Custom Programs In Shell Pipelines
- write custom scripts that work in shell pipelines
Announcements
- You do not need to use custom scripts on HW due tomorrow.
- A couple people have asked for extensions on assignment, that’s fine.
- HW specific text channels (good suggestion @lKappachino)
- Formatted code works better than screenshots in Discord- use backticks.
123 GO - How do you feel about HW due tomorrow? For example, “fine”, “program works, just need to write”, “lost”.
Resources
Custom steps
When no existing shell program does what you want, then write your own program.
gunzip input.txt.gz |
cut --fields=6-10 |
???? | # <-- complex step, not easy to do in shell
gzip > output.txt.gz
123 GO: What’s an example of a complex data processing step that may not be easy to do in the shell? Open ended question, be creative.
Pick your language
We call the program a script, because it’s usually just a single file, aka a script.
The script simply needs to produce stdout
from stdin
, and high level programming languages suitable for data science handle this use case.
That’s why these languages are sometimes called scripting languages.
Here’s an over the top example:
$ sqlite3 database.sqlite < selection.sql |
bash sweet.sh |
awk able.awk |
Rscript rad.R |
python pretty.py |
julia joy.jl |
ruby radiant.rb |
octave maybe.m |
perl peachy.pl > output.txt
In the above pipeline:
database.sqlite
is a local SQL database fileselection.sql
is a SQL query. Thesqlite3 database.sqlite < selection.sql
means thatsqlite3
takesselection.sql
as standard input.sweet.sh
is a bash scriptable.awk
is an Awk scriptrad.R
is an R scriptpretty.py
is a Python scriptjoy.jl
is a Julia scriptradiant.rb
is a Ruby scriptmaybe.m
is an Octave / Matlab scriptpeachy.pl
is a Perl script
They all work together by passing data from stdin
to stdout
.
123 GO - How many languages do you need to at least be familiar with to understand the above pipeline?
That’s why I call this example “over the top”. These languages have significant overlap with one another, and it’s reasonable to stick with one language when possible.
TODO: add hello world examples for R and python.
Comparing R, Python, Julia
My History
I started with Matlab and R in college courses in 2012, and I started using Python in 2013 at work. I’ve had my eye on Julia since around 2015, and dabbled just a bit.
People invest tremendous amounts of time developing skill in a language or learning technologies. It can be quite threatening to hear “X is dumb, Y is better.” Keep this in mind next time you witness a technology flame war.
R, Python, Julia, and Matlab are all relatively high level languages. Their speed for data analysis comes from compiled code. I would argue that the high level syntax is not that different among languages, particularly for linear algebra, which tends to resemble the mathematical form.
The Two Language Problem
The “two language problem” is that sometimes you want or need much faster code than you can get with an interpreted language like Python or R, so you rewrite the slow code in a faster language like C or Fortran. This is a problem because now you need to know three things (taking R and C as the example): the R language, the C language, and how the interface between them works. In general, when programming you want to be in the highest level language possible, because this allows you to get more done with less code, so you’re more productive.
language | Date 1.0 | initial idea |
---|---|---|
Python | 1994 | code should be easy to read (high level) |
R | 2000 (S much earlier) | by statisticians, for statisticians |
Julia | 2018 | solve the two language problem |
123 GO: What do you like about one language that you use? I may ask a few of you to elaborate.
This class is about concepts, not syntax. People can and do use R, Python, and Julia for the kinds of applications we will do in this class. You’re welcome to use whatever language you want, but I encourage you to stick with Julia for this semester, because I believe Julia will make it easier to write programs that are efficient enough for the data sets we’re after.
- lecture and homework will use Julia
- you’ll get better support from me and your peers
The final project is a good time to experiment with different languages.
Julia basic script
This is a basic example, not following best practices yet.
# Example usage:
#
# $ seq 3 | julia hello.jl
# hello 1
# hello 2
# hello 3
"""
greet(x)
Greet `x` with a hello.
"""
function greet(x)
println("hello " * x)
end
# Process stdin
for line in eachline()
greet(line)
end
Builtin variables in Julia relevant to shell pipelines:
stdin
standard inputstdout
standard outputARGS
command line arguments
What great variable names!
Julia script
Allow the user to specify behavior as a command line argument, and add main()
.
This is closer to best practices.
# Example usage:
#
# $ seq 3 | julia hello2.jl bye
# hello 1bye
# hello 2bye
# hello 3bye
function greet(x, after = "")
println("hello " * x * after)
end
# Process stdin
function main()
user_after = ARGS[1]
for line in eachline()
greet(line, user_after)
end
end
if abspath(PROGRAM_FILE) == @__FILE__
main()
end
Exercise
Modify the above Julia program to create a script / command line program that behaves like seq
when used with one argument:
$ seq 3
1
2
3
Call your file seq.jl
and verify it behaves as follows (for any positive integer, not just 4):
$ julia seq.jl 4
1
2
3
4
Upload your program to Canvas as seq.jl.txt
.