Improving Code Stream Homework

4 minute read

  • Critique and improve code by applying general engineering principles

Reading your code taught me new things about Julia, and helped me rewrite my code to improve it. Thanks!

Adopt a beginner / growth oriented mindset when you learn a new technology.

Announcements

Resubmit reservoir sampling homework through Monday.

New test cases :

  1. time seq 1e9 | julia shuf.jl samples 100 integers from 1 to 1 billion
  2. seq 20 | sed "s/1/one/" | julia shuf.jl shuffles non integer input lines
  3. seq 10 | julia shuf.jl 10 shuffles the integers from 1 to 10
  4. seq 11 | julia shuf.jl 10 samples and shuffles 10 integers without replacement from 1 to 11.

123 GO - What’s one food that you cooked recently?

References

Warm up

Most did something like this to put a default for the command line argument:

if isempty(ARGS)
    sampleSize = 100  #default sample size
else
    sampleSize = parse(Int64, ARGS[1])
end

The general pattern is if CONDITION, where CONDITION was one of the following:

ARGS == []              1
ARGS != []              2
isempty(ARGS)           3
!isempty(ARGS)          4
length(ARGS) != 0       5
length(ARGS) == 0       6
size(ARGS)[1] == 1      7
length(ARGS) > 0 && length(ARGS[1]) > 0     8

123 GO: which is your favorite?

A different approach is to try and then catch the exception.

numRows = 100
try
    numRows = parse(Int, ARGS[1])
catch
end

Quotes

There are two ways of constructing a software design: One way is to make it so simple that there are obviously no deficiencies, and the other way is to make it so complicated that there are no obvious deficiencies. The first method is far more difficult.

C.A.R. Hoare


Rule of Simplicity: Design for simplicity; add complexity only where you must.

Eric Raymond


I hate code, and I want as little of it as possible in our product.

Jack Diederich


  • Beautiful is better than ugly.
  • Explicit is better than implicit.
  • Simple is better than complex.
  • Complex is better than complicated.
  • Flat is better than nested.
  • Sparse is better than dense.
  • Readability counts.

Tim Peters

Examples

These examples are brief snippets of code illustrating broader ideas. They won’t run without a little more context. I’d be happy to work them into self contained examples, just ask!


Organize logic into small functions you can understand, test, and tinker with. This is the essence of programming.


# 1
# jump right into reservoir sampling

# 2 
function main()
    ... # implement reservoir sampling

# 3 
function shuf(data=stdin, size=parse(Int, ARGS[1]))

# Separate function into something that works on any data stream.
function reservoir_sample(stream, size=100)

function main()
    R = reservoir_sample(stdin)

DRY Don’t repeat yourself.

if i < sampleSize  #smaller sequence than sample size
    num = Random.shuffle!(num)
    display(num) 
else
    display(num)  
end  

123 GO: What is unnecessarily repeated?

Better:

if i < sampleSize  #smaller sequence than sample size
    num = Random.shuffle!(num)
end  
display(num)  

Better:

if i < sampleSize
    num = Random.shuffle(num)
end  
display(num)  

There’s still a bug here, and we can exercise it with this test:

$ seq 11 | julia shuf.jl 10
1
2
3
4
11
6
7
8
9
10

Best:

# Before:
if i < sampleSize  #smaller sequence than sample size
    num = Random.shuffle!(num)
    display(num) 
else
    display(num)  
end  

# After:
num = Random.shuffle(num)
display(num)  

Stream (iterate) through the data for memory efficiency. There’s no reason to use reservoir sampling if the data fits in memory.

data = readlines(stdin)  # Reads EVERYTHING from stdin

for line in eachline(stdin)
    push!(stream,parse(Int,line))
end

We like lazy data structures that stream over inputs when processing large data.


More generally, assume as little as possible about your inputs.

for line in eachline()
    n = parse(Int, line)

    # ... do something with n


# Better:
for line in eachline()

    # ... do something with line

Choose descriptive variable names.

# Any argument for args? seems less ideal
args = parse(Int, ARGS[1])

# Strong argument- following notation in Wikipedia algo
k = parse(Int, ARGS[1])

reservoir_size = parse(Int, ARGS[1])

reservoirSize = parse(Int, ARGS[1])

123 GO- what’s your favorite?

Most say reservoir_size, reservoirSize

Too far: reservoirSizeForSimpleReservoirSampling


Avoid global variables when possible. The only global variables should be the functions that you define (usually).

global i = 0  # defining
for line in eachline()
    global i += 1   # reads and updates a global variable - AVOID when possible
    # ...


# This will work and be better:
function main()
    i = 0
    for line in eachline()
        i += 1
end
 

The way to avoid globals here is to wrap everything up in one function, see demo after class on Monday, Mar 1.


Iterate directly over objects, avoiding explicit indexing (like x[i]) if possible.

for i = 1:(size(result, 1))
    print(result[i])
    println()
end

# Better:
for ri in result
    println(ri)     # ri will be set to each value in iterable object `result`
end

# can iterate over `eachline(stdin)`


Learn the idioms of your language.

This takes years of experience and inquisitiveness.

for tup in enumerate(data)
    i, val = tup[1], tup[2]

Format your code consistently.

In particular, don’t mix tabs and spaces. It’s a programmer faux pas.

    for element in reservoir
    println(element)
  end

Updated: