Split Apply Combine

4 minute read

  • Split data, apply functions, and combine results

Announcements:

  • Data science panel Friday, thanks for submitting questions
  • Homework hint- for the questions on the martingale, it may be easiest to modify the doublebet strategy function. Ask me on Discord and I’ll help you with this, because we haven’t covered control flow like loops and conditional statements.

live notes

Resources:

Review

Last class we looked again at the roulette simulations.

source("https://raw.githubusercontent.com/clarkfitzg/stat128/master/roulette.R")

set.seed(1380)
d = play(simple_strategy(even), nplayers = 500L, ntimes = 1000L)

Our eventual goal was to compute a couple summary statistics for every value of time in a specific form, so we wrote a function summary_at_time to do this for one particular time.

#' Returns several summary statistics at a particular time
#'
#' @param dt data frame with only one distinct time value
#' @return output: data frame with one row containing columns mean, sd, and time
summary_at_time = function(dt)
{
    time = unique(dt$time)
    data.frame(time, mean = mean(dt$winnings), sd = sd(dt$winnings))
}

We can use our function as follows:

d1 = d[d$time == 1, ]
s1 = summary_at_time(d1)

Let’s do a few more time points.

d2 = d[d$time == 2, ]
s2 = summary_at_time(d2)
d3 = d[d$time == 3, ]
s3 = summary_at_time(d3)

s1, s2, and s3 contain the data we want, so we just need to “stack” all these data frames into one using rbind, and then we can make our plot, at least for three data points.

rbind(s1, s2, s3)

We have 1000 times to do, so we need to repeat the above code 1000 times, right? 😜

s1 = summary_at_time(d[d$time == 1, ])
s2 = summary_at_time(d[d$time == 2, ])
s3 = summary_at_time(d[d$time == 3, ])
s4 = summary_at_time(d[d$time == 4, ])
s5 = summary_at_time(d[d$time == 5, ])
s6 = summary_at_time(d[d$time == 6, ])
s7 = summary_at_time(d[d$time == 7, ])
s8 = summary_at_time(d[d$time == 8, ])
# ...

No way! There’s a better way. Recall the DRY principle- “Don’t Repeat Yourself”

lapply

lapply is used to apply the same function to many data elements. Let’s see it in action.

What does seq do?

seq(2)
seq(5)

Suppose we need many sequences.

seq15 = lapply(1:5, seq)

lists

What is this seq15 object? Could it be in a data frame? Columns in a data frame should all have the same length, so it doesn’t make sense to put it in a data frame.

class(seq15)

It’s a list, which is a general purpose data container. Any element of a list can be any kind of object, including another list.

seq15[[2]] = letters

Q: It’s often more convenient to have results in a data frame or vector, so why return a list? A: Because the list is so general, it’s always safe for lapply to return a list. It means lapply doesn’t need to make any assumptions about what your code does, and this is a great general principle in software engineering: Make as few assumptions as you can :)

lapply details

lapply is roughly equivalent to this code:

lapply_idea = function(X, FUN, ...)
{
    result = vector(mode = "list", length = length(X))
    for(i in X){
        result[[i]] = FUN(X[[i]], ...)
    }
    result
}

lapply_idea(1:5, seq)

You might ask, why use lapply instead of just a loop then? There are a few answers:

1) lapply uses less code than the loop, which means less opportunities for bugs. 2) lapply is fast and idiomatic. 3) lapply is easy to make faster through parallel programming.

split

We have a data frame with time as a column. If we could split our data frame apart into a list based on the unique values in the time column, then we would be able to use our function with lapply. Let’s try split.

ds = split(d, d$time)

ds should have as many elements as the distinct elements in the time column. Let’s verify.

length(unique(d$time))

length(ds)

Let’s take a look at some element in ds. It should be a data frame where all the time values are the same.

head(ds[[234]])

Wonderful, now we can apply our summary function.

ds2 = lapply(ds, summary_at_time)

head(ds2)

Combining

We have our result, we just need to combine all the rows together. rbind works just fine to combine the first few:

rbind(ds2[[1]], ds2[[2]], ds2[[3]], ds2[[4]])

But we’re not going to put 1000 numbers in there- that would violate DRY. How do we do it? Check out the documentation for rbind.

?rbind

rbind(..., deparse.level = 1, make.row.names = TRUE,
    stringsAsFactors = default.stringsAsFactors(), factor.exclude = TRUE)

Arguments:

     ...: (generalized) vectors or matrices.

The first argument is ..., read as “dot-dot-dot” or “ellipsis”. This means rbind accepts an arbitrary number of arguments, and operates on them all, which is a general idiom in R.

We want ALL of the elements in ds2 to appear as arguments to rbind. In other words, we have a function to call, and a list of all the arguments that we would like to call that function with. That’s the purpose of do.call.

dfinal = do.call(rbind, ds2)

This says, call the function rbind with the arguments in ds2. Let’s look at it and make sure it’s what we want. NEVER assume something worked :)

dim(dfinal)
head(dfinal)
tail(dfinal)

Putting it all together

d contains all our data.

ds = split(d, d$time)
ds2 = lapply(ds, summary_at_time)
dfinal = do.call(rbind, ds2)

In the future we’ll look at packages data.table and dplyr, they can make this kind of operation simpler and faster. The purpose of studying it in detail is to learn the concepts, along with some standard R idioms.

Now we can happily create our plot with ggplot2.

library(ggplot2)

g = ggplot(data = dfinal, mapping = aes(x = time, y = mean)) +
    geom_line()

print(g)

Updated: