Introduction To User Defined Functions

4 minute read

Implement a function given a description of what it should do
Describe R’s formula syntax, y ~ x
Select rows of a data frame by index

123GO - what’s your dream job?

Announcements:

The standards for homework grading are high, to help you grow.
I’ll provide an example of a submission that would earn a full score.

live notes

Defining Functions

Functions are the heart of R. Everything in R is a function. To really use R, we need to define our own functions.

Q: What’s the mathematical definition of a function?

A: Maps an input to exactly one output. input ---> output

Last homework we found the data point that was closest to the mean. Let’s implement this in a general function, closest_to_mean.

Q: 123GO - What’s a simple example we can test to verify the behavior of our function? We need an input and output.

A: There are many ways. Here’s one: closest_to_mean(c(10, 11, 12)) should output 11.

Let’s write a placeholder for the function.

#' Return the data point in `x` closest to the mean of `x`.
#'
closest_to_mean = function(x)
{
}

The empty braces {} after function(x) mean that this function doesn’t do anything yet. It’s like a skeleton or outline, ready for implementation. We can attempt to use it, and verify it doesn’t do anything:

> closest_to_mean(precip)
NULL

NULL means nothing, and we’ll talk about it in detail in a future lecture.

In the last homework, we did this calculation for precip, and we can implement our function based on this code:

diff_from_mean = abs(precip - mean(precip))
closest_index = which.min(diff_from_mean)
precip[closest_index]

The code above is written with precip as the input. We need to change it so that x is the input, and stick it in the body of the function, the part between the braces {}.

#' Return the data point in `x` closest to the mean of `x`.
#'
closest_to_mean = function(x)   # <--- function name, arguments
{                               # <--- begin body
    diff_from_mean = abs(x - mean(x))
    closest_index = which.min(diff_from_mean)
    x[closest_index]            # <--- last line of body is the return value
}

Finally, we verify that our function does what we expected.

> closest_to_mean(precip)
Cleveland
       35

> closest_to_mean(c(10, 11, 12))
[1] 11

🤑 Sweeeet.

Model Formulas

R uses ~ (tilde, pronounced “till-duh”) to specify statistical models. The most basic way to specify a model is y ~ x, which you can think of as saying “model y as a function of x”.

The R documentation goes into more detail.

?formula

The models fit by, e.g., the ‘lm’ and ‘glm’ functions are specified in a compact symbolic form. The ‘~’ operator is basic in the formation of such models. An expression of the form ‘y ~ model’ is interpreted as a specification that the response ‘y’ is modelled by a linear predictor specified symbolically by ‘model’.

Many functions accept model formula, for example, boxplot. Let’s load our data for the homework and see.

air = read.csv("http://webpages.csus.edu/fitzgerald/files/stat128/fall20/ad_viz_plotval_data.csv")

Suppose we want to see comparative boxplots of DAILY_AQI_VALUE, split into groups by COUNTY.

boxplot(DAILY_AQI_VALUE ~ COUNTY, data = air)

We see a boxplot for each county, all on the same y axis scale, so that we can easily compare them. Think about the formula DAILY_AQI_VALUE ~ COUNTY as saying, “model DAILY_AQI_VALUE as a function of COUNTY”.

Q: How can we group all the counties besides Sacramento together and do a boxplot?

A: We need a new column saying whether each row belongs to a Sacramento county or not.

air[, "Sac"] = air[, "COUNTY"] == "Sacramento"
boxplot(DAILY_AQI_VALUE ~ Sac, data = air)

Selecting Rows by Index

I’d like to reorder air so that the smallest values of DAILY_AQI_VALUE show up first. How do we do this?

Subsetting x with integers allows us to permute x, and we can use this idea to reorder our data frames. Here’s an example of randomly permuting letters.

> rand_index = sample(length(letters))
>
> rand_index
 [1] 14  6 19  2  4 16  3 10 18 13 15  8 23 11  1 12  5 26 21  7 24 25 20  9 17
[26] 22
> letters[rand_index]
 [1] "n" "f" "s" "b" "d" "p" "c" "j" "r" "m" "o" "h" "w" "k" "a" "l" "e" "z" "u"
[20] "g" "x" "y" "t" "i" "q" "v"

rand_index is a random permutation of the integers 1 to 26. letters[rand_index] selects the elements of letters corresponding to rand_index. 14 comes first in rand_index, so the 14th letter n shows up first. 6 comes next in rand_index, so the 6th letter f shows up next, and so on.

Recall general data subsetting has this form.

data[rows, columns]

We need a vector of integers that shows the sorted order of DAILY_AQI_VALUE. order will give us this.

aqi_order = order(air[, "DAILY_AQI_VALUE"])
head(aqi_order)

This means element 1065 is the smallest value in DAILY_AQI_VALUE, followed by element 582, and so on. Let’s sort the data using aqi_order and save the result.

air2 = air[aqi_order, ]

# 123GO - What do you think, same or different?
dim(air)
dim(air2)

Let’s plot them and see what happened.

plot(air[, "DAILY_AQI_VALUE"], type = "l")  # "l" for line plot

plot(air2[, "DAILY_AQI_VALUE"], type = "l")

Twitter Facebook LinkedIn

Introduction To User Defined Functions

Defining Functions

Model Formulas

Selecting Rows by Index

You May Also Enjoy

Diversity Inclusivity Statement

General Student Advice

Homework Covid Database

Introduction Sql