Selecting Subsets Of Data Frames
- create boolean vectors using comparison operators
- select rows by condition
- reorder rows of data frame
We also need to know about dates:
- date class
- format strings
- conversion to dates
- compare dates
- update columns in place
123GO - what’s your word of the day?
Announcements
- These online notes are more thorough, please use them in your homework.
- HW graded
References
Dates
Let’s load the data for our homework.
air = read.csv("http://webpages.csus.edu/fitzgerald/files/stat128/fall20/ad_viz_plotval_data.csv")
We can extract some columns of interest.
d = air[, "Date"]
pm2 = air[, "Daily.Mean.PM2.5.Concentration"]
If we try to plot pm2
as a function of d
, something bad happens.
plot(d, pm2)
Error in plot.window(...) : need finite 'xlim' values
In addition: Warning messages:
1: In xy.coords(x, y, xlabel, ylabel, log) : NAs introduced by coercion
2: In min(x) : no non-missing arguments to min; returning Inf
3: In max(x) : no non-missing arguments to max; returning -Inf
Q: Anyone know what’s the matter?
A: R doesn’t know how to plot d
.
Why not? These look like dates, which you and I can read.
head(d)
[1] "01/22/2020" "01/25/2020" "01/01/2020" "01/04/2020" "01/07/2020"
[6] "01/10/2020"
The problem is that we haven’t yet told R that they are dates.
class(d)
[1] "character"
Let’s convert them into dates!
d2 = as.Date(d)
Error in charToDate(x) :
character string is not in a standard unambiguous format
We need to specify the date format.
?as.Date
as.Date(x, format, tryFormats = c("%Y-%m-%d", "%Y/%m/%d"),
Q: What do you think this "%Y-%m-%d"
means in the context of a date?
A: Year-month-day, so dates should look like “2020-08-16”.
> head(d)
[1] "01/22/2020" "01/25/2020" "01/01/2020" "01/04/2020" "01/07/2020"
[6] "01/10/2020"
Q: Does our data look like one of those format strings?
A: Nope. We need to supply our own.
Q: What order does our string use?
A: The first element, "01/22/2020"
implies month/day/year, so the right format string to use is “%m/%d/%Y”.
d2 = as.Date(d, format = "%m/%d/%Y")
class(d2)
plot(d2, pm2)
That looks more like it!
Q: Did we change the data in d or our original data frame?
A: No
class(d)
class(air[, "Date"])
Suppose we want to?
air[, "Date"] = as.Date(air[, "Date"], format = "%m/%d/%Y")
class(air[, "Date"])
logical comparisons
Logical vectors are vectors where the value is either TRUE or FALSE (or NA, but that’s for later). Comparisons are one way to create logical vectors.
x = c(1, 2, 2, 3, 18)
Let’s check if x
is equal to 2.
x == 2
I can assign this value into a new variable, which has class “logical”.
isx2 = x == 2
class(isx2)
Note =
does variable assignment, which is COMPLETELY different from ==
for equality comparison.
Your turn: Predict the result of the following.
1 < x # 1
x <= 2 # 2
You can use the same idea with dates. Let’s just look at the first ten elements:
air[1:10, "Date"]
Since we converted them to dates above, we can check:
as.Date("2020-01-14") < air[1:10, "Date"]
We could save this result as a new column in our data frame.
air[, "afterJuly1"] = as.Date("2020-07-01") <= air[, "Date"]
row selection
We can select subsets of the data that we’re interested in.
If data
is our data frame, and columns
is a vector of column names, then we select columns as follows:
data[, columns]
To select a subset of rows, this pattern generalizes to:
data[rows, columns]
That is, leaving rows
empty selects all rows.
Suppose I want to select the rows where the value of the COUNTY
column is "El Dorado"
.
We can check if this is TRUE
or FALSE
for every row:
edcounty = air[, "COUNTY"] == "El Dorado"
Then we can filter out the rows we like using this logical vector:
air2 = air[edcounty, ]