Based on data for 390 PAL students and 888 non-PAL students, the unadjusted, raw difference in average grade for PAL and non-PAL students was 0.21 on a A=4.0 grade scale. However, since students self-select into supplemental PAL instruction, it is possible that the resulting PAL and non-PAL groups were not balanced with respect to other characteristics which could impact course grade. For example, if students with better study habits tend to enroll in PAL, all else being equal, the PAL mean grade would be higher than non-PAL-- even if PAL had no effect on course grade. Consequently, we also performed a propensity score analysis to adjust the estimated effect of PAL on course grade for potential self-selection biases.
After adjusting for self-selection bias, we found that PAL students earned an average grade \(0.38 \pm 0.10\) higher than non-PAL students. A sensitivity analysis indicates that this analysis is moderately sensitive to unknown confounders. Although the data give us sufficient evidence to conclude that PAL increases students' grades in Bio 121, the existence of an unknown confounder similar in magnitude to stem.flg (which indicates if the student's major is STEM or not) would nullify that conclusion.
A propensity score analysis was conducted to assess the effect of PAL supplemental instruction on Bio 121 course grade. Propensity score adjustment was necessary since the data are observational and the characteristics of students who voluntarily enroll in PAL may differ in ways that may, independently of PAL, impact course grade compared to students who do not enroll in PAL. In propensity score analysis, variables related to both likelihood of PAL enrollment and course grade (confounders) are used in a logistic regression model to obtain a propensity score, which is a student's likelihood of enrolling in PAL.
For Bio 121, seven covariates were found to have a statistically significant relationship to likelihood of enrolling in PAL. Variables related to increased likelihood of enrolling were: enrollment in PAL in the past, OTHER ethnicity (combined ethnicities of Foreign, Native American, and Pacific Islander), and being a STEM major. Higher term units attempted was related to a higher likelihood of enrolling in PAL whereas lower grades in Bio 184 (the prereq) were related to higher probability of PAL enrollment. Students previously enrolled in PASS and seniors were less likely to enroll in PAL. Instructor was also related to probability of PAL enrollment.
Using the propensity score model, all students in the dataset, PAL and non-PAL, are assigned a propensity score. Then, each PAL student is matched to one or more non-PAL students who have similar propensity score(s). After matching, the PAL and matched non-PAL groups are compared to determine if the distribution of each covariate is similar between the two groups. This is called a balance check. If the standardized difference between the non-PAL and PAL means is less than 0.10 than the strong criteria in (Liete 2017, p.10) is met for covariate balance. If the standardized difference is under 0.25, than a more lenient criteria is met. The highest absolute value standardized mean differences in this analysis are 0.1276 and 0.0715. Consequently, adequate balance appears to have been achieved.
The difference in the average grade for the matched PAL and non-PAL data is then calculated. The estimated increase in the mean grade of students in PAL over those not in PAL after correcting for self-selection biases is \(0.38 \pm 0.10\) or between 0.28 and 0.48 on a 4.0 grade scale. This result is statistically significant with a P-value of 0.00009 and is based on 322 PAL students and 370 matched non-PAL students. For comparison, the non-propensity score adjusted difference in average grade for PAL and non-PAL students was 0.21.
The estimated PAL effect is based on the assumption that the propensity model includes all potential confounders for PAL enrollment and grade in Bio 121. However, it is possible that unknown confounders exist. A sensitivity analysis was conducted to determine how strong an unknown confounder must be to nullify the statistically significant PAL effect that was found in this analysis. The sensitivity analysis (Rosenbaum, 2002) indicated that an unknown confounder which increases the odds of being in PAL by more than 1.4 is enough to change the treatment effect from significant to non-significant. Inspection of the covariates in the estimated propensity model for Bio 121 indicates that if there is an unknown confounder that has an effect on the propensity score similar to the effect of stem.flg (indicating whether or not the student is a STEM major) observed in this analysis, the PAL effect would become non-significant. Thus, this finding is sensitive to unknown confounders. It is possible a variable like the number of hours per week a student works (which is not in our dataset) is an unknown confounder which could reverse the statistical significance of this analysis.
Additionally, a number of variables such as SAT scores (46% missing) and high school gpa (41% missing) were removed from this analysis due to large amounts of missingness. Since all students who had missing information on any included covariate were eliminated from the analysis, a balance had to be struck between retaining a sufficiently large pool of PAL and non-PAL students and retaining a sufficient number of important covariates. Variables which were eliminated from this analysis had substantial missing data or were subjectively judged as unlikely to be confounding. The choices about which variables to retain resulted in the original pool of 390 PAL students in Bio 121 being reduced to 322. Also, 370 matched non-PAL students were selected out of 888 original non-PAL students. When a PAL student had more than one suitable match among the non-PAL students, all non-PAL students were taken as matches and weighted appropriately in the final estimated PAL effect.
First Add Some R Packages to the Workspace Caution: warning messages are suppressed to reduce clutter in the output.
defaultW <- getOption("warn")
options(warn = -1)
library(dplyr)
library(MatchIt)
library(Matching)
library(DataExplorer)
library(cobalt)
library(backports) # to get the deparse1() function for cobalt
library(rbounds)
library(tidyr)
options(warn = defaultW)
The script contains a function I wrote called id.vars.zero.freq to identify variables causing complete separation in logistic regression.
source("IdentifyCompleteSepVars2.R")
Make sure the PAL datafile is in the same directory as this RMarkdown file. The files which includes data through the Spring 2019 semester has 1,099,371 rows and 174 columns. The total of the grd.pt.unt column is 2,237,555.
load("~/PAL data analysis/PALdatafull.Rdata")
dim(PALdatafull)
## [1] 1099371 174
sum(PALdatafull$grd.pt.unt)
## [1] 2237555
The full file is quite large so reducing down to bio classes makes the code run faster. Each instance of a course taken by a student is included in the data file for the semester during which it was taken and every semester thereafter. 'test1' below filters so the dataset only includes the semester during which the course was taken. The biology dataset should contain 19,652 rows and 174 columns.
bio.classes <- paste("BIO", c(22, 121, 131, 139, 184))
test1 <- PALdatafull$base.time.course==1
test.bio <- PALdatafull$course %in% bio.classes
bio.indx <- test1 & test.bio
bio.dat <- PALdatafull[bio.indx, ]
dim(bio.dat)
## [1] 19652 174
The course.seq variables indicates how many times a student has taken a course prior to the current attempt. To filter on the first attempt at a course, we set course.seq to 0.
raw.table = data.frame(class=character(),
nonPALavg=numeric(),
PALavg=numeric(),
Diff=numeric(),
NonPAL_Num= integer(),
PAL_Num=integer(),
CompletePAL=numeric(),
TermPALStart=integer(),
row.names=NULL,
stringsAsFactors = FALSE)
for (i in 1:length(bio.classes))
{
curr.class = bio.classes[i]
temp = subset(bio.dat, course==curr.class & course.seq==0)
pal.start=min(unique(temp$term.code[temp$palN==2]))
# only include terms after PAL start term
temp = subset(temp, term.code>= pal.start)
x=tapply(temp$grd.pt.unt,temp$palN,
mean, na.rm=T) %>%
as.numeric %>%
round(2)
y=table(temp$palN) %>% as.numeric
raw.table[i, 'class' ] = curr.class
raw.table[i, c(2:4,7)]=c(x[1], x[3],x[3]-x[1],
round(y[3]/sum(y),2))
raw.table[i, c(5,6,8)]= c(y[1], y[3], pal.start)
}
# formatted table
library(knitr)
## Warning: package 'knitr' was built under R version 3.6.3
kable(raw.table, caption = "Raw Comparison of PAL and non-PAL Grades (No Propensity Adjustment)")
class | nonPALavg | PALavg | Diff | NonPAL_Num | PAL_Num | CompletePAL | TermPALStart |
---|---|---|---|---|---|---|---|
BIO 22 | 1.81 | 2.33 | 0.52 | 949 | 413 | 0.29 | 2158 |
BIO 121 | 2.18 | 2.39 | 0.21 | 888 | 390 | 0.30 | 2148 |
BIO 131 | 2.33 | 2.76 | 0.43 | 1516 | 544 | 0.26 | 2148 |
BIO 139 | 2.41 | 2.61 | 0.20 | 308 | 98 | 0.24 | 2178 |
BIO 184 | 2.51 | 2.73 | 0.22 | 509 | 149 | 0.22 | 2178 |
Create new variables: delay since high school and cumulative percent of units passed. Collapse sparse categories and other miscellaneous clean up of data. Sparse categories can cause complete separation in logistic regression and are only predictive for a few students.
yr.course.taken = as.numeric(gsub(".*([0-9]{4})","\\1",bio.dat$coh.term))
bio.dat$delay.from.hs = ifelse(!is.na(yr.course.taken) & !is.na(bio.dat$hs.grad.yr), yr.course.taken-bio.dat$hs.grad.yr, NA)
sum(is.na(bio.dat$delay.from.hs)) #1864 missing values
## [1] 1864
# remove 106 students who did not complete PAL
bio.dat=subset(bio.dat, palN!=1)
#recode palN to factor with 0/1 levels
bio.dat$palN = ifelse(bio.dat$palN==2, 1, 0)
#clean up category names in m.rmd and e.rmd
bio.dat$m.rmd[bio.dat$m.rmd=="Not Remedial\nin Math"]="Not Remedial in Math"
bio.dat$m.rmd[bio.dat$m.rmd=="Remedial\nin Math"]="Remedial in Math"
bio.dat$m.rmd = droplevels(bio.dat$m.rmd)
bio.dat$e.rmd[bio.dat$e.rmd=="Not Remedial\nin English"]="Not Remedial in English"
bio.dat$e.rmd[bio.dat$e.rmd=="Remedial\nin English"]="Remedial in English"
bio.dat$e.rmd = droplevels(bio.dat$e.rmd)
table(bio.dat$e.rmd)
##
## Not Remedial in English Remedial in English
## 15474 3704
# create feature, proportion of cumulative units taken that were passes
bio.dat$cum.percent.units.passed = bio.dat$tot.passd.prgrss.start/bio.dat$tot.taken.prgrss.start
# collapse sparse categories
bio.dat=group_category(data = bio.dat, feature = "load.code", threshold = 0.05, update = TRUE)
table(bio.dat$load.code, PAL=bio.dat$palN, droplevels(bio.dat$course))
## , , = BIO 121
##
## PAL
## 0 1
## F 2104 383
## H 428 13
## OTHER 308 61
##
## , , = BIO 131
##
## PAL
## 0 1
## F 3127 491
## H 486 19
## OTHER 462 68
##
## , , = BIO 139
##
## PAL
## 0 1
## F 2315 87
## H 402 1
## OTHER 340 15
##
## , , = BIO 184
##
## PAL
## 0 1
## F 2751 144
## H 417 5
## OTHER 339 6
##
## , , = BIO 22
##
## PAL
## 0 1
## F 3332 430
## H 496 9
## OTHER 471 36
bio.dat=group_category(data = bio.dat, feature = "addr.type", threshold = 0.05, update = TRUE)
table(bio.dat$addr.type, PAL=bio.dat$palN, droplevels(bio.dat$course))
## , , = BIO 121
##
## PAL
## 0 1
## OTHER 1263 101
## PERM 1577 356
##
## , , = BIO 131
##
## PAL
## 0 1
## OTHER 1608 103
## PERM 2467 475
##
## , , = BIO 139
##
## PAL
## 0 1
## OTHER 1530 19
## PERM 1527 84
##
## , , = BIO 184
##
## PAL
## 0 1
## OTHER 1411 20
## PERM 2096 135
##
## , , = BIO 22
##
## PAL
## 0 1
## OTHER 1627 70
## PERM 2672 405
# code instructor as alphabetic letter for anonymity
bio.dat$Instructor_01=droplevels(factor(bio.dat$Instructor_01))
instructor.vec = sort(unique(bio.dat$Instructor_01))
num.instr = length(instructor.vec)
bio.dat$Instructor_01 = factor(bio.dat$Instructor_01, levels = instructor.vec, labels=as.character(1:num.instr))
key.instr.code = cbind(as.character(instructor.vec),
1:num.instr)
Collapse 'censusMajor' variable separately for each course since the amount of collapsing necessary will vary by course.
bio121.dat <- subset(bio.dat, course=="BIO 121" & term.code>2143)
bio121.dat=group_category(data = bio121.dat, feature = "censusMajor", threshold = 0.05, update = TRUE)
table(bio121.dat$censusMajor, PAL=bio121.dat$palN)
## PAL
## 0 1
## Biology 765 321
## Chemistry 86 27
## Clinical Science/Biomedical Laboratory Science 90 44
## Microbiology 47 26
## Molecular Biology 43 20
## OTHER 90 19
For Bio 184, the grade for the student's first attempt and the number of times it was taken are added to the Bio 121 dataset. However, few students retook Bio 184 so there was inadequate balance on number of times Bio 184 was taken, and it was removed from the propensity model after balance checks.
bio184.dat <- subset(bio.dat, course=="BIO 184")
bio184.dat$emplid <- as.factor(bio184.dat$emplid)
bio184.dat$term.code <- as.numeric(bio184.dat$term.code)
bio184.emplid.and.grade = bio184.dat[, c("emplid", "term.code","grd.pt.unt")]
# Extract grade on first attempt and number of times taken for Bio 184
bio184.summary = bio184.dat %>% group_by(emplid) %>% summarise_at("term.code", list(num.times.taken.bio184=length,first.term.taken.bio184=min))
# Add grade in Bio 184 to the dataframe for Bio 184
first.term.bio184.grd <- merge(bio184.summary, bio184.emplid.and.grade, by.x=c("emplid","first.term.taken.bio184"),by.y=c("emplid", "term.code"))
names(first.term.bio184.grd)[4] <- "grd.pt.unt.bio184"
# extract first attempt at Bio 121
first.term.bio121.grd <- subset(bio121.dat,
course.seq==0)
# Combine data on Bio 184 and Bio 121
first.grd.bio184.and.bio121 <- merge(first.term.bio184.grd, first.term.bio121.grd, by=c("emplid"))
Remove variables having too many missing values in order to retain a larger pool of PAL and non-PAL students.
plot_missing(first.grd.bio184.and.bio121, missing_only = TRUE)
zero_missing_vars <- first.grd.bio184.and.bio121 %>%
summarise(across( ,list(function(x) sum(is.na(x))))) %>%
gather() %>%
arrange(value)
zero_missing_vars <- zero_missing_vars %>%
filter(zero_missing_vars$value != 0)
zero_missing_vars$var.name = substr(zero_missing_vars$key,1,nchar(zero_missing_vars$key)-2)
zero_missing_vars <- first.grd.bio184.and.bio121 %>%
dplyr::select(zero_missing_vars$var.name)
missing_vars <- profile_missing(zero_missing_vars)
missing_vars
## feature num_missing pct_missing
## 1 gender 1 0.0008764242
## 2 hous.coh.term.flg 2 0.0017528484
## 3 m.rmd 2 0.0017528484
## 4 m.rmd.detail 2 0.0017528484
## 5 e.rmd 2 0.0017528484
## 6 e.rmd.detail 2 0.0017528484
## 7 rmd.flg 2 0.0017528484
## 8 rmd.subj.flg 2 0.0017528484
## 9 adm.area 2 0.0017528484
## 10 veteran 2 0.0017528484
## 11 hs.eng.units 2 0.0017528484
## 12 hs.math.units 2 0.0017528484
## 13 hs.phys.units 2 0.0017528484
## 14 hs.bio.units 2 0.0017528484
## 15 hs.lab.units 2 0.0017528484
## 16 hs.forLang.units 2 0.0017528484
## 17 hs.socstud.units 2 0.0017528484
## 18 hs.perfarts.units 2 0.0017528484
## 19 hs.elective.units 2 0.0017528484
## 20 ge.critical.thinking.status 2 0.0017528484
## 21 ge.english.comp.status 2 0.0017528484
## 22 ge.math.status 2 0.0017528484
## 23 ge.oral.comm.status 2 0.0017528484
## 24 state 2 0.0017528484
## 25 career.num 3 0.0026292726
## 26 bot.level 3 0.0026292726
## 27 enroll.status 3 0.0026292726
## 28 term.gpa 3 0.0026292726
## 29 higher.ed.gpa 3 0.0026292726
## 30 csus.gpa 3 0.0026292726
## 31 withdraw_code 3 0.0026292726
## 32 unt_taken_prgrss 3 0.0026292726
## 33 unt_passd_prgrss 3 0.0026292726
## 34 tot_cumulative 3 0.0026292726
## 35 cur_resident_terms 3 0.0026292726
## 36 cum_resident_terms 3 0.0026292726
## 37 acad.plan 3 0.0026292726
## 38 campus.gpa 4 0.0035056968
## 39 total.gpa 4 0.0035056968
## 40 term.units.attemptedCensus 4 0.0035056968
## 41 acad.stand 5 0.0043821209
## 42 zip 7 0.0061349693
## 43 pass.flg 9 0.0078878177
## 44 pass.flgN 9 0.0078878177
## 45 coh 11 0.0096406661
## 46 coh.term 11 0.0096406661
## 47 base.time.coh 11 0.0096406661
## 48 enrl.flg 11 0.0096406661
## 49 enrl.flgERS 11 0.0096406661
## 50 pst.flg 11 0.0096406661
## 51 pst.flgERS 11 0.0096406661
## 52 grad.flg 11 0.0096406661
## 53 grad.flgERS 11 0.0096406661
## 54 rtn.flg 11 0.0096406661
## 55 rtn.flgERS 11 0.0096406661
## 56 ttgF 11 0.0096406661
## 57 maxUnits 11 0.0096406661
## 58 tot_cumulative.start 11 0.0096406661
## 59 stem.final.flg 11 0.0096406661
## 60 admis.plan 12 0.0105170903
## 61 hs.grad.yr 12 0.0105170903
## 62 delay.from.hs 12 0.0105170903
## 63 pell.coh.term.flg 13 0.0113935145
## 64 stem.base.flg 14 0.0122699387
## 65 eth.erss 26 0.0227870289
## 66 urm.flgERS 26 0.0227870289
## 67 m.rmd.admin 32 0.0280455741
## 68 m.rmd.admin.detail 32 0.0280455741
## 69 mother.ed 44 0.0385626643
## 70 tot.passd.prgrss.start 48 0.0420683611
## 71 tot.taken.prgrss.start 48 0.0420683611
## 72 cum.percent.units.passed 48 0.0420683611
## 73 father.ed 82 0.0718667835
## 74 admit.term 122 0.1069237511
## 75 plan.college 219 0.1919368975
## 76 plan.college.desc 219 0.1919368975
## 77 plan.dept 219 0.1919368975
## 78 plan.deptAbbr 219 0.1919368975
## 79 plan.degree 219 0.1919368975
## 80 plan.type 219 0.1919368975
## 81 deg.plan1 407 0.3567046450
## 82 grad.term 407 0.3567046450
## 83 ttg 407 0.3567046450
## 84 grad.termERS 430 0.3768624014
## 85 hs.gpa 466 0.4084136722
## 86 sat.math.score 527 0.4618755478
## 87 sat.math.flg 527 0.4618755478
## 88 sat.verbal.score 527 0.4618755478
## 89 sat.verbal.flg 527 0.4618755478
## 90 sat.test.date 527 0.4618755478
## 91 trf.gpaADM 609 0.5337423313
## 92 treat.section 729 0.6389132340
## 93 fys.term.code 847 0.7423312883
## 94 fys.grd 847 0.7423312883
## 95 fys.rpt.flg 847 0.7423312883
## 96 pledge.term 1051 0.9211218230
## 97 deg.plan2 1120 0.9815950920
## 98 withdraw_reason 1140 0.9991235758
## 99 Instructor_02 1141 1.0000000000
## 100 deg.plan3 1141 1.0000000000
## 101 deg.plan4 1141 1.0000000000
## 102 deg.plan5 1141 1.0000000000
## 103 deg.plan6 1141 1.0000000000
sum(missing_vars$pct_missing>0.05) # remove 31 variables with >5% missing values
## [1] 31
vars.missing.data = as.character(missing_vars[missing_vars$pct_missing>0.05, "feature"])
dim(first.grd.bio184.and.bio121)
## [1] 1141 179
bio121.with.prq = first.grd.bio184.and.bio121[,!(names(first.grd.bio184.and.bio121) %in% vars.missing.data)]
dim(bio121.with.prq)
## [1] 1141 148
Some variables may be single-values and are, thus, unrelated to propensity score. There were 23 single-valued variables in this dataset which were removed.
single.category.vars = sapply(bio121.with.prq, function(x) length(unique(x))==1)
sum(single.category.vars)
## [1] 13
names(bio121.with.prq)[single.category.vars]
## [1] "AP_PHYSC" "AP_PHYSC.flg" "course" "component"
## [5] "units" "course.numeric" "div" "course.seq"
## [9] "rpt.flg" "c2s" "base.time.course" "years"
## [13] "pass.term.flg"
# remove the 13 variables just listed
keep.vars = names(which(single.category.vars==FALSE))
bio121.with.prq2 = bio121.with.prq[, keep.vars]
dim(bio121.with.prq2)
## [1] 1141 135
These are typically variables with very small frequencies in some categories such as 'veteran' and 'acad.stand'. The categories with small frequencies are usually all PAL or all nonPAL. This causes what is known as a complete separation error in estimating the logistic regression model for propensity scoring. The low frequencies in some categories don't really mean that all hypothetical students in these low-frequency categories have the same PAL enrollment behavior. The all or nothing PAL enrollment behavior is more likely an artifact of the small number of people in the category. Consequently, these sparse categories will be combined with other categories to avoid complete separation or the related variable will be discarded if it is unlikely to be a confounder.
sep.cols = id.vars.zero.freq(bio121.with.prq2, treat.col= which(names(bio121.with.prq2)=="palN"))
sum(sep.cols)
## [1] 31
names(bio121.with.prq2)[sep.cols]
## [1] "emplid" "coh.term" "eth.erss"
## [4] "m.rmd.detail" "m.rmd.admin.detail" "rmd.subj.flg"
## [7] "veteran" "AP_PHYS.flg" "AP_PHYS2.flg"
## [10] "AP_PHYSM.flg" "state" "career.course"
## [13] "acad.prog.course" "reason" "grade"
## [16] "grade.base" "pal" "bot.level"
## [19] "enroll.status" "withdraw_code" "enrl.flg"
## [22] "enrl.flgERS" "pst.flg" "pst.flgERS"
## [25] "rtn.flg" "rtn.flgERS" "ttgF"
## [28] "acad.stand" "admis.plan" "acad.plan"
## [31] "treat.flg"
# censusMajor is similar to acad.plan
# collapse eth.erss categories, remove others causing complete separation
bio121.with.prq2 = group_category(data = bio121.with.prq2, feature = "eth.erss", threshold = 0.05, update = TRUE)
Subjective judgment was used to narrow the pool of variables down to those likely to be confounders. It's important to include all variables correlated with outcome even if it is uncertain whether they are related to likeihood of enrolling in PAL. This allows for a more precise estimate of the treatment effect.
vars.to.keep = c("grd.pt.unt","palN","coh","eth.erss","fys.flg","pell.coh.term.flg","hous.coh.term.flg","mother.ed","father.ed","m.rmd","e.rmd","adm.area","ge.math.status","addr.type","median.income","pct.female.head","reason","Instructor_01","pass","bot.level","censusMajor","term.units.attemptedCensus","base.time.coh","higher.ed.gpa.start","acad.stand","plan.type","course.age","stem.flg","prevPASS","grd.pt.unt.bio184","num.times.taken.bio184","cum.percent.units.passed","gender", "load.code","prevPAL")
# father.ed and plan.type dropped earlier due to missingness
new.vars = intersect(vars.to.keep, names(bio121.with.prq2))
dim(bio121.with.prq2)
## [1] 1141 135
bio121.with.prq2 = bio121.with.prq2[ ,new.vars]
dim(bio121.with.prq2)
## [1] 1141 33
sep.cols = id.vars.zero.freq(bio121.with.prq2, treat.col= which(names(bio121.with.prq2)=="palN"))
sep.cols
## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [13] FALSE FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE
## [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
sum(sep.cols)
## [1] 3
names(bio121.with.prq2)[sep.cols]
## [1] "reason" "bot.level" "acad.stand"
bio121.with.prq2=group_category(data = bio121.with.prq2, feature = "acad.stand", threshold = 0.03, update = TRUE)
with(bio121.with.prq2, table(acad.stand, palN))
## palN
## acad.stand 0 1
## Good Standing 744 341
## OTHER 42 14
bio121.with.prq2=group_category(data = bio121.with.prq2, feature = "reason", threshold = 0.01, update = TRUE)
with(bio121.with.prq2, table(reason, palN))
## palN
## reason 0 1
## ENRL 740 324
## OTHER 46 31
bio121.with.prq2=group_category(data = bio121.with.prq2, feature = "bot.level", threshold = 0.01, update = TRUE)
with(bio121.with.prq2, table(bot.level, palN))
## palN
## bot.level 0 1
## Junior 208 125
## OTHER 8 5
## Senior 570 225
dim(bio121.with.prq2)
## [1] 1141 33
bio121.with.prq2 = bio121.with.prq2[complete.cases(bio121.with.prq2), ]
dim(bio121.with.prq2)
## [1] 1048 33
Subjectively identified four potential confounders to force the model to retain: grd.pt.unt.bio184, cum.percent.units.passed, gender, and eth.erss. Stepwise variable selection will be used to select which of the variables currently in the PAL dataset to include in the propensity model.
min.model = glm(palN ~grd.pt.unt.bio184+cum.percent.units.passed+gender+eth.erss, data=bio121.with.prq2, family=binomial)
summary(min.model)
##
## Call:
## glm(formula = palN ~ grd.pt.unt.bio184 + cum.percent.units.passed +
## gender + eth.erss, family = binomial, data = bio121.with.prq2)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.1845 -0.8788 -0.7837 1.4132 1.9607
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.38171 0.92223 -0.414 0.67895
## grd.pt.unt.bio184 -0.08099 0.08689 -0.932 0.35130
## cum.percent.units.passed 0.64034 1.05076 0.609 0.54225
## genderMale -0.28729 0.14058 -2.044 0.04100 *
## eth.erssAsian -0.85497 0.33310 -2.567 0.01027 *
## eth.erssHispanic -0.55330 0.33871 -1.634 0.10235
## eth.erssOTHER -0.21235 0.42740 -0.497 0.61931
## eth.erssTwo or More Races -1.36932 0.43754 -3.130 0.00175 **
## eth.erssUnknown -0.77099 0.45399 -1.698 0.08946 .
## eth.erssWhite -0.66942 0.33897 -1.975 0.04828 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1294.6 on 1047 degrees of freedom
## Residual deviance: 1273.3 on 1038 degrees of freedom
## AIC: 1293.3
##
## Number of Fisher Scoring iterations: 4
biggest = formula(glm(palN ~.-grd.pt.unt, data=bio121.with.prq2, family=binomial))
bio121.step.first.order = step(min.model,
direction="forward",scope = biggest,
trace=FALSE, k=2)
summary(bio121.step.first.order)
##
## Call:
## glm(formula = palN ~ grd.pt.unt.bio184 + cum.percent.units.passed +
## gender + eth.erss + prevPAL + term.units.attemptedCensus +
## bot.level + Instructor_01 + prevPASS + stem.flg + acad.stand,
## family = binomial, data = bio121.with.prq2)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8431 -0.8490 -0.6283 1.1157 2.3530
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.21891 1.08511 -1.123 0.26131
## grd.pt.unt.bio184 -0.19008 0.09585 -1.983 0.04736 *
## cum.percent.units.passed -0.63764 1.15057 -0.554 0.57945
## genderMale -0.18438 0.14896 -1.238 0.21579
## eth.erssAsian -0.75350 0.35591 -2.117 0.03425 *
## eth.erssHispanic -0.59406 0.36175 -1.642 0.10055
## eth.erssOTHER 0.09333 0.46233 0.202 0.84002
## eth.erssTwo or More Races -1.13909 0.46599 -2.444 0.01451 *
## eth.erssUnknown -0.44805 0.48091 -0.932 0.35150
## eth.erssWhite -0.54406 0.36249 -1.501 0.13338
## prevPAL 0.42085 0.06544 6.431 1.27e-10 ***
## term.units.attemptedCensus 0.15657 0.03111 5.032 4.85e-07 ***
## bot.levelOTHER 0.15345 0.72225 0.212 0.83175
## bot.levelSenior -0.51861 0.16180 -3.205 0.00135 **
## Instructor_0121 -0.57759 0.33028 -1.749 0.08032 .
## Instructor_0122 -0.17901 0.15547 -1.151 0.24956
## Instructor_0135 -1.31701 0.55547 -2.371 0.01774 *
## prevPASS -0.24216 0.12016 -2.015 0.04386 *
## stem.flgSTEM 0.33783 0.19037 1.775 0.07596 .
## acad.standOTHER -0.59611 0.43762 -1.362 0.17315
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1294.6 on 1047 degrees of freedom
## Residual deviance: 1169.2 on 1028 degrees of freedom
## AIC: 1209.2
##
## Number of Fisher Scoring iterations: 5
bio121.step.first.order$anova
## Step Df Deviance Resid. Df Resid. Dev AIC
## 1 NA NA 1038 1273.335 1293.335
## 2 + prevPAL -1 46.135395 1037 1227.200 1249.200
## 3 + term.units.attemptedCensus -1 23.837403 1036 1203.362 1227.362
## 4 + bot.level -2 14.706694 1034 1188.656 1216.656
## 5 + Instructor_01 -3 10.247171 1031 1178.408 1212.408
## 6 + prevPASS -1 3.997985 1030 1174.410 1210.410
## 7 + stem.flg -1 3.211949 1029 1171.199 1209.199
## 8 + acad.stand -1 2.000993 1028 1169.198 1209.198
model.first.order=formula(bio121.step.first.order)
An attempt to add second order terms was made. The problems with complete separation due to categories with low frequencies was compounded for two-way interaction terms. I added some two-way interactions manually then ran a likelihood ratio test to see if the linear model could be rejected in favor of the model with two-way interations. The test did not reject the null hypothesis that the linear model was adequate.
bio121.first.order.prop.model = glm(model.first.order, data=bio121.with.prq2, family=binomial)
summary(bio121.first.order.prop.model)
##
## Call:
## glm(formula = model.first.order, family = binomial, data = bio121.with.prq2)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8431 -0.8490 -0.6283 1.1157 2.3530
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.21891 1.08511 -1.123 0.26131
## grd.pt.unt.bio184 -0.19008 0.09585 -1.983 0.04736 *
## cum.percent.units.passed -0.63764 1.15057 -0.554 0.57945
## genderMale -0.18438 0.14896 -1.238 0.21579
## eth.erssAsian -0.75350 0.35591 -2.117 0.03425 *
## eth.erssHispanic -0.59406 0.36175 -1.642 0.10055
## eth.erssOTHER 0.09333 0.46233 0.202 0.84002
## eth.erssTwo or More Races -1.13909 0.46599 -2.444 0.01451 *
## eth.erssUnknown -0.44805 0.48091 -0.932 0.35150
## eth.erssWhite -0.54406 0.36249 -1.501 0.13338
## prevPAL 0.42085 0.06544 6.431 1.27e-10 ***
## term.units.attemptedCensus 0.15657 0.03111 5.032 4.85e-07 ***
## bot.levelOTHER 0.15345 0.72225 0.212 0.83175
## bot.levelSenior -0.51861 0.16180 -3.205 0.00135 **
## Instructor_0121 -0.57759 0.33028 -1.749 0.08032 .
## Instructor_0122 -0.17901 0.15547 -1.151 0.24956
## Instructor_0135 -1.31701 0.55547 -2.371 0.01774 *
## prevPASS -0.24216 0.12016 -2.015 0.04386 *
## stem.flgSTEM 0.33783 0.19037 1.775 0.07596 .
## acad.standOTHER -0.59611 0.43762 -1.362 0.17315
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1294.6 on 1047 degrees of freedom
## Residual deviance: 1169.2 on 1028 degrees of freedom
## AIC: 1209.2
##
## Number of Fisher Scoring iterations: 5
bio121.final = bio121.with.prq2
bio121.final$propensity = bio121.first.order.prop.model$fitted.values
Seven covariates were found to have a statistically significant relationship to likelihood of enrolling in PAL. Variables related to increased likelihood of enrolling were: enrollment in PAL in the past, OTHER ethnicity (combined ethnicities of Foreign, Native American, and Pacific Islander), and being a STEM major. Higher term units attempted was related to a higher likelihood of enrolling in PAL whereas lower grades in Bio 184 (the prereq) were related to higher probability of PAL enrollment. Students previously enrolled in PASS and seniors were less likely to enroll in PAL. Instructor was also related to probability of PAL enrollment.
match.bio121 = Match(Y=bio121.final$grd.pt.unt, Tr=bio121.final$palN, X=bio121.final$propensity, BiasAdjust = F, estimand = "ATT", M=1, caliper=0.25)
Check to see if covariates are sufficiently balanced bettween the non-PAL and PAL groups. If the standardized difference between the non-PAL and PAL means is less than 0.10 than the strong criteria in Liete is met for covariate balance. If the standardized difference is under 0.25, than a more lenient criteria is met. The highest absolute value standardized mean differences are 0.1276 and 0.0715 so balance appears to be adequate.
bio121.bal = bal.tab(match.bio121, bio121.step.first.order$formula, data=bio121.final, un=TRUE, quick=FALSE)
bio121.bal
## Balance Measures
## Type Diff.Un Diff.Adj
## grd.pt.unt.bio184 Contin. -0.0564 -0.0715
## cum.percent.units.passed Contin. -0.0018 -0.1276
## gender_Male Binary -0.0705 0.0323
## eth.erss_African American Binary 0.0316 -0.0064
## eth.erss_Asian Binary -0.0552 -0.0248
## eth.erss_Hispanic Binary 0.0328 0.0390
## eth.erss_OTHER Binary 0.0233 0.0001
## eth.erss_Two or More Races Binary -0.0370 0.0113
## eth.erss_Unknown Binary -0.0039 -0.0043
## eth.erss_White Binary 0.0084 -0.0149
## prevPAL Contin. 0.4295 -0.0101
## term.units.attemptedCensus Contin. 0.3907 -0.0420
## bot.level_Junior Binary 0.0895 0.0151
## bot.level_OTHER Binary 0.0055 -0.0004
## bot.level_Senior Binary -0.0950 -0.0146
## Instructor_01_15 Binary 0.0782 -0.0358
## Instructor_01_21 Binary -0.0067 -0.0054
## Instructor_01_22 Binary -0.0274 0.0383
## Instructor_01_35 Binary -0.0442 0.0028
## prevPASS Contin. 0.0755 -0.0372
## stem.flg_STEM Binary 0.0590 0.0128
## acad.stand_OTHER Binary -0.0235 -0.0118
##
## Sample sizes
## Control Treated
## All 725. 323
## Matched (ESS) 173.43 322
## Matched (Unweighted) 370. 322
## Unmatched 355. 0
## Discarded 0. 1
love.plot(bio121.bal, binary = "raw", stars = "std", var.order = "unadjusted",
thresholds = c(m = .1), abs = F)
The estimated increase in the mean grade of students in PAL over those not in PAL after correcting for self-selection biases is 0.38. This result is statistically significant with a P-value of 0.00009 and is based on 322 PAL students and 370 matched non-PAL students.
est.effect=summary(match.bio121)
##
## Estimate... 0.38278
## AI SE...... 0.10232
## T-stat..... 3.741
## p.val...... 0.0001833
##
## Original number of observations.............. 1048
## Original number of treated obs............... 323
## Matched number of observations............... 322
## Matched number of observations (unweighted). 574
##
## Caliper (SDs)........................................ 0.25
## Number of obs dropped by 'exact' or 'caliper' 1
indx.trt = match.bio121$index.treated
indx.cntr = match.bio121$index.control
wts = match.bio121$weights
mean.trt=sum(bio121.final[indx.trt, 'grd.pt.unt']*wts)/sum(wts)
mean.cntrl=sum(bio121.final[indx.cntr, 'grd.pt.unt']*wts)/sum(wts)
diff.means = match.bio121$est
std.error = match.bio121$se
num.uniq.contrl.matched = length(unique(match.bio121$index.control))
num.trt = match.bio121$wnobs
approx.p.val = (1-pnorm(diff.means/std.error)) #right-tail
results = data.frame(course=c("Bio 121"), mean.gpa.PAL=round(mean.trt,2), mean.gpa.nonPAL=round(mean.cntrl,2), difference.mean.gpa=round(diff.means,2), standard.error=round(std.error,2),p.val=round(approx.p.val,4),number.control=num.uniq.contrl.matched, number.treated=num.trt)
kable(results, caption = "Propensity-scored Adjusted Comparison of PAL and non-PAL Grades")
course | mean.gpa.PAL | mean.gpa.nonPAL | difference.mean.gpa | standard.error | p.val | number.control | number.treated |
---|---|---|---|---|---|---|---|
Bio 121 | 2.44 | 2.06 | 0.38 | 0.1 | 1e-04 | 370 | 322 |
psens(match.bio121, Gamma=2, GammaInc = 0.1)
##
## Rosenbaum Sensitivity Test for Wilcoxon Signed Rank P-Value
##
## Unconfounded estimate .... 0
##
## Gamma Lower bound Upper bound
## 1.0 0 0.0000
## 1.1 0 0.0003
## 1.2 0 0.0054
## 1.3 0 0.0385
## 1.4 0 0.1477
## 1.5 0 0.3535
## 1.6 0 0.5990
## 1.7 0 0.7994
## 1.8 0 0.9186
## 1.9 0 0.9727
## 2.0 0 0.9923
##
## Note: Gamma is Odds of Differential Assignment To
## Treatment Due to Unobserved Factors
##
Note that in the above table \(\Gamma=1.4\) in the first column is the first row where 0.05 is between the Lower and Upper bounds. This means that an unknown confounder which increases the odds of being in PAL by more than 1.4 is enough to change the treatment effect from significant to non-significant. The next code block generates the effect on the odds ratio of each variable in the propensity score. Thus, if there is an unknown confounder that has an effect on the propensity score similar to stem.flg (STEM major or not), the PAL effect would become non-significant. Thus, this finding is sensitive to unknown confounders. It is possible a variable like the number of hours per week a student works which is not in our dataset is a confounder which could reverse the statistical significance of this analysis.
sort(exp(abs(bio121.first.order.prop.model$coefficients)))
## eth.erssOTHER bot.levelOTHER
## 1.097822 1.165844
## term.units.attemptedCensus Instructor_0122
## 1.169494 1.196028
## genderMale grd.pt.unt.bio184
## 1.202468 1.209351
## prevPASS stem.flgSTEM
## 1.274004 1.401901
## prevPAL eth.erssUnknown
## 1.523254 1.565261
## bot.levelSenior eth.erssWhite
## 1.679694 1.722992
## Instructor_0121 eth.erssHispanic
## 1.781744 1.811324
## acad.standOTHER cum.percent.units.passed
## 1.815043 1.892003
## eth.erssAsian eth.erssTwo or More Races
## 2.124415 3.123921
## (Intercept) Instructor_0135
## 3.383485 3.732237