Executive Summary
Detailed Summary
Add R packages
Source a script
Import the Data
Subset Data for Biology
Compare the mean gpa for PAL and non-PAL students for Biology classes without Propensity Score Adjustment
Data Cleaning and Feature Engineering
Extract Bio 121 Data
Extract data for Prerequisite Course Bio 184
Analyze missingness
Remove single-valued variables
Identify variables causing complete separation in logistic regression
Filter to relevant variables
Check again for Complete Separation
Subset on Complete Cases only in Bio 121 Data
Build a Logistic Regression Model for Propensity Score
Propensity Score Matching
Balance check
Estimate Difference Between Mean grade in Bio 121 of PAL and non-PAL students
Sensitivity Analysis

Executive Summary

Based on data for 390 PAL students and 888 non-PAL students, the unadjusted, raw difference in average grade for PAL and non-PAL students was 0.21 on a A=4.0 grade scale. However, since students self-select into supplemental PAL instruction, it is possible that the resulting PAL and non-PAL groups were not balanced with respect to other characteristics which could impact course grade. For example, if students with better study habits tend to enroll in PAL, all else being equal, the PAL mean grade would be higher than non-PAL-- even if PAL had no effect on course grade. Consequently, we also performed a propensity score analysis to adjust the estimated effect of PAL on course grade for potential self-selection biases.

After adjusting for self-selection bias, we found that PAL students earned an average grade \(0.38 \pm 0.10\) higher than non-PAL students. A sensitivity analysis indicates that this analysis is moderately sensitive to unknown confounders. Although the data give us sufficient evidence to conclude that PAL increases students' grades in Bio 121, the existence of an unknown confounder similar in magnitude to stem.flg (which indicates if the student's major is STEM or not) would nullify that conclusion.

Detailed Summary

A propensity score analysis was conducted to assess the effect of PAL supplemental instruction on Bio 121 course grade. Propensity score adjustment was necessary since the data are observational and the characteristics of students who voluntarily enroll in PAL may differ in ways that may, independently of PAL, impact course grade compared to students who do not enroll in PAL. In propensity score analysis, variables related to both likelihood of PAL enrollment and course grade (confounders) are used in a logistic regression model to obtain a propensity score, which is a student's likelihood of enrolling in PAL.

For Bio 121, seven covariates were found to have a statistically significant relationship to likelihood of enrolling in PAL. Variables related to increased likelihood of enrolling were: enrollment in PAL in the past, OTHER ethnicity (combined ethnicities of Foreign, Native American, and Pacific Islander), and being a STEM major. Higher term units attempted was related to a higher likelihood of enrolling in PAL whereas lower grades in Bio 184 (the prereq) were related to higher probability of PAL enrollment. Students previously enrolled in PASS and seniors were less likely to enroll in PAL. Instructor was also related to probability of PAL enrollment.

Using the propensity score model, all students in the dataset, PAL and non-PAL, are assigned a propensity score. Then, each PAL student is matched to one or more non-PAL students who have similar propensity score(s). After matching, the PAL and matched non-PAL groups are compared to determine if the distribution of each covariate is similar between the two groups. This is called a balance check. If the standardized difference between the non-PAL and PAL means is less than 0.10 than the strong criteria in (Liete 2017, p.10) is met for covariate balance. If the standardized difference is under 0.25, than a more lenient criteria is met. The highest absolute value standardized mean differences in this analysis are 0.1276 and 0.0715. Consequently, adequate balance appears to have been achieved.

The difference in the average grade for the matched PAL and non-PAL data is then calculated. The estimated increase in the mean grade of students in PAL over those not in PAL after correcting for self-selection biases is \(0.38 \pm 0.10\) or between 0.28 and 0.48 on a 4.0 grade scale. This result is statistically significant with a P-value of 0.00009 and is based on 322 PAL students and 370 matched non-PAL students. For comparison, the non-propensity score adjusted difference in average grade for PAL and non-PAL students was 0.21.

The estimated PAL effect is based on the assumption that the propensity model includes all potential confounders for PAL enrollment and grade in Bio 121. However, it is possible that unknown confounders exist. A sensitivity analysis was conducted to determine how strong an unknown confounder must be to nullify the statistically significant PAL effect that was found in this analysis. The sensitivity analysis (Rosenbaum, 2002) indicated that an unknown confounder which increases the odds of being in PAL by more than 1.4 is enough to change the treatment effect from significant to non-significant. Inspection of the covariates in the estimated propensity model for Bio 121 indicates that if there is an unknown confounder that has an effect on the propensity score similar to the effect of stem.flg (indicating whether or not the student is a STEM major) observed in this analysis, the PAL effect would become non-significant. Thus, this finding is sensitive to unknown confounders. It is possible a variable like the number of hours per week a student works (which is not in our dataset) is an unknown confounder which could reverse the statistical significance of this analysis.

Additionally, a number of variables such as SAT scores (46% missing) and high school gpa (41% missing) were removed from this analysis due to large amounts of missingness. Since all students who had missing information on any included covariate were eliminated from the analysis, a balance had to be struck between retaining a sufficiently large pool of PAL and non-PAL students and retaining a sufficient number of important covariates. Variables which were eliminated from this analysis had substantial missing data or were subjectively judged as unlikely to be confounding. The choices about which variables to retain resulted in the original pool of 390 PAL students in Bio 121 being reduced to 322. Also, 370 matched non-PAL students were selected out of 888 original non-PAL students. When a PAL student had more than one suitable match among the non-PAL students, all non-PAL students were taken as matches and weighted appropriately in the final estimated PAL effect.

Add R packages

First Add Some R Packages to the Workspace Caution: warning messages are suppressed to reduce clutter in the output.

defaultW <- getOption("warn") 
options(warn = -1) 

library(dplyr)
library(MatchIt)
library(Matching)
library(DataExplorer)
library(cobalt)
library(backports)  # to get the deparse1() function for cobalt
library(rbounds)
library(tidyr)

options(warn = defaultW)

Source a script

The script contains a function I wrote called id.vars.zero.freq to identify variables causing complete separation in logistic regression.

source("IdentifyCompleteSepVars2.R")

Import the Data

Make sure the PAL datafile is in the same directory as this RMarkdown file. The files which includes data through the Spring 2019 semester has 1,099,371 rows and 174 columns. The total of the grd.pt.unt column is 2,237,555.

load("~/PAL data analysis/PALdatafull.Rdata")
dim(PALdatafull)

## [1] 1099371     174

sum(PALdatafull$grd.pt.unt)

## [1] 2237555

Subset Data for Biology

The full file is quite large so reducing down to bio classes makes the code run faster. Each instance of a course taken by a student is included in the data file for the semester during which it was taken and every semester thereafter. 'test1' below filters so the dataset only includes the semester during which the course was taken. The biology dataset should contain 19,652 rows and 174 columns.

bio.classes <- paste("BIO", c(22, 121, 131, 139, 184))

test1 <- PALdatafull$base.time.course==1
test.bio <- PALdatafull$course %in% bio.classes
bio.indx <- test1 & test.bio

bio.dat <- PALdatafull[bio.indx, ]
dim(bio.dat)

## [1] 19652   174

Compare the mean gpa for PAL and non-PAL students for Biology classes without Propensity Score Adjustment

The course.seq variables indicates how many times a student has taken a course prior to the current attempt. To filter on the first attempt at a course, we set course.seq to 0.

raw.table = data.frame(class=character(),
                       nonPALavg=numeric(),
                       PALavg=numeric(), 
                       Diff=numeric(), 
                       NonPAL_Num= integer(),
                       PAL_Num=integer(),
                       CompletePAL=numeric(),
                       TermPALStart=integer(),
                       row.names=NULL,
                       stringsAsFactors = FALSE)

for (i in 1:length(bio.classes))
    {
      curr.class = bio.classes[i]
      temp = subset(bio.dat, course==curr.class & course.seq==0)
      pal.start=min(unique(temp$term.code[temp$palN==2]))
      # only include terms after PAL start term
      temp = subset(temp, term.code>= pal.start)
      
      x=tapply(temp$grd.pt.unt,temp$palN, 
               mean, na.rm=T) %>% 
        as.numeric %>% 
        round(2)
      
      y=table(temp$palN) %>% as.numeric
      
      raw.table[i, 'class' ] = curr.class
      raw.table[i, c(2:4,7)]=c(x[1], x[3],x[3]-x[1],
                        round(y[3]/sum(y),2))
      raw.table[i, c(5,6,8)]= c(y[1], y[3], pal.start)
              
    }
# formatted table
library(knitr)

## Warning: package 'knitr' was built under R version 3.6.3

kable(raw.table, caption = "Raw Comparison of PAL and non-PAL Grades (No Propensity Adjustment)")

Raw Comparison of PAL and non-PAL Grades (No Propensity Adjustment)
class	nonPALavg	PALavg	Diff	NonPAL_Num	PAL_Num	CompletePAL	TermPALStart
BIO 22	1.81	2.33	0.52	949	413	0.29	2158
BIO 121	2.18	2.39	0.21	888	390	0.30	2148
BIO 131	2.33	2.76	0.43	1516	544	0.26	2148
BIO 139	2.41	2.61	0.20	308	98	0.24	2178
BIO 184	2.51	2.73	0.22	509	149	0.22	2178

Data Cleaning and Feature Engineering

Create new variables: delay since high school and cumulative percent of units passed. Collapse sparse categories and other miscellaneous clean up of data. Sparse categories can cause complete separation in logistic regression and are only predictive for a few students.

yr.course.taken = as.numeric(gsub(".*([0-9]{4})","\\1",bio.dat$coh.term))

bio.dat$delay.from.hs = ifelse(!is.na(yr.course.taken) & !is.na(bio.dat$hs.grad.yr), yr.course.taken-bio.dat$hs.grad.yr, NA)

sum(is.na(bio.dat$delay.from.hs))  #1864 missing values

## [1] 1864

# remove 106 students who did not complete PAL
bio.dat=subset(bio.dat, palN!=1)

#recode palN to factor with 0/1 levels
bio.dat$palN = ifelse(bio.dat$palN==2, 1, 0)

#clean up category names in m.rmd and e.rmd
bio.dat$m.rmd[bio.dat$m.rmd=="Not Remedial\nin Math"]="Not Remedial in Math"

bio.dat$m.rmd[bio.dat$m.rmd=="Remedial\nin Math"]="Remedial in Math"

bio.dat$m.rmd = droplevels(bio.dat$m.rmd)

bio.dat$e.rmd[bio.dat$e.rmd=="Not Remedial\nin English"]="Not Remedial in English"

bio.dat$e.rmd[bio.dat$e.rmd=="Remedial\nin English"]="Remedial in English"

bio.dat$e.rmd = droplevels(bio.dat$e.rmd)
table(bio.dat$e.rmd)

## 
## Not Remedial in English     Remedial in English 
##                   15474                    3704

# create feature, proportion of cumulative units taken that were passes
bio.dat$cum.percent.units.passed =  bio.dat$tot.passd.prgrss.start/bio.dat$tot.taken.prgrss.start

# collapse sparse categories
bio.dat=group_category(data = bio.dat, feature = "load.code", threshold = 0.05, update = TRUE)

table(bio.dat$load.code, PAL=bio.dat$palN, droplevels(bio.dat$course))

## , ,  = BIO 121
## 
##        PAL
##            0    1
##   F     2104  383
##   H      428   13
##   OTHER  308   61
## 
## , ,  = BIO 131
## 
##        PAL
##            0    1
##   F     3127  491
##   H      486   19
##   OTHER  462   68
## 
## , ,  = BIO 139
## 
##        PAL
##            0    1
##   F     2315   87
##   H      402    1
##   OTHER  340   15
## 
## , ,  = BIO 184
## 
##        PAL
##            0    1
##   F     2751  144
##   H      417    5
##   OTHER  339    6
## 
## , ,  = BIO 22
## 
##        PAL
##            0    1
##   F     3332  430
##   H      496    9
##   OTHER  471   36

bio.dat=group_category(data = bio.dat, feature = "addr.type", threshold = 0.05, update = TRUE)
table(bio.dat$addr.type, PAL=bio.dat$palN, droplevels(bio.dat$course))

## , ,  = BIO 121
## 
##        PAL
##            0    1
##   OTHER 1263  101
##   PERM  1577  356
## 
## , ,  = BIO 131
## 
##        PAL
##            0    1
##   OTHER 1608  103
##   PERM  2467  475
## 
## , ,  = BIO 139
## 
##        PAL
##            0    1
##   OTHER 1530   19
##   PERM  1527   84
## 
## , ,  = BIO 184
## 
##        PAL
##            0    1
##   OTHER 1411   20
##   PERM  2096  135
## 
## , ,  = BIO 22
## 
##        PAL
##            0    1
##   OTHER 1627   70
##   PERM  2672  405

# code instructor as alphabetic letter for anonymity
bio.dat$Instructor_01=droplevels(factor(bio.dat$Instructor_01))

instructor.vec = sort(unique(bio.dat$Instructor_01))
num.instr = length(instructor.vec)

bio.dat$Instructor_01 = factor(bio.dat$Instructor_01, levels = instructor.vec, labels=as.character(1:num.instr))

key.instr.code = cbind(as.character(instructor.vec),
                       1:num.instr)

Extract Bio 121 Data

Collapse 'censusMajor' variable separately for each course since the amount of collapsing necessary will vary by course.

bio121.dat <- subset(bio.dat, course=="BIO 121" & term.code>2143)

bio121.dat=group_category(data = bio121.dat, feature = "censusMajor", threshold = 0.05, update = TRUE)

table(bio121.dat$censusMajor, PAL=bio121.dat$palN)

##                                                 PAL
##                                                    0   1
##   Biology                                        765 321
##   Chemistry                                       86  27
##   Clinical Science/Biomedical Laboratory Science  90  44
##   Microbiology                                    47  26
##   Molecular Biology                               43  20
##   OTHER                                           90  19

Extract data for Prerequisite Course Bio 184

For Bio 184, the grade for the student's first attempt and the number of times it was taken are added to the Bio 121 dataset. However, few students retook Bio 184 so there was inadequate balance on number of times Bio 184 was taken, and it was removed from the propensity model after balance checks.

bio184.dat <- subset(bio.dat, course=="BIO 184")
bio184.dat$emplid <- as.factor(bio184.dat$emplid)
bio184.dat$term.code <- as.numeric(bio184.dat$term.code)

bio184.emplid.and.grade = bio184.dat[, c("emplid", "term.code","grd.pt.unt")]

#  Extract grade on first attempt and number of times taken for Bio 184
bio184.summary = bio184.dat %>% group_by(emplid) %>% summarise_at("term.code", list(num.times.taken.bio184=length,first.term.taken.bio184=min))

#  Add grade in Bio 184 to the dataframe for Bio 184
first.term.bio184.grd <- merge(bio184.summary, bio184.emplid.and.grade, by.x=c("emplid","first.term.taken.bio184"),by.y=c("emplid", "term.code"))

names(first.term.bio184.grd)[4] <- "grd.pt.unt.bio184"

#  extract first attempt at Bio 121
first.term.bio121.grd <- subset(bio121.dat,
                                course.seq==0)

# Combine data on Bio 184 and Bio 121
first.grd.bio184.and.bio121 <- merge(first.term.bio184.grd, first.term.bio121.grd, by=c("emplid"))

Analyze missingness

Remove variables having too many missing values in order to retain a larger pool of PAL and non-PAL students.

plot_missing(first.grd.bio184.and.bio121, missing_only = TRUE)

zero_missing_vars <- first.grd.bio184.and.bio121 %>% 
  summarise(across( ,list(function(x) sum(is.na(x))))) %>%
  gather() %>%
  arrange(value)

zero_missing_vars <- zero_missing_vars %>%
  filter(zero_missing_vars$value != 0)

zero_missing_vars$var.name = substr(zero_missing_vars$key,1,nchar(zero_missing_vars$key)-2)

zero_missing_vars <- first.grd.bio184.and.bio121 %>%
  dplyr::select(zero_missing_vars$var.name)

missing_vars <- profile_missing(zero_missing_vars)
missing_vars

##                         feature num_missing  pct_missing
## 1                        gender           1 0.0008764242
## 2             hous.coh.term.flg           2 0.0017528484
## 3                         m.rmd           2 0.0017528484
## 4                  m.rmd.detail           2 0.0017528484
## 5                         e.rmd           2 0.0017528484
## 6                  e.rmd.detail           2 0.0017528484
## 7                       rmd.flg           2 0.0017528484
## 8                  rmd.subj.flg           2 0.0017528484
## 9                      adm.area           2 0.0017528484
## 10                      veteran           2 0.0017528484
## 11                 hs.eng.units           2 0.0017528484
## 12                hs.math.units           2 0.0017528484
## 13                hs.phys.units           2 0.0017528484
## 14                 hs.bio.units           2 0.0017528484
## 15                 hs.lab.units           2 0.0017528484
## 16             hs.forLang.units           2 0.0017528484
## 17             hs.socstud.units           2 0.0017528484
## 18            hs.perfarts.units           2 0.0017528484
## 19            hs.elective.units           2 0.0017528484
## 20  ge.critical.thinking.status           2 0.0017528484
## 21       ge.english.comp.status           2 0.0017528484
## 22               ge.math.status           2 0.0017528484
## 23          ge.oral.comm.status           2 0.0017528484
## 24                        state           2 0.0017528484
## 25                   career.num           3 0.0026292726
## 26                    bot.level           3 0.0026292726
## 27                enroll.status           3 0.0026292726
## 28                     term.gpa           3 0.0026292726
## 29                higher.ed.gpa           3 0.0026292726
## 30                     csus.gpa           3 0.0026292726
## 31                withdraw_code           3 0.0026292726
## 32             unt_taken_prgrss           3 0.0026292726
## 33             unt_passd_prgrss           3 0.0026292726
## 34               tot_cumulative           3 0.0026292726
## 35           cur_resident_terms           3 0.0026292726
## 36           cum_resident_terms           3 0.0026292726
## 37                    acad.plan           3 0.0026292726
## 38                   campus.gpa           4 0.0035056968
## 39                    total.gpa           4 0.0035056968
## 40   term.units.attemptedCensus           4 0.0035056968
## 41                   acad.stand           5 0.0043821209
## 42                          zip           7 0.0061349693
## 43                     pass.flg           9 0.0078878177
## 44                    pass.flgN           9 0.0078878177
## 45                          coh          11 0.0096406661
## 46                     coh.term          11 0.0096406661
## 47                base.time.coh          11 0.0096406661
## 48                     enrl.flg          11 0.0096406661
## 49                  enrl.flgERS          11 0.0096406661
## 50                      pst.flg          11 0.0096406661
## 51                   pst.flgERS          11 0.0096406661
## 52                     grad.flg          11 0.0096406661
## 53                  grad.flgERS          11 0.0096406661
## 54                      rtn.flg          11 0.0096406661
## 55                   rtn.flgERS          11 0.0096406661
## 56                         ttgF          11 0.0096406661
## 57                     maxUnits          11 0.0096406661
## 58         tot_cumulative.start          11 0.0096406661
## 59               stem.final.flg          11 0.0096406661
## 60                   admis.plan          12 0.0105170903
## 61                   hs.grad.yr          12 0.0105170903
## 62                delay.from.hs          12 0.0105170903
## 63            pell.coh.term.flg          13 0.0113935145
## 64                stem.base.flg          14 0.0122699387
## 65                     eth.erss          26 0.0227870289
## 66                   urm.flgERS          26 0.0227870289
## 67                  m.rmd.admin          32 0.0280455741
## 68           m.rmd.admin.detail          32 0.0280455741
## 69                    mother.ed          44 0.0385626643
## 70       tot.passd.prgrss.start          48 0.0420683611
## 71       tot.taken.prgrss.start          48 0.0420683611
## 72     cum.percent.units.passed          48 0.0420683611
## 73                    father.ed          82 0.0718667835
## 74                   admit.term         122 0.1069237511
## 75                 plan.college         219 0.1919368975
## 76            plan.college.desc         219 0.1919368975
## 77                    plan.dept         219 0.1919368975
## 78                plan.deptAbbr         219 0.1919368975
## 79                  plan.degree         219 0.1919368975
## 80                    plan.type         219 0.1919368975
## 81                    deg.plan1         407 0.3567046450
## 82                    grad.term         407 0.3567046450
## 83                          ttg         407 0.3567046450
## 84                 grad.termERS         430 0.3768624014
## 85                       hs.gpa         466 0.4084136722
## 86               sat.math.score         527 0.4618755478
## 87                 sat.math.flg         527 0.4618755478
## 88             sat.verbal.score         527 0.4618755478
## 89               sat.verbal.flg         527 0.4618755478
## 90                sat.test.date         527 0.4618755478
## 91                   trf.gpaADM         609 0.5337423313
## 92                treat.section         729 0.6389132340
## 93                fys.term.code         847 0.7423312883
## 94                      fys.grd         847 0.7423312883
## 95                  fys.rpt.flg         847 0.7423312883
## 96                  pledge.term        1051 0.9211218230
## 97                    deg.plan2        1120 0.9815950920
## 98              withdraw_reason        1140 0.9991235758
## 99                Instructor_02        1141 1.0000000000
## 100                   deg.plan3        1141 1.0000000000
## 101                   deg.plan4        1141 1.0000000000
## 102                   deg.plan5        1141 1.0000000000
## 103                   deg.plan6        1141 1.0000000000

sum(missing_vars$pct_missing>0.05)  # remove 31 variables with >5% missing values

## [1] 31

vars.missing.data = as.character(missing_vars[missing_vars$pct_missing>0.05, "feature"])

dim(first.grd.bio184.and.bio121)

## [1] 1141  179

bio121.with.prq = first.grd.bio184.and.bio121[,!(names(first.grd.bio184.and.bio121) %in% vars.missing.data)]
dim(bio121.with.prq)

## [1] 1141  148

Remove single-valued variables

Some variables may be single-values and are, thus, unrelated to propensity score. There were 23 single-valued variables in this dataset which were removed.

single.category.vars = sapply(bio121.with.prq, function(x) length(unique(x))==1)

sum(single.category.vars)

## [1] 13

names(bio121.with.prq)[single.category.vars]

##  [1] "AP_PHYSC"         "AP_PHYSC.flg"     "course"           "component"       
##  [5] "units"            "course.numeric"   "div"              "course.seq"      
##  [9] "rpt.flg"          "c2s"              "base.time.course" "years"           
## [13] "pass.term.flg"

# remove the 13 variables just listed
keep.vars = names(which(single.category.vars==FALSE))
bio121.with.prq2 = bio121.with.prq[, keep.vars]
dim(bio121.with.prq2)

## [1] 1141  135

Identify variables causing complete separation in logistic regression

These are typically variables with very small frequencies in some categories such as 'veteran' and 'acad.stand'. The categories with small frequencies are usually all PAL or all nonPAL. This causes what is known as a complete separation error in estimating the logistic regression model for propensity scoring. The low frequencies in some categories don't really mean that all hypothetical students in these low-frequency categories have the same PAL enrollment behavior. The all or nothing PAL enrollment behavior is more likely an artifact of the small number of people in the category. Consequently, these sparse categories will be combined with other categories to avoid complete separation or the related variable will be discarded if it is unlikely to be a confounder.

sep.cols = id.vars.zero.freq(bio121.with.prq2, treat.col= which(names(bio121.with.prq2)=="palN"))

sum(sep.cols)

## [1] 31

names(bio121.with.prq2)[sep.cols]

##  [1] "emplid"             "coh.term"           "eth.erss"          
##  [4] "m.rmd.detail"       "m.rmd.admin.detail" "rmd.subj.flg"      
##  [7] "veteran"            "AP_PHYS.flg"        "AP_PHYS2.flg"      
## [10] "AP_PHYSM.flg"       "state"              "career.course"     
## [13] "acad.prog.course"   "reason"             "grade"             
## [16] "grade.base"         "pal"                "bot.level"         
## [19] "enroll.status"      "withdraw_code"      "enrl.flg"          
## [22] "enrl.flgERS"        "pst.flg"            "pst.flgERS"        
## [25] "rtn.flg"            "rtn.flgERS"         "ttgF"              
## [28] "acad.stand"         "admis.plan"         "acad.plan"         
## [31] "treat.flg"

# censusMajor is similar to acad.plan 

# collapse eth.erss categories, remove others causing complete separation
bio121.with.prq2 = group_category(data = bio121.with.prq2, feature = "eth.erss", threshold = 0.05, update = TRUE)

Filter to relevant variables

Subjective judgment was used to narrow the pool of variables down to those likely to be confounders. It's important to include all variables correlated with outcome even if it is uncertain whether they are related to likeihood of enrolling in PAL. This allows for a more precise estimate of the treatment effect.

vars.to.keep = c("grd.pt.unt","palN","coh","eth.erss","fys.flg","pell.coh.term.flg","hous.coh.term.flg","mother.ed","father.ed","m.rmd","e.rmd","adm.area","ge.math.status","addr.type","median.income","pct.female.head","reason","Instructor_01","pass","bot.level","censusMajor","term.units.attemptedCensus","base.time.coh","higher.ed.gpa.start","acad.stand","plan.type","course.age","stem.flg","prevPASS","grd.pt.unt.bio184","num.times.taken.bio184","cum.percent.units.passed","gender", "load.code","prevPAL")


# father.ed and plan.type dropped earlier due to missingness
new.vars = intersect(vars.to.keep, names(bio121.with.prq2))

dim(bio121.with.prq2)

## [1] 1141  135

bio121.with.prq2 = bio121.with.prq2[ ,new.vars]
dim(bio121.with.prq2)

## [1] 1141   33

Check again for Complete Separation

sep.cols = id.vars.zero.freq(bio121.with.prq2, treat.col= which(names(bio121.with.prq2)=="palN"))
sep.cols

##  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [13] FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE
## [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

sum(sep.cols)

## [1] 3

names(bio121.with.prq2)[sep.cols]

## [1] "reason"     "bot.level"  "acad.stand"

bio121.with.prq2=group_category(data = bio121.with.prq2, feature = "acad.stand", threshold = 0.03, update = TRUE)
with(bio121.with.prq2, table(acad.stand, palN))

##                palN
## acad.stand        0   1
##   Good Standing 744 341
##   OTHER          42  14

bio121.with.prq2=group_category(data = bio121.with.prq2, feature = "reason", threshold = 0.01, update = TRUE)
with(bio121.with.prq2, table(reason, palN))

##        palN
## reason    0   1
##   ENRL  740 324
##   OTHER  46  31

bio121.with.prq2=group_category(data = bio121.with.prq2, feature = "bot.level", threshold = 0.01, update = TRUE)
with(bio121.with.prq2, table(bot.level, palN))

##          palN
## bot.level   0   1
##    Junior 208 125
##    OTHER    8   5
##    Senior 570 225

Subset on Complete Cases only in Bio 121 Data

dim(bio121.with.prq2)

## [1] 1141   33

bio121.with.prq2 = bio121.with.prq2[complete.cases(bio121.with.prq2), ]
dim(bio121.with.prq2)

## [1] 1048   33

Build a Logistic Regression Model for Propensity Score

Subjectively identified four potential confounders to force the model to retain: grd.pt.unt.bio184, cum.percent.units.passed, gender, and eth.erss. Stepwise variable selection will be used to select which of the variables currently in the PAL dataset to include in the propensity model.

min.model = glm(palN ~grd.pt.unt.bio184+cum.percent.units.passed+gender+eth.erss, data=bio121.with.prq2, family=binomial)

summary(min.model)

## 
## Call:
## glm(formula = palN ~ grd.pt.unt.bio184 + cum.percent.units.passed + 
##     gender + eth.erss, family = binomial, data = bio121.with.prq2)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.1845  -0.8788  -0.7837   1.4132   1.9607  
## 
## Coefficients:
##                           Estimate Std. Error z value Pr(>|z|)   
## (Intercept)               -0.38171    0.92223  -0.414  0.67895   
## grd.pt.unt.bio184         -0.08099    0.08689  -0.932  0.35130   
## cum.percent.units.passed   0.64034    1.05076   0.609  0.54225   
## genderMale                -0.28729    0.14058  -2.044  0.04100 * 
## eth.erssAsian             -0.85497    0.33310  -2.567  0.01027 * 
## eth.erssHispanic          -0.55330    0.33871  -1.634  0.10235   
## eth.erssOTHER             -0.21235    0.42740  -0.497  0.61931   
## eth.erssTwo or More Races -1.36932    0.43754  -3.130  0.00175 **
## eth.erssUnknown           -0.77099    0.45399  -1.698  0.08946 . 
## eth.erssWhite             -0.66942    0.33897  -1.975  0.04828 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1294.6  on 1047  degrees of freedom
## Residual deviance: 1273.3  on 1038  degrees of freedom
## AIC: 1293.3
## 
## Number of Fisher Scoring iterations: 4

biggest = formula(glm(palN ~.-grd.pt.unt, data=bio121.with.prq2, family=binomial))

bio121.step.first.order = step(min.model, 
               direction="forward",scope = biggest, 
               trace=FALSE, k=2) 

summary(bio121.step.first.order)

## 
## Call:
## glm(formula = palN ~ grd.pt.unt.bio184 + cum.percent.units.passed + 
##     gender + eth.erss + prevPAL + term.units.attemptedCensus + 
##     bot.level + Instructor_01 + prevPASS + stem.flg + acad.stand, 
##     family = binomial, data = bio121.with.prq2)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8431  -0.8490  -0.6283   1.1157   2.3530  
## 
## Coefficients:
##                            Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                -1.21891    1.08511  -1.123  0.26131    
## grd.pt.unt.bio184          -0.19008    0.09585  -1.983  0.04736 *  
## cum.percent.units.passed   -0.63764    1.15057  -0.554  0.57945    
## genderMale                 -0.18438    0.14896  -1.238  0.21579    
## eth.erssAsian              -0.75350    0.35591  -2.117  0.03425 *  
## eth.erssHispanic           -0.59406    0.36175  -1.642  0.10055    
## eth.erssOTHER               0.09333    0.46233   0.202  0.84002    
## eth.erssTwo or More Races  -1.13909    0.46599  -2.444  0.01451 *  
## eth.erssUnknown            -0.44805    0.48091  -0.932  0.35150    
## eth.erssWhite              -0.54406    0.36249  -1.501  0.13338    
## prevPAL                     0.42085    0.06544   6.431 1.27e-10 ***
## term.units.attemptedCensus  0.15657    0.03111   5.032 4.85e-07 ***
## bot.levelOTHER              0.15345    0.72225   0.212  0.83175    
## bot.levelSenior            -0.51861    0.16180  -3.205  0.00135 ** 
## Instructor_0121            -0.57759    0.33028  -1.749  0.08032 .  
## Instructor_0122            -0.17901    0.15547  -1.151  0.24956    
## Instructor_0135            -1.31701    0.55547  -2.371  0.01774 *  
## prevPASS                   -0.24216    0.12016  -2.015  0.04386 *  
## stem.flgSTEM                0.33783    0.19037   1.775  0.07596 .  
## acad.standOTHER            -0.59611    0.43762  -1.362  0.17315    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1294.6  on 1047  degrees of freedom
## Residual deviance: 1169.2  on 1028  degrees of freedom
## AIC: 1209.2
## 
## Number of Fisher Scoring iterations: 5

bio121.step.first.order$anova

##                           Step Df  Deviance Resid. Df Resid. Dev      AIC
## 1                              NA        NA      1038   1273.335 1293.335
## 2                    + prevPAL -1 46.135395      1037   1227.200 1249.200
## 3 + term.units.attemptedCensus -1 23.837403      1036   1203.362 1227.362
## 4                  + bot.level -2 14.706694      1034   1188.656 1216.656
## 5              + Instructor_01 -3 10.247171      1031   1178.408 1212.408
## 6                   + prevPASS -1  3.997985      1030   1174.410 1210.410
## 7                   + stem.flg -1  3.211949      1029   1171.199 1209.199
## 8                 + acad.stand -1  2.000993      1028   1169.198 1209.198

model.first.order=formula(bio121.step.first.order)

Fit Propensity Score model (linear terms only)

An attempt to add second order terms was made. The problems with complete separation due to categories with low frequencies was compounded for two-way interaction terms. I added some two-way interactions manually then ran a likelihood ratio test to see if the linear model could be rejected in favor of the model with two-way interations. The test did not reject the null hypothesis that the linear model was adequate.

bio121.first.order.prop.model = glm(model.first.order, data=bio121.with.prq2, family=binomial)
summary(bio121.first.order.prop.model)

## 
## Call:
## glm(formula = model.first.order, family = binomial, data = bio121.with.prq2)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8431  -0.8490  -0.6283   1.1157   2.3530  
## 
## Coefficients:
##                            Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                -1.21891    1.08511  -1.123  0.26131    
## grd.pt.unt.bio184          -0.19008    0.09585  -1.983  0.04736 *  
## cum.percent.units.passed   -0.63764    1.15057  -0.554  0.57945    
## genderMale                 -0.18438    0.14896  -1.238  0.21579    
## eth.erssAsian              -0.75350    0.35591  -2.117  0.03425 *  
## eth.erssHispanic           -0.59406    0.36175  -1.642  0.10055    
## eth.erssOTHER               0.09333    0.46233   0.202  0.84002    
## eth.erssTwo or More Races  -1.13909    0.46599  -2.444  0.01451 *  
## eth.erssUnknown            -0.44805    0.48091  -0.932  0.35150    
## eth.erssWhite              -0.54406    0.36249  -1.501  0.13338    
## prevPAL                     0.42085    0.06544   6.431 1.27e-10 ***
## term.units.attemptedCensus  0.15657    0.03111   5.032 4.85e-07 ***
## bot.levelOTHER              0.15345    0.72225   0.212  0.83175    
## bot.levelSenior            -0.51861    0.16180  -3.205  0.00135 ** 
## Instructor_0121            -0.57759    0.33028  -1.749  0.08032 .  
## Instructor_0122            -0.17901    0.15547  -1.151  0.24956    
## Instructor_0135            -1.31701    0.55547  -2.371  0.01774 *  
## prevPASS                   -0.24216    0.12016  -2.015  0.04386 *  
## stem.flgSTEM                0.33783    0.19037   1.775  0.07596 .  
## acad.standOTHER            -0.59611    0.43762  -1.362  0.17315    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1294.6  on 1047  degrees of freedom
## Residual deviance: 1169.2  on 1028  degrees of freedom
## AIC: 1209.2
## 
## Number of Fisher Scoring iterations: 5

bio121.final = bio121.with.prq2
bio121.final$propensity = bio121.first.order.prop.model$fitted.values

Seven covariates were found to have a statistically significant relationship to likelihood of enrolling in PAL. Variables related to increased likelihood of enrolling were: enrollment in PAL in the past, OTHER ethnicity (combined ethnicities of Foreign, Native American, and Pacific Islander), and being a STEM major. Higher term units attempted was related to a higher likelihood of enrolling in PAL whereas lower grades in Bio 184 (the prereq) were related to higher probability of PAL enrollment. Students previously enrolled in PASS and seniors were less likely to enroll in PAL. Instructor was also related to probability of PAL enrollment.

Propensity Score Matching

match.bio121 = Match(Y=bio121.final$grd.pt.unt, Tr=bio121.final$palN, X=bio121.final$propensity, BiasAdjust = F, estimand = "ATT", M=1, caliper=0.25)

Balance Check

Check to see if covariates are sufficiently balanced bettween the non-PAL and PAL groups. If the standardized difference between the non-PAL and PAL means is less than 0.10 than the strong criteria in Liete is met for covariate balance. If the standardized difference is under 0.25, than a more lenient criteria is met. The highest absolute value standardized mean differences are 0.1276 and 0.0715 so balance appears to be adequate.

bio121.bal = bal.tab(match.bio121, bio121.step.first.order$formula, data=bio121.final, un=TRUE, quick=FALSE)

bio121.bal

## Balance Measures
##                               Type Diff.Un Diff.Adj
## grd.pt.unt.bio184          Contin. -0.0564  -0.0715
## cum.percent.units.passed   Contin. -0.0018  -0.1276
## gender_Male                 Binary -0.0705   0.0323
## eth.erss_African American   Binary  0.0316  -0.0064
## eth.erss_Asian              Binary -0.0552  -0.0248
## eth.erss_Hispanic           Binary  0.0328   0.0390
## eth.erss_OTHER              Binary  0.0233   0.0001
## eth.erss_Two or More Races  Binary -0.0370   0.0113
## eth.erss_Unknown            Binary -0.0039  -0.0043
## eth.erss_White              Binary  0.0084  -0.0149
## prevPAL                    Contin.  0.4295  -0.0101
## term.units.attemptedCensus Contin.  0.3907  -0.0420
## bot.level_Junior            Binary  0.0895   0.0151
## bot.level_OTHER             Binary  0.0055  -0.0004
## bot.level_Senior            Binary -0.0950  -0.0146
## Instructor_01_15            Binary  0.0782  -0.0358
## Instructor_01_21            Binary -0.0067  -0.0054
## Instructor_01_22            Binary -0.0274   0.0383
## Instructor_01_35            Binary -0.0442   0.0028
## prevPASS                   Contin.  0.0755  -0.0372
## stem.flg_STEM               Binary  0.0590   0.0128
## acad.stand_OTHER            Binary -0.0235  -0.0118
## 
## Sample sizes
##                      Control Treated
## All                   725.       323
## Matched (ESS)         173.43     322
## Matched (Unweighted)  370.       322
## Unmatched             355.         0
## Discarded               0.         1

love.plot(bio121.bal, binary = "raw", stars = "std", var.order = "unadjusted", 
          thresholds = c(m = .1), abs = F)

Estimate Difference Between Mean grade in Bio 121 of PAL and non-PAL students

The estimated increase in the mean grade of students in PAL over those not in PAL after correcting for self-selection biases is 0.38. This result is statistically significant with a P-value of 0.00009 and is based on 322 PAL students and 370 matched non-PAL students.

est.effect=summary(match.bio121)

## 
## Estimate...  0.38278 
## AI SE......  0.10232 
## T-stat.....  3.741 
## p.val......  0.0001833 
## 
## Original number of observations..............  1048 
## Original number of treated obs...............  323 
## Matched number of observations...............  322 
## Matched number of observations  (unweighted).  574 
## 
## Caliper (SDs)........................................   0.25 
## Number of obs dropped by 'exact' or 'caliper'  1

indx.trt = match.bio121$index.treated
indx.cntr = match.bio121$index.control
wts = match.bio121$weights

mean.trt=sum(bio121.final[indx.trt, 'grd.pt.unt']*wts)/sum(wts)
mean.cntrl=sum(bio121.final[indx.cntr, 'grd.pt.unt']*wts)/sum(wts)


diff.means = match.bio121$est

std.error =  match.bio121$se
num.uniq.contrl.matched =  length(unique(match.bio121$index.control))

num.trt = match.bio121$wnobs

approx.p.val =  (1-pnorm(diff.means/std.error)) #right-tail

results = data.frame(course=c("Bio 121"), mean.gpa.PAL=round(mean.trt,2), mean.gpa.nonPAL=round(mean.cntrl,2), difference.mean.gpa=round(diff.means,2), standard.error=round(std.error,2),p.val=round(approx.p.val,4),number.control=num.uniq.contrl.matched, number.treated=num.trt)

kable(results, caption = "Propensity-scored Adjusted Comparison of PAL and non-PAL Grades")

Propensity-scored Adjusted Comparison of PAL and non-PAL Grades
course	mean.gpa.PAL	mean.gpa.nonPAL	difference.mean.gpa	standard.error	p.val	number.control	number.treated
Bio 121	2.44	2.06	0.38	0.1	1e-04	370	322

Sensitivity Analysis

psens(match.bio121, Gamma=2, GammaInc = 0.1)

## 
##  Rosenbaum Sensitivity Test for Wilcoxon Signed Rank P-Value 
##  
## Unconfounded estimate ....  0 
## 
##  Gamma Lower bound Upper bound
##    1.0           0      0.0000
##    1.1           0      0.0003
##    1.2           0      0.0054
##    1.3           0      0.0385
##    1.4           0      0.1477
##    1.5           0      0.3535
##    1.6           0      0.5990
##    1.7           0      0.7994
##    1.8           0      0.9186
##    1.9           0      0.9727
##    2.0           0      0.9923
## 
##  Note: Gamma is Odds of Differential Assignment To
##  Treatment Due to Unobserved Factors 
##

Note that in the above table \(\Gamma=1.4\) in the first column is the first row where 0.05 is between the Lower and Upper bounds. This means that an unknown confounder which increases the odds of being in PAL by more than 1.4 is enough to change the treatment effect from significant to non-significant. The next code block generates the effect on the odds ratio of each variable in the propensity score. Thus, if there is an unknown confounder that has an effect on the propensity score similar to stem.flg (STEM major or not), the PAL effect would become non-significant. Thus, this finding is sensitive to unknown confounders. It is possible a variable like the number of hours per week a student works which is not in our dataset is a confounder which could reverse the statistical significance of this analysis.

sort(exp(abs(bio121.first.order.prop.model$coefficients)))

##              eth.erssOTHER             bot.levelOTHER 
##                   1.097822                   1.165844 
## term.units.attemptedCensus            Instructor_0122 
##                   1.169494                   1.196028 
##                 genderMale          grd.pt.unt.bio184 
##                   1.202468                   1.209351 
##                   prevPASS               stem.flgSTEM 
##                   1.274004                   1.401901 
##                    prevPAL            eth.erssUnknown 
##                   1.523254                   1.565261 
##            bot.levelSenior              eth.erssWhite 
##                   1.679694                   1.722992 
##            Instructor_0121           eth.erssHispanic 
##                   1.781744                   1.811324 
##            acad.standOTHER   cum.percent.units.passed 
##                   1.815043                   1.892003 
##              eth.erssAsian  eth.erssTwo or More Races 
##                   2.124415                   3.123921 
##                (Intercept)            Instructor_0135 
##                   3.383485                   3.732237

PAL effectiveness Analysis for Bio 121

Michelle Norris and Caitlin Esgana

July 30, 2020