Rpart Greedy Algorithms

1 minute read

  • Describe the rpart regression model at a high level
  • Interpret a printed rpart object

live notes

123 GO – What’s your word of the day

Announcements:

References:

recursive partitioning

library(rpart)

n = 100
#x = sort(3 * runif(n))
x = seq(from = 0, to = 3, length.out = n
y = rep(2, n)
y[x < 1] = 1
y[2 < x] = 3
noise = rnorm(n, sd = 0.2)
y = y + noise
d = data.frame(x, y)

plot(x, y)

This model is piecewise constant.

Defaults:

fit1 = rpart(y ~ x, data = d)

plot(x, y)
lines(x, predict(fit1))

To see the parameters for rpart:

?rpart.control

Greedy algorithms do the best they can at every step. This means they don’t always find the best (optimal) solution. The following plot illustrates the greedy nature of rpart. The optimal solution with three leaf nodes is to split around 33 and 66. rpart starts out by splitting at 50.5, because that’s the best first split. The best second split is to split the upper half of the data that has 51 points, so around 75.

n = 101
x = seq(n)
y = x
fit = rpart(y ~ x, minsplit = 51)

plot(x, y)
lines(x, predict(fit))

This isn’t a flaw in the algorithm- it’s just a property to be aware of. Optimal solutions can be very elusive, and in practice trees work just fine.

Updated: