DynForest
with
categorical outcome?pbc2
datasetThe pbc2
dataset (Murtaugh et al.
1994) is loaded with the package DynForest
to
illustrate its function abilities. pbc2
data come from a
clinical trial conducted by the Mayo Clinic between 1974 and 1984 to
treat the primary biliary cholangitis (PBC), a chronic liver disease.
312 patients were enrolled in a clinical trial to evaluate the
effectiveness of D-penicillamine compared to a placebo to treat the PBC
and were followed since the clinical trial ends, leading to a total of
1945 observations. During the follow-up, several clinical continuous
markers were collected over time such as: the level of serum bilirubin
(serBilir
), the level of serum cholesterol
(serChol
), the level of albumin (albumin
), the
level of alkaline (alkaline
), the level of aspartate
aminotransferase (SGOT
), platelets count
(platelets
) and the prothrombin time
(prothrombin
). 4 non-continuous time-dependent predictors
were also collected: the presence of ascites (ascites
), the
presence of hepatomegaly (hepatomegaly
), the presence of
blood vessel malformations in the skin (spiders
) and the
edema levels (edema
). These time-dependent predictors were
recorded according to time
variable. In addition to these
time-dependent predictors, few predictors were collected at enrollment:
the sex (sex
), the age (age
) and the drug
treatment (drug
). During the follow-up, 140 patients died
before transplantation, 29 patients were transplanted and 143 patients
were censored alive (event
). The time of first event
(censored alive or any event) was considered as the event time
(years
)
library("DynForest")
data(pbc2)
head(pbc2)
#> id time ascites hepatomegaly spiders edema serBilir
#> 1 1 0.0000000 Yes Yes Yes edema despite diuretics 14.5
#> 2 1 0.5256817 Yes Yes Yes edema despite diuretics 21.3
#> 3 10 0.0000000 Yes No Yes edema despite diuretics 12.6
#> 4 100 0.0000000 No Yes No No edema 2.3
#> 5 100 0.4681853 No Yes No No edema 2.5
#> 6 100 0.9801774 Yes No No edema no diuretics 2.9
#> serChol albumin alkaline SGOT platelets prothrombin histologic drug
#> 1 261 2.60 1718 138.0 190 12.2 4 D-penicil
#> 2 NA 2.94 1612 6.2 183 11.2 4 D-penicil
#> 3 200 2.74 918 147.3 302 11.5 4 placebo
#> 4 178 3.00 746 178.3 119 12.0 4 placebo
#> 5 NA 2.94 836 189.1 98 11.4 4 placebo
#> 6 NA 3.02 650 124.0 99 11.7 4 placebo
#> age sex years event
#> 1 58.76684 female 1.0951703 2
#> 2 58.76684 female 1.0951703 2
#> 3 70.56182 female 0.1396342 2
#> 4 51.47027 male 1.5113350 2
#> 5 51.47027 male 1.5113350 2
#> 6 51.47027 male 1.5113350 2
For the illustration, we select patients still at risk at 4 years and
we recode the event
variable with event
= 1
for subjects who died during between 4 years and 10 years, whereas
subjects with transplantation were recoded event
= 0, as
the subjects still alive. We split the subjects into two datasets: (i)
one dataset to train the random forest using 2/3 of patients; (ii) one dataset to predict
on the other 1/3 of patients.
pbc2 <- pbc2[which(pbc2$years>4&pbc2$time<=4),]
pbc2$event <- ifelse(pbc2$event==2, 1, 0)
pbc2$event[which(pbc2$years>10)] <- 0
set.seed(1234)
id <- unique(pbc2$id)
id_sample <- sample(id, length(id)*2/3)
id_row <- which(pbc2$id%in%id_sample)
pbc2_train <- pbc2[id_row,]
pbc2_pred <- pbc2[-id_row,]
We use the same strategy as in the survival context (section 4) to build the random forest, with the same predictors and the same association for time-dependent predictors.
timeData_train <- pbc2_train[,c("id","time",
"serBilir","SGOT",
"albumin","alkaline")]
timeVarModel <- list(serBilir = list(fixed = serBilir ~ time,
random = ~ time),
SGOT = list(fixed = SGOT ~ time + I(time^2),
random = ~ time + I(time^2)),
albumin = list(fixed = albumin ~ time,
random = ~ time),
alkaline = list(fixed = alkaline ~ time,
random = ~ time))
fixedData_train <- unique(pbc2_train[,c("id","age","drug","sex")])
With a categorical outcome, the definition of the output object is
slightly different. We should specify type
=“factor” to
define the outcome as categorical, and the dataframe in Y
should contain only 2 columns, the variable identifier id
and the outcome event
.
We executed dynforest()
function to build the random
forest with hyperparameters mtry
= 7 and
nodesize
= 2 as follows:
With a categorical outcome, the OOB prediction error is evaluated
using a missclassification criterion. This criterion can be computed
with compute_ooberror()
function and the results of the
random forest can be displayed using summary()
:
summary(res_dyn_OOB)
dynforest executed for categorical outcome
Splitting rule: Minimize weighted within-group Shannon entropy
Out-of-bag error type: Missclassification
Leaf statistic: Majority vote
----------------
Input
Number of subjects: 150
Longitudinal: 4 predictor(s)
Numeric: 1 predictor(s)
Factor: 2 predictor(s)
----------------
Tuning parameters
mtry: 7
nodesize: 2
ntree: 200
----------------
----------------
dynforest summary
Average depth per tree: 5.89
Average number of leaves per tree: 16.81
Average number of subjects per leaf: 5.72
----------------
Out-of-bag error based on Missclassification
Out-of-bag error: 0.2333
----------------
Computation time
Number of cores used: 7
Time difference of 2.87888 mins
----------------
In this illustration, we built the random forest using 150 subjects
because we only kept the subjects still at risk at landmark time at 4
years and split the dataset in 2/3 for
training and 1/3 for testing. We have
on average 5.7 subjects per leaf, and the average depth level per tree
is 5.9. This random forest predicted the wrong outcome for 24% of the
subjects. The random forest performances can be optimized by choosing
the mtry
and nodesize
hyperparameters that
minimized the OOB missclassification.
We can predict the probability of death between 4 and 10 years on
subjects still at risk at landmark time at 4 years. In classification
mode, the predictions are performed using the majority vote. The
prediction over the trees is thus a category of the outcome along with
the proportion of the trees that lead to this category. Predictions are
computed using predict()
function, then a dataframe can be
easily built from the returning object to get the prediction and
probability of the outcome for each subject:
timeData_pred <- pbc2_pred[,c("id","time",
"serBilir","SGOT",
"albumin","alkaline")]
fixedData_pred <- unique(pbc2_pred[,c("id","age","drug","sex")])
pred_dyn <- predict(object = res_dyn,
timeData = timeData_pred,
fixedData = fixedData_pred,
idVar = "id", timeVar = "time",
t0 = 4)
head(data.frame(pred = pred_dyn$pred_indiv,
proba = pred_dyn$pred_indiv_proba))
pred proba
101 0 0.945
104 0 0.790
106 1 0.600
108 0 0.945
112 1 0.575
114 0 0.650
As shown in this example, some predictions are made with varying confidence from 57.5% for subject 112 to 94.5% for subject 101. We predict for instance no event for subject 101 with a probability of 94.5% and an event for subject 106 with a probability of 60.0%.
The most predictive variables can be identified using
compute_vimp()
and displayed using plot()
function as follows:
res_dyn_VIMP <- compute_vimp(dynforest_obj = res_dyn_OOB, seed = 123)
plot(res_dyn_VIMP, PCT = TRUE)
Again, we found that the most predictive variable is
serBilir
. When perturbating serBilir
, the OOB
prediction error was increased by 15%.
The minimal depth is computed using compute_vardepth()
function and is displayed at predictor and feature levels using
plot()
function. The results are displayed in Figure 2
using the random forest with maximal mtry
hyperparameter
value (i.e., mtry
= 7) for a better understanding.
depth_dyn <- compute_vardepth(dynforest_obj = res_dyn_OOB)
p1 <- plot(depth_dyn, plot_level = "predictor")
p2 <- plot(depth_dyn, plot_level = "feature")
We observe that serBilir
and albumin
have
the lowest minimal depth and are used to split the subjects in almost
all the trees (199 and 196 out of 200 trees, respectively) (Figure 2A).
Figure 2B provides further results. In particular, this graph shows that
the random intercept (indicated by bi0) of serBilir
and
albumin
are the earliest predictors used to split the
subjects and are present in 193 and 190 out of 200 trees,
respectively.