Overview of DynForest package

DynForest methodology was implemented into the R package DynForest (Devaux 2024) freely available on The Comprehensive R Archive Network (CRAN) to users.

The package includes two main functions: dynforest() and predict() for the learning and the prediction steps. These functions are fully described in section 3.1 and 3.2. Other functions available are briefly described in the table below. These functions are illustrated in examples, one for a survival outcome, one for a categorical outcome and one for a continuous outcome.

Function Description
Learning and prediction steps
dynforest() Function that builds the random forest
predict() Function for S3 class dynforest predicting the outcome on new subjects using the individual-specific information
Assessment function
compute_ooberror() Function that computes the Out-Of-Bag error to be minimized to tune the random forest
Exploring functions
compute_vimp() Function that computes the importance of variables
compute_gvimp() Function that computes the importance of a group of variables
compute_vardepth() Function that extracts information about the tree building process
plot() functions for S3 class:
dynforest Plot the estimated CIF for given tree nodes or subjects
dynforestpred Plot the predicted CIF for the cause of interest for given subjects
dynforestvimp Plot the importance of variables by value or percentage
dynforestgvimp Plot the importance of a group of variables by value or percentage
dynforestvardepth Plot the minimal depth by predictors or features
Other functions
summary() Function for class S3 dynforest or dynforestoob displaying information about the type of random forest, predictors included, parameters used, Out-Of-Bag error (only for dynforestoob class) and brief summaries about the leaves
print() Function to print object of class dynforest, dynforestoob, dynforestvimp, dynforestgvimp, dynforestvardepth and dynforestpred
get_tree() Function that extracts the tree structure for a given tree
get_treenode() Function that extracts the terminal node identifiers for a given tree

dynforest() function

dynforest() is the function to build the random forest. The call of this function is:

dynforest(timeData = NULL, fixedData = NULL, idVar = NULL, 
          timeVar = NULL, timeVarModel = NULL, Y = NULL, 
          ntree = 200, mtry = NULL, nodesize = 1, minsplit = 2, cause = 1,
          nsplit_option = "quantile", ncores = NULL,
          seed = 1234, verbose = TRUE)

Arguments

timeData is an optional argument that contains the dataframe in longitudinal format (i.e., one observation per row) for the time-dependent predictors. In addition to time-dependent predictors, this dataframe should include a unique identifier and the measurement times. This argument is set to NULL if no time-dependent predictor is included. Argument fixedData contains the dataframe in wide format (i.e., one subject per row) for the time-fixed predictors. In addition to time-fixed predictors, this dataframe should also include the same identifier as used in timeData. This argument is set to NULL if no time-fixed predictor is included. Argument idVar provides the name of identifier variable included in timeData and fixedData dataframes. Argument timeVar provides the name of time variable included in timeData dataframe. Argument timeVarModel contains as many lists as time-dependent predictors defined in timeData to specify the structure of the mixed models assumed for each predictor. For each time-dependent predictor, the list should contain a fixed and a random argument to define the formula of a mixed model to be estimated with lcmm R package (Proust-Lima, Philipps, and Liquet 2017). fixed defines the formula for the fixed-effects and random for the random-effects (e.g., list(Y1 = list(fixed = Y1 ~ time, random = ~ time)). Argument Y contains a list of two elements type and Y. Element type defines the nature of the outcome (surv for survival outcome with possibly competing causes, numeric for continuous outcome and factor for categorical outcome) and element Y defines the dataframe which includes the identifier (same as in timeData and fixedData dataframes) and outcome variables.

Arguments ntree, mtry, nodesize and minsplit are the hyperparameters of the random forest. Argument ntree controls the number of trees in the random forest (200 by default). Argument mtry indicates the number of variables randomly drawn at each node (square root of the total number of predictors by default). Argument nodesize indicates the minimal number of subjects allowed in the leaves (1 by default). Argument minsplit controls the minimal number of events required to split the node (2 by default).

For survival outcome, argument cause indicates the event the interest. Argument nsplit_option indicates the method to build the two groups of individuals at each node. By default, we build the groups according to deciles (quantile option) but they could be built according to random values (sample option).

Argument ncores indicates the number of cores used to grow the trees in parallel mode. By default, we set the number of cores of the computer minus 1. Argument seed specifies the random seed. It can be fixed to replicate the results. Argument verbose allows to display a progression bar during the execution of the function.

Values

dynforest() function returns an object of class dynforest containing several elements:

  • data a list with longitudinal predictors (Longitudinal element), continuous predictors (Numeric element) and categorical predictors (Factor element)
  • rf is a dataframe with one column per tree containing a list with several elements, which includes:
    • leaves the leaf identifier for each subject used to grow the tree
    • idY the identifiers for each subject used to grow the tree
    • V_split the split summary (more detailed below)
    • Y_pred the estimated outcome in each leaf
    • model_param the estimated parameters of the mixed model for the longitudinal predictors used to split the subjects at each node
    • Ytype, hist_nodes, Y, boot and Ylevels internal information used in other functions
  • type the nature of the outcome
  • times the event times (only for survival outcome)
  • cause the cause of interest (only for survival outcome)
  • causes the unique causes (only for survival outcome)
  • Inputs the list of predictors names for Longitudinal (longitudinal predictor), Continuous (continuous predictor) and Factor (categorical predictor)
  • Longitudinal.model the mixed model specification for each longitudinal predictor
  • param a list of hyperparameters used to grow the random forest
  • comput.time the computation time

The main information returned by rf is V_split element which can also be extract using get_tree() function. This element contains a table sorted by the node/leaf identifier (id_node column) with each row representing a node/leaf. Each column provides information about the splits:

  • type: the nature of the predictor (Longitudinal for longitudinal predictor, Numeric for continuous predictor or Factor for categorical predictor) if the node was split, Leaf otherwise;
  • var_split: the predictor used for the split defined by its order in timeData and fixedData;
  • feature: the feature used for the split defined by its position in random statistic;
  • threshold: the threshold used for the split (only with Longitudinal and Numeric). No information is returned for Factor;
  • N: the number of subjects in the node/leaf;
  • Nevent: the number of events of interest in the node/leaf (only with survival outcome);
  • depth: the depth level of the node/leaf.

Additional information about the dependencies

dynforest() function internally calls other functions from related packages to build the random forest:

  • hlme() function (from lcmm package (Proust-Lima, Philipps, and Liquet 2017)) to fit the mixed models for the time-dependent predictors defined in timeData and timeVarModel arguments
  • Entropy() function (from base package) to compute the Shannon entropy
  • survdiff() function (from survival package (Therneau 2022)) to compute the log-rank statistic test
  • crr() function (from cmprsk package (Gray 2020)) to compute the Fine & Gray statistic test

predict() function

predict() is the S3 function for class dynforest to predict the outcome on new subjects. Landmark time can be specified to consider only longitudinal data collected up to this time to compute the prediction. The call of this function is:

predict(object, timeData = NULL, fixedData = NULL,
        idVar, timeVar, t0 = NULL)

Arguments

Argument object contains a dynforest object resulting from dynforest() function. Argument timeData contains the dataframe in longitudinal format (i.e., one observation per row) for the time-dependent predictors of new subjects. In addition to time-dependent predictors, this dataframe should also include a unique identifier and the time measurements. This argument can be set to NULL if no time-dependent predictor is included. Argument fixedData contains the dataframe in wide format (i.e., one subject per row) for the time-fixed predictors of new subjects. In addition to time-fixed predictors, this dataframe should also include an unique identifier. This argument can be set to NULL if no time-fixed predictor is included. Argument idVar provides the name of the identifier variable included in timeData and fixedData dataframes. Argument timeVar provides the name of time-measurement variable included in timeData dataframe. Argument t0 defines the landmark time; only the longitudinal data collected up to this time are to be considered. This argument should be set to NULL to include all longitudinal data.

Values

predict() function returns several elements:

  • t0 the landmark time defined in argument (NULL by default)
  • times times used to compute the individual predictions (only with survival outcome). The times are defined according to the time-to-event subjects used to build the random forest.
  • pred_indiv the predicted outcome for the new subject. With survival outcome, predictions are provided for each time defined in times element.
  • pred_leaf a table giving for each tree (in column) the leaf in which each subject is assigned (in row)
  • pred_indiv_proba the proportion of the trees leading to the category prediction for each subject (only with categorical outcome)

References

Devaux, Anthony. 2024. DynForest: Random Forest with Multivariate Longitudinal Predictors. https://CRAN.R-project.org/package=DynForest.
Gray, Bob. 2020. cmprsk: Subdistribution Analysis of Competing Risks. https://CRAN.R-project.org/package=cmprsk.
Proust-Lima, Cécile, Viviane Philipps, and Benoit Liquet. 2017. “Estimation of Extended Mixed Models Using Latent Classes and Latent Processes: The R Package lcmm.” Journal of Statistical Software 78 (2): 1–56. https://doi.org/10.18637/jss.v078.i02.
Therneau, Terry M. 2022. A Package for Survival Analysis in R. https://CRAN.R-project.org/package=survival.