Boosting-RF

Development and added value of boosting with penalized splines and random forests in excess mortality hazard models applied to cancer epidemiology

Our objective is to develop, and make freely available to the community (through R packages), net survival methods that are adapted for analyzing such data. We will then apply these new statistical methods to answer three epidemiological questions using rich data.

Période du projet :

2020 - 2025

Membre(s) SESSTIM du projet :

Roch GIORGI

Membre(s) hors SESSTIM du projet :

Nadine BOSSARD (Project coordinator)

Commanditaires :

INCa SHS-E-SP 2020 (SHSESP20-011)

Partenaires :

SESSTIM, UMR 1252 Inserm/IRD/Aix-Marseille Université.

London School of Hygiene and Tropical Medicine (LSHTM).

Réseau Francim.

Thèmes :

Excess mortality, Net survival, boosting, random forest, machine learning, Cancer

Problématique:

For cancer patients, mortality hazard consists of two components: excess mortality that is directly or indirectly related to cancer and expected (or background) mortality related to all other causes. The excess mortality hazard allows calculating a ‘net survival’, i.e. the survival that would occur if cancer were the sole cause of death. In cancer epidemiology, net survival and the dynamics of excess hazard according to the time since diagnosis are both major indicators and evaluating the effect of prognosis variables on these indicators are important.

In the current context of development of health data platforms, registry or cohort data may be augmented with various new data sources. These data opens the way to many interesting epidemiological perspectives. Yet, efficient statistical methods to analyze such data, with numerous covariables, is lacking.

Méthode :

Two complementary techniques originating from machine learning will be extended to net survival setting. First, we will extend Boosting techniques to excess hazard models, using multidimensional penalized splines as base learners. This method tackles simultaneously the challenging issues of variable selection and model choice and allows modelling of the dynamics of the excess hazard and the effect of prognostic variables on these dynamics. Second, we will develop a random forest methodology for net survival estimation; this methodology is deemed to provide accurate prediction and can also be used to identify the main prognosis variables. Then, the performance of these methods will be evaluated through simulation.

These methods would benefit the community by increasing the ability to analyze rich data, with possibly complex phenomena, in a (net) survival setting.