Model-agnostic global Survival Prediction of Patients with Myeloid Leukemia in QRT/Gustave Roussy Challenge (challengedata.ens.fr): Python’s survivalist Quickstart
Want to share your content on python-bloggers? click here.
In this post I provide a quickstart Python code based on the one provided in https://challengedata.ens.fr/challenges/162. In particular, I use Python package survivalist, that does Survival analysis with Machine Learning and uncertainty quantification (introduced in https://thierrymoudiki.github.io/blog/2024/12/15/python/agnostic-survival-analysis).
Data Challenge : Leukemia Risk Prediction
GOAL OF THE CHALLENGE and WHY IT IS IMPORTANT:
The goal of the challenge is to predict disease risk for patients with blood cancer, in the context of specific subtypes of adult myeloid leukemias.
The risk is measured through the overall survival of patients, i.e. the duration of survival from the diagnosis of the blood cancer to the time of death or last follow-up.
Estimating the prognosis of patients is critical for an optimal clinical management.
For exemple, patients with low risk-disease will be offered supportive care to improve blood counts and quality of life, while patients with high-risk disease will be considered for hematopoietic stem cell transplantion.
The performance metric used in the challenge is the IPCW-C-Index.
THE DATASETS
The training set is made of 3,323 patients.
The test set is made of 1,193 patients.
For each patient, you have acces to CLINICAL data and MOLECULAR data.
The details of the data are as follows:
- OUTCOME:
- OS_YEARS = Overall survival time in years
- OS_STATUS = 1 (death) , 0 (alive at the last follow-up)
CLINICAL DATA, with one line per patient:
- ID = unique identifier per patient
- CENTER = clinical center
- BM_BLAST = Bone marrow blasts in % (blasts are abnormal blood cells)
- WBC = White Blood Cell count in Giga/L
- ANC = Absolute Neutrophil count in Giga/L
- MONOCYTES = Monocyte count in Giga/L
- HB = Hemoglobin in g/dL
- PLT = Platelets coutn in Giga/L
- CYTOGENETICS = A description of the karyotype observed in the blood cells of the patients, measured by a cytogeneticist. Cytogenetics is the science of chromosomes. A karyotype is performed from the blood tumoral cells. The convention for notation is ISCN (https://en.wikipedia.org/wiki/International_System_for_Human_Cytogenomic_Nomenclature). Cytogenetic notation are: https://en.wikipedia.org/wiki/Cytogenetic_notation. Note that a karyotype can be normal or abnornal. The notation 46,XX denotes a normal karyotype in females (23 pairs of chromosomes including 2 chromosomes X) and 46,XY in males (23 pairs of chromosomes inclusing 1 chromosme X and 1 chromsome Y). A common abnormality in the blood cancerous cells might be for exemple a loss of chromosome 7 (monosomy 7, or -7), which is typically asssociated with higher risk disease
GENE MOLECULAR DATA, with one line per patient per somatic mutation. Mutations are detected from the sequencing of the blood tumoral cells.
We call somatic (= acquired) mutations the mutations that are found in the tumoral cells but not in other cells of the body.- ID = unique identifier per patient
- CHR START END = position of the mutation on the human genome
- REF ALT = reference and alternate (=mutant) nucleotide
- GENE = the affected gene
- PROTEIN_CHANGE = the consequence of the mutation on the protei that is expressed by a given gene
- EFFECT = a broad categorization of the mutation consequences on a given gene.
- VAF = Variant Allele Fraction = it represents the proportion of cells with the deleterious mutations.
!pip install survivalist --upgrade --no-cache-dir --verbose
!pip install scikit-survival
!pip install nnetsauce
!curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y import os os.environ['PATH'] = f"/root/.cargo/bin:{os.environ['PATH']}" !echo $PATH !rustc --version !cargo --version
!pip install glmnetforpython --verbose --upgrade --no-cache-dir
!pip install git+https://github.com/Techtonique/genbooster.git --upgrade --no-cache-dir
import numpy as np import pandas as pd import glmnetforpython as glmnet from matplotlib import pyplot as plt from sklearn.utils.discovery import all_estimators from sklearn.datasets import load_iris, load_breast_cancer, load_wine, load_digits from sklearn.linear_model import Ridge, RidgeCV from sklearn.tree import ExtraTreeRegressor from sklearn.model_selection import train_test_split from genbooster.genboosterregressor import BoosterRegressor from genbooster.randombagregressor import RandomBagRegressor from sklearn.metrics import mean_squared_error from tqdm import tqdm from sklearn.utils.discovery import all_estimators from time import time
from survivalist.nonparametric import kaplan_meier_estimator from survivalist.datasets import load_whas500, load_gbsg2, load_veterans_lung_cancer from survivalist.custom import SurvivalCustom from survivalist.custom import PISurvivalCustom from survivalist.ensemble import GradientBoostingSurvivalAnalysis
import matplotlib.pyplot as plt from sklearn.tree import plot_tree from sklearn.model_selection import train_test_split from sksurv.ensemble import RandomSurvivalForest from sksurv.linear_model import CoxPHSurvivalAnalysis from sksurv.metrics import concordance_index_censored , concordance_index_ipcw from sklearn.impute import SimpleImputer from sksurv.util import Surv # Clinical Data df = pd.read_csv("./clinical_train.csv") df_eval = pd.read_csv("./clinical_test.csv") # Molecular Data maf_df = pd.read_csv("./molecular_train.csv") maf_eval = pd.read_csv("./molecular_test.csv") target_df = pd.read_csv("./target_train.csv") #target_df_test = pd.read_csv("./target_test.csv") # Preview the data df.head()
ID | CENTER | BM_BLAST | WBC | ANC | MONOCYTES | HB | PLT | CYTOGENETICS | |
---|---|---|---|---|---|---|---|---|---|
0 | P132697 | MSK | 14.00 | 2.80 | 0.20 | 0.70 | 7.60 | 119.00 | 46,xy,del(20)(q12)[2]/46,xy[18] |
1 | P132698 | MSK | 1.00 | 7.40 | 2.40 | 0.10 | 11.60 | 42.00 | 46,xx |
2 | P116889 | MSK | 15.00 | 3.70 | 2.10 | 0.10 | 14.20 | 81.00 | 46,xy,t(3;3)(q25;q27)[8]/46,xy[12] |
3 | P132699 | MSK | 1.00 | 3.90 | 1.90 | 0.10 | 8.90 | 77.00 | 46,xy,del(3)(q26q27)[15]/46,xy[5] |
4 | P132700 | MSK | 6.00 | 128.00 | 9.70 | 0.90 | 11.10 | 195.00 | 46,xx,t(3;9)(p13;q22)[10]/46,xx[10] |
Step 1: Data Preparation (clinical data only)
For survival analysis, we’ll format the dataset so that OS_YEARS represents the time variable and OS_STATUS represents the event indicator.
# Drop rows where 'OS_YEARS' is NaN if conversion caused any issues target_df.dropna(subset=['OS_YEARS', 'OS_STATUS'], inplace=True) # Check the data types to ensure 'OS_STATUS' is boolean and 'OS_YEARS' is numeric print(target_df[['OS_STATUS', 'OS_YEARS']].dtypes) # Contarget_dfvert 'OS_YEARS' to numeric if it isn’t already target_df['OS_YEARS'] = pd.to_numeric(target_df['OS_YEARS'], errors='coerce') # Ensure 'OS_STATUS' is boolean target_df['OS_STATUS'] = target_df['OS_STATUS'].astype(bool) # Select features features = ['BM_BLAST', 'HB', 'PLT'] target = ['OS_YEARS', 'OS_STATUS'] # Create the survival data format X = df.loc[df['ID'].isin(target_df['ID']), features] y = Surv.from_dataframe('OS_STATUS', 'OS_YEARS', target_df)
OS_STATUS float64 OS_YEARS float64 dtype: object
Step 2: Splitting the Dataset
We’ll split the data into training and testing sets to evaluate the model’s performance.
# Split the data into training and testing sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Survival-aware imputation for missing values imputer = SimpleImputer(strategy="median") X_train[['BM_BLAST', 'HB', 'PLT']] = imputer.fit_transform(X_train[['BM_BLAST', 'HB', 'PLT']]) X_test[['BM_BLAST', 'HB', 'PLT']] = imputer.transform(X_test[['BM_BLAST', 'HB', 'PLT']])
display(X_train.head()) display(y_train)
BM_BLAST | HB | PLT | |
---|---|---|---|
1048 | 3.00 | 9.10 | 150.00 |
1987 | 15.00 | 11.00 | 45.00 |
214 | 6.00 | 6.90 | 132.00 |
2135 | 2.00 | 10.00 | 178.00 |
2150 | 10.00 | 10.00 | 53.00 |
array([(False, 1.91780822), ( True, 1.28219178), ( True, 1.49041096), ..., (False, 8.63561644), (False, 0.47671233), (False, 1.29041096)], dtype=[('OS_STATUS', '?'), ('OS_YEARS', '<f8')])
Step 3: Cox Proportional Hazards Model
To account for censoring in survival analysis, we use a Cox Proportional Hazards (Cox PH) model, a widely used method that estimates the effect of covariates on survival times without assuming a specific baseline survival distribution. The Cox PH model is based on the hazard function, $$h(t | X)\(, which represents the instantaneous risk of an event (e.g., death) at time\)t\(given covariates\)X$$. The model assumes that the hazard can be expressed as: |
\[h(t | X) = h_0(t) \exp(\beta_1 X_1 + \beta_2 X_2 + \dots + \beta_p X_p)\]
where \(h_0(t)\) is the baseline hazard function, and \(\beta\) values are coefficients for each covariate, representing the effect of $X$ on the hazard. Importantly, the proportional hazards assumption implies that the hazard ratios between individuals are constant over time. This approach effectively leverages both observed and censored survival times, making it a more suitable method for survival data compared to standard regression techniques that ignore censoring.
# Initialize and train the Cox Proportional Hazards model cox = CoxPHSurvivalAnalysis() cox.fit(X_train, y_train) # Evaluate the model using Concordance Index IPCW cox_cindex_train = concordance_index_ipcw(y_train, y_train, cox.predict(X_train), tau=7)[0] cox_cindex_test = concordance_index_ipcw(y_train, y_test, cox.predict(X_test), tau=7)[0] print(f"Cox Proportional Hazard Model Concordance Index IPCW on train: {cox_cindex_train:.5f}") print(f"Cox Proportional Hazard Model Concordance Index IPCW on test: {cox_cindex_test:.5f}")
Cox Proportional Hazard Model Concordance Index IPCW on train: 0.66302 Cox Proportional Hazard Model Concordance Index IPCW on test: 0.66060
Step 4: other models
4 – 1 demo
import xgboost as xgb import lightgbm as lgb # Initialize and train the XGBoost model event_time = [y[1] for y in y_test] event_status = [y[0] for y in y_test] km = kaplan_meier_estimator(event_status, event_time, conf_type="log-log") estimator = PISurvivalCustom(regr=xgb.XGBRegressor(), type_pi="kde") estimator.fit(X_train, y_train) surv_funcs = estimator.predict_survival_function(X_test.iloc[:1]) for fn in surv_funcs.mean: plt.step(fn.x, fn(fn.x), where="post") plt.fill_between(fn.x, surv_funcs.lower[0].y, surv_funcs.upper[0].y, alpha=0.25, color="lightblue", step="post") plt.step(km[0], km[1], where="post", color="red", label="Kaplan-Meier") plt.fill_between(km[0], km[2][0], km[2][1], alpha=0.25, color="pink", step="post") plt.ylim(0, 1) plt.show() # Evaluate the model using Concordance Index IPCW cox_cindex_test = concordance_index_ipcw(y_train, y_test, estimator.predict(X_test.iloc[:]).mean, tau=7)[0] print(f"Cox Proportional Hazard Model Concordance Index IPCW on test: {cox_cindex_test:.5f}") cox_cindex_test = concordance_index_ipcw(y_train, y_test, estimator.predict(X_test.iloc[:]).lower, tau=7)[0] print(f"Cox Proportional Hazard Model Concordance Index IPCW on test: {cox_cindex_test:.5f}") cox_cindex_test = concordance_index_ipcw(y_train, y_test, estimator.predict(X_test.iloc[:]).upper, tau=7)[0] print(f"Cox Proportional Hazard Model Concordance Index IPCW on test: {cox_cindex_test:.5f}")
Cox Proportional Hazard Model Concordance Index IPCW on test: 0.60130 Cox Proportional Hazard Model Concordance Index IPCW on test: 0.60106 Cox Proportional Hazard Model Concordance Index IPCW on test: 0.59588
import lightgbm as lgb event_time = [y[1] for y in y_test] event_status = [y[0] for y in y_test] km = kaplan_meier_estimator(event_status, event_time, conf_type="log-log") estimator = PISurvivalCustom(regr=lgb.LGBMRegressor(verbose=0), type_pi="kde") estimator.fit(X_train, y_train) surv_funcs = estimator.predict_survival_function(X_test.iloc[:1]) for fn in surv_funcs.mean: plt.step(fn.x, fn(fn.x), where="post") plt.fill_between(fn.x, surv_funcs.lower[0].y, surv_funcs.upper[0].y, alpha=0.25, color="lightblue", step="post") plt.step(km[0], km[1], where="post", color="red", label="Kaplan-Meier") plt.fill_between(km[0], km[2][0], km[2][1], alpha=0.25, color="pink", step="post") plt.ylim(0, 1) plt.show() # Evaluate the model using Concordance Index IPCW cox_cindex_test = concordance_index_ipcw(y_train, y_test, estimator.predict(X_test.iloc[:]).mean, tau=7)[0] print(f"Cox Proportional Hazard Model Concordance Index IPCW on test: {cox_cindex_test:.5f}") cox_cindex_test = concordance_index_ipcw(y_train, y_test, estimator.predict(X_test.iloc[:]).lower, tau=7)[0] print(f"Cox Proportional Hazard Model Concordance Index IPCW on test: {cox_cindex_test:.5f}") cox_cindex_test = concordance_index_ipcw(y_train, y_test, estimator.predict(X_test.iloc[:]).upper, tau=7)[0] print(f"Cox Proportional Hazard Model Concordance Index IPCW on test: {cox_cindex_test:.5f}")
Cox Proportional Hazard Model Concordance Index IPCW on test: 0.61745 Cox Proportional Hazard Model Concordance Index IPCW on test: 0.62040 Cox Proportional Hazard Model Concordance Index IPCW on test: 0.61757
4 – 2 models galore
# prompt: loop on scikit-learn regressors import nnetsauce as ns import pandas as pd import xgboost as xgb from functools import partial from sklearn.utils.discovery import all_estimators from sklearn.metrics import mean_squared_error from tqdm import tqdm from sksurv.metrics import concordance_index_ipcw from sksurv.linear_model import CoxPHSurvivalAnalysis from sksurv.util import Surv from sklearn.impute import SimpleImputer # Get all regressors from scikit-learn regressors = [est for est in all_estimators() if 'Regressor' in est[0]] # Append xgb.XGBRegressor and lgb.LGBMRegressor as (name, class) tuples regressors += [('XGBRegressor', xgb.XGBRegressor), ('LGBMRegressor', partial(lgb.LGBMRegressor, verbose=0))] results = [] for name, Regressor in tqdm(regressors): print("\n\n ----- base learner", name) try: # Initialize and train the model estimator = PISurvivalCustom(regr=Regressor(), type_pi="kde") estimator.fit(X_train, y_train) # Make predictions and evaluate the model y_pred = estimator.predict(X_test.iloc[:]) c_index = concordance_index_ipcw(y_train, y_test, y_pred.mean, tau=7)[0] c_index_upper = concordance_index_ipcw(y_train, y_test, y_pred.upper, tau=7)[0] c_index_lower = concordance_index_ipcw(y_train, y_test, y_pred.lower, tau=7)[0] print("\n c_index", c_index) results.append([name, c_index, c_index_lower, c_index_upper]) # Initialize and train the model estimator = PISurvivalCustom(regr=ns.CustomRegressor(Regressor()), type_pi="kde") estimator.fit(X_train, y_train) # Make predictions and evaluate the model y_pred = estimator.predict(X_test.iloc[:]) c_index = concordance_index_ipcw(y_train, y_test, y_pred.mean, tau=7)[0] c_index_upper = concordance_index_ipcw(y_train, y_test, y_pred.upper, tau=7)[0] c_index_lower = concordance_index_ipcw(y_train, y_test, y_pred.lower, tau=7)[0] print("\n c_index", c_index) results.append(["custom" + name, c_index, c_index_lower, c_index_upper]) # Initialize and train the model estimator = PISurvivalCustom(regr=RandomBagRegressor(Regressor()), type_pi="kde") estimator.fit(X_train, y_train) # Make predictions and evaluate the model y_pred = estimator.predict(X_test.iloc[:]) c_index = concordance_index_ipcw(y_train, y_test, y_pred.mean, tau=7)[0] c_index_upper = concordance_index_ipcw(y_train, y_test, y_pred.upper, tau=7)[0] c_index_lower = concordance_index_ipcw(y_train, y_test, y_pred.lower, tau=7)[0] print("\n c_index", c_index) results.append(["bagging" + name, c_index, c_index_lower, c_index_upper]) except Exception as e: continue
pd.options.display.float_format = '{:.5f}'.format results_df = pd.DataFrame(results, columns=['Regressor', 'Concordance Index IPCW', 'lower bound', 'upper bound']) results_df.drop(columns=['lower bound', 'upper bound'], inplace=True) results_df.sort_values(by='Concordance Index IPCW', ascending=False, inplace=True) results_df
Regressor | Concordance Index IPCW | |
---|---|---|
35 | baggingPassiveAggressiveRegressor | 0.66627 |
39 | baggingRANSACRegressor | 0.66416 |
26 | baggingHuberRegressor | 0.66340 |
48 | baggingTheilSenRegressor | 0.66338 |
51 | baggingTransformedTargetRegressor | 0.66288 |
47 | customTheilSenRegressor | 0.66249 |
52 | TweedieRegressor | 0.66238 |
24 | HuberRegressor | 0.66225 |
46 | TheilSenRegressor | 0.66224 |
45 | baggingSGDRegressor | 0.66125 |
25 | customHuberRegressor | 0.66078 |
32 | baggingMLPRegressor | 0.66022 |
50 | customTransformedTargetRegressor | 0.66021 |
54 | baggingTweedieRegressor | 0.65999 |
49 | TransformedTargetRegressor | 0.65943 |
53 | customTweedieRegressor | 0.65934 |
2 | baggingAdaBoostRegressor | 0.65665 |
37 | RANSACRegressor | 0.65656 |
31 | customMLPRegressor | 0.65547 |
44 | customSGDRegressor | 0.65448 |
20 | baggingGradientBoostingRegressor | 0.64801 |
1 | customAdaBoostRegressor | 0.64506 |
0 | AdaBoostRegressor | 0.64473 |
42 | baggingRandomForestRegressor | 0.63564 |
18 | GradientBoostingRegressor | 0.63519 |
5 | baggingBaggingRegressor | 0.63496 |
19 | customGradientBoostingRegressor | 0.63449 |
59 | baggingLGBMRegressor | 0.63271 |
23 | baggingHistGradientBoostingRegressor | 0.63223 |
28 | customKNeighborsRegressor | 0.62687 |
14 | baggingExtraTreesRegressor | 0.62214 |
30 | MLPRegressor | 0.62160 |
57 | LGBMRegressor | 0.62069 |
21 | HistGradientBoostingRegressor | 0.61921 |
13 | customExtraTreesRegressor | 0.61834 |
58 | customLGBMRegressor | 0.61702 |
29 | baggingKNeighborsRegressor | 0.61671 |
4 | customBaggingRegressor | 0.61650 |
12 | ExtraTreesRegressor | 0.61259 |
3 | BaggingRegressor | 0.61154 |
36 | QuantileRegressor | 0.61013 |
41 | customRandomForestRegressor | 0.60759 |
40 | RandomForestRegressor | 0.60749 |
38 | customRANSACRegressor | 0.60653 |
33 | PassiveAggressiveRegressor | 0.60526 |
11 | baggingExtraTreeRegressor | 0.60368 |
56 | customXGBRegressor | 0.60044 |
22 | customHistGradientBoostingRegressor | 0.60003 |
8 | baggingDecisionTreeRegressor | 0.59707 |
27 | KNeighborsRegressor | 0.59139 |
55 | XGBRegressor | 0.58723 |
34 | customPassiveAggressiveRegressor | 0.58205 |
9 | ExtraTreeRegressor | 0.57820 |
6 | DecisionTreeRegressor | 0.56057 |
10 | customExtraTreeRegressor | 0.55844 |
7 | customDecisionTreeRegressor | 0.55117 |
15 | GaussianProcessRegressor | 0.53342 |
16 | customGaussianProcessRegressor | 0.52007 |
17 | baggingGaussianProcessRegressor | 0.49695 |
43 | SGDRegressor | 0.40105 |
results_df.shape
(60, 2)
Want to share your content on python-bloggers? click here.