Model-agnostic global Survival Prediction of Patients with Myeloid Leukemia in QRT/Gustave Roussy Challenge (challengedata.ens.fr): Python’s survivalist Quickstart

This article was first published on T. Moudiki's Webpage - Python , and kindly contributed to python-bloggers. (You can report issue about the content on this page here)
Want to share your content on python-bloggers? click here.

In this post I provide a quickstart Python code based on the one provided in https://challengedata.ens.fr/challenges/162. In particular, I use Python package survivalist, that does Survival analysis with Machine Learning and uncertainty quantification (introduced in https://thierrymoudiki.github.io/blog/2024/12/15/python/agnostic-survival-analysis).

Logo 1
Logo 2

Data Challenge : Leukemia Risk Prediction

GOAL OF THE CHALLENGE and WHY IT IS IMPORTANT:

The goal of the challenge is to predict disease risk for patients with blood cancer, in the context of specific subtypes of adult myeloid leukemias.

The risk is measured through the overall survival of patients, i.e. the duration of survival from the diagnosis of the blood cancer to the time of death or last follow-up.

Estimating the prognosis of patients is critical for an optimal clinical management.
For exemple, patients with low risk-disease will be offered supportive care to improve blood counts and quality of life, while patients with high-risk disease will be considered for hematopoietic stem cell transplantion.

The performance metric used in the challenge is the IPCW-C-Index.

THE DATASETS

The training set is made of 3,323 patients.

The test set is made of 1,193 patients.

For each patient, you have acces to CLINICAL data and MOLECULAR data.

The details of the data are as follows:

  • OUTCOME:
    • OS_YEARS = Overall survival time in years
    • OS_STATUS = 1 (death) , 0 (alive at the last follow-up)
  • CLINICAL DATA, with one line per patient:

    • ID = unique identifier per patient
    • CENTER = clinical center
    • BM_BLAST = Bone marrow blasts in % (blasts are abnormal blood cells)
    • WBC = White Blood Cell count in Giga/L
    • ANC = Absolute Neutrophil count in Giga/L
    • MONOCYTES = Monocyte count in Giga/L
    • HB = Hemoglobin in g/dL
    • PLT = Platelets coutn in Giga/L
    • CYTOGENETICS = A description of the karyotype observed in the blood cells of the patients, measured by a cytogeneticist. Cytogenetics is the science of chromosomes. A karyotype is performed from the blood tumoral cells. The convention for notation is ISCN (https://en.wikipedia.org/wiki/International_System_for_Human_Cytogenomic_Nomenclature). Cytogenetic notation are: https://en.wikipedia.org/wiki/Cytogenetic_notation. Note that a karyotype can be normal or abnornal. The notation 46,XX denotes a normal karyotype in females (23 pairs of chromosomes including 2 chromosomes X) and 46,XY in males (23 pairs of chromosomes inclusing 1 chromosme X and 1 chromsome Y). A common abnormality in the blood cancerous cells might be for exemple a loss of chromosome 7 (monosomy 7, or -7), which is typically asssociated with higher risk disease
  • GENE MOLECULAR DATA, with one line per patient per somatic mutation. Mutations are detected from the sequencing of the blood tumoral cells.
    We call somatic (= acquired) mutations the mutations that are found in the tumoral cells but not in other cells of the body.

    • ID = unique identifier per patient
    • CHR START END = position of the mutation on the human genome
    • REF ALT = reference and alternate (=mutant) nucleotide
    • GENE = the affected gene
    • PROTEIN_CHANGE = the consequence of the mutation on the protei that is expressed by a given gene
    • EFFECT = a broad categorization of the mutation consequences on a given gene.
    • VAF = Variant Allele Fraction = it represents the proportion of cells with the deleterious mutations.
!pip install survivalist --upgrade --no-cache-dir --verbose
!pip install scikit-survival
!pip install nnetsauce
!curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y

import os
os.environ['PATH'] = f"/root/.cargo/bin:{os.environ['PATH']}"

!echo $PATH
!rustc --version
!cargo --version


!pip install glmnetforpython --verbose --upgrade --no-cache-dir
!pip install git+https://github.com/Techtonique/genbooster.git --upgrade --no-cache-dir
import numpy as np
import pandas as pd
import glmnetforpython as glmnet
from matplotlib import pyplot as plt
from sklearn.utils.discovery import all_estimators
from sklearn.datasets import load_iris, load_breast_cancer, load_wine, load_digits
from sklearn.linear_model import Ridge, RidgeCV
from sklearn.tree import ExtraTreeRegressor
from sklearn.model_selection import train_test_split
from genbooster.genboosterregressor import BoosterRegressor
from genbooster.randombagregressor import RandomBagRegressor
from sklearn.metrics import mean_squared_error
from tqdm import tqdm
from sklearn.utils.discovery import all_estimators
from time import time
from survivalist.nonparametric import kaplan_meier_estimator
from survivalist.datasets import load_whas500, load_gbsg2, load_veterans_lung_cancer
from survivalist.custom import SurvivalCustom
from survivalist.custom import PISurvivalCustom
from survivalist.ensemble import GradientBoostingSurvivalAnalysis
import matplotlib.pyplot as plt
from sklearn.tree import plot_tree
from sklearn.model_selection import train_test_split
from sksurv.ensemble import RandomSurvivalForest
from sksurv.linear_model import CoxPHSurvivalAnalysis
from sksurv.metrics import concordance_index_censored , concordance_index_ipcw
from sklearn.impute import SimpleImputer
from sksurv.util import Surv

# Clinical Data
df = pd.read_csv("./clinical_train.csv")
df_eval = pd.read_csv("./clinical_test.csv")

# Molecular Data
maf_df = pd.read_csv("./molecular_train.csv")
maf_eval = pd.read_csv("./molecular_test.csv")

target_df = pd.read_csv("./target_train.csv")
#target_df_test = pd.read_csv("./target_test.csv")

# Preview the data
df.head()
IDCENTERBM_BLASTWBCANCMONOCYTESHBPLTCYTOGENETICS
0P132697MSK14.002.800.200.707.60119.0046,xy,del(20)(q12)[2]/46,xy[18]
1P132698MSK1.007.402.400.1011.6042.0046,xx
2P116889MSK15.003.702.100.1014.2081.0046,xy,t(3;3)(q25;q27)[8]/46,xy[12]
3P132699MSK1.003.901.900.108.9077.0046,xy,del(3)(q26q27)[15]/46,xy[5]
4P132700MSK6.00128.009.700.9011.10195.0046,xx,t(3;9)(p13;q22)[10]/46,xx[10]

Step 1: Data Preparation (clinical data only)

For survival analysis, we’ll format the dataset so that OS_YEARS represents the time variable and OS_STATUS represents the event indicator.

# Drop rows where 'OS_YEARS' is NaN if conversion caused any issues
target_df.dropna(subset=['OS_YEARS', 'OS_STATUS'], inplace=True)

# Check the data types to ensure 'OS_STATUS' is boolean and 'OS_YEARS' is numeric
print(target_df[['OS_STATUS', 'OS_YEARS']].dtypes)

# Contarget_dfvert 'OS_YEARS' to numeric if it isn’t already
target_df['OS_YEARS'] = pd.to_numeric(target_df['OS_YEARS'], errors='coerce')

# Ensure 'OS_STATUS' is boolean
target_df['OS_STATUS'] = target_df['OS_STATUS'].astype(bool)

# Select features
features = ['BM_BLAST', 'HB', 'PLT']
target = ['OS_YEARS', 'OS_STATUS']

# Create the survival data format
X = df.loc[df['ID'].isin(target_df['ID']), features]
y = Surv.from_dataframe('OS_STATUS', 'OS_YEARS', target_df)
OS_STATUS    float64
OS_YEARS     float64
dtype: object

Step 2: Splitting the Dataset

We’ll split the data into training and testing sets to evaluate the model’s performance.

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Survival-aware imputation for missing values
imputer = SimpleImputer(strategy="median")
X_train[['BM_BLAST', 'HB', 'PLT']] = imputer.fit_transform(X_train[['BM_BLAST', 'HB', 'PLT']])
X_test[['BM_BLAST', 'HB', 'PLT']] = imputer.transform(X_test[['BM_BLAST', 'HB', 'PLT']])
display(X_train.head())
display(y_train)
BM_BLASTHBPLT
10483.009.10150.00
198715.0011.0045.00
2146.006.90132.00
21352.0010.00178.00
215010.0010.0053.00

array([(False, 1.91780822), ( True, 1.28219178), ( True, 1.49041096), ...,
       (False, 8.63561644), (False, 0.47671233), (False, 1.29041096)],
      dtype=[('OS_STATUS', '?'), ('OS_YEARS', '<f8')])

Step 3: Cox Proportional Hazards Model

To account for censoring in survival analysis, we use a Cox Proportional Hazards (Cox PH) model, a widely used method that estimates the effect of covariates on survival times without assuming a specific baseline survival distribution. The Cox PH model is based on the hazard function, $$h(tX)\(, which represents the instantaneous risk of an event (e.g., death) at time\)t\(given covariates\)X$$. The model assumes that the hazard can be expressed as:

\[h(t | X) = h_0(t) \exp(\beta_1 X_1 + \beta_2 X_2 + \dots + \beta_p X_p)\]

where \(h_0(t)\) is the baseline hazard function, and \(\beta\) values are coefficients for each covariate, representing the effect of $X$ on the hazard. Importantly, the proportional hazards assumption implies that the hazard ratios between individuals are constant over time. This approach effectively leverages both observed and censored survival times, making it a more suitable method for survival data compared to standard regression techniques that ignore censoring.

# Initialize and train the Cox Proportional Hazards model
cox = CoxPHSurvivalAnalysis()
cox.fit(X_train, y_train)

# Evaluate the model using Concordance Index IPCW
cox_cindex_train = concordance_index_ipcw(y_train, y_train, cox.predict(X_train), tau=7)[0]
cox_cindex_test = concordance_index_ipcw(y_train, y_test, cox.predict(X_test), tau=7)[0]
print(f"Cox Proportional Hazard Model Concordance Index IPCW on train: {cox_cindex_train:.5f}")
print(f"Cox Proportional Hazard Model Concordance Index IPCW on test: {cox_cindex_test:.5f}")
Cox Proportional Hazard Model Concordance Index IPCW on train: 0.66302
Cox Proportional Hazard Model Concordance Index IPCW on test: 0.66060

Step 4: other models

4 – 1 demo

import xgboost as xgb
import lightgbm as lgb

# Initialize and train the XGBoost model


event_time = [y[1] for y in y_test]
event_status = [y[0] for y in y_test]
km = kaplan_meier_estimator(event_status, event_time,
                            conf_type="log-log")
estimator = PISurvivalCustom(regr=xgb.XGBRegressor(),
                             type_pi="kde")

estimator.fit(X_train, y_train)

surv_funcs = estimator.predict_survival_function(X_test.iloc[:1])

for fn in surv_funcs.mean:
    plt.step(fn.x, fn(fn.x), where="post")
    plt.fill_between(fn.x, surv_funcs.lower[0].y, surv_funcs.upper[0].y, alpha=0.25, color="lightblue", step="post")
    plt.step(km[0], km[1], where="post", color="red", label="Kaplan-Meier")
    plt.fill_between(km[0], km[2][0], km[2][1], alpha=0.25, color="pink", step="post")
    plt.ylim(0, 1)
    plt.show()

# Evaluate the model using Concordance Index IPCW
cox_cindex_test = concordance_index_ipcw(y_train, y_test, estimator.predict(X_test.iloc[:]).mean, tau=7)[0]
print(f"Cox Proportional Hazard Model Concordance Index IPCW on test: {cox_cindex_test:.5f}")
cox_cindex_test = concordance_index_ipcw(y_train, y_test, estimator.predict(X_test.iloc[:]).lower, tau=7)[0]
print(f"Cox Proportional Hazard Model Concordance Index IPCW on test: {cox_cindex_test:.5f}")
cox_cindex_test = concordance_index_ipcw(y_train, y_test, estimator.predict(X_test.iloc[:]).upper, tau=7)[0]
print(f"Cox Proportional Hazard Model Concordance Index IPCW on test: {cox_cindex_test:.5f}")

img1

Cox Proportional Hazard Model Concordance Index IPCW on test: 0.60130
Cox Proportional Hazard Model Concordance Index IPCW on test: 0.60106
Cox Proportional Hazard Model Concordance Index IPCW on test: 0.59588
import lightgbm as lgb

event_time = [y[1] for y in y_test]
event_status = [y[0] for y in y_test]
km = kaplan_meier_estimator(event_status, event_time,
                            conf_type="log-log")
estimator = PISurvivalCustom(regr=lgb.LGBMRegressor(verbose=0),
                             type_pi="kde")

estimator.fit(X_train, y_train)

surv_funcs = estimator.predict_survival_function(X_test.iloc[:1])

for fn in surv_funcs.mean:
    plt.step(fn.x, fn(fn.x), where="post")
    plt.fill_between(fn.x, surv_funcs.lower[0].y, surv_funcs.upper[0].y, alpha=0.25, color="lightblue", step="post")
    plt.step(km[0], km[1], where="post", color="red", label="Kaplan-Meier")
    plt.fill_between(km[0], km[2][0], km[2][1], alpha=0.25, color="pink", step="post")
    plt.ylim(0, 1)
    plt.show()

# Evaluate the model using Concordance Index IPCW
cox_cindex_test = concordance_index_ipcw(y_train, y_test, estimator.predict(X_test.iloc[:]).mean, tau=7)[0]
print(f"Cox Proportional Hazard Model Concordance Index IPCW on test: {cox_cindex_test:.5f}")
cox_cindex_test = concordance_index_ipcw(y_train, y_test, estimator.predict(X_test.iloc[:]).lower, tau=7)[0]
print(f"Cox Proportional Hazard Model Concordance Index IPCW on test: {cox_cindex_test:.5f}")
cox_cindex_test = concordance_index_ipcw(y_train, y_test, estimator.predict(X_test.iloc[:]).upper, tau=7)[0]
print(f"Cox Proportional Hazard Model Concordance Index IPCW on test: {cox_cindex_test:.5f}")

img2

Cox Proportional Hazard Model Concordance Index IPCW on test: 0.61745
Cox Proportional Hazard Model Concordance Index IPCW on test: 0.62040
Cox Proportional Hazard Model Concordance Index IPCW on test: 0.61757

4 – 2 models galore

# prompt: loop on scikit-learn regressors
import nnetsauce as ns
import pandas as pd
import xgboost as xgb

from functools import partial
from sklearn.utils.discovery import all_estimators
from sklearn.metrics import mean_squared_error
from tqdm import tqdm
from sksurv.metrics import concordance_index_ipcw
from sksurv.linear_model import CoxPHSurvivalAnalysis
from sksurv.util import Surv
from sklearn.impute import SimpleImputer


# Get all regressors from scikit-learn
regressors = [est for est in all_estimators()  if 'Regressor' in est[0]]

# Append xgb.XGBRegressor and lgb.LGBMRegressor as (name, class) tuples
regressors += [('XGBRegressor', xgb.XGBRegressor),
 ('LGBMRegressor', partial(lgb.LGBMRegressor, verbose=0))]

results = []

for name, Regressor in tqdm(regressors):

    print("\n\n ----- base learner", name)

    try:

      # Initialize and train the model
      estimator = PISurvivalCustom(regr=Regressor(), type_pi="kde")
      estimator.fit(X_train, y_train)
      # Make predictions and evaluate the model
      y_pred = estimator.predict(X_test.iloc[:])
      c_index = concordance_index_ipcw(y_train, y_test, y_pred.mean, tau=7)[0]
      c_index_upper = concordance_index_ipcw(y_train, y_test, y_pred.upper, tau=7)[0]
      c_index_lower = concordance_index_ipcw(y_train, y_test, y_pred.lower, tau=7)[0]
      print("\n c_index", c_index)
      results.append([name, c_index, c_index_lower, c_index_upper])

      # Initialize and train the model
      estimator = PISurvivalCustom(regr=ns.CustomRegressor(Regressor()), type_pi="kde")
      estimator.fit(X_train, y_train)
      # Make predictions and evaluate the model
      y_pred = estimator.predict(X_test.iloc[:])
      c_index = concordance_index_ipcw(y_train, y_test, y_pred.mean, tau=7)[0]
      c_index_upper = concordance_index_ipcw(y_train, y_test, y_pred.upper, tau=7)[0]
      c_index_lower = concordance_index_ipcw(y_train, y_test, y_pred.lower, tau=7)[0]
      print("\n c_index", c_index)
      results.append(["custom" + name, c_index, c_index_lower, c_index_upper])

      # Initialize and train the model
      estimator = PISurvivalCustom(regr=RandomBagRegressor(Regressor()), type_pi="kde")
      estimator.fit(X_train, y_train)
      # Make predictions and evaluate the model
      y_pred = estimator.predict(X_test.iloc[:])
      c_index = concordance_index_ipcw(y_train, y_test, y_pred.mean, tau=7)[0]
      c_index_upper = concordance_index_ipcw(y_train, y_test, y_pred.upper, tau=7)[0]
      c_index_lower = concordance_index_ipcw(y_train, y_test, y_pred.lower, tau=7)[0]
      print("\n c_index", c_index)
      results.append(["bagging" + name, c_index, c_index_lower, c_index_upper])

    except Exception as e:
      continue
pd.options.display.float_format = '{:.5f}'.format
results_df = pd.DataFrame(results, columns=['Regressor', 'Concordance Index IPCW', 'lower bound', 'upper bound'])
results_df.drop(columns=['lower bound', 'upper bound'], inplace=True)
results_df.sort_values(by='Concordance Index IPCW', ascending=False, inplace=True)
results_df
RegressorConcordance Index IPCW
35baggingPassiveAggressiveRegressor0.66627
39baggingRANSACRegressor0.66416
26baggingHuberRegressor0.66340
48baggingTheilSenRegressor0.66338
51baggingTransformedTargetRegressor0.66288
47customTheilSenRegressor0.66249
52TweedieRegressor0.66238
24HuberRegressor0.66225
46TheilSenRegressor0.66224
45baggingSGDRegressor0.66125
25customHuberRegressor0.66078
32baggingMLPRegressor0.66022
50customTransformedTargetRegressor0.66021
54baggingTweedieRegressor0.65999
49TransformedTargetRegressor0.65943
53customTweedieRegressor0.65934
2baggingAdaBoostRegressor0.65665
37RANSACRegressor0.65656
31customMLPRegressor0.65547
44customSGDRegressor0.65448
20baggingGradientBoostingRegressor0.64801
1customAdaBoostRegressor0.64506
0AdaBoostRegressor0.64473
42baggingRandomForestRegressor0.63564
18GradientBoostingRegressor0.63519
5baggingBaggingRegressor0.63496
19customGradientBoostingRegressor0.63449
59baggingLGBMRegressor0.63271
23baggingHistGradientBoostingRegressor0.63223
28customKNeighborsRegressor0.62687
14baggingExtraTreesRegressor0.62214
30MLPRegressor0.62160
57LGBMRegressor0.62069
21HistGradientBoostingRegressor0.61921
13customExtraTreesRegressor0.61834
58customLGBMRegressor0.61702
29baggingKNeighborsRegressor0.61671
4customBaggingRegressor0.61650
12ExtraTreesRegressor0.61259
3BaggingRegressor0.61154
36QuantileRegressor0.61013
41customRandomForestRegressor0.60759
40RandomForestRegressor0.60749
38customRANSACRegressor0.60653
33PassiveAggressiveRegressor0.60526
11baggingExtraTreeRegressor0.60368
56customXGBRegressor0.60044
22customHistGradientBoostingRegressor0.60003
8baggingDecisionTreeRegressor0.59707
27KNeighborsRegressor0.59139
55XGBRegressor0.58723
34customPassiveAggressiveRegressor0.58205
9ExtraTreeRegressor0.57820
6DecisionTreeRegressor0.56057
10customExtraTreeRegressor0.55844
7customDecisionTreeRegressor0.55117
15GaussianProcessRegressor0.53342
16customGaussianProcessRegressor0.52007
17baggingGaussianProcessRegressor0.49695
43SGDRegressor0.40105


results_df.shape
(60, 2)
To leave a comment for the author, please follow the link and comment on their blog: T. Moudiki's Webpage - Python .

Want to share your content on python-bloggers? click here.