{ "cells": [ { "cell_type": "markdown", "id": "7c726394-3d15-41d2-8425-360d89b7e88a", "metadata": {}, "source": [ "#### Interfacing with scikit-learn\n", "Using survivalpredict's pipeline to interface with the sklearn ecosystem.\n", "\n", "Survivalpredict's `SklearnSurvivalPipeline` class allows us to interface directly with scikit-learn's greater ecosystem. There is a whole ecosystem centered around hyperparameter tuning, feature selection, and scaling out compute for scikit-learn compatible code that we can interface with. " ] }, { "cell_type": "code", "execution_count": 1, "id": "eb50477c-b96d-4d73-90af-656a7c5fd9e9", "metadata": {}, "outputs": [], "source": [ "import warnings\n", "warnings.filterwarnings('ignore')\n", "\n", "import numpy as np\n", "import itertools\n", "import pandas as pd\n", "\n", "from sklearn.model_selection import GridSearchCV , cross_val_score\n", "from sklearn.compose import make_column_transformer, make_column_selector\n", "from sklearn.preprocessing import StandardScaler, OneHotEncoder\n", "\n", "from mlxtend.feature_selection import SequentialFeatureSelector\n", "from mlxtend.plotting import plot_sequential_feature_selection as plot_sfs\n", "\n", "from survivalpredict.estimators import CoxProportionalHazard,KaplanMeierSurvivalEstimator\n", "from survivalpredict.metrics import integrated_brier_score_administrative_sklearn_scorer \n", "from survivalpredict.pipeline import SklearnSurvivalPipeline,build_sklearn_pipeline_target,make_sklearn_survival_pipeline\n", "from survivalpredict.strata_preprocessing import StrataBuilderEncoder, StrataColumnTransformer, make_strata_column_transformer\n", "\n", "from SurvSet.data import SurvLoader\n", "loader = SurvLoader()" ] }, { "cell_type": "markdown", "id": "3d469ca7-e7de-4547-8efd-d9b72b6223f9", "metadata": {}, "source": [ "Let's start by loading in the data, using `SurvSet`. Notice that we are keeping the data as a dataframe. This allows us to retain feature names, enabling us to call out specific strata feature names when tuning for the best-performing strata." ] }, { "cell_type": "code", "execution_count": 2, "id": "ba37d084-091a-4c89-b33f-90b7872a06d1", "metadata": {}, "outputs": [], "source": [ "df = loader.load_dataset(\"support2\")[\"df\"]\n", "df = df.dropna()\n", "\n", "times = df[\"time\"].to_numpy().astype(np.int64)\n", "events = df[\"event\"].to_numpy().astype(np.bool_)\n", "X_df = df[list(set(df.columns).difference(set((\"pid\", \"event\", \"time\"))))]" ] }, { "cell_type": "markdown", "id": "fdd1245e-7fc4-4159-89d2-91e2d2c9cc20", "metadata": {}, "source": [ "Next, we are going to determine the maximum point in time to observe for evaluating the model. As time goes forward, a smaller and smaller percentage of individuals will still be in the study, including the tail-end points in the integrated Brier score, which will artificially inflate the score, not highlighting improvements in performance in points in time that include most individuals." ] }, { "cell_type": "code", "execution_count": 3, "id": "253f8c06-3b87-41b9-ad83-cf2df896e2b2", "metadata": {}, "outputs": [], "source": [ "max_time = np.percentile(times, 85).round().astype(np.int64)" ] }, { "cell_type": "markdown", "id": "72bd2699-5e58-4ea3-a73f-7c2c57bc71fb", "metadata": {}, "source": [ "The 'build_sklearn_pipeline_target' function allows us to take non-feature vectors, like `times`, `events`, `times_start`(for left censoring), and predetermined `strata` into a single vector that can be passed into scikit-learn classes. The output of 'build_sklearn_pipeline_target' will function as the 'y' vector for the SklearnSurvivalPipeline class.\n", "\n", "Next, we will build a pipeline that will turn 'fac_sex', 'fac_income' columns into the strata, then preprocess the rest of the features, and finally train a Cox model." ] }, { "cell_type": "code", "execution_count": 4, "id": "c455d769-e5c0-4400-804c-8ec0e045c994", "metadata": {}, "outputs": [], "source": [ "y = build_sklearn_pipeline_target(times, events)\n", "\n", "strata_transformers = make_strata_column_transformer(\n", " (StrataBuilderEncoder(), [\"fac_sex\", \"fac_income\"])\n", ")\n", "\n", "column_transformers = make_column_transformer(\n", " (StandardScaler(), make_column_selector(dtype_include=np.number)),\n", " (OneHotEncoder(), make_column_selector(dtype_include=[object, \"string\"])),\n", ")\n", "\n", "pipeline_cox = make_sklearn_survival_pipeline(\n", " strata_transformers,\n", " column_transformers,\n", " CoxProportionalHazard(),\n", " max_time=max_time,\n", ")" ] }, { "cell_type": "code", "execution_count": 5, "id": "42351e79-b3c9-447d-b7de-33baed5db6af", "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
SklearnSurvivalPipeline(max_time=818,\n",
" steps=[('stratacolumntransformer',\n",
" StrataColumnTransformer(strata_transformers=[('stratabuilderencoder',\n",
" StrataBuilderEncoder(),\n",
" ['fac_sex',\n",
" 'fac_income'])],\n",
" stratabuilderencoder__columns=['fac_sex',\n",
" 'fac_income'])),\n",
" ('columntransformer',\n",
" ColumnTransformer(transformers=[('standardscaler',\n",
" StandardScaler(),\n",
" <sklearn.compose._column_transformer.make_column_selector object at 0x7f9320e88440>),\n",
" ('onehotencoder',\n",
" OneHotEncoder(),\n",
" <sklearn.compose._column_transformer.make_column_selector object at 0x7f9320fbdf90>)])),\n",
" ('coxproportionalhazard',\n",
" CoxProportionalHazard())])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. | \n", " | steps | \n", "[('stratacolumntransformer', ...), ('columntransformer', ...), ...] | \n", "
| \n", " | max_time | \n", "818 | \n", "
| \n", " | memory | \n", "None | \n", "
| \n", " | strata_transformers | \n", "[('stratabuilderencoder', ...)] | \n", "
| \n", " | stratabuilderencoder__columns | \n", "['fac_sex', 'fac_income'] | \n", "
['fac_sex', 'fac_income']
<sklearn.compose._column_transformer.make_column_selector object at 0x7f9320e88440>
| \n", " | \n",
" \n",
" copy\n",
" copy: bool, default=True If False, try to avoid a copy and do inplace scaling instead. This is not guaranteed to always work inplace; e.g. if the data is not a NumPy array or scipy.sparse CSR matrix, a copy may still be returned.\n", " \n", " | \n",
" True | \n", "
| \n", " | \n",
" \n",
" with_mean\n",
" with_mean: bool, default=True If True, center the data before scaling. This does not work (and will raise an exception) when attempted on sparse matrices, because centering them entails building a dense matrix which in common use cases is likely to be too large to fit in memory.\n", " \n", " | \n",
" True | \n", "
| \n", " | \n",
" \n",
" with_std\n",
" with_std: bool, default=True If True, scale the data to unit variance (or equivalently, unit standard deviation).\n", " \n", " | \n",
" True | \n", "
<sklearn.compose._column_transformer.make_column_selector object at 0x7f9320fbdf90>
| \n", " | alpha | \n", "0.0 | \n", "
| \n", " | max_iter | \n", "100 | \n", "
| \n", " | ties | \n", "'breslow' | \n", "
| \n", " | tol | \n", "1e-09 | \n", "
| \n", " | params | \n", "mean_test_score | \n", "std_test_score | \n", "rank_test_score | \n", "
|---|---|---|---|---|
| 101 | \n", "{'coxproportionalhazard__alpha': 10, 'strataco... | \n", "-119.853315 | \n", "5.188538 | \n", "1 | \n", "
| 68 | \n", "{'coxproportionalhazard__alpha': 10, 'strataco... | \n", "-120.893513 | \n", "6.171047 | \n", "2 | \n", "
| 2 | \n", "{'coxproportionalhazard__alpha': 0, 'stratacol... | \n", "-121.189043 | \n", "5.495687 | \n", "3 | \n", "
| 134 | \n", "{'coxproportionalhazard__alpha': 25, 'strataco... | \n", "-121.211370 | \n", "7.280129 | \n", "4 | \n", "
| 233 | \n", "{'coxproportionalhazard__alpha': 50, 'strataco... | \n", "-121.580471 | \n", "8.243867 | \n", "5 | \n", "
SequentialFeatureSelector(estimator=SklearnSurvivalPipeline(max_time=818,\n",
" steps=[('coxproportionalhazard',\n",
" CoxProportionalHazard())]),\n",
" forward=False, k_features=(1, 1), n_jobs=5,\n",
" scoring=make_scorer(integrated_brier_score_administrative_sklearn_metric, greater_is_better=False, response_method='predict'))In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. | \n", " | estimator | \n", "SklearnSurviv...nalHazard())]) | \n", "
| \n", " | k_features | \n", "(1, ...) | \n", "
| \n", " | forward | \n", "False | \n", "
| \n", " | floating | \n", "False | \n", "
| \n", " | verbose | \n", "0 | \n", "
| \n", " | scoring | \n", "make_scorer(i...hod='predict') | \n", "
| \n", " | cv | \n", "5 | \n", "
| \n", " | n_jobs | \n", "5 | \n", "
| \n", " | pre_dispatch | \n", "'2*n_jobs' | \n", "
| \n", " | clone_estimator | \n", "True | \n", "
| \n", " | fixed_features | \n", "None | \n", "
| \n", " | feature_groups | \n", "None | \n", "
| \n", " | alpha | \n", "0.0 | \n", "
| \n", " | max_iter | \n", "100 | \n", "
| \n", " | ties | \n", "'breslow' | \n", "
| \n", " | tol | \n", "1e-09 | \n", "
| \n", " | feature_idx | \n", "cv_scores | \n", "avg_score | \n", "feature_names | \n", "
|---|---|---|---|---|
| 7 | \n", "(0, 3, 4, 15, 17, 18, 19) | \n", "[-143.56667782179701, -136.25594055650689, -15... | \n", "-144.112512 | \n", "(0, 3, 4, 15, 17, 18, 19) | \n", "
| 6 | \n", "(0, 3, 4, 17, 18, 19) | \n", "[-143.83516942714164, -135.87268885869483, -15... | \n", "-144.263809 | \n", "(0, 3, 4, 17, 18, 19) | \n", "
| 8 | \n", "(0, 1, 3, 4, 15, 17, 18, 19) | \n", "[-141.46966297859967, -138.35774779284924, -15... | \n", "-144.270329 | \n", "(0, 1, 3, 4, 15, 17, 18, 19) | \n", "
| 9 | \n", "(0, 1, 3, 4, 15, 17, 18, 19, 23) | \n", "[-149.19738821435038, -135.50584628943156, -15... | \n", "-144.318642 | \n", "(0, 1, 3, 4, 15, 17, 18, 19, 23) | \n", "
| 10 | \n", "(0, 1, 3, 4, 15, 16, 17, 18, 19, 23) | \n", "[-149.00292841678186, -134.64053897377698, -15... | \n", "-144.356467 | \n", "(0, 1, 3, 4, 15, 16, 17, 18, 19, 23) | \n", "