Increasing The Accuracy of Predictive Models with Stacked Ensemble Techniques: Healthcare Example

Sriram Parthasarathy
Product Coalition
Published in
7 min readMar 21, 2022

--

Cancer Predictive models have long been used to improve early diagnosis and better identify ideal treatment and monitoring. Terms “XGBoost” or “Convolution Neural Networks” which used to be only used by computer science Ph.Ds are now part of the vocabulary of CXOs and frontline clinicians who want to take advantage of the higher performance computing and advances in genomics & bio markers to help better patient outcomes. Increased Healthcare costs are driving the demand for big data-driven healthcare applications with increased efficiencies & economics.

Top chronic diseases

Heart diseases, cancer and diabetes are the leading drivers of the $3.8 trillion in health care costs.

According to the CDC, $208.9 billion was spent in 2020 for cancer treatment including $29.8 billion spent on breast cancer.

For women, the three most common cancers are breast, lung, and colorectal, and they account for an estimated 50% of all new cancer diagnoses in women in 2020.

Breast cancer is one of the most common cancers in women in the United States. 1 in 8 women will develop cancer in their lifetime. 1 in 3 of all new female cancers each year is breast cancer. Family history of breast cancer doubles the risk of breast cancer, but the majority of women (85%) diagnosed with breast cancer do not have a known family history of the disease. Certain gene mutations increase the risk of breast cancer. For example, 5 to 10% of breast cancers are linked to a gene mutation called BRCA1 and BRCA2.

In this article I will be using a breast cancer use case to build and improve the accuracy of a predictive model using stacked ensemble techniques.

Data assembly

One of the important parts of the model development is assembling the data and the cleanup. 80% of healthcare data is unstructured and difficult to extract, use and analyze. To be able to build a realistic model one needs to be able to automate the process of reading the notes, extracting important clinical concepts such as diagnosis, symptoms, treatments, procedures, adverse events and bio markers many of which are mainly described in the clinical notes with associated context.

Extracting meaningful information from a note with context is a hard problem to solve.

For example, let’s take four example sentences with different contexts. 1. Patient has breast cancer. 2. Patient is negative for breast cancer. 3. Patient’s mom has breast cancer. 4. No traces of breast cancer tissues were found. All four references are related to breast cancer but how it’s described is very important in distinguishing who is the subject (patient or someone else) and whether that subject has cancer or not as part of the extraction. You can read more about this in this article Extracting Insights From Clinical Notes using NLP techniques.

In addition, this unstructured data needs to be merged with structured data such as demographics, medications, labs, problem list etc. Merging, cleaning and transforming data is an important part of the routine. You can read more about this in this article Practical Strategies to Handle Missing Values & transforming the data. Many times in the healthcare problem, one value is more abundant than the other values. For example, less people in general have cancer. This is called a Class imbalance problem. You can read about this at Techniques for handling biased data

In the next section we will discuss the model details.

Traditional Model

Traditionally, one machine learning algorithm is used to solve one predictive problem. Problem with that approach is that for complex problems it may not work well because of constraints in the parameters or the data format and so on. This is why using a diverse set of models helps achieve better performance.

Using multiple models

The approach here is to train multiple models and use those models to create the predictions. So if you have 3 models there will be 3 predicted values.

Next question is how do we merge this data together? One common thinking would be to take a majority vote. If two predictive models (classification) got it as Yes, the final answer is Yes. If two predictive models (classification) got it as No, the final answer is No. That is a simple approach. Here we are trusting that the majority wins.

Instead, what if we train another model which takes the 3 predictions from the 3 models and the actual value to learn how to predict. By doing that the new model would learn to predict the final value based on previous individual predictions. Note that for this new model, we do not provide the original raw inputs.

Using algorithm to learn from the predictive values

Stacking Ensemble Model

Putting it all together, this process is called stacking where multiple models are used for the initial predictions, and subsequently, a new meta-learner model is used to predict the final value. It’s a technique where the ensemble machine learning algorithm tries to learn how to best use the 3 predictions from the 3 models. Note that here we used two levels for the stacking. We can use more than 2 levels for the stacking.

Data Set details

For the example to illustrate today I will be using the Breast Cancer Wisconsin data set. The columns in the dataset are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image.

In this data set, our goal is to predict if the cancer is benign or malignant based on the factors from the image present as columns.

Step 1: Data preparation

Using R, I read the data set, did a minor clean up and removed the patient ID field which is not needed for the training. (ID fields are not included in the training process in general). I split the data into 75% for training and 25% for testing. The normal range for training / test split 70/30 or 75/25 or 80/20.

Here are the attributes for this data set

Here is a peek at the sample data set

Step 2: Build out the base learner models

Build out 3 models,Random Forest, GBM and Linear regression. I have included the R code used to train the 3 models as well as the ensemble model. Readers who want to get an overall picture of the process can skip that piece.

For this experiment / sample, I used R language to build out the model. You can also use Python or Scala for this purpose. I used H2O library to train the 3 models (Random Forest, GBM and Linear regression) as well as the ensemble model. Please make sure you have the latest java sdk installed on your machine so RStudio can refer to that.

GBM model

R code snippet shown below for training this GBM model.

Model_gbm <- h2o.gbm(x = x,
y = y,
training_frame = trainingdata,
nfolds = 5,
keep_cross_validation_predictions = TRUE,
seed = 5)

Accuracy from this model is 0.981692677070828

Random Forest model

R code shown below for training this Random Forest model.

Model_rf <- h2o.randomForest(x = x,
y = y,
training_frame = trainingdata,
nfolds = 5,
keep_cross_validation_predictions = TRUE,
seed = 5)

Accuracy from the Random Forest model is 0.987620048019208

Linear Regression model

R code snippet shown below for training this Linear Regression model.

Model_lr <- h2o.glm(x = x,
y = y,
training_frame = trainingdata,
family = c(“binomial”),
nfolds = 5,
keep_cross_validation_predictions = TRUE,
seed = 5)

Accuracy from the Linear Regression model is 0.986644657863145

Highest accuracy from the 3 models is 0.986440388879413

Question is, if the ensemble model accuracy be higher than this? Lets see.

Step3: Train the ensemble models

R code snippet shown below for training the ensemble model.

ensemble <- h2o.stackedEnsemble(
x = x,
y = y,
training_frame = trainingdata,
base_models = list(Model_gbm, Model_rf, Model_lr))

Accuracy from the ensemble model is 0.986644657863145

Accuracy comparison

All the 3 models did pretty good, the accuracy from the Ensemble model is greater than the maximum of the accuracy from the 3 models. Its recommended to try the stacking model to see if the accuracy goes up.

In summary, the ensemble machine learning algorithm tries to learn how to best use the predictions from the models trained using the original data. It leverages the capabilities of multiple models and make the predictions better than any of the single models. Its a good technique to experiment with any predictive problem.

--

--