A Machine Learning Approach to Predicting MGMT Methylation Status in Glioblastoma Patients | by Jarrett Evans | Jul, 2023

It is important that glioblastoma (GBM) cancers are dealt with efficiently and effectively due to the deadliness they pose to the patients who develop them. With a median survival of 14–16 months, they account for about 45% of all malignant central nervous system tumors.


This team looked to leverage a two-stage approach to predict the appropriate MGMT methylation status. First by eliminating noisy radiomics features and next by implementing classification algorithms into a genetic algorithm to help identify the best prediction features.

There were various machine-learning techniques that were tested during this research. The purpose was to find the most meaningful radiomics features for prediction. They did this by extracting radiomics features from multimodal images from magnetic resonance images (MRI). The two-stage feature selection method started with an eXtreme Gradient Boosting (XGBoost) model followed by a genetic algorithm (GA)-based wrapper model. GA models work in a similar fashion to natural selection, where the ‘fittest’ set of features for predictions are identified.

The data that was used were preprocessed and segmented multimodal MRI features from The Cancer Genome Atlas. In total, there were 53 GBM patients that were included and 704 radiomics features were obtained.

The workflow stage of the genetic algorithm consisted of six different steps: generation of the initial population, fitness assessment, selection of parents, crossover, mutation, and population replacement for the next generation. The formula that was used for selection probability (where the features are selected based on their performance in the fitness assessment stage) is shown below:

Selection Probability Formula

Once the initial features were extracted from the XGBoost algorithm it was time to do classification using the features to predict which patients fell into the classes on MGMT Methylated and Unmethylated and used that as the fitness assessment. They tried implementing three different machine learning algorithms into the fitness assessment piece of the genetic algorithm workflow. They used Random Forest (RF), XGBoost, and Support Vector Machines (SVM). To implement the algorithms, they utilized the Python machine learning library SKlearn.


Once they trained the three different versions of algorithms, they assessed model performance with three different measurements: accuracy, specificity, and recall. To compare the performance, they used the mean of running cross-validation 20 times and used a Kolmogorov-Smirnov test for evaluation.

The best result was achieved by the Random Forest algorithm incorporated into the Genetic Algorithm (GA-RF). This technique outperformed the others in all of the evaluation metrics with an accuracy of 0.925, a sensitivity of 0.894, and a specificity of 0.966.

After the GA-RF model finished there was an optimal subset of 22 radiomics features. Amongst these features included 17 textural features, 3 histogram-based features, 1 volume feature, and 1 intensity feature. Textural features can play a helping role for clinicians by reflecting spatial intensity correlations and distributions of voxels that could help quantify the “multiregional variations” in blood flow, edema, and necrosis. Histogram features can be used to illustrate the frequency distribution of intensity values that occur in an image.

Expanded Use-Case

To test the more widespread applicability of the model the team used the learned features on a new dataset. The dataset they used this time was for patients with Low-Grade Gliomas (LGG). They applied the learned features directly and didn’t do any additional feature selections.

The results for the GA-RF model on the LGG dataset were an accuracy of 0.75, a sensitivity of 0.78, and a specificity of 0.62. Without any transfer learning or fine-tuning done, these are promising results.

By receiving strong performance with the features applied to the LGG dataset the researchers were able to show that these features could potentially be reused for other similar diseases.


A potential limitation of this technique is that with a small number of patients, researchers may experience a high-dimensionality-related problem. This occurs when the number of features is high compared to the amount of training data that is available. If this happens it can be challenging to learn accurate relationships between the data and the target variable.

High-dimensionality problems are a widespread issue within the space of radiomics, not just in this study. To overcome this limitation the team used cross-validation to be more certain of the results they were receiving.

Cross-validation Overview

Cross-validation works by splitting the data that are used for training and testing into different groups for each iteration of the validation. By doing this multiple times and taking the mean of the results you have more confidence that your model can repeatedly obtain the results it is giving.

In cancer treatment, clinicians are reliant upon tumor characteristics and grades so that they can optimize treatment for chemotherapy, radiation therapy, and surgery.


If this technique continues to be developed and eventually utilized, it could be a way for doctors to noninvasively understand the MGMT methylation status of their patients. This information can help them to make a more informed treatment decision that could help lead to better prognoses within patients. This also opens the door for other potential ways radiomics can be utilized in Oncology.

Source link

Leave a Comment