Python for Predicting ED Visits | Healthcare Analytics

Analyzing social determinants of health (SDOH) with Python

Photo by National Cancer Institute on Unsplash

The goal of this project is to use Social Determinants of Health (SDoH) data by county available from AHRQ to look for relationships between specific variables and the county’s ED visit rate. Ultimately, I’d like to develop a predictive model with the top features related to a high ED rate. I decided to look at 2019 and 2020 data (2018 was not available). This dataset was used with the explicit permission of HRSA.

This step-by-step tutorial goes through my process of loading, cleaning, analyzing, and modeling the data.

The first step was to load the two data files and check out the shape.

author’s code

Since the two dataframes have a different number of columns, I’m going to import the data dictionaries and pull the columns that are the same.

I merged the data dictionaries on the column names (inner join) to get the final list of common columns. Once I had the columns, I selected a subset of each dataframe with those columns and concatenated them with axis=0 to add them vertically. My df_final includes the 2019 and 2020 data for the common columns.

dictionary2019=pd.read_csv('Data/datadictionary2019.csv', encoding= "ISO-8859–1")
dictionary2020=pd.read_csv('Data/datadictionary.csv', encoding= "ISO-8859–1")
commoncolumns=dictionary2020.merge(dictionary2019, how='inner', left_on='name', right_on='name')['name'].values.tolist()

df_final=pd.concat([dfa, dfb], axis=0)

This data has 674 columns, so we are going to need to try to narrow down which features to look at. Let’s start with my variable of interest, the ED visit rate.

The dataset includes — Emergency department visits per 1,000 male Medicare (dual and non-dual) beneficiaries. However, the data provides separate male and female rates, so I am going to create a weighted average for the overall ED rate.

To do so I multiplied the rate by the percentage of males and the percentage of females respectively, then summed the values.

Male ED Rate: 597.1129158207091

Female ED Rate: 639.9023742580443

import numpy as np
#create an average overall ED rate by weighting the male and female rates by their percentage of the population and adding
#print the mean ED rate to use as our baseline for Good and Bad outcomes

After looking at the data frame, we can see missing data for some of the counties. To address this, I dropped certain columns and imputed others with the mean.

For this project, we will drop rows with a missing value for the ED rate calculation because that means they did not have data related to emergency visits.

Then we will find the 80th percentile for the dataset to use as a cutoff point for ‘high ED rate’. This will be my outcome variable for future predictions. The overall EDrate will be used for correlations and exploratory data analysis.

#create cutoff value for high EDrate
cutoff=np.percentile(df_final['EDrate'], 80)
#if ED rate is greater than the 50th percentile then flag as high or else 0 for low
df_final['HighED']=np.where(df_final['EDrate']>cutoff, 1, 0)

In order to handle some of the missing data, I started by dropping any columns with >10% missing values. Then I got a list of the remaining columns with any missing values, as shown below.

# drop columns with >10% missing
df_final.dropna(thresh=0.90*len(df_final),axis=1, inplace=True)

#list columns remaining with missing values
df_final.isnull().sum().to_frame(name='counts').query('counts > 0').sort_values(by='counts', ascending=False)

Let’s try a simple imputer with the mean for all columns that are type float. First we will need to separate our training and testing sets:

from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
columns=df_final.loc[:, df_final.dtypes == float].columns.values
X_train, X_test, y_train, y_test = train_test_split( df_final.index, df_final['HighED'], stratify=df_final['HighED'], test_size=0.25, random_state=42)
imp = SimpleImputer( strategy='mean')


Next, we will select all of the columns with a dtype of float so that we can run correlations between each feature and the target feature. Based on the correlation, we will set a threshold and keep columns that are significantly positively correlated or negatively correlated with EDrate.

#print positive correlations over .2 (small-moderate effect size)
def positivecorrelations(threshold=.2):
for col in columns:
if df_impute_train['EDrate'].corr(df_impute_train[col])>threshold:
return positivecolumns

#print negative correlations less than -.2 (small-moderate effect size)
def negativecorrelations(threshold=-.2):
for col in columns:
if df_impute_train['EDrate'].corr(df_impute_train[col])<threshold:
return negativecolumns


I made a final list of the columns as shown below:

#make a final list of significant columns

We ended up with 140 columns. After I printed a list of the columns, I realized I still had some cleaning up to do –

We want to make sure that we’re not including any variables that contain ED so we will filter all of those columns in addition to our calculated Female rate and Male rate out of our list with the code below.

stringVal = "ED"
finalcols=[x for x in sigcols if stringVal not in x]


This left us with 140 columns. We’ve narrowed our dataset down to 112 columns. But now that I’m looking at the list of columns, I see that we should also exclude anything with _IP (inpatient), _PA (post-acute), and _EM(E&M)too. We’re also not interested in the min and max temperature per month, so I’ll go ahead and drop those columns.

stringVal = "_IP"

finalcols=[x for x in finalcols if stringVal not in x]

finalcols=[x for x in finalcols if stringVal2 not in x]

finalcols=[x for x in finalcols if stringVal3 not in x]
finalcols=[x for x in finalcols if stringVal4 not in x]
#result is 77

Based on another careful examination of the output, I found that there are some features that are measuring very similar things(i.e. estimated percentages overall vs. aged X-Y). Let’s drop the columns that specify age ranges if the overall value is there. In addition, it looks like the PQI is repeated for all different subsets of the population, so we will want to take the weighted average to find the overall rate by using M and F rates as we did earlier with the EDrate.


df_impute_train.loc[:, 'MalePQI']=(df_impute_train['ACS_PCT_MALE']*df_impute_train['MMD_OVERALL_PQI_M_RATE'])/100
df_impute_train.loc[:, 'FemalePQI']=(df_impute_train['ACS_PCT_FEMALE']*df_impute_train['MMD_OVERALL_PQI_F_RATE'])/100
df_impute_train.loc[:, 'PQI']=df_impute_train['MalePQI']+df_impute_train['FemalePQI']

df_impute_test.loc[:, 'MalePQI']=(df_impute_test['ACS_PCT_MALE']*df_impute_test['MMD_OVERALL_PQI_M_RATE'])/100
df_impute_test.loc[:, 'FemalePQI']=(df_impute_test['ACS_PCT_FEMALE']*df_impute_test['MMD_OVERALL_PQI_F_RATE'])/100
df_impute_test.loc[:, 'PQI']=df_impute_test['MalePQI']+df_impute_test['FemalePQI']

finalcols=[x for x in finalcols if rate not in x]
finalcols=[x for x in finalcols if race not in x]

dictionary2020[['name', 'label']][dictionary2020['name'].isin(finalcols)]

I also pulled the full labels for the columns from the data dictionary and reviewed them to ensure that all of the columns were practical, given my background knowledge of healthcare analytics.

The final result yielded 78 columns to work with. Next, I wanted to quickly visualize the relationship between each of these variables and the EDrate, so I wrote a little loop to create scatterplots:

import matplotlib.pyplot as plt
def create_scatterplots(var='EDrate'):
for col in finalcols:
plt.scatter(df_impute_train[col], df_impute_train[var])
plt.title(("{} vs {}".format(col, var)))

The next part of this analysis involved developing a few predictive models. For this project, I used a logistic regression, support vector machine, random forest classifier, and XGBoost classifier. The first models we’re testing are logistic regression and SVM so we need to scale the data.

I went with a MinMaxScaler to try to reduce the effect of outliers.

from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()


If you’re new to classification problems, check out this introduction article on Logistic Regressions by Ayush Pant. For this logistic regression, I decided to set the class_weight=’balanced’, given that this is an imbalanced classification problem:

from sklearn.linear_model import LogisticRegression
model = LogisticRegression(solver='liblinear', random_state=0, class_weight='balanced').fit(X_train_scaled, y_train)

model.score(X_test_scaled, y_test)

The confusion matrix below displays the TP, TN, FP, and FN. We also print the classification report.

from sklearn.metrics import classification_report, confusion_matrix
import matplotlib.pyplot as plt

cm = confusion_matrix(y_test, model.predict(X_test_scaled))

fig, ax = plt.subplots(figsize=(8, 8))
ax.imshow(cm, cmap='summer', alpha=0.3)
ax.xaxis.set(ticks=(0, 1), ticklabels=('Predicted LowED', 'Predicted HighED'))
ax.yaxis.set(ticks=(0, 1), ticklabels=('Actual LowED', 'Actual HighED'))
ax.set_ylim(1.5, -0.5)
for i in range(2):
for j in range(2):
ax.text(j, i, cm[i, j], ha='center', va='center', color='black')

The next model I tried was a Support Vector Machine. If this is also a topic that’s new for you, I’d recommend checking out Ajay Yadav’s article on SVM

For my project, I played around a little bit with the class weights to see what would strike a balance in terms of the model results.

from sklearn.svm import SVC
clf=SVC(class_weight='balanced'), y_train)
clf.score(X_test_scaled, y_test)
cm = confusion_matrix(y_test, clf.predict(X_test_scaled))

fig, ax = plt.subplots(figsize=(8, 8))
ax.imshow(cm, cmap='summer', alpha=0.3)
ax.xaxis.set(ticks=(0, 1), ticklabels=('Predicted LowED', 'Predicted HighED'))
ax.yaxis.set(ticks=(0, 1), ticklabels=('Actual LowED', 'Actual HighED'))
ax.set_ylim(1.5, -0.5)
for i in range(2):
for j in range(2):
ax.text(j, i, cm[i, j], ha='center', va='center', color='black')
print(classification_report(y_test, clf.predict(X_test_scaled)))

The next models to test are Random Forest Classifier and XGBoost. Because tree-based models don’t require scaling, I’m going to use the original X data for these two.

Before you go ahead and start running a random forest classifier, you might want to read a background article, like Tony Yiu’s article Understanding Random Forest. Once you feel like you have a good understanding of the concepts, try running a simple example with the code below:

from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
clf = RandomForestClassifier( max_features=None, random_state=0, class_weight='balanced'), y_train)
clf.score(X_test, y_test)
cm = confusion_matrix(y_test, clf.predict(X_test))

fig, ax = plt.subplots(figsize=(8, 8))
ax.imshow(cm, cmap='summer', alpha=0.3)
ax.xaxis.set(ticks=(0, 1), ticklabels=('Predicted LowED', 'Predicted HighED'))
ax.yaxis.set(ticks=(0, 1), ticklabels=('Actual LowED', 'Actual HighED'))
ax.set_ylim(1.5, -0.5)
for i in range(2):
for j in range(2):
ax.text(j, i, cm[i, j], ha='center', va='center', color='black')
print(classification_report(y_test, clf.predict(X_test)))

So far our Random Forest classification report looks like this:

With an accuracy of 94%, this looks like it’s doing pretty well, but I’m concerned about the false negative rate. Let’s see if the XGBoost can do better.

New to XGBoost? Review George Seif’s article — A Beginner’s guide to XGBoost to gain familiarity with the boosting trees before you run the code below.

import xgboost as xgb

# Init classifier
xgb_cl = xgb.XGBClassifier(random_state=0)

# Fit, y_train)

# Predict
preds = xgb_cl.predict(X_test)
cm = confusion_matrix(y_test, xgb_cl.predict(X_test))

fig, ax = plt.subplots(figsize=(8, 8))
ax.xaxis.set(ticks=(0, 1), ticklabels=('Predicted LowED', 'Predicted HighED'))
ax.yaxis.set(ticks=(0, 1), ticklabels=('Actual LowED', 'Actual HighED'))
ax.set_ylim(1.5, -0.5)
for i in range(2):
for j in range(2):
ax.text(j, i, cm[i, j], ha='center', va='center', color='red')
print(classification_report(y_test, xgb_cl.predict(X_test)))

It looks like the Random Forest Classifier is the clear winner, with higher accuracy, precision, recall, and f1.

Let’s take a look at some of the feature importance values:

features = df_impute_test.columns.values
importances = clf.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(8, 20))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='b', align='center')
plt.yticks(range(len(indices)), [features[i] for i in indices])
plt.xlabel('Relative Importance')
 import shap
explainer = shap.TreeExplainer(clf)
shap_values = explainer.shap_values(X_test)
shap.summary_plot(shap_values, X_test, plot_type="bar")
  1. The most important feature is the percentage of black females

— This suggests that the minority group faces greater hardships related to acute healthcare and any discrimination or health inequity should be addressed immediately

2. Percentage of disabled veterans is also a top feature

— This suggests that disability and veteran status contribute to high ED utilization and the underlying risk factors need to be addressed

3. SHAP values show that PQI and the Percentage of households that received food stamps/SNAP in the past 12 months have a significant impact

— This indirectly ties socioeconomic status to poor health outcomes, suggesting that if healthcare providers can help in this area, the patients might not visit the emergency department as frequently

Based on these findings, data professionals in the healthcare sector may consider developing predictive models to isolate SDoH features that are correlated with poor health outcomes.

The hope would be that a target healthcare program can address SDoH needs such as discrimination due to race, disability, or socioeconomic status. By offering more intensive care and support to high-risk individuals or high-risk areas, healthcare analysts can work to better treat these patients in a non-acute (not-ED) setting.

This can be achieved through care providers like community health facilities, home health care agencies, and other partners with health insurance companies.

In this article, I introduced a public dataset related to SDoH. By analyzing 2019 and 2020 data, I arrived at several predictive models. The goal of the model was to predict if a county would have High Emergency Department usage based on SDoH factors. The best model from my analysis was the random forest classifier. We review the core features driving the model and explained the implications for healthcare analysts.

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —

Check out the entire notebook on my GitHub

If you’re new to Medium and you enjoy stories like this, sign up here

Internet Citation: Social Determinants of Health Database. Content last reviewed November 2022. Agency for Healthcare Research and Quality, Rockville, MD.

Source link

Leave a Comment