Skip to main content

Machine learning

A heart disease model that explains itself.

UCI data, four models, SHAP on the predictions. Built to be inspected, not just scored.

By Daniel Jeun


Heart disease prediction is an old benchmark. The UCI dataset has been beaten to death. The interesting thing about doing it again is to do it well, with every step inspectable.

The pipeline

Numerical features get the median for missing values. Categorical features get the mode. The target is binarized into presence or absence of heart disease. The id and dataset columns are dropped. Categorical columns are one hot encoded with drop_first to avoid multicollinearity. Numerical features are standardized. The split is eighty twenty with stratified sampling. SMOTE handles the class imbalance on the training set, never on the test set.

Four models train: Logistic Regression for the linear baseline, a Random Forest and an XGBoost ensemble for two flavors of tree based scoring, and a Support Vector Classifier for a margin based comparison. SHAP runs on the predictions so a clinician looking at any single output can ask why the model said what it said and get an answer. The top features are thalch, the maximum heart rate achieved, and exang_True, exercise induced angina. That aligns with the clinical literature, which is a good sign for a model that could easily have latched onto something spurious.

The honest detail

Cholesterol shows a negative correlation with heart disease in this dataset. That’s the wrong direction. It is almost certainly a data quality issue in the original UCI compilation, not a real protective effect. A clinician reading the report needs to see that called out, not buried under a feature importance bar chart.

The point of the project is the inspectability. A score on a beaten dataset is not the contribution. Being able to point at a prediction and explain it is.

Python scikit-learn XGBoost SHAP SMOTE

← Back to all projects