migraine-risk-prediction

Migraine Attack Risk Prediction

A Clinically Interpretable Machine Learning Pipeline

Author: Patricia C. Torrell · Clinical Data Analyst
Stack: Python · R · scikit-learn · tidyverse · SHAP · ggplot2
Domain: Neurology · Longitudinal Clinical Data · Personalised Medicine


Overview

This project develops a fully transparent, end-to-end machine learning pipeline to estimate the probability of a migraine attack from longitudinal clinical, treatment, and environmental data.

The pipeline goes beyond prediction: every modelling decision is clinically justified, every output is interpretable, and the results are validated at both population and patient level. It is designed to reflect the standards expected in a pharmaceutical or clinical research environment — reproducible, auditable, and medically grounded.

Full variable definitions and clinical coding are documented in docs/data_dictionary.md.

Core question: Can we predict the onset of a migraine attack using available clinical and environmental data — and if so, who is most sensitive to treatment and environment?


What This Project Does

Stage Description
EDA Longitudinal structure analysis, target prevalence, distribution of clinical and environmental variables
Feature Engineering Treatment phase classification, seasonal decomposition, cumulative attack history, rolling air quality mean, age bands
Modeling Logistic Regression, Gradient Boosting, XGBoost — trained with patient-level cross-validation (GroupKFold)
Calibration Isotonic and Platt calibration with honest out-of-fold evaluation
Threshold Optimisation FP/FN cost-based search with minimum sensitivity constraint (≥0.80)
Subgroup & Fairness Analysis OR estimation with clustered robust SEs across sex, age, medication, subtype, air quality quartiles
Interpretability Permutation Importance, PDP, ALE (global + stratified), ICE per patient, sensitivity analysis (demonstrate clinical heterogeneity) and SHAP (global + per patient).

Final Model

Logistic Regression — tuned + isotonic calibration

Chosen for interpretability, clinical plausibility, and consistent performance across all evaluation criteria.

Metric Value 95% CI (bootstrap)
ROC-AUC 0.715 [0.681 – 0.749]
PR-AUC 0.754 [0.711 – 0.792]
Brier Score 0.214 [0.200 – 0.229]
ECE 0.074 [0.051 – 0.099]
Precision 0.672 [0.638 – 0.708]
Sensitivity 0.867 [0.836 – 0.894]
Optimal threshold 0.47 FP/FN cost-optimised

All metrics validated on a held-out test set of unseen patients (honest patient-level split).


Key Clinical Findings

What drives migraine risk?

Global interpretability (Permutation Importance + SHAP) consistently identifies three dominant predictors:

Does treatment reduce risk?

The treatment phase alone does not reach statistical significance, but medication type does — strongly. Patients with reduced medication show the highest attack proportion (75.8%), above those on continuous treatment (58.8%) and those on no medication (40.5%), confirming a confounding-by-indication effect consistent with clinical practice.

Does air quality matter?

No significant association was found between air quality quartile and attack risk. All quartiles show odds ratios very close to 1 (Q2: OR = 0.97, p = 0.806; Q3: OR = 1.00, p = 0.989; Q4: OR = 1.05, p = 0.767), with wide confidence intervals and no consistent directional trend. Air quality does not emerge as a reliable predictor in this model.

Is clinical response heterogeneous?

Yes — and this is one of the most clinically relevant findings of the project. ICE curves and patient-level sensitivity indices show that:

This supports the case for personalised clinical monitoring rather than population-level protocols.


Interpretability Highlights

The project applies a full suite of model-agnostic interpretability methods:


Project Structure

.
├── data/
│   ├── raw/                        # Original dataset (Kaggle)
│   └── processed/                  # Engineered features, splits, predictions
│
├── models/                         # Trained models, calibrators, preprocessor
│
├── notebooks/
│   ├── 01_eda.ipynb
│   ├── 02_preprocessing.ipynb
│   ├── 03_feature_engineering.ipynb
│   ├── 04_model_training.ipynb
│   ├── 05_calibration.ipynb
│   ├── 06_tuning.ipynb
│   ├── 07_model_experiments.ipynb   # Attempted model improvements — no significant gain over baseline; 
                                     # final model unchanged
│   ├── 08_threshold_optimization.ipynb
│   ├── 09_subgroup_fairness.ipynb  # R kernel
│   └── 10_interpretability.ipynb   # R kernel
│
├── reports/
│   ├── figures/                    # All plots (PNG, 300 dpi)
│   └── tables/                     # All output tables (CSV)
│
├── src/
│   ├── data_prep.py
│   ├── features.py
│   ├── modeling.py
│   ├── calibration.py
│   ├── evaluation.py
│   ├── thresholding.py
│   ├── visualization.py
│   ├── visualization.R
│   ├── subgroup_fairness.R
│   ├── interpretability.R
│   └── config.py
│
├── docs/
│   ├── data_dictionary.md       # Clinical variable definitions and coding
│   ├── CLINICAL_SUMMARY.md      # Key findings for clinical audiences
│   └── PROJECT_STORY.md         # Analytical decisions and lessons learned
│
├── requirements.txt
├── LICENSE.txt
└── README.md

Technical Notes

Why logistic regression over tree-based models?
Gradient Boosting and XGBoost showed lower ROC-AUC and worse calibration in validation. More importantly, logistic regression produces directly interpretable coefficients and probability outputs that are clinically meaningful — a key requirement for any decision-support tool in a medical context.

Why isotonic calibration?
Platt scaling consistently worsened Brier and ECE across all models. Isotonic calibration improved reliability of predicted probabilities without degrading discrimination.

Why a mixed Python/R pipeline?
Python handles preprocessing, modelling, and SHAP (via scikit-learn and KernelExplainer). R handles the statistical modelling layer — robust clustered standard errors, OR estimation, forest plots, and advanced interpretability (PDP, ALE, ICE) — where the R ecosystem is more mature and audit-ready for a clinical context.


Data Source

Hebaqueen. Medicine and Environment Impact on Migraine. Kaggle.
kaggle.com/datasets/hebaqueen/medicine-and-environment-impact-on-migraine

133 patients · 4,152 longitudinal records · 10 original variables · study period 1997–2000


Limitations


Contact

Patricia C. Torrell
Clinical Data Analyst — open to opportunities in pharmaceutical data analytics, clinical research, and medical affairs


MIT License