Skip to content

hms-dbmi/ENLIGHT

Repository files navigation

ENLIGHT: Interpretable Multimodal AI for Grounded Cancer Pathology Diagnosis and Molecular Profiling

teaser

Artificial intelligence (AI)-powered computational pathology has emerged as an increasingly useful tool in cancer evaluation, augmenting clinical interpretation and uncovering previously unrecognized relationships between tissue morphology and underlying molecular alterations. Recent advances in pathology foundation models have enabled diagnostic workflows with unprecedented scalability and efficiency. However, standard AI models remain black boxes, offering limited interpretability and insufficient pathological grounding to justify their assessments. Here we establish Explainable Neoplasm Learning in Grounded Histology Terms (ENLIGHT), a large multimodal model (LMM) designed to systematically identify cancer diagnosis, subtypes, and genetic alterations across 28 organs. We trained ENLIGHT on 38.36 million pathology image-text pairs and evaluated it on 5.68 million independent validation samples across 40 patient cohorts from 44 institutions worldwide, covering six categories of core diagnostic tasks. ENLIGHT demonstrates strong generalizability in zero-shot classification, cancer subtyping, cross-modal retrieval, visual question answering, report generation, and molecular profile prediction. Across these tasks, it consistently outperforms state-of-the-art (SOTA) pathology models tailored to predictive objectives by up to 12%, and generative tasks by up to 281% (mean ROUGE-L for six open-ended visual question answer benchmarks) on independent, unseen cohorts. Importantly, ENLIGHT explains its diagnostic decisions using interpretable pathological concepts that align with established medical knowledge, while uncovering novel links between tissue morphology and molecular alterations. By integrating the reasoning capabilities of LMMs with interpretable pathology grounding, ENLIGHT provides a versatile, scalable, and transparent platform to advance biomedical research, education, and clinical decision support in pathology.

Install environment

See environment.md for setup instructions.

Download checkpoints

Download from Set the path to $CKPTDIR

Download data example

Download from Set the path to $DATADIR

Zero-shot classification

Evaluate cancer grading

AGGC22: https://aggc22.grand-challenge.org/

SICAPv2: https://data.mendeley.com/datasets/9xxm58dvs3/1

RCC-KMC: https://www.kaggle.com/datasets/sachidwivedi1234/kmc-kidney-histopathology-dataset

Set path to $BASE

python eval-zeroshot/zeroshot_classification.py --database $BASE --data AGGC22 --pretrained_path $CKPTDIR/enlight-fm/enlight-visual-encoder.pt --task grading
Evaluate microenvirnment classification

BreCaHAD: https://figshare.com/articles/dataset/BreCaHAD_A_Dataset_for_Breast_Cancer_Histopathological_Annotation_and_Diagnosis/7379186

BIDC: https://data.mendeley.com/datasets/w7jjcx7gj6/1

HD30000: https://www.kaggle.com/datasets/aicoder/histopathology-dataset/data

Hist700: https://figshare.com/articles/dataset/LungHist700_A_Dataset_of_Histological_Images_for_Deep_Learning_in_Pulmonary_Pathology/25459174?file=45206104

WSSS4LUAD: https://wsss4luad.grand-challenge.org/

NPC-88k: https://www.kaggle.com/datasets/wshmunirah/npc-88k-public

OSCC: https://data.mendeley.com/datasets/ftmp4cvtmb/1

Tolkach: https://zenodo.org/records/7548828

DigestPath19: https://digestpath2019.grand-challenge.org/

PAIP21: https://paip2021.grand-challenge.org/Home/

Choledoch: https://www.kaggle.com/datasets/ethelzq/multidimensional-choledoch-database

CAMEL: https://github.com/ThoroughImages/CAMEL

Chaoyang: https://bupt-ai-cz.github.io/HSA-NRL/

MHIST: https://bmirds.github.io/MHIST/

SPIDER_colon: https://huggingface.co/datasets/histai/SPIDER-colorectal

UnitToPatho: https://github.com/EIDOSlab/UNITOPATHO

OCELOT: https://ocelot2023.grand-challenge.org/datasets/

Camelyon16: https://camelyon16.grand-challenge.org/

Camelyon17: https://camelyon17.grand-challenge.org/

SkinCancer: https://heidata.uni-heidelberg.de/dataset.xhtml?persistentId=doi:10.11588/data/7QCR8S

SIPAKMED: https://www.kaggle.com/datasets/prahladmehandiratta/cervical-cancer-largest-dataset-sipakmeds

ETI: https://github.com/ssea-lab/DL4ETI

UBC-OCEAN: https://www.kaggle.com/competitions/UBC-OCEAN

UHB: https://zenodo.org/records/3825933

Set path to $BASE

python eval/zeroshot_classification.py --database $BASE --data SPIDER_colon --pretrained_path $CKPTDIR/enlight-fm/enlight-visual-encoder.pt 

Zero-shot retrieval

Download datasets

TCGA-UT: https://zenodo.org/records/5889558##.YuJHdd_RaUk

Set path to $BASE

Evaulate image2text, text2image retrieval
python eval/zeroshot_retrieval.py --database $BASE --data ut-0 --pretrained_path $CKPTDIR/enlight-fm/enlight-visual-encoder.pt

Visual question answering

Patch-image QA benchmarks

Download dataset

PathMMU: https://huggingface.co/datasets/jamessyx/PathMMU

PathVQA: https://huggingface.co/datasets/dz-osamu/PathVQA

Set path to $IMG_DIR

Preprocess to format
python eval-vqa/format_vqa_batch_infer.py $IMG_DIR pathmmu
Infer to answer
BASE=$IMG_DIR CKPTDIR=$CKPTDIR bash eval/vqa_batch_infer-pathmmu.sh

Slide QA Example

CKPTDIR=$CKPTDIR DATADIR=$DATADIR bash eval-vqa/vqa_infer_slide.sh

Explainable Classification

Classify and Explain Subtyping

CKPTDIR=$CKPTDIR DATADIR=$DATADIR bash eval-xclassify/explain_classify.sh

Classify and Explain Molecular Alteration

CKPTDIR=$CKPTDIR DATADIR=$DATADIR bash eval-xclassify/explain_classify.sh

Acknowledgements

We thank the open-source repositories as below:

open_clip

LLaVA

Quilt-1M

PathAssist

Issues

Please open new threads or address all questions to xuan_gong@hms.harvard.edu or Kun-Hsing_Yu@hms.harvard.edu

About

ENLIGHT

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors