Official code and analysis for the ACL 2026 paper:
Failure Modes in Multi-Hop QA: The Weakest Link Effect and the Recognition Bottleneck Meiru Zhang, Zaiqiao Meng, Nigel Collier (University of Cambridge).
📄 Paper (arXiv) · 📦 Preprocessed data (Release v1.0.0)
We study why Large Language Models fail at multi-hop question answering in long contexts. Using Multi-Focus Attention Instruction (MFAI) — a training-free probe that explicitly indexes evidence positions via natural-language cues — we show:
- Step-function, not linear decay. Performance depends on the absolute bucket (Beginning / Middle / Tail) of the evidence, not the linear distance between hops. On MuSiQue the between-bucket gap is ~4× the within-bucket variation.
- Weakest Link Effect. When evidence is split across buckets, multi-hop accuracy collapses toward the minimum single-bucket accuracy, not the average.
- Recognition is the bottleneck. Matched MFAI rescues low-visibility positions by up to 11.5% — the gap is attentional, not reasoning-capacity.
- Task topology modulates robustness. Vertical reasoning chains (MuSiQue) are vulnerable to misleading MFAI; horizontal ones (NeoQA) are resilient.
- System-2 thinking overrides both biases, but at ~6× the output-token cost.
Replicated across MuSiQue, NeoQA, 2WikiMultiHopQA, MuSiQue 3- and 4-hop, and Qwen2.5-32B-Instruct-GPTQ-Int8. Six LLMs: Llama-3.1-8B, Ministral-8B, Qwen2.5-{7B,14B}, Qwen3-8B (thinking & non-thinking).
weakest-link-effect/
├── src/
│ ├── data_pre/ # Dataset preprocessing (NeoQA, MuSiQue, 2Wiki)
│ ├── infer/ # Inference runners (entity QA, event QA)
│ ├── evaluate/ # Unified evaluation pipeline (YAML-driven)
│ ├── analyze/ # Spread / cross / multihop analysis and plotting
│ ├── att_heatmap/ # Attention heatmap visualization
│ └── utils/ # Shared prompting and answer-extraction utilities
├── scripts/
│ ├── infer/ # Experiment launch scripts
│ ├── evaluate/ # Evaluation orchestration
│ ├── analyze/ # Analysis automation
│ └── test/ # Reproducibility checks (paper number verification,
│ # paper table regeneration, order-sensitivity audit)
├── config/
│ └── evaluate.yaml # Models, datasets, paths
├── READMEs/ # Depth docs (see §Documentation below)
├── tests/ # Unit + smoke tests
├── requirements.txt
├── LICENSE
└── CITATION.cff
conda create -n event python=3.10
conda activate event
pip install -r requirements.txtPreprocess the three MHQA banks (18-document context, fixed distractors):
python src/data_pre/preprocess_neoqa_bank.py --output_dir dataset/processed_neoqa_bank --seed 42
python src/data_pre/preprocess_musique_bank.py --output_dir dataset/processed_musique_bank --seed 42
python src/data_pre/preprocess_2wiki_bank.py --output_dir dataset/processed_2wiki_bank --seed 42Details: READMEs/README_EXPERIMENT.md.
Inference is served via vLLM.
# Start a local vLLM server (see scripts/infer/start_vllm_server.sh for options)
./scripts/infer/start_vllm_server.sh
# Launch the Spread / Cross experiments for a target model
./scripts/run_multihop_inference.shThe evaluation pipeline is YAML-driven and self-contained.
# One model × dataset × experiment
python -m src.evaluate.evaluate --dataset musique --experiment spread --model Qwen2.5-7B-Instruct
# Everything configured in config/evaluate.yaml
python scripts/evaluate/run_evaluate.py# Spread / Cross plots for any dataset
python -m src.analyze --dataset musique --analysis spread
python -m src.analyze --dataset neoqa --analysis cross
# Bootstrap CIs and McNemar's test
python -m src.analyze --dataset all --analysis statsThe following scripts are the canonical way to generate every numeric artifact
in the paper. They write their outputs to docs/ (created on first run).
# Regenerate every appendix .tex table -> docs/paper_tables/
PYTHONPATH=. python scripts/test/generate_paper_tables.py
# Verify every numeric claim in the paper against raw results
# -> docs/paper_numbers_verification.{md,csv}
PYTHONPATH=. python scripts/test/verify_paper_numbers.py
# Reversed-distractor-order robustness audit
# -> docs/order_sensitivity_phase2_results.md
PYTHONPATH=. python scripts/test/analyze_order_phase2.pyThe preprocessed 18-document banks for NeoQA (same- and random-timeline),
MuSiQue (2/3/4-hop), and 2WikiMultiHopQA (compositional, inference,
comparison, bridge-comparison) are available as a GitHub Release asset
on the v1.0.0 tag:
processed_dataset.tar.gz(≈ 37 MB) — extract at the repository root:
tar -xzf processed_dataset.tar.gz # populates ./dataset/processed_*The banks are deterministic from the preprocessing scripts in src/data_pre/
with --seed 42 (see the Data Sources and Baselines section below for
how to regenerate them from each dataset's original source).
We do not ship raw JSONL traces (the full set is ~12 GB). Re-run them from the shipped banks:
./scripts/infer/start_vllm_server.sh --model Qwen/Qwen2.5-7B-Instruct
./scripts/infer/run_qwen3_musique.sh --model Qwen2.5-7B-Instruct
./scripts/infer/run_qwen3_neoqa.sh --model Qwen2.5-7B-Instruct
./scripts/infer/run_2wiki_compositional.sh --model Qwen2.5-7B-Instruct
# ...see READMEs/README_EXPERIMENT.md for the full launch matrixOnce results/ is populated, analysis follows the same flow as the paper:
# Bootstrap CIs + McNemar's test
python -m src.analyze --dataset all --analysis stats
# Spread / Cross plots (per dataset)
python -m src.analyze --dataset musique --analysis spread
python -m src.analyze --dataset musique --analysis cross
# ... see READMEs/README_ANALYSIS.md for the full list
# Appendix tables + numeric-claim verification
PYTHONPATH=. python scripts/test/generate_paper_tables.py
PYTHONPATH=. python scripts/test/verify_paper_numbers.pyWe build on the following public datasets and preprocessing pipelines. We ship only code and preprocessed banks derived from these sources; please cite the originals if you use them.
| Dataset | Paper | Access |
|---|---|---|
| MuSiQue | Trivedi et al., 2022. MuSiQue: Multihop Questions via Single-hop Question Composition | github.com/StonyBrookNLP/musique · HF: via Shahar6000/MoreDocsSameLen (our pipeline reads MuSiQue through this distractor-packaged HF dataset) |
| NeoQA | Glockner et al., 2025. NeoQA: Evidence-based Question Answering with Generated News Events (ACL Findings 2025) | github.com/amazon-science/NeoQA · HF: mglockner/neoqa (subset context-ablation for our Spread/Cross runs) |
| 2WikiMultiHopQA | Ho et al., 2020. Constructing A Multi-hop QA Dataset for Comprehensive Evaluation of Reasoning Steps | github.com/Alab-NII/2wikimultihop · distractor-packaged copy via Spongeorge/long-context-multihop (file data/base/2wiki_2ndhalfvalid_adtldocs.json) |
| Repo | Used for | Paper |
|---|---|---|
Spongeorge/long-context-multihop |
2WikiMultiHopQA distractor packaging; input to src/data_pre/preprocess_2wiki_bank{,_4hop}.py |
Baker et al., 2024. Lost in the Middle, and In-Between: Enhancing Language Models' Ability to Reason Over Long Contexts in Multi-Hop QA (arXiv:2412.10079) |
shaharl6000/MoreDocsSameLen |
MuSiQue distractor packaging (18-doc context) and open-ended EM / F1 scoring logic mirrored in src/evaluate/metrics.py |
Levy et al., 2025. More Documents, Same Length (arXiv) |
amazon-science/NeoQA |
NeoQA data pipeline and prompt / answer-extraction baseline mirrored in src/infer/neoqa/ and src/utils/ |
Glockner et al., 2025 (above) |
We re-use document packaging conventions and evaluation utilities from these
repos; our additions (MFAI probe, Spread/Cross protocols, multi-hop
extensions, unified analysis pipeline) live in src/.
If you prefer to regenerate the banks from scratch instead of downloading our
processed_dataset.tar.gz:
# MuSiQue — via Hugging Face (no manual download)
python src/data_pre/preprocess_musique_bank.py --output_dir dataset/processed_musique_bank --seed 42
python src/data_pre/preprocess_musique_bank_3_4hop.py --output_dir dataset/processed_musique_bank --seed 42
# NeoQA — follow the decryption steps at github.com/amazon-science/NeoQA
# then place the context-ablation subset under dataset/neoqa_context-ablation/
python src/data_pre/preprocess_neoqa_bank.py --data_path dataset/neoqa_context-ablation \
--output_dir dataset/processed_neoqa_bank \
--num_docs 18 --seed 42
python src/data_pre/preprocess_neoqa_random_bank.py --data_path dataset/neoqa_context-ablation \
--output_dir dataset/processed_neoqa_bank_random \
--num_docs 18 --seed 42
# 2WikiMultiHopQA — clone long-context-multihop and point at its packaged JSON
python src/data_pre/preprocess_2wiki_bank.py \
--input_path path/to/long-context-multihop/data/base/2wiki_2ndhalfvalid_adtldocs.json \
--output_dir dataset/processed_2wiki_bank --seed 42
python src/data_pre/preprocess_2wiki_bank_4hop.py \
--input_path path/to/long-context-multihop/data/base/2wiki_2ndhalfvalid_adtldocs.json \
--output_dir dataset/processed_2wiki_bank --seed 42Four topic-focused docs live under READMEs/:
README_INTRO.md— Research motivation, hypothesis, research questions.README_METHOD.md— Technical method: I/O formats, module layout.README_EXPERIMENT.md— Full experiment setup: datasets, preprocessing, models, metrics.README_ANALYSIS.md— Analysis pipeline, visualization, result structure.
If you use this code, please cite:
@inproceedings{zhang2026weakestlink,
title = {Failure Modes in Multi-Hop QA: The Weakest Link Effect and the Recognition Bottleneck},
author = {Zhang, Meiru and Meng, Zaiqiao and Collier, Nigel},
booktitle = {Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics},
year = {2026},
eprint = {2601.12499},
archivePrefix = {arXiv},
primaryClass = {cs.CL},
url = {https://arxiv.org/abs/2601.12499}
}MIT — see LICENSE. Copyright (c) 2026 University of Cambridge Language
Technology Lab; Meiru Zhang.
Open a GitHub issue, or email mz468 [at] cam.ac.uk.