Failure Modes in Multi-Hop QA: The Weakest Link Effect and the Recognition Bottleneck

Official code and analysis for the ACL 2026 paper:

Failure Modes in Multi-Hop QA: The Weakest Link Effect and the Recognition Bottleneck Meiru Zhang, Zaiqiao Meng, Nigel Collier (University of Cambridge).

📄 Paper (arXiv) · 📦 Preprocessed data (Release v1.0.0)

TL;DR

We study why Large Language Models fail at multi-hop question answering in long contexts. Using Multi-Focus Attention Instruction (MFAI) — a training-free probe that explicitly indexes evidence positions via natural-language cues — we show:

Step-function, not linear decay. Performance depends on the absolute bucket (Beginning / Middle / Tail) of the evidence, not the linear distance between hops. On MuSiQue the between-bucket gap is ~4× the within-bucket variation.
Weakest Link Effect. When evidence is split across buckets, multi-hop accuracy collapses toward the minimum single-bucket accuracy, not the average.
Recognition is the bottleneck. Matched MFAI rescues low-visibility positions by up to 11.5% — the gap is attentional, not reasoning-capacity.
Task topology modulates robustness. Vertical reasoning chains (MuSiQue) are vulnerable to misleading MFAI; horizontal ones (NeoQA) are resilient.
System-2 thinking overrides both biases, but at ~6× the output-token cost.

Replicated across MuSiQue, NeoQA, 2WikiMultiHopQA, MuSiQue 3- and 4-hop, and Qwen2.5-32B-Instruct-GPTQ-Int8. Six LLMs: Llama-3.1-8B, Ministral-8B, Qwen2.5-{7B,14B}, Qwen3-8B (thinking & non-thinking).

Repository Layout

weakest-link-effect/
├── src/
│   ├── data_pre/         # Dataset preprocessing (NeoQA, MuSiQue, 2Wiki)
│   ├── infer/            # Inference runners (entity QA, event QA)
│   ├── evaluate/         # Unified evaluation pipeline (YAML-driven)
│   ├── analyze/          # Spread / cross / multihop analysis and plotting
│   ├── att_heatmap/      # Attention heatmap visualization
│   └── utils/            # Shared prompting and answer-extraction utilities
├── scripts/
│   ├── infer/            # Experiment launch scripts
│   ├── evaluate/         # Evaluation orchestration
│   ├── analyze/          # Analysis automation
│   └── test/             # Reproducibility checks (paper number verification,
│                         #   paper table regeneration, order-sensitivity audit)
├── config/
│   └── evaluate.yaml     # Models, datasets, paths
├── READMEs/              # Depth docs (see §Documentation below)
├── tests/                # Unit + smoke tests
├── requirements.txt
├── LICENSE
└── CITATION.cff

Quick Start

1. Environment

conda create -n event python=3.10
conda activate event
pip install -r requirements.txt

2. Data preparation

Preprocess the three MHQA banks (18-document context, fixed distractors):

python src/data_pre/preprocess_neoqa_bank.py   --output_dir dataset/processed_neoqa_bank   --seed 42
python src/data_pre/preprocess_musique_bank.py --output_dir dataset/processed_musique_bank --seed 42
python src/data_pre/preprocess_2wiki_bank.py   --output_dir dataset/processed_2wiki_bank   --seed 42

Details: READMEs/README_EXPERIMENT.md.

3. Inference

Inference is served via vLLM.

# Start a local vLLM server (see scripts/infer/start_vllm_server.sh for options)
./scripts/infer/start_vllm_server.sh

# Launch the Spread / Cross experiments for a target model
./scripts/run_multihop_inference.sh

4. Evaluation

The evaluation pipeline is YAML-driven and self-contained.

# One model × dataset × experiment
python -m src.evaluate.evaluate --dataset musique --experiment spread --model Qwen2.5-7B-Instruct

# Everything configured in config/evaluate.yaml
python scripts/evaluate/run_evaluate.py

5. Analysis & plotting

# Spread / Cross plots for any dataset
python -m src.analyze --dataset musique --analysis spread
python -m src.analyze --dataset neoqa   --analysis cross

# Bootstrap CIs and McNemar's test
python -m src.analyze --dataset all --analysis stats

6. Reproduce paper tables and number checks

The following scripts are the canonical way to generate every numeric artifact in the paper. They write their outputs to docs/ (created on first run).

# Regenerate every appendix .tex table -> docs/paper_tables/
PYTHONPATH=. python scripts/test/generate_paper_tables.py

# Verify every numeric claim in the paper against raw results
# -> docs/paper_numbers_verification.{md,csv}
PYTHONPATH=. python scripts/test/verify_paper_numbers.py

# Reversed-distractor-order robustness audit
# -> docs/order_sensitivity_phase2_results.md
PYTHONPATH=. python scripts/test/analyze_order_phase2.py

Preprocessed Data

The preprocessed 18-document banks for NeoQA (same- and random-timeline), MuSiQue (2/3/4-hop), and 2WikiMultiHopQA (compositional, inference, comparison, bridge-comparison) are available as a GitHub Release asset on the v1.0.0 tag:

processed_dataset.tar.gz (≈ 37 MB) — extract at the repository root:

tar -xzf processed_dataset.tar.gz       # populates ./dataset/processed_*

The banks are deterministic from the preprocessing scripts in src/data_pre/ with --seed 42 (see the Data Sources and Baselines section below for how to regenerate them from each dataset's original source).

Reproducing Raw Inference Outputs

We do not ship raw JSONL traces (the full set is ~12 GB). Re-run them from the shipped banks:

./scripts/infer/start_vllm_server.sh --model Qwen/Qwen2.5-7B-Instruct
./scripts/infer/run_qwen3_musique.sh     --model Qwen2.5-7B-Instruct
./scripts/infer/run_qwen3_neoqa.sh       --model Qwen2.5-7B-Instruct
./scripts/infer/run_2wiki_compositional.sh --model Qwen2.5-7B-Instruct
# ...see READMEs/README_EXPERIMENT.md for the full launch matrix

Once results/ is populated, analysis follows the same flow as the paper:

# Bootstrap CIs + McNemar's test
python -m src.analyze --dataset all --analysis stats

# Spread / Cross plots (per dataset)
python -m src.analyze --dataset musique --analysis spread
python -m src.analyze --dataset musique --analysis cross
# ... see READMEs/README_ANALYSIS.md for the full list

# Appendix tables + numeric-claim verification
PYTHONPATH=. python scripts/test/generate_paper_tables.py
PYTHONPATH=. python scripts/test/verify_paper_numbers.py

Data Sources and Baselines

We build on the following public datasets and preprocessing pipelines. We ship only code and preprocessed banks derived from these sources; please cite the originals if you use them.

Datasets (originals)

Dataset	Paper	Access
MuSiQue	Trivedi et al., 2022. MuSiQue: Multihop Questions via Single-hop Question Composition	github.com/StonyBrookNLP/musique · HF: via `Shahar6000/MoreDocsSameLen` (our pipeline reads MuSiQue through this distractor-packaged HF dataset)
NeoQA	Glockner et al., 2025. NeoQA: Evidence-based Question Answering with Generated News Events (ACL Findings 2025)	github.com/amazon-science/NeoQA · HF: `mglockner/neoqa` (subset `context-ablation` for our Spread/Cross runs)
2WikiMultiHopQA	Ho et al., 2020. Constructing A Multi-hop QA Dataset for Comprehensive Evaluation of Reasoning Steps	github.com/Alab-NII/2wikimultihop · distractor-packaged copy via `Spongeorge/long-context-multihop` (file `data/base/2wiki_2ndhalfvalid_adtldocs.json`)

Preprocessing and evaluation baselines

Repo	Used for	Paper
`Spongeorge/long-context-multihop`	2WikiMultiHopQA distractor packaging; input to `src/data_pre/preprocess_2wiki_bank{,_4hop}.py`	Baker et al., 2024. Lost in the Middle, and In-Between: Enhancing Language Models' Ability to Reason Over Long Contexts in Multi-Hop QA (arXiv:2412.10079)
`shaharl6000/MoreDocsSameLen`	MuSiQue distractor packaging (18-doc context) and open-ended EM / F1 scoring logic mirrored in `src/evaluate/metrics.py`	Levy et al., 2025. More Documents, Same Length (arXiv)
`amazon-science/NeoQA`	NeoQA data pipeline and prompt / answer-extraction baseline mirrored in `src/infer/neoqa/` and `src/utils/`	Glockner et al., 2025 (above)

We re-use document packaging conventions and evaluation utilities from these repos; our additions (MFAI probe, Spread/Cross protocols, multi-hop extensions, unified analysis pipeline) live in src/.

How to obtain the raw data

If you prefer to regenerate the banks from scratch instead of downloading our processed_dataset.tar.gz:

# MuSiQue — via Hugging Face (no manual download)
python src/data_pre/preprocess_musique_bank.py       --output_dir dataset/processed_musique_bank --seed 42
python src/data_pre/preprocess_musique_bank_3_4hop.py --output_dir dataset/processed_musique_bank --seed 42

# NeoQA — follow the decryption steps at github.com/amazon-science/NeoQA
# then place the context-ablation subset under dataset/neoqa_context-ablation/
python src/data_pre/preprocess_neoqa_bank.py        --data_path dataset/neoqa_context-ablation \
                                                    --output_dir dataset/processed_neoqa_bank \
                                                    --num_docs 18 --seed 42
python src/data_pre/preprocess_neoqa_random_bank.py --data_path dataset/neoqa_context-ablation \
                                                    --output_dir dataset/processed_neoqa_bank_random \
                                                    --num_docs 18 --seed 42

# 2WikiMultiHopQA — clone long-context-multihop and point at its packaged JSON
python src/data_pre/preprocess_2wiki_bank.py \
    --input_path path/to/long-context-multihop/data/base/2wiki_2ndhalfvalid_adtldocs.json \
    --output_dir dataset/processed_2wiki_bank --seed 42
python src/data_pre/preprocess_2wiki_bank_4hop.py \
    --input_path path/to/long-context-multihop/data/base/2wiki_2ndhalfvalid_adtldocs.json \
    --output_dir dataset/processed_2wiki_bank --seed 42

Documentation

Four topic-focused docs live under READMEs/:

README_INTRO.md — Research motivation, hypothesis, research questions.
README_METHOD.md — Technical method: I/O formats, module layout.
README_EXPERIMENT.md — Full experiment setup: datasets, preprocessing, models, metrics.
README_ANALYSIS.md — Analysis pipeline, visualization, result structure.

Citation

If you use this code, please cite:

@inproceedings{zhang2026weakestlink,
  title     = {Failure Modes in Multi-Hop QA: The Weakest Link Effect and the Recognition Bottleneck},
  author    = {Zhang, Meiru and Meng, Zaiqiao and Collier, Nigel},
  booktitle = {Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics},
  year      = {2026},
  eprint    = {2601.12499},
  archivePrefix = {arXiv},
  primaryClass  = {cs.CL},
  url       = {https://arxiv.org/abs/2601.12499}
}

License

Contact

Open a GitHub issue, or email mz468 [at] cam.ac.uk.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Failure Modes in Multi-Hop QA: The Weakest Link Effect and the Recognition Bottleneck

TL;DR

Repository Layout

Quick Start

1. Environment

2. Data preparation

3. Inference

4. Evaluation

5. Analysis & plotting

6. Reproduce paper tables and number checks

Preprocessed Data

Reproducing Raw Inference Outputs

Data Sources and Baselines

Datasets (originals)

Preprocessing and evaluation baselines

How to obtain the raw data

Documentation

Citation

License

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
READMEs		READMEs
config		config
scripts		scripts
src		src
tests		tests
.gitignore		.gitignore
CITATION.cff		CITATION.cff
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Failure Modes in Multi-Hop QA: The Weakest Link Effect and the Recognition Bottleneck

TL;DR

Repository Layout

Quick Start

1. Environment

2. Data preparation

3. Inference

4. Evaluation

5. Analysis & plotting

6. Reproduce paper tables and number checks

Preprocessed Data

Reproducing Raw Inference Outputs

Data Sources and Baselines

Datasets (originals)

Preprocessing and evaluation baselines

How to obtain the raw data

Documentation

Citation

License

Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages