Skip to content

cambridgeltl/weakest-link-effect

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Failure Modes in Multi-Hop QA: The Weakest Link Effect and the Recognition Bottleneck

Official code and analysis for the ACL 2026 paper:

Failure Modes in Multi-Hop QA: The Weakest Link Effect and the Recognition Bottleneck Meiru Zhang, Zaiqiao Meng, Nigel Collier (University of Cambridge).

📄 Paper (arXiv) · 📦 Preprocessed data (Release v1.0.0)

TL;DR

We study why Large Language Models fail at multi-hop question answering in long contexts. Using Multi-Focus Attention Instruction (MFAI) — a training-free probe that explicitly indexes evidence positions via natural-language cues — we show:

  1. Step-function, not linear decay. Performance depends on the absolute bucket (Beginning / Middle / Tail) of the evidence, not the linear distance between hops. On MuSiQue the between-bucket gap is ~4× the within-bucket variation.
  2. Weakest Link Effect. When evidence is split across buckets, multi-hop accuracy collapses toward the minimum single-bucket accuracy, not the average.
  3. Recognition is the bottleneck. Matched MFAI rescues low-visibility positions by up to 11.5% — the gap is attentional, not reasoning-capacity.
  4. Task topology modulates robustness. Vertical reasoning chains (MuSiQue) are vulnerable to misleading MFAI; horizontal ones (NeoQA) are resilient.
  5. System-2 thinking overrides both biases, but at ~6× the output-token cost.

Replicated across MuSiQue, NeoQA, 2WikiMultiHopQA, MuSiQue 3- and 4-hop, and Qwen2.5-32B-Instruct-GPTQ-Int8. Six LLMs: Llama-3.1-8B, Ministral-8B, Qwen2.5-{7B,14B}, Qwen3-8B (thinking & non-thinking).

Repository Layout

weakest-link-effect/
├── src/
│   ├── data_pre/         # Dataset preprocessing (NeoQA, MuSiQue, 2Wiki)
│   ├── infer/            # Inference runners (entity QA, event QA)
│   ├── evaluate/         # Unified evaluation pipeline (YAML-driven)
│   ├── analyze/          # Spread / cross / multihop analysis and plotting
│   ├── att_heatmap/      # Attention heatmap visualization
│   └── utils/            # Shared prompting and answer-extraction utilities
├── scripts/
│   ├── infer/            # Experiment launch scripts
│   ├── evaluate/         # Evaluation orchestration
│   ├── analyze/          # Analysis automation
│   └── test/             # Reproducibility checks (paper number verification,
│                         #   paper table regeneration, order-sensitivity audit)
├── config/
│   └── evaluate.yaml     # Models, datasets, paths
├── READMEs/              # Depth docs (see §Documentation below)
├── tests/                # Unit + smoke tests
├── requirements.txt
├── LICENSE
└── CITATION.cff

Quick Start

1. Environment

conda create -n event python=3.10
conda activate event
pip install -r requirements.txt

2. Data preparation

Preprocess the three MHQA banks (18-document context, fixed distractors):

python src/data_pre/preprocess_neoqa_bank.py   --output_dir dataset/processed_neoqa_bank   --seed 42
python src/data_pre/preprocess_musique_bank.py --output_dir dataset/processed_musique_bank --seed 42
python src/data_pre/preprocess_2wiki_bank.py   --output_dir dataset/processed_2wiki_bank   --seed 42

Details: READMEs/README_EXPERIMENT.md.

3. Inference

Inference is served via vLLM.

# Start a local vLLM server (see scripts/infer/start_vllm_server.sh for options)
./scripts/infer/start_vllm_server.sh

# Launch the Spread / Cross experiments for a target model
./scripts/run_multihop_inference.sh

4. Evaluation

The evaluation pipeline is YAML-driven and self-contained.

# One model × dataset × experiment
python -m src.evaluate.evaluate --dataset musique --experiment spread --model Qwen2.5-7B-Instruct

# Everything configured in config/evaluate.yaml
python scripts/evaluate/run_evaluate.py

5. Analysis & plotting

# Spread / Cross plots for any dataset
python -m src.analyze --dataset musique --analysis spread
python -m src.analyze --dataset neoqa   --analysis cross

# Bootstrap CIs and McNemar's test
python -m src.analyze --dataset all --analysis stats

6. Reproduce paper tables and number checks

The following scripts are the canonical way to generate every numeric artifact in the paper. They write their outputs to docs/ (created on first run).

# Regenerate every appendix .tex table -> docs/paper_tables/
PYTHONPATH=. python scripts/test/generate_paper_tables.py

# Verify every numeric claim in the paper against raw results
# -> docs/paper_numbers_verification.{md,csv}
PYTHONPATH=. python scripts/test/verify_paper_numbers.py

# Reversed-distractor-order robustness audit
# -> docs/order_sensitivity_phase2_results.md
PYTHONPATH=. python scripts/test/analyze_order_phase2.py

Preprocessed Data

The preprocessed 18-document banks for NeoQA (same- and random-timeline), MuSiQue (2/3/4-hop), and 2WikiMultiHopQA (compositional, inference, comparison, bridge-comparison) are available as a GitHub Release asset on the v1.0.0 tag:

  • processed_dataset.tar.gz (≈ 37 MB) — extract at the repository root:
tar -xzf processed_dataset.tar.gz       # populates ./dataset/processed_*

The banks are deterministic from the preprocessing scripts in src/data_pre/ with --seed 42 (see the Data Sources and Baselines section below for how to regenerate them from each dataset's original source).

Reproducing Raw Inference Outputs

We do not ship raw JSONL traces (the full set is ~12 GB). Re-run them from the shipped banks:

./scripts/infer/start_vllm_server.sh --model Qwen/Qwen2.5-7B-Instruct
./scripts/infer/run_qwen3_musique.sh     --model Qwen2.5-7B-Instruct
./scripts/infer/run_qwen3_neoqa.sh       --model Qwen2.5-7B-Instruct
./scripts/infer/run_2wiki_compositional.sh --model Qwen2.5-7B-Instruct
# ...see READMEs/README_EXPERIMENT.md for the full launch matrix

Once results/ is populated, analysis follows the same flow as the paper:

# Bootstrap CIs + McNemar's test
python -m src.analyze --dataset all --analysis stats

# Spread / Cross plots (per dataset)
python -m src.analyze --dataset musique --analysis spread
python -m src.analyze --dataset musique --analysis cross
# ... see READMEs/README_ANALYSIS.md for the full list

# Appendix tables + numeric-claim verification
PYTHONPATH=. python scripts/test/generate_paper_tables.py
PYTHONPATH=. python scripts/test/verify_paper_numbers.py

Data Sources and Baselines

We build on the following public datasets and preprocessing pipelines. We ship only code and preprocessed banks derived from these sources; please cite the originals if you use them.

Datasets (originals)

Dataset Paper Access
MuSiQue Trivedi et al., 2022. MuSiQue: Multihop Questions via Single-hop Question Composition github.com/StonyBrookNLP/musique · HF: via Shahar6000/MoreDocsSameLen (our pipeline reads MuSiQue through this distractor-packaged HF dataset)
NeoQA Glockner et al., 2025. NeoQA: Evidence-based Question Answering with Generated News Events (ACL Findings 2025) github.com/amazon-science/NeoQA · HF: mglockner/neoqa (subset context-ablation for our Spread/Cross runs)
2WikiMultiHopQA Ho et al., 2020. Constructing A Multi-hop QA Dataset for Comprehensive Evaluation of Reasoning Steps github.com/Alab-NII/2wikimultihop · distractor-packaged copy via Spongeorge/long-context-multihop (file data/base/2wiki_2ndhalfvalid_adtldocs.json)

Preprocessing and evaluation baselines

Repo Used for Paper
Spongeorge/long-context-multihop 2WikiMultiHopQA distractor packaging; input to src/data_pre/preprocess_2wiki_bank{,_4hop}.py Baker et al., 2024. Lost in the Middle, and In-Between: Enhancing Language Models' Ability to Reason Over Long Contexts in Multi-Hop QA (arXiv:2412.10079)
shaharl6000/MoreDocsSameLen MuSiQue distractor packaging (18-doc context) and open-ended EM / F1 scoring logic mirrored in src/evaluate/metrics.py Levy et al., 2025. More Documents, Same Length (arXiv)
amazon-science/NeoQA NeoQA data pipeline and prompt / answer-extraction baseline mirrored in src/infer/neoqa/ and src/utils/ Glockner et al., 2025 (above)

We re-use document packaging conventions and evaluation utilities from these repos; our additions (MFAI probe, Spread/Cross protocols, multi-hop extensions, unified analysis pipeline) live in src/.

How to obtain the raw data

If you prefer to regenerate the banks from scratch instead of downloading our processed_dataset.tar.gz:

# MuSiQue — via Hugging Face (no manual download)
python src/data_pre/preprocess_musique_bank.py       --output_dir dataset/processed_musique_bank --seed 42
python src/data_pre/preprocess_musique_bank_3_4hop.py --output_dir dataset/processed_musique_bank --seed 42

# NeoQA — follow the decryption steps at github.com/amazon-science/NeoQA
# then place the context-ablation subset under dataset/neoqa_context-ablation/
python src/data_pre/preprocess_neoqa_bank.py        --data_path dataset/neoqa_context-ablation \
                                                    --output_dir dataset/processed_neoqa_bank \
                                                    --num_docs 18 --seed 42
python src/data_pre/preprocess_neoqa_random_bank.py --data_path dataset/neoqa_context-ablation \
                                                    --output_dir dataset/processed_neoqa_bank_random \
                                                    --num_docs 18 --seed 42

# 2WikiMultiHopQA — clone long-context-multihop and point at its packaged JSON
python src/data_pre/preprocess_2wiki_bank.py \
    --input_path path/to/long-context-multihop/data/base/2wiki_2ndhalfvalid_adtldocs.json \
    --output_dir dataset/processed_2wiki_bank --seed 42
python src/data_pre/preprocess_2wiki_bank_4hop.py \
    --input_path path/to/long-context-multihop/data/base/2wiki_2ndhalfvalid_adtldocs.json \
    --output_dir dataset/processed_2wiki_bank --seed 42

Documentation

Four topic-focused docs live under READMEs/:

Citation

If you use this code, please cite:

@inproceedings{zhang2026weakestlink,
  title     = {Failure Modes in Multi-Hop QA: The Weakest Link Effect and the Recognition Bottleneck},
  author    = {Zhang, Meiru and Meng, Zaiqiao and Collier, Nigel},
  booktitle = {Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics},
  year      = {2026},
  eprint    = {2601.12499},
  archivePrefix = {arXiv},
  primaryClass  = {cs.CL},
  url       = {https://arxiv.org/abs/2601.12499}
}

License

MIT — see LICENSE. Copyright (c) 2026 University of Cambridge Language Technology Lab; Meiru Zhang.

Contact

Open a GitHub issue, or email mz468 [at] cam.ac.uk.

About

Initial camera-ready release accompanying the ACL 2026 paper "Failure Modes in Multi-Hop QA: The Weakest Link Effect and the Recognition Bottleneck"

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors