Fix missing or wrong Tibetan stacked syllables in PDF text extraction.
Tibetan PDFs often embed fonts with incomplete or incorrect ToUnicode CMap
entries for stacked syllable glyphs. Standard extraction tools then silently
drop subjoined letters — རྡོ་རྗེ་ becomes རོ་རེ་ — or insert spurious
characters. This tool patches the in-memory CMap using a pre-built database
derived from the original fonts' GSUB ligature tables. You can either
re-extract corrected Unicode text, or write a new PDF whose ToUnicode maps
are fixed so copy-paste and other tools see the right Tibetan.
pip install git+https://github.com/gangagyatso4364/pdf-to-xml-unicoded-font.gitRequirements: Python 3.8+, PyMuPDF (fitz), fontTools.
tibetan-pdf-fix document.pdf
# writes: document.raw.txt document.patched.txt document.diff.txt
tibetan-pdf-fix doc1.pdf doc2.pdf doc3.pdfWrite a patched PDF only (same ToUnicode logic as above; no .txt files):
tibetan-pdf-fix --patch-pdf document.pdf
# writes: document.patched.pdf (input PDF is not overwritten)
tibetan-pdf-fix -p doc1.pdf doc2.pdf # short formfrom tibetan_pdf_fix import extract_tibetan_pdf
result = extract_tibetan_pdf("document.pdf")
print(result["patched"]) # corrected Unicode text
print(result["stats"]) # {"fonts_seen": 3, "patched": 2, "upgrades": 1583, ...}
print(result["char_delta"]) # net change in character count
# Output files are written by default to the same directory as the PDF:
# document.raw.txt — original extraction (before patching)
# document.patched.txt — corrected extraction
# document.diff.txt — every changed lineOutput to a different directory:
result = extract_tibetan_pdf("document.pdf", output_dir="/tmp/output")In-memory only (no files written):
result = extract_tibetan_pdf("document.pdf", write_files=False)Patch the PDF and return (or write) the corrected document:
from tibetan_pdf_fix import patch_tibetan_pdf
result = patch_tibetan_pdf("document.pdf")
# default: writes document.patched.pdf next to the input
# result["pdf_bytes"] — patched PDF as bytes
# result["stats"] — same shape as extract_tibetan_pdf
# result["output_path"] — Path written, or None if write_file=False
result = patch_tibetan_pdf("document.pdf", output_path="/tmp/out.pdf")
result = patch_tibetan_pdf("document.pdf", write_file=False) # bytes only, no fileThe bundled database (reverse_db.json, 2.4 MB) covers 68 Tibetan font
variants from three families:
| Family | Variants |
|---|---|
| Monlam Uni OuChan | OuChan1 through OuChan5 |
| Himalaya | Himalaya-A through Himalaya-N, Himalaya-SN, himalaya0 |
| Jomolhari | Jomolhari |
| (others) | Additional variants in bodyig.zip |
The database is built from GSUB type-4 ligature rules in each font, so stacked syllables of arbitrary depth are correctly decomposed — not just the simple vowel+consonant combinations.
Only Type0 / CID / Identity-H fonts are patched. These fonts preserve the original GID in the PDF, so the database lookup is exact.
TrueType simple-encoding fonts (typical of Ghostscript-generated PDFs) are currently not patched. In Ghostscript PDFs the char-codes are sequentially assigned by Ghostscript and do not correspond to original font GIDs, making reliable lookup impossible without reading the embedded subset's glyph order.
- For each Tibetan font in the PDF, look up its GID→Unicode mapping in the
pre-built
reverse_db.json. - Parse the PDF's existing ToUnicode CMap for that font.
- Replace each entry where the database has a mapping (the database is authoritative — it comes from the font's own GSUB table, not from the PDF creator's heuristics).
- Write the corrected CMap back into the in-memory PDF, then either
re-extract text (
extract_tibetan_pdf) or serialise the document to a new PDF file (patch_tibetan_pdf/tibetan-pdf-fix --patch-pdf).
See docs/approach.md for the full technical explanation.
Lines changed: 2,540 / 5,361 Char delta: +9,969
| Before (wrong) | After (correct) |
|---|---|
ཀོང་ཡངས་རོལ་བའི་རྣལ་འབོར་པ་ |
ཀློང་ཡངས་རོལ་བའི་རྣལ་འབྱོར་པ་ |
ཀི་ཟབ་གཏེར་ |
ཀྱི་ཟབ་གཏེར་ |
རོ་རེའི་སེ་ཕེང་ |
རྡོ་རྗེའི་སྐྱེ་ཕྲེང་ |
སིང་གསོལ་འདེབས་ |
སྙིང་གསོལ་འདེབས་ |
བིན་རླབས་ |
བྱིན་རླབས་ |
Full output: docs/examples/TI1751-01-001/
Lines changed: 10,205 / 11,979 Char delta: -23,725
Word had embedded incorrect multi-character sequences for many vowel glyphs
(spurious subjoined-ja ྗ inserted before every ོ):
| Before (wrong) | After (correct) |
|---|---|
བྗོད་གངས་ཅན་ |
བོད་གངས་ཅན་ |
དང་པྗོ་དཔར་སྐྲུན་ |
དང་པོ་དཔར་སྐྲུན་ |
ཐྗོས་བསམ་སྗོམ་གསུམ་ |
ཐོས་བསམ་སྒོམ་གསུམ་ |
མྱིག་འདུན་ |
མིག་འདུན་ |
Full output: docs/examples/TI1055-01-001/
The pre-built reverse_db.json is included in the package. If you have
bodyig.zip (the source font archive, ~69 MB, not included in this repo),
you can rebuild it:
# Place bodyig.zip in the scripts/ directory, then:
pip install fonttools
python scripts/build_reverse_db.pytibetan_pdf_fix/ Python package
├── extractor.py Core logic: patch ToUnicode, extract text or emit PDF
└── data/
└── reverse_db.json GID -> Unicode database (bundled)
scripts/
├── build_reverse_db.py Rebuild database from bodyig.zip
└── build_glyph_db.py Legacy alternate builder
docs/
├── approach.md Full technical explanation
└── examples/ Example PDFs and before/after outputs
├── TI1055-01-001/ Word PDF (.pdf, .raw.txt, .patched.txt, .diff.txt)
└── TI1751-01-001/ InDesign PDF (.pdf, .raw.txt, .patched.txt, .diff.txt)
MIT