Skip to content

OpenPecha/nalanda-docx-formater

Repository files navigation

Nalanda DOCX Formatter

A tool that reads a DOCX book file, detects Tibetan chapter title paragraphs by their font styling, and replaces each matched title with a formatted header table containing a Nalanda University logo, the styled chapter title, and a QR code linking to the chapter's URL.

Installation

pip install .

For development:

pip install -e ".[dev]"

System Requirements

PDF conversion requires one of the following (optional — DOCX output is always produced):

  • LibreOffice (recommended) — best fidelity for Tibetan fonts and table formatting
    sudo apt install libreoffice    # Debian/Ubuntu
  • Pandoc + XeLaTeX (fallback) — handles Unicode/Tibetan well
    sudo apt install pandoc texlive-xetex    # Debian/Ubuntu

If neither is installed, the tool will produce the DOCX output and skip PDF generation.

Usage

nalanda-docx-format book.docx chapters.yaml nalanda_logo.png -o ./output/

Arguments

Argument Description
book.docx Source DOCX book file
chapters.yaml YAML chapter lookup dictionary
nalanda_logo.png Nalanda University logo image (PNG)
-o / --output Output directory (default: ./output/)
-v / --verbose Enable verbose (DEBUG) logging

Chapter YAML Format

chapter_1:
  chapter_title: "༄༅། །ཆོས་ཀྱི་དབྱིངས་སུ་བསྟོད་པ།"
  chapter_url: "https://example.com/chapter1"
chapter_2:
  chapter_title: "༄༅། །དཔེ་མེད་པར་བསྟོད་པ།"
  chapter_url: "https://example.com/chapter2"

Each entry must have:

  • chapter_title — the exact Tibetan title text as it appears in the DOCX (font: Monlam Uni Ouchen5, 18pt)
  • chapter_url — URL to encode in the QR code

How It Works

  1. Opens the DOCX as a ZIP archive and parses the XML directly using lxml
  2. Scans all paragraphs for runs styled with Monlam Uni Ouchen5 at 18pt (the title font)
  3. Matches detected titles against the YAML chapter lookup (NFC-normalized for Tibetan text)
  4. For each match, generates a QR code and builds a header table with three columns:
    • Logo (10% width)
    • Styled title text (80% width)
    • QR code (10% width)
  5. Replaces the original title paragraph with the header table
  6. Saves the modified DOCX and optionally converts to PDF

Output

The tool produces:

  • output/<bookname>.docx — modified DOCX with header tables
  • output/<bookname>.pdf — PDF version (if LibreOffice or Pandoc is available)
  • output/qr_<chapter_id>.png — generated QR code images

Running Tests

PYTHONPATH=src pytest

License

MIT License

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages