Vietnamese text processing and audio alignment project.
- Sentence Splitting: Split Vietnamese text files into one sentence per line
- Sentence Counting: Count sentences in Vietnamese text files
- Audio Alignment: Align Vietnamese audio with text and cut into sentence segments
pip install -r requirements.txtNote: For audio processing, you also need ffmpeg:
- Windows: Download from ffmpeg.org or use
choco install ffmpeg - macOS:
brew install ffmpeg - Linux:
sudo apt-get install ffmpeg(Ubuntu/Debian)
Split a text file into sentences (one per line):
python src/sentence_splitter.py <input_file> [-o output_file]Example:
python src/sentence_splitter.py "data/Thiên thần nhỏ của tôi - Nguyễn Nhật Ánh.txt"Count sentences in the data files:
cd src
python -c "from main import count_sentences_in_data_files; count_sentences_in_data_files()"WhisperX provides more accurate word-level timestamps than basic Whisper:
python src/align_vietnamese_audio_whisperx.py <audio_file> <sentences_file> [output_dir] [model] [device]Example:
python src/align_vietnamese_audio_whisperx.py audio.wav "data/Text-ThienThanNhoCuaToi/Track 1.txt" output_audio base cpuParameters:
audio_file: Path to audio file (wav, mp3, etc.)sentences_file: Path to processed text file (one sentence per line)output_dir: Output directory for audio segments (default:audio_segments)model: Whisper model size -tiny,base,small,medium,large(default:base)device:cpuorcuda(default:cpu)
python src/align_vietnamese_audio.py <audio_file> <sentences_file> [output_dir] [model]Output (both methods):
sentence_00001.wav,sentence_00002.wav, ... - Audio segmentssentence_00001.txt,sentence_00002.txt, ... - Corresponding text filestimestamps.txt- Timestamps for each sentencetranscription.txt- Full transcription
For the highest accuracy, consider using MFA (Montreal Forced Aligner):
- Install MFA:
conda install -c conda-forge montreal-forced-alignment - Download Vietnamese acoustic model:
mfa model download acoustic vietnamese_mfa - Use MFA command-line tool for alignment
MFA provides the most accurate forced alignment but requires more setup.
NLP-midterm/
├── data/ # Text files
│ ├── *.txt # Original text files
│ └── *.processed.txt # Processed (one sentence per line)
├── src/
│ ├── sentence_splitter.py # Sentence splitting script
│ ├── main.py # Main processing and counting
│ ├── align_vietnamese_audio.py # Basic Whisper alignment (faster)
│ └── align_vietnamese_audio_whisperx.py # WhisperX alignment (more accurate)
└── requirements.txt # Python dependencies
- All text files use UTF-8 encoding to support Vietnamese diacritics
- For best accuracy: Use WhisperX (
align_vietnamese_audio_whisperx.py) - provides better word-level timestamps - For fastest processing: Use basic Whisper (
align_vietnamese_audio.py) - For highest accuracy: Consider MFA (Montreal Forced Aligner) - requires separate installation
- Recommended Whisper models:
baseorsmallfor Vietnamese - Audio loading uses librosa (no ffmpeg required for MP3 files)