A secure file redaction service that automatically detects and redacts sensitive information from various file types.
- Python 3.8 or higher
- Node.js 14 or higher
- pip3
- System dependencies (installed automatically by setup script):
- tesseract-ocr
- python3-dev
If the setup script doesn't work for your system, you can install dependencies manually:
-
Install system dependencies:
- For Ubuntu/Debian:
sudo apt-get install tesseract-ocr python3-dev - For Arch Linux:
sudo pacman -Sy tesseract - For Fedora:
sudo dnf install tesseract - For macOS:
brew install tesseract
- For Ubuntu/Debian:
-
Create and activate Python virtual environment:
python3 -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate- Install Python dependencies:
pip install --upgrade pip
pip install -r redaction/requirements.txt
python -m spacy download en_core_web_lg- Install Node.js dependencies:
cd server
npm install- Start the server:
npm startThe server will be running at http://localhost:3000
- Images: PNG, JPEG
- Documents: PDF, DOCX, PPTX, XLSX
- Text: TXT, RTF, CSV, JSON, XML
- Automatic detection and redaction of sensitive information
- Support for multiple file types
- Real-time processing
- User-friendly interface
See the LICENSE file for details.
- Smart Detection: Identifies 20+ PII types (emails, phones, IDs, etc.) using Microsoft Presidio
- Accurate Redaction: Maintains content structure after redaction
- Web Interface: Simple drag-and-drop UI
- Secure Processing: Files processed in-memory (never stored permanently)
Frontend:
- HTML5/CSS3
- JavaScript (ES6+)
Backend:
- Node.js (Express)
- Python 3.8+ (Flask)
- Key Modules:
- Microsoft Presidio (analysis)
- Pytesseract (text extraction)
- Node.js v16+
- Python 3.8+
# Clone repository
https://github.com/nbdevanandan/hack-the-future