SOFTWARE
DocExtr
Document Extractor
Pull text out of documents in many formats, all in one pass.
- Linux
- Windows
- pip
pip install docextr
Per-OS single executables and the Python wheel will be available at launch. The source stays private, and each package ships with an installation and operation guide.
Architecture
Project Nature
An extraction engine for document digitization pipelines handling tens to hundreds of millions of records. The source code stays private, but it ships as executable packages (binaries and Python wheels) with a bundled user guide, so it can be used directly wherever a similar need arises.
The Problem We Solve
In large-scale document digitization environments, extracting data from nested archives (e.g., a zip inside a zip), handling diverse formats, and pulling metadata come up repeatedly. Consistent output in Plain Text or Markdown is required for AI and search system inputs, but existing tools are often limited to single formats or low throughput, making them difficult to apply directly to large-scale pipelines.
Usage
The single executable docextr takes a directory as input and either writes JSONL or loads directly into an RDBMS table. In Python, you can extract from file paths or in-memory bytes using import docextr.
CLI
# Directory → JSONL Shard (8 Workers)
docextr run \
--input /data/documents \
--output /data/output \
--workers 8
# Direct load to RDBMS (JSONL omitted)
docextr run \
--input /data/documents \
--output-db "postgres://user:pass@host/db" \
--db-table extraction \
--db-mode replace Python
import docextr
# (Optional) Specify Python parser path when using HWP OLE recovery or OCR fallback
docextr.set_python_path("/path/to/docextr/python/parsers")
# Extract via file path
result = docextr.extract("/data/report.hwp")
print(result["plain_text"]) # Body text
print(result["markdown"]) # Markdown
print(result["blake3"]) # BLAKE3 Hash
# Extract via in-memory bytes
with open("/data/report.pdf", "rb") as f:
result = docextr.extract_bytes(f.read(), "report.pdf") Supported Formats
- Document / Text txt, md, html, docx, pptx, xlsx, xls, doc, hwp, hwpx, pdf, Source/Config 100+ types (c, cpp, java, py, json, xml, yaml, cmake, etc.)
- Scanned PDF (OCR) PaddleOCR v5 fallback (requires the OCR model; without it, files with no text layer raise an error)
- Archive (Recursive) zip, tar, tar.gz, tar.bz2, 7z (MAX_DEPTH 16 · MAX_SIZE 4 GiB)
- Unsupported rar (License) · Encrypted archives
Output
Record Fields (4 Types)
- plain_text Plain Text
- markdown Markdown
- structured_json Structured JSON (Title, Author, Meta)
- blake3 BLAKE3 Hash 64-char hex
- SQLite
- PostgreSQL
- MySQL
- MariaDB
- Oracle
Shard outputs are JSONL or CSV. Refer to the enclosed guide for the full 14-column JSONL schema.
Features
- The core engine is Rust native, with Python fallbacks only for HWP and OCR.
- Recursive traversal of compressed archives up to 16 levels.
- 14-column JSONL — source_path · plain_text · markdown · structured_json · blake3 · file_type · file_size · error · title · author · created · word_count · quality_score · has_structure
- Outputs JSONL/CSV shards or loads directly into an RDBMS.
- Fault-tolerant error handling — failed files still produce a record so the pipeline keeps running.
- Throughput scales with the number of parallel workers (
--workers N).
Specifications & Measurement Environment
- Version
- v0.1.0
- Deployment Form
- Linux, Windows single executable + Python wheel
- Source
- Private (Executable package download only)
- Runtime
- Rust core + Python wheel (when the fallback parser is used)
- Python Library
import docextr(Native binding)- Processing Performance
- Routes extraction by format and compression state, and runs in parallel across multiple workers.
- Stability Control
- Defends against nested zip bombs, and when it hits a corrupted file, isolates just that file and keeps the pipeline running.
Security & Compliance
- License
- Private, Download-only (Source private)
- Trust Boundary
- VFS — MAX_DEPTH 16 · MAX_SIZE 4GiB · Zip bomb defense
- Failure Isolation
- Panic isolation per Extractor unit + fallback chain
- Output Integrity
- A BLAKE3 hash is included with every record
- RDBMS Adapters
- SQLite, PostgreSQL, MySQL, MariaDB, Oracle (feature gate)
- Tech Support
- A formal maintenance and security-vulnerability patch support channel is provided under an adoption contract.
Getting Started
- Receive the official distribution package (including binary and operation guide)
- Install — set up the single executable or the Python wheel environment.
- Run — CLI:
docextr <input> --output <jsonl|csv|db> - Load — write JSONL/CSV or load directly into one of the 5 RDBMS targets (fixed 14-column schema).
Considering Cubiware for your organization?
We will guide you through setup and rollout tailored to your requirements and operating environment. Reach out for a demo or a proposal.