SOFTWARE

DocExtr

Document Extractor

Pull text out of documents in many formats, all in one pass.

v0.1.0Rust Core + Python Bindings · Single Executable + Python Wheel

Download v0.1.0 Coming soon

Linux x86_64 · .tar.gz
Windows x86_64 · .zip

pip pip install docextr

Per-OS single executables and the Python wheel will be available at launch. The source stays private, and each package ships with an installation and operation guide.

Architecture

ArchitectureVFS recursively explores nested archives up to 16 levels, and a parallel worker pool invokes the format-specific Extractor. The output is either a JSONL/CSV file or loaded directly into 5 types of RDBMS tables.

Project Nature

An extraction engine for document digitization pipelines handling tens to hundreds of millions of records. The source code stays private, but it ships as executable packages (binaries and Python wheels) with a bundled user guide, so it can be used directly wherever a similar need arises.

The Problem We Solve

In large-scale document digitization environments, extracting data from nested archives (e.g., a zip inside a zip), handling diverse formats, and pulling metadata come up repeatedly. Consistent output in Plain Text or Markdown is required for AI and search system inputs, but existing tools are often limited to single formats or low throughput, making them difficult to apply directly to large-scale pipelines.

Usage

The single executable docextr takes a directory as input and either writes JSONL or loads directly into an RDBMS table. In Python, you can extract from file paths or in-memory bytes using import docextr.

CLI

# Directory → JSONL Shard (8 Workers)
docextr run \
  --input  /data/documents \
  --output /data/output \
  --workers 8

# Direct load to RDBMS (JSONL omitted)
docextr run \
  --input  /data/documents \
  --output-db "postgres://user:pass@host/db" \
  --db-table extraction \
  --db-mode  replace

Python

import docextr

# (Optional) Specify Python parser path when using HWP OLE recovery or OCR fallback
docextr.set_python_path("/path/to/docextr/python/parsers")

# Extract via file path
result = docextr.extract("/data/report.hwp")
print(result["plain_text"])   # Body text
print(result["markdown"])     # Markdown
print(result["blake3"])       # BLAKE3 Hash

# Extract via in-memory bytes
with open("/data/report.pdf", "rb") as f:
    result = docextr.extract_bytes(f.read(), "report.pdf")

Supported Formats

Document / Text txt, md, html, docx, pptx, xlsx, xls, doc, hwp, hwpx, pdf, Source/Config 100+ types (c, cpp, java, py, json, xml, yaml, cmake, etc.)
Scanned PDF (OCR) PaddleOCR v5 fallback (requires the OCR model; without it, files with no text layer raise an error)
Archive (Recursive) zip, tar, tar.gz, tar.bz2, 7z (MAX_DEPTH 16 · MAX_SIZE 4 GiB)
Unsupported rar (License) · Encrypted archives

Output

Record Fields (4 Types)

plain_text Plain Text
markdown Markdown
structured_json Structured JSON (Title, Author, Meta)
blake3 BLAKE3 Hash 64-char hex

Direct RDBMS Loading (5 Types)

SQLite
PostgreSQL
MySQL
MariaDB
Oracle

Shard outputs are JSONL or CSV. Refer to the enclosed guide for the full 14-column JSONL schema.

Features

The core engine is Rust native, with Python fallbacks only for HWP and OCR.
Recursive traversal of compressed archives up to 16 levels.
14-column JSONL — source_path · plain_text · markdown · structured_json · blake3 · file_type · file_size · error · title · author · created · word_count · quality_score · has_structure
Outputs JSONL/CSV shards or loads directly into an RDBMS.
Fault-tolerant error handling — failed files still produce a record so the pipeline keeps running.
Throughput scales with the number of parallel workers (--workers N).

Specifications & Measurement Environment

Version: v0.1.0
Deployment Form: Linux, Windows single executable + Python wheel
Source: Private (Executable package download only)
Runtime: Rust core + Python wheel (when the fallback parser is used)
Python Library: import docextr (Native binding)
Processing Performance: Routes extraction by format and compression state, and runs in parallel across multiple workers.
Stability Control: Defends against nested zip bombs, and when it hits a corrupted file, isolates just that file and keeps the pipeline running.

Security & Compliance

License: Private, Download-only (Source private)
Trust Boundary: VFS — MAX_DEPTH 16 · MAX_SIZE 4GiB · Zip bomb defense
Failure Isolation: Panic isolation per Extractor unit + fallback chain
Output Integrity: A BLAKE3 hash is included with every record
RDBMS Adapters: SQLite, PostgreSQL, MySQL, MariaDB, Oracle (feature gate)
Tech Support: A formal maintenance and security-vulnerability patch support channel is provided under an adoption contract.

Getting Started

Receive the official distribution package (including binary and operation guide)
Install — set up the single executable or the Python wheel environment.
Run — CLI: docextr <input> --output <jsonl|csv|db>
Load — write JSONL/CSV or load directly into one of the 5 RDBMS targets (fixed 14-column schema).

Considering Cubiware for your organization?

We will guide you through setup and rollout tailored to your requirements and operating environment. Reach out for a demo or a proposal.