Skip to main content

SOFTWARE

DocExtr

Document Extractor

Pull text out of documents in many formats, all in one pass.

v0.1.0Rust Core + Python Bindings · Single Executable + Python Wheel

Download v0.1.0 Coming soon
  • Linux x86_64 · .tar.gz
  • Windows x86_64 · .zip
  • pip pip install docextr

Per-OS single executables and the Python wheel will be available at launch. The source stays private, and each package ships with an installation and operation guide.

Architecture

ArchitectureVFS recursively explores nested archives up to 16 levels, and a parallel worker pool invokes the format-specific Extractor. The output is either a JSONL/CSV file or loaded directly into 5 types of RDBMS tables.
Pinch or scroll to zoom · drag to pan · double-tap or double-click to reset

Project Nature

An extraction engine for document digitization pipelines handling tens to hundreds of millions of records. The source code stays private, but it ships as executable packages (binaries and Python wheels) with a bundled user guide, so it can be used directly wherever a similar need arises.

The Problem We Solve

In large-scale document digitization environments, extracting data from nested archives (e.g., a zip inside a zip), handling diverse formats, and pulling metadata come up repeatedly. Consistent output in Plain Text or Markdown is required for AI and search system inputs, but existing tools are often limited to single formats or low throughput, making them difficult to apply directly to large-scale pipelines.

Usage

The single executable docextr takes a directory as input and either writes JSONL or loads directly into an RDBMS table. In Python, you can extract from file paths or in-memory bytes using import docextr.

CLI

# Directory → JSONL Shard (8 Workers)
docextr run \
  --input  /data/documents \
  --output /data/output \
  --workers 8

# Direct load to RDBMS (JSONL omitted)
docextr run \
  --input  /data/documents \
  --output-db "postgres://user:pass@host/db" \
  --db-table extraction \
  --db-mode  replace

Python

import docextr

# (Optional) Specify Python parser path when using HWP OLE recovery or OCR fallback
docextr.set_python_path("/path/to/docextr/python/parsers")

# Extract via file path
result = docextr.extract("/data/report.hwp")
print(result["plain_text"])   # Body text
print(result["markdown"])     # Markdown
print(result["blake3"])       # BLAKE3 Hash

# Extract via in-memory bytes
with open("/data/report.pdf", "rb") as f:
    result = docextr.extract_bytes(f.read(), "report.pdf")

Supported Formats

Output

Record Fields (4 Types)

Direct RDBMS Loading (5 Types)
  • SQLite
  • PostgreSQL
  • MySQL
  • MariaDB
  • Oracle

Shard outputs are JSONL or CSV. Refer to the enclosed guide for the full 14-column JSONL schema.

Features

Specifications & Measurement Environment

Version
v0.1.0
Deployment Form
Linux, Windows single executable + Python wheel
Source
Private (Executable package download only)
Runtime
Rust core + Python wheel (when the fallback parser is used)
Python Library
import docextr (Native binding)
Processing Performance
Routes extraction by format and compression state, and runs in parallel across multiple workers.
Stability Control
Defends against nested zip bombs, and when it hits a corrupted file, isolates just that file and keeps the pipeline running.

Security & Compliance

License
Private, Download-only (Source private)
Trust Boundary
VFS — MAX_DEPTH 16 · MAX_SIZE 4GiB · Zip bomb defense
Failure Isolation
Panic isolation per Extractor unit + fallback chain
Output Integrity
A BLAKE3 hash is included with every record
RDBMS Adapters
SQLite, PostgreSQL, MySQL, MariaDB, Oracle (feature gate)
Tech Support
A formal maintenance and security-vulnerability patch support channel is provided under an adoption contract.

Getting Started

  1. Receive the official distribution package (including binary and operation guide)
  2. Install — set up the single executable or the Python wheel environment.
  3. Run — CLI: docextr <input> --output <jsonl|csv|db>
  4. Load — write JSONL/CSV or load directly into one of the 5 RDBMS targets (fixed 14-column schema).

Considering Cubiware for your organization?

We will guide you through setup and rollout tailored to your requirements and operating environment. Reach out for a demo or a proposal.