back to home

oomol-lab / pdf-craft

PDF craft can convert PDF files into various other formats. This project will focus on processing PDF files of scanned books.

4,959 stars
322 forks
42 issues
PythonShell

AI Architecture Analysis

This repository is indexed by RepoMind. By analyzing oomol-lab/pdf-craft in our AI interface, you can instantly generate complete architecture diagrams, visualize control flows, and perform automated security audits across the entire codebase.

Our Agentic Context Augmented Generation (Agentic CAG) engine loads full source files into context on-demand, avoiding the fragmentation of traditional RAG systems. Ask questions about the architecture, dependencies, or specific features to see it in action.

Source files are only loaded when you start an analysis to optimize performance.

Embed this Badge

Showcase RepoMind's analysis directly in your repository's README.

[![Analyzed by RepoMind](https://img.shields.io/badge/Analyzed%20by-RepoMind-4F46E5?style=for-the-badge)](https://repomind.in/repo/oomol-lab/pdf-craft)
Preview:Analyzed by RepoMind

Repository Overview (README excerpt)

Crawler view

PDF Craft English | 中文 Introduction pdf-craft converts PDF files into various other formats, with a focus on handling scanned book PDFs. This project is based on DeepSeek OCR for document recognition. It supports the recognition of complex content such as tables and formulas. With GPU acceleration, pdf-craft can complete the entire conversion process from PDF to Markdown or EPUB locally. During the conversion, pdf-craft automatically identifies document structure, accurately extracts body text, and filters out interfering elements like headers and footers. For academic or technical documents containing footnotes, formulas, and tables, pdf-craft handles them properly, preserving these important elements (including images and other assets within footnotes). When converting to EPUB, the table of contents is automatically generated. The final Markdown or EPUB files maintain the content integrity and readability of the original book. Lightweight and Fast Starting from the official v1.0.0 release, pdf-craft fully embraces DeepSeek OCR and no longer relies on LLM for text correction. This change brings significant performance improvements: the entire conversion process is completed locally without network requests, eliminating the long waits and occasional network failures of the old version. However, the new version has also removed the LLM text correction feature. If your use case still requires this functionality, you can continue using the old version v0.2.8. Online Demo We provide an online demo platform that lets you experience PDF Craft's conversion capabilities without any installation. You can directly upload PDF files and convert them. Quick Start Installation The above commands are for quick setup only. To actually use pdf-craft, you need to **install Poppler** for PDF parsing (required for all use cases) and **configure a CUDA environment** for OCR recognition (required for actual conversion). Please refer to the Installation Guide for detailed instructions. Quick Start Convert to Markdown Convert to EPUB Detailed Usage Convert to Markdown Convert to EPUB Model Management pdf-craft depends on DeepSeek OCR models, which are automatically downloaded from Hugging Face on first run. You can control model storage and loading behavior through the and parameters. Pre-download Models In production environments, it is recommended to download models in advance to avoid downloading on first run: Specify Model Cache Path By default, models are downloaded to the system's Hugging Face cache directory. You can customize the cache location through the parameter: Offline Mode If you have pre-downloaded the models, you can use to disable network downloads and ensure only local models are used: API Reference OCR Models The parameter accepts a type: • - Smallest model, fastest speed • - Small model • - Base model • - Large model • - Largest model, highest quality (default) Table Rendering Methods • - HTML format (default) • - Clipping format (directly clips table images from the original PDF scan) Formula Rendering Methods • - MathML format (default) • - SVG format • - Clipping format (directly clips formula images from the original PDF scan) Inline LaTeX The parameter (EPUB only, default: ) controls whether to preserve inline LaTeX expressions in the output. When enabled, inline mathematical formulas are preserved as LaTeX code, which can be rendered by compatible EPUB readers. Table of Contents Detection The parameter controls how pdf-craft handles table of contents extraction: • (default for Markdown): Assumes no TOC pages exist. The conversion generates TOC based on document headings only, without detecting or processing TOC pages. • (default for EPUB): Assumes TOC pages exist. The conversion uses statistical analysis to detect TOC pages and extract chapter structure. For books with complex chapter hierarchies, you can configure the optional parameter to enable LLM-powered chapter title analysis, which provides more accurate TOC hierarchy detection. LLM-Enhanced TOC Extraction To use LLM-enhanced TOC extraction, you need to configure an LLM instance: Custom PDF Handler By default, pdf-craft uses Poppler (via ) for PDF parsing and rendering. If Poppler is not in your system PATH, you can specify a custom path: If not specified, pdf-craft will use Poppler from your system PATH. For advanced use cases, you can also implement the protocol to use alternative PDF libraries. Error Handling The and parameters provide flexible error handling options. You can use them in two ways: **1. Boolean Mode** - Simple on/off control: When set to , processing continues when errors occur on individual pages, inserting a placeholder message instead of stopping the entire conversion. **2. Custom Function Mode** - Fine-grained control: This allows you to implement custom logic for deciding which specific errors should be ignored during conversion. Related Open Source Libraries epub-translator uses AI large language models to automatically translate EPUB e-books while 100% preserving the original book's format, illustrations, table of contents, and layout. It also generates bilingual versions for convenient language learning or international sharing. When combined with this library, you can convert and translate scanned PDF books. For a demonstration, see this video: Convert PDF scanned books to EPUB format and translate to bilingual books. License This project is licensed under the MIT License. See the LICENSE file for details. Starting from v1.0.0, pdf-craft has fully migrated to DeepSeek OCR (MIT license), removing the previous AGPL-3.0 dependency, allowing the entire project to be released under the more permissive MIT license. Note that pdf-craft has a transitive dependency on easydict (LGPLv3) via DeepSeek OCR. Thanks to the community for their support and contributions! Acknowledgments • DeepSeekOCR • doc-page-extractor • pyahocorasick