back to home

allenai / olmocr

Toolkit for linearizing PDFs for LLM datasets/training

17,026 stars
1,364 forks
60 issues
PythonShellHTML

AI Architecture Analysis

This repository is indexed by RepoMind. By analyzing allenai/olmocr in our AI interface, you can instantly generate complete architecture diagrams, visualize control flows, and perform automated security audits across the entire codebase.

Our Agentic Context Augmented Generation (Agentic CAG) engine loads full source files into context on-demand, avoiding the fragmentation of traditional RAG systems. Ask questions about the architecture, dependencies, or specific features to see it in action.

Source files are only loaded when you start an analysis to optimize performance.

Embed this Badge

Showcase RepoMind's analysis directly in your repository's README.

[![Analyzed by RepoMind](https://img.shields.io/badge/Analyzed%20by-RepoMind-4F46E5?style=for-the-badge)](https://repomind.in/repo/allenai/olmocr)
Preview:Analyzed by RepoMind

Repository Overview (README excerpt)

Crawler view

A toolkit for converting PDFs and other image-based document formats into clean, readable, plain text format. Try the online demo: https://olmocr.allenai.org/ Features: • Convert PDF, PNG, and JPEG based documents into clean Markdown • Support for equations, tables, handwriting, and complex formatting • Automatically removes headers and footers • Convert into text with a natural reading order, even in the presence of figures, multi-column layouts, and insets • Efficient, less than $200 USD per million pages converted • (Based on a 7B parameter VLM, so it requires a GPU) News • October 21, 2025 - v0.4.0 - New model release, boosts olmOCR-bench score by ~4 points using synthetic data and introduces RL training. • August 13, 2025 - v0.3.0 - New model release, fixes auto-rotation detection, and hallucinations on blank documents. • July 24, 2025 - v0.2.1 - New model release, scores 3 points higher on olmOCR-Bench, also runs significantly faster because it's default FP8, and needs much fewer retries per document. • July 23, 2025 - v0.2.0 - New cleaned up trainer code, makes it much simpler to train olmOCR models yourself. • June 17, 2025 - v0.1.75 - Switch from sglang to vllm based inference pipeline, updated docker image to CUDA 12.8. • May 23, 2025 - v0.1.70 - Official docker support and images are now available! See Docker usage • May 19, 2025 - v0.1.68 - olmOCR-Bench launch, scoring 77.4. Launch includes 2 point performance boost in olmOCR pipeline due to bug fixes with prompts. • Mar 17, 2025 - v0.1.60 - Performance improvements due to better temperature selection in sampling. • Feb 25, 2025 - v0.1.58 - Initial public launch and demo. Benchmark **olmOCR-Bench**: We also ship a comprehensive benchmark suite covering over 7,000 test cases across 1,400 documents to help measure performance of OCR systems. ArXiv Old scans math Tables Old scans Headers & footers Multi column Long tiny text Base Overall Mistral OCR API 77.2 67.5 60.6 29.3 93.6 71.3 77.1 99.4 72.0±1.1 Marker 1.10.1 83.8 66.8 72.9 33.5 86.6 80.0 85.7 99.3 76.1±1.1 MinerU 2.5.4* 76.6 54.6 84.9 33.7 96.6 78.2 83.5 93.7 75.2±1.1 DeepSeek-OCR 77.2 73.6 80.2 33.3 96.1 66.4 79.4 99.8 75.7±1.0 Nanonets-OCR2-3B 75.4 46.1 86.8 40.9 32.1 81.9 93.0 99.6 69.5±1.1 PaddleOCR-VL* 85.7 71.0 84.1 37.8 97.0 79.9 85.7 98.5 80.0±1.0 Infinity-Parser 7B* 84.4 83.8 85.0 47.9 88.7 84.2 86.4 99.8 82.5±? Chandra OCR 0.1.0* 82.2 80.3 88.0 50.4 90.8 81.2 92.3 99.9 83.1±0.9 olmOCR v0.4.0 83.0 82.3 84.9 47.7 96.1 83.7 81.9 99.7 82.4±1.1 Installation System Dependencies You will need to install poppler-utils and additional fonts for rendering PDF images. Install dependencies (Ubuntu/Debian): Python Installation Set up a conda environment and install olmocr. The requirements for running olmOCR are difficult to install in an existing python environment, so please do make a clean python environment to install into. Choose the installation option that matches your use case: **Option 1: Remote Inference (Lightweight)** If you plan to use a remote vLLM server with the flag, install the base package: This avoids installing heavy GPU dependencies like PyTorch (~2GB+). **Option 2: Local GPU Inference** Requirements: • Recent NVIDIA GPU (tested on RTX 4090, L40S, A100, H100) with at least 12 GB of GPU RAM • 30GB of free disk space For running inference with your own GPU: **Option 3: Beaker Cluster Execution** For submitting jobs to Beaker clusters with the flag: **Option 4: Benchmark Suite** For running the olmOCR benchmark suite: **Combined Installation** You can combine multiple options: **Troubleshooting** If you run into errors about , update your ulimit: Usage Examples For quick testing, try the web demo. **Convert a Single PDF (Local GPU):** **Convert an Image file:** **Convert Multiple PDFs:** **Use Remote Inference Server:** With the flag, results will be stored as markdown files inside of . > **Note:** You can also use instead of if you prefer. Viewing Results The workspace folder will then have both Dolma and markdown files (if using ). Using an Inference Provider or External Server If you have a vLLM server already running elsewhere (or any inference platform implementing the OpenAI API), you can point olmOCR to use it instead of spawning a local instance. **Installation for Remote Inference:** **Using an External Server:** The served model name in vLLM needs to match the value provided in . **Example vLLM Server Launch:** Verified External Providers We have tested on these external model providers and confirmed that they work | | $/1M Input tokens | $/1M Output tokens | Example Command | |-----------------------------------------------------------------------------|-------------------|--------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | Cirrascale | $0.07 | $0.15 | | | DeepInfra | $0.09 | $0.19 | | | Parasail | $0.10 | $0.20 | | Notes on arguments • : Defines the OpenAI-compatible endpoint: ex • : Your API key, bassed in via Authorization Bearer HTTP header • : Max concurrent requests that will be in-flight to the inference provider at one time • : Max number of page groups that will be processed at once. You may want to set this to so that you finish one group of stuff before moving on. • : You may want a smaller number of pages per group as many external provides have lower concurrent request limits • : The model identifier, ex. , different providers have different names, and if you run locally, you can use • Other arguments work the same as with local inference Multi-node / Cluster Usage If you want to convert millions of PDFs using multiple nodes running in parallel, olmOCR supports reading PDFs from AWS S3 and coordinating work using an AWS S3 output bu…