Best Open Source ocr Libraries

A curated list of the most popular GitHub repositories tagged with ocr. Select any project to visualize its architecture and dive into the codebase using RepoMind's AI engine.

#1tesseract-ocr/tesseract

Tesseract Open Source OCR Engine (main repository)

72,951C++

Explore Repo

#2PaddlePaddle/PaddleOCR

Turn any PDF or image document into structured data for your AI. A powerful, lightweight OCR toolkit that bridges the gap between images/PDFs and LLMs. Supports 100+ languages.

72,460Python

Explore Repo

#3opendatalab/MinerU

Transforms complex documents like PDFs into LLM-ready markdown/JSON for your Agentic workflows.

56,399Python

Explore Repo

#4hiroi-sora/Umi-OCR

OCR software, free and offline. 开源、免费的离线OCR软件。支持截屏/批量导入图片，PDF文档识别，排除水印/页眉页脚，扫描/生成二维码。内置多国语言库。

42,616Python

Explore Repo

#5siyuan-note/siyuan

A privacy-first, self-hosted, fully open source personal knowledge management software, written in typescript and golang.

41,915TypeScript

Explore Repo

#6naptha/tesseract.js

Pure Javascript OCR for more than 100 Languages 📖🎉🖥

37,925JavaScript

Explore Repo

#7paperless-ngx/paperless-ngx

A community-supported supercharged document management system: scan, index and archive all your documents

37,418Python

Explore Repo

#8ShareX/ShareX

ShareX is a free and open-source application that enables users to capture or record any area of their screen with a single keystroke. It also supports uploading images, text, and various file types to a wide range of destinations.

35,901C#

Explore Repo

#9ocrmypdf/OCRmyPDF

OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched

32,961Python

Explore Repo

#10JaidedAI/EasyOCR

Ready-to-use OCR with 80+ supported languages and all popular writing scripts including Latin, Chinese, Arabic, Devanagari, Cyrillic and etc.

29,096Python

Explore Repo

#11pot-app/pot-desktop

🌈一个跨平台的划词翻译和OCR软件 | A cross-platform software for text translation and recognition.

17,365JavaScript

Explore Repo

#12lukas-blecher/LaTeX-OCR

pix2tex: Using a ViT to convert images of equations into LaTeX code.

16,255Python

Explore Repo

#13run-llama/liteparse

A fast, helpful, and open-source document parser

11,684Rust

Explore Repo

#14ripperhe/Bob

Bob 是一款 macOS 平台的翻译和 OCR 软件。

9,579

Explore Repo

#15zyddnys/manga-image-translator

Translate manga/image 一键翻译各类图片内文字 https://cotrans.touhou.ai/ (no longer working)

9,549Python

Explore Repo

#16pymupdf/PyMuPDF

PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.

9,256Python

Explore Repo

#17bytedance/Dolphin

The official repo for “Dolphin: Document Image Parsing via Heterogeneous Anchor Prompting”, ACL, 2025.

8,865Python

Explore Repo

#18Zipstack/unstract

LLM-Driven Extraction of Unstructured Data — Built for API Deployments & ETL Pipeline Workflows

6,710Python

Explore Repo

#19oomol-lab/pdf-craft

PDF craft can convert PDF files into various other formats. This project will focus on processing PDF files of scanned books.

4,959Python

Explore Repo

#20datalab-to/chandra

OCR model that handles complex tables, forms, handwriting with full layout.

4,949Python

Explore Repo

#21JabRef/jabref

Desktop app for managing BibTeX and BibLaTeX (.bib) libraries

4,436Java

Explore Repo

#22SylphxAI/pdf-reader-mcp

📄 The PDF intelligence layer for AI agents — Agent Document Twin, evidence-first extraction, visual crops, OCR provenance, trust reports, and benchmark-gated releases. MCP server for Claude, Cursor, VS Code, and any MCP client.

831TypeScript

Explore Repo