back to home
Best Open Source pdf parser Libraries
A curated list of the most popular GitHub repositories tagged with pdf parser. Select any project to visualize its architecture and dive into the codebase using RepoMind's AI engine.
#1PaddlePaddle/PaddleOCR
Turn any PDF or image document into structured data for your AI. A powerful, lightweight OCR toolkit that bridges the gap between images/PDFs and LLMs. Supports 100+ languages.
70,987Python
Analyze Code
#2opendatalab/MinerU
Transforms complex documents like PDFs into LLM-ready markdown/JSON for your Agentic workflows.
54,581Python
Analyze Code
#3py-pdf/pypdf
A pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files
9,822Python
Analyze Code
#4bytedance/Dolphin
The official repo for “Dolphin: Document Image Parsing via Heterogeneous Anchor Prompting”, ACL, 2025.
8,827Python
Analyze Code