AI Architecture Analysis
This repository is indexed by RepoMind. By analyzing run-llama/liteparse in our AI interface, you can instantly generate complete architecture diagrams, visualize control flows, and perform automated security audits across the entire codebase.
Our Agentic Context Augmented Generation (Agentic CAG) engine loads full source files into context on-demand, avoiding the fragmentation of traditional RAG systems. Ask questions about the architecture, dependencies, or specific features to see it in action.
Repository Overview (README excerpt)
Crawler viewLiteParse | | | Docs LiteParse is a standalone OSS PDF parsing tool focused exclusively on **fast and light** parsing. It provides high-quality spatial text parsing with bounding boxes, without proprietary LLM features or cloud dependencies. Everything runs locally on your machine. **Hitting the limits of local parsing?** For complex documents (dense tables, multi-column layouts, charts, handwritten text, or scanned PDFs), you'll get significantly better results with LlamaParse, our cloud-based document parser built for production document pipelines. LlamaParse handles the hard stuff so your models see clean, structured data and markdown. > 👉 Sign up for LlamaParse free Overview • **Fast Text Parsing**: Spatial text parsing using PDF.js • **Flexible OCR System**: • **Built-in**: Tesseract.js (zero setup, works out of the box!) • **HTTP Servers**: Plug in any OCR server (EasyOCR, PaddleOCR, custom) • **Standard API**: Simple, well-defined OCR API specification • **Screenshot Generation**: Generate high-quality page screenshots for LLM agents • **Multiple Output Formats**: JSON and Text • **Bounding Boxes**: Precise text positioning information • **Standalone Binary**: No cloud dependencies, runs entirely locally • **Multi-platform**: Linux, macOS (Intel/ARM), Windows Installation CLI Tool Option 1: Global Install (Recommended) Install globally via npm to use the command anywhere: Then use it: For macOS and Linux users, can be also installed via : Option 2: Install from Source You can clone the repo and install the CLI globally from source: Agent Skill You can use as an agent skill, downloading it with the CLI tool: Or copy-pasting the file to your own skills setup. Usage Parse Files Batch Parsing You can also parse an entire directory of documents: Generate Screenshots Screenshots are essential for LLM agents to extract visual information that text alone cannot capture. Library Usage Install as a dependency in your project: Buffer / Uint8Array Input You can pass raw bytes directly instead of a file path. PDF buffers are parsed with **zero disk I/O** — no temp files are written: Non-PDF buffers (images, Office documents) are written to a temp directory for format conversion. Screenshots also work with buffer input: CLI Options Parse Command Batch Parse Command Screenshot Command OCR Setup Default: Tesseract.js By default, Tesseract.js downloads language data from the internet on first use. For offline or air-gapped environments, set the environment variable to a directory containing pre-downloaded files: You can also pass in the library config: Optional: HTTP OCR Servers For higher accuracy or better performance, you can use an HTTP OCR server. We provide ready-to-use example wrappers for popular OCR engines: • EasyOCR • PaddleOCR You can integrate any OCR service by implementing the simple LiteParse OCR API specification (see ). The API requires: • POST endpoint • Accepts and parameters • Returns JSON: See the example servers in and as templates. For the complete OCR API specification, see . Multi-Format Input Support LiteParse supports **automatic conversion** of various document formats to PDF before parsing. This makes it unique compared to other PDF-only parsing tools! Supported Input Formats Office Documents (via LibreOffice) • **Word**: , , , , • **PowerPoint**: , , , • **Spreadsheets**: , , , , , Just install the dependency and LiteParse will automatically convert these formats to PDF for parsing: > _For Windows, you might need to add the path to the directory containing LibreOffice CLI executable (generally ) to the environment variables and re-start the machine._ Images (via ImageMagick) • **Formats**: , , , , , , , Just install ImageMagick and LiteParse will convert images to PDF for parsing (with OCR): Environment Variables | Variable | Description | |----------|-------------| | | Path to a directory containing Tesseract files. Used for offline/air-gapped environments where Tesseract.js cannot download language data from the internet. | | | Override the temp directory used for format conversion and intermediate files. Defaults to the OS temp directory ( ). Useful in containerized or read-only filesystem environments. | Configuration You can configure parsing options via CLI flags or a JSON config file. The config file allows you to set sensible defaults and override as needed. Config File Example Create a file: For HTTP OCR servers, just add : Use with: Development We provide a fairly rich / that we recommend using to help with development + coding agents. License Apache 2.0 Credits Built on top of: • PDF.js - PDF parsing engine • Tesseract.js - In-process OCR engine • EasyOCR - HTTP OCR server (optional) • PaddleOCR - HTTP OCR server (optional) • Sharp - Image processing