EvolvingLMMs-Lab / NEO

NEO Series: Native Vision-Language Models from First Principles

680 stars

24 forks

1 issues

PythonJupyter NotebookShell

Chat with Codebase Architecture Scan Security Audit Explain Codebase

AI Architecture Analysis

This repository is indexed by RepoMind. By analyzing EvolvingLMMs-Lab/NEO in our AI interface, you can instantly generate complete architecture diagrams, visualize control flows, and perform automated security audits across the entire codebase.

Our Agentic Context Augmented Generation (Agentic CAG) engine loads full source files into context on-demand, avoiding the fragmentation of traditional RAG systems. Ask questions about the architecture, dependencies, or specific features to see it in action.

Source files are only loaded when you start an analysis to optimize performance.

Click here to launch the interactive analysis workspace

Embed this Badge

Showcase RepoMind's analysis directly in your repository's README.

[![Analyzed by RepoMind](https://img.shields.io/badge/Analyzed%20by-RepoMind-4F46E5?style=for-the-badge)](https://repomind.in/repo/EvolvingLMMs-Lab/NEO)

Preview:

Repository Overview (README excerpt)

Crawler view

NEO Series: Native Vision-Language Models • **2025/09**: From Pixels to Words -- Towards Native Vision-Language Primitives at Scale (ICLR 2026) 📜 News [2026/01] 🔥🔥🔥 The training code of NEO is released ! [2025/10] The paper, weights, and evaluation code of **NEO** are released ! [2025/09] 💥💥💥 **NEO** has been completed ! 📋 Todo List • [x] Evaluation guide • [x] Training guide 💡 Motivation • **What constraints set native VLMs apart from modular ones, and to what extent can they be overcome?** • **How to make native VLMs more accessible and democratized, thereby accelerating their progress?** 💡 Highlights • 🔥 **Native Architecture:** NEO innovates a native VLM primitive that unifies pixel-word encoding, alignment, and reasoning within a dense, monolithic model architecture. • 🔥 **Superior Efficiency:** With merely 390M image-text examples, NEO develops strong visual perception from scratch, rivaling top-tier modular VLMs and outperforming native ones. • 🔥 **Promising Roadmap:** NEO pioneers a promising route for scalable and powerful native VLMs, paired with diverse reusable components that foster a cost-effective and extensible ecosystem. 🤖 Model Zoo We release 2B and 9B **NEO** in Pre-Training (PT), Mid-Training (MT), and Supervised Fine-Tuning (SFT). | Model Name | Model Weight| | ---------- | ------------------------------------------------------- | | **NEO-2B-PT** | 🤗 NEO-2B-PT HF link | | **NEO-2B-MT** | 🤗 NEO-2B-MT HF link | | **NEO-2B-SFT** | 🤗 NEO-2B-SFT HF link | | **NEO-9B-PT** | 🤗 NEO-9B-PT HF link | | **NEO-9B-MT** | 🤗 NEO-9B-MT HF link | | **NEO-9B-SFT** | 🤗 NEO-9B-SFT HF link | 📊 Benchmark Results > **TABLE NOTE:** > - “# Data” = data scale for pre-training / mid-training / supervised fine-tuning. > - “†“ = vision-language models using Reinforcement Learning (RL). > - “Any Res.” = any resolution; “Tile-wise” = image split into tiles; “Any Rat.” = any aspect ratio; “Fix Res.” = fixed resolution. > - “MoE“ = Mixture-of-Experts; “DaC“ = Divide-and-Conquer. > - **Bold** = best score in each column. | **Model_NAME** | **Base_LLM_NAME** | **#Data_PT·MT·SFT** | **Input_TYPE** | **RoPE_TYPE** | **MMMU** | **MMB** | **MMVet** | **MMStar** | **SEED_I** | **POPE** | **HallB** | **AI2D** | **DocVQA** | **ChartQA** | **InfoVQA** | **TextVQA** | **OCRBench** | |:--|:--|:--|:--|:--|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:| | 🔻**Modular_VLMs_(2B)** ||||||||||||||||||| | Qwen2-VL | Qwen2-1.5B | --·--·-- | Any Res. | M-RoPE | 41.1 | 74.9 | 49.5 | 48.0 | -- | -- | 41.7 | 74.7 | **90.1** | 73.5 | 65.5 | **79.7** | 80.9 | | InternVL2.5 | InternLM2.5-1.8B | >6B·100M·16M | Tile-wise | 1D-RoPE | 43.6 | 74.7 | 60.8 | 53.7 | -- | **90.6** | 42.6 | 74.9 | 88.7 | 79.2 | 60.9 | 74.3 | 80.4 | | InternVL3† | Qwen2.5-1.5B | >6B·100M·22M | Tile-wise | 1D-RoPE | **48.6** | **81.1** | **62.2** | **60.7** | -- | 89.6 | 42.5 | **78.7** | 88.3 | **80.2** | **66.1** | 77.0 | **83.5** | | *Qwen2.5-VL†* | ***Qwen2.5-3B*** | --·--·-- | *Any Res.* | *M-RoPE* | *51.2* | *79.1* | *61.8* | *55.9* | -- | -- | *46.3* | *81.6* | *93.9* | *84.0* | *77.1* | *79.3* | *79.7* | | Encoder_Based | Qwen3-1.7B | >6B·40M·4M | Tile-wise | 1D-RoPE | 47.1 | 75.8 | 37.4 | 52.7 | **73.6** | 87.0 | **44.4** | 77.4 | 89.9 | 78.4 | 65.9 | 73.3 | **83.5** | | 🔻**Native_VLMs_(2B)** ||||||||||||||||||| | Mono-InternVL | InternLM2-1.8B | 1.2B·143M·7M | Tile-wise | 1D-RoPE | 33.7 | 65.5 | 40.1 | -- | 67.4 | -- | 34.8 | 68.6 | 80.0 | 73.7 | 43.0 | 72.6 | 76.7 | | Mono-InternVL-1.5 | InternLM2-1.8B | 400M·150M·7M | Tile-wise | 1D-RoPE | 39.1 | 64.0 | **54.0** | -- | 66.9 | -- | 32.5 | 67.4 | 81.7 | 72.2 | 47.9 | 73.7 | **80.1** | | HoVLE | InternLM2-1.8B | 550M·50M·7M | Tile-wise | 1D-RoPE | 32.2 | 73.3 | 43.8 | -- | 70.9 | 87.4 | 38.4 | 73.0 | 86.1 | 78.6 | 55.7 | 70.9 | 74.0 | | OneCAT | Qwen2.5-1.5B | 436M·70M·13M | Any Res. | M-RoPE | 39.0 | 72.4 | 42.4 | -- | 70.9 | -- | -- | 72.4 | 87.1 | 76.2 | 56.3 | 67.0 | -- | | **NEO** | Qwen3-1.7B | 345M·40M·4M | Any Res. | Native_RoPE | **48.6** | **76.0** | 49.6 | **54.2** | **74.2** | **87.5** | **43.1** | **80.1** | **89.9** | **81.2** | **63.2** | **74.0** | 77.1 | | **Model_NAME** | **Base_LLM_NAME** | **#Data_PT·MT·SFT** | **Input_TYPE** | **RoPE_TYPE** | **MMMU** | **MMB** | **MMVet** | **MMStar** | **SEED_I** | **POPE** | **HallB** | **AI2D** | **DocVQA** | **ChartQA** | **InfoVQA** | **TextVQA** | **OCRBench** | |:--|:--|:--|:--|:--|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:| | 🔻**Modular_VLMs_(8B)** ||||||||||||||||||| | Qwen2-VL | Qwen2-7B | --·--·-- | Any Res. | M-RoPE | 54.1 | 83.0 | 62.0 | 60.7 | -- | 88.1 | 50.6 | 83.0 | 94.5 | 83.0 | 76.5 | 84.3 | 86.6 | | InternVL2.5 | InternLM2.5-7B | >6B·50M·4M | Tile-wise | 1D-RoPE | 56.0 | **84.6** | 62.8 | 64.4 | -- | 90.6 | 50.1 | 84.5 | 93.0 | 84.8 | 77.6 | 79.1 | 82.2 | | Qwen2.5-VL† | Qwen2.5-7B | --·--·-- | Any Res. | M-RoPE | 55.0 | 83.5 | 67.1 | 63.9 | -- | 86.4 | **52.9** | 83.9 | **95.7** | **87.3** | **82.6** | **84.9** | 86.4 | | InternVL3† | Qwen2.5-7B | >6B·100M·22M | Tile-wise | 1D-RoPE | **62.7** | 83.4 | **81.3** | **68.2** | -- | **91.1** | 49.9 | **85.2** | 92.7 | 86.6 | 76.8 | 80.2 | **88.0** | | Encoder-Based | Qwen3-8B | >6B·40M·4M | Tile-wise | 1D-RoPE | 54.1 | 84.0 | 60.0 | 63.5 | **76.2** | 87.8 | 51.4 | 82.9 | 92.1 | 83.5 | 75.0 | 77.1 | 85.3 | | 🔻**Native_VLMs_(8B)** ||||||||||||||||||| | Fuyu | Persimmon-8B | --·--·-- | Any Res. | 1D-RoPE | 27.9 | 10.7 | 21.4 | -- | 59.3 | 84.0 | -- | 64.5 | -- | -- | -- | -- | 36.6 | | Chameleon | from scratch | 1.4B·0M·1.8M | Fix Res. | 1D-RoPE | 25.4 | 31.1 | 8.3 | -- | 30.6 | 19.4 | 17.1 | 46.0 | 1.5 | 2.9 | 5.0 | 4.8 | 0.7 | | EVE | Vicuna-7B | 33M·0M·1.8M | Any Rat. | 1D-RoPE | 32.6 | 52.3 | 25.7 | -- | 64.6 | 85.0 | 26.4 | 61.0 | 53.0 | 59.1 | 25.0 | 56.8 | 39.8 | | SOLO | Mistral-7B | 44M·0M·2M | Any Res. | 1D-RoPE | -- | 67.7 | 30.4 | -- | 64.4 | 78.6 | -- | 61.4 | -- | -- | -- | -- | 12.6 | | Em…