back to home

EvolvingLMMs-Lab / NEO

NEO Series: Native Vision-Language Models from First Principles

View on GitHub
680 stars
24 forks
1 issues
PythonJupyter NotebookShell

AI Architecture Analysis

This repository is indexed by RepoMind. By analyzing EvolvingLMMs-Lab/NEO in our AI interface, you can instantly generate complete architecture diagrams, visualize control flows, and perform automated security audits across the entire codebase.

Our Agentic Context Augmented Generation (Agentic CAG) engine loads full source files into context on-demand, avoiding the fragmentation of traditional RAG systems. Ask questions about the architecture, dependencies, or specific features to see it in action.

Source files are only loaded when you start an analysis to optimize performance.

Embed this Badge

Showcase RepoMind's analysis directly in your repository's README.

[![Analyzed by RepoMind](https://img.shields.io/badge/Analyzed%20by-RepoMind-4F46E5?style=for-the-badge)](https://repomind.in/repo/EvolvingLMMs-Lab/NEO)
Preview:Analyzed by RepoMind

Repository Overview (README excerpt)

Crawler view

NEO Series: Native Vision-Language Models • **2025/09**: From Pixels to Words -- Towards Native Vision-Language Primitives at Scale (ICLR 2026) 📜 News [2026/01] 🔥🔥🔥 The training code of NEO is released ! [2025/10] The paper, weights, and evaluation code of **NEO** are released ! [2025/09] 💥💥💥 **NEO** has been completed ! 📋 Todo List • [x] Evaluation guide • [x] Training guide 💡 Motivation • **What constraints set native VLMs apart from modular ones, and to what extent can they be overcome?** • **How to make native VLMs more accessible and democratized, thereby accelerating their progress?** 💡 Highlights • 🔥 **Native Architecture:** NEO innovates a native VLM primitive that unifies pixel-word encoding, alignment, and reasoning within a dense, monolithic model architecture. • 🔥 **Superior Efficiency:** With merely 390M image-text examples, NEO develops strong visual perception from scratch, rivaling top-tier modular VLMs and outperforming native ones. • 🔥 **Promising Roadmap:** NEO pioneers a promising route for scalable and powerful native VLMs, paired with diverse reusable components that foster a cost-effective and extensible ecosystem. 🤖 Model Zoo We release 2B and 9B **NEO** in Pre-Training (PT), Mid-Training (MT), and Supervised Fine-Tuning (SFT). | Model Name | Model Weight| | ---------- | ------------------------------------------------------- | | **NEO-2B-PT** | 🤗 NEO-2B-PT HF link | | **NEO-2B-MT** | 🤗 NEO-2B-MT HF link | | **NEO-2B-SFT** | 🤗 NEO-2B-SFT HF link | | **NEO-9B-PT** | 🤗 NEO-9B-PT HF link | | **NEO-9B-MT** | 🤗 NEO-9B-MT HF link | | **NEO-9B-SFT** | 🤗 NEO-9B-SFT HF link | 📊 Benchmark Results > **TABLE NOTE:** > - “# Data” = data scale for pre-training / mid-training / supervised fine-tuning. > - “†“ = vision-language models using Reinforcement Learning (RL). > - “Any Res.” = any resolution; “Tile-wise” = image split into tiles; “Any Rat.” = any aspect ratio; “Fix Res.” = fixed resolution. > - “MoE“ = Mixture-of-Experts; “DaC“ = Divide-and-Conquer. > - **Bold** = best score in each column. | **Model_NAME** | **Base_LLM_NAME** | **#Data_PT·MT·SFT** | **Input_TYPE** | **RoPE_TYPE** | **MMMU** | **MMB** | **MMVet** | **MMStar** | **SEED_I** | **POPE** | **HallB** | **AI2D** | **DocVQA** | **ChartQA** | **InfoVQA** | **TextVQA** | **OCRBench** | |:--|:--|:--|:--|:--|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:| | 🔻**Modular_VLMs_(2B)** ||||||||||||||||||| | Qwen2-VL | Qwen2-1.5B | --·--·-- | Any Res. | M-RoPE | 41.1 | 74.9 | 49.5 | 48.0 | -- | -- | 41.7 | 74.7 | **90.1** | 73.5 | 65.5 | **79.7** | 80.9 | | InternVL2.5 | InternLM2.5-1.8B | >6B·100M·16M | Tile-wise | 1D-RoPE | 43.6 | 74.7 | 60.8 | 53.7 | -- | **90.6** | 42.6 | 74.9 | 88.7 | 79.2 | 60.9 | 74.3 | 80.4 | | InternVL3† | Qwen2.5-1.5B | >6B·100M·22M | Tile-wise | 1D-RoPE | **48.6** | **81.1** | **62.2** | **60.7** | -- | 89.6 | 42.5 | **78.7** | 88.3 | **80.2** | **66.1** | 77.0 | **83.5** | | *Qwen2.5-VL†* | ***Qwen2.5-3B*** | --·--·-- | *Any Res.* | *M-RoPE* | *51.2* | *79.1* | *61.8* | *55.9* | -- | -- | *46.3* | *81.6* | *93.9* | *84.0* | *77.1* | *79.3* | *79.7* | | Encoder_Based | Qwen3-1.7B | >6B·40M·4M | Tile-wise | 1D-RoPE | 47.1 | 75.8 | 37.4 | 52.7 | **73.6** | 87.0 | **44.4** | 77.4 | 89.9 | 78.4 | 65.9 | 73.3 | **83.5** | | 🔻**Native_VLMs_(2B)** ||||||||||||||||||| | Mono-InternVL | InternLM2-1.8B | 1.2B·143M·7M | Tile-wise | 1D-RoPE | 33.7 | 65.5 | 40.1 | -- | 67.4 | -- | 34.8 | 68.6 | 80.0 | 73.7 | 43.0 | 72.6 | 76.7 | | Mono-InternVL-1.5 | InternLM2-1.8B | 400M·150M·7M | Tile-wise | 1D-RoPE | 39.1 | 64.0 | **54.0** | -- | 66.9 | -- | 32.5 | 67.4 | 81.7 | 72.2 | 47.9 | 73.7 | **80.1** | | HoVLE | InternLM2-1.8B | 550M·50M·7M | Tile-wise | 1D-RoPE | 32.2 | 73.3 | 43.8 | -- | 70.9 | 87.4 | 38.4 | 73.0 | 86.1 | 78.6 | 55.7 | 70.9 | 74.0 | | OneCAT | Qwen2.5-1.5B | 436M·70M·13M | Any Res. | M-RoPE | 39.0 | 72.4 | 42.4 | -- | 70.9 | -- | -- | 72.4 | 87.1 | 76.2 | 56.3 | 67.0 | -- | | **NEO** | Qwen3-1.7B | 345M·40M·4M | Any Res. | Native_RoPE | **48.6** | **76.0** | 49.6 | **54.2** | **74.2** | **87.5** | **43.1** | **80.1** | **89.9** | **81.2** | **63.2** | **74.0** | 77.1 | | **Model_NAME** | **Base_LLM_NAME** | **#Data_PT·MT·SFT** | **Input_TYPE** | **RoPE_TYPE** | **MMMU** | **MMB** | **MMVet** | **MMStar** | **SEED_I** | **POPE** | **HallB** | **AI2D** | **DocVQA** | **ChartQA** | **InfoVQA** | **TextVQA** | **OCRBench** | |:--|:--|:--|:--|:--|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:| | 🔻**Modular_VLMs_(8B)** ||||||||||||||||||| | Qwen2-VL | Qwen2-7B | --·--·-- | Any Res. | M-RoPE | 54.1 | 83.0 | 62.0 | 60.7 | -- | 88.1 | 50.6 | 83.0 | 94.5 | 83.0 | 76.5 | 84.3 | 86.6 | | InternVL2.5 | InternLM2.5-7B | >6B·50M·4M | Tile-wise | 1D-RoPE | 56.0 | **84.6** | 62.8 | 64.4 | -- | 90.6 | 50.1 | 84.5 | 93.0 | 84.8 | 77.6 | 79.1 | 82.2 | | Qwen2.5-VL† | Qwen2.5-7B | --·--·-- | Any Res. | M-RoPE | 55.0 | 83.5 | 67.1 | 63.9 | -- | 86.4 | **52.9** | 83.9 | **95.7** | **87.3** | **82.6** | **84.9** | 86.4 | | InternVL3† | Qwen2.5-7B | >6B·100M·22M | Tile-wise | 1D-RoPE | **62.7** | 83.4 | **81.3** | **68.2** | -- | **91.1** | 49.9 | **85.2** | 92.7 | 86.6 | 76.8 | 80.2 | **88.0** | | Encoder-Based | Qwen3-8B | >6B·40M·4M | Tile-wise | 1D-RoPE | 54.1 | 84.0 | 60.0 | 63.5 | **76.2** | 87.8 | 51.4 | 82.9 | 92.1 | 83.5 | 75.0 | 77.1 | 85.3 | | 🔻**Native_VLMs_(8B)** ||||||||||||||||||| | Fuyu | Persimmon-8B | --·--·-- | Any Res. | 1D-RoPE | 27.9 | 10.7 | 21.4 | -- | 59.3 | 84.0 | -- | 64.5 | -- | -- | -- | -- | 36.6 | | Chameleon | from scratch | 1.4B·0M·1.8M | Fix Res. | 1D-RoPE | 25.4 | 31.1 | 8.3 | -- | 30.6 | 19.4 | 17.1 | 46.0 | 1.5 | 2.9 | 5.0 | 4.8 | 0.7 | | EVE | Vicuna-7B | 33M·0M·1.8M | Any Rat. | 1D-RoPE | 32.6 | 52.3 | 25.7 | -- | 64.6 | 85.0 | 26.4 | 61.0 | 53.0 | 59.1 | 25.0 | 56.8 | 39.8 | | SOLO | Mistral-7B | 44M·0M·2M | Any Res. | 1D-RoPE | -- | 67.7 | 30.4 | -- | 64.4 | 78.6 | -- | 61.4 | -- | -- | -- | -- | 12.6 | | Em…