back to home

BradyFU / Awesome-Multimodal-Large-Language-Models

:sparkles::sparkles:Latest Advances on Multimodal Large Language Models

17,472 stars
1,117 forks
99 issues

AI Architecture Analysis

This repository is indexed by RepoMind. By analyzing BradyFU/Awesome-Multimodal-Large-Language-Models in our AI interface, you can instantly generate complete architecture diagrams, visualize control flows, and perform automated security audits across the entire codebase.

Our Agentic Context Augmented Generation (Agentic CAG) engine loads full source files into context on-demand, avoiding the fragmentation of traditional RAG systems. Ask questions about the architecture, dependencies, or specific features to see it in action.

Source files are only loaded when you start an analysis to optimize performance.

Embed this Badge

Showcase RepoMind's analysis directly in your repository's README.

[![Analyzed by RepoMind](https://img.shields.io/badge/Analyzed%20by-RepoMind-4F46E5?style=for-the-badge)](https://repomind.in/repo/BradyFU/Awesome-Multimodal-Large-Language-Models)
Preview:Analyzed by RepoMind

Repository Overview (README excerpt)

Crawler view

Awesome-Multimodal-Large-Language-Models ✨ Highlights of NJU-MiG > 🔥🔥 **Surveys of MLLMs** | **💬 WeChat (MLLM微信交流群)** • 🌟 **A Survey of Unified Multimodal Understanding and Generation: Advances and Challenges** A total of **83 pages**, and **750+ references**! [📖 Paper] [🌟 Project] • 🌟 **MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs** arXiv 2025, Paper, Project • **A Survey on Multimodal Large Language Models** NSR 2024, Paper, Project --- > 🔥🔥 **VITA Series Omni MLLMs** | **💬 WeChat (VITA微信交流群)** • 🌟 **VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction** NeurIPS 2025 Highlight, Paper, Project • 🌟 **VITA-VLA: Efficiently Teaching Vision-Language Models to Act via Action Expert Distillation** arXiv 2025, Paper, Project • 🌟 **VITA-E: Natural Embodied Interaction with Concurrent Seeing, Hearing, Speaking, and Acting** arXiv 2025, Paper, Project • **VITA: Towards Open-Source Interactive Omni Multimodal LLM** arXiv 2024, Paper, Project • **Long-VITA: Scaling Large Multi-modal Models to 1 Million Tokens with Leading Short-Context Accuracy** arXiv 2025, Paper, Project • **VITA-Audio: Fast Interleaved Cross-Modal Token Generation for Efficient Large Speech-Language Model** NeurIPS 2025, Paper, Project --- > 🔥🔥 **MME Series MLLM Benchmarks** • 🌟 **MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs** arXiv 2025, Paper, Project • **MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models** NeurIPS 2025 DB Highlight, Paper, Dataset, Eval Tool, ✒️ Citation • **Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis** CVPR 2025, Paper, Project, Dataset • **MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?** ICLR 2025, Paper, Project, Dataset --- Table of Contents • Awesome Papers • Multimodal Instruction Tuning • Multimodal Hallucination • Multimodal In-Context Learning • Multimodal Chain-of-Thought • LLM-Aided Visual Reasoning • Foundation Models • Evaluation • Multimodal RLHF • Others • Awesome Datasets • Datasets of Pre-Training for Alignment • Datasets of Multimodal Instruction Tuning • Datasets of In-Context Learning • Datasets of Multimodal Chain-of-Thought • Datasets of Multimodal RLHF • Benchmarks for Evaluation • Others --- Awesome Papers Multimodal Instruction Tuning | Title | Venue | Date | Code | Demo | |:--------|:--------:|:--------:|:--------:|:--------:| | **InternVL-U: Democratizing Unified Multimodal Models for Understanding, Reasoning, Generation and Editing** | arXiv | 2026-03-10 | Github | Local Demo | | **Omni-Diffusion: Unified Multimodal Understanding and Generation with Masked Discrete Diffusion** | arXiv | 2026-03-06 | Github | - | | **Beyond Language Modeling: An Exploration of Multimodal Pretraining** | arXiv | 2026-03-03 | - | - | | **Gemini 3.1 Pro: A smarter model for your most complex tasks** | Blog | 2026-02-19 | - | - | | **Qwen3.5: Towards Native Multimodal Agents** | Blog | 2026-02-16 | Github | Demo | | **MiniCPM-o 4.5** | Blog | 2026-02-06 | Github | Demo | | **DeepSeek-OCR 2: Visual Causal Flow** | DeepSeek | 2026-01-27 | Github | - | | **Seed1.8 Model Card: Towards Generalized Real-World Agency** | Bytedance Seed | 2025-12-18 | - | - | | **Introducing GPT-5.2** | OpenAI | 2025-12-11 | - | - | | **Introducing Mistral 3** | Blog | 2025-12-02 | Huggingface | - | | **Qwen3-VL Technical Report** | arXiv | 2025-11-26 | Github | Demo | | **Emu3.5: Native Multimodal Models are World Learners** | arXiv | 2025-10-30 | Github | - | | **VITA-E: Natural Embodied Interaction with Concurrent Seeing, Hearing, Speaking, and Acting** | arXiv | 2025-10-21 | Github | Local Demo | | **DeepSeek-OCR: Contexts Optical Compression** | arXiv | 2025-10-21 | Github | - | | **OmniVinci: Enhancing Architecture and Data for Omni-Modal Understanding LLM** | arXiv | 2025-10-17 | Github | - | | **NExT-OMNI: Towards Any-to-Any Omnimodal Foundation Models with Discrete Flow Matching** | arXiv | 2025-10-16 | - | - | | **InteractiveOmni: A Unified Omni-modal Model for Audio-Visual Multi-turn Dialogue** | arXiv | 2025-10-15 | Github | - | | **VITA-VLA: Efficiently Teaching Vision-Language Models to Act via Action Expert Distillation** | arXiv | 2025-10-10 | Github | - | | **LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training** | arXiv | 2025-10-09 | Github | Demo | | **Qwen3-Omni Technical Report** | arXiv | 2025-09-22 | Github | Demo | | **InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency** | arXiv | 2025-08-27 | Github | Demo | | **MiniCPM-V 4.5: A GPT-4o Level MLLM for Single Image, Multi Image and Video Understanding on Your Phone** | - | 2025-08-26 | Github | Demo | | **Thyme: Think Beyond Images** | arXiv | 2025-08-18 | Github | Demo | | **Introducing GPT-5** | OpenAI | 2025-08-07 | - | - | | **dots.vlm1** | rednote-hilab | 2025-08-06 | Github | Demo | | **Step3: Cost-Effective Multimodal Intelligence** | StepFun | 2025-07-31 | Github | Demo | | **GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning** | arXiv | 2025-07-02 | Github | Demo | | **DenseWorld-1M: Towards Detailed Dense Grounded Caption in the Real World** | arXiv | 2025-06-30 | Github | - | | **Qwen VLo: From "Understanding" the World to "Depicting" It** | Qwen | 2025-06-26 | - | Demo | | **MMSearch-R1: Incentivizing LMMs to Search** | arXiv | 2025-06-25 | Github | - | | **Show-o2: Improved Native Unified Multimodal Models** | arXiv | 2025-06-18 | Github | - | | **Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities** | Google | 2025-06-17 | - | - | | **Ego-R1: Chain-of-Tool-Thought for Ultra-Long Egocentric Video Reasoning** | arXiv | 2025-06-16 | Github | - | | **MiMo-VL Technical Report** | arXiv | 2025-06-04 | Github | - |…