UX-Decoder / Segment-Everything-Everywhere-All-At-Once
[NeurIPS 2023] Official implementation of the paper "Segment Everything Everywhere All at Once"
AI Architecture Analysis
This repository is indexed by RepoMind. By analyzing UX-Decoder/Segment-Everything-Everywhere-All-At-Once in our AI interface, you can instantly generate complete architecture diagrams, visualize control flows, and perform automated security audits across the entire codebase.
Our Agentic Context Augmented Generation (Agentic CAG) engine loads full source files into context on-demand, avoiding the fragmentation of traditional RAG systems. Ask questions about the architecture, dependencies, or specific features to see it in action.
Repository Overview (README excerpt)
Crawler view👀*SEEM:* Segment Everything Everywhere All at Once :grapes: \[Read our arXiv Paper\] :apple: \[Try our Demo\] We introduce **SEEM** that can **S**egment **E**verything **E**verywhere with **M**ulti-modal prompts all at once. SEEM allows users to easily segment an image using prompts of different types including visual prompts (points, marks, boxes, scribbles and image segments) and language prompts (text and audio), etc. It can also work with any combination of prompts or generalize to custom prompts! by Xueyan Zou*, Jianwei Yang*, Hao Zhang*, Feng Li*, Linjie Li, Jianfeng Wang, Lijuan Wang, Jianfeng Gao^, Yong Jae Lee^, in **NeurIPS 2023**. A brief introduction of all the generic and interactive segmentation tasks we can do! :rocket: Updates • **[2023.11.2]** SEEM is applied in LLaVA-Interactive: an all-in-one demo for Image Chat, Segmentation, Generation and Editing. Experience the future of interactive image editing with visual chat. [Project Page] [Demo] [Code] [Paper] • **[2023.10.23]** SEEM is used in Set-of-Mark Prompting: a brand-new visual prompting technique for GPT-4V! It totally unleashes the extraordinary visual grounding power of GPT-4V! [Project Page] [Code] [Paper] • **[2023.10.10]** We release the training log for SEEM-Large-v1 and log for SEEM-Tiny-v1! • **[2023.10.04]** We are excited to release :white_check_mark: training/evaluation/demo code, :white_check_mark: new checkpoints, and :white_check_mark: comprehensive readmes for ***both X-Decoder and SEEM***! • **[2023.09.25]** Our work has been accepted to NeurIPS 2023! • **[2023.07.27]** We are excited to release our X-Decoder training code! We will release its descendant SEEM training code very soon! • **[2023.07.10]** We release Semantic-SAM, a universal image segmentation model to enable segment and recognize anything at any desired granularity. Code and checkpoint are available! • **[2023.05.02]** We have released the SEEM Focal-L and X-Decoder Focal-L checkpoints and configs! • **[2023.04.28]** We have updated the ArXiv that shows *better interactive segmentation results than SAM*, which trained on x50 more data than us! • **[2023.04.26]** We have released the Demo Code and SEEM-Tiny Checkpoint! Please try the One-Line Started! • **[2023.04.20]** SEEM Referring Video Segmentation is out! Please try the Video Demo and take a look at the NERF examples. :bookmark_tabs: Catalog We release the following contents for **both SEEM and X-Decoder**:exclamation: • [x] Demo Code • [x] Model Checkpoint • [x] Comprehensive User Guide • [x] Training Code • [x] Evaluation Code :point_right: **One-Line SEEM Demo with Linux:** :round_pushpin: *[New]* **Getting Started:** • INSTALL.md • DATASET.md • TRAIN.md • EVAL.md :round_pushpin: *[New]* **Latest Checkpoints and Numbers:** | | | | COCO | | | Ref-COCOg | | | VOC | | SBD | | |-----------------|---------------------------------------------------------------------------------------------|------------|------|------|------|-----------|------|------|-------|-------|-------|-------| | Method | Checkpoint | Backbone | PQ ↑ | mAP ↑ | mIoU ↑ | cIoU ↑ | mIoU ↑ | AP50 ↑ | NoC85 ↓ | NoC90 ↓| NoC85 ↓| NoC90 ↓| | X-Decoder | ckpt | Focal-T | 50.8 | 39.5 | 62.4 | 57.6 | 63.2 | 71.6 | - | - | - | - | | X-Decoder-oq201 | ckpt | Focal-L | 56.5 | 46.7 | 67.2 | 62.8 | 67.5 | 76.3 | - | - | - | - | | SEEM_v0 | ckpt | Focal-T | 50.6 | 39.4 | 60.9 | 58.5 | 63.5 | 71.6 | 3.54 | 4.59 | * | * | | SEEM_v0 | - | Davit-d3 | 56.2 | 46.8 | 65.3 | 63.2 | 68.3 | 76.6 | 2.99 | 3.89 | 5.93 | 9.23 | | SEEM_v0 | ckpt | Focal-L | 56.2 | 46.4 | 65.5 | 62.8 | 67.7 | 76.2 | 3.04 | 3.85 | * | * | | SEEM_v1 | ckpt | SAM-ViT-B | 52.0 | 43.5 | 60.2 | 54.1 | 62.2 | 69.3 | 2.53 | 3.23 | * | * | | SEEM_v1 | ckpt | SAM-ViT-L | 49.0 | 41.6 | 58.2 | 53.8 | 62.2 | 69.5 | 2.40 | 2.96 | * | * | | SEEM_v1 | ckpt/log | Focal-T | 50.8 | 39.4 | 60.7 | 58.5 | 63.7 | 72.0 | 3.19 | 4.13 | * | * | | SEEM_v1 | ckpt/log | Focal-L | 56.1 | 46.3 | 65.8 | 62.4 | 67.8 | 76.0 | 2.66 | 3.44 | * | * | **SEEM_v0:** Supporting Single Interactive object training and inference **SEEM_v1:** Supporting Multiple Interactive objects training and inference :fire: **Related projects:** • FocalNet and DaViT : We used FocalNet and DaViT as the vision backbones. • UniCL : We used unified contrastive learning technique for learning image-text representations. • X-Decoder : We built SEEM based on X-Decoder which is a generalist decoder that can do multiple tasks with one model only. :fire: **Other projects you may find interesting:** • Semantic-SAM, a universal image segmentation model to enable segment and recognize anything at any desired granularity • OpenSeed : Strong open-set segmentation methods. • Grounding SAM : Combining Grounding DINO and Segment Anything; Grounding DINO: A strong open-set detection model. • X-GPT : Conversational Visual Agent supported by X-Decoder. • LLaVA : Large Language and Vision Assistant. :bulb: Highlights Inspired by the appealing universal interface in LLMs, we are advocating a universal, interactive multi-modal interface for any type of segmentation with **ONE SINGLE MODEL**. We emphasize **4** important features of **SEEM** below. • **Versatility**: work with various types of prompts, for example, clicks, boxes, polygons, scribbles, texts, and referring image; • **Compositionaliy**: deal with any compositions of prompts; • **Interactivity**: interact with user in multi-rounds, thanks to the memory prompt of **SEEM** to store the session history; • **Semantic awareness**: give a semantic label to any predicted mask; :unicorn: How to use the demo • Try our default examples first; • Upload an image; • Select at least one type of prompt of your choice (If you want to use referred region of another image please check "Example" and upload another image in referring image panel); • Remember to provide the actual prompt for each prompt type you select, otherwise…