back to home

mlc-ai / web-llm

High-performance In-browser LLM Inference Engine

17,587 stars
1,222 forks
157 issues
TypeScriptSCSSHTML

AI Architecture Analysis

This repository is indexed by RepoMind. By analyzing mlc-ai/web-llm in our AI interface, you can instantly generate complete architecture diagrams, visualize control flows, and perform automated security audits across the entire codebase.

Our Agentic Context Augmented Generation (Agentic CAG) engine loads full source files into context on-demand, avoiding the fragmentation of traditional RAG systems. Ask questions about the architecture, dependencies, or specific features to see it in action.

Source files are only loaded when you start an analysis to optimize performance.

Embed this Badge

Showcase RepoMind's analysis directly in your repository's README.

[![Analyzed by RepoMind](https://img.shields.io/badge/Analyzed%20by-RepoMind-4F46E5?style=for-the-badge)](https://repomind.in/repo/mlc-ai/web-llm)
Preview:Analyzed by RepoMind

Repository Overview (README excerpt)

Crawler view

WebLLM **High-Performance In-Browser LLM Inference Engine.** Documentation | Blogpost | Paper | Examples Overview WebLLM is a high-performance in-browser LLM inference engine that brings language model inference directly onto web browsers with hardware acceleration. Everything runs inside the browser with no server support and is accelerated with WebGPU. WebLLM is **fully compatible with OpenAI API.** That is, you can use the same OpenAI API on **any open source models** locally, with functionalities including streaming, JSON-mode, function-calling (WIP), etc. We can bring a lot of fun opportunities to build AI assistants for everyone and enable privacy while enjoying GPU acceleration. You can use WebLLM as a base npm package and build your own web application on top of it by following the examples below. This project is a companion project of MLC LLM, which enables universal deployment of LLM across hardware environments. **Check out WebLLM Chat to try it out!** Key Features • **In-Browser Inference**: WebLLM is a high-performance, in-browser language model inference engine that leverages WebGPU for hardware acceleration, enabling powerful LLM operations directly within web browsers without server-side processing. • **Full OpenAI API Compatibility**: Seamlessly integrate your app with WebLLM using OpenAI API with functionalities such as streaming, JSON-mode, logit-level control, seeding, and more. • **Structured JSON Generation**: WebLLM supports state-of-the-art JSON mode structured generation, implemented in the WebAssembly portion of the model library for optimal performance. Check WebLLM JSON Playground on HuggingFace to try generating JSON output with custom JSON schema. • **Extensive Model Support**: WebLLM natively supports a range of models including Llama 3, Phi 3, Gemma, Mistral, Qwen(通义千问), and many others, making it versatile for various AI tasks. For the complete supported model list, check MLC Models. • **Custom Model Integration**: Easily integrate and deploy custom models in MLC format, allowing you to adapt WebLLM to specific needs and scenarios, enhancing flexibility in model deployment. • **Plug-and-Play Integration**: Easily integrate WebLLM into your projects using package managers like NPM and Yarn, or directly via CDN, complete with comprehensive examples and a modular design for connecting with UI components. • **Streaming & Real-Time Interactions**: Supports streaming chat completions, allowing real-time output generation which enhances interactive applications like chatbots and virtual assistants. • **Web Worker & Service Worker Support**: Optimize UI performance and manage the lifecycle of models efficiently by offloading computations to separate worker threads or service workers. • **Chrome Extension Support**: Extend the functionality of web browsers through custom Chrome extensions using WebLLM, with examples available for building both basic and advanced extensions. Built-in Models Check the complete list of available models on MLC Models. WebLLM supports a subset of these available models and the list can be accessed at . Here are the primary families of models currently supported: • **Llama**: Llama 3, Llama 2, Hermes-2-Pro-Llama-3 • **Phi**: Phi 3, Phi 2, Phi 1.5 • **Gemma**: Gemma-2B • **Mistral**: Mistral-7B-v0.3, Hermes-2-Pro-Mistral-7B, NeuralHermes-2.5-Mistral-7B, OpenHermes-2.5-Mistral-7B • **Qwen (通义千问)**: Qwen2 0.5B, 1.5B, 7B If you need more models, request a new model via opening an issue or check Custom Models for how to compile and use your own models with WebLLM. Jumpstart with Examples Learn how to use WebLLM to integrate large language models into your application and generate chat completions through this simple Chatbot example: For an advanced example of a larger, more complicated project, check WebLLM Chat. More examples for different use cases are available in the examples folder. Get Started WebLLM offers a minimalist and modular interface to access the chatbot in the browser. The package is designed in a modular way to hook to any of the UI components. Installation Package Manager Then import the module in your code. CDN Delivery Thanks to jsdelivr.com, WebLLM can be imported directly through URL and work out-of-the-box on cloud development platforms like jsfiddle.net, Codepen.io, and Scribbler: It can also be dynamically imported as: Create MLCEngine Most operations in WebLLM are invoked through the interface. You can create an instance and loading the model by calling the factory function. (Note that loading models requires downloading and it can take a significant amount of time for the very first run without caching previously. You should properly handle this asynchronous call.) Under the hood, this factory function does the following steps for first creating an engine instance (synchronous) and then loading the model (asynchronous). You can also do them separately in your application. Chat Completion After successfully initializing the engine, you can now invoke chat completions using OpenAI style chat APIs through the interface. For the full list of parameters and their descriptions, check section below and OpenAI API reference. (Note: The parameter is not supported and will be ignored here. Instead, call or instead as shown in the Create MLCEngine above.) Streaming WebLLM also supports streaming chat completion generating. To use it, simply pass to the call. Advanced Usage Using Workers You can put the heavy computation in a worker script to optimize your application performance. To do so, you need to: • Create a handler in the worker thread that communicates with the frontend while handling the requests. • Create a Worker Engine in your main application, which under the hood sends messages to the handler in the worker thread. For detailed implementations of different kinds of Workers, check the following sections. Dedicated Web Worker WebLLM comes with API support for WebWorker so you ca…