tenstorrent / tt-metal
:metal: TT-NN operator library, and TT-Metalium low level kernel programming model.
AI Architecture Analysis
This repository is indexed by RepoMind. By analyzing tenstorrent/tt-metal in our AI interface, you can instantly generate complete architecture diagrams, visualize control flows, and perform automated security audits across the entire codebase.
Our Agentic Context Augmented Generation (Agentic CAG) engine loads full source files into context on-demand, avoiding the fragmentation of traditional RAG systems. Ask questions about the architecture, dependencies, or specific features to see it in action.
Repository Overview (README excerpt)
Crawler viewHardware | Install | Discord | Join Us | Bounty $ **TT-NN** is a Python & C++ Neural Network OP library. API Reference | Model Demos Featured Models The Models team is focused on developing the following models, optimizing them for performance, accuracy, and compatibility. Follow each model link for more details. >[!IMPORTANT] > For a **full model list** see the **Model Matrix**, or visit the **Developer Hub**. >[!NOTE] > Performance Metrics: > - Time to First Token (TTFT) measures the time (in milliseconds) it takes to generate the first output token after input is received. > - T/S/U (Tokens per Second per User): Represents the throughput of first-token generation after prefill. It is calculated as 1 / inter-token latency. > - T/S (Tokens per Second): Represents total token throughput, calculated as T/S = T/S/U x batch size. > - TP (Tensor Parallel) and DP (Data Parallel): Indicate the parallelization factors across multiple devices. > - Reported LLM Performance: Based on an input sequence length of 128 tokens for all models. > - Performance Data Source: Metrics were collected using the tt-metal model demos (linked above). Results may vary when using other runtimes such as the vLLM inference server. Llama 3.3 70B (TP=32) | Batch | Hardware | TTFT (MS) | T/S/U | Target T/S/U | T/S | TT-Metalium Release | vLLM Tenstorrent Repo Release | |-------|----------|-----------|-------|-----------------|-----|---------------------|-------------------------------| | 32 | Galaxy (Wormhole) | 53 | 72.5 | 80 | 2268.8 | v0.65.0-rc7 | 59be953 | Qwen 2.5 7B (TP=2) | Batch | Hardware | TTFT (MS) | T/S/U | Target T/S/U | T/S | TT-Metalium Release | vLLM Tenstorrent Repo Release | |-------|----------|-----------|-------|-----------------|------|---------------------|-------------------------------| | 32 | n300 (Wormhole) | 109 | 22.1 | 30 | 707.2 | v0.62.0-rc35 | ced0161 | Qwen 2.5 72B (TP=8) | Batch | Hardware | TTFT (MS) | T/S/U | Target T/S/U | T/S | TT-Metalium Release | vLLM Tenstorrent Repo Release | |-------|----------|-----------|-------|-----------------|-----|---------------------|-------------------------------| | 32 | QuietBox (Wormhole) | 223 | 15.4 | 20 | 492.8 | v0.62.0-rc25 | e7c329b | Whisper (distil-large-v3) | Batch | Hardware | TTFT (MS) | T/S/U | Target T/S/U | T/S | TT-Metalium Release | |-------|----------|-----------|-------|-----------------|-----|---------------------| | 1 | n150 (Wormhole) | 163 | 105.0 | 45 | 105.0 | v0.65.0-dev20251208 | | 1 | p150 (Blackhole) | 63 | 263.4 | | 263.4 | v0.65.0-dev20251208 | Mixtral 8x7B (TP=8) | Batch | Hardware | TTFT (MS) | T/S/U | Target T/S/U | T/S | TT-Metalium Release | |-------|----------|-----------|-------|-----------------|-----|---------------------| | 32 | QuietBox (Wormhole) | 122 | 24.9 | 33 | 796.8 | v0.62.0-dev20251015 | Blackhole software optimization is under active development. Please join us in shaping the future of open source AI! [\[Discord\]](https://discord.gg/tenstorrent) [\[Developer Hub\]](https://tenstorrent.com/developers) For more information regarding vLLM installation and environment creation visit the Tenstorrent vLLM repository. Model Updates For the latest model updates and features, please see MODEL_UPDATES.md Model Bring-Up and Testing For information on initial model procedures, please see Model Bring-Up and Testing TT-NN Tech Reports • Advanced Performance Optimizations for Models (updated March 4th, 2025) • ViT Implementation in TT-NN on GS (updated Sept 22nd, 2024) • LLMs Bring up in TT-NN (updated Oct 29th, 2024) • CNN Bring up & Optimization in TT-NN (updated Jan 22nd, 2025) Benchmarks • Matrix Multiply FLOPS on Wormhole and Blackhole (updated June 17th, 2025) --- **TT-Metalium** is our low-level programming model, enabling kernel development for Tenstorrent hardware. Programming Guide | API Reference Getting started Get started with simple kernels. TT-Metalium Tech Reports • Matrix Engine (updated Sept 6th, 2024) • Data Formats (updated Sept 7th, 2024) • Reconfiguring Data Formats (updated Oct 17th, 2024) • Handling special floating-point numbers (updated Oct 5th, 2024) • Allocator (Updated Dec 19th, 2024) • Tensor Layouts (updated Sept 6th, 2024) • Saturating DRAM Bandwidth (updated Sept 6th, 2024) • Flash Attention on Wormhole (updated Sept 6th, 2024) • CNNs on TT Architectures (updated Sept 6th, 2024) • Ethernet and Multichip Basics (Updated Sept 20th, 2024) • Blackhole Bring-Up Programming Guide (Updated Dec 18th, 2024) • Sub-Devices (Updated Jan 7th, 2025) Scaleout Tech Reports • Programming Mesh of Devices (Scale-Up) (updated Jan 6th, 2026) • Programming Multiple Meshes (Scale-Out) (updated Jan 19th, 2026) • TT-Fabric Architecture (updated Dec 1st, 2025) • TT-Distributed Architecture (updated Oct 20th, 2025) TT-Metalium Programming Examples Hello World • Hello World! Compute Kernel • Hello World! Data Movement Kernel Add Integers • Add 2 Integers in Baby RiscV • Add 2 Integers in Compute Kernel Simple Tensor Manipulation • Sharding • Padding DRAM Data Movement • Dram Loopback Data Movement Eltwise • Eltwise Unary OP in Vector Engine (SFPU) • Eltwise Binary OP in Matrix Engine (FPU) Matmul • Matmul OP on a Single_core • Matmul OP on Multi_core (Basic) • Matmul Multi_core Reuse (Optimized) • Matmul Multi_core Multi-Cast (Optimized) Tools and Instruments TT-NN Visualizer A comprehensive tool for visualizing and analyzing model execution, offering interactive graphs, memory plots, tensor details, buffer overviews, operation flow graphs, and multi-instance support with file or SSH-based report loading. TT-Exalens The TT-Exalens repository describes TT-Lensium, a low-level debugging tool for Tenstorrent hardware. It allows developers to access and communicate with Wormhole and Blackhole devices. TT-SMI The TT-SMI repository describes the Tenstorrent System Management Interface. This command line utility can interact with Tenstorrent devices on host. TT-SMI provides a…