tenstorrent / tt-forge

Tenstorrent's MLIR Based Compiler. We aim to enable developers to run AI on all configurations of Tenstorrent hardware, through an open-source, general, and performant compiler.

190 stars

28 forks

92 issues

PythonJavaScriptShell

Chat with Codebase Architecture Scan Security Audit Explain Codebase

AI Architecture Analysis

This repository is indexed by RepoMind. By analyzing tenstorrent/tt-forge in our AI interface, you can instantly generate complete architecture diagrams, visualize control flows, and perform automated security audits across the entire codebase.

Our Agentic Context Augmented Generation (Agentic CAG) engine loads full source files into context on-demand, avoiding the fragmentation of traditional RAG systems. Ask questions about the architecture, dependencies, or specific features to see it in action.

Source files are only loaded when you start an analysis to optimize performance.

Click here to launch the interactive analysis workspace

Embed this Badge

Showcase RepoMind's analysis directly in your repository's README.

[![Analyzed by RepoMind](https://img.shields.io/badge/Analyzed%20by-RepoMind-4F46E5?style=for-the-badge)](https://repomind.in/repo/tenstorrent/tt-forge)

Preview:

Repository Overview (README excerpt)

Crawler view

Hardware | Documentation | Discord | Join Us | Bounty $ TT-Forge is Tenstorrent's open-source AI compiler stack, built on TT-Metalium. It brings together frontends, an MLIR compiler, a kernel DSL, and a model library to make running AI workloads on Tenstorrent hardware straightforward — 800+ model variants tested in CI, thousands more ran internally, if it fits in memory, it should run. Models like GPT-OSS 120B, Llama 3 70B, Stable Diffusion XL, Whisper, and YOLOv12 — all running today from PyTorch, JAX, and ONNX. Inference and training, custom kernels to full models — all open source. ----- **Contents:** Sub-Projects · Run a Model · Train a Model · Write a Custom Kernel · Tested Models · Architecture · FAQ ----- TT-Forge Sub-Projects | Project | What It Does | Links | |---------|-------------|-------| | **TT-XLA** | Primary frontend for **PyTorch** and **JAX** models. Uses the PJRT interface to compile models into StableHLO graphs for TT-MLIR. Supports single and multi-chip. | Docs · Demos | | **TT-Forge-ONNX** | TVM-based frontend for **ONNX**, **TensorFlow**, and **PaddlePaddle** models. Single-chip only. | Docs · Demos | | **TT-MLIR** | Core MLIR-based compiler. Defines TTIR, TTNN, and TTKernel dialects, applies optimization passes (fusion, sharding, layout), and lowers to TT-Metalium. | Docs · Tools | | **TT-Lang** | Python DSL for custom high-performance kernels. Write fused ops in Python with built-in simulation, profiling, and AI-assisted translation from Triton-class DSLs. *(Early preview)* | Docs | | **TT-Blacksmith** | Optimized training recipes and experiments. 40+ examples spanning PyTorch, JAX, and Lightning across vision models, LLMs, and NLP. | Docs · Experiments | | **TT-Forge-Models** | 800+ model variants continuously tested in CI. Standardized loaders for LLMs, vision, NLP, multimodal, detection, segmentation, speech, and more. | Repo | ----- Run a Model Get ResNet-50 running on Tenstorrent hardware in minutes: > **Note:** Wheels are hosted on Tenstorrent's package index. The flag is required until packages are available on public PyPI. See the full Getting Started Guide for all setup options. For ONNX models, see TT-Forge-ONNX. More demos in TT-XLA demos. ----- Train a Model Standard PyTorch training example runs on Tenstorrent hardware via TT-Blacksmith: 40+ ready-to-run training recipes - Llama, Gemma, Qwen, ViT, NeRF, and more. See the experiments table. ----- Write a Custom Kernel TT-Lang *(early preview)* — write high-performance kernels in Python instead of low-level C++. Here's a matmul with bias ( ): Python in, optimized hardware code out. The compiler handles NOC addressing, register allocation, and memory management. See the full matmul example and TT-Lang repo for tutorials, simulation, and profiling tools. ----- Tested Models 800+ model variants in the model library, continuously tested in CI — with thousands more ran internally. Highlights: | Category | Models | |----------|--------| | **LLMs** | Llama 3.1/3.2 (1B–70B), Qwen 2.5/3 (0.5B–32B), Falcon-3 (1B–10B), Phi-1/2/3/3.5, Gemma 1.1/2 (2B–7B), Mistral/Ministral (7B–24B), Mamba 2.8B | | **Vision** | ResNet-50, ViT, EfficientNet, MobileNetV1/V2, Swin, VoVNet, SegFormer, U-Net, UFLD/UFLDv2, MNIST | | **NLP / Encoders** | BERT, ALBERT, BGE-M3, Qwen3-Embedding, RoBERTa, SqueezeBERT | | **Multimodal** | BLIP (vision-language), Stable Diffusion XL (UNet) | | **Multi-chip (N300+)** | Llama 3.1 8B/70B, Falcon-3 7B/10B, Mistral 7B/Nemo/Small 24B, Qwen 2.5/3 up to 32B | See the full benchmark suite and demos for the complete list. ----- Architecture Interactive Tenstorrent Software Architecture Diagram Overview of Tenstorrent's open-source AI software ecosystem. Click on components to navigate to their repositories: ----- FAQ • **Can the user set dtype? How?** • Datatypes are generally inferred by the front end framework. However, certain front ends provide options to override the default datatype selection. See next bullet for an example. • Enable bfp8 conversion using compile options. The model **MUST** be cast to bfloat16 before compilation. • **How to set shard configs?** • In tt-xla, sharding can be configured using the function from the module. Here's an example of how to set shard configurations ([See example model](https://github.co _...truncated for preview_