KellerJordan / modded-nanogpt
NanoGPT (124M) in 2 minutes
AI Architecture Analysis
This repository is indexed by RepoMind. By analyzing KellerJordan/modded-nanogpt in our AI interface, you can instantly generate complete architecture diagrams, visualize control flows, and perform automated security audits across the entire codebase.
Our Agentic Context Augmented Generation (Agentic CAG) engine loads full source files into context on-demand, avoiding the fragmentation of traditional RAG systems. Ask questions about the architecture, dependencies, or specific features to see it in action.
Repository Overview (README excerpt)
Crawler viewModded-NanoGPT This repository hosts the *NanoGPT speedrun*, in which we (collaboratively|competitively) search for the fastest algorithm to use 8 NVIDIA H100 GPUs to train a language model that attains 3.28 cross-entropy loss on the FineWeb validation set. The target (3.28 validation loss on FineWeb) follows Andrej Karpathy's GPT-2 replication in llm.c, which attains that loss after running for 45 minutes. The speedrun code also descends from llm.c's PyTorch trainer, which itself descends from NanoGPT, hence the name of the repo. Thanks to the efforts of many contributors, this repo now contains a training algorithm which attains the target performance in: • Under 90 seconds on 8xH100 (the llm.c GPT-2 replication needed 45 minutes) • under 400M tokens (the llm.c GPT-2 replication needed 10B) This improvement in training speed has been brought about by the following techniques: • Modernized architecture: Rotary embeddings, QK-Norm, and ReLU² • The Muon optimizer [writeup] [repo] • Use FP8 matmul for head, and asymmetric rescale and softcap logits • Initialization of projections to zero (muP-like) • Skip connections from embedding to every block as well as from block 3 to 6 • Extra embeddings which are mixed into the values in attention layers (inspired by Zhou et al. 2024) • Flash Attention 3 with long-short sliding window attention pattern (inspired by Gemma 2) and window size warmup with YaRN • Align training batch starts with EoS and set a max document length • Accumulate gradients for 2 steps for embedding and lm_head before updating parameters • Enable model to back out contributions from first 2/3 layers before prediction • Polar Express implementation in Muon • Smear module to enable 1 token look back • Sparse attention gate • NorMuon • Cautious Weight Decay w/ schedule tied to LR • Exponential decay of residual stream • Batch size schedule • Max seq length schedule • Partial Key Offset • Multi token prediction • Untie embed and lm_head at 2/3 of training • Additional gating on value embeddings and skip connection • Paired head attention • Bigram hash embedding • Partitioned Hyperconnections As well as many systems optimizations. Contributors list (growing with each new record): @bozavlado; @brendanh0gan; @fernbear.bsky.social; @Grad62304977; @jxbz; @kellerjordan0; @KoszarskyB; @leloykun; @YouJiacheng; @jadenj3o; @KonstantinWilleke, @alexrgilbert, @adricarda, @tuttyfrutyee, @vdlad; @ryanyang0, @vagrawal, @classiclarryd, @byronxu99, @varunneal, @EmelyanenkoK, @bernard24/https://www.hiverge.ai/, @Gusarich, @li_zichong, @akash5474, @snimu, @roeeshenberg, @ChrisJMcCormick, @dominikkallusky, @acutkosky, @manikbhandari, @andrewbriand, @jrauvola, @soren_dunn_, @photon_mz, @srashedll, @dhrvji, @EmmettBicker, @dualverse-ai, @sisovicm, @moof2x, @samacqua --- Running the current record To run the current record, run the following commands. Add torchrun to path if ./run.sh gives error . **Note: torch.compile will add around 7 minutes of latency the first time you run the code.** Official records are timed on 8 NVIDIA H100 GPUs from https://app.primeintellect.ai/. PrimeIntellect has generously sponsored recent validation runs. Alternative: Running with Docker (recommended for precise timing) For cases where CUDA or NCCL versions aren't compatible with your current system setup, Docker can be a helpful alternative. This approach standardizes versions for CUDA, NCCL, CUDNN, and Python, reducing dependency issues and simplifying setup. Note: an NVIDIA driver must already be installed on the system (useful if only the NVIDIA driver and Docker are available). To get an interactive docker, you can use --- World record history The following is the historical progression of world speed records for the following competitive task: > *Train a neural network to ≤3.28 validation loss on FineWeb using 8x NVIDIA H100s.* Note: The 3.28 target was selected to match Andrej Karpathy's GPT-2 (small) reproduction. | # | Record time | Description | Date | Log | Contributors | | - | - | - | - | - | - | 1 | 45 minutes | llm.c baseline | 05/28/24 | log | @karpathy, llm.c contributors 2 | 31.4 minutes | Tuned learning rate & rotary embeddings | 06/06/24 | log | @kellerjordan0 3 | 24.9 minutes | Introduced the Muon optimizer | 10/04/24 | none | @kellerjordan0, @jxbz 4 | 22.3 minutes | Muon improvements | 10/11/24 | log | @kellerjordan0, @bozavlado 5 | 15.2 minutes | Pad embeddings, ReLU², zero-init projections, QK-norm | 10/14/24 | log | @Grad62304977, @kellerjordan0 6 | 13.1 minutes | Distributed the overhead of Muon | 10/18/24 | log | @kellerjordan0 7 | 12.0 minutes | Upgraded PyTorch 2.5.0 | 10/18/24 | log | @kellerjordan0 8 | 10.8 minutes | Untied embedding and head | 11/03/24 | log | @Grad62304977, @kellerjordan0 9 | 8.2 minutes | Value and embedding skip connections, momentum warmup, logit softcap | 11/06/24 | log | @Grad62304977, @kellerjordan0 10 | 7.8 minutes | Bfloat16 activations | 11/08/24 | log | @kellerjordan0 11 | 7.2 minutes | U-net pattern skip connections & double lr | 11/10/24 | log | @brendanh0gan 12 | 5.03 minutes | 1024-ctx dense causal attention → 64K-ctx FlexAttention | 11/19/24 | log | @KoszarskyB 13 | 4.66 minutes | Attention window warmup | 11/24/24 | log | @fernbear.bsky.social 14 | 4.41 minutes | Value Embeddings | 12/04/24 | log | @KoszarskyB 15 | 3.95 minutes | U-net pattern value embeddings, assorted code optimizations | 12/08/24 | log | @leloykun, @YouJiacheng 16 | 3.80 minutes | Split value embeddings, block sliding window, separate block mask | 12/10/24 | log | @YouJiacheng 17 | 3.57 minutes | Sparsify value embeddings, improve rotary embeddings, drop an attn layer | 12/17/24 | log | @YouJiacheng 18 | 3.4 minutes | Lower logit softcap from 30 to 15 | 01/04/25 | log | @KoszarskyB 19 | 3.142 minutes | FP8 head, offset logits, lr decay to 0.1 instead of 0.0 | 01/13/25 | log | @YouJiacheng 20 | 2.992 minutes | Merged QKV weights, long-short attention, attention scale, l…