naklecha / llama3-from-scratch
llama3 implementation one matrix multiplication at a time
AI Architecture Analysis
This repository is indexed by RepoMind. By analyzing naklecha/llama3-from-scratch in our AI interface, you can instantly generate complete architecture diagrams, visualize control flows, and perform automated security audits across the entire codebase.
Our Agentic Context Augmented Generation (Agentic CAG) engine loads full source files into context on-demand, avoiding the fragmentation of traditional RAG systems. Ask questions about the architecture, dependencies, or specific features to see it in action.
Repository Overview (README excerpt)
Crawler viewllama3 implemented from scratch in this file, i implemented llama3 from scratch, one tensor and matrix multiplication at a time. also, im going to load tensors directly from the model file that meta provided for llama3, you need to download the weights before running this file. here is the offical link to download the weights: https://llama.meta.com/llama-downloads/ tokenizer im not going to implement a bpe tokenizer (but andrej karpathy has a really clean implementation) link to his implementation: https://github.com/karpathy/minbpe 'hello world!' reading the model file normally, reading this depends on how the model classes are written and the variable names inside them. but since we are implementing llama3 from scratch we will read the file one tensor at a time. [ "tok_embeddings.weight", "layers.0.attention.wq.weight", "layers.0.attention.wk.weight", "layers.0.attention.wv.weight", "layers.0.attention.wo.weight", "layers.0.feed_forward.w1.weight", "layers.0.feed_forward.w3.weight", "layers.0.feed_forward.w2.weight", "layers.0.attention_norm.weight", "layers.0.ffn_norm.weight", "layers.1.attention.wq.weight", "layers.1.attention.wk.weight", "layers.1.attention.wv.weight", "layers.1.attention.wo.weight", "layers.1.feed_forward.w1.weight", "layers.1.feed_forward.w3.weight", "layers.1.feed_forward.w2.weight", "layers.1.attention_norm.weight", "layers.1.ffn_norm.weight", "layers.2.attention.wq.weight" ] {'dim': 4096, 'n_layers': 32, 'n_heads': 32, 'n_kv_heads': 8, 'vocab_size': 128256, 'multiple_of': 1024, 'ffn_dim_multiplier': 1.3, 'norm_eps': 1e-05, 'rope_theta': 500000.0} we use this config to infer details about the model like • the model has 32 transformer layers • each multi-head attention block has 32 heads • the vocab size and so on converting text to tokens here we use tiktoken (i think an openai library) as the tokenizer [128000, 1820, 4320, 311, 279, 17139, 3488, 315, 2324, 11, 279, 15861, 11, 323, 4395, 374, 220] [' ', 'the', ' answer', ' to', ' the', ' ultimate', ' question', ' of', ' life', ',', ' the', ' universe', ',', ' and', ' everything', ' is', ' '] converting tokens to their embedding IM SORRY but this is the only part of the codebase where i use an inbuilt neural network module anyway, so our [17x1] tokens are now [17x4096], i.e. 17 embeddings (one for each token) of length 4096 note: keep track of the shapes, it makes it much easier to understand everything torch.Size([17, 4096]) we then normalize the embedding using rms normalization please, note after this step the shapes dont change, the values are just normalized things to keep in mind, we need a norm_eps (from config) because we dont want to accidently set rms to 0 and divide by 0 here is the formula: building the first first layer of the transformer normalization you will see me accessing layer.0 from the model dict (this is the first layer) anyway, so after normalizing our shapes are still [17x4096] same as embedding but normalized torch.Size([17, 4096]) attention implemented from scratch let's load the attention heads of the first layer of the transformer > when we load the query, key, value and output vectors from the model we notice the shapes to be [4096x4096], [1024x4096], [1024x4096], [4096x4096] > at first glance this is weird because ideally we want each q,k,v and o for each head individually > the authors of the code bundled them togeather because its easy it helps parallize attention head multiplication. > im going to unwrap everything... torch.Size([4096, 4096]) torch.Size([1024, 4096]) torch.Size([1024, 4096]) torch.Size([4096, 4096]) unwrapping query in the next section we will unwrap the queries from multiple attention heads, the resulting shape is [32x128x4096] here, 32 is the number of attention heads in llama3, 128 is the size of the query vector and 4096 is the size of the token embedding torch.Size([32, 128, 4096]) im going to implement the first head of the first layer here i access the query weight matrix first head of the first layer, the size of this query weight matrix is [128x4096] torch.Size([128, 4096]) we now multiply the query weights with the token embedding, to recive a query for the token here you can see the resulting shape is [17x128], this is because we have 17 tokens and for each token there is a 128 length query. torch.Size([17, 128]) positioning encoding we are now at a stage where we have a query vector for each token in our prompt, but if you think about it -- the indivitually query vector has no idea about the position in the prompt. query: "the answer to the ultimate question of life, the universe, and everything is " in our prompt we have used "the" three times, we need the query vectors of all 3 "the" tokens to have different query vectors (each of size [1x128]) based on their positions in the query. we perform these rotations using RoPE (rotory positional embedding). RoPE watch this video (this is what i watched) to understand the math. https://www.youtube.com/watch?v=o29P0Kpobz0&t=530s torch.Size([17, 64, 2]) in the above step, we split the query vectors into pairs, we apply a rotational angle shift to each pair! we now have a vector of size [17x64x2], this is the 128 length queries split into 64 pairs for each token in the prompt! each of those 64 pairs will be rotated by m*(theta) where m is the position of the token for which we are rotating the query! using dot product of complex numbers to rotate a vector tensor([0.0000, 0.0156, 0.0312, 0.0469, 0.0625, 0.0781, 0.0938, 0.1094, 0.1250, 0.1406, 0.1562, 0.1719, 0.1875, 0.2031, 0.2188, 0.2344, 0.2500, 0.2656, 0.2812, 0.2969, 0.3125, 0.3281, 0.3438, 0.3594, 0.3750, 0.3906, 0.4062, 0.4219, 0.4375, 0.4531, 0.4688, 0.4844, 0.5000, 0.5156, 0.5312, 0.5469, 0.5625, 0.5781, 0.5938, 0.6094, 0.6250, 0.6406, 0.6562, 0.6719, 0.6875, 0.7031, 0.7188, 0.7344, 0.7500, 0.7656, 0.7812, 0.7969, 0.8125, 0.8281, 0.8438, 0.8594, 0.87…