What I Learned at the Small LLM Training Talk at SCaLE 23x

I went into this talk expecting a theoretical overview of small language models. I left with a concrete picture of what it actually costs and breaks when you try to train one yourself.

The speaker was David von Thenen, and the session covered building a nanoGPT-style causal language model from scratch using open tooling. The project is on GitHub at github.com/davidvonthenen/2026-scale-23x-slm.

End-to-end small LLM training pipeline: StreamingTokenDataset for 10TB FineWeb-Edu, nanoGPT architecture (tokens to vectors to attention to transformer to prediction head to autoregressive loop), training loop with learning rate knobs, fine-tune vs RAG comparison, and dynamic int8 quantization for CPU deployment.

The architecture walkthrough covered the full transformer loop. Input tokens become vectors. Attention layers decide which previous tokens matter for the current prediction. The transformer block processes context in parallel and builds relationships. The prediction head turns the current state into a ranked list of likely next tokens. The autoregressive loop appends the chosen token and repeats. What makes a model "small" is not the architecture, it is parameter count, context window, and vector dimensions. Those directly control reasoning capacity.

The data story was practically useful. They picked FineWeb-Edu as their training corpus, which is 10TB of educational content. You cannot load 10TB into RAM. The solution is StreamingTokenDataset, which processes data as a stream so it loads dynamically rather than all at once. The alternative is running out of memory on the first epoch, which is what happens when you try to train a 1B parameter model naively. That data volume roughly maps to a 1B parameter model when using the OpenAI GPTConfig.

Tokenization uses OpenAI's tiktoken, which implements GPT-2 byte-pair encoding. Language models do not learn from text directly. Tiktoken breaks text into subword tokens, which become the integer IDs that flow into the network. The key thing to get right early is consistency: if the tokenizer you use for training does not match the tokenizer you use at inference time, the output is garbage. That is one of the top failure modes in practice.

Training is predict, compare, adjust. The loss score goes down over iterations. PyTorch checkpoints let you save and resume. The knobs that matter most are learning rate and batch size. Unstable learning rates are a common failure. Over-checkpointing is a problem on single-GPU rigs because saving large checkpoints repeatedly adds up to significant time and storage pressure.

The fine-tuning section answered a question I have thought about before. When should you fine-tune instead of using RAG? The answer he gave was clear. RAG is for retrieval. You use it when the model is good enough at reasoning but needs access to specific data it does not have. Fine-tuning changes the model's behavior. If you want the model to know SQL, or respond in a specific style, or understand a domain it was not trained on, you fine-tune. RAG will not teach a model new skills. Fine-tuning will. The distinction matters when you are deciding whether to build a retrieval pipeline or commit to a training run.

The cost question came up during Q&A. Why fine-tune instead of just using a larger base model with a bigger dataset? The answer was partly cost and partly data sovereignty. Using a flagship model means your data goes to that provider, and the costs scale with usage. A fine-tuned local model runs on your hardware indefinitely. Something like Qwen2.5 8B can be fine-tuned and deployed for around $100 in compute cost. The equivalent in API calls to a frontier model for the same volume of inference would be orders of magnitude more.

Quantization is what makes the trained model practical on commodity hardware. Dynamic int8 quantization converts model weights from 32-bit floats to 8-bit integers during inference. The context window, attention heads, and layer structure stay the same. What changes is how precisely the weights are stored. You lose some accuracy. You gain significantly faster inference and dramatically lower memory requirements. The goal is running a capable model on a standard CPU without a GPU, which is exactly the homelab use case.

The combination of streaming data ingestion, good tokenizer hygiene, careful learning rate management, and post-training quantization is what separates a training run that completes from one that runs out of memory or produces noise. None of these are exotic techniques. They are just the table stakes for doing this kind of work seriously.