Recently, I started diving into LLMs. After working through several blogs and tutorials, I decided to build one from scratch.
I built a 124M parameters language model and named it GLaMA (Generalized Lightweight Autoregressive Model with Attention). And yes, ChatGPT helped with the name.
It is a decoder-only transformer model. The architecture is a hybrid of LLaMA-style improvements in the classic GPT-2 design.
To train this small model, I curated a small high-quality dataset of roughly 1B tokens from mixed sources. For a model of this size, clean data matters more than sheer volume.
I trained the model on about 3B tokens total, running 24k steps, which took roughly 6 hours on single A100 GPU.
In the end, the model reached a perplexity of ~16 and HellaSwag score of ~27%.
For context, it’s quite close to the original GPT-2 small model — especially considering the smaller dataset and limited training time. That result honestly surprised me.
With enough compute, the model’s capacity can definitely be scaled up.
Training a language model from scratch force you to understand many things like dataset curation, hyperparameters choices, and optimization details.
It is very different from just fine-tuning from an existing checkpoint.
Most importantly — it was fun.
If you are interested, checkout the code on GitHub.