Training a small 124M parameters language model from scratch

Recently, I started diving into LLMs. After working through several blogs and tutorials, I decided to build one from scratch.

I built a 124M parameters language model and named it GLaMA (Generalized Lightweight Autoregressive Model with Attention). And yes, ChatGPT helped with the name.

It is a decoder-only transformer model. The architecture is a hybrid of LLaMA-style improvements in the classic GPT-2 design.

To train this small model, I curated a small high-quality dataset of roughly 1B tokens from mixed sources. For a model of this size, clean data matters more than sheer volume.

I trained the model on about 3B tokens total, running 24k steps, which took roughly 6 hours on single A100 GPU.

In the end, the model reached a perplexity of ~16 and HellaSwag score of ~27%.

For context, it’s quite close to the original GPT-2 small model — especially considering the smaller dataset and limited training time. That result honestly surprised me.

With enough compute, the model’s capacity can definitely be scaled up.

Training a language model from scratch force you to understand many things like dataset curation, hyperparameters choices, and optimization details.

It is very different from just fine-tuning from an existing checkpoint.

Most importantly — it was fun.

If you are interested, checkout the code on GitHub.