"I knew this to be true for BERT-era and ResNet-style models, and very pleasantly" – AI Engineering by Nirant, May 27, 2025

I knew this to be true for BERT-era and ResNet-style models, and very pleasantly surprised that this holds for pre-training LLMs too: The choice of pretraining data and tokenizer has the largest impact on scaling trends. Even switching from Llama (Transformer) to Mamba (State-Space Model) barely changes loss-to-loss relationships! In contrast, architecture, model size, context length, and optimizer settings have negligible impact. This suggests architectures can be freely optimized for efficiency, while data curation is the real key to strong generalization. Source: Loss Line work by Brendel group https://brendel-group.github.io/llm-line/

Comments