Open-sourcing Codebase Scaling for Non-commercial Research
Codebase Scaling
We are releasing our model training codebase Scaling by making it publicly available under the Open Aleph License, which explicitly allows for non-commercial research and educational use. Accompanying the public GitHub repository, this blog post provides details on design and engineering choices we made when developing our training codebase. Scaling was used to develop our concurrently released new models Pharia-1-LLM-control and Pharia-1-LLM-control-aligned.
Introduction
Over the course of the last years, the compute budget for training state-of-the-art deep learning models, especially LLMs, has increased dramatically. This is due to the increase in both model and data size, with model sizes reaching hundreds of billions of parameters and training sets growing to multiple trillions of tokens. Engineering model training code that scales to these high-compute scenarios is a non-trivial challenge. The memory and compute requirements of LLM training require parallelization across many GPUs to make training feasible.
Parallelization introduces a lot of complexity into a standard training loop and the resulting code is notoriously difficult to maintain and extend. While many research labs and companies publish model weights, their training code is often proprietary and not made available to the research community. On the other hand, public libraries for large scale training are often difficult to understand or adapt.
We are thus happy to release Scaling, Aleph Alpha’s parallel training codebase.
Scaling contains the following features:
- Frameworks: We provide general building blocks for distributed model training (scaling.core), as well as a fully-fledged suite for large-scale transformer training (scaling.transformer).
- Efficient training: The codebase supports data, pipeline, and model (tensor) parallelism, mixed-precision training, and modern performance optimizations such as optimizer state sharding and activation checkpointing.
- Code quality: We use rigorous typing, Pydantic classes and extensive automated testing. This ensures clean code, safe development and less potential for bugs.
Scaling is released under the Open Aleph License, enabling use for non-commercial research and educational purposes. We hope that making our codebase available to the machine learning community will assist independent research in the area of large-scale training.
Scaling is a complete proprietary development based on our research and work, and is not forked from any existing open source references. Scaling will continue to be actively developed and improved over time to make sure the training performance measures up to modern standards. We will also incorporate innovations from our research team to make our work transparent and reproducible.
In the following, we describe the two modules scaling.core and scaling.transformer in a bit more detail. Scaling focuses on conciseness and extensibility compared to larger codebases, such as Megatron-LM or DeepSpeed. This design choice reduces complexity, facilitating easier maintenance and optimization while maintaining performance and scalability.
A general distributed training library: scaling.core
scaling.core is the largely model-agnostic engine room of the Scaling library, built on top of PyTorch, that enables parallel training of machine learning models at scale. It supports the following parallelization techniques:
- Data parallelism. A batch of training data is split across multiple model copies that perform independent forward-backward passes, after which the resulting weight gradients are aggregated. This allows to train at smaller batch sizes per GPU, reducing activation and gradient memory.
- Pipeline parallelism. The layers of a model are grouped into sequential “pipeline stages”, with the computation for each stage happening on a different device. Activations and gradients are then communicated to subsequent, respectively previous pipeline stages. Each stage only has to materialize part of the model, drastically reducing memory consumption.
- Model (tensor) parallelism. Large matrix multiplications account for much of the compute requirements of modern deep neural networks. These core operations can be split across devices, with each device owning a slice of the weight matrix and performing the corresponding sub-computations.
All three techniques can be used simultaneously, giving rise to 3D parallelism. Beyond that, Scaling implements other important performance optimizations, such as mixed-precision training, sharded optimizer states (known as ZeRO), as well as activation checkpointing.
We took care to ensure that users can build their own 3D training applications without requiring intricate knowledge of the underlying implementation. To give an example, all you need to make a custom model architecture work in Scaling is to implement it as an instance of our ParallelModule class. ParallelModule works very similarly to torch.nn.Sequential, assuming a sequential model structure that can be broken down into layers. The layers themselves don’t have any special requirements and are plain PyTorch modules. In conjunction with our trainer and optimizer abstractions, which usually can be used out-of-the-box, ParallelModule then enables a data-and-pipe-parallel training loop. To make use of model parallelism, an additional step is needed: Replacing all instances of torch.nn.Linear in your model with the ColumnParallelLinear, respectively RowParallelLinear class we provide. With the general complexity of distributed training, there are of course a few intricacies that we do not cover here. We invite the reader to check out our repository and the comprehensive resources within to get started.
A suite for LLM training: scaling.transformer
As transformers are the go-to architecture for language modeling and the main use case for large-scale distributed training, we provide a training suite for large language models in scaling.transformer.
Using the building blocks provided in scaling.core, we implement a state-of-the-art transformer architecture and training loop. The model architecture can be configured using a concise but flexible config class, which allows users to realize a multitude of transformer variants. To highlight just a few architecture options, we support
- multi-query and grouped-query attention,
- different MLP types (e.g., SwiGLU),
- rotary positional embeddings,
- parameter-efficient fine-tuning methods,
and many other features of modern LLM architectures. In addition to the architecture, we provide LLM-specific data loading and preprocessing functionality.
Naturally, scaling.transformer supports fully distributed training with all parallelization and optimization options provided by the scaling.core components.
A transformer training or fine-tuning can be launched via a unified entry point script train.py which receives a configuration file in YAML format, providing a simple and traceable workflow for experimentation.
Finally, scaling.transformer provides a light-weight inference functionality, which can be used to test and evaluate trained models. It supports KV-caching for improved inference performance.