Scaling Laws for Neural Language Models¶
The paper "Scaling Laws for Neural Language Models" by Jared Kaplan and collaborators explores the relationship between the size of neural language models (like Transformers), the data they train on, & the compute used for training. The study reveals valuable guidelines for building better models.
Key Findings¶
1. Performance Scaling Laws¶
- The performance of language models (measured using cross-entropy loss) improves predictably as:
- Model size increases.
- Dataset size grows.
- Compute budget scales up.
- These improvements follow power-law trends across large scales, while architectural details like network depth and width have minimal impact compared to scale.
2. Optimal Compute Utilization¶
- Larger models are more sample-efficient, meaning they require less data and fewer training steps to achieve the same performance as smaller models.
- The most compute-efficient way to train involves:
- Using very large models.
- Training on moderate-sized datasets.
- Stopping training before the model fully converges.
3. Overfitting Behavior¶
- Overfitting occurs when the model learns the training data too well and struggles to generalize to new data.
- This can be avoided by scaling dataset size proportionally to model size. For example:
- If the model size increases 8x, the dataset size should increase 5x.
4. Batch Size and Training Efficiency¶
- The optimal batch size (the number of examples processed together) is determined by the gradient noise scale.
- As models get better (loss decreases), the critical batch size increases.
5. Generalization¶
- Models trained on one dataset perform well on others, too. The performance on new data correlates strongly with their performance on the training dataset.
- This shows that scaling models improves not only training performance but also generalization.
6. Compute-Efficient Scaling¶
- When compute resources are limited:
- It’s better to train larger models for shorter durations (stopping before full convergence).
- This approach is more effective than training smaller models to convergence.
Practical Implications¶
- Better Scaling: Training larger models with an optimized balance of compute and data leads to better results, but with diminishing returns as scale increases.
- Efficiency Gains: Current training practices may not be efficient. Smarter compute allocation as outlined in this paper can achieve similar or better performance with fewer resources.
From the above graph
Test loss decreases predictably as compute increases. The relationship follows a power-law:
$$
L = \left(\frac{C_{\text{min}}}{2.3 \cdot 10^8}\right)^{-0.050}
$$
This indicates diminishing returns as compute grows.
Test loss reduces with larger datasets, following another power-law:
$$
L = \left(\frac{D}{5.4 \cdot 10^{13}}\right)^{-0.095}
$$
Larger datasets contribute significantly to performance improvement.
Increasing the number of parameters leads to a steady decrease in test loss, again following a power-law:
$$
L = \left(\frac{N}{8.8 \cdot 10^{13}}\right)^{-0.076}
$$
Larger models are more effective in reducing loss.
Larger models (with more parameters) require fewer samples (tokens processed) to achieve the same level of test loss.
For example, models with (10^9) parameters converge much faster (with fewer tokens) compared to smaller models with (10^3) parameters.
This shows that larger models are more sample-efficient, meaning they can learn effectively with less data.
The optimal model size depends on the target test loss and compute budget.
Compute-efficient training stops well before full convergence (the steep downward slopes), which means training doesn't need to be pushed to the absolute minimum test loss to achieve good results.
Larger models are better at leveraging compute, which is shown by the smoother progression of performance across increasing compute budgets.
The majority of compute resources should go toward increasing model size (blue region), as larger models yield the most significant improvements in performance.
Increasing batch size (orange region) also helps but contributes less compared to model size.
Extending serial training steps (green region) provides minimal benefits and should receive the least priority.
Larger models (e.g., 708M parameters in yellow) achieve significantly lower test loss compared to smaller models (e.g., 393.2K parameters in purple) for the same dataset size.
Increasing the dataset size consistently improves performance across all models, but the improvement becomes less significant as the dataset grows very large (diminishing returns).
Larger models benefit more from additional data, making them better at leveraging larger datasets to achieve lower loss.
Models with more parameters (yellow and green curves) achieve lower test loss compared to smaller models (blue and purple curves) when trained with the same number of steps.
After an initial transient period, learning curves stabilize, showing predictable behavior across different model sizes.
The graph suggests that the number of training steps required to reach optimal performance increases with model size, but larger models ultimately perform better.
The paper trains language models on WebText2, an extended version of the WebText [RWC+19] dataset. The text is tokenized using byte-pair encoding (BPE) [SHB15] with a vocabulary size of 50,257
The models are optimized using the autoregressive log-likelihood (i.e., cross-entropy loss) averaged over a 1,024-token context, which serves as the principal performance metric. The loss is recorded on the WebText2 test distribution as well as a selection of other text distributions.
The study primarily trains decoder-only [LSP+18, RNSS18] Transformer [VSP+17] models. For comparison, LSTM models and Universal Transformers [DGV+18] are also trained.
Summary of Scaling Laws¶
The test loss of a Transformer trained to autoregressively model language follows predictable power-law relationships when performance is constrained by:
- The number of non-embedding parameters (N),
- The dataset size (D), or
- The compute budget (C_min).
These relationships are described as follows:
1. Model Size (N)¶
For models with a limited number of parameters, trained to convergence on large datasets: L(N) = (Nc / N)^α_N, where:
- α_N ≈ 0.076,
- Nc ≈ 8.8 × 10^13.
Interpretation:
- Larger models (more parameters) result in lower test loss.
- Doubling the number of parameters reduces test loss by a factor of 2^(-α_N) ≈ 0.95, showing diminishing returns.
2. Dataset Size (D)¶
For large models trained on limited datasets with early stopping: L(D) = (Dc / D)^α_D, where:
- α_D ≈ 0.095,
- Dc ≈ 5.4 × 10^13.
Interpretation:
- Increasing dataset size reduces test loss, but gains diminish as datasets grow larger.
3. Compute Budget (C_min)¶
For training with limited compute and an optimally-sized model: L(C_min) = (C_min / C)^α_min, where:
- α_min ≈ 0.050,
- C_min ≈ 3.1 × 10^8 (PF-days).
Interpretation:
- More compute reduces test loss, but the returns diminish with increasing compute.
Combined Scaling Relationships¶
Dataset Size vs. Model Size:
- To avoid overfitting, dataset size should grow sublinearly with model size: D ∝ N^(α_N / α_D) ≈ N^0.74.
Simultaneous Dependence on N and D:
- Test loss is governed by the combined relationship: L(N, D) = (Nc / N)^α_N + (Dc / D)^α_D.
Learning Curves Over Time:
- For a fixed number of parameter updates (S): L(N, S) = (Nc / N)^α_N + 1 / (S_min(S)^α_S), where α_S ≈ 0.76.
Scaling with Compute Budget:
- For a fixed compute budget (C), the optimal allocation is:
N ∝ C^(α_min / α_N),
B ∝ C^(α_min / α_B),
S ∝ C^(α_min / α_S). - Compute should primarily be spent on increasing model size, with smaller increases in training steps and dataset size.
- For a fixed compute budget (C), the optimal allocation is:
N ∝ C^(α_min / α_N),
Operation | Parameters | FLOPs per Token | Summary |
---|---|---|---|
Embedding | (n_vocab + n_ctx) * d_model | --- | Embeds the input tokens and positional information. |
Attention: QKV | 4 * d_model | n_layer * d_model^3 | Computes query, key, and value matrices for multi-head attention. |
Attention: Masking | --- | 2 * n_layer * n_ctx * d_attn | Applies masking for sequence dependencies (e.g., causal attention). |
Attention: Projection | n_layer * d_attn * d_model | 2 * n_layer * d_attn * d_model | Projects attention outputs back to model dimension. |
Feedforward | n_layer * 2 * d_model * d_ff | 2 * n_layer * 2 * d_model * d_ff | Performs transformations in the feed-forward layer after attention. |
De-embedding | 2 * d_model * n_vocab | --- | Maps model outputs back to vocabulary space for prediction. |
Embedding:¶
- What it does: Converts input tokens (words, subwords, or characters) into numerical vectors that the model can process. It also adds positional information to account for the sequence order.
- Parameters: The size depends on the vocabulary (
n_vocab
), the input context size (n_ctx
), and the embedding dimension (d_model
). - Compute cost (FLOPs): Negligible for a single token, so it's not listed.
Attention: QKV Computation:¶
- What it does: Creates three matrices—Query (Q), Key (K), and Value (V)—which are used to calculate how different tokens in the input relate to each other.
- Parameters: Depends on the embedding dimension (
d_model
). - Compute cost (FLOPs): Involves processing all layers (
n_layer
) with a cubic cost relative to the model's embedding dimension.
Attention: Masking:¶
- What it does: Ensures that the model only considers valid relationships between tokens, such as preventing future tokens from being seen in autoregressive tasks like language generation.
- Parameters: No additional parameters needed.
- Compute cost (FLOPs): Proportional to the number of layers (
n_layer
), input context size (n_ctx
), and attention dimension (d_attn
).
Attention: Projection:¶
- What it does: Transforms the output of the attention mechanism back into the model's main working dimension (
d_model
). - Parameters: Depends on the number of layers (
n_layer
), attention dimension (d_attn
), and embedding dimension (d_model
). - Compute cost (FLOPs): Scales with the size of these parameters.
Feedforward Layer:¶
- What it does: Applies additional transformations to the token representations after attention to extract deeper features.
- Parameters: Proportional to the number of layers (
n_layer
), embedding dimension (d_model
), and intermediate feed-forward size (d_ff
). - Compute cost (FLOPs): Similar to the number of parameters, but doubled for calculations.
De-embedding:¶
- What it does: Converts the processed token representations back into probabilities over the vocabulary for predictions (e.g., generating the next word).
- Parameters: Depends on the embedding dimension (
d_model
) and vocabulary size (n_vocab
). - Compute cost (FLOPs): Not significant for this operation.
Feed-Forward Ratio (Left Panel):
- Performance changes only slightly as the feed-forward ratio (
d_ff / d_model
) deviates from the standard value (d_ff = 4d_model
). - Increasing the number of attention heads (
n_head
) can mitigate small performance losses.
- Performance changes only slightly as the feed-forward ratio (
Aspect Ratio (Middle Panel):
- Performance remains stable across a wide range of aspect ratios (
d_model / n_layer
), even when this ratio varies by a factor of 40. - This indicates that the model's shape (distribution of parameters across layers) is not critical to achieving good performance.
- Performance remains stable across a wide range of aspect ratios (
Attention Head Dimension (Right Panel):
- Performance is minimally affected by changes in the attention head dimension (
d_model / n_head
). - A 22% increase in compute can offset only a 1% increase in loss, showing diminishing returns for additional compute in this context.
- Performance is minimally affected by changes in the attention head dimension (
Overall Finding:¶
Model performance depends only mildly on architectural choices like feed-forward ratio, aspect ratio, and attention head dimension, as long as the total number of parameters is fixed.
Figure 6: Impact of Layers and Parameters on Test Loss¶
With Embedding Parameters (Left Panel):
- Models with fewer layers perform worse, even when the total parameter count is similar.
- Increasing the number of layers improves performance significantly, with diminishing returns for very deep models (>6 layers).
Excluding Embedding Parameters (Right Panel):
- When embedding parameters are excluded, test loss aligns to a single trendline across different layer configurations.
- Only models with fewer than 2 layers or extreme depth-to-width ratios deviate from this trend, showing poor performance.
Overall Finding:¶
Performance scales predictably with the total number of non-embedding parameters, independent of the specific number of layers, as long as the model configuration is reasonable. Including embedding parameters introduces variability that masks the underlying trend.
Unified Summary¶
- Model shape (e.g., feed-forward ratio, aspect ratio, attention head dimension) has a minor impact on performance when the total parameter count is fixed.
- Performance scaling laws hold primarily for non-embedding parameters, with deeper models generally performing better, provided extreme configurations (e.g., very shallow or very deep models) are avoided.
Types of Embedding Parameters¶
1. Vocabulary Embedding Parameters¶
- These parameters map each token in the vocabulary (
n_vocab
) to a dense vector representation of sized_model
. - Formula: Vocabulary Embedding Parameters = n_vocab × d_model
- Example:
- For
n_vocab = 50,000
andd_model = 512
: Vocabulary Embedding Parameters = 50,000 × 512 = 25,600,000 parameters
- For
2. Positional Embedding Parameters¶
- These parameters encode the position of each token in the input sequence to account for word order.
- Formula: Positional Embedding Parameters = n_ctx × d_model
- Example:
- For
n_ctx = 1024
(sequence length) andd_model = 512
: Positional Embedding Parameters = 1024 × 512 = 524,288 parameters
- For
What Are Shallow Models?¶
- Definition: Shallow models have a low number of layers (
n_layer
), regardless of the size of each layer (d_model
). - Characteristics:
- They process fewer hierarchical features and rely heavily on embeddings.
- Performance is typically worse compared to deeper models with the same total parameters.
Depth-to-Width Ratios¶
- Definition: The ratio between the number of layers (
n_layer
) and the size of each layer (d_model
). - Balanced Ratios:
- Reasonable configurations, like
12 layers
withd_model = 512
, tend to perform well.
- Reasonable configurations, like
- Extreme Ratios:
- Very deep but narrow: Too many layers with small
d_model
(e.g., 24 layers,d_model = 128
). - Very wide but shallow: Too few layers with large
d_model
(e.g., 1 layer,d_model = 2048
). - These configurations often underperform due to inefficient parameter usage or limited capacity for feature extraction.
- Very deep but narrow: Too many layers with small
Transformers vs. LSTMs:
- Transformers consistently outperform LSTMs as the number of parameters increases.
- Transformers leverage long-range dependencies more effectively, resulting in lower test loss compared to LSTMs, which plateau in performance.
Per-Token Test Loss:
- LSTMs struggle with tokens beyond the first 100, showing limited ability to utilize long contexts.
- Transformers improve performance across the entire sequence, highlighting their strength in handling long-range dependencies.
Key Insight: Transformers are more efficient and scalable than LSTMs, making them better suited for tasks requiring long-context understanding.
Test loss decreases smoothly across all data distributions (e.g., Internet Books, Wikipedia, Common Crawl) as the number of non-embedding parameters increases.
There is only a small and slowly growing offset between the test loss on the WebText2 test set (the primary training distribution) and other data distributions.
Larger models generalize better, achieving lower test loss across all data distributions.
The model’s ability to generalize (perform well on unseen data) improves steadily with increasing model size.
The offset between the training distribution (WebText2) and other distributions indicates that generalization is largely consistent across diverse datasets, with minor degradation.
The test loss on other distributions (e.g., Books, Wikipedia) correlates closely with the test loss on the training distribution (WebText2).
This correlation holds throughout the training process (dashed lines for ongoing training) and for converged models (solid points at the end of training).
Performance on other distributions does not improve beyond what is achieved on the training distribution.
Generalization depends almost entirely on how well the model performs on the training distribution.
This finding suggests that improving performance on the training data directly translates to better performance on unseen data, without requiring special techniques for generalization.
Generalization Improves with Model Size:
- Larger models generalize better to unseen distributions, reducing test loss across a variety of datasets.
- The gap between the training distribution and other datasets is small and grows slowly with model size.
Strong Correlation Between Training and Generalization:
- Performance on unseen data is tightly linked to performance on the training data.
- Generalization is not dependent on the phase of training but rather on the overall performance on the training distribution.
Implications¶
- Scalability:
- Increasing model size is an effective way to improve both training and generalization performance.
- Consistency:
- A well-trained model on one distribution (e.g., WebText2) is likely to perform similarly well on related distributions without requiring additional fine-tuning.
- Focus on Training:
- Optimizing the model for the training distribution is sufficient for achieving strong generalization, simplifying the training process.
Proportional Scaling¶
- To achieve optimal performance, model size ((N)) and dataset size ((D)) should be scaled together.
- Increasing (N) without sufficient (D) leads to a bottleneck and overfitting.
Predictability of Overfitting¶
- Overfitting is governed by the ratio: [ \frac{N^{\alpha_N / \alpha_D}}{D} ]
- This ratio allows for precise predictions of when a model will begin to overfit based on its size ((N)) and dataset size ((D)).
Power-Law Scaling¶
- For large datasets, test loss follows a power-law relationship with model size, confirming the scaling laws described earlier.
Regularization¶
All models are regularized using:
- 10% dropout to prevent overfitting.
- Early stopping, determined by monitoring the test loss and halting training when the loss stops decreasing.
Key Findings from Figure 9¶
The results align closely with the scaling law:
L(N, D) = (N_c / N)^α_N + (D_c / D)^α_D
Parameter Fits:¶
- α_N = 0.076
- α_D = 0.103
- N_c = 6.4 × 10¹³
- D_c = 1.8 × 10¹³
Exception:¶
- When the dataset size is drastically reduced (e.g., 2 × 10⁷ tokens), overfitting occurs very early in training.
- Such small datasets, where each epoch consists of only 40 parameter updates, may represent a different regime for language modeling.
Comparing Finite and Infinite Data Limits¶
For sufficiently large datasets (e.g., D = 22B tokens), overfitting is negligible for all but the largest models, making this dataset representative of D = ∞.
Difference in Loss:¶
The difference in loss between finite D and infinite D is defined as:
δL(N, D) ≡ (L(N, D) - L(N, ∞)) / L(N, ∞)
Dependence on N and D:¶
Empirical observations reveal that δL depends on the combination of N and D:
δL ≈ (1 + (N^α_N / N_c) * (D_c / D^α_D)) - 1
Dataset Size vs. Model Size¶
To avoid overfitting, the dataset size should scale with the model size. The threshold for avoiding overfitting when training to within a loss variation of 0.02 is:
D ≥ 5 × 10³ × N^0.74
Implications:¶
- Models smaller than 10⁹ parameters can be trained with minimal overfitting on the 22B token WebText2 dataset.
- Larger models encounter mild overfitting but remain manageable.
Sublinear Growth:
- Dataset size can grow sub-linearly with model size while avoiding overfitting.
- Example: Doubling the model size does not require doubling the dataset size.
Non-Optimal Regularization:
- The study did not optimize regularization (e.g., dropout) for varying dataset and model sizes.
- There is room for improvement in minimizing overfitting.
Compute-Efficiency:
- Avoiding overfitting does not necessarily correspond to maximally compute-efficient training.
- Some mild overfitting may be tolerable in practical settings.
Optimal Model Size and Compute Efficiency¶
- Models sized between 0.6x and 2.2x of the optimal size ((N_{\text{efficient}})) can be trained effectively with only 20% more compute compared to the most efficient size.
- Straying outside this range (much smaller or larger models) leads to significantly higher compute costs and reduced efficiency.
- Smaller models (< Nefficient) require more training steps, as they have limited learning capacity per step.
- Larger models (> Nefficient) require fewer steps, but the benefits plateau due to diminishing returns, making them less compute-efficient.
Test Loss vs. Compute (PF-Days)¶
Test loss (L) decreases predictably with compute (C) and follows these power-law trends:
L = (C / 2.3 × 10^8)^(-0.050)
L = (C / 2.0 × 10^7)^(-0.057)
A performance bump is observed during the transition from 1-layer to 2-layer networks at 10⁻⁵ PF-days, emphasizing the importance of adequate model complexity.
Shallow models (e.g., 1-layer) do not follow the power-law scaling trends, highlighting the need for minimum depth for effective scaling.
Compute-efficient training demonstrates reliable extrapolation of performance for larger models, supported by predictable power-law relationships.
Compute budgets (Cmin ) are best utilized by increasing model size ( N N) and batch size, with minimal reliance on extending optimization steps. For every 10x increase in compute: Model size grows by approximately 5x. Data processing grows by only 2x. These findings highlight the compute-efficient strategy of focusing on larger models and batch sizes for training efficiency.
Summary: Contradictions and a Conjecture¶
Scaling Laws Have Limits: Current trends suggest model performance improves predictably with more compute, data, and model size, but these trends must eventually plateau due to the non-zero entropy of natural language.
Contradiction in Predictions: Compute-efficient training predicts a faster performance improvement ((L(C_{\text{min}}))) than what is achievable with the slower growth of data, indicating the scaling laws will break down before reaching extreme scales.
Theoretical Limits:
- Compute: (10^4) PF-days.
- Model Size: (10^{12}) parameters.
- Dataset Size: (10^{12}) tokens.
- Loss: (1.7) nats/token (potentially representing the entropy of natural language).
- These values are uncertain but provide a theoretical boundary where scaling laws may no longer apply.
Plateauing Performance: As compute and data grow, loss trends are expected to level off, signaling the entropy limit of natural language and the exhaustion of extractable information.
Key Conjecture: The intersection point of scaling laws may indicate the point where all reliable information from natural language data has been extracted. Future improvements would require new methods or data sources.
Practical Insight: Artificially introducing noise to test the limits of performance could provide deeper insights into how models approach this theoretical boundary.