A Large-Scale Exploration of 𝜇-Transfer (2024)

Lucas Dax Lingle
lucasdaxlingle@gmail.com

Abstract

Large neural network models have become a mainstay of natural language processing and computer vision, yet their initialization and learning rates are set in a largely heuristic fashion, potentially varying from paper to paper and one model size to the next.The $\mu$ -Parameterization ( $\mu$ P) offers a potential solution to these challenges, yielding scaling rules for model initialization and learning rates, and reportedly enabling zero-shot hyperparameter transfer from small to large models in a variety of cases.

Despite the evident promise, the $\mu$ P scaling rules are not yet widely adopted, perhaps due to higher implementation complexity, many variations, or complex theoretical background.This work investigates $\mu$ P empirically, focusing on the ubiquitous transformer architecture, and aims to answer a simple question: does $\mu$ -Transfer yield optimal learning rates in practice?From models with 2M to 10B parameters, we show that $\mu$ -Transfer works as intended for the majority of important cases, but also identify some surprising cases where it may not.

1 Introduction

Despite the emergence of transformers as the primary architecture for language and vision (OpenAI etal., 2024; Anthropic, 2024; Gemini Team etal., 2023; Reid etal., 2024; Touvron etal., 2023a; b; Jiang etal., 2024; Parmar etal., 2024; Dehghani etal., 2023; Liu etal., 2023), there is still no universal method for setting their initialization, learning rate, or architectural hyperparameters. Further, the hyperparameters selected for large models might be far from optimal due to the expense of conducting hyperparameter sweeps at scale.

The $\mu$ -Parameterization ( $\mu$ P) (Yang & Hu, 2021; Yang etal., 2022; 2023b) offers a general method for scaling initializations and learning rates, based on a Gaussian Process interpretation of deep neural networks. Empirically, $\mu$ P is also reported to enable zero-shot hyperparameter transfer from small proxy models to large target models (Yang etal., 2022; 2023b), using width as the direction of scaling. This ‘ $\mu$ -transfer’ technique offers a promise of stable training and optimal hyperparameters at scale with low expense.

However, while the initial report on $\mu$ -transfer demonstrated approximate preservation of hyperparameter optima, it was only shown at a relatively small scale (Yang etal., 2022), with the sole large-scale experiment being oriented as a benchmark.As a result, there is a lack of convincing empirical evidence that hyperparameter optima are preserved under $\mu$ -transfer when target model is very large. In this absence, it seems possible the optimum could drift or jump, as transfer could be disrupted by emergent outliers (Dettmers, 2022).

A second open question is if $\mu$ -transfer is compatible with the techniques used in practice, such as decoupled weight decay (Loshchilov & Hutter, 2017) or multiplicative nonlinearities (Shazeer, 2020; So etal., 2021). While the initial report aims to delineate the compatible techniques, there is a need for further exploration and empirical verification.Perhaps awaiting verification, many recent large models have not reportedly used $\mu$ -transfer.

A few recent works have adopted $\mu$ P (Dey etal., 2023; Hu etal., 2024; XAI, 2024), but do not settle the open questions above; such an investigation would require extensive hyperparameter sweeps at scale. Inspired by these works, this paper aims to shed further light on $\mu$ -transfer, studying its reliability on transformer models from 2M to 10B parameters in a variety of settings.

Experiment Group	Width	Base LR					Transfer
		$2^{-10}$	$2^{-8}$	$2^{-6}$	$2^{-4}$	$2^{-2}$
	128	3.846	3.743	3.695	3.884	4.143
Baseline $\mu$ P	512	3.114	2.993	2.953	3.221	3.506	✔
	2048	2.711	2.553	2.511	2.563	3.244
	128	3.838	3.735	3.705	3.911	4.269
Projection Biases	512	3.108	2.986	2.947	2.970	3.557	✔
	2048	2.710	2.552	2.529	2.672	3.418
	128	3.836	3.743	3.694	3.877	4.167
Zero Query Init	512	3.115	2.992	2.949	3.135	3.532	✔
	2048	2.711	2.553	2.510	2.551	3.272
	128	3.861	3.765	3.699	3.896	4.161
SP Unembedding Init	512	3.119	2.990	2.951	3.265	3.582	✔
	2048	2.716	2.554	2.509	2.564	7.471
	128	3.846	3.743	3.695	3.906	4.143
Cosine Schedule	512	3.114	2.995	2.955	3.225	3.506	✔
	2048	2.712	2.558	2.518	2.572	3.244
	128	3.834	3.743	3.693	4.012	4.120
Embedding Normalization	512	3.115	2.993	2.954	3.028	3.506	✔
	2048	2.710	2.553	2.512	2.564	7.316
	128	3.800	3.740	3.715	4.090	7.024
SwiGLU Nonlinearity	512	3.070	2.975	2.953	3.175	6.863	✔
	2048	2.677	2.536	2.505	2.553	4.571
	128	3.808	3.735	3.686	3.999	4.484
Squared ReLU Nonlinearity	512	3.071	2.964	2.929	3.184	7.299	✔
	2048	2.666	2.516	2.482	2.532	3.259
	128	3.811	3.708	3.667	3.881	4.121
Multi-Query Attention	512	3.101	2.979	2.940	3.187	3.518	✔
	2048	2.715	2.564	2.521	2.546	3.257
	128	3.844	3.735	3.697	3.716	10.380
4x Larger Batch	512	3.141	2.990	2.965	3.305	10.373	✔
	2048	2.745	2.556	2.541	2.697	7.197
	128	3.855	3.774	3.736	3.945	4.104
4x Smaller Batch	512	3.120	3.011	2.977	3.024	3.521	✔
	2048	2.714	2.568	2.527	2.549	3.223
	128	3.842	3.744	3.689	3.670	3.681
RMSNorm Gains (Vector)	512	3.101	2.992	2.951	2.950	3.412	✗
	2048	2.692	2.553	2.609	2.605	3.169
	128	3.843	3.749	3.692	3.670	4.471
RMSNorm Gains (Scalar)	512	3.106	3.000	2.961	2.959	3.515	✗
	2048	2.704	2.570	2.525	2.542	3.334
	128	3.836	3.758	3.905	4.140	4.597
SP Attention Scale	512	3.104	2.993	2.962	3.449	4.184	✗
	2048	2.706	2.555	2.525	3.306	7.280
	128	3.760	3.679	3.694	3.741	4.011
Decoupled Weight Decay	512	3.057	2.963	2.957	3.139	3.373	✗
	2048	2.686	2.535	2.502	3.123	6.594
	128	3.708	3.736	4.057	4.344	10.380
Lion Optimizer	512	2.952	2.947	3.416	3.961	10.285	✗
	2048	2.519	2.511	3.151	10.377	10.377

2 Background and Notation

This paper focuses on decoder-only transformer models, which process sequences of tokens $\mathbf{z}\in\{0,\ldots,V-1\}^{C}$ , where $V$ is called the vocabulary size and $C$ the context length. This architecture has three components: embeddings, transformer blocks, unembeddings. We describe a pre-norm transformer decoder (Radford etal., 2019) of depth $L$ .

2.1 Embeddings

The token sequence $\mathbf{z}\in\{0,\ldots,V-1\}^{C}$ is used to index into an embedding matrix $\mathbf{W}^{E}\in\mathbb{R}^{V\times M}$ , where $M$ is called the model width. The resulting real-valued vectors are written as rows to an activation matrix $\mathbf{X}^{0}\in\mathbb{R}^{C\times M}$ according to the formula $\mathbf{X}_{i}^{0}=\mathbf{W}_{z_{i}}^{E}$ .

2.2 Transformer Blocks

A transformer block consists of two residual blocks (He etal., 2016), denoted MHA and MLP, which are added to a ‘residual stream’ according to the formula

\displaystyle\mathbf{X}^{\ell}=\mathbf{X}^{\ell-1}+\text{MHA}(\mathbf{X}^{\ell%-1})+\text{MLP}(\mathbf{X}^{\ell-1}+\text{MHA}(\mathbf{X}^{\ell-1})).

(1)

The MHA residual block performs multi-head self-attention, defined by a head width $D\in\mathbb{N}$ and number of heads $H\in\mathbb{N}$ . For each head, MHA uses a distinct set of projections $\mathbf{W}^{AQ},\mathbf{W}^{AK},\mathbf{W}^{AV}\in\mathbb{R}^{M\times D}$ to perform the following computations given input $\mathbf{X}\in\mathbb{R}^{C\times M}$ :

$\displaystyle\mathbf{Y}$	$\displaystyle=\text{LayerNorm}(\mathbf{X})$	(2)
$\displaystyle\mathbf{Q}$	$\displaystyle=\mathbf{Y}\mathbf{W}^{AQ}$	(3)
$\displaystyle\mathbf{K}$	$\displaystyle=\mathbf{Y}\mathbf{W}^{AK}$	(4)
$\displaystyle\mathbf{V}$	$\displaystyle=\mathbf{Y}\mathbf{W}^{AV}$	(5)
$\displaystyle\mathbf{S}$	$\displaystyle=\tau^{-1}\mathbf{Q}\mathbf{K}^{\top}+\mathbf{M}$	(6)
$\displaystyle\mathbf{P}$	$\displaystyle=\text{Softmax}(\mathbf{S})$	(7)
$\displaystyle\mathbf{O}$	$\displaystyle=\mathbf{P}\mathbf{V}$	(8)

LayerNorm and Softmax are applied row-wise, $\tau^{-1}>0$ is an scalar constant commonly set to $1/\sqrt{D}$ , and $\mathbf{M}$ is a causal mask given by $\mathbf{M}_{i,j}=-\infty$ if $i<j$ and $\mathbf{M}_{i,j}=0$ otherwise.The heads’ outputs $\mathbf{O}$ are concatenated together, and then projected using one additional matrix $\mathbf{W}^{AO}\in\mathbb{R}^{HD\times M}$ to form the residual $\text{MHA}(\mathbf{X})$ . This residual is summed onto the residual stream as in Equation 1, and the sum is processed by the MLP residual block.

The MLP residual block applies a multi-layer perceptron to each row individually. It is defined via hidden width $F$ and element-wise activation $\phi$ . It uses two trainable projections $\mathbf{W}^{FI}\in\mathbb{R}^{M\times F}$ and $\mathbf{W}^{FO}\in\mathbb{R}^{F\times M}$ . Given an input tensor $\mathbf{X}$ , it defines the residual:

	$\displaystyle\mathbf{Y}$	$\displaystyle=\text{LayerNorm}(\mathbf{X})$		(9)
	$\displaystyle\mathbf{O}$	$\displaystyle=\phi(\mathbf{Y}\mathbf{W}^{FI})\mathbf{W}^{FO}$		(10)

This residual is likewise summed onto the residual stream, following Equation 1.

2.3 Unembedding

The unembedding layer uses a matrix $\mathbf{W}^{U}\in\mathbb{R}^{M\times V}$ to produce the probabilities for next-token prediction. The layer’s input is the residual stream output $\mathbf{X}^{L}$ , and its output is

	$\displaystyle\mathbf{Y}$	$\displaystyle=\text{LayerNorm}(\mathbf{X}^{L})$		(11)
	$\displaystyle\mathbf{U}$	$\displaystyle=\text{Softmax}(\mathbf{Y}\mathbf{W}^{U})$		(12)

Due to the softmax, each row of $\mathbf{U}\in\mathbb{R}^{C\times V}$ defines a probability mass function over tokens in the vocabulary. The model is trained on the cross-entropy loss $-\frac{1}{C}\sum_{i=0}^{C-1}\log\mathbf{U}_{i,\mathbf{z}_{i+1}}$ .

3 $\mu$ -Transfer

The $\mu$ -Parameterization ( $\mu$ P) (Yang & Hu, 2021; Yang etal., 2022; 2023b) refers to a specific family of initializations and learning rates that reportedly allow hyperparameter transfer from small to large models. This paper investigates $\mu$ P for transformers with respect to width. We do not consider depthwise $\mu$ P (Yang etal., 2023b) because it requires one linear layer per residual block, while transformers require at least two.

The general formulation of $\mu$ P when training with Adam (Kingma & Ba, 2014) and using an i.i.d. Gaussian initialization is given by Yang etal. (2022). The first three columns of Table 2 display these rules for transformers.These columns use big-theta notation. Formally, $f(x)=\Theta(g(x))$ if there exists $x_{0}\in\mathbb{R}$ and $c,C>0$ s.t. $cg(x)\leq f(x)\leq Cg(x)$ for all $x>x_{0}$ .

Param	Init Variance ( $\Theta$ )	Adam LR ( $\Theta$ )	Init Variance (Exact)	Adam LR (Exact)
$\mathbf{W}^{E}$	$1$	$1$	$1$	$\alpha$
$\mathbf{W}^{AQ}$	$1/M$	$1/M$	$1/M$	$\alpha P/M$
$\mathbf{W}^{AK}$	$1/M$	$1/M$	$1/M$	$\alpha P/M$
$\mathbf{W}^{AV}$	$1/M$	$1/M$	$1/M$	$\alpha P/M$
$\mathbf{W}^{AO}$	$1/(HD)$	$1/(HD)$	$1/M$	$\alpha P/M$
$\mathbf{W}^{FI}$	$1/M$	$1/M$	$1/M$	$\alpha P/M$
$\mathbf{W}^{FO}$	$1/F$	$1/F$	$0.25/M$	$\alpha P/M$
$\mathbf{W}^{U}$	$1/M^{2}$	$1/M$	$1/M^{2}$	$\alpha P/M$

In the remainder of this paper, we assume $HD=M$ and $F=4M$ . In our experiments we fix a proxy model width $P=128$ and head width $D=128$ , and follow the specific scaling rules in the last two columns of Table 2, where $\alpha$ denotes the base learning rate, so named because it is the learning rate for all parameters when $M=P$ . These relative scaling rules are a special case of those in Appendix B.1 of Yang etal. (2022).

In addition, $\mu$ P uses an attention scale of $\tau^{-1}=\Theta(1/D)$ instead of the usual $\tau^{-1}=1/\sqrt{D}$ . For simplicity, we use $\tau^{-1}=1/D$ , since in preliminary experiments we observed only a small improvement from using smaller multiples of $1/D$ . Note that for $D$ fixed across model widths $M$ , any constant $\tau^{-1}\neq 0$ technically complies with $\mu$ P (Yang etal., 2022) but in the experiments, $\tau^{-1}$ will be shown to have a major impact on performance and transfer.

It is also possible to add scalar multipliers throughout the network as hyperparameters. For simplicity, we focus on $\mu$ -Transfer of the base learning rate.

4 Experiments

4.1 Experimental Setup

4.1.1 Implementation

Our experiments are implemented using Jax/Flax.Training is performed on TPU pod slices, using the fully-sharded data parallelism (FSDP) strategy from Xu etal. (2021) to reduce memory overhead.Models train on the Colossal Clean Common Crawl (C4) dataset, using the T5 tokenizer (Raffel etal., 2019) with context length $C=256$ .

The experiments use a bitwise-deterministic training pipeline, with shards of data written to disk in a random-access format similar to Nvidia Megatron (Shoeybi etal., 2020). Distributed model checkpoints are saved periodically, and the reported validation loss is computed on the best-performing checkpoint. The same seed is used for all experiments; due to time constraints, each experiment is run once.

4.1.2 Configuration

We use the following default configuration, deviating from it only if specifically mentioned.The depth is fixed at $L=24$ , and we consider model widths $M\in\{128,512,2048\}$ , yielding three model sizes ranging from 4.7M to 1.2B non-embedding parameters.The head width is fixed at $D=128$ , the number of heads is $H=M/D$ , and MLP hidden width is $F=4M$ . The models use RMS LayerNorm without gains (Zhang & Sennrich, 2019), linear projections without biases (Raffel etal., 2019), RoPE on the queries and keys (Su etal., 2021), and ReLU for the MLP nonlinearity (Vaswani etal., 2017; Raffel etal., 2019).

By default, we use $2^{18}$ tokens per batch, float32 parameters, bfloat16 activations/gradients¹¹1During evaluation, output logits are computed in float32..The optimizer is AdamW (Loshchilov & Hutter, 2017) with $\beta_{1}=0.9,\beta_{2}=0.98,\epsilon=10^{-9}$ , with default weight decay $0.0$ and gradient clip $1.0$ . Models train for 125K steps total, with 10K steps of learning rate warmup followed by linear decay to zero.

4.2 Primary Experiments

In our primary experiments, we sweep the base learning rate $\alpha\in\{2^{-2j}:j\in\mathbb{N},1\leq j\leq 5\}$ for each model size and experiment setting, and we report all results. This allows us to investigate impact of various experimental conditions on model quality and transferability of learning rates under the $\mu$ P scaling rules. We focus on learning rate transfer because it is the main hyperparameter of interest for large transformer models.

4.2.1 Baseline

In our first experiment group, we establish baselines for model quality to compare the other experiment groups to. In addition, we verify that $\mu$ -Transfer works reliably even with mixed-precision training. For this purpose, we utilize the Google Brain floating point format, bfloat16, for the activations and gradients. This format is supported by Google TPUs and recent Nvidia GPUs, and was used with $\mu$ P in a contemporary work (Dey etal., 2023).

As shown in Table 1, the learning rates transfer reliably across model sizes under $\mu$ P. Despite each model being 4x wider (and 16x larger) than the last, the smallest model’s optimal base learning rate $\alpha$ directly predicts the optimum in our sweeps for the larger models.

4.2.2 Projection Biases

It is not a priori clear if trainable bias vectors in linear layers are beneficial for model quality, and several prior works omit them (Raffel etal., 2019; Shazeer, 2020; Chowdhery etal., 2023). Here, we test their benefit and impact on learning rate transferability under $\mu$ P.

As shown in Table 1, the learning rates appear to transfer across model sizes under $\mu$ P. However, for the smallest and largest model, biases do not appear to improve quality versus the baseline when the learning rate is optimal.

4.2.3 RMSNorm Gains

It is not a priori clear if trainable scale vectors (‘gains’) in RMSNorm (Zhang & Sennrich, 2019) are beneficial for model quality and many frameworks offer the option to omit them. This ablation tests their benefit and impact on learning rate transferability under $\mu$ P.We also test a variant where the trainable gain vector is replaced with a trainable scalar multiplier, similar to Elhage etal. (2023).

As shown in Table 1, optimal learning rates for these models do not reliably transfer when using $\Theta(1)$ learning rate scaling for the gains, despite the fact that the ‘coordinate size’ of the features before and after RMS Normalization is $\Theta(1)$ w.r.t. width by design.In addition to the lack of transfer in these experiments, we find trainable gains harm the quality of the largest $\mu$ P models when the base learning rate $\alpha$ is optimal.

4.2.4 Query Initialization

The usual $\mu$ P initialization for query projections $\mathbf{W}^{AQ}$ is Gaussian with variance $\Theta(1/M)$ . One alternative is to use zero-initialized query projections, which yield equal attention weights over all past timesteps at initialization. This change was recommended by Yang etal. (2022) to improve transfer, so we investigate its effects as well.

As shown in Table 1, the learning rates transfer across model sizes when using $\mu$ P with zero-initialized query projections. There is also a slight yet consistent improvement in loss.

4.2.5 SP Attention Scale

The usual attention scale $\tau^{-1}=1/\sqrt{D}$ was first proposed by Vaswani etal. (2017) and generally used since. However, $\mu$ P proposes $\tau^{-1}=\Theta(1/D)$ , and we use $\tau^{-1}=1/D$ . Notably in our experiments we scale the model width $M$ and keep the head width $D$ fixed across model sizes, so the attention scale should not actually matter for purposes of transfer; any difference between $1/\sqrt{D}$ and $1/D$ can be treated as a constant multiplier. Nonetheless, we investigate the effect of using the standard $1/\sqrt{D}$ attention scale.

As shown in Table 1, the SP attention scale $1/\sqrt{D}$ appears quite suboptimal, harming performance relative to the baselines across all model sizes. Interestingly, this $11.3\times$ larger attention scale actually prevented transfer of the optimal learning rate, despite actually using a constant attention head width $D=128$ across models. Given this result, our recommendation is to use $\tau^{-1}\leq 1/D$ if applying $\mu$ -transfer with small proxy models.

4.2.6 SP Unembedding Initialization

The $\mu$ P initialization for unembedding matrix $\mathbf{W}^{U}$ is a Gaussian distribution with variance $\Theta(1/M^{2})$ , while the so-called standard parameterization (SP) uses $1/M$ (Yang etal., 2022). We thus ablate the impact of using the standard initialization on performance and transfer.

As shown in Table 1, despite using SP initialization for the unembedding projection, the learning rates empirically transfer across model sizes. There also appears to be a small but consistent improvement from using the SP initialization for the unembedding projection.

4.2.7 Cosine Schedule

It is not immediately clear whether the linear learning rate schedule in the baseline settings is the optimal choice. In this ablation, we instead use a cosine schedule, decaying to zero.

As shown in Table 1, the learning rates transfer across model sizes, showing that $\mu$ P is compatible with a cosine schedule as well as the linear schedule used by the baseline. Our tentative conclusion is that the schedule is unlikely to interfere with learning rate transfer.

4.2.8 Weight Decay

Decoupled weight decay for Adam (Loshchilov & Hutter, 2017) is often used to train transformers. In common libraries such as Pytorch and Optax, its decay rate $\lambda$ is multiplied by the current learning rate instead of just the schedule multiplier, although the latter was recently investigated by Wortsman etal. (2023) under the name ‘independent weight decay’. We focus on the version used by these libraries, as we found the ‘independent’ variant inevitably led to instability in the larger models not seen in the smaller ones–since the optimal learning rate tended to decrease with model size, even while $\lambda$ stayed constant.

As shown in Table 1, when using decoupled weight decay, the optimal base learning rate $\alpha$ may not always transfer from the smallest model to the larger ones. On the other hand, these experiments use the strongest typical setting at $\lambda=0.1$ , and yet the optimum for $\alpha$ is not visibly changed between the two larger models. Furthermore, this optimum for $\alpha$ also matches the optimum for the baselines, which is the same across model sizes.

Based on these observations, decoupled weight decay appears unlikely to alter the optimal learning rate for the target model significantly when $\lambda\leq 0.1$ and $M\gg P$ , so one strategy would be to omit it when training the proxy model, and apply it to the target model only.

4.2.9 Embedding Normalization

We consider using normalized embeddings following (Peng etal., 2023; Gemma, 2024), using RMSNorm without trainable gains (Zhang & Sennrich, 2019). We do not change the learning rate from the setting in Table 2 nor adjust the initialization.

As shown in Table 1, the optimal learning rate transfers across models using $\mu$ P; however, the improvement in model quality over the baseline is negligible. We briefly investigated using lower initialization variances, but found this harmed stability with the width-constant embedding learning rate of $\mu$ P Adam, possibly due to similar effects.

4.2.10 Multiplicative Nonlinearities

Multiplicative nonlinearities such as SwiGLU (Shazeer, 2020) and Squared ReLU (So etal., 2021) are increasingly used in the MLP blocks to improve transformer quality (Touvron etal., 2023a; b; Elsen etal., 2023; Peng etal., 2023; Parmar etal., 2024). In this experiment, we investigate both of the aforementioned nonlinearities, which are notably ‘superlinear’ and thus may create outliers that interfere with $\mu$ -Transfer, as discussed by Yang etal. (2022).For SwiGLU, we use $F=5M$ , so the MLP has $7.5M^{2}$ parameters.

As shown in Table 1, the SwiGLU and Squared ReLU nonlinearities both allow $\mu$ -transfer of the learning rate across model sizes. The outcomes here contrast nicely with RMSNorm gains experiments, since transfer occurs despite the multiplicative interactions.

4.3 Lion Optimizer

We empirically investigate if the Lion optimizer (Chen etal., 2023b; a) is compatible with $\mu$ -Transfer. This optimizer is at least twice as memory-efficient as the Adam optimizer, and was reported to yield models of similar quality, including transformers (Chen etal., 2023b). A notable property of this optimizer is that its updates are constrained to $\{-1,+1\}$ per coordinate, yielding a coordinate size of $\Theta(1)$ per step. As a result, a $\Theta(1/M)$ transfer rule for weight learning rates, similar to $\mu$ P Adam, might be appropriate (Yang etal., 2022).

As shown in Table 1, the Lion optimizer did not admit transfer of the base learning rate from the smallest model size. The scaling rules do appear to preserve the optimum $\alpha$ between larger models, but further investigations beyond the scope of this project are needed.

4.4 Multi-Query Attention

Multi-Query Attention (Shazeer, 2019) and its grouped generalization (Ainslie etal., 2023) are increasingly used in transformer LLMs (Chowdhery etal., 2023; Touvron etal., 2023b; Almazrouei etal., 2023; Gemini Team etal., 2023; Jiang etal., 2024).These techniques aim to improve the inference speed of transformers by sharing keys/values across multiple heads. This ablation investigates the impact of the shared K/V cache on $\mu$ -Transfer.Similar to Shazeer (2019), we approximately correct for the parameter increase by setting $F=5M$ .

As shown in Table 1, multi-query attention is compatible $\mu$ -Transfer.

4.4.1 4x Larger Batch

Large-batch training can reduce wall time, but may also have a considerable influence on the training dynamics (McCandlish etal., 2018; You etal., 2019). In this section, we consider scaling up the batch size by $4\times$ while keeping the number of training tokens the same. For this ablation, we adopt the learning rate scaling rule from You etal. (2019); Malladi etal. (2022), so that each formula in Table 2 is scaled by $2\times$ .

As shown in Table 1, the $4\times$ larger batch size admits transfer of the learning rate via $\mu$ P.

4.4.2 4x Smaller Batch

An important question is if $\mu$ -Transfer requires a minimum batch size to work (Dey etal., 2023). In this section, we consider scaling the batch size down by $4\times$ , while keeping the number of training tokens the same.For this ablation, we again adopt the learning rate scaling rule from You etal. (2019); Malladi etal. (2022), so that each formula in Table 2 is scaled by $0.5\times$ .

As shown in Table 1, the $4\times$ smaller batch size admits transfer of the learning rate via $\mu$ P.

4.5 Comparison with SP

Next, we verify the claim from Yang etal. (2022) that $\mu$ P models outperform models using the standard parameterization (SP). For the SP baseline, we use attention scale $1/\sqrt{D}$ , unembedding initialization variance $1/M$ , trainable biases in linear layers, and trainable gains in RMSNorm layers. The learning rate for each model size is swept over. All other hyperparameters are identical to the $\mu$ P baseline.

Width	LR
	$2^{-10}$	$2^{-8}$	$2^{-6}$	$2^{-4}$	$2^{-2}$
128	3.841	3.757	3.706	3.879	4.030
512	3.013	2.967	2.987	3.383	7.403
2048	2.738	2.902	7.247	7.477	7.314

As shown in Table 3, the optimal loss at each model width is worse than the corresponding loss for the baseline $\mu$ P model given in Table 1. This suggests that using $\mu$ P not only facilitates hyperparameter transfer, but can improve the model loss as well.

4.6 Large-Scale Transfer Experiment

In this experiment, we combine the architectural choices that both transferred and improved performance, and investigate if $\mu$ -Transfer continues to work as desired at scale.

We use depth $L=12$ and consider widths $M\in\{128,512,2048,8192\}$ , yielding models with approximately 2M, 40M, 600M, and 10B non-embedding parameters, respectively.We use zero-initialized queries (Yang etal., 2022) and Squared ReLU nonlinearity (So etal., 2021).We use $2^{21}$ tokens per batch, training for 90K steps. We use Adam (Kingma & Ba, 2014) with $\beta_{1}=0.9,\beta_{2}=0.95,\epsilon={10}^{-8}$ , decoupled weight decay $0.1$ , and gradient clip $1.0$ .

Params	Width	Base LR
		$2^{-8}$	$2^{-6}$	$2^{-4}$
2M	128	3.791	3.766	3.814
40M	512	3.016	2.983	3.004
600M	2048	2.513	2.459	2.466
10B	8192	2.238	2.167	2.169

As shown in Table 3, the optimal learning rate transfers from a model about $5000\times$ smaller. This shows that $\mu$ -Transfer continues to predict the optimal learning rate at scale. Moreover, the outcome suggests ‘emergent outliers’ may not be a source of interference in $\mu$ -Transfer, given that these reportedly appear at around 7B parameters (Dettmers, 2022).

5 Related Works

The $\mu$ -Parameterization ( $\mu$ P) is proposed as part of the Tensor Programs series (Yang, 2019; 2020; 2021; Yang & Hu, 2021; Yang etal., 2022; Littwin & Yang, 2023; Yang etal., 2023b; a).

The empirical demonstration of zero-shot hyperparameter transfer under $\mu$ P was given in Yang etal. (2022). Notably, the largest model trained in that report had 6.7B parameters. It used FP32 computation for numerical stability, as well as using a different position encoding mechanism and a different learning rate schedule than the FP16 baseline. This led to an open question whether $\mu$ -Parameterized transformers could be made stable at large scale.Moreover, the large-scale experiment did not demonstrate that $\mu$ -Transfer predicted the hyperparameter optima to the 6.7B target model.

Some recent works have adopted $\mu$ P for hyperparameter tuning (Dey etal., 2023; Hu etal., 2024; XAI, 2024), but did not provide any evidence that the hyperparameter optimum is preserved under $\mu$ -Transfer when the target model is very large.Furthermore, Dey etal. (2023) trained a suite of models, but only used $\mu$ P transformers with up to 2.7B parameters, while their largest model using a standard parameterization had 13B parameters. This left open the question of whether $\mu$ -Transfer worked reliably on larger-scale target models.

Some recent works suggest to use $\mu$ -Transfer to avoid large-scale experiments entirely (Yao & Wang, 2023; Fan etal., 2024). These works use power laws to predict the loss for larger models from smaller ones (Kaplan etal., 2020; Hoffmann etal., 2022; So etal., 2021), and use $\mu$ P for hyperparameter tuning when fitting the smaller reference models. However, similar to the other works, these papers do not study whether $\mu$ -Transfer correctly predicts the optimal hyperparameters themselves when the target model is very large.

A potential alternative to $\mu$ P was recently proposed by DeepSeek-AI etal. (2024), namely a scaling law for the optimal learning rate solely in terms of compute budget. However, empirically-derived scaling laws are strongly affected by the choice of independent variables and the fitted data, so the fitted scaling law may not transfer to other setups.

Other recent work has investigated transformer training instabilities via small proxy models (Wortsman etal., 2023), proposing architectural adjustments that reduce the loss sensitivity to the learning rate, rather than predict the optimal one; their method notably applies to depthwise scaling of transformer models. Automatic Gradient Descent and Hypergradients (Bernstein etal., 2023; Baydin etal., 2017; Chandra etal., 2019) are methods to tune the learning rate of optimizers during training. These methods have a high implementation complexity and might also incur a performance hit versus vanilla hyperparameter tuning, which $\mu$ P makes affordable via proxy models.

6 Conclusion

This paper studied the reliability of $\mu$ -Transfer of learning rates, focusing on transformers. In our experiments, $\mu$ -Transfer worked as desired in most cases, including with multiplicative nonlinearities, multi-query attention, and large/small batch training. However, $\mu$ P did not admit transfer when using trainable gain parameters or too large an attention scale. The simple $\mu$ P recipe used in this work also outperforms the ‘standard parameterization’ commonly used for transformers.

Lastly, we found $\mu$ -Transfer from a 2M parameter model predicted the optimal learning rate from a sweep at the scale of 10B parameters. To the best of our knowledge, this is the largest target model where this property has been verified. We hope these findings are helpful to the research community, and inspire further work on hyperparameter transfer.

Acknowledgments

The author thanks Oleg Filatov, Stella Biderman, Lucas Nestler, and Hailey Schoelkopf for helpful remarks during the project and Oleg for feedback on the manuscript. The author is grateful to Google TPU Research Cloud for supporting the experiments with Cloud TPUs.

References

Ainslie etal. (2023)Joshua Ainslie, James Lee-Thorp, Michiel deJong, Yury Zemlyanskiy, FedericoLebrón, and Sumit Sanghai.GQA: Training generalized multi-query transformer models frommulti-head checkpoints, 2023.URL https://arxiv.org/abs/2305.13245.
Almazrouei etal. (2023)Ebtesam Almazrouei, Hamza Alobeidli, Abdulaziz Alshamsi, Alessandro Cappelli,Ruxandra Cojocaru, Mérouane Debbah, Étienne Goffinet, Daniel Hesslow,Julien Launay, Quentin Malartic, Daniele Mazzotta, Badreddine Noune, BaptistePannier, and Guilherme Penedo.The falcon series of open language models, 2023.URL https://arxiv.org/abs/2311.16867.
Anthropic (2024)Anthropic.The claude 3 model family: Opus, sonnet, haiku.https://www-cdn.anthropic.com/de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model_Card_Claude_3.pdf,2024.Accessed: 2024-03-13.
Baydin etal. (2017)AtilimGunes Baydin, Robert Cornish, DavidMartinez Rubio, Mark Schmidt, andFrank Wood.Online learning rate adaptation with hypergradient descent, 2017.URL https://arxiv.org/abs/1703.04782.
Bernstein etal. (2023)Jeremy Bernstein, Chris Mingard, Kevin Huang, Navid Azizan, and Yisong Yue.Automatic gradient descent: Deep learning without hyperparameters,2023.URL https://arxiv.org/abs/2304.05187.
Chandra etal. (2019)Kartik Chandra, Audrey Xie, Jonathan Ragan-Kelley, and Erik Meijer.Gradient descent: The ultimate optimizer, 2019.URL https://arxiv.org/abs/1909.13371.
Chen etal. (2023a)Lizhang Chen, BoLiu, Kaizhao Liang, and Qiang Liu.Lion secretly solves constrained optimization: As Lyapunovpredicts, 2023a.URL https://arxiv.org/abs/2310.05898.
Chen etal. (2023b)Xiangning Chen, Chen Liang, DaHuang, Esteban Real, Kaiyuan Wang, Yao Liu, HieuPham, Xuanyi Dong, Thang Luong, Cho-Jui Hsieh, Yifeng Lu, and QuocV. Le.Symbolic discovery of optimization algorithms, 2023b.URL https://arxiv.org/abs/2302.06675.
Chowdhery etal. (2023)Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra,Adam Roberts, Paul Barham, HyungWon Chung, Charles Sutton, SebastianGehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua Maynez,Abhishek Rao, Parker Barnes, YiTay, Noam Shazeer, Vinodkumar Prabhakaran,Emily Reif, Nan Du, Ben Hutchinson, Reiner Pope, James Bradbury, JacobAustin, Michael Isard, Guy Gur-Ari, Pengcheng Yin, Toju Duke, AnselmLevskaya, Sanjay Ghemawat, Sunipa Dev, Henryk Michalewski, Xavier Garcia,Vedant Misra, Kevin Robinson, Liam Fedus, Denny Zhou, Daphne Ippolito, DavidLuan, Hyeontaek Lim, Barret Zoph, Alexander Spiridonov, Ryan Sepassi, DavidDohan, Shivani Agrawal, Mark Omernick, AndrewM. Dai,ThanumalayanSankaranarayana Pillai, Marie Pellat, Aitor Lewkowycz, EricaMoreira, Rewon Child, Oleksandr Polozov, Katherine Lee, Zongwei Zhou, XuezhiWang, Brennan Saeta, Mark Diaz, Orhan Firat, Michele Catasta, Jason Wei,Kathy Meier-Hellstern, Douglas Eck, Jeff Dean, Slav Petrov, and Noah Fiedel.PaLM: Scaling language modeling with pathways.Journal of Machine Learning Research, 24(240):1–113, 2023.URL http://jmlr.org/papers/v24/22-1144.html.
DeepSeek-AI etal. (2024)DeepSeek-AI, Xiao Bi, Deli Chen, Guanting Chen, Shanhuang Chen, Damai Dai,Chengqi Deng, Honghui Ding, Kai Dong, Qiushi Du, Zhe Fu, Huazuo Gao, KaigeGao, Wenjun Gao, Ruiqi Ge, Kang Guan, Daya Guo, Jianzhong Guo, Guangbo Hao,Zhewen Hao, Ying He, Wenjie Hu, Panpan Huang, Erhang Li, Guowei Li, JiashiLi, Yao Li, Y.K. Li, Wenfeng Liang, Fangyun Lin, A.X. Liu, BoLiu, Wen Liu,Xiaodong Liu, Xin Liu, Yiyuan Liu, Haoyu Lu, Shanghao Lu, Fuli Luo, ShirongMa, Xiaotao Nie, Tian Pei, Yishi Piao, Junjie Qiu, Hui Qu, Tongzheng Ren,Zehui Ren, Chong Ruan, Zhangli Sha, Zhihong Shao, Junxiao Song, Xuecheng Su,Jingxiang Sun, Yaofeng Sun, Minghui Tang, Bingxuan Wang, Peiyi Wang, ShiyuWang, Yaohui Wang, Yongji Wang, Tong Wu, Y.Wu, Xin Xie, Zhenda Xie, ZiweiXie, Yiliang Xiong, Hanwei Xu, R.X. Xu, Yanhong Xu, Dejian Yang, YuxiangYou, Shuiping Yu, Xingkai Yu, B.Zhang, Haowei Zhang, Lecong Zhang, LiyueZhang, Mingchuan Zhang, Minghua Zhang, Wentao Zhang, Yichao Zhang, ChenggangZhao, Yao Zhao, Shangyan Zhou, Shunfeng Zhou, Qihao Zhu, and Yuheng Zou.Deepseek llm: Scaling open-source language models with longtermism,2024.URL https://arxiv.org/abs/2401.02954.
Dehghani etal. (2023)Mostafa Dehghani, Josip Djolonga, Basil Mustafa, Piotr Padlewski, JonathanHeek, Justin Gilmer, Andreas Steiner, Mathilde Caron, Robert Geirhos, IbrahimAlabdulmohsin, Rodolphe Jenatton, Lucas Beyer, Michael Tschannen, AnuragArnab, Xiao Wang, Carlos Riquelme, Matthias Minderer, Joan Puigcerver, UtkuEvci, Manoj Kumar, Sjoerd van Steenkiste, GamaleldinF. Elsayed, AravindhMahendran, Fisher Yu, Avital Oliver, Fantine Huot, Jasmijn Bastings,MarkPatrick Collier, Alexey Gritsenko, Vighnesh Birodkar, CristinaVasconcelos, YiTay, Thomas Mensink, Alexander Kolesnikov, Filip Pavetić,Dustin Tran, Thomas Kipf, Mario Lučić, Xiaohua Zhai, Daniel Keysers,Jeremiah Harmsen, and Neil Houlsby.Scaling vision transformers to 22 billion parameters, 2023.URL https://arxiv.org/abs/2302.05442.
Dettmers (2022)Tim Dettmers.LLM.int8() and emergent features.https://timdettmers.com/2022/08/17/llm-int8-and-emergent-features/,2022.Accessed: 2024-03-09.
Dey etal. (2023)Nolan Dey, Gurpreet Gosal, Zhiming Chen, Hemant Khachane, William Marshall,Ribhu Pathria, Marvin Tom, and Joel Hestness.Cerebras-GPT: Open compute-optimal language models trained on thecerebras wafer-scale cluster, 2023.URL https://arxiv.org/abs/2304.03208.
Elhage etal. (2023)Nelson Elhage, Robert Lasenby, and Christopher Olah.Privileged bases in the transformer residual stream.https://transformer-circuits.pub/2023/privileged-basis/index.html,2023.Accessed: 2024-03-09.
Elsen etal. (2023)Erich Elsen, Augustus Odena, Maxwell Nye, Sağnak Taşırlar, Tri Dao, CurtisHawthorne, Deepak Moparthi, and Arushi Somani.Releasing persimmon-8b, 2023.URL https://www.adept.ai/blog/persimmon-8b.
Fan etal. (2024)Siqi Fan, Xiusheng Huang, Xuezhi Fang, Yiqun Yao, Xiang Li, Ziyi Ni, Xin Jiang,Xuying Meng, Peng Han, Shuo Shang, Kang Liu, Aixin Sun, and Yequan Wang.NanoLM: An affordable LLM study benchmark via accurate lossprediction across scales, 2024.URL https://openreview.net/forum?id=mao3y822aM.
Gemini Team etal. (2023)Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-BaptisteAlayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, AndrewM. Dai, Anja Hauth,Katie Millican, David Silver, Slav Petrov, Melvin Johnson, IoannisAntonoglou, Julian Schrittwieser, Amelia Glaese, Jilin Chen, Emily Pitler,Timothy Lillicrap, Angeliki Lazaridou, Orhan Firat, James Molloy, MichaelIsard, PaulR. Barham, Tom Hennigan, Benjamin Lee, Fabio Viola, MalcolmReynolds, Yuanzhong Xu, Ryan Doherty, Eli Collins, Clemens Meyer, ElizaRutherford, Erica Moreira, Kareem Ayoub, Megha Goel, George Tucker, EnriquePiqueras, Maxim Krikun, Iain Barr, Nikolay Savinov, Ivo Danihelka, BeccaRoelofs, Anaïs White, Anders Andreassen, Tamara von Glehn, Lakshman Yagati,Mehran Kazemi, Lucas Gonzalez, Misha Khalman, Jakub Sygnowski, AlexandreFrechette, Charlotte Smith, Laura Culp, Lev Proleev, YiLuan, XiChen, JamesLottes, Nathan Schucher, Federico Lebron, Alban Rrustemi, Natalie Clay, PhilCrone, Tomas Kocisky, Jeffrey Zhao, Bartek Perz, Dian Yu, Heidi Howard, AdamBloniarz, JackW. Rae, Han Lu, Laurent Sifre, Marcello Maggioni, FredAlcober, Dan Garrette, Megan Barnes, Shantanu Thakoor, Jacob Austin, GabrielBarth-Maron, William Wong, Rishabh Joshi, Rahma Chaabouni, Deeni Fatiha, ArunAhuja, Ruibo Liu, Yunxuan Li, Sarah Cogan, Jeremy Chen, Chao Jia, Chenjie Gu,Qiao Zhang, Jordan Grimstad, AleJakse Hartman, Martin Chadwick, GauravSinghTomar, Xavier Garcia, Evan Senter, Emanuel Taropa,ThanumalayanSankaranarayana Pillai, Jacob Devlin, Michael Laskin, DiegodeLasCasas, Dasha Valter, Connie Tao, Lorenzo Blanco, AdriàPuigdomènechBadia, David Reitter, Mianna Chen, Jenny Brennan, Clara Rivera, Sergey Brin,Shariq Iqbal, Gabriela Surita, Jane Labanowski, Abhi Rao, Stephanie Winkler,Emilio Parisotto, Yiming Gu, Kate Olszewska, Yujing Zhang, Ravi Addanki,Antoine Miech, Annie Louis, LaurentEl Shafey, Denis Teplyashin, Geoff Brown,Elliot Catt, Nithya Attaluri, Jan Balaguer, Jackie Xiang, Pidong Wang, ZoeAshwood, Anton Briukhov, Albert Webson, Sanjay Ganapathy, Smit Sanghavi, AjayKannan, Ming-Wei Chang, Axel Stjerngren, Josip Djolonga, Yuting Sun, AnkurBapna, Matthew Aitchison, Pedram Pejman, Henryk Michalewski, Tianhe Yu, CindyWang, Juliette Love, Junwhan Ahn, Dawn Bloxwich, Kehang Han, Peter Humphreys,Thibault Sellam, James Bradbury, Varun Godbole, Sina Samangooei, BogdanDamoc, Alex Kaskasoli, Sébastien M.R. Arnold, Vijay Vasudevan, ShubhamAgrawal, Jason Riesa, Dmitry Lepikhin, Richard Tanburn, Srivatsan Srinivasan,Hyeontaek Lim, Sarah Hodkinson, Pranav Shyam, Johan Ferret, Steven Hand,Ankush Garg, TomLe Paine, Jian Li, Yujia Li, Minh Giang, Alexander Neitz,Zaheer Abbas, Sarah York, Machel Reid, Elizabeth Cole, Aakanksha Chowdhery,Dipanjan Das, Dominika Rogozińska, Vitaly Nikolaev, Pablo Sprechmann,Zachary Nado, Lukas Zilka, Flavien Prost, Luheng He, Marianne Monteiro,Gaurav Mishra, Chris Welty, Josh Newlan, Dawei Jia, Miltiadis Allamanis,ClaraHuiyi Hu, Raoul deLiedekerke, Justin Gilmer, Carl Saroufim, ShrutiRijhwani, Shaobo Hou, Disha Shrivastava, Anirudh Baddepudi, Alex Goldin,Adnan Ozturel, Albin Cassirer, Yunhan Xu, Daniel Sohn, Devendra Sachan,ReinaldKim Amplayo, Craig Swanson, Dessie Petrova, Shashi Narayan, ArthurGuez, Siddhartha Brahma, Jessica Landon, Miteyan Patel, Ruizhe Zhao, KevinVillela, Luyu Wang, Wenhao Jia, Matthew Rahtz, Mai Giménez, Legg Yeung,Hanzhao Lin, James Keeling, Petko Georgiev, Diana Mincu, Boxi Wu, SalemHaykal, Rachel Saputro, Kiran Vodrahalli, James Qin, Zeynep Cankara, AbhanshuSharma, Nick Fernando, Will Hawkins, Behnam Neyshabur, Solomon Kim, AdrianHutter, Priyanka Agrawal, Alex Castro-Ros, George vanden Driessche, TaoWang, Fan Yang, Shuo yiin Chang, Paul Komarek, Ross McIlroy, Mario Lučić,Guodong Zhang, Wael Farhan, Michael Sharman, Paul Natsev, Paul Michel, YongCheng, Yamini Bansal, Siyuan Qiao, Kris Cao, Siamak Shakeri, ChristinaButterfield, Justin Chung, PaulKishan Rubenstein, Shivani Agrawal, ArthurMensch, Kedar Soparkar, Karel Lenc, Timothy Chung, Aedan Pope, LorenMaggiore, Jackie Kay, Priya Jhakra, Shibo Wang, Joshua Maynez, Mary Phuong,Taylor Tobin, Andrea Tacchetti, Maja Trebacz, Kevin Robinson, Yash Katariya,Sebastian Riedel, Paige Bailey, Kefan Xiao, Nimesh Ghelani, Lora Aroyo,Ambrose Slone, Neil Houlsby, Xuehan Xiong, Zhen Yang, Elena Gribovskaya,Jonas Adler, Mateo Wirth, Lisa Lee, Music Li, Thais Kagohara, Jay Pavagadhi,Sophie Bridgers, Anna Bortsova, Sanjay Ghemawat, Zafarali Ahmed, Tianqi Liu,Richard Powell, Vijay Bolina, Mariko Iinuma, Polina Zablotskaia, JamesBesley, Da-Woon Chung, Timothy Dozat, Ramona Comanescu, Xiance Si, JeremyGreer, Guolong Su, Martin Polacek, RaphaëlLopez Kaufman, Simon Tokumine,Hexiang Hu, Elena Buchatskaya, Yingjie Miao, Mohamed Elhawaty, AdityaSiddhant, Nenad Tomasev, Jinwei Xing, Christina Greer, Helen Miller, ShereenAshraf, Aurko Roy, Zizhao Zhang, Ada Ma, Angelos Filos, Milos Besta, RoryBlevins, Ted Klimenko, Chih-Kuan Yeh, Soravit Changpinyo, Jiaqi Mu, OscarChang, Mantas Pajarskas, Carrie Muir, Vered Cohen, CharlineLe Lan, KrishnaHaridasan, Amit Marathe, Steven Hansen, Sholto Douglas, Rajkumar Samuel,Mingqiu Wang, Sophia Austin, Chang Lan, Jiepu Jiang, Justin Chiu,JaimeAlonso Lorenzo, LarsLowe Sjösund, Sébastien Cevey, Zach Gleicher,Thi Avrahami, Anudhyan Boral, Hansa Srinivasan, Vittorio Selo, Rhys May,Konstantinos Aisopos, Léonard Hussenot, LivioBaldini Soares, Kate Baumli,MichaelB. Chang, Adrià Recasens, Ben Caine, Alexander Pritzel, FilipPavetic, Fabio Pardo, Anita Gergely, Justin Frye, Vinay Ramasesh, Dan Horgan,Kartikeya Badola, Nora Kassner, Subhrajit Roy, Ethan Dyer, Víctor Campos,Alex Tomala, Yunhao Tang, DaliaEl Badawy, Elspeth White, Basil Mustafa, OranLang, Abhishek Jindal, Sharad Vikram, Zhitao Gong, Sergi Caelles, RossHemsley, Gregory Thornton, Fangxiaoyu Feng, Wojciech Stokowiec, CeZheng,Phoebe Thacker, Çağlar Ünlü, Zhishuai Zhang, Mohammad Saleh, JamesSvensson, Max Bileschi, Piyush Patil, Ankesh Anand, Roman Ring, KaterinaTsihlas, Arpi Vezer, Marco Selvi, Toby Shevlane, Mikel Rodriguez, TomKwiatkowski, Samira Daruki, Keran Rong, Allan Dafoe, Nicholas FitzGerald,Keren Gu-Lemberg, Mina Khan, LisaAnne Hendricks, Marie Pellat, VladimirFeinberg, James Cobon-Kerr, Tara Sainath, Maribeth Rauh, SayedHadi Hashemi,Richard Ives, Yana Hasson, YaGuang Li, Eric Noland, Yuan Cao, Nathan Byrd,LeHou, Qingze Wang, Thibault Sottiaux, Michela Paganini, Jean-BaptisteLespiau, Alexandre Moufarek, Samer Hassan, Kaushik Shivakumar, Joost vanAmersfoort, Amol Mandhane, Pratik Joshi, Anirudh Goyal, Matthew Tung, AndrewBrock, Hannah Sheahan, Vedant Misra, Cheng Li, Nemanja Rakićević, MostafaDehghani, Fangyu Liu, Sid Mittal, Junhyuk Oh, Seb Noury, Eren Sezener,Fantine Huot, Matthew Lamm, NicolaDe Cao, Charlie Chen, Gamaleldin Elsayed,EdChi, Mahdis Mahdieh, Ian Tenney, Nan Hua, Ivan Petrychenko, Patrick Kane,Dylan Scandinaro, Rishub Jain, Jonathan Uesato, Romina Datta, Adam Sadovsky,Oskar Bunyan, Dominik Rabiej, Shimu Wu, John Zhang, Gautam Vasudevan, EdouardLeurent, Mahmoud Alnahlawi, Ionut Georgescu, Nan Wei, Ivy Zheng, Betty Chan,PamG Rabinovitch, Piotr Stanczyk, YeZhang, David Steiner, Subhajit Naskar,Michael Azzam, Matthew Johnson, Adam Paszke, Chung-Cheng Chiu, JaumeSanchezElias, Afroz Mohiuddin, Faizan Muhammad, Jin Miao, Andrew Lee, NinoVieillard, Sahitya Potluri, Jane Park, Elnaz Davoodi, Jiageng Zhang, JeffStanway, Drew Garmon, Abhijit Karmarkar, Zhe Dong, Jong Lee, Aviral Kumar,Luowei Zhou, Jonathan Evens, William Isaac, Zhe Chen, Johnson Jia, AnselmLevskaya, Zhenkai Zhu, Chris Gorgolewski, Peter Grabowski, YuMao, AlbertoMagni, Kaisheng Yao, Javier Snaider, Norman Casagrande, Paul Suganthan, EvanPalmer, Geoffrey Irving, Edward Loper, Manaal Faruqui, Isha Arkatkar, NanxinChen, Izhak Shafran, Michael Fink, Alfonso Castaño, Irene Giannoumis,Wooyeol Kim, Mikołaj Rybiński, Ashwin Sreevatsa, Jennifer Prendki, DavidSoergel, Adrian Goedeckemeyer, Willi Gierke, Mohsen Jafari, Meenu Gaba,Jeremy Wiesner, DianaGage Wright, Yawen Wei, Harsha Vashisht, YanaKulizhskaya, Jay Hoover, Maigo Le, LuLi, Chimezie Iwuanyanwu, LuLiu, KevinRamirez, Andrey Khorlin, Albert Cui, Tian LIN, Marin Georgiev, Marcus Wu,Ricardo Aguilar, Keith Pallo, Abhishek Chakladar, Alena Repina, Xihui Wu, Tomvander Weide, Priya Ponnapalli, Caroline Kaplan, Jiri Simsa, Shuangfeng Li,Olivier Dousse, Fan Yang, Jeff Piper, Nathan Ie, Minnie Lui, Rama Pasumarthi,Nathan Lintz, Anitha Vijayakumar, LamNguyen Thiet, Daniel Andor, PedroValenzuela, Cosmin Paduraru, Daiyi Peng, Katherine Lee, Shuyuan Zhang, SomerGreene, DucDung Nguyen, Paula Kurylowicz, Sarmishta Velury, SebastianKrause, Cassidy Hardin, Lucas Dixon, Lili Janzer, Kiam Choo, Ziqiang Feng,Biao Zhang, Achintya Singhal, Tejasi Latkar, Mingyang Zhang, Quoc Le,ElenaAllica Abellan, Dayou Du, Dan McKinnon, Natasha Antropova, TolgaBolukbasi, Orgad Keller, David Reid, Daniel Finchelstein, MariaAbi Raad,Remi Crocker, Peter Hawkins, Robert Dadashi, Colin Gaffney, Sid Lall, KenFranko, Egor Filonov, Anna Bulanova, Rémi Leblond, Vikas Yadav, ShirleyChung, Harry Askham, LuisC. Cobo, Kelvin Xu, Felix Fischer, Jun Xu,Christina Sorokin, Chris Alberti, Chu-Cheng Lin, Colin Evans, Hao Zhou, AlekDimitriev, Hannah Forbes, Dylan Banarse, Zora Tung, Jeremiah Liu, MarkOmernick, Colton Bishop, Chintu Kumar, Rachel Sterneck, Ryan Foley, RohanJain, Swaroop Mishra, Jiawei Xia, Taylor Bos, Geoffrey Cideron, Ehsan Amid,Francesco Piccinno, Xingyu Wang, Praseem Banzal, Petru Gurita, Hila Noga,Premal Shah, DanielJ. Mankowitz, Alex Polozov, Nate Kushman, VictoriaKrakovna, Sasha Brown, MohammadHossein Bateni, Dennis Duan, Vlad Firoiu,Meghana Thotakuri, Tom Natan, Anhad Mohananey, Matthieu Geist, SidharthMudgal, Sertan Girgin, Hui Li, Jiayu Ye, Ofir Roval, Reiko Tojo, MichaelKwong, James Lee-Thorp, Christopher Yew, Quan Yuan, Sumit Bagri, DanilaSinopalnikov, Sabela Ramos, John Mellor, Abhishek Sharma, Aliaksei Severyn,Jonathan Lai, Kathy Wu, Heng-Tze Cheng, David Miller, Nicolas Sonnerat, DenisVnukov, Rory Greig, Jennifer Beattie, Emily Caveness, Libin Bai, JulianEisenschlos, Alex Korchemniy, Tomy Tsai, Mimi Jasarevic, Weize Kong, PhuongDao, Zeyu Zheng, Frederick Liu, Fan Yang, Rui Zhu, Mark Geller, TianHueyTeh, Jason Sanmiya, Evgeny Gladchenko, Nejc Trdin, Andrei Sozanschi, DanielToyama, Evan Rosen, Sasan Tavakkol, Linting Xue, Chen Elkind, Oliver Woodman,John Carpenter, George Papamakarios, Rupert Kemp, Sushant Kafle, TanyaGrunina, Rishika Sinha, Alice Talbert, Abhimanyu Goyal, Diane Wu, DeneseOwusu-Afriyie, Cosmo Du, Chloe Thornton, Jordi Pont-Tuset, PradyumnaNarayana, Jing Li, Sabaer Fatehi, John Wieting, Omar Ajmeri, Benigno Uria,Tao Zhu, Yeongil Ko, Laura Knight, Amélie Héliou, Ning Niu, Shane Gu,Chenxi Pang, Dustin Tran, Yeqing Li, Nir Levine, Ariel Stolovich, NorbertKalb, Rebeca Santamaria-Fernandez, Sonam Goenka, Wenny Yustalim, RobinStrudel, Ali Elqursh, Balaji Lakshminarayanan, Charlie Deck, Shyam Upadhyay,Hyo Lee, Mike Dusenberry, Zonglin Li, Xuezhi Wang, Kyle Levin, RaphaelHoffmann, Dan Holtmann-Rice, Olivier Bachem, Summer Yue, Sho Arora, EricMalmi, Daniil Mirylenka, Qijun Tan, Christy Koh, SoheilHassas Yeganeh, SiimPõder, Steven Zheng, Francesco Pongetti, Mukarram Tariq, Yanhua Sun, LucianIonita, Mojtaba Seyedhosseini, Pouya Tafti, Ragha Kotikalapudi, Zhiyu Liu,Anmol Gulati, Jasmine Liu, Xinyu Ye, Bart Chrzaszcz, Lily Wang, Nikhil Sethi,Tianrun Li, Ben Brown, Shreya Singh, Wei Fan, Aaron Parisi, Joe Stanton,Chenkai Kuang, Vinod Koverkathu, ChristopherA. Choquette-Choo, Yunjie Li,TJLu, Abe Ittycheriah, Prakash Shroff, Pei Sun, Mani Varadarajan, SanazBahargam, Rob Willoughby, David Gaddy, Ish*ta Dasgupta, Guillaume Desjardins,Marco Cornero, Brona Robenek, Bhavishya Mittal, Ben Albrecht, Ashish Shenoy,Fedor Moiseev, Henrik Jacobsson, Alireza Ghaffarkhah, Morgane Rivière,Alanna Walton, Clément Crepy, Alicia Parrish, Yuan Liu, Zongwei Zhou,Clement Farabet, Carey Radebaugh, Praveen Srinivasan, Claudia vander Salm,Andreas Fidjeland, Salvatore Scellato, Eri Latorre-Chimoto, HannaKlimczak-Plucińska, David Bridson, Dario deCesare, Tom Hudson, PiermariaMendolicchio, Lexi Walker, Alex Morris, Ivo Penchev, Matthew Mauger, AlexeyGuseynov, Alison Reid, Seth Odoom, Lucia Loher, Victor Cotruta, MadhaviYenugula, Dominik Grewe, Anastasia Petrushkina, Tom Duerig, Antonio Sanchez,Steve Yadlowsky, Amy Shen, Amir Globerson, Adam Kurzrok, Lynette Webb, SahilDua, Dong Li, Preethi Lahoti, Surya Bhupatiraju, Dan Hurt, Haroon Qureshi,Ananth Agarwal, Tomer Shani, Matan Eyal, Anuj Khare, ShreyasRammohan Belle,Lei Wang, Chetan Tekur, MihirSanjay Kale, Jinliang Wei, Ruoxin Sang, BrennanSaeta, Tyler Liechty, YiSun, Yao Zhao, Stephan Lee, Pandu Nayak, Doug Fritz,ManishReddy Vuyyuru, John Aslanides, Nidhi Vyas, Martin Wicke, Xiao Ma,Taylan Bilal, Evgenii Eltyshev, Daniel Balle, Nina Martin, Hardie Cate, JamesManyika, Keyvan Amiri, Yelin Kim, XiXiong, Kai Kang, Florian Luisier, NileshTripuraneni, David Madras, Mandy Guo, Austin Waters, Oliver Wang, JoshuaAinslie, Jason Baldridge, Han Zhang, Garima Pruthi, Jakob Bauer, Feng Yang,Riham Mansour, Jason Gelman, Yang Xu, George Polovets, JiLiu, Honglong Cai,Warren Chen, XiangHai Sheng, Emily Xue, Sherjil Ozair, Adams Yu, ChristofAngermueller, Xiaowei Li, Weiren Wang, Julia Wiesinger, EmmanouilKoukoumidis, Yuan Tian, Anand Iyer, Madhu Gurumurthy, Mark Goldenson,Parashar Shah, MKBlake, Hongkun Yu, Anthony Urbanowicz, Jennimaria Palomaki,Chrisantha Fernando, Kevin Brooks, Ken Durden, Harsh Mehta, Nikola Momchev,Elahe Rahimtoroghi, Maria Georgaki, Amit Raul, Sebastian Ruder, MorganRedshaw, Jinhyuk Lee, Komal Jalan, Dinghua Li, Ginger Perng, Blake Hechtman,Parker Schuh, Milad Nasr, Mia Chen, Kieran Milan, Vladimir Mikulik, TrevorStrohman, Juliana Franco, Tim Green, Demis Hassabis, Koray Kavukcuoglu,Jeffrey Dean, and Oriol Vinyals.Gemini: A family of highly capable multimodal models, 2023.URL https://arxiv.org/abs/2312.11805.
Gemma (2024)Team Gemma.Gemma: Open models based on gemini research and technology, 2024.URLhttps://storage.googleapis.com/deepmind-media/gemma/gemma-report.pdf.
He etal. (2016)Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.Identity mappings in deep residual networks, 2016.URL https://arxiv.org/abs/1603.05027.
Hoffmann etal. (2022)Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, TrevorCai, Eliza Rutherford, Diego deLasCasas, LisaAnne Hendricks, JohannesWelbl, Aidan Clark, Tom Hennigan, Eric Noland, Katie Millican, George vandenDriessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, ErichElsen, JackW. Rae, Oriol Vinyals, and Laurent Sifre.Training compute-optimal large language models, 2022.URL https://arxiv.org/abs/2203.15556.
Hu etal. (2024)Shengding Hu, Yuge Tu, XuHan, Ganqu Cui, Chaoqun He, Weilin Zhao, Xiang Long,Zhi Zheng, Yewei Fang, Kaihuo Zhang, Yuxiang Huang, Zhenning Dai, BaitaoGong, Chongyi Wang, Yuan Yao, Jie Zhou, Jie Cai, Xinrong Zhang, Zhongwu Zhai,Ning Ding, Chao Jia, Guoyang Zeng, Dahai Li, Zhiyuan Liu, and Maosong Sun.Minicpm: Unveiling the potential of end-side large language models.https://shengdinghu.notion.site/MiniCPM-Unveiling-the-Potential-of-End-side-Large-Language-Models-d4d3a8c426424654a4e80e42a711cb20?pvs=4,2024.Accessed: 2024-03-09.
Jiang etal. (2024)AlbertQ. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, BlancheSavary, Chris Bamford, DevendraSingh Chaplot, Diego delas Casas, EmmaBouHanna, Florian Bressand, Gianna Lengyel, Guillaume Bour, Guillaume Lample,LélioRenard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock,Sandeep Subramanian, Sophia Yang, Szymon Antoniak, TevenLe Scao, ThéophileGervet, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and WilliamEl Sayed.Mixtral of experts, 2024.URL https://arxiv.org/abs/2401.04088.
Kaplan etal. (2020)Jared Kaplan, Sam McCandlish, Tom Henighan, TomB. Brown, Benjamin Chess, RewonChild, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei.Scaling laws for neural language models, 2020.URL https://arxiv.org/abs/2001.08361.
Kingma & Ba (2014)DiederikP. Kingma and Jimmy Ba.Adam: A method for stochastic optimization, 2014.URL https://arxiv.org/abs/1412.6980.
Littwin & Yang (2023)Etai Littwin and Greg Yang.Adaptive optimization in the $\infty$-width limit.In The Eleventh International Conference on LearningRepresentations, 2023.URL https://openreview.net/forum?id=zgVDqw9ZUES.
Liu etal. (2023)Haotian Liu, Chunyuan Li, Qingyang Wu, and YongJae Lee.Visual instruction tuning, 2023.URL https://arxiv.org/abs/2304.08485.
Loshchilov & Hutter (2017)Ilya Loshchilov and Frank Hutter.Fixing weight decay regularization in adam.CoRR, abs/1711.05101, 2017.URL http://arxiv.org/abs/1711.05101.
Malladi etal. (2022)Sadhika Malladi, Kaifeng Lyu, Abhishek Panigrahi, and Sanjeev Arora.On the SDEs and scaling rules for adaptive gradient algorithms,2022.URL https://arxiv.org/abs/2205.10287.
McCandlish etal. (2018)Sam McCandlish, Jared Kaplan, Dario Amodei, and OpenAIDota Team.An empirical model of large-batch training.CoRR, abs/1812.06162, 2018.URL http://arxiv.org/abs/1812.06162.
Nguyen & Salazar (2019)ToanQ. Nguyen and Julian Salazar.Transformers without tears: Improving the normalization ofself-attention, 2019.URL https://arxiv.org/abs/1910.05895.
OpenAI etal. (2024)OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya,FlorenciaLeoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman,Shyamal Anadkat, Red Avila, Igor Babuschkin, Suchir Balaji, Valerie Balcom,Paul Baltescu, Haiming Bao, Mohammad Bavarian, Jeff Belgum, Irwan Bello, JakeBerdine, Gabriel Bernadett-Shapiro, Christopher Berner, Lenny Bogdonoff, OlegBoiko, Madelaine Boyd, Anna-Luisa Brakman, Greg Brockman, Tim Brooks, MilesBrundage, Kevin Button, Trevor Cai, Rosie Campbell, Andrew Cann, BrittanyCarey, Chelsea Carlson, Rory Carmichael, Brooke Chan, Che Chang, FotisChantzis, Derek Chen, Sully Chen, Ruby Chen, Jason Chen, Mark Chen, BenChess, Chester Cho, Casey Chu, HyungWon Chung, Dave Cummings, JeremiahCurrier, Yunxing Dai, Cory Decareaux, Thomas Degry, Noah Deutsch, DamienDeville, Arka Dhar, David Dohan, Steve Dowling, Sheila Dunning, AdrienEcoffet, Atty Eleti, Tyna Eloundou, David Farhi, Liam Fedus, Niko Felix,SimónPosada Fishman, Juston Forte, Isabella Fulford, Leo Gao, Elie Georges,Christian Gibson, Vik Goel, Tarun Gogineni, Gabriel Goh, Rapha Gontijo-Lopes,Jonathan Gordon, Morgan Grafstein, Scott Gray, Ryan Greene, Joshua Gross,ShixiangShane Gu, Yufei Guo, Chris Hallacy, Jesse Han, Jeff Harris, YuchenHe, Mike Heaton, Johannes Heidecke, Chris Hesse, Alan Hickey, Wade Hickey,Peter Hoeschele, Brandon Houghton, Kenny Hsu, Shengli Hu, Xin Hu, JoostHuizinga, Shantanu Jain, Shawn Jain, Joanne Jang, Angela Jiang, Roger Jiang,Haozhun Jin, Denny Jin, Shino Jomoto, Billie Jonn, Heewoo Jun, Tomer Kaftan,Łukasz Kaiser, Ali Kamali, Ingmar Kanitscheider, NitishShirish Keskar,Tabarak Khan, Logan Kilpatrick, JongWook Kim, Christina Kim, Yongjik Kim,JanHendrik Kirchner, Jamie Kiros, Matt Knight, Daniel Kokotajlo, ŁukaszKondraciuk, Andrew Kondrich, Aris Konstantinidis, Kyle Kosic, GretchenKrueger, Vishal Kuo, Michael Lampe, Ikai Lan, Teddy Lee, Jan Leike, JadeLeung, Daniel Levy, ChakMing Li, Rachel Lim, Molly Lin, Stephanie Lin,Mateusz Litwin, Theresa Lopez, Ryan Lowe, Patricia Lue, Anna Makanju, KimMalfacini, Sam Manning, Todor Markov, Yaniv Markovski, Bianca Martin, KatieMayer, Andrew Mayne, Bob McGrew, ScottMayer McKinney, Christine McLeavey,Paul McMillan, Jake McNeil, David Medina, Aalok Mehta, Jacob Menick, LukeMetz, Andrey Mishchenko, Pamela Mishkin, Vinnie Monaco, Evan Morikawa, DanielMossing, Tong Mu, Mira Murati, Oleg Murk, David Mély, Ashvin Nair, ReiichiroNakano, Rajeev Nayak, Arvind Neelakantan, Richard Ngo, Hyeonwoo Noh, LongOuyang, Cullen O’Keefe, Jakub Pachocki, Alex Paino, Joe Palermo, AshleyPantuliano, Giambattista Parascandolo, Joel Parish, Emy Parparita, AlexPassos, Mikhail Pavlov, Andrew Peng, Adam Perelman, Filipe deAvilaBelbutePeres, Michael Petrov, HenriquePonde deOliveiraPinto, Michael,Pokorny, Michelle Pokrass, VitchyrH. Pong, Tolly Powell, Alethea Power,Boris Power, Elizabeth Proehl, Raul Puri, Alec Radford, Jack Rae, AdityaRamesh, Cameron Raymond, Francis Real, Kendra Rimbach, Carl Ross, BobRotsted, Henri Roussez, Nick Ryder, Mario Saltarelli, Ted Sanders, ShibaniSanturkar, Girish Sastry, Heather Schmidt, David Schnurr, John Schulman,Daniel Selsam, Kyla Sheppard, Toki Sherbakov, Jessica Shieh, Sarah Shoker,Pranav Shyam, Szymon Sidor, Eric Sigler, Maddie Simens, Jordan Sitkin,Katarina Slama, Ian Sohl, Benjamin Sokolowsky, Yang Song, Natalie Staudacher,FelipePetroski Such, Natalie Summers, Ilya Sutskever, Jie Tang, NikolasTezak, MadeleineB. Thompson, Phil Tillet, Amin Tootoonchian, ElizabethTseng, Preston Tuggle, Nick Turley, Jerry Tworek, Juan FelipeCerón Uribe,Andrea Vallone, Arun Vijayvergiya, Chelsea Voss, Carroll Wainwright,JustinJay Wang, Alvin Wang, Ben Wang, Jonathan Ward, Jason Wei, CJWeinmann,Akila Welihinda, Peter Welinder, Jiayi Weng, Lilian Weng, Matt Wiethoff, DaveWillner, Clemens Winter, Samuel Wolrich, Hannah Wong, Lauren Workman, SherwinWu, Jeff Wu, Michael Wu, Kai Xiao, Tao Xu, Sarah Yoo, Kevin Yu, Qiming Yuan,Wojciech Zaremba, Rowan Zellers, Chong Zhang, Marvin Zhang, Shengjia Zhao,Tianhao Zheng, Juntang Zhuang, William Zhuk, and Barret Zoph.Gpt-4 technical report, 2024.URL https://arxiv.org/abs/2303.08774.
Parmar etal. (2024)Jupinder Parmar, Shrimai Prabhumoye, Joseph Jennings, Mostofa Patwary, SandeepSubramanian, Dan Su, Chen Zhu, Deepak Narayanan, Aastha Jhunjhunwala, AyushDattagupta, Vibhu Jawa, Jiwei Liu, Ameya Mahabaleshwarkar, Osvald Nitski,Annika Brundyn, James Maki, Miguel Martinez, Jiaxuan You, John Kamalu,Patrick LeGresley, Denys Fridman, Jared Casper, Ashwath Aithal, OleksiiKuchaiev, Mohammad Shoeybi, Jonathan Cohen, and Bryan Catanzaro.Nemotron-4 15b technical report, 2024.URL https://arxiv.org/abs/2402.16819.
Peng etal. (2023)BoPeng, Eric Alcaide, Quentin Anthony, Alon Albalak, Samuel Arcadinho, StellaBiderman, Huanqi Cao, Xin Cheng, Michael Chung, Matteo Grella, KranthiKiranGV, Xuzheng He, Haowen Hou, Jiaju Lin, Przemyslaw Kazienko, Jan Kocon,Jiaming Kong, Bartlomiej Koptyra, Hayden Lau, Krishna SriIpsit Mantri,Ferdinand Mom, Atsushi Saito, Guangyu Song, Xiangru Tang, Bolun Wang,JohanS. Wind, Stanislaw Wozniak, Ruichong Zhang, Zhenyuan Zhang, QihangZhao, Peng Zhou, Qinghua Zhou, Jian Zhu, and Rui-Jie Zhu.Rwkv: Reinventing rnns for the transformer era, 2023.URL https://arxiv.org/abs/2305.13048.
Radford etal. (2019)Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and IlyaSutskever.Language models are unsupervised multitask learners, 2019.https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdfLast visited on 2023/09/07.
Raffel etal. (2019)Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, MichaelMatena, Yanqi Zhou, Wei Li, and PeterJ. Liu.Exploring the limits of transfer learning with a unified text-to-texttransformer, 2019.URL https://arxiv.org/abs/1910.10683.
Reid etal. (2024)Machel Reid, Nikolay Savinov, Denis Teplyashin, Dmitry Lepikhin, TimothyLillicrap, Jean baptiste Alayrac, Radu Soricut, Angeliki Lazaridou, OrhanFirat, Julian Schrittwieser, Ioannis Antonoglou, Rohan Anil, SebastianBorgeaud, Andrew Dai, Katie Millican, Ethan Dyer, Mia Glaese, ThibaultSottiaux, Benjamin Lee, Fabio Viola, Malcolm Reynolds, Yuanzhong Xu, JamesMolloy, Jilin Chen, Michael Isard, Paul Barham, Tom Hennigan, Ross McIlroy,Melvin Johnson, Johan Schalkwyk, Eli Collins, Eliza Rutherford, EricaMoreira, Kareem Ayoub, Megha Goel, Clemens Meyer, Gregory Thornton, ZhenYang, Henryk Michalewski, Zaheer Abbas, Nathan Schucher, Ankesh Anand,Richard Ives, James Keeling, Karel Lenc, Salem Haykal, Siamak Shakeri, PranavShyam, Aakanksha Chowdhery, Roman Ring, Stephen Spencer, Eren Sezener, LukeVilnis, Oscar Chang, Nobuyuki Morioka, George Tucker, CeZheng, OliverWoodman, Nithya Attaluri, Tomas Kocisky, Evgenii Eltyshev, XiChen, TimothyChung, Vittorio Selo, Siddhartha Brahma, Petko Georgiev, Ambrose Slone,Zhenkai Zhu, James Lottes, Siyuan Qiao, Ben Caine, Sebastian Riedel, AlexTomala, Martin Chadwick, Juliette Love, Peter Choy, Sid Mittal, Neil Houlsby,Yunhao Tang, Matthew Lamm, Libin Bai, Qiao Zhang, Luheng He, Yong Cheng,Peter Humphreys, Yujia Li, Sergey Brin, Albin Cassirer, Yingjie Miao, LukasZilka, Taylor Tobin, Kelvin Xu, Lev Proleev, Daniel Sohn, Alberto Magni,LisaAnne Hendricks, Isabel Gao, Santiago Ontañón, Oskar Bunyan, NathanByrd, Abhanshu Sharma, Biao Zhang, Mario Pinto, Rishika Sinha, Harsh Mehta,Dawei Jia, Sergi Caelles, Albert Webson, Alex Morris, Becca Roelofs, YifanDing, Robin Strudel, Xuehan Xiong, Marvin Ritter, Mostafa Dehghani, RahmaChaabouni, Abhijit Karmarkar, Guangda Lai, Fabian Mentzer, Bibo Xu, YaGuangLi, Yujing Zhang, TomLe Paine, Alex Goldin, Behnam Neyshabur, Kate Baumli,Anselm Levskaya, Michael Laskin, Wenhao Jia, JackW. Rae, Kefan Xiao, AntoineHe, Skye Giordano, Lakshman Yagati, Jean-Baptiste Lespiau, Paul Natsev,Sanjay Ganapathy, Fangyu Liu, Danilo Martins, Nanxin Chen, Yunhan Xu, MeganBarnes, Rhys May, Arpi Vezer, Junhyuk Oh, Ken Franko, Sophie Bridgers, RuizheZhao, Boxi Wu, Basil Mustafa, Sean Sechrist, Emilio Parisotto,ThanumalayanSankaranarayana Pillai, Chris Larkin, Chenjie Gu, ChristinaSorokin, Maxim Krikun, Alexey Guseynov, Jessica Landon, Romina Datta,Alexander Pritzel, Phoebe Thacker, Fan Yang, Kevin Hui, Anja Hauth, Chih-KuanYeh, David Barker, Justin Mao-Jones, Sophia Austin, Hannah Sheahan, ParkerSchuh, James Svensson, Rohan Jain, Vinay Ramasesh, Anton Briukhov, Da-WoonChung, Tamara von Glehn, Christina Butterfield, Priya Jhakra, MatthewWiethoff, Justin Frye, Jordan Grimstad, Beer Changpinyo, CharlineLe Lan,Anna Bortsova, Yonghui Wu, Paul Voigtlaender, Tara Sainath, Charlotte Smith,Will Hawkins, Kris Cao, James Besley, Srivatsan Srinivasan, Mark Omernick,Colin Gaffney, Gabriela Surita, Ryan Burnell, Bogdan Damoc, Junwhan Ahn,Andrew Brock, Mantas Pajarskas, Anastasia Petrushkina, Seb Noury, LorenzoBlanco, Kevin Swersky, Arun Ahuja, Thi Avrahami, Vedant Misra, RaouldeLiedekerke, Mariko Iinuma, Alex Polozov, Sarah York, George vandenDriessche, Paul Michel, Justin Chiu, Rory Blevins, Zach Gleicher, AdriàRecasens, Alban Rrustemi, Elena Gribovskaya, Aurko Roy, Wiktor Gworek, SébArnold, Lisa Lee, James Lee-Thorp, Marcello Maggioni, Enrique Piqueras,Kartikeya Badola, Sharad Vikram, Lucas Gonzalez, Anirudh Baddepudi, EvanSenter, Jacob Devlin, James Qin, Michael Azzam, Maja Trebacz, Martin Polacek,Kashyap Krishnakumar, Shuo yiin Chang, Matthew Tung, Ivo Penchev, RishabhJoshi, Kate Olszewska, Carrie Muir, Mateo Wirth, AleJakse Hartman, JoshNewlan, Sheleem Kashem, Vijay Bolina, Elahe Dabir, Joost van Amersfoort,Zafarali Ahmed, James Cobon-Kerr, Aishwarya Kamath, ArnarMar Hrafnkelsson,LeHou, Ian Mackinnon, Alexandre Frechette, Eric Noland, Xiance Si, EmanuelTaropa, Dong Li, Phil Crone, Anmol Gulati, Sébastien Cevey, Jonas Adler, AdaMa, David Silver, Simon Tokumine, Richard Powell, Stephan Lee, Michael Chang,Samer Hassan, Diana Mincu, Antoine Yang, Nir Levine, Jenny Brennan, MingqiuWang, Sarah Hodkinson, Jeffrey Zhao, Josh Lipschultz, Aedan Pope, MichaelB.Chang, Cheng Li, LaurentEl Shafey, Michela Paganini, Sholto Douglas, BerndBohnet, Fabio Pardo, Seth Odoom, Mihaela Rosca, CiceroNogueira dos Santos,Kedar Soparkar, Arthur Guez, Tom Hudson, Steven Hansen, ChulayuthAsawaroengchai, Ravi Addanki, Tianhe Yu, Wojciech Stokowiec, Mina Khan,Justin Gilmer, Jaehoon Lee, CarrieGrimes Bostock, Keran Rong, JonathanCaton, Pedram Pejman, Filip Pavetic, Geoff Brown, Vivek Sharma, MarioLučić, Rajkumar Samuel, Josip Djolonga, Amol Mandhane, LarsLowe Sjösund,Elena Buchatskaya, Elspeth White, Natalie Clay, Jiepu Jiang, Hyeontaek Lim,Ross Hemsley, Jane Labanowski, NicolaDe Cao, David Steiner, SayedHadiHashemi, Jacob Austin, Anita Gergely, Tim Blyth, Joe Stanton, KaushikShivakumar, Aditya Siddhant, Anders Andreassen, Carlos Araya, Nikhil Sethi,Rakesh Shivanna, Steven Hand, Ankur Bapna, Ali Khodaei, Antoine Miech,Garrett Tanzer, Andy Swing, Shantanu Thakoor, Zhufeng Pan, Zachary Nado,Stephanie Winkler, Dian Yu, Mohammad Saleh, Loren Maggiore, Iain Barr, MinhGiang, Thais Kagohara, Ivo Danihelka, Amit Marathe, Vladimir Feinberg,Mohamed Elhawaty, Nimesh Ghelani, Dan Horgan, Helen Miller, Lexi Walker,Richard Tanburn, Mukarram Tariq, Disha Shrivastava, Fei Xia, Chung-ChengChiu, Zoe Ashwood, Khuslen Baatarsukh, Sina Samangooei, Fred Alcober, AxelStjerngren, Paul Komarek, Katerina Tsihlas, Anudhyan Boral, Ramona Comanescu,Jeremy Chen, Ruibo Liu, Dawn Bloxwich, Charlie Chen, Yanhua Sun, FangxiaoyuFeng, Matthew Mauger, Xerxes Dotiwalla, Vincent Hellendoorn, Michael Sharman,Ivy Zheng, Krishna Haridasan, Gabe Barth-Maron, Craig Swanson, DominikaRogozińska, Alek Andreev, PaulKishan Rubenstein, Ruoxin Sang, Dan Hurt,Gamaleldin Elsayed, Renshen Wang, Dave Lacey, Anastasija Ilić, Yao Zhao,Lora Aroyo, Chimezie Iwuanyanwu, Vitaly Nikolaev, Balaji Lakshminarayanan,Sadegh Jazayeri, RaphaëlLopez Kaufman, Mani Varadarajan, Chetan Tekur, DougFritz, Misha Khalman, David Reitter, Kingshuk Dasgupta, Shourya Sarcar, TinaOrnduff, Javier Snaider, Fantine Huot, Johnson Jia, Rupert Kemp, Nejc Trdin,Anitha Vijayakumar, Lucy Kim, Christof Angermueller, LiLao, Tianqi Liu,Haibin Zhang, David Engel, Somer Greene, Anaïs White, Jessica Austin, LillyTaylor, Shereen Ashraf, Dangyi Liu, Maria Georgaki, Irene Cai, YanaKulizhskaya, Sonam Goenka, Brennan Saeta, Kiran Vodrahalli, Christian Frank,Dario deCesare, Brona Robenek, Harry Richardson, Mahmoud Alnahlawi,Christopher Yew, Priya Ponnapalli, Marco Tagliasacchi, Alex Korchemniy, YelinKim, Dinghua Li, Bill Rosgen, Zoe Ashwood, Kyle Levin, Jeremy Wiesner,Praseem Banzal, Praveen Srinivasan, Hongkun Yu, Çağlar Ünlü, David Reid,Zora Tung, Daniel Finchelstein, Ravin Kumar, Andre Elisseeff, Jin Huang, MingZhang, Rui Zhu, Ricardo Aguilar, Mai Giménez, Jiawei Xia, Olivier Dousse,Willi Gierke, SoheilHassas Yeganeh, Damion Yates, Komal Jalan, LuLi, EriLatorre-Chimoto, DucDung Nguyen, Ken Durden, Praveen Kallakuri, Yaxin Liu,Matthew Johnson, Tomy Tsai, Alice Talbert, Jasmine Liu, Alexander Neitz, ChenElkind, Marco Selvi, Mimi Jasarevic, LivioBaldini Soares, Albert Cui, PidongWang, AlekWenjiao Wang, Xinyu Ye, Krystal Kallarackal, Lucia Loher, Hoi Lam,Josef Broder, Dan Holtmann-Rice, Nina Martin, Bramandia Ramadhana, DanielToyama, Mrinal Shukla, Sujoy Basu, Abhi Mohan, Nick Fernando, Noah Fiedel,Kim Paterson, Hui Li, Ankush Garg, Jane Park, DongHyun Choi, Diane Wu,Sankalp Singh, Zhishuai Zhang, Amir Globerson, Lily Yu, John Carpenter,Félix deChaumontQuitry, Carey Radebaugh, Chu-Cheng Lin, Alex Tudor,Prakash Shroff, Drew Garmon, Dayou Du, Neera Vats, Han Lu, Shariq Iqbal, AlexYakubovich, Nilesh Tripuraneni, James Manyika, Haroon Qureshi, Nan Hua,Christel Ngani, MariaAbi Raad, Hannah Forbes, Anna Bulanova, Jeff Stanway,Mukund Sundararajan, Victor Ungureanu, Colton Bishop, Yunjie Li, BalajiVenkatraman, BoLi, Chloe Thornton, Salvatore Scellato, Nishesh Gupta,Yicheng Wang, Ian Tenney, Xihui Wu, Ashish Shenoy, Gabriel Carvajal,DianaGage Wright, Ben Bariach, Zhuyun Xiao, Peter Hawkins, Sid Dalmia,Clement Farabet, Pedro Valenzuela, Quan Yuan, Chris Welty, Ananth Agarwal,Mia Chen, Wooyeol Kim, Brice Hulse, Nandita Dukkipati, Adam Paszke, AndrewBolt, Elnaz Davoodi, Kiam Choo, Jennifer Beattie, Jennifer Prendki, HarshaVashisht, Rebeca Santamaria-Fernandez, LuisC. Cobo, Jarek Wilkiewicz, DavidMadras, Ali Elqursh, Grant Uy, Kevin Ramirez, Matt Harvey, Tyler Liechty,Heiga Zen, Jeff Seibert, ClaraHuiyi Hu, Mohamed Elhawaty, Andrey Khorlin,Maigo Le, Asaf Aharoni, Megan Li, Lily Wang, Sandeep Kumar, Alejandro Lince,Norman Casagrande, Jay Hoover, DaliaEl Badawy, David Soergel, Denis Vnukov,Matt Miecnikowski, Jiri Simsa, Anna Koop, Praveen Kumar, Thibault Sellam,Daniel Vlasic, Samira Daruki, Nir Shabat, John Zhang, Guolong Su, JiagengZhang, Jeremiah Liu, YiSun, Evan Palmer, Alireza Ghaffarkhah, XiXiong,Victor Cotruta, Michael Fink, Lucas Dixon, Ashwin Sreevatsa, AdrianGoedeckemeyer, Alek Dimitriev, Mohsen Jafari, Remi Crocker, NicholasFitzGerald, Aviral Kumar, Sanjay Ghemawat, Ivan Philips, Frederick Liu,Yannie Liang, Rachel Sterneck, Alena Repina, Marcus Wu, Laura Knight, MarinGeorgiev, Hyo Lee, Harry Askham, Abhishek Chakladar, Annie Louis, Carl Crous,Hardie Cate, Dessie Petrova, Michael Quinn, Denese Owusu-Afriyie, AchintyaSinghal, Nan Wei, Solomon Kim, Damien Vincent, Milad Nasr, ChristopherA.Choquette-Choo, Reiko Tojo, Shawn Lu, Diego deLasCasas, Yuchung Cheng,Tolga Bolukbasi, Katherine Lee, Saaber Fatehi, Rajagopal Ananthanarayanan,Miteyan Patel, Charbel Kaed, Jing Li, Jakub Sygnowski, ShreyasRammohanBelle, Zhe Chen, Jaclyn Konzelmann, Siim Põder, Roopal Garg, VinodKoverkathu, Adam Brown, Chris Dyer, Rosanne Liu, Azade Nova, Jun Xu, SlavPetrov, Demis Hassabis, Koray Kavukcuoglu, Jeffrey Dean, and Oriol Vinyals.Gemini 1.5: Unlocking multimodal understanding across millions oftokens of context, 2024.URL https://arxiv.org/abs/2403.05530.
Shazeer (2019)Noam Shazeer.Fast transformer decoding: One write-head is all you need, 2019.URL https://arxiv.org/abs/1911.02150.
Shazeer (2020)Noam Shazeer.GLU variants improve transformer, 2020.URL https://arxiv.org/abs/2002.05202.
Shoeybi etal. (2020)Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper,and Bryan Catanzaro.Megatron-LM: Training multi-billion parameter language models usingmodel parallelism, 2020.URL https://arxiv.org/abs/1909.08053.
So etal. (2021)David So, Wojciech Mańke, Hanxiao Liu, Zihang Dai, Noam Shazeer, and QuocVLe.Searching for efficient transformers for language modeling.In M.Ranzato, A.Beygelzimer, Y.Dauphin, P.S. Liang, and J.WortmanVaughan (eds.), Advances in Neural Information Processing Systems,volume34, pp. 6010–6022. Curran Associates, Inc., 2021.URLhttps://proceedings.neurips.cc/paper_files/paper/2021/file/2f3c6a4cd8af177f6456e7e51a916ff3-Paper.pdf.
Su etal. (2021)Jianlin Su, YuLu, Shengfeng Pan, BoWen, and Yunfeng Liu.RoFormer: Enhanced transformer with rotary position embedding.CoRR, abs/2104.09864, 2021.URL https://arxiv.org/abs/2104.09864.
Touvron etal. (2023a)Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-AnneLachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro,Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and GuillaumeLample.Llama: Open and efficient foundation language models,2023a.URL https://arxiv.org/abs/2302.13971.
Touvron etal. (2023b)Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, YasmineBabaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale,Dan Bikel, Lukas Blecher, CristianCanton Ferrer, Moya Chen, GuillemCucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller,Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, SagharHosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa,Isabel Kloumann, Artem Korenev, PunitSingh Koura, Marie-Anne Lachaux,Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, XavierMartinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, AndrewPoulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, RuanSilva, EricMichael Smith, Ranjan Subramanian, XiaoqingEllen Tan, Binh Tang,Ross Taylor, Adina Williams, JianXiang Kuan, Puxin Xu, Zheng Yan, IliyanZarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, AurelienRodriguez, Robert Stojnic, Sergey Edunov, and Thomas Scialom.Llama 2: Open foundation and fine-tuned chat models,2023b.URL https://arxiv.org/abs/2307.09288.
van Laarhoven (2017)Twan van Laarhoven.L2 regularization versus batch and weight normalization, 2017.URL https://arxiv.org/abs/1706.05350.
Vaswani etal. (2017)Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,AidanN Gomez, Łukasz Kaiser, and Illia Polosukhin.Attention is all you need.In Advances in Neural Information Processing Systems,volume30. Curran Associates, Inc., 2017.URLhttps://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf.
Wortsman etal. (2023)Mitchell Wortsman, PeterJ. Liu, Lechao Xiao, Katie Everett, Alex Alemi, BenAdlam, JohnD. Co-Reyes, Izzeddin Gur, Abhishek Kumar, Roman Novak, JeffreyPennington, Jascha Sohl-dickstein, Kelvin Xu, Jaehoon Lee, Justin Gilmer, andSimon Kornblith.Small-scale proxies for large-scale transformer traininginstabilities, 2023.URL https://arxiv.org/abs/2309.14322.
XAI (2024)XAI.Grok-1, 2024.URL https://github.com/xai-org/grok-1.
Xu etal. (2021)Yuanzhong Xu, HyoukJoong Lee, Dehao Chen, Blake Hechtman, Yanping Huang, RahulJoshi, Maxim Krikun, Dmitry Lepikhin, Andy Ly, Marcello Maggioni, RuomingPang, Noam Shazeer, Shibo Wang, Tao Wang, Yonghui Wu, and Zhifeng Chen.GSPMD: General and scalable parallelization for ML computationgraphs, 2021.URL https://arxiv.org/abs/2105.04663.
Yang (2019)Greg Yang.Tensor Programs I: Wide feedforward or recurrent neural networks ofany architecture are gaussian processes, 2019.URL https://arxiv.org/abs/1910.12478.
Yang (2020)Greg Yang.Tensor Programs II: Neural tangent kernel for any architecture,2020.URL https://arxiv.org/abs/2006.14548.
Yang (2021)Greg Yang.Tensor Programs III: Neural matrix laws, 2021.URL https://arxiv.org/abs/2009.10685.
Yang & Hu (2021)Greg Yang and EdwardJ. Hu.Tensor Programs IV: Feature learning in infinite-width neuralnetworks.In Marina Meila and Tong Zhang (eds.), Proceedings of the 38thInternational Conference on Machine Learning, volume 139 ofProceedings of Machine Learning Research, pp. 11727–11737. PMLR,18–24 Jul 2021.URL https://proceedings.mlr.press/v139/yang21c.html.
Yang etal. (2022)Greg Yang, EdwardJ. Hu, Igor Babuschkin, Szymon Sidor, Xiaodong Liu, DavidFarhi, Nick Ryder, Jakub Pachocki, Weizhu Chen, and Jianfeng Gao.Tensor Programs V: Tuning large neural networks via zero-shothyperparameter transfer, 2022.URL https://arxiv.org/abs/2203.03466.
Yang etal. (2023a)Greg Yang, JamesB. Simon, and Jeremy Bernstein.A spectral condition for feature learning, 2023a.URL https://arxiv.org/abs/2310.17813.
Yang etal. (2023b)Greg Yang, Dingli Yu, Chen Zhu, and Soufiane Hayou.Tensor Programs VI: Feature learning in infinite-depth neuralnetworks, 2023b.URL https://arxiv.org/abs/2310.02244.
Yao & Wang (2023)Yiqun Yao and Yequan Wang.Research without re-search: Maximal update parametrization yieldsaccurate loss prediction across scales, 2023.URL https://arxiv.org/abs/2304.06875.
You etal. (2019)Yang You, Jing Li, Jonathan Hseu, Xiaodan Song, James Demmel, and Cho-JuiHsieh.Reducing BERT pre-training time from 3 days to 76 minutes.CoRR, abs/1904.00962, 2019.URL http://arxiv.org/abs/1904.00962.
Zhang & Sennrich (2019)Biao Zhang and Rico Sennrich.Root mean square layer normalization.CoRR, abs/1910.07467, 2019.URL http://arxiv.org/abs/1910.07467.

A Large-Scale Exploration of 𝜇-Transfer (2024)

Abstract

1 Introduction

2 Background and Notation

2.1 Embeddings

2.2 Transformer Blocks

2.3 Unembedding

3 μ𝜇\muitalic_μ-Transfer

4 Experiments

4.1 Experimental Setup

4.1.1 Implementation

4.1.2 Configuration

4.2 Primary Experiments

4.2.1 Baseline

4.2.2 Projection Biases

4.2.3 RMSNorm Gains

4.2.4 Query Initialization

4.2.5 SP Attention Scale

4.2.6 SP Unembedding Initialization

4.2.7 Cosine Schedule

4.2.8 Weight Decay

4.2.9 Embedding Normalization

4.2.10 Multiplicative Nonlinearities

4.3 Lion Optimizer

4.4 Multi-Query Attention

4.4.1 4x Larger Batch

4.4.2 4x Smaller Batch

4.5 Comparison with SP

4.6 Large-Scale Transfer Experiment

5 Related Works

6 Conclusion

Acknowledgments

References

References

3 $\mu$ -Transfer