A Large-Scale Exploration of 𝜇-Transfer (2024)

Lucas Dax Lingle
lucasdaxlingle@gmail.com

Abstract

Large neural network models have become a mainstay of natural language processing and computer vision, yet their initialization and learning rates are set in a largely heuristic fashion, potentially varying from paper to paper and one model size to the next.The μ𝜇\muitalic_μ-Parameterization (μ𝜇\muitalic_μP) offers a potential solution to these challenges, yielding scaling rules for model initialization and learning rates, and reportedly enabling zero-shot hyperparameter transfer from small to large models in a variety of cases.

Despite the evident promise, the μ𝜇\muitalic_μP scaling rules are not yet widely adopted, perhaps due to higher implementation complexity, many variations, or complex theoretical background.This work investigates μ𝜇\muitalic_μP empirically, focusing on the ubiquitous transformer architecture, and aims to answer a simple question: does μ𝜇\muitalic_μ-Transfer yield optimal learning rates in practice?From models with 2M to 10B parameters, we show that μ𝜇\muitalic_μ-Transfer works as intended for the majority of important cases, but also identify some surprising cases where it may not.

1 Introduction

Despite the emergence of transformers as the primary architecture for language and vision (OpenAI etal., 2024; Anthropic, 2024; Gemini Team etal., 2023; Reid etal., 2024; Touvron etal., 2023a; b; Jiang etal., 2024; Parmar etal., 2024; Dehghani etal., 2023; Liu etal., 2023), there is still no universal method for setting their initialization, learning rate, or architectural hyperparameters. Further, the hyperparameters selected for large models might be far from optimal due to the expense of conducting hyperparameter sweeps at scale.

The μ𝜇\muitalic_μ-Parameterization (μ𝜇\muitalic_μP) (Yang & Hu, 2021; Yang etal., 2022; 2023b) offers a general method for scaling initializations and learning rates, based on a Gaussian Process interpretation of deep neural networks. Empirically, μ𝜇\muitalic_μP is also reported to enable zero-shot hyperparameter transfer from small proxy models to large target models (Yang etal., 2022; 2023b), using width as the direction of scaling. This ‘μ𝜇\muitalic_μ-transfer’ technique offers a promise of stable training and optimal hyperparameters at scale with low expense.

However, while the initial report on μ𝜇\muitalic_μ-transfer demonstrated approximate preservation of hyperparameter optima, it was only shown at a relatively small scale (Yang etal., 2022), with the sole large-scale experiment being oriented as a benchmark.As a result, there is a lack of convincing empirical evidence that hyperparameter optima are preserved under μ𝜇\muitalic_μ-transfer when target model is very large. In this absence, it seems possible the optimum could drift or jump, as transfer could be disrupted by emergent outliers (Dettmers, 2022).

A second open question is if μ𝜇\muitalic_μ-transfer is compatible with the techniques used in practice, such as decoupled weight decay (Loshchilov & Hutter, 2017) or multiplicative nonlinearities (Shazeer, 2020; So etal., 2021). While the initial report aims to delineate the compatible techniques, there is a need for further exploration and empirical verification.Perhaps awaiting verification, many recent large models have not reportedly used μ𝜇\muitalic_μ-transfer.

A few recent works have adopted μ𝜇\muitalic_μP (Dey etal., 2023; Hu etal., 2024; XAI, 2024), but do not settle the open questions above; such an investigation would require extensive hyperparameter sweeps at scale. Inspired by these works, this paper aims to shed further light on μ𝜇\muitalic_μ-transfer, studying its reliability on transformer models from 2M to 10B parameters in a variety of settings.

Experiment GroupWidthBase LRTransfer
210superscript2102^{-10}2 start_POSTSUPERSCRIPT - 10 end_POSTSUPERSCRIPT28superscript282^{-8}2 start_POSTSUPERSCRIPT - 8 end_POSTSUPERSCRIPT26superscript262^{-6}2 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT24superscript242^{-4}2 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT22superscript222^{-2}2 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT
1283.8463.7433.6953.8844.143
Baseline μ𝜇\muitalic_μP5123.1142.9932.9533.2213.506
20482.7112.5532.5112.5633.244
1283.8383.7353.7053.9114.269
Projection Biases5123.1082.9862.9472.9703.557
20482.7102.5522.5292.6723.418
1283.8363.7433.6943.8774.167
Zero Query Init5123.1152.9922.9493.1353.532
20482.7112.5532.5102.5513.272
1283.8613.7653.6993.8964.161
SP Unembedding Init5123.1192.9902.9513.2653.582
20482.7162.5542.5092.5647.471
1283.8463.7433.6953.9064.143
Cosine Schedule5123.1142.9952.9553.2253.506
20482.7122.5582.5182.5723.244
1283.8343.7433.6934.0124.120
Embedding Normalization5123.1152.9932.9543.0283.506
20482.7102.5532.5122.5647.316
1283.8003.7403.7154.0907.024
SwiGLU Nonlinearity5123.0702.9752.9533.1756.863
20482.6772.5362.5052.5534.571
1283.8083.7353.6863.9994.484
Squared ReLU Nonlinearity5123.0712.9642.9293.1847.299
20482.6662.5162.4822.5323.259
1283.8113.7083.6673.8814.121
Multi-Query Attention5123.1012.9792.9403.1873.518
20482.7152.5642.5212.5463.257
1283.8443.7353.6973.71610.380
4x Larger Batch5123.1412.9902.9653.30510.373
20482.7452.5562.5412.6977.197
1283.8553.7743.7363.9454.104
4x Smaller Batch5123.1203.0112.9773.0243.521
20482.7142.5682.5272.5493.223
1283.8423.7443.6893.6703.681
RMSNorm Gains (Vector)5123.1012.9922.9512.9503.412
20482.6922.5532.6092.6053.169
1283.8433.7493.6923.6704.471
RMSNorm Gains (Scalar)5123.1063.0002.9612.9593.515
20482.7042.5702.5252.5423.334
1283.8363.7583.9054.1404.597
SP Attention Scale5123.1042.9932.9623.4494.184
20482.7062.5552.5253.3067.280
1283.7603.6793.6943.7414.011
Decoupled Weight Decay5123.0572.9632.9573.1393.373
20482.6862.5352.5023.1236.594
1283.7083.7364.0574.34410.380
Lion Optimizer5122.9522.9473.4163.96110.285
20482.5192.5113.15110.37710.377

2 Background and Notation

This paper focuses on decoder-only transformer models, which process sequences of tokens 𝐳{0,,V1}C𝐳superscript0𝑉1𝐶\mathbf{z}\in\{0,\ldots,V-1\}^{C}bold_z ∈ { 0 , … , italic_V - 1 } start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT, where V𝑉Vitalic_V is called the vocabulary size and C𝐶Citalic_C the context length. This architecture has three components: embeddings, transformer blocks, unembeddings. We describe a pre-norm transformer decoder (Radford etal., 2019) of depth L𝐿Litalic_L.

2.1 Embeddings

The token sequence 𝐳{0,,V1}C𝐳superscript0𝑉1𝐶\mathbf{z}\in\{0,\ldots,V-1\}^{C}bold_z ∈ { 0 , … , italic_V - 1 } start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT is used to index into an embedding matrix 𝐖EV×Msuperscript𝐖𝐸superscript𝑉𝑀\mathbf{W}^{E}\in\mathbb{R}^{V\times M}bold_W start_POSTSUPERSCRIPT italic_E end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_V × italic_M end_POSTSUPERSCRIPT, where M𝑀Mitalic_M is called the model width. The resulting real-valued vectors are written as rows to an activation matrix 𝐗0C×Msuperscript𝐗0superscript𝐶𝑀\mathbf{X}^{0}\in\mathbb{R}^{C\times M}bold_X start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_C × italic_M end_POSTSUPERSCRIPT according to the formula 𝐗i0=𝐖ziEsuperscriptsubscript𝐗𝑖0superscriptsubscript𝐖subscript𝑧𝑖𝐸\mathbf{X}_{i}^{0}=\mathbf{W}_{z_{i}}^{E}bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT = bold_W start_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_E end_POSTSUPERSCRIPT.

2.2 Transformer Blocks

A transformer block consists of two residual blocks (He etal., 2016), denoted MHA and MLP, which are added to a ‘residual stream’ according to the formula

𝐗=𝐗1+MHA(𝐗1)+MLP(𝐗1+MHA(𝐗1)).superscript𝐗superscript𝐗1MHAsuperscript𝐗1MLPsuperscript𝐗1MHAsuperscript𝐗1\displaystyle\mathbf{X}^{\ell}=\mathbf{X}^{\ell-1}+\text{MHA}(\mathbf{X}^{\ell%-1})+\text{MLP}(\mathbf{X}^{\ell-1}+\text{MHA}(\mathbf{X}^{\ell-1})).bold_X start_POSTSUPERSCRIPT roman_ℓ end_POSTSUPERSCRIPT = bold_X start_POSTSUPERSCRIPT roman_ℓ - 1 end_POSTSUPERSCRIPT + MHA ( bold_X start_POSTSUPERSCRIPT roman_ℓ - 1 end_POSTSUPERSCRIPT ) + MLP ( bold_X start_POSTSUPERSCRIPT roman_ℓ - 1 end_POSTSUPERSCRIPT + MHA ( bold_X start_POSTSUPERSCRIPT roman_ℓ - 1 end_POSTSUPERSCRIPT ) ) .(1)

The MHA residual block performs multi-head self-attention, defined by a head width D𝐷D\in\mathbb{N}italic_D ∈ blackboard_N and number of heads H𝐻H\in\mathbb{N}italic_H ∈ blackboard_N. For each head, MHA uses a distinct set of projections 𝐖AQ,𝐖AK,𝐖AVM×Dsuperscript𝐖𝐴𝑄superscript𝐖𝐴𝐾superscript𝐖𝐴𝑉superscript𝑀𝐷\mathbf{W}^{AQ},\mathbf{W}^{AK},\mathbf{W}^{AV}\in\mathbb{R}^{M\times D}bold_W start_POSTSUPERSCRIPT italic_A italic_Q end_POSTSUPERSCRIPT , bold_W start_POSTSUPERSCRIPT italic_A italic_K end_POSTSUPERSCRIPT , bold_W start_POSTSUPERSCRIPT italic_A italic_V end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_M × italic_D end_POSTSUPERSCRIPT to perform the following computations given input 𝐗C×M𝐗superscript𝐶𝑀\mathbf{X}\in\mathbb{R}^{C\times M}bold_X ∈ blackboard_R start_POSTSUPERSCRIPT italic_C × italic_M end_POSTSUPERSCRIPT:

𝐘𝐘\displaystyle\mathbf{Y}bold_Y=LayerNorm(𝐗)absentLayerNorm𝐗\displaystyle=\text{LayerNorm}(\mathbf{X})= LayerNorm ( bold_X )(2)
𝐐𝐐\displaystyle\mathbf{Q}bold_Q=𝐘𝐖AQabsentsuperscript𝐘𝐖𝐴𝑄\displaystyle=\mathbf{Y}\mathbf{W}^{AQ}= bold_YW start_POSTSUPERSCRIPT italic_A italic_Q end_POSTSUPERSCRIPT(3)
𝐊𝐊\displaystyle\mathbf{K}bold_K=𝐘𝐖AKabsentsuperscript𝐘𝐖𝐴𝐾\displaystyle=\mathbf{Y}\mathbf{W}^{AK}= bold_YW start_POSTSUPERSCRIPT italic_A italic_K end_POSTSUPERSCRIPT(4)
𝐕𝐕\displaystyle\mathbf{V}bold_V=𝐘𝐖AVabsentsuperscript𝐘𝐖𝐴𝑉\displaystyle=\mathbf{Y}\mathbf{W}^{AV}= bold_YW start_POSTSUPERSCRIPT italic_A italic_V end_POSTSUPERSCRIPT(5)
𝐒𝐒\displaystyle\mathbf{S}bold_S=τ1𝐐𝐊+𝐌absentsuperscript𝜏1superscript𝐐𝐊top𝐌\displaystyle=\tau^{-1}\mathbf{Q}\mathbf{K}^{\top}+\mathbf{M}= italic_τ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_QK start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT + bold_M(6)
𝐏𝐏\displaystyle\mathbf{P}bold_P=Softmax(𝐒)absentSoftmax𝐒\displaystyle=\text{Softmax}(\mathbf{S})= Softmax ( bold_S )(7)
𝐎𝐎\displaystyle\mathbf{O}bold_O=𝐏𝐕absent𝐏𝐕\displaystyle=\mathbf{P}\mathbf{V}= bold_PV(8)

LayerNorm and Softmax are applied row-wise, τ1>0superscript𝜏10\tau^{-1}>0italic_τ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT > 0 is an scalar constant commonly set to 1/D1𝐷1/\sqrt{D}1 / square-root start_ARG italic_D end_ARG, and 𝐌𝐌\mathbf{M}bold_M is a causal mask given by 𝐌i,j=subscript𝐌𝑖𝑗\mathbf{M}_{i,j}=-\inftybold_M start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT = - ∞ if i<j𝑖𝑗i<jitalic_i < italic_j and 𝐌i,j=0subscript𝐌𝑖𝑗0\mathbf{M}_{i,j}=0bold_M start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT = 0 otherwise.The heads’ outputs 𝐎𝐎\mathbf{O}bold_O are concatenated together, and then projected using one additional matrix 𝐖AOHD×Msuperscript𝐖𝐴𝑂superscript𝐻𝐷𝑀\mathbf{W}^{AO}\in\mathbb{R}^{HD\times M}bold_W start_POSTSUPERSCRIPT italic_A italic_O end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_H italic_D × italic_M end_POSTSUPERSCRIPT to form the residual MHA(𝐗)MHA𝐗\text{MHA}(\mathbf{X})MHA ( bold_X ). This residual is summed onto the residual stream as in Equation 1, and the sum is processed by the MLP residual block.

The MLP residual block applies a multi-layer perceptron to each row individually. It is defined via hidden width F𝐹Fitalic_F and element-wise activation ϕitalic-ϕ\phiitalic_ϕ. It uses two trainable projections 𝐖FIM×Fsuperscript𝐖𝐹𝐼superscript𝑀𝐹\mathbf{W}^{FI}\in\mathbb{R}^{M\times F}bold_W start_POSTSUPERSCRIPT italic_F italic_I end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_M × italic_F end_POSTSUPERSCRIPT and 𝐖FOF×Msuperscript𝐖𝐹𝑂superscript𝐹𝑀\mathbf{W}^{FO}\in\mathbb{R}^{F\times M}bold_W start_POSTSUPERSCRIPT italic_F italic_O end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_F × italic_M end_POSTSUPERSCRIPT. Given an input tensor 𝐗𝐗\mathbf{X}bold_X, it defines the residual:

𝐘𝐘\displaystyle\mathbf{Y}bold_Y=LayerNorm(𝐗)absentLayerNorm𝐗\displaystyle=\text{LayerNorm}(\mathbf{X})= LayerNorm ( bold_X )(9)
𝐎𝐎\displaystyle\mathbf{O}bold_O=ϕ(𝐘𝐖FI)𝐖FOabsentitalic-ϕsuperscript𝐘𝐖𝐹𝐼superscript𝐖𝐹𝑂\displaystyle=\phi(\mathbf{Y}\mathbf{W}^{FI})\mathbf{W}^{FO}= italic_ϕ ( bold_YW start_POSTSUPERSCRIPT italic_F italic_I end_POSTSUPERSCRIPT ) bold_W start_POSTSUPERSCRIPT italic_F italic_O end_POSTSUPERSCRIPT(10)

This residual is likewise summed onto the residual stream, following Equation 1.

2.3 Unembedding

The unembedding layer uses a matrix 𝐖UM×Vsuperscript𝐖𝑈superscript𝑀𝑉\mathbf{W}^{U}\in\mathbb{R}^{M\times V}bold_W start_POSTSUPERSCRIPT italic_U end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_M × italic_V end_POSTSUPERSCRIPT to produce the probabilities for next-token prediction. The layer’s input is the residual stream output 𝐗Lsuperscript𝐗𝐿\mathbf{X}^{L}bold_X start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT, and its output is

𝐘𝐘\displaystyle\mathbf{Y}bold_Y=LayerNorm(𝐗L)absentLayerNormsuperscript𝐗𝐿\displaystyle=\text{LayerNorm}(\mathbf{X}^{L})= LayerNorm ( bold_X start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT )(11)
𝐔𝐔\displaystyle\mathbf{U}bold_U=Softmax(𝐘𝐖U)absentSoftmaxsuperscript𝐘𝐖𝑈\displaystyle=\text{Softmax}(\mathbf{Y}\mathbf{W}^{U})= Softmax ( bold_YW start_POSTSUPERSCRIPT italic_U end_POSTSUPERSCRIPT )(12)

Due to the softmax, each row of 𝐔C×V𝐔superscript𝐶𝑉\mathbf{U}\in\mathbb{R}^{C\times V}bold_U ∈ blackboard_R start_POSTSUPERSCRIPT italic_C × italic_V end_POSTSUPERSCRIPT defines a probability mass function over tokens in the vocabulary. The model is trained on the cross-entropy loss 1Ci=0C1log𝐔i,𝐳i+11𝐶superscriptsubscript𝑖0𝐶1subscript𝐔𝑖subscript𝐳𝑖1-\frac{1}{C}\sum_{i=0}^{C-1}\log\mathbf{U}_{i,\mathbf{z}_{i+1}}- divide start_ARG 1 end_ARG start_ARG italic_C end_ARG ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C - 1 end_POSTSUPERSCRIPT roman_log bold_U start_POSTSUBSCRIPT italic_i , bold_z start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT.

3 μ𝜇\muitalic_μ-Transfer

The μ𝜇\muitalic_μ-Parameterization (μ𝜇\muitalic_μP) (Yang & Hu, 2021; Yang etal., 2022; 2023b) refers to a specific family of initializations and learning rates that reportedly allow hyperparameter transfer from small to large models. This paper investigates μ𝜇\muitalic_μP for transformers with respect to width. We do not consider depthwise μ𝜇\muitalic_μP (Yang etal., 2023b) because it requires one linear layer per residual block, while transformers require at least two.

The general formulation of μ𝜇\muitalic_μP when training with Adam (Kingma & Ba, 2014) and using an i.i.d. Gaussian initialization is given by Yang etal. (2022). The first three columns of Table 2 display these rules for transformers.These columns use big-theta notation. Formally, f(x)=Θ(g(x))𝑓𝑥Θ𝑔𝑥f(x)=\Theta(g(x))italic_f ( italic_x ) = roman_Θ ( italic_g ( italic_x ) ) if there exists x0subscript𝑥0x_{0}\in\mathbb{R}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ blackboard_R and c,C>0𝑐𝐶0c,C>0italic_c , italic_C > 0 s.t. cg(x)f(x)Cg(x)𝑐𝑔𝑥𝑓𝑥𝐶𝑔𝑥cg(x)\leq f(x)\leq Cg(x)italic_c italic_g ( italic_x ) ≤ italic_f ( italic_x ) ≤ italic_C italic_g ( italic_x ) for all x>x0𝑥subscript𝑥0x>x_{0}italic_x > italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT.

ParamInit Variance (Θnormal-Θ\Thetaroman_Θ)Adam LR (Θnormal-Θ\Thetaroman_Θ)Init Variance (Exact)Adam LR (Exact)
𝐖Esuperscript𝐖𝐸\mathbf{W}^{E}bold_W start_POSTSUPERSCRIPT italic_E end_POSTSUPERSCRIPT111111111111α𝛼\alphaitalic_α
𝐖AQsuperscript𝐖𝐴𝑄\mathbf{W}^{AQ}bold_W start_POSTSUPERSCRIPT italic_A italic_Q end_POSTSUPERSCRIPT1/M1𝑀1/M1 / italic_M1/M1𝑀1/M1 / italic_M1/M1𝑀1/M1 / italic_MαP/M𝛼𝑃𝑀\alpha P/Mitalic_α italic_P / italic_M
𝐖AKsuperscript𝐖𝐴𝐾\mathbf{W}^{AK}bold_W start_POSTSUPERSCRIPT italic_A italic_K end_POSTSUPERSCRIPT1/M1𝑀1/M1 / italic_M1/M1𝑀1/M1 / italic_M1/M1𝑀1/M1 / italic_MαP/M𝛼𝑃𝑀\alpha P/Mitalic_α italic_P / italic_M
𝐖AVsuperscript𝐖𝐴𝑉\mathbf{W}^{AV}bold_W start_POSTSUPERSCRIPT italic_A italic_V end_POSTSUPERSCRIPT1/M1𝑀1/M1 / italic_M1/M1𝑀1/M1 / italic_M1/M1𝑀1/M1 / italic_MαP/M𝛼𝑃𝑀\alpha P/Mitalic_α italic_P / italic_M
𝐖AOsuperscript𝐖𝐴𝑂\mathbf{W}^{AO}bold_W start_POSTSUPERSCRIPT italic_A italic_O end_POSTSUPERSCRIPT1/(HD)1𝐻𝐷1/(HD)1 / ( italic_H italic_D )1/(HD)1𝐻𝐷1/(HD)1 / ( italic_H italic_D )1/M1𝑀1/M1 / italic_MαP/M𝛼𝑃𝑀\alpha P/Mitalic_α italic_P / italic_M
𝐖FIsuperscript𝐖𝐹𝐼\mathbf{W}^{FI}bold_W start_POSTSUPERSCRIPT italic_F italic_I end_POSTSUPERSCRIPT1/M1𝑀1/M1 / italic_M1/M1𝑀1/M1 / italic_M1/M1𝑀1/M1 / italic_MαP/M𝛼𝑃𝑀\alpha P/Mitalic_α italic_P / italic_M
𝐖FOsuperscript𝐖𝐹𝑂\mathbf{W}^{FO}bold_W start_POSTSUPERSCRIPT italic_F italic_O end_POSTSUPERSCRIPT1/F1𝐹1/F1 / italic_F1/F1𝐹1/F1 / italic_F0.25/M0.25𝑀0.25/M0.25 / italic_MαP/M𝛼𝑃𝑀\alpha P/Mitalic_α italic_P / italic_M
𝐖Usuperscript𝐖𝑈\mathbf{W}^{U}bold_W start_POSTSUPERSCRIPT italic_U end_POSTSUPERSCRIPT1/M21superscript𝑀21/M^{2}1 / italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT1/M1𝑀1/M1 / italic_M1/M21superscript𝑀21/M^{2}1 / italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPTαP/M𝛼𝑃𝑀\alpha P/Mitalic_α italic_P / italic_M

In the remainder of this paper, we assume HD=M𝐻𝐷𝑀HD=Mitalic_H italic_D = italic_M and F=4M𝐹4𝑀F=4Mitalic_F = 4 italic_M. In our experiments we fix a proxy model width P=128𝑃128P=128italic_P = 128 and head width D=128𝐷128D=128italic_D = 128, and follow the specific scaling rules in the last two columns of Table 2, where α𝛼\alphaitalic_α denotes the base learning rate, so named because it is the learning rate for all parameters when M=P𝑀𝑃M=Pitalic_M = italic_P. These relative scaling rules are a special case of those in Appendix B.1 of Yang etal. (2022).

In addition, μ𝜇\muitalic_μP uses an attention scale of τ1=Θ(1/D)superscript𝜏1Θ1𝐷\tau^{-1}=\Theta(1/D)italic_τ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT = roman_Θ ( 1 / italic_D ) instead of the usual τ1=1/Dsuperscript𝜏11𝐷\tau^{-1}=1/\sqrt{D}italic_τ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT = 1 / square-root start_ARG italic_D end_ARG. For simplicity, we use τ1=1/Dsuperscript𝜏11𝐷\tau^{-1}=1/Ditalic_τ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT = 1 / italic_D, since in preliminary experiments we observed only a small improvement from using smaller multiples of 1/D1𝐷1/D1 / italic_D. Note that for D𝐷Ditalic_D fixed across model widths M𝑀Mitalic_M, any constant τ10superscript𝜏10\tau^{-1}\neq 0italic_τ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ≠ 0 technically complies with μ𝜇\muitalic_μP (Yang etal., 2022) but in the experiments, τ1superscript𝜏1\tau^{-1}italic_τ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT will be shown to have a major impact on performance and transfer.

It is also possible to add scalar multipliers throughout the network as hyperparameters. For simplicity, we focus on μ𝜇\muitalic_μ-Transfer of the base learning rate.

4 Experiments

4.1 Experimental Setup

4.1.1 Implementation

Our experiments are implemented using Jax/Flax.Training is performed on TPU pod slices, using the fully-sharded data parallelism (FSDP) strategy from Xu etal. (2021) to reduce memory overhead.Models train on the Colossal Clean Common Crawl (C4) dataset, using the T5 tokenizer (Raffel etal., 2019) with context length C=256𝐶256C=256italic_C = 256.

The experiments use a bitwise-deterministic training pipeline, with shards of data written to disk in a random-access format similar to Nvidia Megatron (Shoeybi etal., 2020). Distributed model checkpoints are saved periodically, and the reported validation loss is computed on the best-performing checkpoint. The same seed is used for all experiments; due to time constraints, each experiment is run once.

4.1.2 Configuration

We use the following default configuration, deviating from it only if specifically mentioned.The depth is fixed at L=24𝐿24L=24italic_L = 24, and we consider model widths M{128,512,2048}𝑀1285122048M\in\{128,512,2048\}italic_M ∈ { 128 , 512 , 2048 }, yielding three model sizes ranging from 4.7M to 1.2B non-embedding parameters.The head width is fixed at D=128𝐷128D=128italic_D = 128, the number of heads is H=M/D𝐻𝑀𝐷H=M/Ditalic_H = italic_M / italic_D, and MLP hidden width is F=4M𝐹4𝑀F=4Mitalic_F = 4 italic_M. The models use RMS LayerNorm without gains (Zhang & Sennrich, 2019), linear projections without biases (Raffel etal., 2019), RoPE on the queries and keys (Su etal., 2021), and ReLU for the MLP nonlinearity (Vaswani etal., 2017; Raffel etal., 2019).

By default, we use 218superscript2182^{18}2 start_POSTSUPERSCRIPT 18 end_POSTSUPERSCRIPT tokens per batch, float32 parameters, bfloat16 activations/gradients111During evaluation, output logits are computed in float32..The optimizer is AdamW (Loshchilov & Hutter, 2017) with β1=0.9,β2=0.98,ϵ=109formulae-sequencesubscript𝛽10.9formulae-sequencesubscript𝛽20.98italic-ϵsuperscript109\beta_{1}=0.9,\beta_{2}=0.98,\epsilon=10^{-9}italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0.9 , italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 0.98 , italic_ϵ = 10 start_POSTSUPERSCRIPT - 9 end_POSTSUPERSCRIPT, with default weight decay 0.00.00.00.0 and gradient clip 1.01.01.01.0. Models train for 125K steps total, with 10K steps of learning rate warmup followed by linear decay to zero.

4.2 Primary Experiments

In our primary experiments, we sweep the base learning rate α{22j:j,1j5}𝛼conditional-setsuperscript22𝑗formulae-sequence𝑗1𝑗5\alpha\in\{2^{-2j}:j\in\mathbb{N},1\leq j\leq 5\}italic_α ∈ { 2 start_POSTSUPERSCRIPT - 2 italic_j end_POSTSUPERSCRIPT : italic_j ∈ blackboard_N , 1 ≤ italic_j ≤ 5 } for each model size and experiment setting, and we report all results. This allows us to investigate impact of various experimental conditions on model quality and transferability of learning rates under the μ𝜇\muitalic_μP scaling rules. We focus on learning rate transfer because it is the main hyperparameter of interest for large transformer models.

4.2.1 Baseline

In our first experiment group, we establish baselines for model quality to compare the other experiment groups to. In addition, we verify that μ𝜇\muitalic_μ-Transfer works reliably even with mixed-precision training. For this purpose, we utilize the Google Brain floating point format, bfloat16, for the activations and gradients. This format is supported by Google TPUs and recent Nvidia GPUs, and was used with μ𝜇\muitalic_μP in a contemporary work (Dey etal., 2023).

As shown in Table 1, the learning rates transfer reliably across model sizes under μ𝜇\muitalic_μP. Despite each model being 4x wider (and 16x larger) than the last, the smallest model’s optimal base learning rate α𝛼\alphaitalic_α directly predicts the optimum in our sweeps for the larger models.

4.2.2 Projection Biases

It is not a priori clear if trainable bias vectors in linear layers are beneficial for model quality, and several prior works omit them (Raffel etal., 2019; Shazeer, 2020; Chowdhery etal., 2023). Here, we test their benefit and impact on learning rate transferability under μ𝜇\muitalic_μP.

As shown in Table 1, the learning rates appear to transfer across model sizes under μ𝜇\muitalic_μP. However, for the smallest and largest model, biases do not appear to improve quality versus the baseline when the learning rate is optimal.

4.2.3 RMSNorm Gains

It is not a priori clear if trainable scale vectors (‘gains’) in RMSNorm (Zhang & Sennrich, 2019) are beneficial for model quality and many frameworks offer the option to omit them. This ablation tests their benefit and impact on learning rate transferability under μ𝜇\muitalic_μP.We also test a variant where the trainable gain vector is replaced with a trainable scalar multiplier, similar to Elhage etal. (2023).

As shown in Table 1, optimal learning rates for these models do not reliably transfer when using Θ(1)Θ1\Theta(1)roman_Θ ( 1 ) learning rate scaling for the gains, despite the fact that the ‘coordinate size’ of the features before and after RMS Normalization is Θ(1)Θ1\Theta(1)roman_Θ ( 1 ) w.r.t. width by design.In addition to the lack of transfer in these experiments, we find trainable gains harm the quality of the largest μ𝜇\muitalic_μP models when the base learning rate α𝛼\alphaitalic_α is optimal.

4.2.4 Query Initialization

The usual μ𝜇\muitalic_μP initialization for query projections 𝐖AQsuperscript𝐖𝐴𝑄\mathbf{W}^{AQ}bold_W start_POSTSUPERSCRIPT italic_A italic_Q end_POSTSUPERSCRIPT is Gaussian with variance Θ(1/M)Θ1𝑀\Theta(1/M)roman_Θ ( 1 / italic_M ). One alternative is to use zero-initialized query projections, which yield equal attention weights over all past timesteps at initialization. This change was recommended by Yang etal. (2022) to improve transfer, so we investigate its effects as well.

As shown in Table 1, the learning rates transfer across model sizes when using μ𝜇\muitalic_μP with zero-initialized query projections. There is also a slight yet consistent improvement in loss.

4.2.5 SP Attention Scale

The usual attention scale τ1=1/Dsuperscript𝜏11𝐷\tau^{-1}=1/\sqrt{D}italic_τ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT = 1 / square-root start_ARG italic_D end_ARG was first proposed by Vaswani etal. (2017) and generally used since. However, μ𝜇\muitalic_μP proposes τ1=Θ(1/D)superscript𝜏1Θ1𝐷\tau^{-1}=\Theta(1/D)italic_τ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT = roman_Θ ( 1 / italic_D ), and we use τ1=1/Dsuperscript𝜏11𝐷\tau^{-1}=1/Ditalic_τ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT = 1 / italic_D. Notably in our experiments we scale the model width M𝑀Mitalic_M and keep the head width D𝐷Ditalic_D fixed across model sizes, so the attention scale should not actually matter for purposes of transfer; any difference between 1/D1𝐷1/\sqrt{D}1 / square-root start_ARG italic_D end_ARG and 1/D1𝐷1/D1 / italic_D can be treated as a constant multiplier. Nonetheless, we investigate the effect of using the standard 1/D1𝐷1/\sqrt{D}1 / square-root start_ARG italic_D end_ARG attention scale.

As shown in Table 1, the SP attention scale 1/D1𝐷1/\sqrt{D}1 / square-root start_ARG italic_D end_ARG appears quite suboptimal, harming performance relative to the baselines across all model sizes. Interestingly, this 11.3×11.3\times11.3 × larger attention scale actually prevented transfer of the optimal learning rate, despite actually using a constant attention head width D=128𝐷128D=128italic_D = 128 across models. Given this result, our recommendation is to use τ11/Dsuperscript𝜏11𝐷\tau^{-1}\leq 1/Ditalic_τ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ≤ 1 / italic_D if applying μ𝜇\muitalic_μ-transfer with small proxy models.

4.2.6 SP Unembedding Initialization

The μ𝜇\muitalic_μP initialization for unembedding matrix 𝐖Usuperscript𝐖𝑈\mathbf{W}^{U}bold_W start_POSTSUPERSCRIPT italic_U end_POSTSUPERSCRIPT is a Gaussian distribution with variance Θ(1/M2)Θ1superscript𝑀2\Theta(1/M^{2})roman_Θ ( 1 / italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ), while the so-called standard parameterization (SP) uses 1/M1𝑀1/M1 / italic_M (Yang etal., 2022). We thus ablate the impact of using the standard initialization on performance and transfer.

As shown in Table 1, despite using SP initialization for the unembedding projection, the learning rates empirically transfer across model sizes. There also appears to be a small but consistent improvement from using the SP initialization for the unembedding projection.

4.2.7 Cosine Schedule

It is not immediately clear whether the linear learning rate schedule in the baseline settings is the optimal choice. In this ablation, we instead use a cosine schedule, decaying to zero.

As shown in Table 1, the learning rates transfer across model sizes, showing that μ𝜇\muitalic_μP is compatible with a cosine schedule as well as the linear schedule used by the baseline. Our tentative conclusion is that the schedule is unlikely to interfere with learning rate transfer.

4.2.8 Weight Decay

Decoupled weight decay for Adam (Loshchilov & Hutter, 2017) is often used to train transformers. In common libraries such as Pytorch and Optax, its decay rate λ𝜆\lambdaitalic_λ is multiplied by the current learning rate instead of just the schedule multiplier, although the latter was recently investigated by Wortsman etal. (2023) under the name ‘independent weight decay’. We focus on the version used by these libraries, as we found the ‘independent’ variant inevitably led to instability in the larger models not seen in the smaller ones–since the optimal learning rate tended to decrease with model size, even while λ𝜆\lambdaitalic_λ stayed constant.

As shown in Table 1, when using decoupled weight decay, the optimal base learning rate α𝛼\alphaitalic_α may not always transfer from the smallest model to the larger ones. On the other hand, these experiments use the strongest typical setting at λ=0.1𝜆0.1\lambda=0.1italic_λ = 0.1, and yet the optimum for α𝛼\alphaitalic_α is not visibly changed between the two larger models. Furthermore, this optimum for α𝛼\alphaitalic_α also matches the optimum for the baselines, which is the same across model sizes.

Based on these observations, decoupled weight decay appears unlikely to alter the optimal learning rate for the target model significantly when λ0.1𝜆0.1\lambda\leq 0.1italic_λ ≤ 0.1 and MPmuch-greater-than𝑀𝑃M\gg Pitalic_M ≫ italic_P, so one strategy would be to omit it when training the proxy model, and apply it to the target model only.

4.2.9 Embedding Normalization

We consider using normalized embeddings following (Peng etal., 2023; Gemma, 2024), using RMSNorm without trainable gains (Zhang & Sennrich, 2019). We do not change the learning rate from the setting in Table 2 nor adjust the initialization.

As shown in Table 1, the optimal learning rate transfers across models using μ𝜇\muitalic_μP; however, the improvement in model quality over the baseline is negligible. We briefly investigated using lower initialization variances, but found this harmed stability with the width-constant embedding learning rate of μ𝜇\muitalic_μP Adam, possibly due to similar effects.

4.2.10 Multiplicative Nonlinearities

Multiplicative nonlinearities such as SwiGLU (Shazeer, 2020) and Squared ReLU (So etal., 2021) are increasingly used in the MLP blocks to improve transformer quality (Touvron etal., 2023a; b; Elsen etal., 2023; Peng etal., 2023; Parmar etal., 2024). In this experiment, we investigate both of the aforementioned nonlinearities, which are notably ‘superlinear’ and thus may create outliers that interfere with μ𝜇\muitalic_μ-Transfer, as discussed by Yang etal. (2022).For SwiGLU, we use F=5M𝐹5𝑀F=5Mitalic_F = 5 italic_M, so the MLP has 7.5M27.5superscript𝑀27.5M^{2}7.5 italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT parameters.

As shown in Table 1, the SwiGLU and Squared ReLU nonlinearities both allow μ𝜇\muitalic_μ-transfer of the learning rate across model sizes. The outcomes here contrast nicely with RMSNorm gains experiments, since transfer occurs despite the multiplicative interactions.

4.3 Lion Optimizer

We empirically investigate if the Lion optimizer (Chen etal., 2023b; a) is compatible with μ𝜇\muitalic_μ-Transfer. This optimizer is at least twice as memory-efficient as the Adam optimizer, and was reported to yield models of similar quality, including transformers (Chen etal., 2023b). A notable property of this optimizer is that its updates are constrained to {1,+1}11\{-1,+1\}{ - 1 , + 1 } per coordinate, yielding a coordinate size of Θ(1)Θ1\Theta(1)roman_Θ ( 1 ) per step. As a result, a Θ(1/M)Θ1𝑀\Theta(1/M)roman_Θ ( 1 / italic_M ) transfer rule for weight learning rates, similar to μ𝜇\muitalic_μP Adam, might be appropriate (Yang etal., 2022).

As shown in Table 1, the Lion optimizer did not admit transfer of the base learning rate from the smallest model size. The scaling rules do appear to preserve the optimum α𝛼\alphaitalic_α between larger models, but further investigations beyond the scope of this project are needed.

4.4 Multi-Query Attention

Multi-Query Attention (Shazeer, 2019) and its grouped generalization (Ainslie etal., 2023) are increasingly used in transformer LLMs (Chowdhery etal., 2023; Touvron etal., 2023b; Almazrouei etal., 2023; Gemini Team etal., 2023; Jiang etal., 2024).These techniques aim to improve the inference speed of transformers by sharing keys/values across multiple heads. This ablation investigates the impact of the shared K/V cache on μ𝜇\muitalic_μ-Transfer.Similar to Shazeer (2019), we approximately correct for the parameter increase by setting F=5M𝐹5𝑀F=5Mitalic_F = 5 italic_M.

As shown in Table 1, multi-query attention is compatible μ𝜇\muitalic_μ-Transfer.

4.4.1 4x Larger Batch

Large-batch training can reduce wall time, but may also have a considerable influence on the training dynamics (McCandlish etal., 2018; You etal., 2019). In this section, we consider scaling up the batch size by 4×4\times4 × while keeping the number of training tokens the same. For this ablation, we adopt the learning rate scaling rule from You etal. (2019); Malladi etal. (2022), so that each formula in Table 2 is scaled by 2×2\times2 ×.

As shown in Table 1, the 4×4\times4 × larger batch size admits transfer of the learning rate via μ𝜇\muitalic_μP.

4.4.2 4x Smaller Batch

An important question is if μ𝜇\muitalic_μ-Transfer requires a minimum batch size to work (Dey etal., 2023). In this section, we consider scaling the batch size down by 4×4\times4 ×, while keeping the number of training tokens the same.For this ablation, we again adopt the learning rate scaling rule from You etal. (2019); Malladi etal. (2022), so that each formula in Table 2 is scaled by 0.5×0.5\times0.5 ×.

As shown in Table 1, the 4×4\times4 × smaller batch size admits transfer of the learning rate via μ𝜇\muitalic_μP.

4.5 Comparison with SP

Next, we verify the claim from Yang etal. (2022) that μ𝜇\muitalic_μP models outperform models using the standard parameterization (SP). For the SP baseline, we use attention scale 1/D1𝐷1/\sqrt{D}1 / square-root start_ARG italic_D end_ARG, unembedding initialization variance 1/M1𝑀1/M1 / italic_M, trainable biases in linear layers, and trainable gains in RMSNorm layers. The learning rate for each model size is swept over. All other hyperparameters are identical to the μ𝜇\muitalic_μP baseline.

WidthLR
210superscript2102^{-10}2 start_POSTSUPERSCRIPT - 10 end_POSTSUPERSCRIPT28superscript282^{-8}2 start_POSTSUPERSCRIPT - 8 end_POSTSUPERSCRIPT26superscript262^{-6}2 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT24superscript242^{-4}2 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT22superscript222^{-2}2 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT
1283.8413.7573.7063.8794.030
5123.0132.9672.9873.3837.403
20482.7382.9027.2477.4777.314

As shown in Table 3, the optimal loss at each model width is worse than the corresponding loss for the baseline μ𝜇\muitalic_μP model given in Table 1. This suggests that using μ𝜇\muitalic_μP not only facilitates hyperparameter transfer, but can improve the model loss as well.

4.6 Large-Scale Transfer Experiment

In this experiment, we combine the architectural choices that both transferred and improved performance, and investigate if μ𝜇\muitalic_μ-Transfer continues to work as desired at scale.

We use depth L=12𝐿12L=12italic_L = 12 and consider widths M{128,512,2048,8192}𝑀12851220488192M\in\{128,512,2048,8192\}italic_M ∈ { 128 , 512 , 2048 , 8192 }, yielding models with approximately 2M, 40M, 600M, and 10B non-embedding parameters, respectively.We use zero-initialized queries (Yang etal., 2022) and Squared ReLU nonlinearity (So etal., 2021).We use 221superscript2212^{21}2 start_POSTSUPERSCRIPT 21 end_POSTSUPERSCRIPT tokens per batch, training for 90K steps. We use Adam (Kingma & Ba, 2014) withβ1=0.9,β2=0.95,ϵ=108formulae-sequencesubscript𝛽10.9formulae-sequencesubscript𝛽20.95italic-ϵsuperscript108\beta_{1}=0.9,\beta_{2}=0.95,\epsilon={10}^{-8}italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0.9 , italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 0.95 , italic_ϵ = 10 start_POSTSUPERSCRIPT - 8 end_POSTSUPERSCRIPT, decoupled weight decay 0.10.10.10.1, and gradient clip 1.01.01.01.0.

ParamsWidthBase LR
28superscript282^{-8}2 start_POSTSUPERSCRIPT - 8 end_POSTSUPERSCRIPT26superscript262^{-6}2 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT24superscript242^{-4}2 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT
2M1283.7913.7663.814
40M5123.0162.9833.004
600M20482.5132.4592.466
10B81922.2382.1672.169

As shown in Table 3, the optimal learning rate transfers from a model about 5000×5000\times5000 × smaller. This shows that μ𝜇\muitalic_μ-Transfer continues to predict the optimal learning rate at scale. Moreover, the outcome suggests ‘emergent outliers’ may not be a source of interference in μ𝜇\muitalic_μ-Transfer, given that these reportedly appear at around 7B parameters (Dettmers, 2022).

5 Related Works

The μ𝜇\muitalic_μ-Parameterization (μ𝜇\muitalic_μP) is proposed as part of the Tensor Programs series (Yang, 2019; 2020; 2021; Yang & Hu, 2021; Yang etal., 2022; Littwin & Yang, 2023; Yang etal., 2023b; a).

The empirical demonstration of zero-shot hyperparameter transfer under μ𝜇\muitalic_μP was given in Yang etal. (2022). Notably, the largest model trained in that report had 6.7B parameters. It used FP32 computation for numerical stability, as well as using a different position encoding mechanism and a different learning rate schedule than the FP16 baseline. This led to an open question whether μ𝜇\muitalic_μ-Parameterized transformers could be made stable at large scale.Moreover, the large-scale experiment did not demonstrate that μ𝜇\muitalic_μ-Transfer predicted the hyperparameter optima to the 6.7B target model.

Some recent works have adopted μ𝜇\muitalic_μP for hyperparameter tuning (Dey etal., 2023; Hu etal., 2024; XAI, 2024), but did not provide any evidence that the hyperparameter optimum is preserved under μ𝜇\muitalic_μ-Transfer when the target model is very large.Furthermore, Dey etal. (2023) trained a suite of models, but only used μ𝜇\muitalic_μP transformers with up to 2.7B parameters, while their largest model using a standard parameterization had 13B parameters. This left open the question of whether μ𝜇\muitalic_μ-Transfer worked reliably on larger-scale target models.

Some recent works suggest to use μ𝜇\muitalic_μ-Transfer to avoid large-scale experiments entirely (Yao & Wang, 2023; Fan etal., 2024). These works use power laws to predict the loss for larger models from smaller ones (Kaplan etal., 2020; Hoffmann etal., 2022; So etal., 2021), and use μ𝜇\muitalic_μP for hyperparameter tuning when fitting the smaller reference models. However, similar to the other works, these papers do not study whether μ𝜇\muitalic_μ-Transfer correctly predicts the optimal hyperparameters themselves when the target model is very large.

A potential alternative to μ𝜇\muitalic_μP was recently proposed by DeepSeek-AI etal. (2024), namely a scaling law for the optimal learning rate solely in terms of compute budget. However, empirically-derived scaling laws are strongly affected by the choice of independent variables and the fitted data, so the fitted scaling law may not transfer to other setups.

Other recent work has investigated transformer training instabilities via small proxy models (Wortsman etal., 2023), proposing architectural adjustments that reduce the loss sensitivity to the learning rate, rather than predict the optimal one; their method notably applies to depthwise scaling of transformer models. Automatic Gradient Descent and Hypergradients (Bernstein etal., 2023; Baydin etal., 2017; Chandra etal., 2019) are methods to tune the learning rate of optimizers during training. These methods have a high implementation complexity and might also incur a performance hit versus vanilla hyperparameter tuning, which μ𝜇\muitalic_μP makes affordable via proxy models.

6 Conclusion

This paper studied the reliability of μ𝜇\muitalic_μ-Transfer of learning rates, focusing on transformers. In our experiments, μ𝜇\muitalic_μ-Transfer worked as desired in most cases, including with multiplicative nonlinearities, multi-query attention, and large/small batch training. However, μ𝜇\muitalic_μP did not admit transfer when using trainable gain parameters or too large an attention scale. The simple μ𝜇\muitalic_μP recipe used in this work also outperforms the ‘standard parameterization’ commonly used for transformers.

Lastly, we found μ𝜇\muitalic_μ-Transfer from a 2M parameter model predicted the optimal learning rate from a sweep at the scale of 10B parameters. To the best of our knowledge, this is the largest target model where this property has been verified. We hope these findings are helpful to the research community, and inspire further work on hyperparameter transfer.

Acknowledgments

The author thanks Oleg Filatov, Stella Biderman, Lucas Nestler, and Hailey Schoelkopf for helpful remarks during the project and Oleg for feedback on the manuscript. The author is grateful to Google TPU Research Cloud for supporting the experiments with Cloud TPUs.

References

  • Ainslie etal. (2023)Joshua Ainslie, James Lee-Thorp, Michiel deJong, Yury Zemlyanskiy, FedericoLebrón, and Sumit Sanghai.GQA: Training generalized multi-query transformer models frommulti-head checkpoints, 2023.URL https://arxiv.org/abs/2305.13245.
  • Almazrouei etal. (2023)Ebtesam Almazrouei, Hamza Alobeidli, Abdulaziz Alshamsi, Alessandro Cappelli,Ruxandra Cojocaru, Mérouane Debbah, Étienne Goffinet, Daniel Hesslow,Julien Launay, Quentin Malartic, Daniele Mazzotta, Badreddine Noune, BaptistePannier, and Guilherme Penedo.The falcon series of open language models, 2023.URL https://arxiv.org/abs/2311.16867.
  • Anthropic (2024)Anthropic.The claude 3 model family: Opus, sonnet, haiku.https://www-cdn.anthropic.com/de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model_Card_Claude_3.pdf,2024.Accessed: 2024-03-13.
  • Baydin etal. (2017)AtilimGunes Baydin, Robert Cornish, DavidMartinez Rubio, Mark Schmidt, andFrank Wood.Online learning rate adaptation with hypergradient descent, 2017.URL https://arxiv.org/abs/1703.04782.
  • Bernstein etal. (2023)Jeremy Bernstein, Chris Mingard, Kevin Huang, Navid Azizan, and Yisong Yue.Automatic gradient descent: Deep learning without hyperparameters,2023.URL https://arxiv.org/abs/2304.05187.
  • Chandra etal. (2019)Kartik Chandra, Audrey Xie, Jonathan Ragan-Kelley, and Erik Meijer.Gradient descent: The ultimate optimizer, 2019.URL https://arxiv.org/abs/1909.13371.
  • Chen etal. (2023a)Lizhang Chen, BoLiu, Kaizhao Liang, and Qiang Liu.Lion secretly solves constrained optimization: As Lyapunovpredicts, 2023a.URL https://arxiv.org/abs/2310.05898.
  • Chen etal. (2023b)Xiangning Chen, Chen Liang, DaHuang, Esteban Real, Kaiyuan Wang, Yao Liu, HieuPham, Xuanyi Dong, Thang Luong, Cho-Jui Hsieh, Yifeng Lu, and QuocV. Le.Symbolic discovery of optimization algorithms, 2023b.URL https://arxiv.org/abs/2302.06675.
  • Chowdhery etal. (2023)Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra,Adam Roberts, Paul Barham, HyungWon Chung, Charles Sutton, SebastianGehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua Maynez,Abhishek Rao, Parker Barnes, YiTay, Noam Shazeer, Vinodkumar Prabhakaran,Emily Reif, Nan Du, Ben Hutchinson, Reiner Pope, James Bradbury, JacobAustin, Michael Isard, Guy Gur-Ari, Pengcheng Yin, Toju Duke, AnselmLevskaya, Sanjay Ghemawat, Sunipa Dev, Henryk Michalewski, Xavier Garcia,Vedant Misra, Kevin Robinson, Liam Fedus, Denny Zhou, Daphne Ippolito, DavidLuan, Hyeontaek Lim, Barret Zoph, Alexander Spiridonov, Ryan Sepassi, DavidDohan, Shivani Agrawal, Mark Omernick, AndrewM. Dai,ThanumalayanSankaranarayana Pillai, Marie Pellat, Aitor Lewkowycz, EricaMoreira, Rewon Child, Oleksandr Polozov, Katherine Lee, Zongwei Zhou, XuezhiWang, Brennan Saeta, Mark Diaz, Orhan Firat, Michele Catasta, Jason Wei,Kathy Meier-Hellstern, Douglas Eck, Jeff Dean, Slav Petrov, and Noah Fiedel.PaLM: Scaling language modeling with pathways.Journal of Machine Learning Research, 24(240):1–113, 2023.URL http://jmlr.org/papers/v24/22-1144.html.
  • DeepSeek-AI etal. (2024)DeepSeek-AI, Xiao Bi, Deli Chen, Guanting Chen, Shanhuang Chen, Damai Dai,Chengqi Deng, Honghui Ding, Kai Dong, Qiushi Du, Zhe Fu, Huazuo Gao, KaigeGao, Wenjun Gao, Ruiqi Ge, Kang Guan, Daya Guo, Jianzhong Guo, Guangbo Hao,Zhewen Hao, Ying He, Wenjie Hu, Panpan Huang, Erhang Li, Guowei Li, JiashiLi, Yao Li, Y.K. Li, Wenfeng Liang, Fangyun Lin, A.X. Liu, BoLiu, Wen Liu,Xiaodong Liu, Xin Liu, Yiyuan Liu, Haoyu Lu, Shanghao Lu, Fuli Luo, ShirongMa, Xiaotao Nie, Tian Pei, Yishi Piao, Junjie Qiu, Hui Qu, Tongzheng Ren,Zehui Ren, Chong Ruan, Zhangli Sha, Zhihong Shao, Junxiao Song, Xuecheng Su,Jingxiang Sun, Yaofeng Sun, Minghui Tang, Bingxuan Wang, Peiyi Wang, ShiyuWang, Yaohui Wang, Yongji Wang, Tong Wu, Y.Wu, Xin Xie, Zhenda Xie, ZiweiXie, Yiliang Xiong, Hanwei Xu, R.X. Xu, Yanhong Xu, Dejian Yang, YuxiangYou, Shuiping Yu, Xingkai Yu, B.Zhang, Haowei Zhang, Lecong Zhang, LiyueZhang, Mingchuan Zhang, Minghua Zhang, Wentao Zhang, Yichao Zhang, ChenggangZhao, Yao Zhao, Shangyan Zhou, Shunfeng Zhou, Qihao Zhu, and Yuheng Zou.Deepseek llm: Scaling open-source language models with longtermism,2024.URL https://arxiv.org/abs/2401.02954.
  • Dehghani etal. (2023)Mostafa Dehghani, Josip Djolonga, Basil Mustafa, Piotr Padlewski, JonathanHeek, Justin Gilmer, Andreas Steiner, Mathilde Caron, Robert Geirhos, IbrahimAlabdulmohsin, Rodolphe Jenatton, Lucas Beyer, Michael Tschannen, AnuragArnab, Xiao Wang, Carlos Riquelme, Matthias Minderer, Joan Puigcerver, UtkuEvci, Manoj Kumar, Sjoerd van Steenkiste, GamaleldinF. Elsayed, AravindhMahendran, Fisher Yu, Avital Oliver, Fantine Huot, Jasmijn Bastings,MarkPatrick Collier, Alexey Gritsenko, Vighnesh Birodkar, CristinaVasconcelos, YiTay, Thomas Mensink, Alexander Kolesnikov, Filip Pavetić,Dustin Tran, Thomas Kipf, Mario Lučić, Xiaohua Zhai, Daniel Keysers,Jeremiah Harmsen, and Neil Houlsby.Scaling vision transformers to 22 billion parameters, 2023.URL https://arxiv.org/abs/2302.05442.
  • Dettmers (2022)Tim Dettmers.LLM.int8() and emergent features.https://timdettmers.com/2022/08/17/llm-int8-and-emergent-features/,2022.Accessed: 2024-03-09.
  • Dey etal. (2023)Nolan Dey, Gurpreet Gosal, Zhiming Chen, Hemant Khachane, William Marshall,Ribhu Pathria, Marvin Tom, and Joel Hestness.Cerebras-GPT: Open compute-optimal language models trained on thecerebras wafer-scale cluster, 2023.URL https://arxiv.org/abs/2304.03208.
  • Elhage etal. (2023)Nelson Elhage, Robert Lasenby, and Christopher Olah.Privileged bases in the transformer residual stream.https://transformer-circuits.pub/2023/privileged-basis/index.html,2023.Accessed: 2024-03-09.
  • Elsen etal. (2023)Erich Elsen, Augustus Odena, Maxwell Nye, Sağnak Taşırlar, Tri Dao, CurtisHawthorne, Deepak Moparthi, and Arushi Somani.Releasing persimmon-8b, 2023.URL https://www.adept.ai/blog/persimmon-8b.
  • Fan etal. (2024)Siqi Fan, Xiusheng Huang, Xuezhi Fang, Yiqun Yao, Xiang Li, Ziyi Ni, Xin Jiang,Xuying Meng, Peng Han, Shuo Shang, Kang Liu, Aixin Sun, and Yequan Wang.NanoLM: An affordable LLM study benchmark via accurate lossprediction across scales, 2024.URL https://openreview.net/forum?id=mao3y822aM.
  • Gemini Team etal. (2023)Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-BaptisteAlayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, AndrewM. Dai, Anja Hauth,Katie Millican, David Silver, Slav Petrov, Melvin Johnson, IoannisAntonoglou, Julian Schrittwieser, Amelia Glaese, Jilin Chen, Emily Pitler,Timothy Lillicrap, Angeliki Lazaridou, Orhan Firat, James Molloy, MichaelIsard, PaulR. Barham, Tom Hennigan, Benjamin Lee, Fabio Viola, MalcolmReynolds, Yuanzhong Xu, Ryan Doherty, Eli Collins, Clemens Meyer, ElizaRutherford, Erica Moreira, Kareem Ayoub, Megha Goel, George Tucker, EnriquePiqueras, Maxim Krikun, Iain Barr, Nikolay Savinov, Ivo Danihelka, BeccaRoelofs, Anaïs White, Anders Andreassen, Tamara von Glehn, Lakshman Yagati,Mehran Kazemi, Lucas Gonzalez, Misha Khalman, Jakub Sygnowski, AlexandreFrechette, Charlotte Smith, Laura Culp, Lev Proleev, YiLuan, XiChen, JamesLottes, Nathan Schucher, Federico Lebron, Alban Rrustemi, Natalie Clay, PhilCrone, Tomas Kocisky, Jeffrey Zhao, Bartek Perz, Dian Yu, Heidi Howard, AdamBloniarz, JackW. Rae, Han Lu, Laurent Sifre, Marcello Maggioni, FredAlcober, Dan Garrette, Megan Barnes, Shantanu Thakoor, Jacob Austin, GabrielBarth-Maron, William Wong, Rishabh Joshi, Rahma Chaabouni, Deeni Fatiha, ArunAhuja, Ruibo Liu, Yunxuan Li, Sarah Cogan, Jeremy Chen, Chao Jia, Chenjie Gu,Qiao Zhang, Jordan Grimstad, AleJakse Hartman, Martin Chadwick, GauravSinghTomar, Xavier Garcia, Evan Senter, Emanuel Taropa,ThanumalayanSankaranarayana Pillai, Jacob Devlin, Michael Laskin, DiegodeLasCasas, Dasha Valter, Connie Tao, Lorenzo Blanco, AdriàPuigdomènechBadia, David Reitter, Mianna Chen, Jenny Brennan, Clara Rivera, Sergey Brin,Shariq Iqbal, Gabriela Surita, Jane Labanowski, Abhi Rao, Stephanie Winkler,Emilio Parisotto, Yiming Gu, Kate Olszewska, Yujing Zhang, Ravi Addanki,Antoine Miech, Annie Louis, LaurentEl Shafey, Denis Teplyashin, Geoff Brown,Elliot Catt, Nithya Attaluri, Jan Balaguer, Jackie Xiang, Pidong Wang, ZoeAshwood, Anton Briukhov, Albert Webson, Sanjay Ganapathy, Smit Sanghavi, AjayKannan, Ming-Wei Chang, Axel Stjerngren, Josip Djolonga, Yuting Sun, AnkurBapna, Matthew Aitchison, Pedram Pejman, Henryk Michalewski, Tianhe Yu, CindyWang, Juliette Love, Junwhan Ahn, Dawn Bloxwich, Kehang Han, Peter Humphreys,Thibault Sellam, James Bradbury, Varun Godbole, Sina Samangooei, BogdanDamoc, Alex Kaskasoli, Sébastien M.R. Arnold, Vijay Vasudevan, ShubhamAgrawal, Jason Riesa, Dmitry Lepikhin, Richard Tanburn, Srivatsan Srinivasan,Hyeontaek Lim, Sarah Hodkinson, Pranav Shyam, Johan Ferret, Steven Hand,Ankush Garg, TomLe Paine, Jian Li, Yujia Li, Minh Giang, Alexander Neitz,Zaheer Abbas, Sarah York, Machel Reid, Elizabeth Cole, Aakanksha Chowdhery,Dipanjan Das, Dominika Rogozińska, Vitaly Nikolaev, Pablo Sprechmann,Zachary Nado, Lukas Zilka, Flavien Prost, Luheng He, Marianne Monteiro,Gaurav Mishra, Chris Welty, Josh Newlan, Dawei Jia, Miltiadis Allamanis,ClaraHuiyi Hu, Raoul deLiedekerke, Justin Gilmer, Carl Saroufim, ShrutiRijhwani, Shaobo Hou, Disha Shrivastava, Anirudh Baddepudi, Alex Goldin,Adnan Ozturel, Albin Cassirer, Yunhan Xu, Daniel Sohn, Devendra Sachan,ReinaldKim Amplayo, Craig Swanson, Dessie Petrova, Shashi Narayan, ArthurGuez, Siddhartha Brahma, Jessica Landon, Miteyan Patel, Ruizhe Zhao, KevinVillela, Luyu Wang, Wenhao Jia, Matthew Rahtz, Mai Giménez, Legg Yeung,Hanzhao Lin, James Keeling, Petko Georgiev, Diana Mincu, Boxi Wu, SalemHaykal, Rachel Saputro, Kiran Vodrahalli, James Qin, Zeynep Cankara, AbhanshuSharma, Nick Fernando, Will Hawkins, Behnam Neyshabur, Solomon Kim, AdrianHutter, Priyanka Agrawal, Alex Castro-Ros, George vanden Driessche, TaoWang, Fan Yang, Shuo yiin Chang, Paul Komarek, Ross McIlroy, Mario Lučić,Guodong Zhang, Wael Farhan, Michael Sharman, Paul Natsev, Paul Michel, YongCheng, Yamini Bansal, Siyuan Qiao, Kris Cao, Siamak Shakeri, ChristinaButterfield, Justin Chung, PaulKishan Rubenstein, Shivani Agrawal, ArthurMensch, Kedar Soparkar, Karel Lenc, Timothy Chung, Aedan Pope, LorenMaggiore, Jackie Kay, Priya Jhakra, Shibo Wang, Joshua Maynez, Mary Phuong,Taylor Tobin, Andrea Tacchetti, Maja Trebacz, Kevin Robinson, Yash Katariya,Sebastian Riedel, Paige Bailey, Kefan Xiao, Nimesh Ghelani, Lora Aroyo,Ambrose Slone, Neil Houlsby, Xuehan Xiong, Zhen Yang, Elena Gribovskaya,Jonas Adler, Mateo Wirth, Lisa Lee, Music Li, Thais Kagohara, Jay Pavagadhi,Sophie Bridgers, Anna Bortsova, Sanjay Ghemawat, Zafarali Ahmed, Tianqi Liu,Richard Powell, Vijay Bolina, Mariko Iinuma, Polina Zablotskaia, JamesBesley, Da-Woon Chung, Timothy Dozat, Ramona Comanescu, Xiance Si, JeremyGreer, Guolong Su, Martin Polacek, RaphaëlLopez Kaufman, Simon Tokumine,Hexiang Hu, Elena Buchatskaya, Yingjie Miao, Mohamed Elhawaty, AdityaSiddhant, Nenad Tomasev, Jinwei Xing, Christina Greer, Helen Miller, ShereenAshraf, Aurko Roy, Zizhao Zhang, Ada Ma, Angelos Filos, Milos Besta, RoryBlevins, Ted Klimenko, Chih-Kuan Yeh, Soravit Changpinyo, Jiaqi Mu, OscarChang, Mantas Pajarskas, Carrie Muir, Vered Cohen, CharlineLe Lan, KrishnaHaridasan, Amit Marathe, Steven Hansen, Sholto Douglas, Rajkumar Samuel,Mingqiu Wang, Sophia Austin, Chang Lan, Jiepu Jiang, Justin Chiu,JaimeAlonso Lorenzo, LarsLowe Sjösund, Sébastien Cevey, Zach Gleicher,Thi Avrahami, Anudhyan Boral, Hansa Srinivasan, Vittorio Selo, Rhys May,Konstantinos Aisopos, Léonard Hussenot, LivioBaldini Soares, Kate Baumli,MichaelB. Chang, Adrià Recasens, Ben Caine, Alexander Pritzel, FilipPavetic, Fabio Pardo, Anita Gergely, Justin Frye, Vinay Ramasesh, Dan Horgan,Kartikeya Badola, Nora Kassner, Subhrajit Roy, Ethan Dyer, Víctor Campos,Alex Tomala, Yunhao Tang, DaliaEl Badawy, Elspeth White, Basil Mustafa, OranLang, Abhishek Jindal, Sharad Vikram, Zhitao Gong, Sergi Caelles, RossHemsley, Gregory Thornton, Fangxiaoyu Feng, Wojciech Stokowiec, CeZheng,Phoebe Thacker, Çağlar Ünlü, Zhishuai Zhang, Mohammad Saleh, JamesSvensson, Max Bileschi, Piyush Patil, Ankesh Anand, Roman Ring, KaterinaTsihlas, Arpi Vezer, Marco Selvi, Toby Shevlane, Mikel Rodriguez, TomKwiatkowski, Samira Daruki, Keran Rong, Allan Dafoe, Nicholas FitzGerald,Keren Gu-Lemberg, Mina Khan, LisaAnne Hendricks, Marie Pellat, VladimirFeinberg, James Cobon-Kerr, Tara Sainath, Maribeth Rauh, SayedHadi Hashemi,Richard Ives, Yana Hasson, YaGuang Li, Eric Noland, Yuan Cao, Nathan Byrd,LeHou, Qingze Wang, Thibault Sottiaux, Michela Paganini, Jean-BaptisteLespiau, Alexandre Moufarek, Samer Hassan, Kaushik Shivakumar, Joost vanAmersfoort, Amol Mandhane, Pratik Joshi, Anirudh Goyal, Matthew Tung, AndrewBrock, Hannah Sheahan, Vedant Misra, Cheng Li, Nemanja Rakićević, MostafaDehghani, Fangyu Liu, Sid Mittal, Junhyuk Oh, Seb Noury, Eren Sezener,Fantine Huot, Matthew Lamm, NicolaDe Cao, Charlie Chen, Gamaleldin Elsayed,EdChi, Mahdis Mahdieh, Ian Tenney, Nan Hua, Ivan Petrychenko, Patrick Kane,Dylan Scandinaro, Rishub Jain, Jonathan Uesato, Romina Datta, Adam Sadovsky,Oskar Bunyan, Dominik Rabiej, Shimu Wu, John Zhang, Gautam Vasudevan, EdouardLeurent, Mahmoud Alnahlawi, Ionut Georgescu, Nan Wei, Ivy Zheng, Betty Chan,PamG Rabinovitch, Piotr Stanczyk, YeZhang, David Steiner, Subhajit Naskar,Michael Azzam, Matthew Johnson, Adam Paszke, Chung-Cheng Chiu, JaumeSanchezElias, Afroz Mohiuddin, Faizan Muhammad, Jin Miao, Andrew Lee, NinoVieillard, Sahitya Potluri, Jane Park, Elnaz Davoodi, Jiageng Zhang, JeffStanway, Drew Garmon, Abhijit Karmarkar, Zhe Dong, Jong Lee, Aviral Kumar,Luowei Zhou, Jonathan Evens, William Isaac, Zhe Chen, Johnson Jia, AnselmLevskaya, Zhenkai Zhu, Chris Gorgolewski, Peter Grabowski, YuMao, AlbertoMagni, Kaisheng Yao, Javier Snaider, Norman Casagrande, Paul Suganthan, EvanPalmer, Geoffrey Irving, Edward Loper, Manaal Faruqui, Isha Arkatkar, NanxinChen, Izhak Shafran, Michael Fink, Alfonso Castaño, Irene Giannoumis,Wooyeol Kim, Mikołaj Rybiński, Ashwin Sreevatsa, Jennifer Prendki, DavidSoergel, Adrian Goedeckemeyer, Willi Gierke, Mohsen Jafari, Meenu Gaba,Jeremy Wiesner, DianaGage Wright, Yawen Wei, Harsha Vashisht, YanaKulizhskaya, Jay Hoover, Maigo Le, LuLi, Chimezie Iwuanyanwu, LuLiu, KevinRamirez, Andrey Khorlin, Albert Cui, Tian LIN, Marin Georgiev, Marcus Wu,Ricardo Aguilar, Keith Pallo, Abhishek Chakladar, Alena Repina, Xihui Wu, Tomvander Weide, Priya Ponnapalli, Caroline Kaplan, Jiri Simsa, Shuangfeng Li,Olivier Dousse, Fan Yang, Jeff Piper, Nathan Ie, Minnie Lui, Rama Pasumarthi,Nathan Lintz, Anitha Vijayakumar, LamNguyen Thiet, Daniel Andor, PedroValenzuela, Cosmin Paduraru, Daiyi Peng, Katherine Lee, Shuyuan Zhang, SomerGreene, DucDung Nguyen, Paula Kurylowicz, Sarmishta Velury, SebastianKrause, Cassidy Hardin, Lucas Dixon, Lili Janzer, Kiam Choo, Ziqiang Feng,Biao Zhang, Achintya Singhal, Tejasi Latkar, Mingyang Zhang, Quoc Le,ElenaAllica Abellan, Dayou Du, Dan McKinnon, Natasha Antropova, TolgaBolukbasi, Orgad Keller, David Reid, Daniel Finchelstein, MariaAbi Raad,Remi Crocker, Peter Hawkins, Robert Dadashi, Colin Gaffney, Sid Lall, KenFranko, Egor Filonov, Anna Bulanova, Rémi Leblond, Vikas Yadav, ShirleyChung, Harry Askham, LuisC. Cobo, Kelvin Xu, Felix Fischer, Jun Xu,Christina Sorokin, Chris Alberti, Chu-Cheng Lin, Colin Evans, Hao Zhou, AlekDimitriev, Hannah Forbes, Dylan Banarse, Zora Tung, Jeremiah Liu, MarkOmernick, Colton Bishop, Chintu Kumar, Rachel Sterneck, Ryan Foley, RohanJain, Swaroop Mishra, Jiawei Xia, Taylor Bos, Geoffrey Cideron, Ehsan Amid,Francesco Piccinno, Xingyu Wang, Praseem Banzal, Petru Gurita, Hila Noga,Premal Shah, DanielJ. Mankowitz, Alex Polozov, Nate Kushman, VictoriaKrakovna, Sasha Brown, MohammadHossein Bateni, Dennis Duan, Vlad Firoiu,Meghana Thotakuri, Tom Natan, Anhad Mohananey, Matthieu Geist, SidharthMudgal, Sertan Girgin, Hui Li, Jiayu Ye, Ofir Roval, Reiko Tojo, MichaelKwong, James Lee-Thorp, Christopher Yew, Quan Yuan, Sumit Bagri, DanilaSinopalnikov, Sabela Ramos, John Mellor, Abhishek Sharma, Aliaksei Severyn,Jonathan Lai, Kathy Wu, Heng-Tze Cheng, David Miller, Nicolas Sonnerat, DenisVnukov, Rory Greig, Jennifer Beattie, Emily Caveness, Libin Bai, JulianEisenschlos, Alex Korchemniy, Tomy Tsai, Mimi Jasarevic, Weize Kong, PhuongDao, Zeyu Zheng, Frederick Liu, Fan Yang, Rui Zhu, Mark Geller, TianHueyTeh, Jason Sanmiya, Evgeny Gladchenko, Nejc Trdin, Andrei Sozanschi, DanielToyama, Evan Rosen, Sasan Tavakkol, Linting Xue, Chen Elkind, Oliver Woodman,John Carpenter, George Papamakarios, Rupert Kemp, Sushant Kafle, TanyaGrunina, Rishika Sinha, Alice Talbert, Abhimanyu Goyal, Diane Wu, DeneseOwusu-Afriyie, Cosmo Du, Chloe Thornton, Jordi Pont-Tuset, PradyumnaNarayana, Jing Li, Sabaer Fatehi, John Wieting, Omar Ajmeri, Benigno Uria,Tao Zhu, Yeongil Ko, Laura Knight, Amélie Héliou, Ning Niu, Shane Gu,Chenxi Pang, Dustin Tran, Yeqing Li, Nir Levine, Ariel Stolovich, NorbertKalb, Rebeca Santamaria-Fernandez, Sonam Goenka, Wenny Yustalim, RobinStrudel, Ali Elqursh, Balaji Lakshminarayanan, Charlie Deck, Shyam Upadhyay,Hyo Lee, Mike Dusenberry, Zonglin Li, Xuezhi Wang, Kyle Levin, RaphaelHoffmann, Dan Holtmann-Rice, Olivier Bachem, Summer Yue, Sho Arora, EricMalmi, Daniil Mirylenka, Qijun Tan, Christy Koh, SoheilHassas Yeganeh, SiimPõder, Steven Zheng, Francesco Pongetti, Mukarram Tariq, Yanhua Sun, LucianIonita, Mojtaba Seyedhosseini, Pouya Tafti, Ragha Kotikalapudi, Zhiyu Liu,Anmol Gulati, Jasmine Liu, Xinyu Ye, Bart Chrzaszcz, Lily Wang, Nikhil Sethi,Tianrun Li, Ben Brown, Shreya Singh, Wei Fan, Aaron Parisi, Joe Stanton,Chenkai Kuang, Vinod Koverkathu, ChristopherA. Choquette-Choo, Yunjie Li,TJLu, Abe Ittycheriah, Prakash Shroff, Pei Sun, Mani Varadarajan, SanazBahargam, Rob Willoughby, David Gaddy, Ish*ta Dasgupta, Guillaume Desjardins,Marco Cornero, Brona Robenek, Bhavishya Mittal, Ben Albrecht, Ashish Shenoy,Fedor Moiseev, Henrik Jacobsson, Alireza Ghaffarkhah, Morgane Rivière,Alanna Walton, Clément Crepy, Alicia Parrish, Yuan Liu, Zongwei Zhou,Clement Farabet, Carey Radebaugh, Praveen Srinivasan, Claudia vander Salm,Andreas Fidjeland, Salvatore Scellato, Eri Latorre-Chimoto, HannaKlimczak-Plucińska, David Bridson, Dario deCesare, Tom Hudson, PiermariaMendolicchio, Lexi Walker, Alex Morris, Ivo Penchev, Matthew Mauger, AlexeyGuseynov, Alison Reid, Seth Odoom, Lucia Loher, Victor Cotruta, MadhaviYenugula, Dominik Grewe, Anastasia Petrushkina, Tom Duerig, Antonio Sanchez,Steve Yadlowsky, Amy Shen, Amir Globerson, Adam Kurzrok, Lynette Webb, SahilDua, Dong Li, Preethi Lahoti, Surya Bhupatiraju, Dan Hurt, Haroon Qureshi,Ananth Agarwal, Tomer Shani, Matan Eyal, Anuj Khare, ShreyasRammohan Belle,Lei Wang, Chetan Tekur, MihirSanjay Kale, Jinliang Wei, Ruoxin Sang, BrennanSaeta, Tyler Liechty, YiSun, Yao Zhao, Stephan Lee, Pandu Nayak, Doug Fritz,ManishReddy Vuyyuru, John Aslanides, Nidhi Vyas, Martin Wicke, Xiao Ma,Taylan Bilal, Evgenii Eltyshev, Daniel Balle, Nina Martin, Hardie Cate, JamesManyika, Keyvan Amiri, Yelin Kim, XiXiong, Kai Kang, Florian Luisier, NileshTripuraneni, David Madras, Mandy Guo, Austin Waters, Oliver Wang, JoshuaAinslie, Jason Baldridge, Han Zhang, Garima Pruthi, Jakob Bauer, Feng Yang,Riham Mansour, Jason Gelman, Yang Xu, George Polovets, JiLiu, Honglong Cai,Warren Chen, XiangHai Sheng, Emily Xue, Sherjil Ozair, Adams Yu, ChristofAngermueller, Xiaowei Li, Weiren Wang, Julia Wiesinger, EmmanouilKoukoumidis, Yuan Tian, Anand Iyer, Madhu Gurumurthy, Mark Goldenson,Parashar Shah, MKBlake, Hongkun Yu, Anthony Urbanowicz, Jennimaria Palomaki,Chrisantha Fernando, Kevin Brooks, Ken Durden, Harsh Mehta, Nikola Momchev,Elahe Rahimtoroghi, Maria Georgaki, Amit Raul, Sebastian Ruder, MorganRedshaw, Jinhyuk Lee, Komal Jalan, Dinghua Li, Ginger Perng, Blake Hechtman,Parker Schuh, Milad Nasr, Mia Chen, Kieran Milan, Vladimir Mikulik, TrevorStrohman, Juliana Franco, Tim Green, Demis Hassabis, Koray Kavukcuoglu,Jeffrey Dean, and Oriol Vinyals.Gemini: A family of highly capable multimodal models, 2023.URL https://arxiv.org/abs/2312.11805.
  • Gemma (2024)Team Gemma.Gemma: Open models based on gemini research and technology, 2024.URLhttps://storage.googleapis.com/deepmind-media/gemma/gemma-report.pdf.
  • He etal. (2016)Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.Identity mappings in deep residual networks, 2016.URL https://arxiv.org/abs/1603.05027.
  • Hoffmann etal. (2022)Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, TrevorCai, Eliza Rutherford, Diego deLasCasas, LisaAnne Hendricks, JohannesWelbl, Aidan Clark, Tom Hennigan, Eric Noland, Katie Millican, George vandenDriessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, ErichElsen, JackW. Rae, Oriol Vinyals, and Laurent Sifre.Training compute-optimal large language models, 2022.URL https://arxiv.org/abs/2203.15556.
  • Hu etal. (2024)Shengding Hu, Yuge Tu, XuHan, Ganqu Cui, Chaoqun He, Weilin Zhao, Xiang Long,Zhi Zheng, Yewei Fang, Kaihuo Zhang, Yuxiang Huang, Zhenning Dai, BaitaoGong, Chongyi Wang, Yuan Yao, Jie Zhou, Jie Cai, Xinrong Zhang, Zhongwu Zhai,Ning Ding, Chao Jia, Guoyang Zeng, Dahai Li, Zhiyuan Liu, and Maosong Sun.Minicpm: Unveiling the potential of end-side large language models.https://shengdinghu.notion.site/MiniCPM-Unveiling-the-Potential-of-End-side-Large-Language-Models-d4d3a8c426424654a4e80e42a711cb20?pvs=4,2024.Accessed: 2024-03-09.
  • Jiang etal. (2024)AlbertQ. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, BlancheSavary, Chris Bamford, DevendraSingh Chaplot, Diego delas Casas, EmmaBouHanna, Florian Bressand, Gianna Lengyel, Guillaume Bour, Guillaume Lample,LélioRenard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock,Sandeep Subramanian, Sophia Yang, Szymon Antoniak, TevenLe Scao, ThéophileGervet, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and WilliamEl Sayed.Mixtral of experts, 2024.URL https://arxiv.org/abs/2401.04088.
  • Kaplan etal. (2020)Jared Kaplan, Sam McCandlish, Tom Henighan, TomB. Brown, Benjamin Chess, RewonChild, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei.Scaling laws for neural language models, 2020.URL https://arxiv.org/abs/2001.08361.
  • Kingma & Ba (2014)DiederikP. Kingma and Jimmy Ba.Adam: A method for stochastic optimization, 2014.URL https://arxiv.org/abs/1412.6980.
  • Littwin & Yang (2023)Etai Littwin and Greg Yang.Adaptive optimization in the $\infty$-width limit.In The Eleventh International Conference on LearningRepresentations, 2023.URL https://openreview.net/forum?id=zgVDqw9ZUES.
  • Liu etal. (2023)Haotian Liu, Chunyuan Li, Qingyang Wu, and YongJae Lee.Visual instruction tuning, 2023.URL https://arxiv.org/abs/2304.08485.
  • Loshchilov & Hutter (2017)Ilya Loshchilov and Frank Hutter.Fixing weight decay regularization in adam.CoRR, abs/1711.05101, 2017.URL http://arxiv.org/abs/1711.05101.
  • Malladi etal. (2022)Sadhika Malladi, Kaifeng Lyu, Abhishek Panigrahi, and Sanjeev Arora.On the SDEs and scaling rules for adaptive gradient algorithms,2022.URL https://arxiv.org/abs/2205.10287.
  • McCandlish etal. (2018)Sam McCandlish, Jared Kaplan, Dario Amodei, and OpenAIDota Team.An empirical model of large-batch training.CoRR, abs/1812.06162, 2018.URL http://arxiv.org/abs/1812.06162.
  • Nguyen & Salazar (2019)ToanQ. Nguyen and Julian Salazar.Transformers without tears: Improving the normalization ofself-attention, 2019.URL https://arxiv.org/abs/1910.05895.
  • OpenAI etal. (2024)OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya,FlorenciaLeoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman,Shyamal Anadkat, Red Avila, Igor Babuschkin, Suchir Balaji, Valerie Balcom,Paul Baltescu, Haiming Bao, Mohammad Bavarian, Jeff Belgum, Irwan Bello, JakeBerdine, Gabriel Bernadett-Shapiro, Christopher Berner, Lenny Bogdonoff, OlegBoiko, Madelaine Boyd, Anna-Luisa Brakman, Greg Brockman, Tim Brooks, MilesBrundage, Kevin Button, Trevor Cai, Rosie Campbell, Andrew Cann, BrittanyCarey, Chelsea Carlson, Rory Carmichael, Brooke Chan, Che Chang, FotisChantzis, Derek Chen, Sully Chen, Ruby Chen, Jason Chen, Mark Chen, BenChess, Chester Cho, Casey Chu, HyungWon Chung, Dave Cummings, JeremiahCurrier, Yunxing Dai, Cory Decareaux, Thomas Degry, Noah Deutsch, DamienDeville, Arka Dhar, David Dohan, Steve Dowling, Sheila Dunning, AdrienEcoffet, Atty Eleti, Tyna Eloundou, David Farhi, Liam Fedus, Niko Felix,SimónPosada Fishman, Juston Forte, Isabella Fulford, Leo Gao, Elie Georges,Christian Gibson, Vik Goel, Tarun Gogineni, Gabriel Goh, Rapha Gontijo-Lopes,Jonathan Gordon, Morgan Grafstein, Scott Gray, Ryan Greene, Joshua Gross,ShixiangShane Gu, Yufei Guo, Chris Hallacy, Jesse Han, Jeff Harris, YuchenHe, Mike Heaton, Johannes Heidecke, Chris Hesse, Alan Hickey, Wade Hickey,Peter Hoeschele, Brandon Houghton, Kenny Hsu, Shengli Hu, Xin Hu, JoostHuizinga, Shantanu Jain, Shawn Jain, Joanne Jang, Angela Jiang, Roger Jiang,Haozhun Jin, Denny Jin, Shino Jomoto, Billie Jonn, Heewoo Jun, Tomer Kaftan,Łukasz Kaiser, Ali Kamali, Ingmar Kanitscheider, NitishShirish Keskar,Tabarak Khan, Logan Kilpatrick, JongWook Kim, Christina Kim, Yongjik Kim,JanHendrik Kirchner, Jamie Kiros, Matt Knight, Daniel Kokotajlo, ŁukaszKondraciuk, Andrew Kondrich, Aris Konstantinidis, Kyle Kosic, GretchenKrueger, Vishal Kuo, Michael Lampe, Ikai Lan, Teddy Lee, Jan Leike, JadeLeung, Daniel Levy, ChakMing Li, Rachel Lim, Molly Lin, Stephanie Lin,Mateusz Litwin, Theresa Lopez, Ryan Lowe, Patricia Lue, Anna Makanju, KimMalfacini, Sam Manning, Todor Markov, Yaniv Markovski, Bianca Martin, KatieMayer, Andrew Mayne, Bob McGrew, ScottMayer McKinney, Christine McLeavey,Paul McMillan, Jake McNeil, David Medina, Aalok Mehta, Jacob Menick, LukeMetz, Andrey Mishchenko, Pamela Mishkin, Vinnie Monaco, Evan Morikawa, DanielMossing, Tong Mu, Mira Murati, Oleg Murk, David Mély, Ashvin Nair, ReiichiroNakano, Rajeev Nayak, Arvind Neelakantan, Richard Ngo, Hyeonwoo Noh, LongOuyang, Cullen O’Keefe, Jakub Pachocki, Alex Paino, Joe Palermo, AshleyPantuliano, Giambattista Parascandolo, Joel Parish, Emy Parparita, AlexPassos, Mikhail Pavlov, Andrew Peng, Adam Perelman, Filipe deAvilaBelbutePeres, Michael Petrov, HenriquePonde deOliveiraPinto, Michael,Pokorny, Michelle Pokrass, VitchyrH. Pong, Tolly Powell, Alethea Power,Boris Power, Elizabeth Proehl, Raul Puri, Alec Radford, Jack Rae, AdityaRamesh, Cameron Raymond, Francis Real, Kendra Rimbach, Carl Ross, BobRotsted, Henri Roussez, Nick Ryder, Mario Saltarelli, Ted Sanders, ShibaniSanturkar, Girish Sastry, Heather Schmidt, David Schnurr, John Schulman,Daniel Selsam, Kyla Sheppard, Toki Sherbakov, Jessica Shieh, Sarah Shoker,Pranav Shyam, Szymon Sidor, Eric Sigler, Maddie Simens, Jordan Sitkin,Katarina Slama, Ian Sohl, Benjamin Sokolowsky, Yang Song, Natalie Staudacher,FelipePetroski Such, Natalie Summers, Ilya Sutskever, Jie Tang, NikolasTezak, MadeleineB. Thompson, Phil Tillet, Amin Tootoonchian, ElizabethTseng, Preston Tuggle, Nick Turley, Jerry Tworek, Juan FelipeCerón Uribe,Andrea Vallone, Arun Vijayvergiya, Chelsea Voss, Carroll Wainwright,JustinJay Wang, Alvin Wang, Ben Wang, Jonathan Ward, Jason Wei, CJWeinmann,Akila Welihinda, Peter Welinder, Jiayi Weng, Lilian Weng, Matt Wiethoff, DaveWillner, Clemens Winter, Samuel Wolrich, Hannah Wong, Lauren Workman, SherwinWu, Jeff Wu, Michael Wu, Kai Xiao, Tao Xu, Sarah Yoo, Kevin Yu, Qiming Yuan,Wojciech Zaremba, Rowan Zellers, Chong Zhang, Marvin Zhang, Shengjia Zhao,Tianhao Zheng, Juntang Zhuang, William Zhuk, and Barret Zoph.Gpt-4 technical report, 2024.URL https://arxiv.org/abs/2303.08774.
  • Parmar etal. (2024)Jupinder Parmar, Shrimai Prabhumoye, Joseph Jennings, Mostofa Patwary, SandeepSubramanian, Dan Su, Chen Zhu, Deepak Narayanan, Aastha Jhunjhunwala, AyushDattagupta, Vibhu Jawa, Jiwei Liu, Ameya Mahabaleshwarkar, Osvald Nitski,Annika Brundyn, James Maki, Miguel Martinez, Jiaxuan You, John Kamalu,Patrick LeGresley, Denys Fridman, Jared Casper, Ashwath Aithal, OleksiiKuchaiev, Mohammad Shoeybi, Jonathan Cohen, and Bryan Catanzaro.Nemotron-4 15b technical report, 2024.URL https://arxiv.org/abs/2402.16819.
  • Peng etal. (2023)BoPeng, Eric Alcaide, Quentin Anthony, Alon Albalak, Samuel Arcadinho, StellaBiderman, Huanqi Cao, Xin Cheng, Michael Chung, Matteo Grella, KranthiKiranGV, Xuzheng He, Haowen Hou, Jiaju Lin, Przemyslaw Kazienko, Jan Kocon,Jiaming Kong, Bartlomiej Koptyra, Hayden Lau, Krishna SriIpsit Mantri,Ferdinand Mom, Atsushi Saito, Guangyu Song, Xiangru Tang, Bolun Wang,JohanS. Wind, Stanislaw Wozniak, Ruichong Zhang, Zhenyuan Zhang, QihangZhao, Peng Zhou, Qinghua Zhou, Jian Zhu, and Rui-Jie Zhu.Rwkv: Reinventing rnns for the transformer era, 2023.URL https://arxiv.org/abs/2305.13048.
  • Radford etal. (2019)Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and IlyaSutskever.Language models are unsupervised multitask learners, 2019.https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdfLast visited on 2023/09/07.
  • Raffel etal. (2019)Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, MichaelMatena, Yanqi Zhou, Wei Li, and PeterJ. Liu.Exploring the limits of transfer learning with a unified text-to-texttransformer, 2019.URL https://arxiv.org/abs/1910.10683.
  • Reid etal. (2024)Machel Reid, Nikolay Savinov, Denis Teplyashin, Dmitry Lepikhin, TimothyLillicrap, Jean baptiste Alayrac, Radu Soricut, Angeliki Lazaridou, OrhanFirat, Julian Schrittwieser, Ioannis Antonoglou, Rohan Anil, SebastianBorgeaud, Andrew Dai, Katie Millican, Ethan Dyer, Mia Glaese, ThibaultSottiaux, Benjamin Lee, Fabio Viola, Malcolm Reynolds, Yuanzhong Xu, JamesMolloy, Jilin Chen, Michael Isard, Paul Barham, Tom Hennigan, Ross McIlroy,Melvin Johnson, Johan Schalkwyk, Eli Collins, Eliza Rutherford, EricaMoreira, Kareem Ayoub, Megha Goel, Clemens Meyer, Gregory Thornton, ZhenYang, Henryk Michalewski, Zaheer Abbas, Nathan Schucher, Ankesh Anand,Richard Ives, James Keeling, Karel Lenc, Salem Haykal, Siamak Shakeri, PranavShyam, Aakanksha Chowdhery, Roman Ring, Stephen Spencer, Eren Sezener, LukeVilnis, Oscar Chang, Nobuyuki Morioka, George Tucker, CeZheng, OliverWoodman, Nithya Attaluri, Tomas Kocisky, Evgenii Eltyshev, XiChen, TimothyChung, Vittorio Selo, Siddhartha Brahma, Petko Georgiev, Ambrose Slone,Zhenkai Zhu, James Lottes, Siyuan Qiao, Ben Caine, Sebastian Riedel, AlexTomala, Martin Chadwick, Juliette Love, Peter Choy, Sid Mittal, Neil Houlsby,Yunhao Tang, Matthew Lamm, Libin Bai, Qiao Zhang, Luheng He, Yong Cheng,Peter Humphreys, Yujia Li, Sergey Brin, Albin Cassirer, Yingjie Miao, LukasZilka, Taylor Tobin, Kelvin Xu, Lev Proleev, Daniel Sohn, Alberto Magni,LisaAnne Hendricks, Isabel Gao, Santiago Ontañón, Oskar Bunyan, NathanByrd, Abhanshu Sharma, Biao Zhang, Mario Pinto, Rishika Sinha, Harsh Mehta,Dawei Jia, Sergi Caelles, Albert Webson, Alex Morris, Becca Roelofs, YifanDing, Robin Strudel, Xuehan Xiong, Marvin Ritter, Mostafa Dehghani, RahmaChaabouni, Abhijit Karmarkar, Guangda Lai, Fabian Mentzer, Bibo Xu, YaGuangLi, Yujing Zhang, TomLe Paine, Alex Goldin, Behnam Neyshabur, Kate Baumli,Anselm Levskaya, Michael Laskin, Wenhao Jia, JackW. Rae, Kefan Xiao, AntoineHe, Skye Giordano, Lakshman Yagati, Jean-Baptiste Lespiau, Paul Natsev,Sanjay Ganapathy, Fangyu Liu, Danilo Martins, Nanxin Chen, Yunhan Xu, MeganBarnes, Rhys May, Arpi Vezer, Junhyuk Oh, Ken Franko, Sophie Bridgers, RuizheZhao, Boxi Wu, Basil Mustafa, Sean Sechrist, Emilio Parisotto,ThanumalayanSankaranarayana Pillai, Chris Larkin, Chenjie Gu, ChristinaSorokin, Maxim Krikun, Alexey Guseynov, Jessica Landon, Romina Datta,Alexander Pritzel, Phoebe Thacker, Fan Yang, Kevin Hui, Anja Hauth, Chih-KuanYeh, David Barker, Justin Mao-Jones, Sophia Austin, Hannah Sheahan, ParkerSchuh, James Svensson, Rohan Jain, Vinay Ramasesh, Anton Briukhov, Da-WoonChung, Tamara von Glehn, Christina Butterfield, Priya Jhakra, MatthewWiethoff, Justin Frye, Jordan Grimstad, Beer Changpinyo, CharlineLe Lan,Anna Bortsova, Yonghui Wu, Paul Voigtlaender, Tara Sainath, Charlotte Smith,Will Hawkins, Kris Cao, James Besley, Srivatsan Srinivasan, Mark Omernick,Colin Gaffney, Gabriela Surita, Ryan Burnell, Bogdan Damoc, Junwhan Ahn,Andrew Brock, Mantas Pajarskas, Anastasia Petrushkina, Seb Noury, LorenzoBlanco, Kevin Swersky, Arun Ahuja, Thi Avrahami, Vedant Misra, RaouldeLiedekerke, Mariko Iinuma, Alex Polozov, Sarah York, George vandenDriessche, Paul Michel, Justin Chiu, Rory Blevins, Zach Gleicher, AdriàRecasens, Alban Rrustemi, Elena Gribovskaya, Aurko Roy, Wiktor Gworek, SébArnold, Lisa Lee, James Lee-Thorp, Marcello Maggioni, Enrique Piqueras,Kartikeya Badola, Sharad Vikram, Lucas Gonzalez, Anirudh Baddepudi, EvanSenter, Jacob Devlin, James Qin, Michael Azzam, Maja Trebacz, Martin Polacek,Kashyap Krishnakumar, Shuo yiin Chang, Matthew Tung, Ivo Penchev, RishabhJoshi, Kate Olszewska, Carrie Muir, Mateo Wirth, AleJakse Hartman, JoshNewlan, Sheleem Kashem, Vijay Bolina, Elahe Dabir, Joost van Amersfoort,Zafarali Ahmed, James Cobon-Kerr, Aishwarya Kamath, ArnarMar Hrafnkelsson,LeHou, Ian Mackinnon, Alexandre Frechette, Eric Noland, Xiance Si, EmanuelTaropa, Dong Li, Phil Crone, Anmol Gulati, Sébastien Cevey, Jonas Adler, AdaMa, David Silver, Simon Tokumine, Richard Powell, Stephan Lee, Michael Chang,Samer Hassan, Diana Mincu, Antoine Yang, Nir Levine, Jenny Brennan, MingqiuWang, Sarah Hodkinson, Jeffrey Zhao, Josh Lipschultz, Aedan Pope, MichaelB.Chang, Cheng Li, LaurentEl Shafey, Michela Paganini, Sholto Douglas, BerndBohnet, Fabio Pardo, Seth Odoom, Mihaela Rosca, CiceroNogueira dos Santos,Kedar Soparkar, Arthur Guez, Tom Hudson, Steven Hansen, ChulayuthAsawaroengchai, Ravi Addanki, Tianhe Yu, Wojciech Stokowiec, Mina Khan,Justin Gilmer, Jaehoon Lee, CarrieGrimes Bostock, Keran Rong, JonathanCaton, Pedram Pejman, Filip Pavetic, Geoff Brown, Vivek Sharma, MarioLučić, Rajkumar Samuel, Josip Djolonga, Amol Mandhane, LarsLowe Sjösund,Elena Buchatskaya, Elspeth White, Natalie Clay, Jiepu Jiang, Hyeontaek Lim,Ross Hemsley, Jane Labanowski, NicolaDe Cao, David Steiner, SayedHadiHashemi, Jacob Austin, Anita Gergely, Tim Blyth, Joe Stanton, KaushikShivakumar, Aditya Siddhant, Anders Andreassen, Carlos Araya, Nikhil Sethi,Rakesh Shivanna, Steven Hand, Ankur Bapna, Ali Khodaei, Antoine Miech,Garrett Tanzer, Andy Swing, Shantanu Thakoor, Zhufeng Pan, Zachary Nado,Stephanie Winkler, Dian Yu, Mohammad Saleh, Loren Maggiore, Iain Barr, MinhGiang, Thais Kagohara, Ivo Danihelka, Amit Marathe, Vladimir Feinberg,Mohamed Elhawaty, Nimesh Ghelani, Dan Horgan, Helen Miller, Lexi Walker,Richard Tanburn, Mukarram Tariq, Disha Shrivastava, Fei Xia, Chung-ChengChiu, Zoe Ashwood, Khuslen Baatarsukh, Sina Samangooei, Fred Alcober, AxelStjerngren, Paul Komarek, Katerina Tsihlas, Anudhyan Boral, Ramona Comanescu,Jeremy Chen, Ruibo Liu, Dawn Bloxwich, Charlie Chen, Yanhua Sun, FangxiaoyuFeng, Matthew Mauger, Xerxes Dotiwalla, Vincent Hellendoorn, Michael Sharman,Ivy Zheng, Krishna Haridasan, Gabe Barth-Maron, Craig Swanson, DominikaRogozińska, Alek Andreev, PaulKishan Rubenstein, Ruoxin Sang, Dan Hurt,Gamaleldin Elsayed, Renshen Wang, Dave Lacey, Anastasija Ilić, Yao Zhao,Lora Aroyo, Chimezie Iwuanyanwu, Vitaly Nikolaev, Balaji Lakshminarayanan,Sadegh Jazayeri, RaphaëlLopez Kaufman, Mani Varadarajan, Chetan Tekur, DougFritz, Misha Khalman, David Reitter, Kingshuk Dasgupta, Shourya Sarcar, TinaOrnduff, Javier Snaider, Fantine Huot, Johnson Jia, Rupert Kemp, Nejc Trdin,Anitha Vijayakumar, Lucy Kim, Christof Angermueller, LiLao, Tianqi Liu,Haibin Zhang, David Engel, Somer Greene, Anaïs White, Jessica Austin, LillyTaylor, Shereen Ashraf, Dangyi Liu, Maria Georgaki, Irene Cai, YanaKulizhskaya, Sonam Goenka, Brennan Saeta, Kiran Vodrahalli, Christian Frank,Dario deCesare, Brona Robenek, Harry Richardson, Mahmoud Alnahlawi,Christopher Yew, Priya Ponnapalli, Marco Tagliasacchi, Alex Korchemniy, YelinKim, Dinghua Li, Bill Rosgen, Zoe Ashwood, Kyle Levin, Jeremy Wiesner,Praseem Banzal, Praveen Srinivasan, Hongkun Yu, Çağlar Ünlü, David Reid,Zora Tung, Daniel Finchelstein, Ravin Kumar, Andre Elisseeff, Jin Huang, MingZhang, Rui Zhu, Ricardo Aguilar, Mai Giménez, Jiawei Xia, Olivier Dousse,Willi Gierke, SoheilHassas Yeganeh, Damion Yates, Komal Jalan, LuLi, EriLatorre-Chimoto, DucDung Nguyen, Ken Durden, Praveen Kallakuri, Yaxin Liu,Matthew Johnson, Tomy Tsai, Alice Talbert, Jasmine Liu, Alexander Neitz, ChenElkind, Marco Selvi, Mimi Jasarevic, LivioBaldini Soares, Albert Cui, PidongWang, AlekWenjiao Wang, Xinyu Ye, Krystal Kallarackal, Lucia Loher, Hoi Lam,Josef Broder, Dan Holtmann-Rice, Nina Martin, Bramandia Ramadhana, DanielToyama, Mrinal Shukla, Sujoy Basu, Abhi Mohan, Nick Fernando, Noah Fiedel,Kim Paterson, Hui Li, Ankush Garg, Jane Park, DongHyun Choi, Diane Wu,Sankalp Singh, Zhishuai Zhang, Amir Globerson, Lily Yu, John Carpenter,Félix deChaumontQuitry, Carey Radebaugh, Chu-Cheng Lin, Alex Tudor,Prakash Shroff, Drew Garmon, Dayou Du, Neera Vats, Han Lu, Shariq Iqbal, AlexYakubovich, Nilesh Tripuraneni, James Manyika, Haroon Qureshi, Nan Hua,Christel Ngani, MariaAbi Raad, Hannah Forbes, Anna Bulanova, Jeff Stanway,Mukund Sundararajan, Victor Ungureanu, Colton Bishop, Yunjie Li, BalajiVenkatraman, BoLi, Chloe Thornton, Salvatore Scellato, Nishesh Gupta,Yicheng Wang, Ian Tenney, Xihui Wu, Ashish Shenoy, Gabriel Carvajal,DianaGage Wright, Ben Bariach, Zhuyun Xiao, Peter Hawkins, Sid Dalmia,Clement Farabet, Pedro Valenzuela, Quan Yuan, Chris Welty, Ananth Agarwal,Mia Chen, Wooyeol Kim, Brice Hulse, Nandita Dukkipati, Adam Paszke, AndrewBolt, Elnaz Davoodi, Kiam Choo, Jennifer Beattie, Jennifer Prendki, HarshaVashisht, Rebeca Santamaria-Fernandez, LuisC. Cobo, Jarek Wilkiewicz, DavidMadras, Ali Elqursh, Grant Uy, Kevin Ramirez, Matt Harvey, Tyler Liechty,Heiga Zen, Jeff Seibert, ClaraHuiyi Hu, Mohamed Elhawaty, Andrey Khorlin,Maigo Le, Asaf Aharoni, Megan Li, Lily Wang, Sandeep Kumar, Alejandro Lince,Norman Casagrande, Jay Hoover, DaliaEl Badawy, David Soergel, Denis Vnukov,Matt Miecnikowski, Jiri Simsa, Anna Koop, Praveen Kumar, Thibault Sellam,Daniel Vlasic, Samira Daruki, Nir Shabat, John Zhang, Guolong Su, JiagengZhang, Jeremiah Liu, YiSun, Evan Palmer, Alireza Ghaffarkhah, XiXiong,Victor Cotruta, Michael Fink, Lucas Dixon, Ashwin Sreevatsa, AdrianGoedeckemeyer, Alek Dimitriev, Mohsen Jafari, Remi Crocker, NicholasFitzGerald, Aviral Kumar, Sanjay Ghemawat, Ivan Philips, Frederick Liu,Yannie Liang, Rachel Sterneck, Alena Repina, Marcus Wu, Laura Knight, MarinGeorgiev, Hyo Lee, Harry Askham, Abhishek Chakladar, Annie Louis, Carl Crous,Hardie Cate, Dessie Petrova, Michael Quinn, Denese Owusu-Afriyie, AchintyaSinghal, Nan Wei, Solomon Kim, Damien Vincent, Milad Nasr, ChristopherA.Choquette-Choo, Reiko Tojo, Shawn Lu, Diego deLasCasas, Yuchung Cheng,Tolga Bolukbasi, Katherine Lee, Saaber Fatehi, Rajagopal Ananthanarayanan,Miteyan Patel, Charbel Kaed, Jing Li, Jakub Sygnowski, ShreyasRammohanBelle, Zhe Chen, Jaclyn Konzelmann, Siim Põder, Roopal Garg, VinodKoverkathu, Adam Brown, Chris Dyer, Rosanne Liu, Azade Nova, Jun Xu, SlavPetrov, Demis Hassabis, Koray Kavukcuoglu, Jeffrey Dean, and Oriol Vinyals.Gemini 1.5: Unlocking multimodal understanding across millions oftokens of context, 2024.URL https://arxiv.org/abs/2403.05530.
  • Shazeer (2019)Noam Shazeer.Fast transformer decoding: One write-head is all you need, 2019.URL https://arxiv.org/abs/1911.02150.
  • Shazeer (2020)Noam Shazeer.GLU variants improve transformer, 2020.URL https://arxiv.org/abs/2002.05202.
  • Shoeybi etal. (2020)Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper,and Bryan Catanzaro.Megatron-LM: Training multi-billion parameter language models usingmodel parallelism, 2020.URL https://arxiv.org/abs/1909.08053.
  • So etal. (2021)David So, Wojciech Mańke, Hanxiao Liu, Zihang Dai, Noam Shazeer, and QuocVLe.Searching for efficient transformers for language modeling.In M.Ranzato, A.Beygelzimer, Y.Dauphin, P.S. Liang, and J.WortmanVaughan (eds.), Advances in Neural Information Processing Systems,volume34, pp. 6010–6022. Curran Associates, Inc., 2021.URLhttps://proceedings.neurips.cc/paper_files/paper/2021/file/2f3c6a4cd8af177f6456e7e51a916ff3-Paper.pdf.
  • Su etal. (2021)Jianlin Su, YuLu, Shengfeng Pan, BoWen, and Yunfeng Liu.RoFormer: Enhanced transformer with rotary position embedding.CoRR, abs/2104.09864, 2021.URL https://arxiv.org/abs/2104.09864.
  • Touvron etal. (2023a)Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-AnneLachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro,Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and GuillaumeLample.Llama: Open and efficient foundation language models,2023a.URL https://arxiv.org/abs/2302.13971.
  • Touvron etal. (2023b)Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, YasmineBabaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale,Dan Bikel, Lukas Blecher, CristianCanton Ferrer, Moya Chen, GuillemCucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller,Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, SagharHosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa,Isabel Kloumann, Artem Korenev, PunitSingh Koura, Marie-Anne Lachaux,Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, XavierMartinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, AndrewPoulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, RuanSilva, EricMichael Smith, Ranjan Subramanian, XiaoqingEllen Tan, Binh Tang,Ross Taylor, Adina Williams, JianXiang Kuan, Puxin Xu, Zheng Yan, IliyanZarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, AurelienRodriguez, Robert Stojnic, Sergey Edunov, and Thomas Scialom.Llama 2: Open foundation and fine-tuned chat models,2023b.URL https://arxiv.org/abs/2307.09288.
  • van Laarhoven (2017)Twan van Laarhoven.L2 regularization versus batch and weight normalization, 2017.URL https://arxiv.org/abs/1706.05350.
  • Vaswani etal. (2017)Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,AidanN Gomez, Łukasz Kaiser, and Illia Polosukhin.Attention is all you need.In Advances in Neural Information Processing Systems,volume30. Curran Associates, Inc., 2017.URLhttps://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf.
  • Wortsman etal. (2023)Mitchell Wortsman, PeterJ. Liu, Lechao Xiao, Katie Everett, Alex Alemi, BenAdlam, JohnD. Co-Reyes, Izzeddin Gur, Abhishek Kumar, Roman Novak, JeffreyPennington, Jascha Sohl-dickstein, Kelvin Xu, Jaehoon Lee, Justin Gilmer, andSimon Kornblith.Small-scale proxies for large-scale transformer traininginstabilities, 2023.URL https://arxiv.org/abs/2309.14322.
  • XAI (2024)XAI.Grok-1, 2024.URL https://github.com/xai-org/grok-1.
  • Xu etal. (2021)Yuanzhong Xu, HyoukJoong Lee, Dehao Chen, Blake Hechtman, Yanping Huang, RahulJoshi, Maxim Krikun, Dmitry Lepikhin, Andy Ly, Marcello Maggioni, RuomingPang, Noam Shazeer, Shibo Wang, Tao Wang, Yonghui Wu, and Zhifeng Chen.GSPMD: General and scalable parallelization for ML computationgraphs, 2021.URL https://arxiv.org/abs/2105.04663.
  • Yang (2019)Greg Yang.Tensor Programs I: Wide feedforward or recurrent neural networks ofany architecture are gaussian processes, 2019.URL https://arxiv.org/abs/1910.12478.
  • Yang (2020)Greg Yang.Tensor Programs II: Neural tangent kernel for any architecture,2020.URL https://arxiv.org/abs/2006.14548.
  • Yang (2021)Greg Yang.Tensor Programs III: Neural matrix laws, 2021.URL https://arxiv.org/abs/2009.10685.
  • Yang & Hu (2021)Greg Yang and EdwardJ. Hu.Tensor Programs IV: Feature learning in infinite-width neuralnetworks.In Marina Meila and Tong Zhang (eds.), Proceedings of the 38thInternational Conference on Machine Learning, volume 139 ofProceedings of Machine Learning Research, pp. 11727–11737. PMLR,18–24 Jul 2021.URL https://proceedings.mlr.press/v139/yang21c.html.
  • Yang etal. (2022)Greg Yang, EdwardJ. Hu, Igor Babuschkin, Szymon Sidor, Xiaodong Liu, DavidFarhi, Nick Ryder, Jakub Pachocki, Weizhu Chen, and Jianfeng Gao.Tensor Programs V: Tuning large neural networks via zero-shothyperparameter transfer, 2022.URL https://arxiv.org/abs/2203.03466.
  • Yang etal. (2023a)Greg Yang, JamesB. Simon, and Jeremy Bernstein.A spectral condition for feature learning, 2023a.URL https://arxiv.org/abs/2310.17813.
  • Yang etal. (2023b)Greg Yang, Dingli Yu, Chen Zhu, and Soufiane Hayou.Tensor Programs VI: Feature learning in infinite-depth neuralnetworks, 2023b.URL https://arxiv.org/abs/2310.02244.
  • Yao & Wang (2023)Yiqun Yao and Yequan Wang.Research without re-search: Maximal update parametrization yieldsaccurate loss prediction across scales, 2023.URL https://arxiv.org/abs/2304.06875.
  • You etal. (2019)Yang You, Jing Li, Jonathan Hseu, Xiaodan Song, James Demmel, and Cho-JuiHsieh.Reducing BERT pre-training time from 3 days to 76 minutes.CoRR, abs/1904.00962, 2019.URL http://arxiv.org/abs/1904.00962.
  • Zhang & Sennrich (2019)Biao Zhang and Rico Sennrich.Root mean square layer normalization.CoRR, abs/1910.07467, 2019.URL http://arxiv.org/abs/1910.07467.
A Large-Scale Exploration of 𝜇-Transfer (2024)

References

Top Articles
Latest Posts
Article information

Author: Fredrick Kertzmann

Last Updated:

Views: 5966

Rating: 4.6 / 5 (66 voted)

Reviews: 89% of readers found this page helpful

Author information

Name: Fredrick Kertzmann

Birthday: 2000-04-29

Address: Apt. 203 613 Huels Gateway, Ralphtown, LA 40204

Phone: +2135150832870

Job: Regional Design Producer

Hobby: Nordic skating, Lacemaking, Mountain biking, Rowing, Gardening, Water sports, role-playing games

Introduction: My name is Fredrick Kertzmann, I am a gleaming, encouraging, inexpensive, thankful, tender, quaint, precious person who loves writing and wants to share my knowledge and understanding with you.