Using Degeneracy in the Loss Landscape for Mechanistic Interpretability (2024)

Lucius Bushnaq Jake Mendel Stefan Heimersheim Dan Braun Nicholas Goldowsky-Dill Kaarel Hänni Cindy Wu Marius HobbhahnApollo ResearchCorrespondence to Lucius Bushnaq <lucius@apolloresearch.ai>Cadenza LabsIndependent

Abstract

Mechanistic Interpretability aims to reverse engineer the algorithms implemented by neural networks by studying their weights and activations. An obstacle to reverse engineering neural networks is that many of the parameters inside a network are not involved in the computation being implemented by the network. These degenerate parameters may obfuscate internal structure. Singular learning theory teaches us that neural network parameterizations are biased towards being more degenerate, and parameterizations with more degeneracy are likely to generalize further. We identify 3 ways that network parameters can be degenerate: linear dependence between activations in a layer; linear dependence between gradients passed back to a layer; ReLUs which fire on the same subset of datapoints. We also present a heuristic argument that modular networks are likely to be more degenerate, and we develop a metric for identifying modules in a network that is based on this argument. We propose that if we can represent a neural network in a way that is invariant to reparameterizations that exploit the degeneracies, then this representation is likely to be more interpretable, and we provide some evidence that such a representation is likely to have sparser interactions. We introduce the Interaction Basis, a tractable technique to obtain a representation that is invariant to degeneracies from linear dependence of activations or Jacobians.

1 Introduction

Mechanistic Interpretability aims to understand the algorithms implemented by neural networks (Olah etal., 2017; Elhage etal., 2021; Räuker etal., 2023; Olah etal., 2020; Meng etal., 2023; Geiger etal., 2021; Wang etal., 2022; Conmy etal., 2024). A key challenge in mechanistic interpretability is that neurons tend to fire on many unrelated inputs (Fusi etal., 2016; Nguyen etal., 2016; Olah etal., 2017; Geva etal., 2021; Goh etal., 2021) and any apparent circuits in the model often do not show a single clear functionality and do not have clear boundaries separating them from the rest of the network (Conmy etal., 2023; Chan etal., 2022).

We suggest that a central problem for current methods of reverse engineering networks is that neural networks are degenerate: there are many different choices of parameters that implement the same function (Wei etal., 2022; Watanabe, 2009). For example, in a transformer attention head, only the product WOV=WOWVsubscript𝑊𝑂𝑉subscript𝑊𝑂subscript𝑊𝑉W_{OV}=W_{O}W_{V}italic_W start_POSTSUBSCRIPT italic_O italic_V end_POSTSUBSCRIPT = italic_W start_POSTSUBSCRIPT italic_O end_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT of the WVsubscript𝑊𝑉W_{V}italic_W start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT and WOsubscript𝑊𝑂W_{O}italic_W start_POSTSUBSCRIPT italic_O end_POSTSUBSCRIPT matrices influences the network’s output, thus, many different choices of WOsubscript𝑊𝑂W_{O}italic_W start_POSTSUBSCRIPT italic_O end_POSTSUBSCRIPT and WVsubscript𝑊𝑉W_{V}italic_W start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT are parameterizations of the same network (Elhage etal., 2021). This degeneracy makes parameters and activations an obfuscated view of a network’s computational features, hindering interpretability. While we have workarounds for known architecture-dependent degeneracies such as the WOVsubscript𝑊𝑂𝑉W_{OV}italic_W start_POSTSUBSCRIPT italic_O italic_V end_POSTSUBSCRIPT case, Singular Learning Theory (SLT, Watanabe, 2009, 2013) suggests that we should expect additional degeneracy in trained networks that generalize well.

SLT quantifies the degeneracy of the loss landscape around a solution using the local learning coefficient (LLC) (Lau etal., 2023; Watanabe, 2009, 2013). More degenerate solutions lie in broader ‘basins’ of the loss landscape, where many alternative parameterizations implement a similar function. Networks with lower LLCs are more degenerate, implement more general algorithms, and generalize better to new data (Watanabe, 2009, 2013).These predictions of SLT are only straightforwardly applicable to the global minimum in the loss landscape; a generalization is required to applythese insights to real networks.

In this paper we make the following contributions.First, in Section 2 we propose changes to SLT to make it useful for interpretability on real networks.Then, in Section 3 we characterize three ways in which neural networks can be degenerate.In Section 4, we prove a link between some of these degeneracies and sparsity in the interactions between features.In Section 5, we develop a technique for searching for modularity based on its relation to degeneracy in the loss landscape.Finally in Section 6, we propose a practical technique for removing some of these degeneracies in the form of the interaction basis.

2 Singular learning theory and the effective parameter count

If a neural network’s parameterisation is degenerate, this means there are many choices of parameters that achieve the same loss. At a global minimum in the loss landscape, more degeneracy in the parametrisation implies that the network lies in a broader basin of the loss. We can quantify how broad the basin is using Singular Learning Theory [SLT, Watanabe 2009, 2013; Wei etal. 2022].

In Section 2.1, we provide an overview of the key concepts from SLT that we will make use of. In Section 2.2 we explain why the tools of SLT are not completely suitable for identifying degeneracy in model internals. As a proposal to resolve some of these limitations, we introduce the behavioral loss in Section 2.2.1, and finite data singular learning theory in Section 2.2.2. Together, these concepts will allow us to define the effective parameter count, a measure of the number of computationally-relevant parameters in the network. If we achieved our goal of a fully parameterisation-invariant representation of a neural network, its explicit parameter count would equal its effective parameter count.

2.1 Background: the local learning coefficient

The most important quantity in SLT is the learning coefficient λ𝜆\lambdaitalic_λ. We define a data distribution xXsimilar-to𝑥𝑋x\sim Xitalic_x ∼ italic_X and a family of models with N𝑁Nitalic_N parameters, parameterised by a vector θ𝜃\thetaitalic_θ in a parameter space ΘNΘsuperscript𝑁\Theta\subseteq\mathbb{R}^{N}roman_Θ ⊆ blackboard_R start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT. We also define a population loss function L(θ|X)𝐿conditional𝜃𝑋L(\theta|X)italic_L ( italic_θ | italic_X ) which is normalised so that L(θ0|X)=0𝐿conditionalsubscript𝜃0𝑋0L(\theta_{0}|X)=0italic_L ( italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | italic_X ) = 0 at the global minimum θ0=argminθL(θ|X)subscript𝜃0subscriptargmin𝜃𝐿conditional𝜃𝑋\theta_{0}=\operatorname*{arg\,min}_{\theta}L(\theta|X)italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT italic_L ( italic_θ | italic_X ). Then λ𝜆\lambdaitalic_λ is defined as (Watanabe, 2009):111See Watanabe (2009) for a more rigorous definition of the learning coefficient.

λ:=limϵ0[ϵddϵlogV(ϵ)],assign𝜆subscriptitalic-ϵ0delimited-[]italic-ϵdditalic-ϵVitalic-ϵ\displaystyle\lambda:=\lim_{\epsilon\to 0}\left[\epsilon\frac{\text{d}}{\text{%d}\epsilon}\log\operatorname{V}(\epsilon)\right]\,,italic_λ := roman_lim start_POSTSUBSCRIPT italic_ϵ → 0 end_POSTSUBSCRIPT [ italic_ϵ divide start_ARG d end_ARG start_ARG d italic_ϵ end_ARG roman_log roman_V ( italic_ϵ ) ] ,(1)

whereV(ϵ)Vitalic-ϵ\operatorname{V}(\epsilon)roman_V ( italic_ϵ ) is the volume of the region of parameter space ΘΘ\Thetaroman_Θ with loss less than ϵitalic-ϵ\epsilonitalic_ϵ:

V(ϵ):={θΘ:L(θ)<ϵ}dθassignVitalic-ϵsubscriptconditional-set𝜃Θ𝐿𝜃italic-ϵd𝜃\displaystyle\operatorname{V}(\epsilon):=\int_{\{\theta\in\Theta:\,L(\theta)<%\epsilon\}}\text{d}\thetaroman_V ( italic_ϵ ) := ∫ start_POSTSUBSCRIPT { italic_θ ∈ roman_Θ : italic_L ( italic_θ ) < italic_ϵ } end_POSTSUBSCRIPT d italic_θ(2)

The learning coefficient quantifies the way the volume of a region of low loss changes as we ‘zoom in’ to lower and lower loss. It is a measure of basin broadness, and SLT predicts that networks are biased towards points in the loss landscape with lower learning coefficient.

Since the loss landscape can have many different solutions with minimum loss, this definition does not necessarily single out a region corresponding to a single solution.Therefore Lau etal. (2023) introduce the local learning coefficient (LLC, denoted by λ^^𝜆\hat{\lambda}over^ start_ARG italic_λ end_ARG) as a way to use the machinery of SLT to study the loss landscape geometry in the neighbourhood of a particular local minimum at θsuperscript𝜃\theta^{*}italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT by restricting the volume in the definition of the learning coefficient to a neighbourhood of that minimum ΘθΘsubscriptΘsuperscript𝜃Θ\Theta_{\theta^{*}}\subset\Thetaroman_Θ start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ⊂ roman_Θ satisfying θ=argminθΘθL(θ|X)superscript𝜃subscriptargmin𝜃subscriptΘsuperscript𝜃𝐿conditional𝜃𝑋\theta^{*}=\operatorname*{arg\,min}_{\theta\in\Theta_{\theta^{*}}}L(\theta|X)italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT italic_θ ∈ roman_Θ start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_L ( italic_θ | italic_X ). Then we define the local volume:

Vθ(ϵ)={θΘθ:L(θ)<L(θ)+ϵ}dθsubscriptVsuperscript𝜃italic-ϵsubscriptconditional-set𝜃subscriptΘsuperscript𝜃𝐿𝜃𝐿superscript𝜃italic-ϵd𝜃\operatorname{V}_{\theta^{*}}(\epsilon)=\int_{\{\theta\in\Theta_{\theta^{*}}:%\,L(\theta)<L(\theta^{*})+\epsilon\}}\text{d}\thetaroman_V start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_ϵ ) = ∫ start_POSTSUBSCRIPT { italic_θ ∈ roman_Θ start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT : italic_L ( italic_θ ) < italic_L ( italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) + italic_ϵ } end_POSTSUBSCRIPT d italic_θ(3)

and the local learning coefficient:

λ^(θ)=limϵ0[ϵddϵlogVθ(ϵ)].^𝜆superscript𝜃subscriptitalic-ϵ0delimited-[]italic-ϵdditalic-ϵsubscriptVsuperscript𝜃italic-ϵ\hat{\lambda}(\theta^{*})=\lim_{\epsilon\to 0}\left[\epsilon\frac{\text{d}}{%\text{d}\epsilon}\log\operatorname{V}_{\theta^{*}}(\epsilon)\right]\,.over^ start_ARG italic_λ end_ARG ( italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) = roman_lim start_POSTSUBSCRIPT italic_ϵ → 0 end_POSTSUBSCRIPT [ italic_ϵ divide start_ARG d end_ARG start_ARG d italic_ϵ end_ARG roman_log roman_V start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_ϵ ) ] .(4)

To see why the LLC can be thought of as counting the degeneracy in the network, consider a network with N𝑁Nitalic_N parameters, with Nfreesubscript𝑁freeN_{\text{free}}italic_N start_POSTSUBSCRIPT free end_POSTSUBSCRIPT degrees of freedom in the parameterisation (such that Nfreesubscript𝑁freeN_{\text{free}}italic_N start_POSTSUBSCRIPT free end_POSTSUBSCRIPT of the parameters can be freely varied together or independently, without affecting the loss). Then, we can approximate the loss by a Taylor series around the minimum:

L(θ|X)=L(θ)+12(θθ)TH(θ)(θθ)+O(θθ3)𝐿conditional𝜃𝑋𝐿superscript𝜃12superscript𝜃superscript𝜃𝑇𝐻superscript𝜃𝜃superscript𝜃𝑂superscriptnorm𝜃superscript𝜃3L(\theta|X)=L(\theta^{*})+\frac{1}{2}(\theta-\theta^{*})^{T}H(\theta^{*})(%\theta-\theta^{*})+O(||\theta-\theta^{*}||^{3})italic_L ( italic_θ | italic_X ) = italic_L ( italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) + divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( italic_θ - italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_H ( italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ( italic_θ - italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) + italic_O ( | | italic_θ - italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT | | start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT )(5)

where H(θ)𝐻superscript𝜃H(\theta^{*})italic_H ( italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) is the Hessian at the mininum. Consider the case that all functionally relevant parameters all contribute a quadratic term to the loss to leading order, and degrees of freedom correspond to parameters which the loss does not depend on at all. In this case, Murfet (2020) explicitly calculate the LLC, showing that it equals 12(NNfree)12𝑁subscript𝑁free\frac{1}{2}(N-N_{\text{free}})divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( italic_N - italic_N start_POSTSUBSCRIPT free end_POSTSUBSCRIPT ) — i.e. the LLC counts the number of functionally relevant parameters in the model.

There is a sense that in such a model, the nominal parameter count is misleading, and if there are Nfreesubscript𝑁freeN_{\text{free}}italic_N start_POSTSUBSCRIPT free end_POSTSUBSCRIPT degrees of freedom then there are effectively only NNfree𝑁subscript𝑁freeN-N_{\text{free}}italic_N - italic_N start_POSTSUBSCRIPT free end_POSTSUBSCRIPT actual parameters in the model. Indeed, this is the right perspective to take for selecting a model class to fit data with. Watanabe (2013) demonstrates that for models with parameter-function maps that are not one-to-one, the Bayesian Information Criterion (Schwarz, 1978), which predicts which model fit to given data generalizes best (Hoogland, 2023), should be modified: the parameter count of the model N𝑁Nitalic_N should be replaced with 2λ2𝜆2\lambda2 italic_λ.

In this simple example, the LLC is equal to half the rank of the Hessian at the minimum, and one might wonder if these two quantities are always related in this way. It turns out that they are only the same when the loss landscape can be written locally as a sum of quadratic terms, but this isn’t always true. For example, the loss landscape could be locally quartic in some directions, or the set of points with loss equal to 0 may form complicated self intersecting shapes like a cross. In these cases, it is the LLC, not the rank of the Hessian, that measures how much freedom there is to change parameters and how much we expect a particular model to generalise.

2.2 Modifying SLT for interpretability

We would like to use the local learning coefficient to quantify the number of degrees of freedom in the parameterisation of a neural network — the number of ways the parameters in a neural network can be changed while still implementing the same function, or at least a highly similar function. However, there are some obstacles to using the LLC for this purpose:

  1. 1.

    The LLC λ^(θ)^𝜆superscript𝜃\hat{\lambda}(\theta^{*})over^ start_ARG italic_λ end_ARG ( italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) measures the size of the region of equal loss around a particular local minimum θΘsuperscript𝜃Θ\theta^{*}\in\Thetaitalic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ roman_Θ in the loss landscape. This loss landscape is defined by a loss function and a dataset of inputs and labels. Unless the network achieves optimal loss on this dataset, points in the region could have equal loss even though they correspond to different functions, if these functions achieve the same average performance over the dataset. We do not want our measure of the number of degrees of freedom to include different functions which achieve the same overall loss.

  2. 2.

    The local learning coefficient is only well defined at a local minimum of the loss, but we frequently want to interpret neural networks that have not been trained to convergence and are not at a minimum of the loss on their training distribution.

  3. 3.

    We would like to be able to consider two very similar but not identical functions to be the same function, if they only differ in ways that can be considered noise. This is partially because, after finite training time, a network will not have fully converged on the cleanest version of an algorithm without any noise222Indeed, sometimes it is possible to remove this noise and improve performance (Nanda etal., 2023). However, the formal approach of SLT studies models in the limit of infinite data. This turns out to correspond to taking the limit ϵ0italic-ϵ0\epsilon\to 0italic_ϵ → 0 in the definition of the LLC (equation 4) — after infinite data, the LLC is determined by the scaling of the volume function at loss equal to L(θ)𝐿superscript𝜃L(\theta^{*})italic_L ( italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ). This means that the LLC contains information only about exact degeneracies in the parameterization — only about different parameterisations that are at the local minimum. Instead, we would prefer to work with a modified LLC which quantifies the number of parameterization choices which correspond to approximately identical functions.

We introduce the behavioral loss as a resolution to problems (1) and (2), and finite data SLT as a resolution to problem (3).

2.2.1 Behavioral loss

In this section, we describe how we can define the local learning coefficient of a network to avoid problems 1 and 2 listed above.We want to define a new loss function and corresponding loss landscape for the sake of the SLT formalism (we do not train with this loss) such that all the parameter choices in a region with zero loss correspond to the same function on the training dataset: the same map of inputs to outputs.This loss function, which we call the Behavioral Loss, LBsubscript𝐿𝐵L_{B}italic_L start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT, is defined with respect to an original neural network with an original set of parameters θsuperscript𝜃\theta^{*}italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, and defines how similar the function 𝐟θsubscript𝐟𝜃\mathbf{f}_{\theta}bold_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT implemented by a different set of parameters θ𝜃\thetaitalic_θ is to the original function 𝐟θsubscript𝐟superscript𝜃\mathbf{f}_{\theta^{*}}bold_f start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT:

LB(θ|θ,𝒟)=1nx𝒟𝐟θ(x)𝐟θ(x)2subscript𝐿𝐵conditional𝜃superscript𝜃𝒟1𝑛subscript𝑥𝒟superscriptnormsubscript𝐟𝜃𝑥subscript𝐟superscript𝜃𝑥2L_{B}(\theta|\theta^{*},\operatorname{\mathcal{D}})=\frac{1}{n}\sum_{x\in%\operatorname{\mathcal{D}}}\left|\left|\mathbf{f}_{\theta}(x)-\mathbf{f}_{%\theta^{*}}(x)\right|\right|^{2}italic_L start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ( italic_θ | italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , caligraphic_D ) = divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_x ∈ caligraphic_D end_POSTSUBSCRIPT | | bold_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x ) - bold_f start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_x ) | | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT(6)

where 𝒟𝒟\operatorname{\mathcal{D}}caligraphic_D is the training dataset and 𝐯norm𝐯||\mathbf{v}||| | bold_v | | denotes the 2superscript2\ell^{2}roman_ℓ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT-norm of 𝐯𝐯\mathbf{v}bold_v333We arbitrarily chose an MSE loss here, but conceptually we require a loss which is non-negative and satisfies identity of indiscernibles: L=0x:𝐟θ(x)=𝐟θ(x)iff𝐿0for-all𝑥:subscript𝐟𝜃𝑥subscript𝐟superscript𝜃𝑥L=0\iff\forall x:\mathbf{f}_{\theta}(x)=\mathbf{f}_{\theta^{*}}(x)italic_L = 0 ⇔ ∀ italic_x : bold_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x ) = bold_f start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_x ). For example, when studying an LLM, it may be more suitable to use KL-divergence..By definition, this loss landscape always has a global minimum at the parameters the model actually uses θ=θ𝜃superscript𝜃\theta=\theta^{*}italic_θ = italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, solving problem 2 above. Additionally, parameter choices which achieve 0 behavioral loss must have the same input-output behaviour as 𝐟θsuperscriptsubscript𝐟𝜃\mathbf{f}_{\theta}^{*}bold_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT on the entire training dataset, solving problem 1.Note that achieving zero behavioral loss relative to a model with parameters θsuperscript𝜃\theta^{*}italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is a stricter requirement than achieving the same loss as the model with parameters θsuperscript𝜃\theta^{*}italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT on the training data.Therefore, the behavioral loss LLC λ^Bsubscript^𝜆𝐵\hat{\lambda}_{B}over^ start_ARG italic_λ end_ARG start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT will be equal to, or higher than the training loss LLC λ^^𝜆\hat{\lambda}over^ start_ARG italic_λ end_ARG.

2.2.2 Singular learning theory at finite data

Next we want to resolve the problem that standard SLT formulae concern only the limit of infinite data when the model is certainly at a local minimum of the loss landscape. We would like to think of a neural network trained on a finite amount of data as implementing a core algorithm we are interested in reverse engineering, plus some amount of ‘noise’ which may vary with the parameterisation and which is not important to interpret. For example, in a modular addition transformer (Nanda etal., 2023), there are parts of the network which can be ablated to improve loss: these parts of the network may be present because the model has not fully converged to a minimum yet. In this case, if we have two transformers trained on modular addition which have the same input-output behaviour after we have ablated parts to improve performance, then we would like to consider these models as implemtenting the same function ‘up to’ noise before we ablate those parts.

In this section, we sketch how to modify SLT so that the LLC becomes a measure of how many different parameterisations implement nearly the same function, rather than exactly the same function. In this way, we can numerically vary how much the functions two different parameterisations implement are allowed to differ from each other on the training data.

We start by explaining why SLT takes the limit ϵ0italic-ϵ0\epsilon\to 0italic_ϵ → 0 in the definition of the learning coefficient (equation 1). SLT is a theory of Bayesian learning machines: learning machines which start with some prior over parameters which is nonzero everywhere φ:Θ(0,1):𝜑maps-toΘ01\varphi:\Theta\mapsto(0,1)italic_φ : roman_Θ ↦ ( 0 , 1 ), and which learn by performing a Bayesian update on each datapoint they observe. After a dataset 𝒟nsubscript𝒟𝑛\operatorname{\mathcal{D}}_{n}caligraphic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT of n𝑛nitalic_n datapoints, the posterior distribution over parameters is:

p(θ|𝒟n)=enL(θ|𝒟n)φ(θ)p(𝒟n).𝑝conditional𝜃subscript𝒟𝑛superscript𝑒𝑛𝐿conditional𝜃subscript𝒟𝑛𝜑𝜃𝑝subscript𝒟𝑛\displaystyle p(\theta|\operatorname{\mathcal{D}}_{n})=\frac{e^{-nL(\theta|%\operatorname{\mathcal{D}}_{n})}\varphi(\theta)}{p(\operatorname{\mathcal{D}}_%{n})}\,.italic_p ( italic_θ | caligraphic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) = divide start_ARG italic_e start_POSTSUPERSCRIPT - italic_n italic_L ( italic_θ | caligraphic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT italic_φ ( italic_θ ) end_ARG start_ARG italic_p ( caligraphic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) end_ARG .(7)

where L(θ|𝒟n)𝐿conditional𝜃subscript𝒟𝑛L(\theta|\operatorname{\mathcal{D}}_{n})italic_L ( italic_θ | caligraphic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) is the negative log likelihood of the dataset given the model 𝐟θsubscript𝐟𝜃\mathbf{f}_{\theta}bold_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT, which we identify with the loss function when making a connection between Bayesian learning and SGD (Murphy, 2012), and p(𝒟n)𝑝subscript𝒟𝑛p(\operatorname{\mathcal{D}}_{n})italic_p ( caligraphic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) is a normalisation factor.

The exponential dependence on n𝑛nitalic_n ensures that in the limit n𝑛n\to\inftyitalic_n → ∞, a Bayesian learning machine’s posterior is only nonzero at points of minimum loss. This means that the asymptotic behaviour of the learning machine depends only on properties of the loss landscape that are asymptotically close to having zero loss. This is the reason that we take ϵ0italic-ϵ0\epsilon\to 0italic_ϵ → 0 in the definition of the learning coefficient.

However, since the parameters θsuperscript𝜃\theta^{*}italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT we find after finite steps of SGD correspond to an algorithm plus noise, we want to consider the size of the region of parameter space that achieves a behavioral loss less than the noise size. From a bayesian learning perspective, in equation 7, we can see that for large but finite number of data points, most of the posterior concentrates around the regions of low loss, but it does not fully concentrate on the region with exactly minimum loss.

Therefore, we simply refrain from taking the limit as the loss scale ϵitalic-ϵ\epsilonitalic_ϵ goes to 00 in the definition of the learning coefficient, and consider the learning coefficient at a particular loss scale:

λ(ϵ):=ϵddϵlogV(ϵ)assign𝜆italic-ϵitalic-ϵdditalic-ϵVitalic-ϵ\displaystyle\lambda(\epsilon):=\epsilon\frac{\text{d}}{\text{d}\epsilon}\log%\operatorname{V}(\epsilon)italic_λ ( italic_ϵ ) := italic_ϵ divide start_ARG d end_ARG start_ARG d italic_ϵ end_ARG roman_log roman_V ( italic_ϵ )(8)

To understand how the learning coefficient can vary with epsilon, consider an illustrative example: an extremely simple setup with a single parameter w𝑤w\in\mathbb{R}italic_w ∈ blackboard_R, and a loss function L(w)=c2w2+w4𝐿𝑤superscript𝑐2superscript𝑤2superscript𝑤4L(w)=c^{2}w^{2}+w^{4}italic_L ( italic_w ) = italic_c start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_w start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_w start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT with c1much-less-than𝑐1c\ll 1italic_c ≪ 1. This is a toy model of a scenario where there is a very small quadratic term in the learning coefficient. This term is only ‘visible’ to the learning coefficient when we zoom in to very small loss values. To see this, we must the calculate how the volume (equation 2) depends on the loss scale ϵitalic-ϵ\epsilonitalic_ϵ.For large ϵc14much-greater-thanitalic-ϵsuperscript𝑐14\epsilon\gg c^{\frac{1}{4}}italic_ϵ ≫ italic_c start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 4 end_ARG end_POSTSUPERSCRIPT, the quartic term dominates the loss and the region of loss less than ϵitalic-ϵ\epsilonitalic_ϵ is roughly the interval [ϵ14,ϵ14]superscriptitalic-ϵ14superscriptitalic-ϵ14[-\epsilon^{\frac{1}{4}},\epsilon^{\frac{1}{4}}][ - italic_ϵ start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 4 end_ARG end_POSTSUPERSCRIPT , italic_ϵ start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 4 end_ARG end_POSTSUPERSCRIPT ].This gives V(ϵ)2ϵ14Vitalic-ϵ2superscriptitalic-ϵ14\operatorname{V}(\epsilon)\approx 2\epsilon^{\frac{1}{4}}roman_V ( italic_ϵ ) ≈ 2 italic_ϵ start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 4 end_ARG end_POSTSUPERSCRIPT so the learning coefficient is λ(ϵc14)=14𝜆much-greater-thanitalic-ϵsuperscript𝑐1414\lambda(\epsilon\gg c^{\frac{1}{4}})=\frac{1}{4}italic_λ ( italic_ϵ ≫ italic_c start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 4 end_ARG end_POSTSUPERSCRIPT ) = divide start_ARG 1 end_ARG start_ARG 4 end_ARG, the same as if the quadratic term were not present.On the other hand, for small enough ϵc14much-less-thanitalic-ϵsuperscript𝑐14\epsilon\ll c^{\frac{1}{4}}italic_ϵ ≪ italic_c start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 4 end_ARG end_POSTSUPERSCRIPT, the quadratic term becomes visible: V(ϵ)2ϵ12/c2Vitalic-ϵ2superscriptitalic-ϵ12superscript𝑐2\operatorname{V}(\epsilon)\approx 2\epsilon^{\frac{1}{2}}/c^{2}roman_V ( italic_ϵ ) ≈ 2 italic_ϵ start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT / italic_c start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, so λ(ϵc14)=12𝜆much-less-thanitalic-ϵsuperscript𝑐1412\lambda(\epsilon\ll c^{\frac{1}{4}})=\frac{1}{2}italic_λ ( italic_ϵ ≪ italic_c start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 4 end_ARG end_POSTSUPERSCRIPT ) = divide start_ARG 1 end_ARG start_ARG 2 end_ARG.

Determining how to choose an appropriate cutoff ϵitalic-ϵ\epsilonitalic_ϵ is still an open problem. We suggest that researchers choose the value of the behavioral loss cutoff in the context of the question they would like to answer. For example, if one trains multiple models with different seeds on the same task, then the appropriate loss cutoff may be on the order of the variance between the seeds.

Finally, we are able to quantify the amount of degeneracy in a neural network. We define the Effective Parameter Count of a neural network 𝐟θsubscript𝐟superscript𝜃\mathbf{f}_{\theta^{*}}bold_f start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT at noise scale ϵitalic-ϵ\epsilonitalic_ϵ astwo times the local learning coefficient λB(ϵ)subscript𝜆𝐵italic-ϵ\lambda_{B}(\epsilon)italic_λ start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ( italic_ϵ ) of the behavioral loss with respect to the network at noise scale epsilon.

Neff(ϵ):=2λB(ϵ)assignsubscript𝑁effitalic-ϵ2subscript𝜆𝐵italic-ϵN_{\text{eff}}(\epsilon):=2\lambda_{B}(\epsilon)italic_N start_POSTSUBSCRIPT eff end_POSTSUBSCRIPT ( italic_ϵ ) := 2 italic_λ start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ( italic_ϵ )(9)

We conjecture that a fully parameterisation invariant representation of a neural network which captures all the behaviour up to noise scale ϵitalic-ϵ\epsilonitalic_ϵ would require Neff(ϵ)subscript𝑁effitalic-ϵN_{\text{eff}}(\epsilon)italic_N start_POSTSUBSCRIPT eff end_POSTSUBSCRIPT ( italic_ϵ ) parameters.

3 Internal structures that contribute to degeneracy

In this section, we will show three ways the internal structure of neural networks can induce degrees of re-parametrization freedom Nfreesubscript𝑁freeN_{\text{free}}italic_N start_POSTSUBSCRIPT free end_POSTSUBSCRIPT in the loss landscape. Since Neff=NNfreesubscript𝑁eff𝑁subscript𝑁freeN_{\text{eff}}=N-N_{\text{free}}italic_N start_POSTSUBSCRIPT eff end_POSTSUBSCRIPT = italic_N - italic_N start_POSTSUBSCRIPT free end_POSTSUBSCRIPT, this is equivalent to showing three ways the internal structures of neural networks determine their effective parameter count. We do not expect that these three sources of re-parametrization freedom offer a complete account of all degeneracy in real networks. They are merely a starting point for relating the degeneracy of networks to their computational structure at all.

For ease of presentation, most of the expressions in this section are only derived for the example case of fully connected networks. They can be generalized to transformers, though we do not show this explicitly here.

In Section 3.1, we show a relationship between the effective parameter count and the dimensions of the spaces spanned by the network’s activation vectors (Section 3.1.1) and Jacobians (Section 3.1.2) recorded over the training data.In Section 3.2, we show a relationship between the number of distinct nonlinearities implemented in a layer of the network on the training data and the effective parameter count.

3.1 Activations and Jacobians

In this section, we show how a network having low dimensional hidden activations or Jacobians leads to re-parametrisation freedom.

We begin by bringing the network’s Hessian, which gives the first non-zero term in the Taylor expansion of the loss around an optimum (See equation 5) into a more convenient form.Each local free direction in the loss landscape corresponds to an eigenvector of the Hessian with zero eigenvalue.444The reverse does not hold, due to higher order terms in the expansion in equation 5. See (Watanabe, 2009, 2013). Therefore, the rank of the Hessian can be used to obtain a lower bound for the learning coefficient.

Consider the Hessian of a fully connected network, with parameters θ=θ𝜃superscript𝜃\theta=\theta^{*}italic_θ = italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, network inputs x𝑥xitalic_x and network outputs 𝐟θ(x)subscript𝐟𝜃𝑥\mathbf{f}_{\theta}(x)bold_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x ), on a behavioural lossLB(θ|θ,𝒟)subscript𝐿𝐵conditional𝜃superscript𝜃𝒟L_{B}\left(\theta|\theta^{*},\operatorname{\mathcal{D}}\right)italic_L start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ( italic_θ | italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , caligraphic_D ) evaluated on a dataset consisting of |𝒟|=n𝒟𝑛|\operatorname{\mathcal{D}}|=n| caligraphic_D | = italic_n inputs. Using the chain rule, the Hessian at the global minimum θ=θ𝜃superscript𝜃\theta=\theta^{*}italic_θ = italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT can be written as:

2LB(θ|θ,𝒟)θi,jlθi,jl|θ=θevaluated-atsuperscript2subscript𝐿𝐵conditional𝜃superscript𝜃𝒟subscriptsuperscript𝜃𝑙𝑖𝑗subscriptsuperscript𝜃superscript𝑙superscript𝑖superscript𝑗𝜃superscript𝜃\displaystyle\left.\frac{\partial^{2}L_{B}\left(\theta|\theta^{*},%\operatorname{\mathcal{D}}\right)}{\partial\theta^{l}_{i,j}\partial\theta^{l^{%\prime}}_{i^{\prime},j^{\prime}}}\right|_{\theta=\theta^{*}}divide start_ARG ∂ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ( italic_θ | italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , caligraphic_D ) end_ARG start_ARG ∂ italic_θ start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ∂ italic_θ start_POSTSUPERSCRIPT italic_l start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_ARG | start_POSTSUBSCRIPT italic_θ = italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT=x𝒟k,k2LB(θ|θ,𝒟)fklfinalfklfinal|θ=θfklfinal(x)θi,jlfklfinal(x)θi,jlabsentevaluated-atsubscript𝑥𝒟subscript𝑘superscript𝑘superscript2subscript𝐿𝐵conditional𝜃superscript𝜃𝒟subscriptsuperscript𝑓subscript𝑙final𝑘subscriptsuperscript𝑓subscript𝑙finalsuperscript𝑘𝜃superscript𝜃subscriptsuperscript𝑓subscript𝑙final𝑘𝑥subscriptsuperscript𝜃superscript𝑙superscript𝑖superscript𝑗subscriptsuperscript𝑓subscript𝑙finalsuperscript𝑘𝑥subscriptsuperscript𝜃𝑙𝑖𝑗\displaystyle=\sum_{x\in\operatorname{\mathcal{D}}}\sum_{k,k^{\prime}}\left.%\frac{\partial^{2}L_{B}\left(\theta|\theta^{*},\operatorname{\mathcal{D}}%\right)}{\partial{f^{l_{\text{final}}}_{k}}\partial{f^{l_{\text{final}}}_{k^{%\prime}}}}\right|_{\theta=\theta^{*}}\frac{\partial f^{l_{\text{final}}}_{k}(x%)}{\partial\theta^{l^{\prime}}_{i^{\prime},j^{\prime}}}\frac{\partial f^{l_{%\text{final}}}_{k^{\prime}}(x)}{\partial\theta^{l}_{i,j}}= ∑ start_POSTSUBSCRIPT italic_x ∈ caligraphic_D end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_k , italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT divide start_ARG ∂ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ( italic_θ | italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , caligraphic_D ) end_ARG start_ARG ∂ italic_f start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT final end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∂ italic_f start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT final end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_ARG | start_POSTSUBSCRIPT italic_θ = italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT divide start_ARG ∂ italic_f start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT final end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x ) end_ARG start_ARG ∂ italic_θ start_POSTSUPERSCRIPT italic_l start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_ARG divide start_ARG ∂ italic_f start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT final end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_x ) end_ARG start_ARG ∂ italic_θ start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT end_ARG(10)
=MSEloss1nx𝒟kfklfinal(x)θi,jlfklfinal(x)θi,jlMSEloss1nsubscriptx𝒟subscriptksubscriptsuperscriptfsubscriptlfinalkxsubscriptsuperscript𝜃superscriptlsuperscriptisuperscriptjsubscriptsuperscriptfsubscriptlfinalkxsubscriptsuperscript𝜃lij\displaystyle\overset{\rm MSE\ loss}{=}\frac{1}{n}\sum_{x\in\operatorname{%\mathcal{D}}}\sum_{k}\frac{\partial f^{l_{\text{final}}}_{k}(x)}{\partial%\theta^{l^{\prime}}_{i^{\prime},j^{\prime}}}\frac{\partial f^{l_{\text{final}}%}_{k}(x)}{\partial\theta^{l}_{i,j}}start_OVERACCENT roman_MSE roman_loss end_OVERACCENT start_ARG = end_ARG divide start_ARG 1 end_ARG start_ARG roman_n end_ARG ∑ start_POSTSUBSCRIPT roman_x ∈ caligraphic_D end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT roman_k end_POSTSUBSCRIPT divide start_ARG ∂ roman_f start_POSTSUPERSCRIPT roman_l start_POSTSUBSCRIPT final end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_k end_POSTSUBSCRIPT ( roman_x ) end_ARG start_ARG ∂ italic_θ start_POSTSUPERSCRIPT roman_l start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , roman_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_ARG divide start_ARG ∂ roman_f start_POSTSUPERSCRIPT roman_l start_POSTSUBSCRIPT final end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_k end_POSTSUBSCRIPT ( roman_x ) end_ARG start_ARG ∂ italic_θ start_POSTSUPERSCRIPT roman_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_i , roman_j end_POSTSUBSCRIPT end_ARG

for l=1,lfinal;j=1,,dl;i=1,,dl+1formulae-sequence𝑙1subscript𝑙finalformulae-sequence𝑗1superscript𝑑𝑙𝑖1superscript𝑑𝑙1l=1,\dots l_{\text{final}};\>j=1,\dots,d^{l};\>i=1,\dots,d^{l+1}italic_l = 1 , … italic_l start_POSTSUBSCRIPT final end_POSTSUBSCRIPT ; italic_j = 1 , … , italic_d start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ; italic_i = 1 , … , italic_d start_POSTSUPERSCRIPT italic_l + 1 end_POSTSUPERSCRIPT. In the second line, we have used that the loss function is MSE from outputs at θ=θ𝜃superscript𝜃\theta=\theta^{*}italic_θ = italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT to simplify the expression, and we have also used that the first derivatives of the loss are zero at the minimum555If we were to use a different behavioural loss such as KL divergence, this would mean that the term 2Lfklfinalfklfinal|θ=θevaluated-atsuperscript2𝐿subscriptsuperscript𝑓subscript𝑙final𝑘subscriptsuperscript𝑓subscript𝑙finalsuperscript𝑘𝜃superscript𝜃\left.\frac{\partial^{2}L}{f^{l_{\text{final}}}_{k}f^{l_{\text{final}}}_{k^{%\prime}}}\right|_{\theta=\theta^{*}}divide start_ARG ∂ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L end_ARG start_ARG italic_f start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT final end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_f start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT final end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_ARG | start_POSTSUBSCRIPT italic_θ = italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT would not be equal to δkk.subscript𝛿𝑘superscript𝑘\delta_{kk^{\prime}}.italic_δ start_POSTSUBSCRIPT italic_k italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT . This means that different output activations (logits for a language model) would be weighted differently, but the story of this section would be largely the same.. Thus, the Hessian is equal to a Gram matrixof the network’s weight gradients fklfinalθi,jlsubscriptsuperscript𝑓subscript𝑙final𝑘subscriptsuperscript𝜃𝑙𝑖𝑗\frac{\partial f^{l_{\text{final}}}_{k}}{\partial\theta^{l}_{i,j}}divide start_ARG ∂ italic_f start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT final end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG start_ARG ∂ italic_θ start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT end_ARG, and linear dependence of entries of the weight gradients over the training set 𝒟𝒟\operatorname{\mathcal{D}}caligraphic_D corresponds to zero eigenvalues in the Hessian.

We can apply the chain rule again to rewrite the gradient vector on each datapoint as an outer product of Jacobians and activations:

fklfinal(x)θi,jlsubscriptsuperscript𝑓subscript𝑙final𝑘𝑥subscriptsuperscript𝜃𝑙𝑖𝑗\displaystyle\frac{\partial f^{l_{\text{final}}}_{k}(x)}{\partial\theta^{l}_{i%,j}}divide start_ARG ∂ italic_f start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT final end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x ) end_ARG start_ARG ∂ italic_θ start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT end_ARG=fklfinal(x)pil+1fjl(x)absentsubscriptsuperscript𝑓subscript𝑙final𝑘𝑥subscriptsuperscript𝑝𝑙1𝑖subscriptsuperscript𝑓𝑙𝑗𝑥\displaystyle=\frac{\partial f^{l_{\text{final}}}_{k}(x)}{\partial p^{l+1}_{i}%}f^{l}_{j}(x)= divide start_ARG ∂ italic_f start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT final end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x ) end_ARG start_ARG ∂ italic_p start_POSTSUPERSCRIPT italic_l + 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG italic_f start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_x )(11)

where the Jacobian is taken with respect to preactivations to layer l+1𝑙1l+1italic_l + 1:

𝐩l+1(x)=Wl𝐟l(x).superscript𝐩𝑙1𝑥superscript𝑊𝑙superscript𝐟𝑙𝑥\mathbf{p}^{l+1}(x)=W^{l}\mathbf{f}^{l}(x).bold_p start_POSTSUPERSCRIPT italic_l + 1 end_POSTSUPERSCRIPT ( italic_x ) = italic_W start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT bold_f start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ( italic_x ) .(12)

Thus, every degree of linear dependence in the activations fjlsubscriptsuperscript𝑓𝑙𝑗f^{l}_{j}italic_f start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT or Jacobians fklfinalpil+1subscriptsuperscript𝑓subscript𝑙final𝑘subscriptsuperscript𝑝𝑙1𝑖\frac{\partial f^{l_{\text{final}}}_{k}}{\partial p^{l+1}_{i}}divide start_ARG ∂ italic_f start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT final end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG start_ARG ∂ italic_p start_POSTSUPERSCRIPT italic_l + 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG in a layer l𝑙litalic_l of the network also causes degrees of linear dependence in the weight gradient fklfinal(x)θi,jlsubscriptsuperscript𝑓subscript𝑙final𝑘𝑥subscriptsuperscript𝜃𝑙𝑖𝑗\frac{\partial f^{l_{\text{final}}}_{k}(x)}{\partial\theta^{l}_{i,j}}divide start_ARG ∂ italic_f start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT final end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x ) end_ARG start_ARG ∂ italic_θ start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT end_ARG of the network, potentially resulting in re-parametrisation freedom for the network. In the next two sections, we explore how linear dependence in the activations and Jacobians respectively impact the effective parameter count.

3.1.1 Activation vectors spanning a low dimensional subspace

Looking at equation 11, each degree of linear dependence of the activations fjlsubscriptsuperscript𝑓𝑙𝑗f^{l}_{j}italic_f start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT in a hidden layer l𝑙litalic_l of width dlsuperscript𝑑𝑙d^{l}italic_d start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT over the training dataset 𝒟𝒟\operatorname{\mathcal{D}}caligraphic_D,

jcjfjl(x)=0x𝒟,subscript𝑗subscript𝑐𝑗subscriptsuperscript𝑓𝑙𝑗𝑥0for-all𝑥𝒟\sum_{j}c_{j}f^{l}_{j}(x)=0\,\,\forall x\in\operatorname{\mathcal{D}}\,,∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_f start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_x ) = 0 ∀ italic_x ∈ caligraphic_D ,(13)

corresponds to dl+1superscript𝑑𝑙1d^{l+1}italic_d start_POSTSUPERSCRIPT italic_l + 1 end_POSTSUPERSCRIPT linearly dependent entries in the weight gradient fklfinalθi,jlsubscriptsuperscript𝑓subscript𝑙final𝑘subscriptsuperscript𝜃𝑙𝑖𝑗\frac{\partial f^{l_{\text{final}}}_{k}}{\partial\theta^{l}_{i,j}}divide start_ARG ∂ italic_f start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT final end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG start_ARG ∂ italic_θ start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT end_ARG, dl+1superscript𝑑𝑙1d^{l+1}italic_d start_POSTSUPERSCRIPT italic_l + 1 end_POSTSUPERSCRIPT eigenvectors of the Hessian with eigenvalue zero, and dl+1superscript𝑑𝑙1d^{l+1}italic_d start_POSTSUPERSCRIPT italic_l + 1 end_POSTSUPERSCRIPT fully independent free directions in the loss landscape than span a fully free dl+1superscript𝑑𝑙1d^{l+1}italic_d start_POSTSUPERSCRIPT italic_l + 1 end_POSTSUPERSCRIPT dimensional hyperplane. So the effective parameter count Neffsubscript𝑁effN_{\text{eff}}italic_N start_POSTSUBSCRIPT eff end_POSTSUBSCRIPT will be lower than the nominal number of parameters in the model N𝑁Nitalic_N by dl+1superscript𝑑𝑙1d^{l+1}italic_d start_POSTSUPERSCRIPT italic_l + 1 end_POSTSUPERSCRIPT for each such degree of linear dependence in the hidden representations.

More generally, we can take a PCA of the activation vectors in layer l𝑙litalic_l by diagonalising the Gram matrix of activations

Glsuperscript𝐺𝑙\displaystyle G^{l}italic_G start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT:=1nx𝒟𝐟l(x)𝐟l(x)Tassignabsent1𝑛subscript𝑥𝒟superscript𝐟𝑙𝑥superscript𝐟𝑙superscript𝑥𝑇\displaystyle:=\frac{1}{n}\sum_{x\in\operatorname{\mathcal{D}}}\mathbf{f}^{l}(%x){\mathbf{f}^{l}(x)}^{T}:= divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_x ∈ caligraphic_D end_POSTSUBSCRIPT bold_f start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ( italic_x ) bold_f start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ( italic_x ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT(14)
=:UlTDGlUl\displaystyle=:{U^{l}}^{T}D_{G}^{l}U^{l}= : italic_U start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT italic_U start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT

If there is linear dependence between the activations on the dataset, some of the singular values (eigenvalues of Glsuperscript𝐺𝑙G^{l}italic_G start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT) will be zero. If wetransform into rotated layer coordinates 𝐟~l(x)=Ul𝐟l(x),W~l=WlUlTformulae-sequencesuperscript~𝐟𝑙𝑥superscript𝑈𝑙superscript𝐟𝑙𝑥superscript~𝑊𝑙superscript𝑊𝑙superscriptsuperscript𝑈𝑙𝑇\tilde{\mathbf{f}}^{l}(x)=U^{l}\mathbf{f}^{l}(x),\tilde{W}^{l}=W^{l}{U^{l}}^{T}over~ start_ARG bold_f end_ARG start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ( italic_x ) = italic_U start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT bold_f start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ( italic_x ) , over~ start_ARG italic_W end_ARG start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT = italic_W start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT italic_U start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT, then the parameters of the transformed weight matrix in rows which connect to the directions with zero variance can be changed freely without changing the product W~l𝐟~lsuperscript~𝑊𝑙superscript~𝐟𝑙\tilde{W}^{l}\tilde{\mathbf{f}}^{l}over~ start_ARG italic_W end_ARG start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT over~ start_ARG bold_f end_ARG start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT.

In reality, a gram matrix of activation vectors will never have eigenvalues that are exactly 0. However, if a particular eigenvalue has size 1nx𝒟(f~lj(x))2=O(ϵk)1𝑛subscript𝑥𝒟superscriptsubscriptsuperscript~𝑓𝑙𝑗𝑥2𝑂superscriptitalic-ϵ𝑘\sqrt{\frac{1}{n}\sum_{x\in\operatorname{\mathcal{D}}}\left({\tilde{f}^{l}}_{j%}(x)\right)^{2}}=O(\epsilon^{k})square-root start_ARG divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_x ∈ caligraphic_D end_POSTSUBSCRIPT ( over~ start_ARG italic_f end_ARG start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_x ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG = italic_O ( italic_ϵ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) for some ϵ1much-less-thanitalic-ϵ1\epsilon\ll 1italic_ϵ ≪ 1, the transformed parameters inside W~lsuperscript~𝑊𝑙{\tilde{W}^{l}}over~ start_ARG italic_W end_ARG start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT can be changed by O(ϵ12k)𝑂superscriptitalic-ϵ12𝑘O(\epsilon^{\frac{1}{2}-k})italic_O ( italic_ϵ start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG - italic_k end_POSTSUPERSCRIPT ) while only impacting the loss L𝐿Litalic_L by O(ϵ)𝑂italic-ϵO(\epsilon)italic_O ( italic_ϵ ).

This suggests that, under the finite-data SLT picture introduced in Section 2.2.2, singular values of the set of activation vectors that are less than ϵ12superscriptitalic-ϵ12\epsilon^{\frac{1}{2}}italic_ϵ start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT for noise scale ϵitalic-ϵ\epsilonitalic_ϵ result in a lower effective parameter count, with dl+1superscript𝑑𝑙1d^{l+1}italic_d start_POSTSUPERSCRIPT italic_l + 1 end_POSTSUPERSCRIPT effective parameters less for every small singular value.So, if we view the PCA components in a layer l𝑙litalic_l as the ’elementary variables’ of that layer, then the fewer elementary variables the network has in total, the lower the effective parameter count will be.

Relationship to weight norm

One might be concerned that linear dependencies between the activation vectors on the training dataset might not hold for activation vectors outside the training dataset, such that the entries of the weight matrix that we are treating as free do in fact affect the off-distribution outputs of the network.

However, SOTA optimisers often use weight decay or 2superscript2\ell^{2}roman_ℓ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT weight regularisation during training to improve network generalization (Loshchilov and Hutter, 2019). This biases training towards networks with a smaller total 2superscript2\ell^{2}roman_ℓ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT-weight norm, θ2=l=1lfinalWlFsubscriptnorm𝜃2superscriptsubscript𝑙1subscript𝑙finalsubscriptnormsuperscript𝑊𝑙𝐹||\theta||_{2}=\sum_{l=1}^{l_{\text{final}}}||W^{l}||_{F}| | italic_θ | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT final end_POSTSUBSCRIPT end_POSTSUPERSCRIPT | | italic_W start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT | | start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT.Since the Frobenius norm WlFsubscriptnormsuperscript𝑊𝑙𝐹||W^{l}||_{F}| | italic_W start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT | | start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT is invariant under orthogonal transformations, the weight regularisation can equivalently be thought of as biasing training towards low W~lFsubscriptnormsuperscript~𝑊𝑙𝐹||\tilde{W}^{l}||_{F}| | over~ start_ARG italic_W end_ARG start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT | | start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT. Since the entries of W~lsuperscript~𝑊𝑙\tilde{W}^{l}over~ start_ARG italic_W end_ARG start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT which connect to the zero principal components do not affect the output, the training will be biased to push them to 0.This is an example of weight regularisation improving generalisation performance: if, at inference time, an activation vector has variation in a direction not seen during training, a regularised model ignores that component of the activation vector.

3.1.2 Jacobians spanning a low dimensional subspace

We have shown that if the set of activation vectors in some layer have linear dependence over a dataset, then some parameters are free to vary without affecting outputs on that dataset. A similar story can be told when the Jacobians Jijl=filfinal(x)pjl+1subscriptsuperscript𝐽𝑙𝑖𝑗subscriptsuperscript𝑓subscript𝑙final𝑖𝑥subscriptsuperscript𝑝𝑙1𝑗J^{l}_{ij}=\frac{\partial f^{l_{\text{final}}}_{i}(x)}{\partial p^{l+1}_{j}}italic_J start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = divide start_ARG ∂ italic_f start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT final end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) end_ARG start_ARG ∂ italic_p start_POSTSUPERSCRIPT italic_l + 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG do not span the full space of the layer. As with the activations, we look for zero eigenvalues in the gram matrix of the Jacobians:

Klsuperscript𝐾𝑙\displaystyle K^{l}italic_K start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT:=1nx𝒟jJlTJlassignabsent1𝑛subscript𝑥𝒟subscript𝑗superscriptsuperscript𝐽𝑙𝑇superscript𝐽𝑙\displaystyle:=\frac{1}{n}\sum_{x\in\operatorname{\mathcal{D}}}\sum_{j}{J^{l}}%^{T}J^{l}:= divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_x ∈ caligraphic_D end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_J start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_J start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT(15)
=:RlTDPlRl\displaystyle=:{R^{l}}^{T}D^{l}_{P}R^{l}= : italic_R start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_D start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT italic_R start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT

Any zero eigenvalue in this gram matrix leads to dlsuperscript𝑑𝑙d^{l}italic_d start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT zero eigenvalues in the Hessian, analogous to the previous section. We can transform into rotated layer coordinates W~l=RlWlsuperscript~𝑊𝑙superscript𝑅𝑙superscript𝑊𝑙\tilde{W}^{l}=R^{l}W^{l}over~ start_ARG italic_W end_ARG start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT = italic_R start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT italic_W start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT, J~l=JlRlRlTsuperscript~𝐽𝑙superscript𝐽𝑙superscript𝑅𝑙superscriptsuperscript𝑅𝑙𝑇\tilde{J}^{l}=J^{l}R^{l}{R^{l}}^{T}over~ start_ARG italic_J end_ARG start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT = italic_J start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT italic_R start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT italic_R start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPTand the parameters of the transformed weight matrix in columns which connect to the directions with zero variance can be changed freely without changing the product J~lW~lsuperscript~𝐽𝑙superscript~𝑊𝑙\tilde{J}^{l}\tilde{W}^{l}over~ start_ARG italic_J end_ARG start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT over~ start_ARG italic_W end_ARG start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT.However, unlike with the activation PCA components, the dlsuperscript𝑑𝑙d^{l}italic_d start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT free directions in the Hessian from Jacobians spanning a low-dimensional subspace may not always correspond to dlsuperscript𝑑𝑙d^{l}italic_d start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT full degrees of freedom in the parametrization. This is due to the potential presence of terms above second order in the perturbative expansion around the loss optimum, see equation 5, which can cause the loss to change if the parameters are varied along those directions despite the Hessian being zero Watanabe (2009).

Jacobians between hidden layers

Note that we can decompose each Jacobian from layer l𝑙litalic_l to layer lfinalsubscript𝑙finall_{\text{final}}italic_l start_POSTSUBSCRIPT final end_POSTSUBSCRIPT into a product of Jacobians between adjacent layers by the chain rule:

𝐟lfinal(x)𝐩il+1=𝐟lfinal(x)𝐟lfinal1𝐟lfinal1(x)𝐟lfinal2𝐟l+2(x)𝐟l+1𝐟l+1(x)𝐩l+1.superscript𝐟subscript𝑙final𝑥subscriptsuperscript𝐩𝑙1𝑖superscript𝐟subscript𝑙final𝑥superscript𝐟subscript𝑙final1superscript𝐟subscript𝑙final1𝑥superscript𝐟subscript𝑙final2superscript𝐟𝑙2𝑥superscript𝐟𝑙1superscript𝐟𝑙1𝑥superscript𝐩𝑙1\frac{\partial\mathbf{f}^{l_{\text{final}}}(x)}{\partial\mathbf{p}^{l+1}_{i}}=%\frac{\partial\mathbf{f}^{l_{\text{final}}}(x)}{\partial\mathbf{f}^{l_{\text{%final}}-1}}\frac{\partial\mathbf{f}^{l_{\text{final}}-1}(x)}{\partial\mathbf{f%}^{l_{\text{final}}-2}}\dots\frac{\partial\mathbf{f}^{l+2}(x)}{\partial\mathbf%{f}^{l+1}}\frac{\partial\mathbf{f}^{l+1}(x)}{\partial\mathbf{p}^{l+1}}\,.divide start_ARG ∂ bold_f start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT final end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_x ) end_ARG start_ARG ∂ bold_p start_POSTSUPERSCRIPT italic_l + 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG = divide start_ARG ∂ bold_f start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT final end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_x ) end_ARG start_ARG ∂ bold_f start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT final end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT end_ARG divide start_ARG ∂ bold_f start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT final end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_x ) end_ARG start_ARG ∂ bold_f start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT final end_POSTSUBSCRIPT - 2 end_POSTSUPERSCRIPT end_ARG … divide start_ARG ∂ bold_f start_POSTSUPERSCRIPT italic_l + 2 end_POSTSUPERSCRIPT ( italic_x ) end_ARG start_ARG ∂ bold_f start_POSTSUPERSCRIPT italic_l + 1 end_POSTSUPERSCRIPT end_ARG divide start_ARG ∂ bold_f start_POSTSUPERSCRIPT italic_l + 1 end_POSTSUPERSCRIPT ( italic_x ) end_ARG start_ARG ∂ bold_p start_POSTSUPERSCRIPT italic_l + 1 end_POSTSUPERSCRIPT end_ARG .(16)

Thus, any rank drop in a gram matrix of Jacobians from layer l+k𝑙𝑘l+kitalic_l + italic_k to layer l+k+1𝑙𝑘1l+k+1italic_l + italic_k + 1 necessarily also leads to a rank drop in the gram matrix of the Jacobians from layer l𝑙litalic_l to layer lfinalsubscript𝑙finall_{\text{final}}italic_l start_POSTSUBSCRIPT final end_POSTSUBSCRIPT, and thus dlsuperscript𝑑𝑙d^{l}italic_d start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT zero eigenvalues in the Hessian.

3.2 Synchronized nonlinearities

In this section, we demonstrate a third example of internal structure that affects the effective parameter count of the model.The two examples we presented in the previous sections might be thought of as showing how the network having fewer relevant variables in its representation in a layer leads to more degeneracy.The example we present in this section shows how the network performing “fewer operations” leads to more degeneracy.

In a dense layer with piecewise linear activation functions (ReLU or LeakyReLU), the effective parameter count is reduced if two neurons have the same set of data points for which they are ‘on’ and ‘off’. We call neurons with this property synchronized with each other. For simplicity, in this section, we will consider a dense feedforward network with ReLU nonlinearities at each layer, and the same hidden width d𝑑ditalic_d throughout.

We define the neuron firing pattern

ril(x)=fil(x)pil(x)ifpil(x)0,elseril(x)=1,formulae-sequencesubscriptsuperscript𝑟𝑙𝑖𝑥subscriptsuperscript𝑓𝑙𝑖𝑥subscriptsuperscript𝑝𝑙𝑖𝑥ifsubscriptsuperscript𝑝𝑙𝑖𝑥0elsesubscriptsuperscript𝑟𝑙𝑖𝑥1\displaystyle r^{l}_{i}(x)=\frac{f^{l}_{i}(x)}{p^{l}_{i}(x)}\text{\ if\ }p^{l}%_{i}(x)\neq 0,\ \text{else}\ r^{l}_{i}(x)=1\,,italic_r start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) = divide start_ARG italic_f start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) end_ARG start_ARG italic_p start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) end_ARG if italic_p start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) ≠ 0 , else italic_r start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) = 1 ,(17)

where pil(x)=jWi,jl1fjl1(x)subscriptsuperscript𝑝𝑙𝑖𝑥subscript𝑗subscriptsuperscript𝑊𝑙1𝑖𝑗subscriptsuperscript𝑓𝑙1𝑗𝑥p^{l}_{i}(x)=\sum_{j}W^{l-1}_{i,j}f^{l-1}_{j}(x)italic_p start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) = ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_W start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT italic_f start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_x ) is the preactivation of neuron i𝑖iitalic_i.We call two neurons i𝑖iitalic_i and j𝑗jitalic_j synchronized if they always fire simultaneously on the training data, ril(x)=rjl(x)x𝒟subscriptsuperscript𝑟𝑙𝑖𝑥subscriptsuperscript𝑟𝑙𝑗𝑥for-all𝑥𝒟r^{l}_{i}(x)=r^{l}_{j}(x)\,\forall x\in\operatorname{\mathcal{D}}italic_r start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) = italic_r start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_x ) ∀ italic_x ∈ caligraphic_D.

All synchronized

As a pedagogical aid, and to demonstrate a point on how the effective parameter count is invariant to linear layer transitions, we first consider the case of all the neurons in layer l+1𝑙1l+1italic_l + 1 being synchronized together in the same firing pattern rl+1(x)superscript𝑟𝑙1𝑥r^{l+1}(x)italic_r start_POSTSUPERSCRIPT italic_l + 1 end_POSTSUPERSCRIPT ( italic_x ). Then we can write:

𝐟l+2(x)superscript𝐟𝑙2𝑥\displaystyle\mathbf{f}^{l+2}(x)bold_f start_POSTSUPERSCRIPT italic_l + 2 end_POSTSUPERSCRIPT ( italic_x )=ReLU(Wl+1ReLU(Wl𝐟l(x)))=ReLU(Wl+1rl+1(x)Wl𝐟l(x)),absentReLUsuperscript𝑊𝑙1ReLUsuperscript𝑊𝑙superscript𝐟𝑙𝑥ReLUsuperscript𝑊𝑙1superscript𝑟𝑙1𝑥superscript𝑊𝑙superscript𝐟𝑙𝑥\displaystyle=\operatorname{ReLU}\left(W^{l+1}\operatorname{ReLU}(W^{l}\mathbf%{f}^{l}(x))\right)=\operatorname{ReLU}\left(W^{l+1}r^{l+1}(x)W^{l}\mathbf{f}^{%l}(x)\right)\,,= roman_ReLU ( italic_W start_POSTSUPERSCRIPT italic_l + 1 end_POSTSUPERSCRIPT roman_ReLU ( italic_W start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT bold_f start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ( italic_x ) ) ) = roman_ReLU ( italic_W start_POSTSUPERSCRIPT italic_l + 1 end_POSTSUPERSCRIPT italic_r start_POSTSUPERSCRIPT italic_l + 1 end_POSTSUPERSCRIPT ( italic_x ) italic_W start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT bold_f start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ( italic_x ) ) ,

meaning Wlsuperscript𝑊𝑙W^{l}italic_W start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT and Wl+1superscript𝑊𝑙1W^{l+1}italic_W start_POSTSUPERSCRIPT italic_l + 1 end_POSTSUPERSCRIPT effectively act as a single d×d𝑑𝑑d\times ditalic_d × italic_d dimensional matrix W~=Wl+1Wl~𝑊superscript𝑊𝑙1superscript𝑊𝑙\tilde{W}=W^{l+1}W^{l}over~ start_ARG italic_W end_ARG = italic_W start_POSTSUPERSCRIPT italic_l + 1 end_POSTSUPERSCRIPT italic_W start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT. Thus, any setting of the weights Wlsuperscript𝑊𝑙W^{l}italic_W start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT and Wl+1superscript𝑊𝑙1W^{l+1}italic_W start_POSTSUPERSCRIPT italic_l + 1 end_POSTSUPERSCRIPT that yield the same W~~𝑊\tilde{W}over~ start_ARG italic_W end_ARG do not change the network’s outputs on the training data, so long as we avoid changing any of the ril+1(x)subscriptsuperscript𝑟𝑙1𝑖𝑥r^{l+1}_{i}(x)italic_r start_POSTSUPERSCRIPT italic_l + 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ).We can ensure that the ril+1(x)subscriptsuperscript𝑟𝑙1𝑖𝑥r^{l+1}_{i}(x)italic_r start_POSTSUPERSCRIPT italic_l + 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) do not change as we vary the weights by restricting ourselves to alternate weight matrices

Wl+1Wl+1C1,WlCWlwithCinvertible andCi,j0i,j.formulae-sequencesuperscript𝑊𝑙1superscript𝑊𝑙1superscript𝐶1formulae-sequencesuperscript𝑊𝑙𝐶superscript𝑊𝑙withCinvertible andsubscript𝐶𝑖𝑗0for-all𝑖𝑗\displaystyle W^{l+1}\to W^{l+1}C^{-1},W^{l}\to CW^{l}\quad\text{with $C$ %invertible and}\quad C_{i,j}\geq 0\,\forall i,j\,.italic_W start_POSTSUPERSCRIPT italic_l + 1 end_POSTSUPERSCRIPT → italic_W start_POSTSUPERSCRIPT italic_l + 1 end_POSTSUPERSCRIPT italic_C start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT , italic_W start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT → italic_C italic_W start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT with italic_C invertible and italic_C start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ≥ 0 ∀ italic_i , italic_j .(18)

Note that a linear layer (without activation function, i.e. fi=pisubscript𝑓𝑖subscript𝑝𝑖f_{i}=p_{i}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT) is just a special case of all neurons being synchronized i,x:ril+1(x)=1:for-all𝑖𝑥subscriptsuperscript𝑟𝑙1𝑖𝑥1\forall i,x:r^{l+1}_{i}(x)=1∀ italic_i , italic_x : italic_r start_POSTSUPERSCRIPT italic_l + 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) = 1.When Wlsuperscript𝑊𝑙W^{l}italic_W start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT is full rank, the drop in the effective parameter count from full synchronisation is the number of parameters in layer l𝑙litalic_l. So we see that from the perspective of the effective parameter count, linear transitions ‘do not cost anything’ — including the linear transition in the model does not meaningfully increase the effective parameter count compared to skipping the layer entirely. We are simply passing variables to the next layer without computing anything new with them.777See (Aoyagi, 2024) for a more complete treatment of effective parameter counts in deep linear networks.

synchronized blocks

Now, we consider the general case of arbitrary neuron pairs in a layer being synchronized or approximately synchronized. We can organise neurons into sets Sa,a=1,amaxformulae-sequencesubscript𝑆𝑎𝑎1subscript𝑎maxS_{a},a=1,\dots a_{\text{max}}italic_S start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , italic_a = 1 , … italic_a start_POSTSUBSCRIPT max end_POSTSUBSCRIPT, with the same activation patterns rSal+1(x)subscriptsuperscript𝑟𝑙1subscript𝑆𝑎𝑥r^{l+1}_{S_{a}}(x)italic_r start_POSTSUPERSCRIPT italic_l + 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x ) for all neurons in the set. We call these sets synchronized blocks. This works because synchronisation is a transitive property, if r1l+1(x)=r2l+1(x)subscriptsuperscript𝑟𝑙11𝑥subscriptsuperscript𝑟𝑙12𝑥r^{l+1}_{1}(x)=r^{l+1}_{2}(x)italic_r start_POSTSUPERSCRIPT italic_l + 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x ) = italic_r start_POSTSUPERSCRIPT italic_l + 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_x ) and r1l+1(x)=r3l+1(x)subscriptsuperscript𝑟𝑙11𝑥subscriptsuperscript𝑟𝑙13𝑥r^{l+1}_{1}(x)=r^{l+1}_{3}(x)italic_r start_POSTSUPERSCRIPT italic_l + 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x ) = italic_r start_POSTSUPERSCRIPT italic_l + 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ( italic_x ), then r1l+1(x)=r3l+1(x)subscriptsuperscript𝑟𝑙11𝑥subscriptsuperscript𝑟𝑙13𝑥r^{l+1}_{1}(x)=r^{l+1}_{3}(x)italic_r start_POSTSUPERSCRIPT italic_l + 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x ) = italic_r start_POSTSUPERSCRIPT italic_l + 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ( italic_x ).

Each neuron belongs to one block, so a=1amax|Sa|=dsubscriptsuperscriptsubscript𝑎max𝑎1subscript𝑆𝑎𝑑\sum^{a_{\text{max}}}_{a=1}|S_{a}|=d∑ start_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT max end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_a = 1 end_POSTSUBSCRIPT | italic_S start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT | = italic_d. Then we have:

fil+2(x)superscriptsubscript𝑓𝑖𝑙2𝑥\displaystyle f_{i}^{l+2}(x)italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l + 2 end_POSTSUPERSCRIPT ( italic_x )=ReLU(ja=1amaxrSal+1(x)kSaWikl+1Wkjlfjl(x)).absentReLUsubscript𝑗subscriptsuperscriptsubscript𝑎max𝑎1subscriptsuperscript𝑟𝑙1subscript𝑆𝑎𝑥subscript𝑘subscript𝑆𝑎subscriptsuperscript𝑊𝑙1𝑖𝑘subscriptsuperscript𝑊𝑙𝑘𝑗superscriptsubscript𝑓𝑗𝑙𝑥\displaystyle=\operatorname{ReLU}\left(\sum_{j}\sum^{a_{\text{max}}}_{a=1}r^{l%+1}_{S_{a}}(x)\sum_{k\in S_{a}}W^{l+1}_{ik}W^{l}_{kj}f_{j}^{l}(x)\right).= roman_ReLU ( ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∑ start_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT max end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_a = 1 end_POSTSUBSCRIPT italic_r start_POSTSUPERSCRIPT italic_l + 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x ) ∑ start_POSTSUBSCRIPT italic_k ∈ italic_S start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_W start_POSTSUPERSCRIPT italic_l + 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT italic_W start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k italic_j end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ( italic_x ) ) .(19)

We can replace Wl+1Wl+1C1,WlCWlformulae-sequencesuperscript𝑊𝑙1superscript𝑊𝑙1superscript𝐶1superscript𝑊𝑙𝐶superscript𝑊𝑙W^{l+1}\to W^{l+1}C^{-1},\>W^{l}\to CW^{l}italic_W start_POSTSUPERSCRIPT italic_l + 1 end_POSTSUPERSCRIPT → italic_W start_POSTSUPERSCRIPT italic_l + 1 end_POSTSUPERSCRIPT italic_C start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT , italic_W start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT → italic_C italic_W start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT, where the matrix C𝐶Citalic_C has a block-diagonal structure

C=(C[1]00C[amax])with invertible blocksC[a]|Sa|×|Sa|formulae-sequence𝐶matrixsubscript𝐶delimited-[]1missing-subexpression0missing-subexpressionmissing-subexpression0missing-subexpressionsubscript𝐶delimited-[]subscript𝑎maxwith invertible blockssubscript𝐶delimited-[]𝑎superscriptsubscript𝑆𝑎subscript𝑆𝑎\displaystyle C=\begin{pmatrix}C_{[1]}&&0\\&\ddots&\\0&&C_{[a_{\text{max}}]}\par\end{pmatrix}\quad\text{with invertible blocks}%\quad C_{[a]}\in\mathbb{R}^{|S_{a}|\times|S_{a}|}italic_C = ( start_ARG start_ROW start_CELL italic_C start_POSTSUBSCRIPT [ 1 ] end_POSTSUBSCRIPT end_CELL start_CELL end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ⋱ end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL end_CELL start_CELL italic_C start_POSTSUBSCRIPT [ italic_a start_POSTSUBSCRIPT max end_POSTSUBSCRIPT ] end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ) with invertible blocks italic_C start_POSTSUBSCRIPT [ italic_a ] end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT | italic_S start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT | × | italic_S start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT | end_POSTSUPERSCRIPT
andC[a],k,k>0k,k(1,,|Sa|).formulae-sequencesubscript𝐶delimited-[]𝑎superscript𝑘𝑘0for-all𝑘superscript𝑘1subscript𝑆𝑎\displaystyle C_{[a],k^{\prime},k}>0\,\forall k,k^{\prime}\in(1,\dots,|S_{a}|)\,.italic_C start_POSTSUBSCRIPT [ italic_a ] , italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_k end_POSTSUBSCRIPT > 0 ∀ italic_k , italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ ( 1 , … , | italic_S start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT | ) .

Just as we do not expect activations and gradients to have exact rank drops, we do not expect exact neuron synchronisation to be common in real models. Instead, we can consider two neurons to be approximately synchronized if their activations only meaningfully differ on a few datapoints. Numerically, we can define:

|ral+1|2:=1|𝒟|x𝒟i,iSa(ril+1(x)pil+1(x)ril+1(x)pil+1(x))2.assignsuperscriptsubscriptsuperscript𝑟𝑙1𝑎21𝒟subscript𝑥𝒟subscript𝑖superscript𝑖subscript𝑆𝑎superscriptsubscriptsuperscript𝑟𝑙1𝑖𝑥subscriptsuperscript𝑝𝑙1𝑖𝑥subscriptsuperscript𝑟𝑙1𝑖𝑥subscriptsuperscript𝑝𝑙1superscript𝑖𝑥2\displaystyle|r^{l+1}_{a}|^{2}:=\frac{1}{|\operatorname{\mathcal{D}}|}\sum_{x%\in\operatorname{\mathcal{D}}}\sum_{i,i^{\prime}\in S_{a}}\left(r^{l+1}_{i}(x)%p^{l+1}_{i}(x)-r^{l+1}_{i}(x)p^{l+1}_{i^{\prime}}(x)\right)^{2}\;.| italic_r start_POSTSUPERSCRIPT italic_l + 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT := divide start_ARG 1 end_ARG start_ARG | caligraphic_D | end_ARG ∑ start_POSTSUBSCRIPT italic_x ∈ caligraphic_D end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ italic_S start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_r start_POSTSUPERSCRIPT italic_l + 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) italic_p start_POSTSUPERSCRIPT italic_l + 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) - italic_r start_POSTSUPERSCRIPT italic_l + 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) italic_p start_POSTSUPERSCRIPT italic_l + 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_x ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT .(20)

If |ral+1|2superscriptsubscriptsuperscript𝑟𝑙1𝑎2|r^{l+1}_{a}|^{2}| italic_r start_POSTSUPERSCRIPT italic_l + 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT is non-zero but small, choosing different weight matrices as above will only increase the loss by an amount proportional to O(|ral+1|2)𝑂superscriptsubscriptsuperscript𝑟𝑙1𝑎2O(|r^{l+1}_{a}|^{2})italic_O ( | italic_r start_POSTSUPERSCRIPT italic_l + 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ).

Degeneracy counting

: For each pair of synchronized neurons ril+1(x),ril+1(x)subscriptsuperscript𝑟𝑙1𝑖𝑥subscriptsuperscript𝑟𝑙1superscript𝑖𝑥r^{l+1}_{i}(x),r^{l+1}_{i^{\prime}}(x)italic_r start_POSTSUPERSCRIPT italic_l + 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) , italic_r start_POSTSUPERSCRIPT italic_l + 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_x ), we can set a pair of off-diagonal entries Ck,k,Ck,ksubscript𝐶𝑘superscript𝑘subscript𝐶superscript𝑘𝑘C_{k,k^{\prime}},C_{k^{\prime},k}italic_C start_POSTSUBSCRIPT italic_k , italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , italic_C start_POSTSUBSCRIPT italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_k end_POSTSUBSCRIPT in C𝐶Citalic_C to arbitrary positive values when we change the weights to Wl+1Wl+1C1,WlCWlformulae-sequencesuperscript𝑊𝑙1superscript𝑊𝑙1superscript𝐶1superscript𝑊𝑙𝐶superscript𝑊𝑙W^{l+1}\to W^{l+1}C^{-1},W^{l}\to CW^{l}italic_W start_POSTSUPERSCRIPT italic_l + 1 end_POSTSUPERSCRIPT → italic_W start_POSTSUPERSCRIPT italic_l + 1 end_POSTSUPERSCRIPT italic_C start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT , italic_W start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT → italic_C italic_W start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT.If Wlsuperscript𝑊𝑙W^{l}italic_W start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT is full rank, the rows k𝑘kitalic_k and ksuperscript𝑘k^{\prime}italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT are linearly independent, so this synchronized pair will result in two free directions in parameter space. Thus, we have as many free directions in parameter space as we have synchronized neurons. We can also count this as the number of the synchronized neurons in each block squared

Nl+1=a=1amax|Sa|2.superscript𝑁𝑙1subscriptsuperscriptsubscript𝑎max𝑎1superscriptsubscript𝑆𝑎2\displaystyle N^{l+1}=\sum^{a_{\text{max}}}_{a=1}|S_{a}|^{2}.italic_N start_POSTSUPERSCRIPT italic_l + 1 end_POSTSUPERSCRIPT = ∑ start_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT max end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_a = 1 end_POSTSUBSCRIPT | italic_S start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT .

We then see that Nl+1superscript𝑁𝑙1N^{l+1}italic_N start_POSTSUPERSCRIPT italic_l + 1 end_POSTSUPERSCRIPT is highest if all the neuron firing patterns are synchronized, and lowest when all neurons have different firing patterns.

However, Wlsuperscript𝑊𝑙W^{l}italic_W start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT is not always full rank. Further, if we want to combine the degrees of freedom from neuron synchronisation with other degrees of freedom from this section, we have to be careful to avoid double-counting. If the activations in layer l𝑙litalic_l lie in low-dimensional subspaces, then some of the d2superscript𝑑2d^{2}italic_d start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT degrees of freedom above may already have been accounted for.If we remove those double-counted degrees of freedom and control for the rank of Wlsuperscript𝑊𝑙W^{l}italic_W start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT, each synchronized block only provides additional degrees of freedom equal to the dimensionality of the space spanned by the preactivations of block Sasubscript𝑆𝑎S_{a}italic_S start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT over the dataset 𝒟𝒟\operatorname{\mathcal{D}}caligraphic_D squared, which we denote

sal+1:=dim(span{pkl+1|kSa}).assignsubscriptsuperscript𝑠𝑙1𝑎dimspanconditional-setsubscriptsuperscript𝑝𝑙1𝑘𝑘subscript𝑆𝑎s^{l+1}_{a}:=\text{dim}(\text{span}\{p^{l+1}_{k}|k\in S_{a}\})\,.italic_s start_POSTSUPERSCRIPT italic_l + 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT := dim ( span { italic_p start_POSTSUPERSCRIPT italic_l + 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | italic_k ∈ italic_S start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT } ) .(21)

So more generally, the additional amount of degeneracy the effective parameter count is lowered by will be

Nl+1=a(sal+1)2.superscript𝑁𝑙1subscript𝑎superscriptsubscriptsuperscript𝑠𝑙1𝑎2N^{l+1}=\sum_{a}(s^{l+1}_{a})^{2}\,.italic_N start_POSTSUPERSCRIPT italic_l + 1 end_POSTSUPERSCRIPT = ∑ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT italic_l + 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT .(22)

The trivial case of self-synchronisation i=i𝑖superscript𝑖i=i^{\prime}italic_i = italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is not excluded here in this formula.It corresponds to the generic freedom to vary the diagonal entries of C𝐶Citalic_C, Ck,ksubscript𝐶𝑘𝑘C_{k,k}italic_C start_POSTSUBSCRIPT italic_k , italic_k end_POSTSUBSCRIPT of a ReLU layer: scaling all the weights going into a neuron by Ck,k+subscript𝐶𝑘𝑘superscriptC_{k,k}\in\mathbb{R}^{+}italic_C start_POSTSUBSCRIPT italic_k , italic_k end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT and scaling all the weights out of the neuron by 1/Ck,k1subscript𝐶𝑘𝑘1/C_{k,k}1 / italic_C start_POSTSUBSCRIPT italic_k , italic_k end_POSTSUBSCRIPT does not change network behavior.

Attention

A similar dynamic holds in the attention layers of transformers, with the attention patterns of different attention heads playing the role of the ReLUReLU\operatorname{ReLU}roman_ReLU activation patterns. If two different attention heads h1,h2subscript1subscript2h_{1},h_{2}italic_h start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_h start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT in the same attention layer have synchronized attention patterns on the training data set, their value matrices WVh1,WVh2subscriptsuperscript𝑊subscript1𝑉subscriptsuperscript𝑊subscript2𝑉W^{h_{1}}_{V},W^{h_{2}}_{V}italic_W start_POSTSUPERSCRIPT italic_h start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT , italic_W start_POSTSUPERSCRIPT italic_h start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT can be changed to add elements in the span of the value vectors of one head to the other head, with the output matrices WOh1,WOh2subscriptsuperscript𝑊subscript1𝑂subscriptsuperscript𝑊subscript2𝑂W^{h_{1}}_{O},W^{h_{2}}_{O}italic_W start_POSTSUPERSCRIPT italic_h start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_O end_POSTSUBSCRIPT , italic_W start_POSTSUPERSCRIPT italic_h start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_O end_POSTSUBSCRIPT that project results back into the residual stream being modified to undo the change. If WVh1,WVh2subscriptsuperscript𝑊subscript1𝑉subscriptsuperscript𝑊subscript2𝑉W^{h_{1}}_{V},W^{h_{2}}_{V}italic_W start_POSTSUPERSCRIPT italic_h start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT , italic_W start_POSTSUPERSCRIPT italic_h start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT are full rank, this results in 2dhead22subscriptsuperscript𝑑2head2d^{2}_{\text{head}}2 italic_d start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT head end_POSTSUBSCRIPT degrees of freedom in the loss landscape for each synchronized attention head, in addition to the generic dhead2subscriptsuperscript𝑑2headd^{2}_{\text{head}}italic_d start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT head end_POSTSUBSCRIPT degrees of freedom per attention head that are present in every transformer model. If WVh1,WVh2subscriptsuperscript𝑊subscript1𝑉subscriptsuperscript𝑊subscript2𝑉W^{h_{1}}_{V},W^{h_{2}}_{V}italic_W start_POSTSUPERSCRIPT italic_h start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT , italic_W start_POSTSUPERSCRIPT italic_h start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT is not full rank, we account for this similarly as we did with the neurons above.

4 Interaction sparsity from parameterisation-invariance

In the introduction, we argued that if we can represent a neural network in a parameterisation-invariant way, then this representation is likely to be a good starting point for reverse-engineering the computation in the network.The intuition behind this claim is that in the standard representation, parts of the network which do not affect the outputs act to obfuscate and hide the relevant computational structure — once these are stripped away, computational structure is likely to become easier to see.One way this could manifest is through the new representation having greater interaction sparsity.

In this section, we demonstrate that picking the right representation can indeed lead to sparser interactions throughout the network. Specifically, we show that we can find a representation such that, for every drop in the effective parameter count caused by either (a) activation vectors not spanning the activation space (Section 3.1.1) or (b) neuron synchronisation (Section 3.2), there is at least one pair of basis directions in adjacent layers of the network that do not interact.

The role of this section is to provide a first example of a representation of a network which has been made invariant to some reparameterisations, and show that this representation has correspondingly fewer interactions between variables. The algorithm sketch used to find the representation here is not very suitable for selecting sparsely connected bases in practical applications, since it is somewhat cumbersome to extend to non-exact linear dependencies. We introduce a way to choose a basis for the activations spaces that is more suitable for practical applications in Section 6.

Consider a dense feedforward network with ReLU activation functions, with Nfreesubscript𝑁freeN_{\text{free}}italic_N start_POSTSUBSCRIPT free end_POSTSUBSCRIPT degrees of freedom in its parameterization that arise from a combination of

  1. 1.

    The gram matrix of activation vectors in some layers being low rank, see Section 3.1.1.

  2. 2.

    Blocks of neurons being synchronized, see Section 3.2.

We will now show that we can find a representation of the network that

  1. 1.

    exploits the degrees of freedom due to low-dimensional activations to sparsify interactions through a re-parametrisation.

  2. 2.

    exploits the degrees of freedom from neuron synchronisation to sparsify interactions through a coordinate transformation, without losing the sparsity gained in step 1.

Sparsifying using low dimensional activations

Here, we show how to exploit the degrees of freedom in the network due to low-dimensional activations in the input layer to sparsify interactions.

Suppose that the gram matrix of activations 𝐟(1)(x)superscript𝐟1𝑥\mathbf{f}^{(1)}(x)bold_f start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( italic_x ) of the input layer, G(1)=1nxfi(1)(x)fj(1)(x)superscript𝐺11𝑛subscript𝑥subscriptsuperscript𝑓1𝑖𝑥subscriptsuperscript𝑓1𝑗𝑥G^{(1)}=\frac{1}{n}\sum_{x}f^{(1)}_{i}(x)f^{(1)}_{j}(x)italic_G start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT italic_f start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) italic_f start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_x ) is not full rank. This means that we can take a set of rank(G(1))ranksuperscript𝐺1\text{rank}\left(G^{(1)}\right)rank ( italic_G start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ) neurons as a basis for the space. This will be fewer neurons than the width d(1)superscript𝑑1d^{(1)}italic_d start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT of the input payer. Writing

j(rank(G(1))+1,,d(1)):fj(1)=i=1rank(G(1))(cj)ifi(1),:for-all𝑗ranksuperscript𝐺11superscript𝑑1subscriptsuperscript𝑓1𝑗superscriptsubscript𝑖1ranksuperscript𝐺1subscriptsubscript𝑐𝑗𝑖subscriptsuperscript𝑓1𝑖\forall j\in(\text{rank}\left(G^{(1)}\right)+1,\dots,d^{(1)}):f^{(1)}_{j}=\sum%_{i=1}^{\text{rank}\left(G^{(1)}\right)}(c_{j})_{i}f^{(1)}_{i}\,,∀ italic_j ∈ ( rank ( italic_G start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ) + 1 , … , italic_d start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ) : italic_f start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT rank ( italic_G start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT ( italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_f start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ,(23)

we can replace the weights W(1)superscript𝑊1W^{(1)}italic_W start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT with new weights

W~ij(1):={Wij(1)+k=rank(G(1))+1d(1)(ck)jWik1jrank(G(1))0rank(G(1))<jd(1)assignsubscriptsuperscript~𝑊1𝑖𝑗casessubscriptsuperscript𝑊1𝑖𝑗superscriptsubscript𝑘ranksuperscript𝐺11superscript𝑑1subscriptsubscript𝑐𝑘𝑗subscript𝑊𝑖𝑘1𝑗ranksuperscript𝐺10ranksuperscript𝐺1𝑗superscript𝑑1\displaystyle\tilde{W}^{(1)}_{ij}:=\begin{cases}W^{(1)}_{ij}+\sum_{k=\text{%rank}\left(G^{(1)}\right)+1}^{d^{(1)}}(c_{k})_{j}W_{ik}&1\leq j\leq\text{rank}%\left(G^{(1)}\right)\\0&\text{rank}\left(G^{(1)}\right)<j\leq d^{(1)}\end{cases}over~ start_ARG italic_W end_ARG start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT := { start_ROW start_CELL italic_W start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT italic_k = rank ( italic_G start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ) + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_c start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT end_CELL start_CELL 1 ≤ italic_j ≤ rank ( italic_G start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ) end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL rank ( italic_G start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ) < italic_j ≤ italic_d start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT end_CELL end_ROW(24)

In this way we can disconnect (d(1)rank(G(1)))superscript𝑑1ranksuperscript𝐺1(d^{(1)}-\text{rank}\left(G^{(1)}\right))( italic_d start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT - rank ( italic_G start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ) ) many neurons from the next layer without changing the activations in layer 2 at all on the training dataset, since W~(1)𝐟(1)=W(1)𝐟(1)superscript~𝑊1superscript𝐟1superscript𝑊1superscript𝐟1\tilde{W}^{(1)}\mathbf{f}^{(1)}=W^{(1)}\mathbf{f}^{(1)}over~ start_ARG italic_W end_ARG start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT bold_f start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT = italic_W start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT bold_f start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT.For every degree of linear dependence we may have had in layer 1111, we now have d(2)superscript𝑑2d^{(2)}italic_d start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT weights set to zero, where d(2)superscript𝑑2d^{(2)}italic_d start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT is the width of the second MLP layer. Since two neurons that are connected by a weight of 0 do not interact, this means that we can associate each drop in the effective parameter count caused by linear dependence between activations in layer 1 with a pair of nodes in the interaction graph which do not interact.

Sparsifying using synchronized neurons

Now, we show that we can exploit the degrees of freedom in the network from the synchronisation of neurons in the first hidden layer to sparsify interactions without losing any of the sparsity we gained in the previous step.

Taking the example of the second layer 𝐟(2)superscript𝐟2\mathbf{f}^{(2)}bold_f start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT, we want to find a new coordinate basis 𝐟^(2)=C(2)𝐟(2)superscript^𝐟2superscript𝐶2superscript𝐟2\hat{\mathbf{f}}^{(2)}=C^{(2)}\mathbf{f}^{(2)}over^ start_ARG bold_f end_ARG start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT = italic_C start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT bold_f start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT in which there is at least one pair of variables (f^i(2),fj(1))superscriptsubscript^𝑓𝑖2subscriptsuperscript𝑓1𝑗(\hat{f}_{i}^{(2)},f^{(1)}_{j})( over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT , italic_f start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) that does not interact for each drop in the effective parameter count caused by neuron synchronisation.

To choose this basis, we start by finding all pairs of neuron firing patterns ril(x)subscriptsuperscript𝑟𝑙𝑖𝑥r^{l}_{i}(x)italic_r start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) in layer 2222 that are synchronized and group them into sets of synchronized blocks.Continuing with the same notation as in Section 3.2, we denote the blocks of synchronized neurons Sa,a(1,,amax)subscript𝑆𝑎𝑎1subscript𝑎maxS_{a},a\in(1,\dots,a_{\text{max}})italic_S start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , italic_a ∈ ( 1 , … , italic_a start_POSTSUBSCRIPT max end_POSTSUBSCRIPT ), with size |Sa|subscript𝑆𝑎|S_{a}|| italic_S start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT |, and we use the notation M[a]subscript𝑀delimited-[]𝑎M_{[a]}italic_M start_POSTSUBSCRIPT [ italic_a ] end_POSTSUBSCRIPT to denote the matrix in sa×sasuperscriptsubscript𝑠𝑎subscript𝑠𝑎\mathbb{R}^{s_{a}\times s_{a}}blackboard_R start_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT × italic_s start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT end_POSTSUPERSCRIPT with entries given by Miji,jSasubscript𝑀𝑖𝑗for-all𝑖𝑗subscript𝑆𝑎M_{ij}\>\forall i,j\in S_{a}italic_M start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ∀ italic_i , italic_j ∈ italic_S start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT.Then, we choose the transformation C(2)superscript𝐶2C^{(2)}italic_C start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT to be block diagonal

C2=(C[1]200C[amax]2),superscript𝐶2matrixsubscriptsuperscript𝐶2delimited-[]1missing-subexpression0missing-subexpressionmissing-subexpression0missing-subexpressionsubscriptsuperscript𝐶2delimited-[]subscript𝑎max\displaystyle C^{2}=\begin{pmatrix}C^{2}_{[1]}&&0\\&\ddots&\\0&&C^{2}_{[a_{\text{max}}]}\par\end{pmatrix}\,,italic_C start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = ( start_ARG start_ROW start_CELL italic_C start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT [ 1 ] end_POSTSUBSCRIPT end_CELL start_CELL end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ⋱ end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL end_CELL start_CELL italic_C start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT [ italic_a start_POSTSUBSCRIPT max end_POSTSUBSCRIPT ] end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ) ,(25)

with the blocks given by the inverse888Technically the pseudoinverse, because W~[a](1)subscriptsuperscript~𝑊1delimited-[]𝑎\tilde{W}^{(1)}_{[a]}over~ start_ARG italic_W end_ARG start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT [ italic_a ] end_POSTSUBSCRIPT does not need to be invertible. of the |Sa|×|Sa|subscript𝑆𝑎subscript𝑆𝑎|S_{a}|\times|S_{a}|| italic_S start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT | × | italic_S start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT | blocks of W~(1)superscript~𝑊1\tilde{W}^{(1)}over~ start_ARG italic_W end_ARG start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT:

C[a](2)=(W~[a](1))1,subscriptsuperscript𝐶2delimited-[]𝑎superscriptsubscriptsuperscript~𝑊1delimited-[]𝑎1\displaystyle C^{(2)}_{[a]}=\left(\tilde{W}^{(1)}_{[a]}\right)^{-1}\,,italic_C start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT [ italic_a ] end_POSTSUBSCRIPT = ( over~ start_ARG italic_W end_ARG start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT [ italic_a ] end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ,(26)
W~[a](1):=(Wσa1+1,σa1+1(1)Wσa,σa1+1(1)Wσa1+1,σa(1)Wσa,σa(1))forσa=b=1asbformulae-sequenceassignsubscriptsuperscript~𝑊1delimited-[]𝑎matrixsubscriptsuperscript𝑊1subscript𝜎𝑎11subscript𝜎𝑎11subscriptsuperscript𝑊1subscript𝜎𝑎subscript𝜎𝑎11subscriptsuperscript𝑊1subscript𝜎𝑎11subscript𝜎𝑎subscriptsuperscript𝑊1subscript𝜎𝑎subscript𝜎𝑎forsubscript𝜎𝑎superscriptsubscript𝑏1𝑎subscript𝑠𝑏\displaystyle\tilde{W}^{(1)}_{[a]}:=\begin{pmatrix}W^{(1)}_{\sigma_{a-1}+1,%\sigma_{a-1}+1}&\cdots&W^{(1)}_{\sigma_{a},\sigma_{a-1}+1}\\\vdots&\ddots&\vdots\\W^{(1)}_{\sigma_{a-1}+1,\sigma_{a}}&\cdots&W^{(1)}_{\sigma_{a},\sigma_{a}}\end%{pmatrix}\quad\text{for}\,\sigma_{a}=\sum_{b=1}^{a}s_{b}over~ start_ARG italic_W end_ARG start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT [ italic_a ] end_POSTSUBSCRIPT := ( start_ARG start_ROW start_CELL italic_W start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_a - 1 end_POSTSUBSCRIPT + 1 , italic_σ start_POSTSUBSCRIPT italic_a - 1 end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT end_CELL start_CELL ⋯ end_CELL start_CELL italic_W start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , italic_σ start_POSTSUBSCRIPT italic_a - 1 end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL ⋮ end_CELL start_CELL ⋱ end_CELL start_CELL ⋮ end_CELL end_ROW start_ROW start_CELL italic_W start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_a - 1 end_POSTSUBSCRIPT + 1 , italic_σ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL ⋯ end_CELL start_CELL italic_W start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , italic_σ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ) for italic_σ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_b = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT(27)

This coordinate transformation will set one interaction to zero per drop in the effective parameter count caused by neuron synchronisation. To see this, we first consider that C(2)superscript𝐶2C^{(2)}italic_C start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT commutes with the nonlinearity applied between layers 1 and 2

x:C(2)ReLU(W(1)𝐟(1)(x))=ReLU(C(2)W(1)𝐟(1)(x)):for-all𝑥superscript𝐶2ReLUsuperscript𝑊1superscript𝐟1𝑥ReLUsuperscript𝐶2superscript𝑊1superscript𝐟1𝑥\forall x:C^{(2)}\text{ReLU}\left(W^{(1)}\mathbf{f}^{(1)}(x)\right)=\text{ReLU%}\left(C^{(2)}W^{(1)}\mathbf{f}^{(1)}(x)\right)∀ italic_x : italic_C start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT ReLU ( italic_W start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT bold_f start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( italic_x ) ) = ReLU ( italic_C start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT italic_W start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT bold_f start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( italic_x ) )(28)

The product W^(1)=C(2)W~(1)superscript^𝑊1superscript𝐶2superscript~𝑊1\hat{W}^{(1)}=C^{(2)}\tilde{W}^{(1)}over^ start_ARG italic_W end_ARG start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT = italic_C start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT over~ start_ARG italic_W end_ARG start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT will thus have block diagonal entries equal to the identity W^[a](1)=𝐈|Sa|subscriptsuperscript^𝑊1delimited-[]𝑎subscript𝐈subscript𝑆𝑎\hat{W}^{(1)}_{[a]}=\mathbf{I}_{|S_{a}|}over^ start_ARG italic_W end_ARG start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT [ italic_a ] end_POSTSUBSCRIPT = bold_I start_POSTSUBSCRIPT | italic_S start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT | end_POSTSUBSCRIPT. This means W^(1)superscript^𝑊1\hat{W}^{(1)}over^ start_ARG italic_W end_ARG start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT will at minimum have an additional a(sa(2))2d(2)subscript𝑎superscriptsubscriptsuperscript𝑠2𝑎2superscript𝑑2\sum_{a}{(s^{(2)}_{a})}^{2}-d^{(2)}∑ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - italic_d start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT entries that are zero — one non-interacting pair of nodes per degree of non-generic parametrization freedom caused by neuron synchronization, see equations 21, 22.These absent interactions are distinct from those due to the activation vectors in layer 1 not spanning the full activation space we found in the previous step. Thus, the minimum absent interactions add up to be equal or greater to the degrees of freedom in the loss landscape stemming from low dimensional activations in the input layer f(1)superscript𝑓1f^{(1)}italic_f start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT or synchronized neurons in the first hidden layer f(2)superscript𝑓2f^{(2)}italic_f start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT.

Repeat for every layer

Now, we can repeat the previous two steps for all layers, moving recursively from the input layer to the output layer. We check if the activation vectors in layer 2 do not span the activation space and pick new weights W~(2)superscript~𝑊2\tilde{W}^{(2)}over~ start_ARG italic_W end_ARG start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT accordingly. Then we check if any neurons in layer three are synchronized and transform 𝐟^(3)=C(3)𝐟(3)superscript^𝐟3superscript𝐶3superscript𝐟3\hat{\mathbf{f}}^{(3)}=C^{(3)}\mathbf{f}^{(3)}over^ start_ARG bold_f end_ARG start_POSTSUPERSCRIPT ( 3 ) end_POSTSUPERSCRIPT = italic_C start_POSTSUPERSCRIPT ( 3 ) end_POSTSUPERSCRIPT bold_f start_POSTSUPERSCRIPT ( 3 ) end_POSTSUPERSCRIPT accordingly. We repeat this for every layer in the network.

We thus obtain new weight matrices, and a new basis for the activations of every layer in the network. Treating the new basis vectors in each layer as nodes in a graph, we can build a graph representing the interactions in the network. This graph will have two properties:

  1. 1.

    It has at least one interaction that is zero for every drop in the effective parameter count introduced by neuron synchronisation or activation vectors spanning a low dimensional subspace

  2. 2.

    It is invariant to reparameterisations that exploit these degeneracies.

5 Modularity may contribute to degeneracy

A core goal of interpretability is breaking up a neural network into smaller parts, such that we can understand the entire network by understanding the individual parts. In this section we propose a particular notion of modularity that could be used to identify these smaller parts. We argue that this notion of modularity is likely to occur in real networks due to its relation to the LLC.

The core claim of this section is that more modular networks are biased towards lower LLC. We argue that if modules in a network interact less (i.e the network is more modular) this yields a higher total degeneracy and thus a lower LLC. Each module has internal degeneracies: if two modules do not interact then the degeneracies in each are independent of each other, so the total amount of degeneracy in the network (from these modules) is at least the sum of the amount of degeneracy within each module. However, if the modules are interacting, then the degeneracies may interact with each other, and the total amount of degeneracy in the network can be less. Therefore, networks which have non- or weakly- interacting modules typically have more degeneracy and thus a lower LLC, which means that neural networks are biased towards solutions which are modular.

The argument in this section does not preclude non-modular networks from having a lower LLC than modular networks in any specific instance. Instead, this section presents an argument that, all else equal, modularity is associated with a lower effective parameter count. This argument could fail in practice if more modularity turns out to increase the effective parameter count of models for a different reason, or if real neural networks simply do not have low-loss modular solutions.

In Section 5.1 we define interacting and non-interacting degeneracies,and show that the total degeneracy is higher in when individual degeneracies do not interact. In Section 5.2 we quantify how modularity affects the LLC by studying a series of increasingly realistic scenarios. First, we consider the case of twomodules which do not interact at all in Section 5.2.1. Then we explore how to modify the analysis for modules which have a small number of interacting variables in Section 5.2.2. Finally, in Section 5.2.3 we extend our analysis to allow for the strength of interactions to vary. We arrive at a modularity metric which can be used to search for modules in a computational graph.

5.1 Interacting and non-interacting degeneracies

Using Degeneracy in the Loss Landscape for Mechanistic Interpretability (1)

If a network’s parameterization has a degeneracy, then there is some way the parameters of the network can change without changing the input-output behavior of the network. This change corresponds to a direction that can be traversed through the parameter space along which the behavioral loss stays zero. We call such a direction a free direction in the parameter space. It’s also possible for a parameterization to have multiple degeneracies and thus multiple free directions.

We call a set of free directions non-interacting if traversing along one free direction does not affect whether the other directions remain free. In this case, the set of non-interacting free directions span an entire free subspace of the parameter space. In a parameter space with θ=(w1,w2,w3)𝜃subscript𝑤1subscript𝑤2subscript𝑤3\theta=(w_{1},w_{2},w_{3})italic_θ = ( italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) and loss given by L(w1,w2,w3)=w12𝐿subscript𝑤1subscript𝑤2subscript𝑤3superscriptsubscript𝑤12L(w_{1},w_{2},w_{3})=w_{1}^{2}italic_L ( italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) = italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, we are free to pick any value of w2subscript𝑤2w_{2}italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT and w3subscript𝑤3w_{3}italic_w start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT while remaining at the minimum of the loss provided that w1=0subscript𝑤10w_{1}=0italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0. The area of constant loss is a 2-dimensional plane.

The set of free directions is called interacting if traversing along one free direction does affect whether other directions remain free. For an extreme example, consider the loss function L(w1,w2)=w12w22𝐿subscript𝑤1subscript𝑤2superscriptsubscript𝑤12superscriptsubscript𝑤22L(w_{1},w_{2})=w_{1}^{2}w_{2}^{2}italic_L ( italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (figure 1) at its minimum (0,0). In this case there are two free directions, but when we traverse along one free direction the other direction ceases to be free. The area of constant loss does not span a full subspace (a 2-dimensional plane); here is resembles a cross (see Figure 1) which is a 1-dimensional object.

We can explicitly calculate the number of degrees of freedom (the difference between the effective parameter count (equation 9) and the nominal parameter count) in each of these two loss landscapes. We find that the first landscape has two degrees of freedom but the second has only one. These are two extremes of fully interacting and fully non-interacting free directions. It is also possible to construct intermediate loss landscapes in which the number of degrees of freedom arising from two free directions is a non-integer value between 1 and 2. In general, for a given set of free directions, the lowest the effective parameter count can be is the non-interacting case.

5.2 Degeneracies in separate modules only interact if the modules are interacting

In this section we quantify the increase in the effective parameter count, and equivalently the LLC,from perfect and near-perfect modularity. We show that a network consisting of non-interacting moduleshas a low effective parameter count, and that a network with modules which interact through a single variable has only a slightly higher effective paraeter count.

Consider a modular neural network 𝐟θ(x)subscript𝐟𝜃𝑥\mathbf{f}_{\theta}(x)bold_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x ) consisting of two parallel modules M1subscript𝑀1M_{1}italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and M2subscript𝑀2M_{2}italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. The modules take in different variables x1,x2subscript𝑥1subscript𝑥2x_{1},x_{2}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT from the input x=(x1,x2)𝑥subscript𝑥1subscript𝑥2x=(x_{1},x_{2})italic_x = ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ), and the outputof the network is the concatenation of the module outputs 𝐟θ(x)=(M1(x1),M2(x2))subscript𝐟𝜃𝑥subscript𝑀1subscript𝑥1subscript𝑀2subscript𝑥2\mathbf{f}_{\theta}(x)=(M_{1}(x_{1}),M_{2}(x_{2}))bold_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x ) = ( italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ).We assign every activation direction in the network to either M1subscript𝑀1M_{1}italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT or M2subscript𝑀2M_{2}italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT.

We split the parameter space ΘΘ\Thetaroman_Θ into 3 subspaces: Θ=Θ1Θ2Θ12Θdirect-sumsubscriptΘ1subscriptΘ2subscriptΘ12\Theta=\Theta_{1}\oplus\Theta_{2}\oplus\Theta_{1\leftrightarrow 2}roman_Θ = roman_Θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⊕ roman_Θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ⊕ roman_Θ start_POSTSUBSCRIPT 1 ↔ 2 end_POSTSUBSCRIPT. θ1Θ1subscript𝜃1subscriptΘ1\theta_{1}\in\Theta_{1}italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∈ roman_Θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT are the parameters inside M1subscript𝑀1M_{1}italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT (i.e.parameters that affect interactions between two activations within M1subscript𝑀1M_{1}italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT), θ2subscript𝜃2\theta_{2}italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is the space of the parameters inside M2subscript𝑀2M_{2}italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, and θ12subscript𝜃12\theta_{1\leftrightarrow 2}italic_θ start_POSTSUBSCRIPT 1 ↔ 2 end_POSTSUBSCRIPT is the space of parameters which affect interactions between activations of both modules.

5.2.1 Non-interacting case

We start by analyzing a network consisting of two perfectly separated modules; the values of activations in M1subscript𝑀1M_{1}italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT have no effect on activations in M2subscript𝑀2M_{2}italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, i.e. θ12=0subscript𝜃120\theta_{1\leftrightarrow 2}=0italic_θ start_POSTSUBSCRIPT 1 ↔ 2 end_POSTSUBSCRIPT = 0 and the networkoutput is given by

𝐟θ(x)=(M1(θ1,x1),M2(θ2,x2)).subscript𝐟𝜃𝑥subscript𝑀1subscript𝜃1subscript𝑥1subscript𝑀2subscript𝜃2subscript𝑥2\mathbf{f}_{\theta}(x)=(M_{1}(\theta_{1},x_{1}),M_{2}(\theta_{2},x_{2})).bold_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x ) = ( italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ) .(29)

Consider now two free directions in parameter space, where one lies entirely in Θ1subscriptΘ1\Theta_{1}roman_Θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, and the other lies entirely in Θ2subscriptΘ2\Theta_{2}roman_Θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. Since M1subscript𝑀1M_{1}italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and M2subscript𝑀2M_{2}italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT share no variables and do not interact, there is no way for a change to parameters along one free direction to affect the freedom of the other direction. Therefore, one dimensional degeneracies that are in different disconnected modules must be non-interacting.By contrast, if M1subscript𝑀1M_{1}italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and M2subscript𝑀2M_{2}italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT were connected, their free directions could interact.

We break up the behavioral loss with respect to this network into three terms:

LB(θ|θ,𝒟)=L1(θ1|θ1,𝒟)+L2(θ2|θ2,𝒟)+L12(θ1,θ2,θ12|θ1,θ2,0,𝒟)subscript𝐿𝐵conditional𝜃superscript𝜃𝒟subscript𝐿1conditionalsubscript𝜃1subscriptsuperscript𝜃1𝒟subscript𝐿2conditionalsubscript𝜃2subscriptsuperscript𝜃2𝒟subscript𝐿12subscript𝜃1subscript𝜃2conditionalsubscript𝜃12subscriptsuperscript𝜃1subscriptsuperscript𝜃20𝒟L_{B}(\theta|\theta^{*},\operatorname{\mathcal{D}})=L_{1}(\theta_{1}|\theta^{*%}_{1},\operatorname{\mathcal{D}})+L_{2}(\theta_{2}|\theta^{*}_{2},%\operatorname{\mathcal{D}})+L_{1\leftrightarrow 2}(\theta_{1},\theta_{2},%\theta_{1\leftrightarrow 2}|\theta^{*}_{1},\theta^{*}_{2},0,\operatorname{%\mathcal{D}})italic_L start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ( italic_θ | italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , caligraphic_D ) = italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , caligraphic_D ) + italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , caligraphic_D ) + italic_L start_POSTSUBSCRIPT 1 ↔ 2 end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT 1 ↔ 2 end_POSTSUBSCRIPT | italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , 0 , caligraphic_D )(30)

L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT are the parts of the behavioral loss than involve only θ1subscript𝜃1\theta_{1}italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and θ2subscript𝜃2\theta_{2}italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT respectively, and L12subscript𝐿12L_{1\leftrightarrow 2}italic_L start_POSTSUBSCRIPT 1 ↔ 2 end_POSTSUBSCRIPT contains all the other parts. So long as we ensure θ12=0subscript𝜃120\theta_{1\leftrightarrow 2}=0italic_θ start_POSTSUBSCRIPT 1 ↔ 2 end_POSTSUBSCRIPT = 0, we have L12=0subscript𝐿120L_{1\leftrightarrow 2}=0italic_L start_POSTSUBSCRIPT 1 ↔ 2 end_POSTSUBSCRIPT = 0. Then a calculation shows that the overall number of degrees of freedom (Nfree=NNeffsubscript𝑁free𝑁subscript𝑁effN_{\text{free}}=N-N_{\text{eff}}italic_N start_POSTSUBSCRIPT free end_POSTSUBSCRIPT = italic_N - italic_N start_POSTSUBSCRIPT eff end_POSTSUBSCRIPT) for this behavioral loss, restricted to the subspace in which θ12=0subscript𝜃120\theta_{1\leftrightarrow 2}=0italic_θ start_POSTSUBSCRIPT 1 ↔ 2 end_POSTSUBSCRIPT = 0, is equal to the sum of the number of degrees of freedom in each module.

There could be additional free directions involving moving θ12subscript𝜃12\theta_{1\leftrightarrow 2}italic_θ start_POSTSUBSCRIPT 1 ↔ 2 end_POSTSUBSCRIPT away from 00. These free directions are not guaranteed not to interact with the free directions in each module, and our argument says nothing about how large additional contributions to the effective parameter count from varying θ12subscript𝜃12\theta_{1\leftrightarrow 2}italic_θ start_POSTSUBSCRIPT 1 ↔ 2 end_POSTSUBSCRIPT may be.

5.2.2 Adding in interactions between modules

Next, we consider the case that there are a small set of activations v1,,vmsubscript𝑣1subscript𝑣𝑚v_{1},\dots,v_{m}italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_v start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT inside M1subscript𝑀1M_{1}italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT that causally affect the value of some activations inside M2subscript𝑀2M_{2}italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT (due to not all the parameters in θ12subscript𝜃12\theta_{1\leftrightarrow 2}italic_θ start_POSTSUBSCRIPT 1 ↔ 2 end_POSTSUBSCRIPT being 0). This means that the two modules are now interacting with each other.In that case, the only degeneracies in M1subscript𝑀1M_{1}italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT which are guaranteed not to interact with the degeneracies in M2subscript𝑀2M_{2}italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT are those which do not affect the value of any of the visubscript𝑣𝑖v_{i}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT.

Picture M1subscript𝑀1M_{1}italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT as a causal graph, where the nodes are activations and the edges are weights or nonlinearities. The nodes inside M1subscript𝑀1M_{1}italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT are connected to the ‘outside’ of M1subscript𝑀1M_{1}italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT via (a) the input layer, where M1subscript𝑀1M_{1}italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT takes in inputs, (b) the output layer, where M1subscript𝑀1M_{1}italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT passes on its outputs, and (c) the ‘mediating’ nodes visubscript𝑣𝑖v_{i}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT where variations affect what happens inside M2subscript𝑀2M_{2}italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT.The free directions inside M1subscript𝑀1M_{1}italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT that are guaranteed not to interact with free directions outside M1subscript𝑀1M_{1}italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT are those directions that leave this entire interaction surface invariant: the directions which do not change any of the mediating nodes as we traverse along them. Each mediating node that is present is an additional constraint on which free directions are guaranteed to be non-interacting. The more approximately independent nodes that are part of that interaction surface, the fewer free directions in M1subscript𝑀1M_{1}italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT might be generically expected to satisfy these constraints.

In the previous section, we argued that the degrees of freedom of the network with noninteracting modules, restricted to the subset of parameter space in which θ12=0subscript𝜃120\theta_{1\leftrightarrow 2}=0italic_θ start_POSTSUBSCRIPT 1 ↔ 2 end_POSTSUBSCRIPT = 0, was equal to the sum of the degrees of freedom in each module. In this section, θ120superscriptsubscript𝜃120\theta_{1\leftrightarrow 2}^{*}\neq 0italic_θ start_POSTSUBSCRIPT 1 ↔ 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ≠ 0, but modifying the argument to restrict to the subset of parameter space in which θ12=θ12subscript𝜃12superscriptsubscript𝜃12\theta_{1\leftrightarrow 2}=\theta_{1\leftrightarrow 2}^{*}italic_θ start_POSTSUBSCRIPT 1 ↔ 2 end_POSTSUBSCRIPT = italic_θ start_POSTSUBSCRIPT 1 ↔ 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is not sufficient to fix the argument, because the degeneracies interact.

To fix the argument, we introduce the constrained loss function for parameters in M1subscript𝑀1M_{1}italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT:

L1,C(θ1|θ1,𝒟,v1,,vm)=L1(θ1|θ1,𝒟)+1ni=1mx𝒟(v1(θ1,x)v1(θ1,x))2subscript𝐿1𝐶conditionalsubscript𝜃1superscriptsubscript𝜃1𝒟subscript𝑣1subscript𝑣𝑚subscript𝐿1conditionalsubscript𝜃1superscriptsubscript𝜃1𝒟1𝑛superscriptsubscript𝑖1𝑚subscript𝑥𝒟superscriptsubscript𝑣1superscriptsubscript𝜃1𝑥subscript𝑣1subscript𝜃1𝑥2L_{1,C}(\theta_{1}|\theta_{1}^{*},\operatorname{\mathcal{D}},v_{1},\dots,v_{m}%)=L_{1}(\theta_{1}|\theta_{1}^{*},\operatorname{\mathcal{D}})+\frac{1}{n}\sum_%{i=1}^{m}\sum_{x\in\operatorname{\mathcal{D}}}{\left(v_{1}(\theta_{1}^{*},x)-v%_{1}(\theta_{1},x)\right)}^{2}italic_L start_POSTSUBSCRIPT 1 , italic_C end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , caligraphic_D , italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_v start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) = italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , caligraphic_D ) + divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_x ∈ caligraphic_D end_POSTSUBSCRIPT ( italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_x ) - italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT(31)

This loss function is the same as the part of the behavioral loss that depends only on parameters in M1subscript𝑀1M_{1}italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, except that it has extra MSE terms added to ensure that the points with very small loss also preserve the values of v1,,vmsubscript𝑣1subscript𝑣𝑚v_{1},\dots,v_{m}italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_v start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT on all datapoints. This means its learning coefficient is higher than for the unconstrained behavioral loss. The key property of the constrained loss landscape is that free directions in are guaranteed to be non interacting with free directions in the loss landscape L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. Therefore, we are able to say that the total effective parameter count of the network consisting of two interacting modules, when constrained to the subspace θ12=θ12subscript𝜃12superscriptsubscript𝜃12\theta_{1\leftrightarrow 2}=\theta_{1\leftrightarrow 2}^{*}italic_θ start_POSTSUBSCRIPT 1 ↔ 2 end_POSTSUBSCRIPT = italic_θ start_POSTSUBSCRIPT 1 ↔ 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, really is twice the sum of the learning coefficient for the loss function L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, and for the loss function L1,Csubscript𝐿1𝐶L_{1,C}italic_L start_POSTSUBSCRIPT 1 , italic_C end_POSTSUBSCRIPT999For simplicity in this section, we have considered the case in which nodes in M1subscript𝑀1M_{1}italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT affect nodes in M2subscript𝑀2M_{2}italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT but the converse is not true. If we wanted interactions to be bidirectional, we could modify the argument of this section by introducing a second constrained loss function L2,Csubscript𝐿2𝐶L_{2,C}italic_L start_POSTSUBSCRIPT 2 , italic_C end_POSTSUBSCRIPT..

As before, there could be additional free directions involving moving θ12subscript𝜃12\theta_{1\leftrightarrow 2}italic_θ start_POSTSUBSCRIPT 1 ↔ 2 end_POSTSUBSCRIPT away from θ12superscriptsubscript𝜃12\theta_{1\leftrightarrow 2}^{*}italic_θ start_POSTSUBSCRIPT 1 ↔ 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, which may interact with the free directions in each module. Since we have not characterized the effect of these free directions on the effective parameter count, we cannot confidently conclude that networks with more separated modules reliably have lower effective parameter counts overall. For example, it may be possible that on most real-world loss landscapes, there are many more non-modular solutions than modular ones, and that typically the place in parameter space with lowest loss and lowest effective parameter count is not modular. However, we are not aware of any compelling reason why non-modular networks have some advantage in terms of having low effective parameter counts, to combat the advantage of modular networks discussed in this section.

5.2.3 Varying the strength of an interaction

In the precious section, we discussed the case that two modules interact via m𝑚mitalic_m nodes. However, this model had no notion of how strong an interaction is — every node inside M1subscript𝑀1M_{1}italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT either is not on the interaction surface, or it is, and all nodes on the interaction surface affects the nodes inside M2subscript𝑀2M_{2}italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT the same amount. In real networks, the extent to which one activation can affect another is continuous. Therefore, we’d like to be able to answer questions like the following:

Suppose that we have two networks both consisting of two modules, M1subscript𝑀1M_{1}italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and M2subscript𝑀2M_{2}italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. In the first network, there is a single node inside M1subscript𝑀1M_{1}italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT that strongly influences M2subscript𝑀2M_{2}italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, and in the second there are two nodes inside M1subscript𝑀1M_{1}italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT that both weakly influence M2subscript𝑀2M_{2}italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. Which of these two networks is likely to have a lower effective parameter count?

In this section we’ll attempt to answer this question. To do so, we will make use of the notion of an effective parameter count at a finite loss cutoff ϵitalic-ϵ\epsilonitalic_ϵ (Section 2.2.2). We show that the magnitude of the total connections through different independent mediating nodes v1,v2subscript𝑣1subscript𝑣2v_{1},v_{2}italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT seems to add approximately logarithmically to determine the effective ‘size’ of the total interaction surface between modules.

As before, we consider two modules M1subscript𝑀1M_{1}italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and M2subscript𝑀2M_{2}italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, connected through a number of mediating variables v1,,vmsubscript𝑣1subscript𝑣𝑚v_{1},\dots,v_{m}italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_v start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT that are part of M1subscript𝑀1M_{1}italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and which M2subscript𝑀2M_{2}italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT depends on. Let each of these mediating variables connect to M2subscript𝑀2M_{2}italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT through a single weight, w1,,wnsubscript𝑤1subscript𝑤𝑛w_{1},\dots,w_{n}italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_w start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT101010We could also consider wisubscript𝑤𝑖w_{i}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to be the sum of weights connecting node visubscript𝑣𝑖v_{i}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to M2subscript𝑀2M_{2}italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT..

If wisubscript𝑤𝑖w_{i}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is sufficiently small relative to the loss cutoff ϵitalic-ϵ\epsilonitalic_ϵ, the connection between modules via visubscript𝑣𝑖v_{i}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT will be so small that it can be considered no connection at all from the perspective of interactions between free directions in different modules. This would be if the loss increases when we traverse along both free directions simultaneously by an amount that is smaller than ϵitalic-ϵ\epsilonitalic_ϵ.

Quantitatively, if we traverse along a free direction in Θ1subscriptΘ1\Theta_{1}roman_Θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT that changes the value of vi(θ1|x)subscript𝑣𝑖conditionalsubscript𝜃1𝑥v_{i}(\theta_{1}|x)italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | italic_x ), then for small enough ϵitalic-ϵ\epsilonitalic_ϵ (and a network with locally smooth-enough activation functions), the resulting change in the MSE loss of the whole network L𝐿Litalic_L will be proportional to wi2subscriptsuperscript𝑤2𝑖w^{2}_{i}italic_w start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. If wi=O(ϵ12)subscript𝑤𝑖𝑂superscriptitalic-ϵ12w_{i}=O\left(\epsilon^{\frac{1}{2}}\right)italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_O ( italic_ϵ start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT ), that means the connection is ‘effectively zero’ relative to the given cutoff ϵitalic-ϵ\epsilonitalic_ϵ, in the sense that the volume of points with L(θ)<ϵ𝐿𝜃italic-ϵL(\theta)<\epsilonitalic_L ( italic_θ ) < italic_ϵ is not substantially impacted by the terms in the loss involving wisubscript𝑤𝑖w_{i}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT.

Now we consider larger connections wi=ϵkisubscript𝑤𝑖superscriptitalic-ϵsubscript𝑘𝑖w_{i}=\epsilon^{k_{i}}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_ϵ start_POSTSUPERSCRIPT italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT with ki(0,12)subscript𝑘𝑖012k_{i}\in(0,\frac{1}{2})italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ ( 0 , divide start_ARG 1 end_ARG start_ARG 2 end_ARG ).We can model this situation by taking the size of wisubscript𝑤𝑖w_{i}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT into account in the constrained loss (equation 31). We define the weighted constrained loss by a sum over mean squared errors for preserving each mediating variable, weighted by the size of the variable:

L1,C(θ1|θ1,θ12,𝒟,v1,,vm)=L1(θ1|θ1,𝒟)+1ni=1mϵ2kix𝒟(vi(θ1,x)vi(θ1,x))2subscript𝐿1𝐶conditionalsubscript𝜃1superscriptsubscript𝜃1superscriptsubscript𝜃12𝒟subscript𝑣1subscript𝑣𝑚subscript𝐿1conditionalsubscript𝜃1superscriptsubscript𝜃1𝒟1𝑛subscriptsuperscript𝑚𝑖1superscriptitalic-ϵ2subscript𝑘𝑖subscript𝑥𝒟superscriptsubscript𝑣𝑖superscriptsubscript𝜃1𝑥subscript𝑣𝑖subscript𝜃1𝑥2\displaystyle L_{1,C}(\theta_{1}|\theta_{1}^{*},\theta_{1\leftrightarrow 2}^{*%},\operatorname{\mathcal{D}},v_{1},\dots,v_{m})=L_{1}(\theta_{1}|\theta_{1}^{*%},\operatorname{\mathcal{D}})+\frac{1}{n}\sum^{m}_{i=1}\epsilon^{2k_{i}}\sum_{%x\in\operatorname{\mathcal{D}}}{\left(v_{i}(\theta_{1}^{*},x)-v_{i}(\theta_{1}%,x)\right)}^{2}italic_L start_POSTSUBSCRIPT 1 , italic_C end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_θ start_POSTSUBSCRIPT 1 ↔ 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , caligraphic_D , italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_v start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) = italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , caligraphic_D ) + divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT italic_ϵ start_POSTSUPERSCRIPT 2 italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_x ∈ caligraphic_D end_POSTSUBSCRIPT ( italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_x ) - italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT(32)

where we’ve made L1,Csubscript𝐿1𝐶L_{1,C}italic_L start_POSTSUBSCRIPT 1 , italic_C end_POSTSUBSCRIPT depend on θ12subscriptsuperscript𝜃12\theta^{*}_{1\leftrightarrow 2}italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 ↔ 2 end_POSTSUBSCRIPT here because wisubscript𝑤𝑖w_{i}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are parameters in θ12subscriptsuperscript𝜃12\theta^{*}_{1\leftrightarrow 2}italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 ↔ 2 end_POSTSUBSCRIPT.We are interested then in how much smaller the learning coefficient for loss landscape L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is than the learning coefficient on landscape L1,Csubscript𝐿1𝐶L_{1,C}italic_L start_POSTSUBSCRIPT 1 , italic_C end_POSTSUBSCRIPT, as a function of loss cutoff ϵitalic-ϵ\epsilonitalic_ϵ. This depends heavily on the details of the model. If the constraints are completely independent, we could perhaps model the presence of each constraint as destroying some number γisubscript𝛾𝑖\gamma_{i}italic_γ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT of degrees of freedom compared to the model in which the constraints were not present (and the modules were fully non-interacting).

Neff, C=Neff+i=1mγi.subscript𝑁eff, Csubscript𝑁effsubscriptsuperscript𝑚𝑖1subscript𝛾𝑖\displaystyle N_{\text{eff, C}}=N_{\text{eff}}+\sum^{m}_{i=1}\gamma_{i}\,.italic_N start_POSTSUBSCRIPT eff, C end_POSTSUBSCRIPT = italic_N start_POSTSUBSCRIPT eff end_POSTSUBSCRIPT + ∑ start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT italic_γ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT .

Now, we seek an expression for γisubscript𝛾𝑖\gamma_{i}italic_γ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT in terms of wisubscript𝑤𝑖w_{i}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Since we require LB(θ)<ϵsubscript𝐿𝐵𝜃italic-ϵL_{B}(\theta)<\epsilonitalic_L start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ( italic_θ ) < italic_ϵ, and each term in LBsubscript𝐿𝐵L_{B}italic_L start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT is positive, we also have that each constraint MSE𝑀𝑆𝐸MSEitalic_M italic_S italic_E must be smaller than ϵitalic-ϵ\epsilonitalic_ϵ. Rearranging, we find that

1nx𝒟(vi(θ1,x)vi(θ1,x))2=ϵ12ki=ϵ~.1𝑛subscript𝑥𝒟superscriptsubscript𝑣𝑖superscriptsubscript𝜃1𝑥subscript𝑣𝑖subscript𝜃1𝑥2superscriptitalic-ϵ12subscript𝑘𝑖~italic-ϵ\displaystyle\frac{1}{n}\sum_{x\in\operatorname{\mathcal{D}}}\left(v_{i}(%\theta_{1}^{*},x)-v_{i}(\theta_{1},x)\right)^{2}=\epsilon^{1-2k_{i}}=\tilde{%\epsilon}\,.divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_x ∈ caligraphic_D end_POSTSUBSCRIPT ( italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_x ) - italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = italic_ϵ start_POSTSUPERSCRIPT 1 - 2 italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT = over~ start_ARG italic_ϵ end_ARG .(33)

Therefore, the weights ϵ2kisuperscriptitalic-ϵ2subscript𝑘𝑖\epsilon^{2k_{i}}italic_ϵ start_POSTSUPERSCRIPT 2 italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT of each constraint effectively correspond to measuring the volume of points satisfying that constraint at a larger loss cutoff ϵ~i=ϵ12kisubscript~italic-ϵ𝑖superscriptitalic-ϵ12subscript𝑘𝑖\tilde{\epsilon}_{i}=\epsilon^{1-2k_{i}}over~ start_ARG italic_ϵ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_ϵ start_POSTSUPERSCRIPT 1 - 2 italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT. Now, we make an assumption that if all the weights were 1, then each constraint would be responsible for removing a similar number γ~~𝛾\tilde{\gamma}over~ start_ARG italic_γ end_ARG of degrees of freedom from the network. In other words, each constraint would restrict the volume of parameter space that achieves loss less than ϵitalic-ϵ\epsilonitalic_ϵ by the same amount. Then, we can rescale this region by the factor ϵ12kisuperscriptitalic-ϵ12subscript𝑘𝑖\epsilon^{1-2k_{i}}italic_ϵ start_POSTSUPERSCRIPT 1 - 2 italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT and we find that:

γi=(12ki)γ~=(12logwilogϵ)γ~,subscript𝛾𝑖12subscript𝑘𝑖~𝛾12subscript𝑤𝑖italic-ϵ~𝛾\displaystyle\gamma_{i}=\left(1-2k_{i}\right)\tilde{\gamma}=\left(1-2\frac{%\log{w_{i}}}{\log{\epsilon}}\right)\tilde{\gamma}\,,italic_γ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ( 1 - 2 italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) over~ start_ARG italic_γ end_ARG = ( 1 - 2 divide start_ARG roman_log italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG roman_log italic_ϵ end_ARG ) over~ start_ARG italic_γ end_ARG ,(34)

Therefore, the size of the logarithm of the weight wisubscript𝑤𝑖w_{i}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT relative to the logarithm of the cutoff ϵitalic-ϵ\epsilonitalic_ϵ becomes a prefactor reducing the number of degrees of freedom removed by constraint i𝑖iitalic_i. If wi=1subscript𝑤𝑖1w_{i}=1italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1, then γi=γ~subscript𝛾𝑖~𝛾\gamma_{i}=\tilde{\gamma}italic_γ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = over~ start_ARG italic_γ end_ARG, and if wiϵ12subscript𝑤𝑖superscriptitalic-ϵ12w_{i}\leq\epsilon^{\frac{1}{2}}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≤ italic_ϵ start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT, then γi=0subscript𝛾𝑖0\gamma_{i}=0italic_γ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 0111111For wi<ϵ12subscript𝑤𝑖superscriptitalic-ϵ12w_{i}<\epsilon^{\frac{1}{2}}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT < italic_ϵ start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT, this is effectively zero from the resolution available at loss cutoff ϵitalic-ϵ\epsilonitalic_ϵ..

With this in mind, let us return to the question introduced at the start of this section. We will call the network with two weak interactions between modules network A𝐴Aitalic_A, with two mediating nodes vA,1,vA,2subscript𝑣𝐴1subscript𝑣𝐴2v_{A,1},v_{A,2}italic_v start_POSTSUBSCRIPT italic_A , 1 end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_A , 2 end_POSTSUBSCRIPT and mediating weights wA,1=wA,2subscript𝑤𝐴1subscript𝑤𝐴2w_{A,1}=w_{A,2}italic_w start_POSTSUBSCRIPT italic_A , 1 end_POSTSUBSCRIPT = italic_w start_POSTSUBSCRIPT italic_A , 2 end_POSTSUBSCRIPT. Likewise, we denote the network with one strong interaction between modules by network B𝐵Bitalic_B, with one mediating node vB,1subscript𝑣𝐵1v_{B,1}italic_v start_POSTSUBSCRIPT italic_B , 1 end_POSTSUBSCRIPT and one mediating weight wBsubscript𝑤𝐵w_{B}italic_w start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT. How large must wBsubscript𝑤𝐵w_{B}italic_w start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT be compared to wA,1subscript𝑤𝐴1w_{A,1}italic_w start_POSTSUBSCRIPT italic_A , 1 end_POSTSUBSCRIPT and wA,2subscript𝑤𝐴2w_{A,2}italic_w start_POSTSUBSCRIPT italic_A , 2 end_POSTSUBSCRIPT for the interactions between modules in network B𝐵Bitalic_B to effectively remove the same number of degrees of freedom as the interactions between modules in network A𝐴Aitalic_A?Using equation LABEL:eq:log_scale, we find that

log(wBϵ12)=log(wA,1ϵ12)+log(wA,2ϵ12).subscript𝑤𝐵superscriptitalic-ϵ12subscript𝑤𝐴1superscriptitalic-ϵ12subscript𝑤𝐴2superscriptitalic-ϵ12\displaystyle\log{\left(\frac{w_{B}}{\epsilon^{\frac{1}{2}}}\right)}=\log{%\left(\frac{w_{A,1}}{\epsilon^{\frac{1}{2}}}\right)}+\log{\left(\frac{w_{A,2}}%{\epsilon^{\frac{1}{2}}}\right)}\,.roman_log ( divide start_ARG italic_w start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT end_ARG start_ARG italic_ϵ start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT end_ARG ) = roman_log ( divide start_ARG italic_w start_POSTSUBSCRIPT italic_A , 1 end_POSTSUBSCRIPT end_ARG start_ARG italic_ϵ start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT end_ARG ) + roman_log ( divide start_ARG italic_w start_POSTSUBSCRIPT italic_A , 2 end_POSTSUBSCRIPT end_ARG start_ARG italic_ϵ start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT end_ARG ) .(35)

So, the analysis in this section implies that connections through different mediating nodes should be considered to add together logarithmically for the purpose of estimating the number of interaction terms between degrees of freedom that live in different modules. In practice, the constraints different mediating variables impose on the loss 32 are likely rarely completely independent, so this should be seen as a rough approximation to be used as a starting guess for the relevant scale of the problem.

If circuits in neural networks correspond to modules, the analysis in this section implies that we could identify circuits in networks by searching for a partition of the interaction graph of the network into modules which minimises the sum of logs of cutoff-normalised interaction strengths between modules.

6 The Interaction Basis

In this section, we propose a technique for representing a neural network as an interaction graph that is invariant to reparameterisations that exploit the freedoms in Sections 3.1.1 and 3.1.2. The technique consists of performing a basis transformation in each layer of the network to represent the activations in a different basis that we call the Interaction Basis.

This basis transformation removes degeneracies in actviations and Jacobians of the layer to make the basis smaller.The basis is also intended to ‘disentangle’ interactions between adjacent layers as much as possible. While we do not know whether it accomplishes this in general, we do show that it does so when the layer transitions are linear. In that case, the layer transition becomes diagonal (appendix A).The interaction basis is invariant to invertible linear transformations,121212Technically, as we will see, it is only invariant to up to the uniqueness of the eigenvectors of a certain matrix. But that usually just amounts to a freedom under reflections of coordinate axes in practice. meaning the basis itself is a largely coordinate-independent object, much like an eigendecomposition (see Section 6.2).

We conjecture that if we apply the interaction basis transformation to a real neural network, the resulting representation is likely to be more interpretable. In a companion paper, Bushnaq etal. (2024), we develop the interaction basis further and test this hypothesis.

6.1 Motivating the interaction basis

To find a transformation of network’s weights and activations that is invariant to reparameterisations based on low-rank activations or low-rank Jacobians, we take equation 10, and use equation 11 to rewrite it as

Hij,ijl,l(θ)=2Lθi,jlθi,jl|θ=θsubscriptsuperscript𝐻𝑙superscript𝑙𝑖𝑗superscript𝑖superscript𝑗superscript𝜃evaluated-atsuperscript2𝐿subscriptsuperscript𝜃𝑙𝑖𝑗subscriptsuperscript𝜃superscript𝑙superscript𝑖superscript𝑗𝜃superscript𝜃\displaystyle H^{l,l^{\prime}}_{ij,i^{\prime}j^{\prime}}(\theta^{*})=\left.%\frac{\partial^{2}L}{\partial\theta^{l}_{i,j}\partial\theta^{l^{\prime}}_{i^{%\prime},j^{\prime}}}\right|_{\theta=\theta^{*}}italic_H start_POSTSUPERSCRIPT italic_l , italic_l start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) = divide start_ARG ∂ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L end_ARG start_ARG ∂ italic_θ start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ∂ italic_θ start_POSTSUPERSCRIPT italic_l start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_ARG | start_POSTSUBSCRIPT italic_θ = italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT=1nx𝒟fjl(x)fjl(x)kfklfinal(x)pil+1fklfinal(x)pil+1.absent1𝑛subscript𝑥𝒟subscriptsuperscript𝑓𝑙𝑗𝑥subscriptsuperscript𝑓superscript𝑙superscript𝑗𝑥subscript𝑘subscriptsuperscript𝑓subscript𝑙final𝑘𝑥subscriptsuperscript𝑝𝑙1𝑖subscriptsuperscript𝑓subscript𝑙final𝑘𝑥subscriptsuperscript𝑝superscript𝑙1superscript𝑖\displaystyle=\frac{1}{n}\sum_{x\in\operatorname{\mathcal{D}}}f^{l}_{j}(x)f^{l%^{\prime}}_{j^{\prime}}(x)\sum_{k}\frac{\partial f^{l_{\text{final}}}_{k}(x)}{%\partial p^{l+1}_{i}}\frac{\partial f^{l_{\text{final}}}_{k}(x)}{\partial p^{l%^{\prime}+1}_{i^{\prime}}}.= divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_x ∈ caligraphic_D end_POSTSUBSCRIPT italic_f start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_x ) italic_f start_POSTSUPERSCRIPT italic_l start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_x ) ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT divide start_ARG ∂ italic_f start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT final end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x ) end_ARG start_ARG ∂ italic_p start_POSTSUPERSCRIPT italic_l + 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG divide start_ARG ∂ italic_f start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT final end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x ) end_ARG start_ARG ∂ italic_p start_POSTSUPERSCRIPT italic_l start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_ARG .(36)

Next, we make two presumptions of independence (Christiano etal., 2022), assuming that

  1. 1.

    We can take expectations over the activations and Jacobians in each layer independently

  2. 2.

    Different layers are somewhat independent such that the Hessian eigenvectors can be largely localised to a particular layer

Both of these assumptions are investigated in Martens and Grosse (2020), who test their validity in small networks and use it to derive a cheap approximation to the Hessian and its inverse.

This allows us to approximate the Hessian as

Hij,ijl,l(θ)subscriptsuperscript𝐻𝑙superscript𝑙𝑖𝑗superscript𝑖superscript𝑗superscript𝜃\displaystyle H^{l,l^{\prime}}_{ij,i^{\prime}j^{\prime}}(\theta^{*})italic_H start_POSTSUPERSCRIPT italic_l , italic_l start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT )δl,l[1nx𝒟fjl(x)fjl(x)][1nx𝒟kfklfinal(x)pil+1fklfinal(x)pil+1].absentsubscript𝛿𝑙superscript𝑙delimited-[]1𝑛subscript𝑥𝒟subscriptsuperscript𝑓𝑙𝑗𝑥subscriptsuperscript𝑓𝑙superscript𝑗𝑥delimited-[]1𝑛subscript𝑥𝒟subscript𝑘subscriptsuperscript𝑓subscript𝑙final𝑘𝑥subscriptsuperscript𝑝𝑙1𝑖subscriptsuperscript𝑓subscript𝑙final𝑘𝑥subscriptsuperscript𝑝𝑙1superscript𝑖\displaystyle\approx\delta_{l,l^{\prime}}\left[\frac{1}{n}\sum_{x\in%\operatorname{\mathcal{D}}}f^{l}_{j}(x)f^{l}_{j^{\prime}}(x)\right]\left[\frac%{1}{n}\sum_{x\in\operatorname{\mathcal{D}}}\sum_{k}\frac{\partial f^{l_{\text{%final}}}_{k}(x)}{\partial p^{l+1}_{i}}\frac{\partial f^{l_{\text{final}}}_{k}(%x)}{\partial p^{l+1}_{i^{\prime}}}\right]\,.≈ italic_δ start_POSTSUBSCRIPT italic_l , italic_l start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_x ∈ caligraphic_D end_POSTSUBSCRIPT italic_f start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_x ) italic_f start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_x ) ] [ divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_x ∈ caligraphic_D end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT divide start_ARG ∂ italic_f start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT final end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x ) end_ARG start_ARG ∂ italic_p start_POSTSUPERSCRIPT italic_l + 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG divide start_ARG ∂ italic_f start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT final end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x ) end_ARG start_ARG ∂ italic_p start_POSTSUPERSCRIPT italic_l + 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_ARG ] .(37)

This effectively turns the Hessian into a product of two matrices, a gram matrix of activations in each layer

Gjjl=1nx𝒟fjl(x)fjl(x)subscriptsuperscript𝐺𝑙𝑗superscript𝑗1𝑛subscript𝑥𝒟subscriptsuperscript𝑓𝑙𝑗𝑥subscriptsuperscript𝑓𝑙superscript𝑗𝑥\displaystyle G^{l}_{jj^{\prime}}=\frac{1}{n}\sum_{x\in\operatorname{\mathcal{%D}}}f^{l}_{j}(x)f^{l}_{j^{\prime}}(x)italic_G start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_x ∈ caligraphic_D end_POSTSUBSCRIPT italic_f start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_x ) italic_f start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_x )(38)

and a Gram matrix of Jacobians with respect to the next layer’s preactivations

Kiil=1nx𝒟kfklfinal(x)pil+1fklfinal(x)pil+1.subscriptsuperscript𝐾𝑙𝑖superscript𝑖1𝑛subscript𝑥𝒟subscript𝑘subscriptsuperscript𝑓subscript𝑙final𝑘𝑥subscriptsuperscript𝑝𝑙1𝑖subscriptsuperscript𝑓subscript𝑙final𝑘𝑥subscriptsuperscript𝑝𝑙1superscript𝑖\displaystyle K^{l}_{ii^{\prime}}=\frac{1}{n}\sum_{x\in\operatorname{\mathcal{%D}}}\sum_{k}\frac{\partial f^{l_{\text{final}}}_{k}(x)}{\partial p^{l+1}_{i}}%\frac{\partial f^{l_{\text{final}}}_{k}(x)}{\partial p^{l+1}_{i^{\prime}}}\,.italic_K start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_x ∈ caligraphic_D end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT divide start_ARG ∂ italic_f start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT final end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x ) end_ARG start_ARG ∂ italic_p start_POSTSUPERSCRIPT italic_l + 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG divide start_ARG ∂ italic_f start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT final end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x ) end_ARG start_ARG ∂ italic_p start_POSTSUPERSCRIPT italic_l + 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_ARG .(39)

We can then find the eigenvectors of this approximated Hessian by separately diagonalising these two matrices.

We would like to find a basis for flsuperscript𝑓𝑙f^{l}italic_f start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT that excludes directions connected exclusively to zero eigenvectors of the Hessian. That is, we want to exclude directions in flsuperscript𝑓𝑙f^{l}italic_f start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT that lie along zero eigenvectors of Glsuperscript𝐺𝑙G^{l}italic_G start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT, and directions that are mapped by the weight matrix Wlsuperscript𝑊𝑙W^{l}italic_W start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT to lie along zero eigenvectors of Klsuperscript𝐾𝑙K^{l}italic_K start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT.

To do this, we can backpropagate the Jacobians in equation 39 one step further to include the weight matrices Wlsuperscript𝑊𝑙W^{l}italic_W start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT:

Miil=1nx𝒟kfklfinal(x)filfklfinal(x)fil.subscriptsuperscript𝑀𝑙𝑖superscript𝑖1𝑛subscript𝑥𝒟subscript𝑘subscriptsuperscript𝑓subscript𝑙final𝑘𝑥subscriptsuperscript𝑓𝑙𝑖subscriptsuperscript𝑓subscript𝑙final𝑘𝑥subscriptsuperscript𝑓𝑙superscript𝑖\displaystyle M^{l}_{ii^{\prime}}=\frac{1}{n}\sum_{x\in\operatorname{\mathcal{%D}}}\sum_{k}\frac{\partial f^{l_{\text{final}}}_{k}(x)}{\partial f^{l}_{i}}%\frac{\partial f^{l_{\text{final}}}_{k}(x)}{\partial f^{l}_{i^{\prime}}}\,.italic_M start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_x ∈ caligraphic_D end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT divide start_ARG ∂ italic_f start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT final end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x ) end_ARG start_ARG ∂ italic_f start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG divide start_ARG ∂ italic_f start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT final end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x ) end_ARG start_ARG ∂ italic_f start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_ARG .(40)

and then search for a basis in flsuperscript𝑓𝑙f^{l}italic_f start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT that diagonalises Mlsuperscript𝑀𝑙M^{l}italic_M start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT and Glsuperscript𝐺𝑙G^{l}italic_G start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT at the same time. This basis will have one basis vector less for each zero eigenvalue of the Gram matrices of the activations and Jacobians, respectively. It will also exclude directions that lie in the null space of Wlsuperscript𝑊𝑙W^{l}italic_W start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT.

The matrices Gl,Mlsuperscript𝐺𝑙superscript𝑀𝑙G^{l},M^{l}italic_G start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT , italic_M start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT are symmetric, so we can write Gl=UlTDGlUlsuperscript𝐺𝑙superscriptsuperscript𝑈𝑙𝑇superscriptsubscript𝐷𝐺𝑙superscript𝑈𝑙G^{l}={U^{l}}^{T}D_{G}^{l}U^{l}italic_G start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT = italic_U start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT italic_U start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT and Ml=VlTDMlVlsuperscript𝑀𝑙superscriptsuperscript𝑉𝑙𝑇superscriptsubscript𝐷𝑀𝑙superscript𝑉𝑙M^{l}={V^{l}}^{T}D_{M}^{l}V^{l}italic_M start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT = italic_V start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT italic_V start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT for diagonal DG,DMsubscript𝐷𝐺subscript𝐷𝑀D_{G},D_{M}italic_D start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT , italic_D start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT and orthogonal Ul,Vlsuperscript𝑈𝑙superscript𝑉𝑙U^{l},V^{l}italic_U start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT , italic_V start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT.

We can find a basis transformation 𝐟^=Cl𝐟l^𝐟superscript𝐶𝑙superscript𝐟𝑙\hat{\mathbf{f}}=C^{l}\mathbf{f}^{l}over^ start_ARG bold_f end_ARG = italic_C start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT bold_f start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT in which both Glsuperscript𝐺𝑙G^{l}italic_G start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT and Mlsuperscript𝑀𝑙M^{l}italic_M start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT are diagonal, in two steps:

  1. 1.

    Apply a whitening transformation with respect to Glsuperscript𝐺𝑙G^{l}italic_G start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT: 𝐟~l=(DGl1/2)+Ulsuperscript~𝐟𝑙superscriptsuperscriptsubscriptsuperscript𝐷𝑙𝐺12superscript𝑈𝑙\tilde{\mathbf{f}}^{l}=\left({D^{l}_{G}}^{1/2}\right)^{+}U^{l}over~ start_ARG bold_f end_ARG start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT = ( italic_D start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT italic_U start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT, where the plus denotes the Moore-Penrose pseudoinverse. If the activations in layer l𝑙litalic_l do not span the full activation space, then the gram matrix Glsuperscript𝐺𝑙G^{l}italic_G start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT must not be full rank, and some diagonal entries of DGlsubscriptsuperscript𝐷𝑙𝐺D^{l}_{G}italic_D start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT are zero. By choosing this pseudoinverse, we effectively eliminate all the degeneracies from low-rank activations from our final basis. In this basis, G~ijl=δijsubscriptsuperscript~𝐺𝑙𝑖𝑗subscript𝛿𝑖𝑗\tilde{G}^{l}_{ij}=\delta_{ij}over~ start_ARG italic_G end_ARG start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = italic_δ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT

  2. 2.

    Now that Glsuperscript𝐺𝑙G^{l}italic_G start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT is whitened, we can apply the transformation by Vlsuperscript𝑉𝑙V^{l}italic_V start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT which diagonalises Mlsuperscript𝑀𝑙M^{l}italic_M start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT without un-diagonalisng Glsuperscript𝐺𝑙G^{l}italic_G start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT since the identity matrix is isotropic131313We need to be careful which coordinate basis we are working in: the entries of Vlsuperscript𝑉𝑙V^{l}italic_V start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT in the basis that whitens Glsuperscript𝐺𝑙G^{l}italic_G start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT and in the standard basis are different.. At this point both Mlsuperscript𝑀𝑙M^{l}italic_M start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT and Glsuperscript𝐺𝑙G^{l}italic_G start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT are diagonal and Clsuperscript𝐶𝑙C^{l}italic_C start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT is defined up to multiplication by a diagonal matrix. We choose to multiply at the end by (DMl1/2)+superscriptsuperscriptsubscriptsuperscript𝐷𝑙𝑀12\left({D^{l}_{M}}^{1/2}\right)^{+}( italic_D start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT because this eliminates degeneracies from low rank Jacobians.

We call the basis 𝐟^l=(DMl1/2)+Vl(DGl1/2)+Ul𝐟lsuperscript^𝐟𝑙superscriptsuperscriptsubscriptsuperscript𝐷𝑙𝑀12superscript𝑉𝑙superscriptsuperscriptsubscriptsuperscript𝐷𝑙𝐺12superscript𝑈𝑙superscript𝐟𝑙\hat{\mathbf{f}}^{l}=\left({D^{l}_{M}}^{1/2}\right)^{+}V^{l}\left({D^{l}_{G}}^%{1/2}\right)^{+}U^{l}\mathbf{f}^{l}over^ start_ARG bold_f end_ARG start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT = ( italic_D start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT italic_V start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ( italic_D start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT italic_U start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT bold_f start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT the interaction basis. Basis vectors in this basis are aligned with the directions that affect the output most — in the case of a deep linear network, this means that transforming to the interaction basis provably performs an SVD of each weight matrix, resulting in basis directions which are aligned with the principal components of the output of the network (see appendix A).

We made two simplifying assumptions of independence about the Hessian to motivate this basis. While they have been used in other contexts to some success, these are still strong assumptions. Future work might investigate alternative techniques for finding a basis without these assumptions. This might only be possible with an overcomplete basis, which could connect the framework of this paper to superposition.

6.2 Invariance to linear transformations

The Interaction Basis is largely a coordinate-independent object, in the sense that it is invariant under linear transformations.If we apply a transformation 𝐟l𝐟Rl=R𝐟l,WlWRl=WlR1formulae-sequencesuperscript𝐟𝑙superscriptsubscript𝐟𝑅𝑙𝑅superscript𝐟𝑙superscript𝑊𝑙superscriptsubscript𝑊𝑅𝑙superscript𝑊𝑙superscript𝑅1\mathbf{f}^{l}\to\mathbf{f}_{R}^{l}=R\mathbf{f}^{l},W^{l}\to W_{R}^{l}=W^{l}R^%{-1}bold_f start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT → bold_f start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT = italic_R bold_f start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT , italic_W start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT → italic_W start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT = italic_W start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT italic_R start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT to the activation space, the final interaction basis is unchanged (𝐟^Rl=𝐟^lsubscriptsuperscript^𝐟𝑙𝑅superscript^𝐟𝑙\hat{\mathbf{f}}^{l}_{R}=\hat{\mathbf{f}}^{l}over^ start_ARG bold_f end_ARG start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT = over^ start_ARG bold_f end_ARG start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT) for any RGLdl()𝑅subscriptGLsuperscript𝑑𝑙R\in\text{GL}_{d^{l}}(\mathbb{R})italic_R ∈ GL start_POSTSUBSCRIPT italic_d start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( blackboard_R ) up to trivial axis reflections, unless Mlsuperscript𝑀𝑙M^{l}italic_M start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT has repeated eigenvalues.

To show this, first note that in the whitened basis 𝐟l~=(DGl1/2)+Ul𝐟l~superscript𝐟𝑙superscriptsuperscriptsubscriptsuperscript𝐷𝑙𝐺12superscript𝑈𝑙superscript𝐟𝑙\tilde{\mathbf{f}^{l}}=\left({D^{l}_{G}}^{1/2}\right)^{+}U^{l}\mathbf{f}^{l}over~ start_ARG bold_f start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT end_ARG = ( italic_D start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT italic_U start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT bold_f start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT, Glsuperscript𝐺𝑙G^{l}italic_G start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT is by definition always transformed to the identity matrix

G~l=(DGl1/2)+Gl((DGl1/2)+)T=𝐈.superscript~𝐺𝑙superscriptsuperscriptsubscriptsuperscript𝐷𝑙𝐺12superscript𝐺𝑙superscriptsuperscriptsuperscriptsubscriptsuperscript𝐷𝑙𝐺12𝑇𝐈\displaystyle\tilde{G}^{l}=\left({D^{l}_{G}}^{1/2}\right)^{+}G^{l}\left(\left(%{D^{l}_{G}}^{1/2}\right)^{+}\right)^{T}=\mathbf{I}\,.over~ start_ARG italic_G end_ARG start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT = ( italic_D start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT italic_G start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ( ( italic_D start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT = bold_I .(41)

So if we whiten after applying the transformation R𝑅Ritalic_R, 𝐟~Rlsubscriptsuperscript~𝐟𝑙𝑅\tilde{\mathbf{f}}^{l}_{R}over~ start_ARG bold_f end_ARG start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT can only differ from 𝐟~lsuperscript~𝐟𝑙\tilde{\mathbf{f}}^{l}over~ start_ARG bold_f end_ARG start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT by an orthogonal transformation. Call this orthogonal matrix QRsubscript𝑄𝑅Q_{R}italic_Q start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT. In the whitened basis, MRlsuperscriptsubscript𝑀𝑅𝑙M_{R}^{l}italic_M start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT will then be:

MRlsuperscriptsubscript𝑀𝑅𝑙\displaystyle M_{R}^{l}italic_M start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT=QRMlQRT.absentsubscript𝑄𝑅superscript𝑀𝑙subscriptsuperscript𝑄𝑇𝑅\displaystyle=Q_{R}M^{l}Q^{T}_{R}\,.= italic_Q start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT italic_M start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT italic_Q start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT .(42)

So MRlsubscriptsuperscript𝑀𝑙𝑅M^{l}_{R}italic_M start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT and Mlsuperscript𝑀𝑙M^{l}italic_M start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT only differ by an orthogonal transformation.The interaction basis will be the eigenbasis of MRlsubscriptsuperscript𝑀𝑙𝑅M^{l}_{R}italic_M start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT and Mlsuperscript𝑀𝑙M^{l}italic_M start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT, respectively.So long as a real matrix does not have degenerate eigenvalues, its eigendecomposition is basis invariant if a convention for the eigenvector normalisation is chosen, up to reflections.So if Mlsuperscript𝑀𝑙M^{l}italic_M start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT does not have multiple identical eigenvalues, the interaction basis we end up in is the same up to reflections whether we transformed with R𝑅Ritalic_R first or not. If Mlsuperscript𝑀𝑙M^{l}italic_M start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT does have identical eigenvalues, the basis will still be identical up to orthogonal transforms in the eigenspaces of Mlsuperscript𝑀𝑙M^{l}italic_M start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT.

7 Related Work

Explaining generalisation

The inductive biases of deep neural networks that leads them to generalise well past their training data has been an object of extensive study (Zhang etal., 2021). Attempts to understand generalisation involve studying simplicity biases (Mingard etal., 2021) and are closely related to attempts to quantify model complexity, for example via VC dimension (Vapnik, 1998), Radamacher complexity (Mohri etal., 2018) or less widely known methods (Liang etal., 2019; Novak etal., 2018). This paper is heavily influenced by Singular Learning Theory (Watanabe, 2009) which uses the local learning coefficient (Lau etal., 2023) to quantify the effective number of parameters in the model via the flatness of minima in the loss landscape. The flatness of minima has been found to predict model generalisation, for example in Li etal. (2018) for networks trained on CIFAR-10. SLT has been used to study the formation of internal structure in neural networks (Chen etal., 2023; Hoogland etal., 2024). Understanding the internals of neural networks through the geometry of their loss landscapes was also proposed as a research direction in (Hoogland etal., 2023).

Local structure of the loss landscape

Other works have investigated the structure of neural network loss landscapes and their degeneracies around solutions found in training.In (Martens and Grosse, 2020), it was proposed that the Hessian matrix of MLPs can be approximated as being factorisable into independent outer products of activations and gradients, and that its eigenvectors might be approximated as being localised in particular layer of the network.This approximation was later extended to CNNs, RNNs, and transformers in Grosse and Martens (2016); Martens etal. (2018); Grosse etal. (2023).The approximation was used to compress models by pruning weights along directions with small Hessian eigenvalues by Wang etal. (2019).For deep linear networks, an analytical expression for the learning coefficient was derived in Aoyagi (2024).Generic degeneracies in the loss shared by all models with an MLP ReLU architecture were investigated in Carrol (2021), and degeneracies of one hidden layer MLPs with tanh activation functions in Farrugia-Roberts (2022).It has been found that most minima in the loss landscape can often be connected by a continuous path of minimum loss, for example in Draxler etal. (2019) for models trained on CIFAR.

Selection for modularity

In Filan etal. (2021), it was found that MLPs and CNNs trained on vision tasks showed more modularity in the weights connecting their neurons than comparable random networks.The observed tendency for biological networks created by evolution to be modular has been widely investigated, with various explanations for the phenomenon being proposed. Clune etal. (2013) offer a good overview of this work for machine learning researchers, and suggests direct minimisation of connection costs between components as a primary driver of modularity in biological networks.Kashtan and Alon (2005) proposes that genetic algorithms select systems to be modular because this makes them more robust to modular changes in the systems’ environments.In Liu etal. (2023), connection costs were used to regularise MLPs trained on various tasks including modular addition to be more modular in their weights, in order to make them more interpretable.

8 Conclusion

We introduced the idea that the presence of degeneracy in neural networks’ parameterizationsmay be a source of challenges for reverse engineering them. We identified some of the sources of this degeneracy, and suggested a technique (the interaction basis) for removing this degeneracy from the representation of the network. We argued that this representation is likely to have sparser interactions, and we introduced a formula for searching for modules in the new represenation of the network based on a toy model of how modularity affects degeneracy. The follow-up paper Bushnaq etal. (2024) tests a variant of the interaction basis, finding that it results in representations which are sparse, modular and interpretable on toy models but it is much less useful when applied to LLMs.

9 Contribution Statement

LB developed the ideas in this paper with contributions from JM and KH. JM and LB developed the presentation of these ideas together. JM led the writing, with substantial support from LB, and feedback from SH and NGD. SH, DB, and NGD ran experiments to provide feedback on early versions of the interaction basis. CW ran experiments to test neuron synchronisation.

10 Acknowledgements

We thank Daniel Murfet, Tom McGrath, James Fox, and Lawrence Chan for comments on the manuscript, to Dmitry Vaintrob for suggesting the concept of finite data SLT, and to Vivek Hebbar, Jesse Hoogland and Linda Linsefors for valuable discussions. Apollo Research is a fiscally sponsored project of Rethink Priorities.

References

  • Aoyagi (2024)Miki Aoyagi.Consideration on the learning efficiency of multiple-layered neural networks with linear units.Neural Networks, 172:106132, 04 2024.doi: 10.1016/j.neunet.2024.106132.
  • Bushnaq etal. (2024)Lucius Bushnaq, Stefan Heimersheim, Nicholas Goldowsky-Dill, Dan Braun, Jake Mendel, Kaarel Hänni, Avery Griffin, Jörn Stöhler, Magdalena Wache, and Marius Hobbhahn.The Local Interaction Basis: Identifying Computationally-Relevant and Sparsely Interacting Features in Neural Networks.arXiv e-prints, art. arXiv:2405.10928, May 2024.
  • Carrol (2021)Liam Carrol.Phase transitions in neural networks.Master’s thesis, School of Computing and Information Systems, The University of Melbourne, October 2021.URL http://therisingsea.org/notes/MSc-Carroll.pdf.
  • Carroll (2023)Liam Carroll.Dslt 1. the rlct measures the effective dimension of neural networks, Jun 2023.URL https://www.alignmentforum.org/posts/4eZtmwaqhAgdJQDEg/dslt-1-the-rlct-measures-the-effective-dimension-of-neural.
  • Chan etal. (2022)Lawrence Chan, Adria Garriga-Alonso, Nix Goldowsky-Dill, Ryan Greenblatt, Jenny Nitishinskaya, Ansh Radhakrishnan, Buck Shlegeris, and Nate Thomas.Causal scrubbing: A method for rigorously testing interpretability hypotheses.Alignment Forum, 2022.URL https://www.alignmentforum.org/posts/JvZhhzycHu2Yd57RN/causal-scrubbing-a-method-for-rigorously-testing.
  • Chen etal. (2023)Zhongtian Chen, Edmund Lau, Jake Mendel, Susan Wei, and Daniel Murfet.Dynamical versus bayesian phase transitions in a toy model of superposition.arXiv preprint arXiv:2310.06301, 2023.
  • Christiano etal. (2022)Paul Christiano, Eric Neyman, and Mark Xu.Formalizing the presumption of independence.arXiv preprint arXiv:2211.06738, 2022.
  • Clune etal. (2013)Jeff Clune, Jean-Baptiste Mouret, and Hod Lipson.The evolutionary origins of modularity.Proceedings of the Royal Society B: Biological Sciences, 280(1755):20122863, March 2013.ISSN 1471-2954.doi: 10.1098/rspb.2012.2863.URL http://dx.doi.org/10.1098/rspb.2012.2863.
  • Conmy etal. (2023)Arthur Conmy, AugustineN. Mavor-Parker, Aengus Lynch, Stefan Heimersheim, and Adrià Garriga-Alonso.Towards automated circuit discovery for mechanistic interpretability, 2023.
  • Conmy etal. (2024)Arthur Conmy, Augustine Mavor-Parker, Aengus Lynch, Stefan Heimersheim, and Adrià Garriga-Alonso.Towards automated circuit discovery for mechanistic interpretability.Advances in Neural Information Processing Systems, 36, 2024.
  • Draxler etal. (2019)Felix Draxler, Kambis Veschgini, Manfred Salmhofer, and FredA. Hamprecht.Essentially no barriers in neural network energy landscape, 2019.
  • Elhage etal. (2021)Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Nova DasSarma, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish, and Chris Olah.A mathematical framework for transformer circuits.Transformer Circuits Thread, 2021.https://transformer-circuits.pub/2021/framework/index.html.
  • Farrugia-Roberts (2022)Matthew Farrugia-Roberts.Structural degeneracy in neural networks.Master’s thesis, School of Computing and Information Systems, The University of Melbourne, December 2022.URL https://far.in.net/mthesis.
  • Filan etal. (2021)Daniel Filan, Stephen Casper, Shlomi Hod, Cody Wild, Andrew Critch, and Stuart Russell.Clusterability in neural networks, 2021.
  • Fusi etal. (2016)Stefano Fusi, EarlK Miller, and Mattia Rigotti.Why neurons mix: high dimensionality for higher cognition.Current Opinion in Neurobiology, 37:66–74, 2016.ISSN 0959-4388.doi: https://doi.org/10.1016/j.conb.2016.01.010.URL https://www.sciencedirect.com/science/article/pii/S0959438816000118.Neurobiology of cognitive behavior.
  • Geiger etal. (2021)Atticus Geiger, Hanson Lu, Thomas Icard, and Christopher Potts.Causal abstractions of neural networks.Advances in Neural Information Processing Systems, 34:9574–9586, 2021.
  • Geva etal. (2021)Mor Geva, Roei Schuster, Jonathan Berant, and Omer Levy.Transformer Feed-Forward Layers Are Key-Value Memories, September 2021.URL http://arxiv.org/abs/2012.14913.arXiv:2012.14913 [cs].
  • Goh etal. (2021)Gabriel Goh, NickCammarata †, ChelseaVoss †, Shan Carter, Michael Petrov, Ludwig Schubert, Alec Radford, and Chris Olah.Multimodal neurons in artificial neural networks.Distill, 2021.doi: 10.23915/distill.00030.https://distill.pub/2021/multimodal-neurons.
  • Grosse and Martens (2016)Roger Grosse and James Martens.A kronecker-factored approximate fisher matrix for convolution layers, 2016.
  • Grosse etal. (2023)Roger Grosse, Juhan Bae, Cem Anil, Nelson Elhage, Alex Tamkin, Amirhossein Tajdini, Benoit Steiner, Dustin Li, Esin Durmus, Ethan Perez, Evan Hubinger, Kamilė Lukošiūtė, Karina Nguyen, Nicholas Joseph, Sam McCandlish, Jared Kaplan, and SamuelR. Bowman.Studying large language model generalization with influence functions, 2023.
  • Hoogland (2023)Jesse Hoogland.Neural networks generalise because of this one weird trick.https://www.lesswrong.com/posts/fovfuFdpuEwQzJu2w/neural-networks-generalize-because-of-this-one-weird-trick, January 2023.
  • Hoogland etal. (2023)Jesse Hoogland, AlexanderGietelink Oldenziel, Daniel Murfet, and Stan van Wingerden.Towards developmental interpretability, Jul 2023.URL https://www.alignmentforum.org/posts/TjaeCWvLZtEDAS5Ex/towards-developmental-interpretability.
  • Hoogland etal. (2024)Jesse Hoogland, George Wang, Matthew Farrugia-Roberts, Liam Carroll, Susan Wei, and Daniel Murfet.The developmental landscape of in-context learning, 2024.
  • Kashtan and Alon (2005)Nadav Kashtan and Uri Alon.Spontaneous evolution of modularity and network motifs.Proceedings of the National Academy of Sciences of the United States of America, 102:13773–8, 10 2005.doi: 10.1073/pnas.0503610102.
  • Lau etal. (2023)Edmund Lau, Daniel Murfet, and Susan Wei.Quantifying degeneracy in singular models via the learning coefficient.arXiv preprint arXiv:2308.12108, 2023.
  • Li etal. (2018)Hao Li, Zheng Xu, Gavin Taylor, Christoph Studer, and Tom Goldstein.Visualizing the loss landscape of neural nets, 2018.
  • Liang etal. (2019)Tengyuan Liang, Tomaso Poggio, Alexander Rakhlin, and James Stokes.Fisher-rao metric, geometry, and complexity of neural networks.In The 22nd international conference on artificial intelligence and statistics, pages 888–896. PMLR, 2019.
  • Liu etal. (2023)Ziming Liu, Eric Gan, and Max Tegmark.Seeing is believing: Brain-inspired modular training for mechanistic interpretability, 2023.
  • Loshchilov and Hutter (2019)Ilya Loshchilov and Frank Hutter.Decoupled weight decay regularization, 2019.
  • Martens and Grosse (2020)James Martens and Roger Grosse.Optimizing neural networks with kronecker-factored approximate curvature, 2020.
  • Martens etal. (2018)James Martens, Jimmy Ba, and Matt Johnson.Kronecker-factored curvature approximations for recurrent neural networks.In International Conference on Learning Representations, 2018.URL https://openreview.net/forum?id=HyMTkQZAb.
  • Meng etal. (2023)Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov.Locating and editing factual associations in gpt, 2023.
  • Mingard etal. (2021)Chris Mingard, Guillermo Valle-Pérez, Joar Skalse, and ArdA Louis.Is sgd a bayesian sampler? well, almost.Journal of Machine Learning Research, 22(79):1–64, 2021.
  • Mohri etal. (2018)Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar.Foundations of machine learning.MIT press, 2018.
  • Murfet (2020)Daniel Murfet.Singular learning theory iv: the rlct.http://www.therisingsea.org/notes/metauni/slt4.pdf, April 2020.Lecture notes.
  • Murphy (2012)KevinP Murphy.Machine Learning: A Probabilistic Perspective.MIT press, 2012.
  • Nanda etal. (2023)Neel Nanda, Lawrence Chan, Tom Lieberum, Jess Smith, and Jacob Steinhardt.Progress measures for grokking via mechanistic interpretability, 2023.
  • Nguyen etal. (2016)Anh Nguyen, Jason Yosinski, and Jeff Clune.Multifaceted feature visualization: Uncovering the different types of features learned by each neuron in deep neural networks, 2016.
  • Novak etal. (2018)Roman Novak, Yasaman Bahri, DanielA Abolafia, Jeffrey Pennington, and Jascha Sohl-Dickstein.Sensitivity and generalization in neural networks: an empirical study.arXiv preprint arXiv:1802.08760, 2018.
  • Olah etal. (2017)Chris Olah, Alexander Mordvintsev, and Ludwig Schubert.Feature visualization.Distill, 2017.doi: 10.23915/distill.00007.https://distill.pub/2017/feature-visualization.
  • Olah etal. (2020)Chris Olah, Nick Cammarata, Ludwig Schubert, Gabriel Goh, Michael Petrov, and Shan Carter.Zoom in: An introduction to circuits.Distill, 5(3):e00024–001, 2020.
  • Räuker etal. (2023)Tilman Räuker, Anson Ho, Stephen Casper, and Dylan Hadfield-Menell.Toward transparent ai: A survey on interpreting the inner structures of deep neural networks, 2023.
  • Schwarz (1978)Gideon Schwarz.Estimating the dimension of a model.The annals of statistics, pages 461–464, 1978.
  • Vapnik (1998)VladimirN. Vapnik.Statistical Learning Theory.Wiley-Interscience, 1998.
  • Wang etal. (2019)Chaoqi Wang, Roger Grosse, Sanja Fidler, and Guodong Zhang.Eigendamage: Structured pruning in the kronecker-factored eigenbasis, 2019.
  • Wang etal. (2022)Kevin Wang, Alexandre Variengien, Arthur Conmy, Buck Shlegeris, and Jacob Steinhardt.Interpretability in the wild: a circuit for indirect object identification in gpt-2 small.arXiv preprint arXiv:2211.00593, 2022.
  • Watanabe (2009)Sumio Watanabe.Algebraic geometry and statistical learning theory, volume25.Cambridge university press, 2009.
  • Watanabe (2013)Sumio Watanabe.A widely applicable bayesian information criterion.The Journal of Machine Learning Research, 14(1):867–897, 2013.
  • Wei etal. (2022)Susan Wei, Daniel Murfet, Mingming Gong, Hui Li, Jesse Gell-Redman, and Thomas Quella.Deep learning is singular, and that’s good.IEEE Transactions on Neural Networks and Learning Systems, 2022.
  • Zhang etal. (2021)Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals.Understanding deep learning (still) requires rethinking generalization.Communications of the ACM, 64(3):107–115, 2021.

Appendix A The local interaction basis on deep linear networks

The interaction basis diagonalizes interactions between neural network layers if the layer transitions are linear. We derive this property for this local interaction basis, a modified interaction basis in which gradients to the final layer are replaced with gradients to the immediately subsequent layer, in order to sparsify interactions between adjacent layers. In the experimental follow up to this paper, Bushnaq etal. [2024] discuss the local interaction basis in more detail before testing it on real networks. In this appendix, we show that the local interaction basis diagonalizes the interactions between neural network layers if the layer transitions are linear. The derivation for the non-local interaction basis follows the same structure.

In the absence of nonlinearities, a deep neural network is just a series of matrix multiplications (once an extra component is added to activation vectors with a constant value of 1, to include the bias). The sparsest way to describe this series of matrix multiplications is to multiply out the network into one multiplication, and then to rotate into the left singular basis of this matrix in the inputs, and the right singular basis in the outputs.To see that transforming to the local interaction basis does indeed perform an SVD for deep linear networks, consider the penultimate layer of the network. We neglect mean centering to make this derivation cleaner, and start by transforming in layer lfinal1subscript𝑙final1l_{\text{final}}-1italic_l start_POSTSUBSCRIPT final end_POSTSUBSCRIPT - 1 to a basis which whitens the activations:

flfinalsuperscript𝑓subscript𝑙final\displaystyle f^{l_{\text{final}}}italic_f start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT final end_POSTSUBSCRIPT end_POSTSUPERSCRIPT=Wlfinal1flfinal1absentsuperscript𝑊subscript𝑙final1superscript𝑓subscript𝑙final1\displaystyle=W^{l_{\text{final}}-1}f^{l_{\text{final}}-1}= italic_W start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT final end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT italic_f start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT final end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT
=Wlfinal1(Ulfinal1)T(Dlfinal1)12Wlfinal1((Dlfinal1)12)+Ulfinal1flfinal1flfinal1absentsubscriptsuperscript𝑊subscript𝑙final1superscriptsuperscript𝑈subscript𝑙final1𝑇superscriptsuperscript𝐷subscript𝑙final112superscript𝑊subscript𝑙final1subscriptsuperscriptsuperscriptsuperscript𝐷subscript𝑙final112superscript𝑈subscript𝑙final1superscript𝑓subscript𝑙final1superscript𝑓subscript𝑙final1\displaystyle=\underbrace{W^{l_{\text{final}}-1}\left(U^{l_{\text{final}}-1}%\right)^{T}\left(D^{l_{\text{final}}-1}\right)^{\frac{1}{2}}}_{W^{\prime l_{%\text{final}}-1}}\underbrace{\left((D^{l_{\text{final}}-1})^{\frac{1}{2}}%\right)^{+}U^{l_{\text{final}}-1}f^{l_{\text{final}}-1}}_{f^{\prime l_{\text{%final}}-1}}= under⏟ start_ARG italic_W start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT final end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_U start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT final end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( italic_D start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT final end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT end_ARG start_POSTSUBSCRIPT italic_W start_POSTSUPERSCRIPT ′ italic_l start_POSTSUBSCRIPT final end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT under⏟ start_ARG ( ( italic_D start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT final end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT italic_U start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT final end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT italic_f start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT final end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT end_ARG start_POSTSUBSCRIPT italic_f start_POSTSUPERSCRIPT ′ italic_l start_POSTSUBSCRIPT final end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT

We’ve wrapped these transformations into definitions of Wlfinal1superscript𝑊subscript𝑙final1W^{\prime l_{\text{final}}-1}italic_W start_POSTSUPERSCRIPT ′ italic_l start_POSTSUBSCRIPT final end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT and flfinal1superscript𝑓subscript𝑙final1f^{\prime l_{\text{final}}-1}italic_f start_POSTSUPERSCRIPT ′ italic_l start_POSTSUBSCRIPT final end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT.We’ll show that the other transformations perform an SVD of Wlfinal1superscript𝑊subscript𝑙final1W^{\prime l_{\text{final}}-1}italic_W start_POSTSUPERSCRIPT ′ italic_l start_POSTSUBSCRIPT final end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT. First, we have to transform to the (uncentered) PCA basis in the final layer.

Gijlfinalsubscriptsuperscript𝐺subscript𝑙final𝑖𝑗\displaystyle G^{l_{\text{final}}}_{ij}italic_G start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT final end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT=1nxfilfinal(x)fjlfinal(x)absent1𝑛subscript𝑥subscriptsuperscript𝑓subscript𝑙finalsuperscript𝑖𝑥subscriptsuperscript𝑓subscript𝑙finalsuperscript𝑗𝑥\displaystyle=\frac{1}{n}\sum_{x}f^{l_{\text{final}}}_{i^{\prime}}(x)f^{l_{%\text{final}}}_{j^{\prime}}(x)= divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT italic_f start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT final end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_x ) italic_f start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT final end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_x )
=1nxWiklfinal1fklfinal1Wjmlfinal1fmlfinal1absent1𝑛subscript𝑥subscriptsuperscript𝑊subscript𝑙final1superscript𝑖𝑘subscriptsuperscript𝑓subscript𝑙final1𝑘subscriptsuperscript𝑊subscript𝑙final1superscript𝑗𝑚subscriptsuperscript𝑓subscript𝑙final1𝑚\displaystyle=\frac{1}{n}\sum_{x}W^{l_{\text{final}}-1}_{i^{\prime}k}f^{l_{%\text{final}}-1}_{k}W^{l_{\text{final}}-1}_{j^{\prime}m}f^{l_{\text{final}}-1}%_{m}= divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT italic_W start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT final end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_k end_POSTSUBSCRIPT italic_f start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT final end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_W start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT final end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_m end_POSTSUBSCRIPT italic_f start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT final end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT
Glfinalsuperscript𝐺subscript𝑙final\displaystyle G^{l_{\text{final}}}italic_G start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT final end_POSTSUBSCRIPT end_POSTSUPERSCRIPT=Wlfinal1Glfinal1Wlfinal1Tabsentsuperscript𝑊subscript𝑙final1superscript𝐺subscript𝑙final1superscriptsuperscript𝑊subscript𝑙final1𝑇\displaystyle=W^{l_{\text{final}}-1}G^{l_{\text{final}}-1}{W^{l_{\text{final}}%-1}}^{T}= italic_W start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT final end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT italic_G start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT final end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT italic_W start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT final end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT
=Wlfinal1Wlfinal1Tabsentsuperscript𝑊subscript𝑙final1superscriptsuperscript𝑊subscript𝑙final1𝑇\displaystyle=W^{\prime l_{\text{final}}-1}{W^{\prime l_{\text{final}}-1}}^{T}= italic_W start_POSTSUPERSCRIPT ′ italic_l start_POSTSUBSCRIPT final end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT italic_W start_POSTSUPERSCRIPT ′ italic_l start_POSTSUBSCRIPT final end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT

where we have leveraged that Glfinal1=Ulfinal1TDlfinal1Ulfinal1superscript𝐺subscript𝑙final1superscriptsuperscript𝑈subscript𝑙final1𝑇superscript𝐷subscript𝑙final1superscript𝑈subscript𝑙final1G^{l_{\text{final}}-1}={U^{l_{\text{final}}-1}}^{T}D^{l_{\text{final}}-1}U^{l_%{\text{final}}-1}italic_G start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT final end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT = italic_U start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT final end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_D start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT final end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT italic_U start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT final end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT by definition in the last step. Writing Wlfinal1=UWΣWVWTsuperscript𝑊subscript𝑙final1subscript𝑈superscript𝑊subscriptΣsuperscript𝑊superscriptsubscript𝑉superscript𝑊𝑇W^{\prime l_{\text{final}}-1}=U_{W^{\prime}}\Sigma_{W^{\prime}}V_{W^{\prime}}^%{T}italic_W start_POSTSUPERSCRIPT ′ italic_l start_POSTSUBSCRIPT final end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT = italic_U start_POSTSUBSCRIPT italic_W start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT roman_Σ start_POSTSUBSCRIPT italic_W start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT italic_W start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT, we have that GL=UWΣW2UWTsuperscript𝐺𝐿subscript𝑈superscript𝑊superscriptsubscriptΣsuperscript𝑊2superscriptsubscript𝑈superscript𝑊𝑇G^{L}=U_{W^{\prime}}\Sigma_{W^{\prime}}^{2}U_{W^{\prime}}^{T}italic_G start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT = italic_U start_POSTSUBSCRIPT italic_W start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT roman_Σ start_POSTSUBSCRIPT italic_W start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_U start_POSTSUBSCRIPT italic_W start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT, so UL=UWTsuperscript𝑈𝐿superscriptsubscript𝑈superscript𝑊𝑇U^{L}=U_{W^{\prime}}^{T}italic_U start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT = italic_U start_POSTSUBSCRIPT italic_W start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT. Since there is no layer after the final layer, the M𝑀Mitalic_M matrix is not defined for the final layer, so the LI basis in the final layer is just the PCA basis141414This is also true in the nonlocal interaction basis, since filfinal(x)fjlfinal=δijsubscriptsuperscript𝑓subscript𝑙final𝑖𝑥subscriptsuperscript𝑓subscript𝑙final𝑗subscript𝛿𝑖𝑗\frac{\partial f^{l_{\text{final}}}_{i}(x)}{\partial f^{l_{\text{final}}}_{j}}%=\delta_{ij}divide start_ARG ∂ italic_f start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT final end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) end_ARG start_ARG ∂ italic_f start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT final end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG = italic_δ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT.

f^lfinalsuperscript^𝑓subscript𝑙final\displaystyle\hat{f}^{l_{\text{final}}}over^ start_ARG italic_f end_ARG start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT final end_POSTSUBSCRIPT end_POSTSUPERSCRIPT=Ulfinalflfinal=UWTWlfinal1flfinal1absentsuperscript𝑈subscript𝑙finalsuperscript𝑓subscript𝑙finalsuperscriptsubscript𝑈superscript𝑊𝑇superscript𝑊subscript𝑙final1superscript𝑓subscript𝑙final1\displaystyle=U^{l_{\text{final}}}f^{l_{\text{final}}}=U_{W^{\prime}}^{T}W^{%\prime l_{\text{final}}-1}f^{\prime l_{\text{final}}-1}= italic_U start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT final end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_f start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT final end_POSTSUBSCRIPT end_POSTSUPERSCRIPT = italic_U start_POSTSUBSCRIPT italic_W start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_W start_POSTSUPERSCRIPT ′ italic_l start_POSTSUBSCRIPT final end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT italic_f start_POSTSUPERSCRIPT ′ italic_l start_POSTSUBSCRIPT final end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT(43)

For the final part of the transformation into the LIB, we need to calculate M𝑀Mitalic_M, which depends on the jacobian from the LIB functions in the next layer to the PCA functions in the current layer:

Mj,jlfinal1subscriptsuperscript𝑀subscript𝑙final1𝑗superscript𝑗\displaystyle M^{l_{\text{final}}-1}_{j,j^{\prime}}italic_M start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT final end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j , italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT=1nxf^ilfinal(x)fjlfinal1f^ilfinal(x)fjlfinal1absent1𝑛subscript𝑥subscriptsuperscript^𝑓subscript𝑙final𝑖𝑥subscriptsuperscript𝑓subscript𝑙final1𝑗subscriptsuperscript^𝑓subscript𝑙final𝑖𝑥subscriptsuperscript𝑓subscript𝑙final1superscript𝑗\displaystyle=\frac{1}{n}\sum_{x}\frac{\partial\hat{f}^{l_{\text{final}}}_{i}(%x)}{\partial f^{\prime l_{\text{final}}-1}_{j}}\frac{\partial\hat{f}^{l_{\text%{final}}}_{i}(x)}{\partial f^{\prime l_{\text{final}}-1}_{j^{\prime}}}= divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT divide start_ARG ∂ over^ start_ARG italic_f end_ARG start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT final end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) end_ARG start_ARG ∂ italic_f start_POSTSUPERSCRIPT ′ italic_l start_POSTSUBSCRIPT final end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG divide start_ARG ∂ over^ start_ARG italic_f end_ARG start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT final end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) end_ARG start_ARG ∂ italic_f start_POSTSUPERSCRIPT ′ italic_l start_POSTSUBSCRIPT final end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_ARG
Mlfinal1superscript𝑀subscript𝑙final1\displaystyle M^{l_{\text{final}}-1}italic_M start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT final end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT=Wlfinal1TUWUWTWlfinal1=Wlfinal1TWlfinal1absentsuperscriptsuperscript𝑊subscript𝑙final1𝑇subscript𝑈superscript𝑊superscriptsubscript𝑈superscript𝑊𝑇superscript𝑊subscript𝑙final1superscriptsuperscript𝑊subscript𝑙final1𝑇superscript𝑊subscript𝑙final1\displaystyle={W^{l_{\text{final}}-1}}^{T}U_{W^{\prime}}U_{W^{\prime}}^{T}W^{l%_{\text{final}}-1}={W^{\prime l_{\text{final}}-1}}^{T}W^{\prime l_{\text{final%}}-1}= italic_W start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT final end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_U start_POSTSUBSCRIPT italic_W start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_U start_POSTSUBSCRIPT italic_W start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_W start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT final end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT = italic_W start_POSTSUPERSCRIPT ′ italic_l start_POSTSUBSCRIPT final end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_W start_POSTSUPERSCRIPT ′ italic_l start_POSTSUBSCRIPT final end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT
=VWΣW2VWTabsentsubscript𝑉superscript𝑊superscriptsubscriptΣsuperscript𝑊2superscriptsubscript𝑉superscript𝑊𝑇\displaystyle=V_{W^{\prime}}\Sigma_{W^{\prime}}^{2}V_{W^{\prime}}^{T}= italic_V start_POSTSUBSCRIPT italic_W start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT roman_Σ start_POSTSUBSCRIPT italic_W start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_V start_POSTSUBSCRIPT italic_W start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT
=:Vlfinal1TΛlfinal1Vlfinal1\displaystyle=:{V^{l_{\text{final}}-1}}^{T}\Lambda^{l_{\text{final}}-1}V^{l_{%\text{final}}-1}= : italic_V start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT final end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT roman_Λ start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT final end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT italic_V start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT final end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT

so Vlfinal1superscript𝑉subscript𝑙final1V^{l_{\text{final}}-1}italic_V start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT final end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT = VWTsuperscriptsubscript𝑉superscript𝑊𝑇V_{W^{\prime}}^{T}italic_V start_POSTSUBSCRIPT italic_W start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT and Λlfinal1=ΣW2superscriptΛsubscript𝑙final1superscriptsubscriptΣsuperscript𝑊2\Lambda^{l_{\text{final}}-1}=\Sigma_{W^{\prime}}^{2}roman_Λ start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT final end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT = roman_Σ start_POSTSUBSCRIPT italic_W start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. Now,

f^lfinal1superscript^𝑓subscript𝑙final1\displaystyle\hat{f}^{l_{\text{final}}-1}over^ start_ARG italic_f end_ARG start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT final end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT=Clfinal1flfinal1absentsuperscript𝐶subscript𝑙final1superscript𝑓subscript𝑙final1\displaystyle=C^{l_{\text{final}}-1}f^{l_{\text{final}}-1}= italic_C start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT final end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT italic_f start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT final end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT
=Λlfinal112Vlfinal1flfinal1absentsuperscriptsuperscriptΛsubscript𝑙final112superscript𝑉subscript𝑙final1superscript𝑓subscript𝑙final1\displaystyle={\Lambda^{l_{\text{final}}-1}}^{\frac{1}{2}}V^{l_{\text{final}}-%1}f^{\prime l_{\text{final}}-1}= roman_Λ start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT final end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT italic_V start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT final end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT italic_f start_POSTSUPERSCRIPT ′ italic_l start_POSTSUBSCRIPT final end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT

Using equation 43, we have:

f^Lsuperscript^𝑓𝐿\displaystyle\hat{f}^{L}over^ start_ARG italic_f end_ARG start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT=UWTWlfinal1VWT(Λlfinal112)+f^lfinal1absentsuperscriptsubscript𝑈superscript𝑊𝑇superscript𝑊subscript𝑙final1superscriptsubscript𝑉superscript𝑊𝑇superscriptsuperscriptsuperscriptΛsubscript𝑙final112superscript^𝑓subscript𝑙final1\displaystyle=U_{W^{\prime}}^{T}W^{\prime l_{\text{final}}-1}V_{W^{\prime}}^{T%}\left({\Lambda^{l_{\text{final}}-1}}^{\frac{1}{2}}\right)^{+}\hat{f}^{l_{%\text{final}}-1}= italic_U start_POSTSUBSCRIPT italic_W start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_W start_POSTSUPERSCRIPT ′ italic_l start_POSTSUBSCRIPT final end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT italic_V start_POSTSUBSCRIPT italic_W start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( roman_Λ start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT final end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT over^ start_ARG italic_f end_ARG start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT final end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT
=ΣW(Λlfinal112)+f^lfinal1absentsubscriptΣsuperscript𝑊superscriptsuperscriptsuperscriptΛsubscript𝑙final112superscript^𝑓subscript𝑙final1\displaystyle=\Sigma_{W^{\prime}}\left({\Lambda^{l_{\text{final}}-1}}^{\frac{1%}{2}}\right)^{+}\hat{f}^{l_{\text{final}}-1}= roman_Σ start_POSTSUBSCRIPT italic_W start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( roman_Λ start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT final end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT over^ start_ARG italic_f end_ARG start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT final end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT(44)
=f^lfinal1absentsuperscript^𝑓subscript𝑙final1\displaystyle=\hat{f}^{l_{\text{final}}-1}= over^ start_ARG italic_f end_ARG start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT final end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT

For layers which are not the final layer in the network, the procedure is very similar. As before, we have:

fl:=((Dl)12)+UlflWl:=Wl(Ul)T(Dl)12formulae-sequenceassignsuperscript𝑓𝑙superscriptsuperscriptsuperscript𝐷𝑙12superscript𝑈𝑙superscript𝑓𝑙assignsuperscript𝑊𝑙superscript𝑊𝑙superscriptsuperscript𝑈𝑙𝑇superscriptsuperscript𝐷𝑙12f^{\prime l}:=\left(\left(D^{l}\right)^{\frac{1}{2}}\right)^{+}U^{l}f^{l}%\qquad\qquad W^{\prime l}:=W^{l}\left(U^{l}\right)^{T}\left(D^{l}\right)^{%\frac{1}{2}}italic_f start_POSTSUPERSCRIPT ′ italic_l end_POSTSUPERSCRIPT := ( ( italic_D start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT italic_U start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT italic_f start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT italic_W start_POSTSUPERSCRIPT ′ italic_l end_POSTSUPERSCRIPT := italic_W start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ( italic_U start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( italic_D start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT
Gl+1=WlWlT,Ul+1=UWlTformulae-sequencesuperscript𝐺𝑙1superscript𝑊𝑙superscriptsuperscript𝑊𝑙𝑇superscript𝑈𝑙1subscriptsuperscript𝑈𝑇superscript𝑊𝑙G^{l+1}=W^{\prime l}{W^{\prime l}}^{T},\qquad U^{l+1}=U^{T}_{W^{\prime l}}italic_G start_POSTSUPERSCRIPT italic_l + 1 end_POSTSUPERSCRIPT = italic_W start_POSTSUPERSCRIPT ′ italic_l end_POSTSUPERSCRIPT italic_W start_POSTSUPERSCRIPT ′ italic_l end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT , italic_U start_POSTSUPERSCRIPT italic_l + 1 end_POSTSUPERSCRIPT = italic_U start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_W start_POSTSUPERSCRIPT ′ italic_l end_POSTSUPERSCRIPT end_POSTSUBSCRIPT

Now, we need to remember that f^l+1=Cl+1fl+1superscript^𝑓𝑙1superscript𝐶𝑙1superscript𝑓𝑙1\hat{f}^{l+1}=C^{l+1}f^{l+1}over^ start_ARG italic_f end_ARG start_POSTSUPERSCRIPT italic_l + 1 end_POSTSUPERSCRIPT = italic_C start_POSTSUPERSCRIPT italic_l + 1 end_POSTSUPERSCRIPT italic_f start_POSTSUPERSCRIPT italic_l + 1 end_POSTSUPERSCRIPT:

fl+1superscript𝑓𝑙1\displaystyle f^{\prime l+1}italic_f start_POSTSUPERSCRIPT ′ italic_l + 1 end_POSTSUPERSCRIPT=((Dl+1)12)+Ul+1Wlflabsentsuperscriptsuperscriptsuperscript𝐷𝑙112superscript𝑈𝑙1superscript𝑊𝑙superscript𝑓𝑙\displaystyle=\left(\left(D^{l+1}\right)^{\frac{1}{2}}\right)^{+}U^{l+1}W^{%\prime l}f^{\prime l}= ( ( italic_D start_POSTSUPERSCRIPT italic_l + 1 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT italic_U start_POSTSUPERSCRIPT italic_l + 1 end_POSTSUPERSCRIPT italic_W start_POSTSUPERSCRIPT ′ italic_l end_POSTSUPERSCRIPT italic_f start_POSTSUPERSCRIPT ′ italic_l end_POSTSUPERSCRIPT
=ΣWl+UWlTWlflabsentsuperscriptsubscriptΣsuperscript𝑊𝑙subscriptsuperscript𝑈𝑇superscript𝑊𝑙superscript𝑊𝑙superscript𝑓𝑙\displaystyle=\Sigma_{W^{\prime l}}^{+}U^{T}_{W^{\prime l}}W^{\prime l}f^{%\prime l}= roman_Σ start_POSTSUBSCRIPT italic_W start_POSTSUPERSCRIPT ′ italic_l end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT italic_U start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_W start_POSTSUPERSCRIPT ′ italic_l end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_W start_POSTSUPERSCRIPT ′ italic_l end_POSTSUPERSCRIPT italic_f start_POSTSUPERSCRIPT ′ italic_l end_POSTSUPERSCRIPT
=VWlTflabsentsuperscriptsubscript𝑉superscript𝑊𝑙𝑇superscript𝑓𝑙\displaystyle=V_{W^{\prime l}}^{T}f^{\prime l}= italic_V start_POSTSUBSCRIPT italic_W start_POSTSUPERSCRIPT ′ italic_l end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_f start_POSTSUPERSCRIPT ′ italic_l end_POSTSUPERSCRIPT
f^l+1superscript^𝑓𝑙1\displaystyle\hat{f}^{l+1}over^ start_ARG italic_f end_ARG start_POSTSUPERSCRIPT italic_l + 1 end_POSTSUPERSCRIPT=Λl+112Vl+1fl+1absentsuperscriptsuperscriptΛ𝑙112superscript𝑉𝑙1superscript𝑓𝑙1\displaystyle={\Lambda^{l+1}}^{\frac{1}{2}}V^{l+1}f^{\prime l+1}= roman_Λ start_POSTSUPERSCRIPT italic_l + 1 end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT italic_V start_POSTSUPERSCRIPT italic_l + 1 end_POSTSUPERSCRIPT italic_f start_POSTSUPERSCRIPT ′ italic_l + 1 end_POSTSUPERSCRIPT
=Λl+112Vl+1VWlTflabsentsuperscriptsuperscriptΛ𝑙112superscript𝑉𝑙1superscriptsubscript𝑉superscript𝑊𝑙𝑇superscript𝑓𝑙\displaystyle={\Lambda^{l+1}}^{\frac{1}{2}}V^{l+1}V_{W^{\prime l}}^{T}f^{%\prime l}= roman_Λ start_POSTSUPERSCRIPT italic_l + 1 end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT italic_V start_POSTSUPERSCRIPT italic_l + 1 end_POSTSUPERSCRIPT italic_V start_POSTSUBSCRIPT italic_W start_POSTSUPERSCRIPT ′ italic_l end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_f start_POSTSUPERSCRIPT ′ italic_l end_POSTSUPERSCRIPT
Mj,jlsubscriptsuperscript𝑀𝑙𝑗superscript𝑗\displaystyle M^{l}_{j,j^{\prime}}italic_M start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j , italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT=1nxf^il+1(x)fjlf^il+1(x)fjlabsent1𝑛subscript𝑥subscriptsuperscript^𝑓𝑙1𝑖𝑥subscriptsuperscript𝑓𝑙𝑗subscriptsuperscript^𝑓𝑙1𝑖𝑥subscriptsuperscript𝑓𝑙superscript𝑗\displaystyle=\frac{1}{n}\sum_{x}\frac{\partial\hat{f}^{l+1}_{i}(x)}{\partial f%^{\prime l}_{j}}\frac{\partial\hat{f}^{l+1}_{i}(x)}{\partial f^{\prime l}_{j^{%\prime}}}= divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT divide start_ARG ∂ over^ start_ARG italic_f end_ARG start_POSTSUPERSCRIPT italic_l + 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) end_ARG start_ARG ∂ italic_f start_POSTSUPERSCRIPT ′ italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG divide start_ARG ∂ over^ start_ARG italic_f end_ARG start_POSTSUPERSCRIPT italic_l + 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) end_ARG start_ARG ∂ italic_f start_POSTSUPERSCRIPT ′ italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_ARG
Mlsuperscript𝑀𝑙\displaystyle M^{l}italic_M start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT=VWlVl+1TΛl+1Vl+1VWlTabsentsubscript𝑉superscript𝑊𝑙superscriptsuperscript𝑉𝑙1𝑇superscriptΛ𝑙1superscript𝑉𝑙1superscriptsubscript𝑉superscript𝑊𝑙𝑇\displaystyle=V_{W^{\prime l}}{V^{l+1}}^{T}{\Lambda^{l+1}}V^{l+1}V_{W^{\prime l%}}^{T}= italic_V start_POSTSUBSCRIPT italic_W start_POSTSUPERSCRIPT ′ italic_l end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_V start_POSTSUPERSCRIPT italic_l + 1 end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT roman_Λ start_POSTSUPERSCRIPT italic_l + 1 end_POSTSUPERSCRIPT italic_V start_POSTSUPERSCRIPT italic_l + 1 end_POSTSUPERSCRIPT italic_V start_POSTSUBSCRIPT italic_W start_POSTSUPERSCRIPT ′ italic_l end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT

Once again, note that this expression for Mlsuperscript𝑀𝑙M^{l}italic_M start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT is manifestly diagonal, so

Vl=Vl+1VWlT,Λl=Λl+1formulae-sequencesuperscript𝑉𝑙superscript𝑉𝑙1subscriptsuperscript𝑉𝑇superscript𝑊𝑙superscriptΛ𝑙superscriptΛ𝑙1V^{l}=V^{l+1}V^{T}_{W^{\prime l}},\qquad\qquad\Lambda^{l}=\Lambda^{l+1}italic_V start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT = italic_V start_POSTSUPERSCRIPT italic_l + 1 end_POSTSUPERSCRIPT italic_V start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_W start_POSTSUPERSCRIPT ′ italic_l end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , roman_Λ start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT = roman_Λ start_POSTSUPERSCRIPT italic_l + 1 end_POSTSUPERSCRIPT

So, Vlsuperscript𝑉𝑙V^{l}italic_V start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT is exactly what we need in order to diagonalize the relationship, and we end up with

f^l+1superscript^𝑓𝑙1\displaystyle\hat{f}^{l+1}over^ start_ARG italic_f end_ARG start_POSTSUPERSCRIPT italic_l + 1 end_POSTSUPERSCRIPT=Λl+112Vl+1VWlTVlTΛl12+f^labsentsuperscriptsuperscriptΛ𝑙112superscript𝑉𝑙1superscriptsubscript𝑉superscript𝑊𝑙𝑇superscriptsuperscript𝑉𝑙𝑇superscriptsuperscriptsuperscriptΛ𝑙12superscript^𝑓𝑙\displaystyle={\Lambda^{l+1}}^{\frac{1}{2}}V^{l+1}V_{W^{\prime l}}^{T}{V^{l}}^%{T}{{\Lambda^{l}}^{\frac{1}{2}}}^{+}\hat{f}^{l}= roman_Λ start_POSTSUPERSCRIPT italic_l + 1 end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT italic_V start_POSTSUPERSCRIPT italic_l + 1 end_POSTSUPERSCRIPT italic_V start_POSTSUBSCRIPT italic_W start_POSTSUPERSCRIPT ′ italic_l end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_V start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT roman_Λ start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT over^ start_ARG italic_f end_ARG start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT
=f^labsentsuperscript^𝑓𝑙\displaystyle=\hat{f}^{l}= over^ start_ARG italic_f end_ARG start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT(45)

So, each layer of the network is the same as the final layer, which is the final activations rotated into the PCA basis, but without whitening.

Using Degeneracy in the Loss Landscape for Mechanistic Interpretability (2024)
Top Articles
Latest Posts
Article information

Author: Prof. An Powlowski

Last Updated:

Views: 5953

Rating: 4.3 / 5 (44 voted)

Reviews: 91% of readers found this page helpful

Author information

Name: Prof. An Powlowski

Birthday: 1992-09-29

Address: Apt. 994 8891 Orval Hill, Brittnyburgh, AZ 41023-0398

Phone: +26417467956738

Job: District Marketing Strategist

Hobby: Embroidery, Bodybuilding, Motor sports, Amateur radio, Wood carving, Whittling, Air sports

Introduction: My name is Prof. An Powlowski, I am a charming, helpful, attractive, good, graceful, thoughtful, vast person who loves writing and wants to share my knowledge and understanding with you.