# Autoregressive Transformers and Lessons From Enactivism

## Mental models and recent progress

The purpose of this post is to elaborate on mental models of recent AI progress and autoregressive transformers that I have developed over the past year or so. It is based on a desire to reconcile several competing explanatory models. On the one hand, we have the idea that feature spaces capture complex co-occurence statistics about tokens in ordered sequences. On the other, we have a view of the data as residing in a Euclidean embedding space with some geometric or topological structure. And on a third hand (my first foot?), I have a nagging affinity for theories of cognition that utilize dynamical systems theory, like enactivism and embodied cognition. I am mostly writing this to codify and clarify the half-baked ideas in my head that may merit a more complete narrative, and some recent work has encouraged me to finally write things down. To be honest, this post does not include many equations. Some of the statements can be formalized and demonstrated, but that is for other posts. This post involves a lot of storytelling about recent progress in ML/AI, and the specific claims about attention and transformers are located towards the bottom.

It has been quite a while since my last blog post. In the field of machine learning at large, the dominance of unsupervised learning, transformers, and extremely large-scale training has become practically indisputable. Or at least more obvious in some subset of groups. When I was last active in blogging and paper-writing, I was most interested in the dynamics of optimization algorithms in large-scale training problems. For example, I was interested in the fundamental limits of data parallelism for SGD. People were seeing that, for any given problem, there is a “critical batch size” after which people observed marginal gains in computational efficiency during training. It was not clear how specific factors like input distribution entropy or over-parameterization affected this phenomenon, although this question was being explored in certain theoretical conditions. The results of these initial, less well-known efforts were further developed into more hyped up themes like double-descent curves that began to provide a language to challenge or extend incumbent frameworks like VC theory and the bias-variance tradeoff.

However, the empirical advances in this time far-outstripped the progress of our theoretical understanding. For example, when GPT-3 100x-ed the parameter count of GPT-2, the language model demonstrated an emergent meta-learning capability. Seriously: meta-learning is an emergent property of autoregressive transformers when you use a stochastic optimization algorithm to maximize the likelihood of the data under the parameters. That’s crazy. But it often feels as though we lack the theoretical language to really understand the dynamics of these systems in an intuitive way that explains these complex, emergent phenomena. It was hard enough to come up with a useful definition of meta-learning when we had gradients and ensembles. That’s why I used a quadratic model to understand a simplified version of MAML. How are theoreticians supposed to catch up to the empiricists? And as likelihood maximization dominates in both the empirical ML world and modern cognitive science theories such as Clark’s predictive processing and Friston’s free energy principle, the mathematically inclined ML theoreticians must feel a bit left out of the party.

And the need to understand these models feels urgent. In a single year, Google went from reporting that 10% of its queries used BERT, to practically 100%. Most people I talk to don’t think of Google searches as “AI”, but these models are everywhere. When was the last time you went a day without reading content from a recommender system? Powerful, opaque AI systems already determine to a large extent the content we consume, modify, and produce on a daily basis. I often see discussions in safety research about weighing present dangers versus future dangers. For example: should we focus on how near-term AI systems will interact with existing social power structures, or should we focus on the longer-term existential risks of creating superintelligent entities that might harm humanity or challenge our assumptions about sentience? For one thing, I think that this is a false choice. Long-term harms to humanity from these systems will likely appear through social power structures since these sociotechnical dynamics form the context in which these algorithms are embedded. And our inability to cogently explain the internal structure of these systems is dangerous for its own reasons. We run the risk of underestimating the real capabilities that are being implicitly exploited by those who control access to computational resources. And we run a parallel risk of using anthropocentric conceptions of intelligence and consciousness to ascribe these systems properties that lead us on an intellectually enticing breadcrumb trail away from the material harms these systems are causing to marginalized groups.

There is a middle ground to achieve here, and I am not claiming to have achieved it, but I think that the least we can do is work to find it by developing a shared language to explain the immense practical success of these systems. This, to me, is the important work of ML theory right now. There are times in research history when theory drives or inspires the development of new algorithms and techniques, and there are other times when theory is playing catch-up to the systems. We are in that latter state right now. It is like Sutton’s “bitter lesson” taken to the extreme, since the emergent properties of the scaled up system seem to be passing some highly significant threshold in raw capability and economic utility, and they show no sign of slowing down.

## Competing semantics

Right now, there is a major disconnect in how different groups are discussing the systems themselves. Each group has its own semantics to describe the operation of the system and its integration into a broader environment. These different semantics are sort of like different subjective “stances” one might take towards the object of inquiry, and they are reminiscent of the “folk psychology” of Dennett’s intentional systems. The appropriateness of a subjective stance depends on its explanatory and predictive power in a given context. So the same researcher might adopt a different stance depending on the problem at hand. I call these “stances” and “semantics” rather than theoretical frameworks because they are usually not purely mathematical. Their histories often intertwine with specific movements in areas like cognitive science.

Some researchers describe these AI systems with what might be called a “semantics of linear prediction”. Likelihood maximization and optimization play a big role. They ask: how well does the model fit the data? Does the model disentangle the latent space and discover a simpler linear structure? There is an intuitive sense in which our current conception of feature embeddings is the intellectual descendent of the propositional logic of older rule-based AI systems. Propositions are expressed through mixtures of basis elements in vector spaces. Simpler systems like word2vec laid the initial groundwork for this epistemic view, but it has been picked up wholesale for modern semi-supervised and unsupervised techniques without many changes. The remaining question to me is explaining how next-token prediction in autoregressive models leads to feature embeddings that are massively multi-purpose. Sure, the data are linearly separable in the token prediction space. But we then have to explain how this necessarily means that the data are linearly separable in a much larger space of semantically meaningful task distributions.

Other researchers may use more of a “brain semantics”. Or maybe one could swap “brain” with “information”. For this group, predictive processing theories of mind like those of Andy Clark are valuable, and the question of whether backpropagation is biologically plausible is significant. It is quite cool that predictive coding asymptotically approximates backprop, for example. I would say that the cognitive science analog of this group is something like a fusion of old-school connectionism and a modern computationalist theory of mind (despite Andy Clark’s sincere efforts to emphasize embodiment), as opposed to the behaviorism that legitimized the rule-based approaches mentioned above. Maybe we could call it neo-connectionism. Here, intelligence can be practically reduced to an agent’s predictive power in its environment. What we call intelligence might just result from the mishmash of certain real phenomenological patterns with basic information-theoretic properties of stochastic systems, like the asymptotic equipartition property.

Other researchers prioritize a view of the AI system in a broader sociotechnical context. The philosophical inheritance of this side comes from other sources, like Nagel’s “The View From Nowhere” and Grice’s theory of communicative intent in meaning. This work often involves analyzing the sociopolitical power structures in AI research and industry, as well as the ways in which other people are materially impacted by these technologies. Critical theory is extremely valuable here. I highly recommend reading Abeba Birhane’s and Jelle Van Dijk’s “A Misdirected Application Of AI Ethics” for an example in this vein. I think these researchers see claims like “this model is performing intelligent actions”, “this system is a moral patient”, and “this system has agency” as (mostly) orthogonal and deserving of (mostly) independent investigations. As a result, regardless of whether the AI system is intelligent or has agency, it is of immediate importance to understand its role in a broader system that does include moral patients and free agents. In my experience, theories of mind like enactivism or embodied cognition are more popular with this group compared to computationalism or behaviorism. I enjoy theories of enactivism and embodied cognition because they have an interesting overlap with my personal affinity for Buddhist metaphysics, so sometimes I’ll come across an essay that’s like a crossover episode guest-starring Evan Thompson.

As an aside, I think that sometimes researchers in the last group move towards the Searle-y argument that our AI systems will never be intelligent or conscious. I have had some crazy experiences with these models that really challenged their status as “stochastic parrots” in my mind. It was shocking to hear the first disembodied human-like voices emerge from the early versions of Jukebox, and to hear some really beautiful music take shape over time as the quality improved. iGPT actually has basic physics knowledge in areas like fluid dynamics. And even in the early days of GPT-3, it was quite capable of inferring complex literary themes and motifs from writers like Haruki Murakami and Virginia Woolf. I remember producing the transcript of a comedic debate between Descartes and Berkeley on monism vs. dualism, moderated by a drunk Socrates. Maybe one day I will post interesting outputs here from the earliest versions of GPT-3. There were times where I used GPT-3 through a simple CLI and felt an unmistakable “user illusion” that connected me to something like the Broca area of a human collective. And despite the fact that I worked at OpenAI, I don’t consider myself much of a futurist or a sci-fi lover. I was never much of an Isaac Asimov fan. My friends keep encouraging me to watch the Expanse and I have failed them for years. So I don’t think this is just me trying to fulfill a nerdy childhood fantasy.

## An empirical story

In my (biased) opinion, our best understanding of many recent ML/AI systems comes from extensive empirical analysis. Theoretical frameworks are not really pulling their weight yet. This empirical trend started with some work to explain the critical batch size problem using second-order statistics of the loss landscape in parameter space (e.g. OpenAI’s ‘noise scale’). But people were struggling to adapt the existing theoretical tools for convex problems to the highly non-convex loss landscape of deep neural networks. We have progressed a lot in our understanding of non-convex stochastic optimization since then. Tools from fields like statistical physics have helped us a lot. But I don’t think they have fully accounted for the paradigm shift that really occurred with the dominance of unsupervised learning and transformers.

For example, there was a long line of work that analyzed optimization dynamics by assuming that the gradient covariance matrix was co-diagonalizable with the Hessian of the loss. People also discussed whether optimization occurred in a low-dimensional subspace, i.e. if the gradient covariance matrix was low-rank. This was empirically verified in image classification problems with convolutional architectures. But it doesn’t appear that this line of research will be as fruitful for unsupervised learning in transformers, where the covariance and Hessian both seem to be less degenerate. And besides, most people use AdamW optimizers now instead of plain SGD, and very little theoretical work has analyzed AdamW dynamics.

Several research groups took a different approach. In particular, people kept noticing that power law relationships popped up everywhere. With the right kind of setup, you could even predict model performance based on quantities like parameter count and dataset size. This empirical approach has been very fruitful and is what motivated OpenAI to train GPT-3 in the first place. I know many researchers who were absolutely floored when they first saw that these power laws hold for so many orders of magnitude. It is entirely possible that they will continue to hold until we surpass human performance on a significant fraction of economically valuable activities. That is a scary proposition. It also validates the bets that companies like Amazon, Facebook, Google, NVIDIA, OpenAI, Microsoft, and Tesla are making with their infrastructure and financial commitments.

So on the one hand, we have a mathematical theory to think about these problems, rooted in language borrowed from areas like statistical physics and high-dimensional probability theory. On the other hand, we have a meticulous quantitative analysis of these models through predictive tools like power laws. In an ideal world, there will also be some way to fuse these two things together. In the language of physics, we should be able to pair a microscopic story with its macroscopic summary. But even if that operation is successful, I am not confident that it will provide a rich enough language to help us predict and understand the emergence of certain qualitative properties of the system.

## Step changes and unpredictable emergent behaviors

From a safety perspective, one of the scarier things about the scaling laws was the emergence of qualitative behaviors that did not appear in smaller models or earlier architectures. This was scary because our quantitative predictions did not give us the ability to establish a consensus bet that certain capabilities would emerge after we trained a sufficiently good model. In other words: these models are predictably better, but in unpredictable ways. This leads to speculation of the form “if GPT-3 can already do ABC, then will GPT-$$n$$ be able to do XYZ, for $$n > 3$$?”

The most obvious example is the meta-learning capability of GPT-3. Meta-learning and cross-task generalization have long been a holy grail of AI research, and I don’t think many people who explicitly worked in that research area expected a simple likelihood maximization approach to work. Sometimes it feels as though we are in a regime of progress where the best question is not “is this system intelligent?” but “what do people mean when they call this system’s behavior intelligent?” We are being forced to pick apart and modularize our assumptions about cognition and intelligence by breaking them down into specific qualitative behaviors. Meta-learning seems to have come and gone, and the goalposts shifted. But there are plenty of other features that may or may not emerge as we continue to develop these models. Common sense, logical consistency, multimodal reasoning, and online learning all come to mind.

Some of these features might already be on the way and others might never appear. But, again, one reason these scaling laws are scary is that we can’t predict what qualitative behaviors will come next. But they are motivating people to invest billions of real dollars into scale. And we have no way of confirming which desirable behaviors are desirable merely as a result of anthropocentrism. For example, is “common sense” reasoning necessary, or even desirable for most economically valuable tasks? We fail at logical consistency more often than we’d like to admit. From an uncharitable reading of the symbol grounding problem, multimodal models might only buy us sample efficiency. And it may be that online learning will be achieved with the same gradient-based approaches we already use for pre-training and fine-tuning. But right now, we have no good way of systematically measuring these qualitative behaviors in way that provides satisfactory criteria for intelligent systems.

A desirable “semantics” has two main properties: it should clearly explain the internal operations of a system, and it should be able to predict how complex qualitative behaviors emerge over time. Although the relative significance of explanation versus prediction seems to be unsettled in discourses about the philosophy of science, I lean towards a weak conviction that an explanation should permit us to predict. I want to have a mental model of these systems that makes these qualitative behaviors obvious in retrospect, like “of course GPT-3 should be a few-shot learner” or “of course iGPT should produce good features for image classification”.

## Dynamics of autoregressive transformers and other partially-founded claims

Again, this section is not really claiming to have the answers to questions posed by the above narrative. And many of the problems are more social than mathematical-theoretical. But I think it’s worth taking a stab at a different kind of analysis in which certain qualitative behaviors become more obvious given the right theoretical picture of a system.

So if we take this opportunity to try a slightly more general approach, with a move from statistical physics to the more abstract dynamical systems theory, we have a nice opportunity to reconnect with modern theories of cognition. There is definitely a sort of abstraction boundary sweet spot to be achieved with this kind of talk. I want to keep vocabulary like “attractor sets” while ignoring regularity conditions about Polish spaces in proofs. We can potentially start to close a bit of the gap that has grown between some of the groups of researchers I mentioned above. Dynamical systems theory is also the root of the enactivist theories of mind that appeal to those who remained unconvinced by Clark’s computationalism and Ryle’s logical behaviorism.

Although this application of the dynamical systems framework does not deal with any fancy question like embodiment, I hope that it least demonstrates a significant structural difference between contemporary AI systems and the rule-based ones of antiquity. And, perhaps more importantly, it demonstrates a potential structural difference between these AI systems and “mere” pattern matchers. However, it will undoubtedly be unappealing to many. For the overtly philosophical, it may appear too mathematical. For the overtly mathematical, it may seem handwavy or not-rigorous.

For now, let’s explore the first desirable property for an interpretative framework: explaining the internal operations of the system. I think that because it sits so comfortably with enactivist theories of cognition, this theory will also be capable of predicting other emergent behaviors like meta-learning, but I’m leaving that to future work. Here are some claims about these systems, and these claims mostly build on each other.

### Update (08 Dec. 2020)

I have since expanded on some of these claims, especially 1, 2, 5, and 6, in a more mathematical follow-up post which you can find here. In fact, maybe you’ll want to skip to that immediately.

### Claim 1: Self-attention modules represent token relationships through a “propositional geometry”

Self-attention modules implicitly define a Euclidean simplex that represents conjunctions of hypotheses about correlations between groups of tokens. So, logical propositions about token correlations emerge from an interpretation of the attention module as a fuzzy key-value lookup. It turns out that a relation between two tokens can be encoded as an odds ratio for the association between a key and query under Gaussian priors. When you collect all the odds ratios for a given token in a vector, you have a conjunction of hypotheses about token correlations, expressed in the language of vector similarity. You move from a “value vector”-focused view of the attention feature space to a view that emphasizes the row vectors of $$\sigma(QK^{\intercal})$$. These row vectors define a Euclidean simplex. If you compute the distance between output vectors from the attention operation, you are actually computing a Mahalonobis distance on this simplex with a quadratic form of $$VV^{\intercal}$$. Cool!

### Claim 2: Likelihood maximization and disentanglement

Maximizing the likelihood of the data with gradient descent disentangles these competing hypotheses about the data. This also reveals a simple duality between feature and parameter spaces. Instead of doing your analysis by taking derivatives with respect to the basic $$W_Q, W_K, W_V$$ linear transforms, you can just work in $$Q, K, V$$ spaces. This duality simplifies the analysis of the disentanglement under optimization dynamics.

### Claim 3: More hypotheses than we can handle

There are several mechanisms that explain the massive number of possible configurations of hypotheses available in this hypothesis space. Some contributing factors: floating point precision, network depth and compositionally, high-dimensional probability, and shattering set size for linear classifiers (i.e. decision boundary complexity for high-dimensional spaces).

### Claim 4: Semantically meaningful concepts emerge due to linear separability

Semantically meaningful classification tasks should be linearly separable in a feature space designed to predict the next token using attention modules. This would explain the success of certain models like iGPT. This claim probably takes a bit more work.

### Claim 5: Autoregressive models estimate dynamical systems by interpolating dynamics between observations

The observations in this case consist of the actual tokens, which have a one-to-one correspondence with their embedding vectors. The basis vectors of the linear embedding layer function as attractors in the pre-logit feature space. Maximizing the token likelihood given a fixed embedding is like optimizing a dynamical system such that its stationary distribution is a mixture of Gaussians centered about the embedding vectors. The individual layers of the network further refine this trajectory to include intermediate states. This also explains the utility of residual connections, since they enable the state to stabilize around an attractor during its trajectory.

### Claim 6: Semantically meaningful concepts are attractor sets in the model’s interpolated state space

Moreover, since the attractor sets are configurations of hypotheses about token correlations, they can be regarded as the structure of the “internal state representation” of the model. In essence, we are able to find a structured, intricate internal representation of the world in the state space of a highly non-linear dynamical system. This sounds very enactivist to me.

From a philosophical perspective, we are nowhere near an answer to the “feels like something” problem of the system. But I think this will at least show that the similarities between our current AI systems and our theories of enacted cognition are stronger than you might expect. And if we can explain qualitative behaviors like attention, intelligence, and belief using an enactivist theory of mind, then we can bring that understanding back to our practical understanding of AI systems.

Maybe I will write a bit more about each of these claims individually. Most of them can be demonstrated using relatively straightforward math, so long as you’re comfortable with some handwaving assumptions.