Autoregressive transformers and lessons from enactivism

December 6, 2020 · Noah Golmant

Note to the reader: this post captures some thoughts on AI progress from 2020. It may feel very stale now.

Mental models and recent progress

The purpose of this post is to rant about some mental models of AI progress and autoregressive transformers that I built up over the past year or so. I want to reconcile several competing explanatory models in my head. On one hand, we have the idea that feature spaces capture complex co-occurence statistics about tokens in ordered sequences. On the other, we have a view of the data as residing in a Euclidean embedding space with some geometric structure. And on a third hand, we have theories of cognition that cross over into dynamical systems theory, like enactivism. I am mostly writing this to clarify some half-baked ideas in my head, and some recent work has encouraged me to finally write things down.

It’s been quite a while since my last blog post. In the field of machine learning, the dominance of unsupervised learning, transformers, and extremely large-scale training has become practically indisputable. When I was last active in blogging and paper-writing, I was most interested in the dynamics of optimization algorithms in large-scale training problems. For example, I was interested in the fundamental limits of data parallelism for SGD. People were seeing that, for any given problem, there is a “critical batch size” after which people observed marginal gains in computational efficiency during training. It was not clear how specific factors like input distribution entropy or over-parameterization affected this phenomenon, although this question was being explored in certain theoretical conditions. The results of these initial, less well-known efforts were further developed into more hyped up themes like double-descent curves that began to provide a language to challenge or extend incumbent frameworks like VC theory and the bias-variance tradeoff.

However, the empirical advances in this time far-outstripped the progress of our theoretical understanding. For example, when GPT-3 100x-ed the parameter count of GPT-2, the language model demonstrated an emergent meta-learning capability. Seriously: meta-learning is an emergent property of autoregressive transformers when you use a stochastic optimization algorithm to maximize the likelihood of the data under the parameters. That’s crazy. But it often feels as though we lack the theoretical language to really understand the dynamics of these systems in an intuitive way that explains these complex, emergent phenomena. It was hard enough to come up with a useful definition of meta-learning when we had gradients and ensembles. That’s why I used a quadratic model to understand a simplified version of MAML. How are theoreticians supposed to catch up to the empiricists? And as likelihood maximization dominates in both the empirical ML world and modern cognitive science theories such as Clark’s predictive processing and Friston’s free energy principle, the mathematically inclined ML theoreticians must feel a bit left out of the party.

And the need to understand these models feels urgent. In a single year, Google went from reporting that 10% of its queries used BERT, to practically 100%. Most people I talk to don’t think of Google searches as “AI”, but these models are everywhere. When was the last time you went a day without reading content from a recommender system? Powerful, opaque AI systems already determine to a large extent the content we consume, modify, and produce on a daily basis. I often see discussions in safety research about weighing present dangers versus future dangers. For example: should we focus on how near-term AI systems will interact with existing social power structures, or should we focus on the longer-term existential risks of creating superintelligent entities that might harm humanity or challenge our assumptions about sentience? For one thing, I think that this is a false choice. Long-term harms to humanity from these systems will likely appear through social power structures since these sociotechnical dynamics form the context in which these algorithms are embedded. And our inability to cogently explain the internal structure of these systems is dangerous for its own reasons. We run the risk of underestimating the real capabilities that are being implicitly exploited by those who control access to computational resources. We run a parallel risk of using anthropocentric conceptions of intelligence and consciousness to ascribe these systems properties that lead us on an intellectually enticing breadcrumb trail away from the material harms these systems are causing to marginalized groups.

There must be some middle ground to achieve here. I think that the least we can do is work towards a shared language to explain the immense practical success of these systems. There are times in research history when theory drives or inspires the development of new algorithms and techniques, and there are other times when theory is playing catch-up to the systems. We are in that latter state right now. It is like Sutton’s “bitter lesson” taken to the extreme, since the emergent properties of the scaled up system seem to be passing some highly significant threshold in raw capability and economic utility, and they show no sign of slowing down.

Competing semantics

Right now, there is a major disconnect in how different groups discuss these systems. Each group has its own semantics to describe the operation of the system and its integration into a broader environment. These different semantics are sort of like different subjective “stances” one might take towards the object of inquiry, and they are reminiscent of the “folk psychology” of Dennett’s intentional systems. The appropriateness of a subjective stance depends on its explanatory and predictive power in a given context. So the same researcher might adopt a different stance depending on the problem at hand. I call these “stances” and “semantics” rather than theoretical frameworks because they are usually not purely mathematical. Their histories often intertwine with specific movements in areas like cognitive science.

Some researchers describe these AI systems with a “semantics of linear prediction”. Likelihood maximization and optimization play a big role. They ask: how well does the model fit the data? Does the model disentangle the latent space and discover a simpler linear structure? There is an intuitive sense in which our current conception of feature embeddings is the intellectual descendent of the propositional logic of older rule-based AI systems. Concepts are expressed through mixtures of basis elements in vector spaces. Simpler systems like word2vec laid the initial groundwork for this epistemic view, but it has been picked up wholesale for modern semi-supervised and unsupervised techniques without many changes. The remaining question to me is explaining how next-token prediction in autoregressive models leads to feature embeddings that are massively multi-purpose. Sure, the data are linearly separable in the token prediction space. But we then have to explain how this necessarily means that the data are linearly separable in a much larger space of semantically meaningful task distributions.

Other researchers may use more of a “brain semantics”. Or maybe one could swap “brain” with “information”. For this group, predictive processing theories of mind like those of Andy Clark are valuable, and the question of whether backpropagation is biologically plausible is significant. It is quite cool that predictive coding asymptotically approximates backprop, for example. I would say that the cognitive science analog of this group is something like a fusion of old-school connectionism and a modern computationalist theory of mind (despite Andy Clark’s sincere efforts to emphasize embodiment), as opposed to the behaviorism that legitimized the rule-based approaches mentioned above. Here, intelligence can be practically reduced to an agent’s predictive power in its environment.

Other researchers prioritize a view of the AI system in a broader sociotechnical context. The philosophical inheritance of this side comes from other sources, like Nagel’s “The View From Nowhere” and Grice’s theory of communicative intent in meaning. This work often involves analyzing the sociopolitical power structures in AI research and industry, as well as the ways in which other people are materially impacted by these technologies. Critical theory is valuable here. I highly recommend reading Abeba Birhane’s and Jelle Van Dijk’s “A Misdirected Application Of AI Ethics” for an example in this vein. I think these researchers see claims like “this model performs intelligent actions”, “this system is a moral patient”, and “this system has agency” as (mostly) orthogonal. As a result, regardless of whether the AI system is intelligent or has agency, it is of immediate importance to understand its role in a broader system that definitely does include moral patients and free agents. Theories of mind like enactivism or embodied cognition are more popular with this group compared to computationalism or behaviorism.

I think that sometimes researchers in the last group move towards the Searle-y argument that our AI systems will never be intelligent or conscious. In 2019/2020 I had some crazy experiences with these models that really challenged their status as “stochastic parrots” in my mind. It was shocking to hear the first disembodied human-like voices emerge from the early versions of Jukebox, and to hear some really beautiful music take shape over time as the quality improved. iGPT actually has basic physics knowledge in areas like fluid dynamics. And even in the early days of GPT-3, it was quite capable of inferring complex literary themes and motifs from writers like Haruki Murakami and Virginia Woolf. There were times where I used GPT-3 through a simple CLI and felt an unmistakable “user illusion”, but with my Broca area instead of my visual cortex. And despite the fact that I worked at OpenAI, I don’t consider myself much of a futurist or a sci-fi lover. I’m not fitting my experience to a personally fulfilling trajectory for science.

An empirical story

In my (biased) opinion, our best understanding of many recent ML/AI systems comes from extensive empirical analysis. Theoretical frameworks are not really pulling their weight yet. This empirical trend started with some work to explain the critical batch size problem using second-order statistics of the loss landscape in parameter space (e.g. OpenAI’s ‘noise scale’). But people were struggling to adapt the existing theoretical tools for convex problems to the highly non-convex loss landscape of deep neural networks. We have progressed a lot in our understanding of non-convex stochastic optimization since then. Tools from fields like statistical physics have helped us a lot. But I don’t think they have fully accounted for the paradigm shift that really occurred with the dominance of unsupervised learning and transformers.

For example, there was a long line of work that analyzed optimization dynamics by assuming that the gradient covariance matrix was co-diagonalizable with the Hessian of the loss. People also discussed whether optimization occurred in a low-dimensional subspace, i.e. if the gradient covariance matrix was low-rank. This was empirically verified in image classification problems with convolutional architectures. But it doesn’t appear that this line of research will be as fruitful for unsupervised learning in transformers, where the covariance and Hessian both seem to be less degenerate. And besides, most people use AdamW optimizers now instead of plain SGD, and very little theoretical work has analyzed AdamW dynamics.

Several research groups took a different approach. In particular, people kept noticing that power law relationships popped up everywhere. With the right kind of setup, you could even predict model performance based on quantities like parameter count and dataset size. This empirical approach has been very fruitful and is what motivated OpenAI to train GPT-3 in the first place. I know many researchers who were absolutely floored when they first saw that these power laws hold for so many orders of magnitude. It is entirely possible that they will continue to hold until we surpass human performance on a significant fraction of economically valuable activities. That is a scary proposition. It also validates the bets that companies like Amazon, Facebook, Google, NVIDIA, OpenAI, Microsoft, and Tesla are making with their infrastructure and financial commitments.

So on the one hand, we have a mathematical theory to think about these problems, rooted in language borrowed from areas like statistical physics and high-dimensional probability theory. On the other hand, we have a meticulous quantitative analysis of these models through predictive tools like power laws. In an ideal world, there will also be some way to fuse these two things together. In the language of physics, we should be able to pair a microscopic story with its macroscopic summary. But even if that operation is successful, I am not confident that it will provide a rich enough language to help us predict and understand the emergence of certain qualitative properties of the system.

Step changes and unpredictable emergent behaviors

From a safety perspective, one of the scarier things about the scaling laws was the emergence of qualitative behaviors that did not appear in smaller models or earlier architectures. This was scary because our quantitative predictions did not give us the ability to establish a consensus bet that certain capabilities would emerge after we trained a sufficiently good model. In other words: these models are predictably better, but in unpredictable ways. This leads to speculation of the form “if GPT-3 can already do ABC, then will GPT- $n$ be able to do XYZ, for $n > 3$ ?”

The most obvious example is the meta-learning capability of GPT-3. Meta-learning and cross-task generalization have long been a holy grail of AI research, and I don’t think many people who explicitly worked in that research area expected a simple likelihood maximization approach to work. Sometimes it feels as though we are in a regime of progress where the best question is not “is this system intelligent?” but “what do people mean when they call this system’s behavior intelligent?” We are being forced to pick apart and modularize our assumptions about cognition and intelligence by breaking them down into specific qualitative behaviors. Meta-learning seems to have come and gone, and the goalposts shifted. But there are plenty of other features that may or may not emerge as we continue to develop these models. Common sense, logical consistency, multimodal reasoning, and online learning all come to mind.

Some of these features might already be on the way and others might never appear. But, again, one reason these scaling laws are scary is that we can’t predict what qualitative behaviors will come next. But they are motivating people to invest billions of real dollars into scale. And we have no way of confirming which desirable behaviors are desirable merely as a result of anthropocentrism. For example, is “common sense” reasoning necessary, or even desirable for most economically valuable tasks? We fail at logical consistency more often than we’d like to admit. From an uncharitable reading of the symbol grounding problem, multimodal models might only buy us sample efficiency. And it may be that online learning will be achieved with the same gradient-based approaches we already use for pre-training and fine-tuning. But right now, we have no good way of systematically measuring these qualitative behaviors in way that provides satisfactory criteria for intelligent systems.

A desirable “semantics” has two main properties: it should clearly explain the internal operations of a system, and it should be able to predict how complex qualitative behaviors emerge over time. Although the relative significance of explanation versus prediction seems to be unsettled in discourses about the philosophy of science, I lean towards a weak conviction that an explanation should permit us to predict. I want to have a mental model of these systems that makes these qualitative behaviors obvious in retrospect, like “of course GPT-3 should be a few-shot learner” or “of course iGPT should produce good features for image classification”.