Revisiting pytorch-hessian-eigenthings, eight years later

May 14, 2026 · Noah Golmant

Back in 2018 I published pytorch-hessian-eigenthings, a niche open-source package for GPU-accelerated curvature analysis of PyTorch models. I built it originally for my own research. Many generalization phenomena in deep learning, such as the flat-minima hypothesis (Hochreiter & Schmidhuber 1997, Keskar et al. 2017), low-rank Hessian claims (Sagun et al. 2017, Papyan 2019), and edge-of-stability dynamics (Cohen et al. 2021), have been linked to the eigenstructure of the loss Hessian. But the full Hessian costs memory quadratic in the parameter count, which is hopeless for anything bigger than a toy MLP. The library used Hessian-vector products together with iterative methods (Lanczos, power iteration) to recover the leading eigendecomposition in linear memory instead.

I stepped away from the project for years, but it ended up getting used by other researchers doing curvature analysis. Coming back to it recently, the original implementation had code smells in all the ways an enthusiastic-undergrad-student codebase usually ages: shaky numerical edges, no real test suite, an immature public API. I’ve had a few years of more professional engineering experience since then, so I thought I’d try to improve it.

I just shipped a v1.0 rewrite. The best additions:

New curvature operators. Beyond the raw Hessian, the library now exposes the Generalized Gauss-Newton (GGN) matrix and the empirical Fisher, with a common operator interface so any algorithm can consume any operator.
New algorithms. Hutchinson and Hutch++ trace estimators, and spectral density estimation via Stochastic Lanczos Quadrature. The latter is what you actually want when you care about bulk eigenvalue structure rather than just the top- $k$ — see PyHessian for a nice application to deep nets.
A fused Triton / torch.compile cross-entropy HVP kernel. The vanilla autograd path materializes intermediates that are quadratic in vocabulary size, which blows up for foundation-model-scale vocabularies. The fused kernel sidesteps that (~3.4× speedup with Triton on CUDA, ~2.6× with torch.compile, both with ~2× peak-memory reduction over eager).
Numerical validation. Closed-form correctness checks on linear and logistic regression, where the Hessian is known analytically, plus cross-library tests against curvlinops to catch regressions.

That last point is the one I care about most, honestly. The original library shipped with some manually run scripts for validation on quadratics. And I gathered confidence checking its results against known trends in larger models, like tracing out the eigenvalue spectra on ResNet models. But “the eigenvalues look reasonable” is a bad acceptance test for code that’s going to end up in other people’s research artifacts. Writing the closed-form tests was also a useful exercise in re-deriving things I’d half-forgotten. For example, the GGN for a softmax-cross-entropy head reduces to something quite clean, and pinning that down in a test is a more honest form of documentation.

A few thoughts from the rewrite itself, in case they’re useful to anyone doing the same thing to their own old code:

Separate the operator from the algorithm. In the 2018 version, “compute top eigenvalues of the Hessian” was one monolithic call. Splitting LinearOperator (how do you apply $Hv$ ?) from EigenSolver (how do you iterate?) made adding GGN and Fisher essentially free, and made SLQ much easier to integrate.
Trust autograd less than you’d like to. Double-backward through certain loss functions has sharp edges, especially around reductions and ignored indices. Most of the actual debugging time went into the cross-entropy HVP, not the algorithms.
Write the analytical tests first. I kept finding bugs that the cross-library comparison would have flagged eventually, but the closed-form tests pinpointed them immediately. Linear regression Hessian is $X^\top X / n$ . Logistic regression is $X^\top D X / n$ with $D$ a diagonal of $\sigma(1-\sigma)$ terms. These are five-line tests that catch a depressing amount.

I’m hoping to use the rewrite as scaffolding for some follow-up analysis. The thing I’m playing with right now is inter-agreement between optimizer updates. I’ve recently studied more about newer optimizers like Muon, Shampoo, and K-FAC, and I read up once again on natural gradient descent. I’m studying this on Pythia checkpoints. The 1B+ models need cheap and accurate curvature operators at a scale where the 2018 code would have melted. 70B+ models like Llama also require work like the finite-difference approach and fused kernel.

If you’re working in this space and have suggestions, requests, or pointers to recent work I should be aware of, I’d appreciate a head’s up. I’ve been out of the field for a while and my reading list has gaps. The repo is here; it builds on prior work from HessianFlow, Accelerated Stochastic Power Iteration, PyHessian, curvlinops, and HessFormer.