Back in 2018 I published pytorch-hessian-eigenthings, a niche open-source package for GPU-accelerated curvature analysis of PyTorch models. I built it originally for my own research. Many generalization phenomena in deep learning, such as the flat-minima hypothesis (Hochreiter & Schmidhuber 1997, Keskar et al. 2017), low-rank Hessian claims (Sagun et al. 2017, Papyan 2019), and edge-of-stability dynamics (Cohen et al. 2021), have been linked to the eigenstructure of the loss Hessian. But the full Hessian costs memory quadratic in the parameter count, which is hopeless for anything bigger than a toy MLP. The library used Hessian-vector products together with iterative methods (Lanczos, power iteration) to recover the leading eigendecomposition in linear memory instead.
I stepped away from the project for years, but it ended up getting used by other researchers doing curvature analysis. Coming back to it recently, the original implementation had code smells in all the ways an enthusiastic-undergrad-student codebase usually ages: shaky numerical edges, no real test suite, an immature public API. I’ve had a few years of more professional engineering experience since then, so I thought I’d try to improve it.
I just shipped a v1.0 rewrite. The best additions:
torch.compile cross-entropy HVP kernel. The vanilla autograd path materializes intermediates that are quadratic in vocabulary size, which blows up for foundation-model-scale vocabularies. The fused kernel sidesteps that (~3.4× speedup with Triton on CUDA, ~2.6× with torch.compile, both with ~2× peak-memory reduction over eager).That last point is the one I care about most, honestly. The original library shipped with some manually run scripts for validation on quadratics. And I gathered confidence checking its results against known trends in larger models, like tracing out the eigenvalue spectra on ResNet models. But “the eigenvalues look reasonable” is a bad acceptance test for code that’s going to end up in other people’s research artifacts. Writing the closed-form tests was also a useful exercise in re-deriving things I’d half-forgotten. For example, the GGN for a softmax-cross-entropy head reduces to something quite clean, and pinning that down in a test is a more honest form of documentation.
A few thoughts from the rewrite itself, in case they’re useful to anyone doing the same thing to their own old code:
LinearOperator (how do you apply ?) from EigenSolver (how do you iterate?) made adding GGN and Fisher essentially free, and made SLQ much easier to integrate.I’m hoping to use the rewrite as scaffolding for some follow-up analysis. The thing I’m playing with right now is inter-agreement between optimizer updates. I’ve recently studied more about newer optimizers like Muon, Shampoo, and K-FAC, and I read up once again on natural gradient descent. I’m studying this on Pythia checkpoints. The 1B+ models need cheap and accurate curvature operators at a scale where the 2018 code would have melted. 70B+ models like Llama also require work like the finite-difference approach and fused kernel.
If you’re working in this space and have suggestions, requests, or pointers to recent work I should be aware of, I’d appreciate a head’s up. I’ve been out of the field for a while and my reading list has gaps. The repo is here; it builds on prior work from HessianFlow, Accelerated Stochastic Power Iteration, PyHessian, curvlinops, and HessFormer.