Persistent Homology as an integrationist framework for LLM Interpretability

I hope to write a longer post about this work, but for now, I wanted to share our paper, in which we propose using persistent homology as a mathematically principled framework to characterize the topology and geometry of LLM latent spaces across multiple scales. Our results uncover consistent topological signatures of adversarial influence. These signatures remain consistent across model families, sizes, layers, and fundamentally distinct attack modes.

Our approach shifts the focus from reductionist interpretability methods, which provide local or global explanations in isolation, to a more integrative framework that models representational spaces across multiple levels of abstraction. Existing representation engineering interpretability approaches like Sparse Autoencoders and linear probes have been successful in isolating certain conceptual directions and features, but typically impose restrictive geometric assumptions and produce findings that are model, layer, or context specific.

Ultimately, we propose a shift away from interpreting individual features or concepts themselves, especially when they are not co-ordinate free, towards understanding the representational substrate that precedes and gives rise to them.

I believe this perspective is important if we are to succeed in developing scalable, effective, and robust approaches to understanding and managing the behavior and capabilities of AI systems. It also honors the reality that a concept doesn't need to be "spoken" or emitted in order for it to exist, which has consequences for how we evaluate theses systems (i.e. we should not be solely focused on tokens themselves).

This is joint work with Inés García-Redondo, Qiquan Wang, Haim Dubossarsky  and Anthea Monod.

Read the paper here: https://arxiv.org/abs/2505.20435