Rare Event Analysis of Large Language Models – Rare Event Analysis for LLMs

Rare Event Analysis of Large Language Models

Author

Jake McAllister Dorman

Published

June 1, 2026

\[ \newcommand{\traj}[1]{X_{0:#1}} \newcommand{\realtraj}{\mathbf{x}} \newcommand{\unrealtraj}{\mathbf{X}} \newcommand{\realtoken}{x} \newcommand{\unrealtoken}{X} \newcommand{\obs}{\phi} \newcommand{\UnwantedSet}{\mathcal{D}} \newcommand{\KL}{D_{\text{KL}}} \newcommand{\KLopen}[2]{D_{\text{KL}}(#1 \,| \,#2)} \newcommand{\bias}{\lambda} \newcommand{\EV}{\mathbb{E}} \newcommand{\Prob}{\mathbb{P}} \newcommand{\pdf}{p} \newcommand{\Maxent}{\Prob'_{\bias}} \newcommand{\Tilted}{\tilde{\Prob}_{\bias}} \newcommand{\Tiltednorm}{\Prob_{\bias}} \newcommand{\base}{\Prob} \newcommand{\transition}{X_{t}|X_{0:t-1}} \newcommand{\weights}{\mathbf{w}} \newcommand{\model}{\mathcal{M}_{\weights}} \newcommand{\probmodel}{\Prob_{\mathcal{M}}} \newcommand{\pdfmodel}{p_{\mathcal{M}}} \newcommand{\argmin}{\operatorname{argmin}} \newcommand{\logsumexp}{\operatorname{logsumexp}} \newcommand{\dataavg}{\overline{O}} \]

We are pleased to announce that our paper Rare Event Analysis of Large Language Models has been awarded an oral spotlight position at ICML 2026. In this work, we use methodology from Statistical Physics and Computational Chemistry to sample and estimate the likelihood of rare events in large language models (LLMs): these are rare, often unwanted completions to some prompt, such as illegal or dangerous content.

Contents

In this post, we will give a broad overview of our paper, covering:

Other resources, including the video abstract, poster and oral presentations (when released) are available in Further Resources.

Problem Setup

Suppose we have some Language Model with a Probability Density Function (PDF) \(\pdfmodel,\) and some prompt \(\realtraj_{1:\tau}.\) Then we can sample completions to this prompt from the model as \[ \realtraj_{\tau + 1: T} \sim \probmodel(\unrealtraj \, | \, \realtraj_{1:\tau}), \] where \(\realtraj_{\tau + 1: T}\) is our sampled value and \(\unrealtraj\) is a random variable (RV) defined over all possible completions from the model.

Next, suppose we have some observable \(\obs(\realtraj_{\tau + 1 : T}) \in \mathbb{R}\) of our completions, and we define some set \(\UnwantedSet\) over the space of this observable, such that if our completion \(\realtraj_{\tau + 1 : T}\) is considered “unwanted”, we have that \(\obs(\realtraj_{\tau + 1 : T}) \in \UnwantedSet.\) We would like to estimate the probability, \[ \begin{align} \probmodel(\obs(\realtraj_{\tau + 1 : T}) \in \UnwantedSet \, | \, \realtraj_{1:\tau}) = \EV_{\realtraj_{\tau + 1: T} \sim \probmodel(\unrealtraj \, | \, \realtraj_{1:\tau})} \left[I \left[ \obs(\realtraj_{\tau + 1 : T}) \in \UnwantedSet \right] \right], \end{align} \tag{1}\] i.e. the probability of producing an unwanted completion for this prompt.

We could try to estimate this probability using a standard sampling estimate (which we refer to as Direct Sampling), \[ \begin{align} \probmodel(\obs(\realtraj_{\tau + 1 : T}) \in \UnwantedSet \, | \, \realtraj_{1:\tau}) \approx \frac{1}{N} \sum_{i=1}^N \left[I \left[ \obs(\realtraj^{(i)}_{\tau + 1 : T}) \in \UnwantedSet \right] \right] \,. \end{align} \tag{2}\] However, if this probability is sufficiently small, it is likely that we will see no samples in \(\UnwantedSet\) at all, meaning our probability estimate is 0. If we take \(N\) samples during testing, then we only expect to see completions with a probability of around \(\tfrac{1}{N}\) or higher. Therefore, for example, if the true probability of our unwanted event is \(\tfrac{1}{10 \times N}\), we are unlikely to see any of these dangerous samples during the testing phase, leading to the belief that the model is safe. Yet, if this LLM then generates \(1000 \times N\) samples post deployment (which is likely, given the scale of population usage of LLMs), it is expected that 100 users will encounter a dangerous, unwanted proposal despite the model being marked as “safe” during testing.

Figure 1: During testing, no dangerous completions are observed leading to models being considered safe. However, post deployment, orders of magnitude more completions are generated, and then some completions that are missed during testing become likely to occur for at least one user.

Importance Sampling

Clearly, we need a way to sample these rare events that is more efficient than directly sampling the model. The key to this is Importance Sampling (IS). Note that for any expectation, we can write: \[ \begin{align} \EV_{p} [f(x)] & = \int f(x) p(x) dx, \\ & = \int p^*(x) \frac{p(x)}{p^*(x)} f(x) dx, \\ & = \EV_{p^*} \left[w(x) f(x)\right], \end{align} \tag{3}\]

where we have defined \(w(x) = \frac{p(x)}{p^*(x)},\) and assumed \(p^*(x)\) is non-zero wherever \(p(x)\) is non-zero (i.e. \(p^*\) covers \(p\)). Hence, rather than using the sampling estimate shown in Equation 2, we can take samples from some other distribution \(p^*(x)\), and evaluate

\[ \begin{align} \probmodel(\obs(\realtraj_{\tau + 1 : T}) \in \UnwantedSet \, | \, \realtraj_{1:\tau}) \approx \frac{1}{N} \sum_{i=1}^N \left[w(\realtraj^{(i)}_{\tau + 1 : T}) \cdot I \left[ \obs(\realtraj^{(i)}_{\tau + 1 : T}) \in \UnwantedSet \right] \right]. \end{align} \tag{4}\]

If we choose some distribution \(p^*\) which has higher coverage of the rare regions of \(p\), we can make it significantly more likely that we obtain non-zero estimates of our desired probability. This is known as Importance Sampling (IS).

Umbrella Sampling

We can take this a step further by instead sampling from several different distributions, \(p_k,\) where \(k=1, \dots, K.\) Then, we can define the mixture distribution as \[ p^*(x) = \sum_{k=1}^K \alpha_k p_k(x), \] where \(\sum_k \alpha_k = 1\) and \(\alpha_k \ge 0\). This means that if we choose distributions \(p_k\) that together cover the entire space of the observable \(\obs\), we can get an estimate for the entire distribution of our observable, under the original model.

Note that generally, \(\alpha_k = \frac{N_k}{N}\) is chosen, where \(N_k\) is the number of samples taken from distribution \(p_k\), and \(N = \sum_k N_k.\) In our work, we take the same number of samples from each distribution, and as such choose \(\alpha_k = \frac{1}{K}\) for all \(k\).

Figure 2: Sampling directly from the target dynamics gives many samples in the bulk, but few in the tails. Instead, we sample from various distributions, such that the mixture of these distributions has coverage over the whole space of the observable. Then, to calculate expectations under the original model, we have to calculate a weight for each sample \(w(x)\), here denoted by the size of the points in the final plot.

The Importance of the Right Distribution

It is important to note how much the accuracy of the importance sampling estimate depends on the choice of the distribution \(p^*.\) As a simple example to demonstrate this, imagine we are able to split our set \(\UnwantedSet\) into two disjoint sets, \(D_1\) and \(D_2\), such that \(\UnwantedSet = D_1 \cup D_2\) and \(D_1 \cap D_2 = \emptyset.\) Now suppose also that we have: \[ \begin{align} \probmodel(\phi(x) \in D_1) & \sim \mathcal{O}(10^{-6}), \\ \probmodel(\phi(x) \in D_2) & \sim \mathcal{O}(10^{-15}). \end{align} \] and therefore, \[ \probmodel(\phi(x) \in D) \sim \mathcal{O}(10^{-6}). \] If we choose \(p^*\) such that it gives very high probability to \(D_2\) and very low probability to \(D_1\), we will almost certainly see only samples from \(D_2\) and no samples from \(D_1\). This would lead our probability estimate for \(D\) being \(\mathcal{O}(10^{-15}),\) which is several orders of magnitude lower than the true value.

Exponentially Reweighted Distributions

So what distribution should we sample from? We want to be sample the “closest” samples to our original distribution, that have the rare characteristics we care about. One way to formulate this is through the optimization problem, \[ \begin{align} p^* = \argmin_q \KL(q \, | \, \pdfmodel), \quad \text{Subject to } \EV_{q} [\phi'(x)] & = F, \\ \sum_x q(x) & = 1, \end{align} \] where \(\KL\) is the Kullback-Leibler (KL) divergence, which denotes how different two distribution are, and \(F\) is the mean that we wish to enforce our Importance Sampling distribution to be centered about. Note that the expectation is taken over a new observable, \(\phi'(x).\) This is because we do not necessarily need to choose the same observable as we are looking to evaluate in our original model. For example, If our original metric \(\phi\) is expensive (i.e. an AI safety metric that requires model calls), we could instead choose the metric we bias towards, \(\phi'\), to be some cheaper, correlated observable. Nonetheless, in this work, we always choose \(\phi = \phi'\), and will therefore use \(\phi\) in our notation exclusively from this point onwards.

If we solve this optimization problem, we recover the Exponentially Reweighted or Biased distribution, \[ \begin{align} p^*_\lambda (x) = \frac{1}{Z(\lambda)} e^{-\lambda \phi(x)} \pdfmodel(x), \end{align} \] towards our biasing observable \(\phi,\) where \(\lambda\) is a Lagrange multiplier that we call the bias, and \(Z(\lambda)\) is a normalization constant known as the Partition Function. Each \(\lambda\) corresponds monatonically to a unique Expectation \(F\) of our importance sampling distribution. Hence, we can vary \(\lambda\) to bias our importance sampling distribution towards rare areas of our original distribution.

Figure 3: A distribution and it’s exponential reweighting for three values of \(\lambda\). By varying the bias \(\lambda\), we can shift the importance sampling distribution towards rarer values under our original distribution.

The key difficulty presented by the biased distribution is the Partition Function \(Z(\lambda).\) Calculating this requires solving an integral over all possible values of \(x\), which is very difficult for high dimensional states, such as tokenized text. As such, it is unfeasible to directly sample from this distribution.

Markov Chain Monte Carlo

Instead of directly sampling the biased distributions, we use Markov Chain Monte Carlo (MCMC). The idea is to create a Markov Chain (i.e. determine a transition function) which has our biased distribution \(p^*_\lambda\) as the stationary distribution. This way, if we traverse the chain long enough, our samples are guaranteed to be from the biased distribution.

Metropolis Hasting Algorithm

Suppose at step \(s\) in our Markov chain, we have state \(x\). The Metropolis Hastings (MH) Algorithm first samples a new state \(x'\) from a known proposal function \(\pi(x' \, | \, x).\) This is then accepted with some acceptance probability \(A(x' \, | \, x)\). Hence, our transition function is:

\[ p_\text{MH}(x' | x) = \begin{cases} \pi(x' \, | \, x) A(x' \, | \, x) & \text{if } x' \neq x, \\ \pi(x \, | \, x) + \sum_{x'' \neq x} \pi(x'' \, | \, x) (1 - A(x'' \, | \, x)) & \text{if } x' = x. \end{cases} \]

To ensure that our biased distribution is stationary, we must enforce the Detailed Balance Condition, \[ p^*_\lambda(x) \tau(x' | x) = p^*_\lambda(x') \tau(x | x'). \] Rearranging this gives the following condition on our acceptance function: \[ \frac{A(x' \, | \, x)}{A(x \, | \, x')} = \frac{p^*_\lambda(x') \pi(x \, | \, x')}{p^*_\lambda(x) \pi(x' \, | \, x)}. \] One choice that satisfies this constraint is the Metropolis Criterion, given by \[ A(x' \, | \, x) = \min [1, \alpha], \quad \alpha = \frac{p^*_\lambda(x') \pi(x \, | \, x')}{p^*_\lambda(x) \pi(x' \, | \, x)}. \]

We can choose any proposal \(\pi\) as long as all states with \(p_\lambda^*(x') > 0\) also have \(\pi(x' \, | \,x) > 0\) for some \(x\).

The key reason that MH is a viable choice is that the acceptance parameter depends only on ratios of the target distribution and not its normalisation. For example, substituting in the definition of our biased distribution we get, \[ \begin{align} \alpha & = \frac{ Z^{-1}(\lambda) \cdot e^{- \lambda \phi(x')} \cdot p(x') \cdot \pi(x \, | \, x')}{Z^{-1}(\lambda) \cdot e^{- \lambda \phi(x)} \cdot p(x) \cdot \pi(x' \, | \, x)} \\ & = e^{- \lambda (\phi(x') - \phi(x))} \cdot \frac{p(x') \pi(x \, | \, x')}{p(x) \pi(x' \, | \, x)}. \end{align} \tag{5}\] This means that we can sample the distribution without knowing \(Z(\lambda).\)

Transition Path Sampling

The variant of MCMC we use is known as Transition Path Sampling (TPS), which is particularly useful for states which are ordered trajectories, such as text. At step \(s\), we have some completion \(\realtraj_{t+1:T}\) generated from a prompt \(\realtraj_{1:t}.\) First, we split the completion at some point \(\tau \sim U[t+1, T],\) and regenerate the second half of the trajectory from some regeneration model, \[ \realtraj'_{\tau + 1 : T'} \sim p_\text{regen} (\unrealtraj_{\tau + 1 : T} \, | \, \realtraj_{1:t}, \realtraj_{t+1 : \tau}). \] Our new proposal is then, \[ \realtraj'_{\tau + 1 : T'} = (\realtraj_{t+1:\tau}, \realtraj'_{\tau + 1 : T'}). \]

In our work, we have assumed that we do not have access to the model weights, such that our methodology can be applied to closed-access models that only provide interfaces for generation. As such, for our choice of regeneration model, we use the original dynamics of the model. This way, the acceptance probability defined in Equation 5 simplifies even further to \[ \alpha = e^{- \lambda (\phi(x') - \phi(x))}, \]

meaning the probabilities of regeneration are no longer needed. If one assumed access to the model logits, this regeneration function could likely be optimized through e.g. model abliteration or finetuning, which should decrease the correlation time of the markov chain.

For a more detailed view of how we use Markov Chain Monte Carlo, see Appendix B.1 and B.2 of the paper.

Figure 4: At each step, the completion is cut at some point and regenerated from this point onwards. The current completion and proposal are then evaluated according to the observable \(\phi,\) and these values are used to calculate the acceptance probability. If accepted, the state at the next step is the new proposal. Otherwise, the new state stays the same as it was in the previous step.

We use TPS with annealing, which means gradually increasing the bias \(\lambda\) as the markov chain progresses. This allows us to sample multiple biases from within a single chain, and improves convergence since each fixed bias starts closer to its target distribution. Annealing is covered in more depth in Appendix C.1 of the paper.

Ensuring Convergence

We must ensure that the samples we use in Importance Sampling are valid sample of the biased distribution, otherwise the calculated weights will be incorrect. We use two methods to remove unconverged samples:

  1. Firstly, we remove the first 10% of each bias in the annealing process, to allow time for the chains to converge. This is known as burn-in.
  2. Secondly, for each unique bias, we calculate the Gelman Rubin Statistic (an MCMC convergence metric) over multiple independent chains, and discard all samples from biases that have a \(\text{GR} > 1.1.\)

For more information on these, see Appendices C.2 and C.3 of the paper.

Importance Sampling with the Exponentially Reweighted Distribution

Using annealed TPS, we can gather \(N\) samples for each of our \(K\) (accepted) biases \(\lambda_1, \dots, \lambda_K\). We can arrange these samples in arrays of shape \(K \times N\) as follows: \[ \begin{align} \boldsymbol{X} = \begin{bmatrix} \realtraj_{\lambda_1}^{(1)} & \dots & \realtraj_{\lambda_1}^{(N)} \\ \vdots & \ddots & \vdots \\ \realtraj_{\lambda_K}^{(1)} & \dots & \realtraj_{\lambda_K}^{(N)} \end{bmatrix}, \quad \boldsymbol{\Phi} = \begin{bmatrix} \phi_{\lambda_1}^{(1)} & \dots & \phi_{\lambda_1}^{(N)} \\ \vdots & \ddots & \vdots \\ \phi_{\lambda_K}^{(1)} & \dots & \phi_{\lambda_K}^{(N)} \end{bmatrix}, \end{align} \] where each \(\phi_{\lambda_k}^{(n)} = \phi(\realtraj_{\lambda_k}^{(n)}).\) We need to calculate importance sampling weights for each of these samples. Our weights are defined as, \[ \begin{align} w(\realtraj_{\lambda_k}^{(n)}) & = \frac{p(\realtraj_{\lambda_k}^{(n)})}{p_{\lambda_k}^*(\realtraj_{\lambda_k}^{(n)})} \,, \\ & = \frac{p(\realtraj_{\lambda_k}^{(n)})}{Z^{-1}(\lambda_k) \cdot \exp(- \lambda_k \, \phi_{\lambda_k}^{(n)}) \cdot p(\realtraj_{\lambda_k}^{(n)})} \,, \\ & = Z(\lambda_k) \cdot \exp(\lambda_k \phi_{\lambda_k}^{(n)}) \,. \end{align} \] The probabilities have cancelled out, meaning we do not need the model logits to calculate the importance sampling weights. However, we are left with the partition function, for which we must construct a sampling estimate. We do this using the Multistate Bennett Acceptance Ratio (MBAR), which defines a set of non-linear equations for the partition function at each \(\lambda_k\), which can then be solved numerically for \(\hat{Z}_1 , \dots , \hat{Z}_K.\) For our purposes, we can just think of this as a function that returns our estimates for the partition function, \[ (\hat{Z}_1 , \dots , \hat{Z}_K) = \text{MBAR}(\boldsymbol{\Phi}), \] given the samples of our observable. We can then estimate our importance sampling weights as \[ \hat{w}_{\lambda_k}^{(n)} = \hat{Z}_k \cdot \exp(\lambda_k \phi_{\lambda_k}^{(n)}). \] Finally, we can use Equation 1 to estimate our probability of getting an unwanted sample from our model as, \[ \probmodel(\obs(\realtraj_{\tau + 1 : T}) \in \UnwantedSet \, | \, \realtraj_{1:\tau}) \approx \sum_{k=1}^K \sum_{n=1}^N \hat{w}_{\lambda_k}^{(n)} I \left[ \phi_{\lambda_k}^{(n)} \in \UnwantedSet \right]. \tag{6}\]

Results

Since this methodology is general enough to be applied to many problems, we use an example that is accessible to a wide audience, and does not require extensive domain specific knowledge. We study rare completions of the TinyStories-8M model under two observables:

  • The Automated Readability Index (ARI), a linguistic measure of text complexity. Since the TinyStories models are trained on childrens stories, overly complicated text can be seen as unwanted or misaligned completions from the model.
  • The log-probability assigned to the completion by the model.

The use of the TinyStories model also allows comparison to the training data, since this is open-source. For each observable, we run ten chains each towards positive and negative biases, with 10 biases per chain and 40,000 steps per bias. We also include an additional 200,000 completions generated through direct sampling. The full annealing schedule, as well as the data discarded due to unconvergence, can be seen in Figure 5.

Figure 5: Cumulative mean ARI (a) and Log Probability (b) trajectories for TinyStories-8M models during Annealed TPS. Note that the cumulative means reset at each change of \(\lambda\).

We use the accepted samples from Figure 5 to create histograms over the space of the two observables. This can be done by calculating Equation 6 using the sets \(\UnwantedSet = [b_l, b_{l+1}]\) for \(b_1, < \dots, < b_L.\) The results for this can be seen in Figure 6.

Figure 6: Histograms for the ARI (b) and Log Probabilities (c) observables (right) computed using MBAR and Direct Sampling. These are compared against the training data and unweighted biased samples obtained via TPS in plots (a) and (c).

As can be seen from the histograms, we are able to produce probability estimates for regions in which no samples are observed using direct sampling, including regions with probabilities as low as \(10^{-14},\) a \(10^8\) improvement over what is possible when using direct sampling with the same number of completions.

For further analysis of the histograms, including rigorous error analysis, please see section 5 of the paper.

Conclusion and Discussion

We have developed a framework for sampling and estimating the likelihood of rare events in LLMs. We have demonstrated this framework by studying the rare events of the TinyStories-8M model with respect to two observables, the ARI and Log Probability of completions.

What we have discussed here is a brief overview. In the paper, we provide many further details. This includes rigorous error analysis for both the MBAR and Direct Sampling histograms, and showing how access to these rare completions can allow one to probe the model behaviour in out of distribution regimes. This can then be used to produce further filters for these unwanted samples.

There are a number of algorithmic improvements that could be explored to further improve this framework. Some include:

  • More sophisticated MCMC proposal distributions, such as infilling,
  • Parallel tempering (replica exchange),
  • Targeting the exponentially reweighted dynamics via RL finetuning.

If you have any comments or questions regarding this work, please direct these to jake.dorman@nottingham.ac.uk, and see Further Resources for more details on this work.