Langevin sampling

Introduction

One of the cornerstones of score-based generative models is the method of sampling from the score function of a distribution via Langevin dynamics. Our aim here is to review the method of sampling via Langevin dynamics based on the score function.

Langevin dynamics

The velocity of a particle moving in a fluid has long been known to be reduced by friction forces with the surrounding fluid particles. For relatively slowly moving particles, when the surrouding fluid flow is essentially laminar, this friction force is regarded to be proportional to the velocity, in what is known as the Stokes law. When the motion is relatively fast and the flow around the particle is turbulent, this friction tends to be proportional to the square of the velocity.

If this were the only force, though, a particle initially at rest on a still fluid would remain forever at rest. That is not the case, as observed by the botanist Robert Brown (1828). His observations led to what is now known as Brownian motion, and which is formally modeled as a Wiener process (see e.g. Einstein (1905) and Mörters and Peres (2010)). A Wiener process describes the (random) position of a particle which is initially at rest and is put into motion by the erratic collisions with the nearby fluid particles.

When in motion, both forces are actually in effect, a deterministic one dependent on the velocity and a random one due to irregular collisions. In a short time scale, the inertia forces are negligible and we recover the Brownian motion. For larger times scales, the Langevin is more appropriate since it takes both forces into account.

Langevin equation

We start with the Langevin equation for a free particle and then introduce the equation with a potential field.

Langevin equation for a free particle

In the Langevin model, both viscous and random collision forces affect the momentum of the particle (Paul Langevin (1908)). In this model, the position $x_t$ of a particle of mass $m$ at time $t$ is given by

\[ m \ddot x_t = - a \mu \dot x_t + \alpha \xi_t,\]

where $a$ is a caractheristic length of the particle; $\mu$ is the molecular viscosity, associated with the frictional drag force assumed proportional to the velocity; and $\alpha$ is a proportionality coefficient associated with a white noise term $\xi_t$ modeling the random collisions with the fluid particles, i.e. $\xi_t$ is a Gaussian (distribution) process with zero mean, $\mathbb{E}[\xi_t] = 0,$ and delta correlated, $\mathbb{E}[\xi_t\xi_s] = \delta_0(t - s).$ The two coefficients are connected by

\[ \alpha = \sqrt{2a\mu k_B T},\]

where $k_B$ is the Boltzmann constant and $T$ is the temperature of the fluid.

The white noise is highly irregular, so the equation above is made rigorous with the theory of stochastic differential equations when casted in the form

\[ \mathrm{d}Y_t = -\nu Y_t \;\mathrm{d}t + \sigma \;\mathrm{d}W_t,\]

where $\{Y_t\}_t$ is a stochastic processes representing the evolution of the velocity in time; $\nu = a\mu / m$ is a kinematic damping factor (with dimension $1/\texttt{time}$); $\sigma=\alpha/m$ is called the diffusion parameter; and $\{W_t\}_t$ is a Wiener process, whose formal derivative represents the white noise. The solution $\{Y_t\}_t$ of the equation above is known as the Ornstein-Uhlenbeck stochastic process. The relation between $\sigma$ and $\nu$ becomes

\[ \sigma = \sqrt{\frac{2\nu k_B T}{m}}.\]

Langevin equation with an energy potential

This is all fine for a nearly free particle, affected only by friction and by smaller nearby particles. More generally, one may also consider a particle under an extra force field with potential $U=U(x)$. In this case, the equation is modified to

\[ m \ddot x_t = - a\mu \dot x_t - m\nabla U(x_t) + \alpha \xi_t,\]

Rigorous stochastic formulation

The rigorous stochastic formulation takes the form of a system,

\[ \begin{cases} \mathrm{d}X_t = Y_t\;\mathrm{d}t, \\ \mathrm{d}Y_t = (-\nu Y_t - \nabla U(X_t))\;\mathrm{d}t + \sigma \;\mathrm{d}W_t. \end{cases}\]

These are called the Langevin equation or Langevin system.

Physical dimensions

We want to check the physical dimensions of the terms in the Langevin equation

\[ m \ddot x_t = - a \mu \dot x_t - m\nabla U(x_t) + \alpha \xi_t,\]

where

\[ \alpha = \sqrt{2a\mu k_B T}.\]

Basic physical dimensions

We denote the physical dimensions by

\[ \quad M = \texttt{ mass,} \quad L = \texttt{ length,} \quad T = \texttt{ time,} \quad \Theta = \texttt{ temperature.} \]

Physical dimension of variables and parameters

The variable $x_t$ denotes position at time $t,$ hence its physical dimension, $[x_t],$ is length,

\[ [x_t] = L.\]

The time derivative has dimension of one over time, i.e.

\[ \left[\frac{\mathrm{d}}{\mathrm{d} t}\right] = \frac{1}{T}.\]

The Dirac delta $\delta_0(\cdot)$ in the definition of white noise is the (time) derivative of the Heaviside function $H(t) = \chi_{[0, \infty)}(t).$ The Heaviside is adimensional, so

\[ [\delta_0] = \frac{1}{T}.\]

Since the dimension of the correlation is

\[ \left[\mathbb{E}[\xi_t\xi_s]\right] = [\xi_t]^2\]

and

\[ [\xi_t]^2 = \left[\mathbb{E}[\xi_t\xi_s]\right] = \left[\delta_0\right] = \frac{1}{T},\]

we find that

\[ [\xi_t] = \frac{1}{\sqrt{T}}.\]

For consistency reasons, the dimension of $\mu$ has to be

\[ [\mu] = \frac{M}{LT}.\]

The potential field is the potential energy over the mass of the particle under the potential field, thus

\[ [U(x)] = \frac{L^2}{T^2}.\]

The dimension of the Boltzmann constant is that of energy over temperature, i.e.

\[ [k_B] = \frac{ML^2}{T^2\Theta}.\]

With that, the dimension of $\alpha$ becomes

\[ [\alpha] = \sqrt{[a][\mu] [k_B] [T]} = \sqrt{L \frac{M}{LT} \frac{ML^2}{T^2\Theta} \Theta} = \sqrt{\frac{M^2L^2}{T^3}} = \frac{ML}{T^{3/2}}.\]

Physical dimensions of the Langevin equation

Now we check that all the terms in the Langevin equation have the same physical dimension of force, since it is an expression of Newton's second law of motion. Indeed, the first term has dimension

\[ [m\ddot x_t] = [m] \left[\frac{\mathrm{d}^2}{\mathrm{d}t^2}\right] [x_t] = M \frac{1}{T^2} L = \frac{ML}{T^2}.\]

The second term has dimension

\[ [a\mu \dot x_t] = [a] [\mu] \left[\frac{\mathrm{d}}{\mathrm{d}t}\right] [x_t] = L \frac{M}{LT} \frac{1}{T} L = \frac{ML}{T^2}.\]

The potential term has dimension

\[ [m\nabla U(x_t)] = M \frac{1}{L} \frac{L^2}{T^2} = \frac{ML}{T^2}.\]

The last term has dimension

\[ [\alpha \xi_t] = [\alpha] [\xi_t] = \frac{ML}{T^{3/2}}\frac{1}{T^{1/2}} = \frac{ML}{T^2}.\]

Physical dimension of the stochastic equation for the random velocity

Looking now at the equation

\[ \mathrm{d}Y_t = -\nu Y_t \;\mathrm{d}t + \sigma \;\mathrm{d}W_t,\]

we first observe that

\[ [\nu] = \left[\frac{a\mu}{m}\right] = L \frac{M}{LT} \frac{1}{M} = \frac{L}{T}\]

and

\[ [\sigma] = \left[\frac{\alpha}{m}\right] = \frac{ML}{T^{3/2}} \frac{1}{M} = \frac{L}{T^{3/2}}.\]

Since the $W_t \sim \mathcal{N}(0, t) = \sqrt{t}\mathcal{N}(0, 1),$ we see that

\[ [\mathrm{d}W_t] = [W_t] = \sqrt{T}.\]

This is consistent with the idea that the white noise $\{\xi_t\}_t$ is formally the time derivative of the Wiener process $\{W_t\}_t.$

Now, we see that

\[ [\mathrm{d}Y_t] = [Y_t] = \frac{L}{T},\]

while

\[ [\nu Y_t \;\mathrm{d}t] = \frac{1}{T} \frac{L}{T} T = \frac{L}{T},\]

and

\[ [\sigma \;\mathrm{d}W_t] = \frac{L}{T^{3/2}} T^{1/2} = \frac{L}{T},\]

establishing the consistency of the equation in terms of physical dimension.

The overdamped limit

In the Langevin equation

\[ m \ddot x_t = - a \mu \dot x_t - m\nabla U(x_t) + \alpha \xi_t, \qquad \alpha = \sqrt{2a\mu k_B T},\]

the term $m \ddot x_t$ represents the inertial force. When the motion is relatively slow, this inertia might be negligible when compared with the drag force. Dropping this term yields the equation

\[ 0 = - \nu \dot x_t - \nabla U(x_t) + \alpha \xi_t,\]

which we can write in the usual form of the overdamped Langevin equation

\[ \mu \dot x_t = - \nabla U(x_t) + \alpha \xi_t.\]

The corresponding stochastic system reduces to a single equation

\[ \nu \mathrm{d}X_t = - \nabla U(X_t)\;\mathrm{d}t + \sigma \;\mathrm{d}W_t.\]

For this approximation to be valid, we look at the ratio between the inertial force $m\ddot x_t$ and the damping force $-a\mu \dot x_t.$ We have

\[ \frac{m\ddot x_t}{a\mu \dot x_t} \sim \frac{1}{\nu}\frac{\Delta Y_t}{Y_t\Delta t}.\]

The inertial force is negligible when

\[ \frac{\Delta Y_t}{Y_t} \ll \nu \Delta t,\]

which can be interpreted as when the relative variations $\Delta Y_t/Y_t$ in velocity are relatively small or when the damping is relatively large. This is consistent with saying that we obtain the first order equation in the overdamped regime.

In the absence of a force field $U=U(x)$, we are left with

\[ \nu \mathrm{d}X_t = \sigma \;\mathrm{d}W_t.\]

which is the Brownian motion equation, with the solution

\[ X_t = \frac{\sigma}{\nu} = \sqrt{\frac{2k_B T}{a\mu}} W_t.\]

In this way, we recover the Brownian motion from the Langevin equation as the overdamped limit without a force field.

Notice the right hand side above has indeed the dimension of length,

\[ \left[\sqrt{\frac{2k_B T}{a\mu}} W_t \right] = \sqrt{\frac{ML^2}{\Theta T^2}\Theta \frac{1}{L}\frac{LT}{M}} \sqrt{T} = \sqrt{\frac{L^2}{T}} \sqrt{T} = L.\]

Adimensionalization

With a proper rescaling, we obtain the adimensional overdamped Langevin equation

\[ \mathrm{d}\tilde X_t = - \nabla_{\tilde x}{\tilde U}({\tilde X}_{\tilde t})\;\mathrm{d}\tilde t + \sqrt{2}\;\mathrm{d}{\tilde W}_{\tilde t}.\]

We do the details of this change of variables below. See also Section 6.5 of Grigorios Pavliotis (2014).

We start with the equation

\[ \nu \mathrm{d}X_t = - \nabla U(X_t)\;\mathrm{d}t + \sigma \;\mathrm{d}W_t,\]

and divide it by $a\nu,$ so that

\[ \frac{1}{a}\mathrm{d}X_t = - \frac{1}{a\nu}\nabla U(X_t)\;\mathrm{d}t + \frac{\sigma}{a\nu} \;\mathrm{d}W_t.\]

Since $\sigma = \sqrt{2\nu k_B T / m}$ and $\nu = a\mu / m,$ we write

\[ \frac{\sigma}{a\nu} = \frac{\sqrt{2\nu k_B T / m}}{a\nu} = \sqrt{\frac{2\nu k_B T}{a^2 m \nu^2}} = \sqrt{\frac{2 k_B T}{a^2 m \nu}} = \sqrt{\frac{2k_B T}{a^3\mu}}.\]

Thus,

\[ \frac{1}{a}\mathrm{d}X_t = - \frac{1}{a\nu}\nabla U(X_t)\;\mathrm{d}t + \sqrt{\frac{2k_B T}{a^3\mu}}\;\mathrm{d}W_t.\]

The time and length scales are changed to

\[ \tilde t = \nu t, \quad \tilde x = \frac{x}{a}.\]

Accordingly, we set

\[ \tilde X_{\tilde t} = \frac{1}{a} \tilde X_{\tilde t / \nu} = \frac{1}{a} X_t.\]

Thus,

\[ \frac{1}{a} \mathrm{d}X_t = \mathrm{d}{\tilde X}_{\tilde t}.\]

We also define the adimensional potential

\[ \tilde U(\tilde x) = \frac{1}{a^2\nu^2}U(a\tilde x) = \frac{1}{a^2\nu^2} U(x).\]

Thus,

\[ - \frac{1}{a\nu}\nabla_x U(X_t)\;\mathrm{d}t = - \frac{1}{a\nu} \nabla_x \left(a^2\nu^2 {\tilde U}\left(\frac{X_t}{a}\right) \right)\;\frac{\mathrm{d}\tilde t}{\nu} = - a\nabla_x \left( {\tilde U}\left(\frac{X_t}{a}\right)\right)\;\mathrm{d}\tilde t = - a\nabla_{\tilde x} {\tilde U}({\tilde X}_{\tilde t})\frac{1}{a}\;\mathrm{d}\tilde t,\]

so that

\[ - \frac{1}{a\nu}\nabla_x U(X_t)\;\mathrm{d}t = - \nabla_{\tilde x} {\tilde U}({\tilde X}_{\tilde t})\;\mathrm{d}\tilde t.\]

Finally we set

\[ {\tilde W}_{\tilde t} = \sqrt{\frac{k_B T}{a^3\mu}} W_{\tilde t / \nu} = \sqrt{\frac{k_B T}{a^3\mu}} W_t,\]

so that

\[ \sqrt{\frac{2k_B T}{a^3\mu}} \;\mathrm{d}W_t = \sqrt{2}\mathrm{d}{\tilde W}_{\tilde t}. \]

Notice that

\[ \left[ \sqrt{\frac{2k_B T}{a^3\mu}} \right] = \sqrt{\frac{ML^2}{\Theta T^2} \Theta\frac{LT}{L^3 M}} = \sqrt{\frac{1}{T}}, \qquad \left[\mathrm{d}W_t\right] = \sqrt{T},\]

so ${\tilde W}_{\tilde t}$ is indeed adimensional.

Combining the terms, we obtain the adimensional equation

\[ \mathrm{d}{\tilde X}_{\tilde t} = - \nabla_{\tilde x} {\tilde U}({\tilde X}_{\tilde t})\;\mathrm{d}\tilde t + \sqrt{2}\;\mathrm{d}{\tilde W}_{\tilde t}.\]

The factor $\sqrt{2}$ could have been included in the definition of ${\tilde W}_{\tilde t},$ but it was left out because it will compensate the one-half term coming from the Itô formula and both eventually cancel out when computing the invariant distribution.

The limit as $\nu \rightarrow \infty.$

If one does the rescaling above in the full equation

\[ m \ddot x_t = - a\mu \dot x_t - m\nabla U(x_t) + \alpha \xi_t,\]

one obtains the overdamped limit exactly when taking the limit $\nu \rightarrow \infty.$ We leave this as an exercise.

The asymptotic distribution

In the inviscid deterministic case, the Langevin equation reads

\[ m \ddot x_t = - m \nabla U(x_t),\]

and the level sets

\[ \left\{(x, v); \frac{1}{2}v^2 + U(x) = c\right\}\]

of the total energy are invariant by the solution group. In the viscous deterministic case

\[ m \ddot x_t = - a\mu \dot x_t - m\nabla U(x_t),\]

the solutions tend to the equilibria states $\min_x U(x)$, or more precisely to the variety

\[ \{(x, v); v = \nabla U(x) = 0\}.\]

In the full viscous, stochastic equation, the tendency to go to the equilibria is balanced by the diffusion, and the system tends to a stochastic equilibrium represented by a distribution with probability density function

\[ p_U(x, v) = \frac{1}{Z} e^{-\frac{m}{k_B T}\left( \frac{v^2}{2} + U(x)\right)},\]

where $Z$ is a normalization constant to have $p_U(x, v)$ integrate to $1$.

A similar behavior occurs in the overdamped Langevin equation, for which the equilibrium distribution is given by

\[ p_U(x) = \frac{1}{Z_0} e^{-\frac{m}{k_B T} U(x)},\]

where $Z_0$ is another normalization constant.

We now discuss in more details the equilibrium distribution in the overdamped equation. For that, we look at the Fokker-Plank equation for

\[ \nu\mathrm{d} X_t = - \nabla U(X_t)\;\mathrm{d}t + \sqrt{\frac{2\nu k_B T}{m}}\;\mathrm{d}W_{t},\]

In general, a stochastic differential equation of the form

\[ \mathrm{d} X_t = \mu(t, X_t) \;\mathrm{d}t + \sigma(t, X_t)\;\mathrm{d}W_{t}\]

is associated with the Fokker-Planck equation

\[ \frac{\partial}{\partial t} p(t, x) = - \nabla_x \cdot \left(\mu(t, x) p(t, x)\right) + \frac{1}{2} \Delta_x \left( \sigma(t, x)^2 p(t, x) \right),\]

where $p(t, x)$ is such that $x \mapsto p(t, x)$ is the probability density function of the solution $X_t$, i.e. the marginal distribution at time $t$ of the process $\{X_t\}_t$. For the overdamped Langevin equation above, the Fokker-Planck equation takes the form

\[ \frac{\partial}{\partial t} p(t, x) = \frac{1}{\nu}\nabla_x \cdot \left(\nabla U(x) p(t, x)\right) + \frac{k_B T}{\nu m}\Delta_x p(t, x),\]

which can also be written as

\[ \frac{\partial}{\partial t} p(t, x) = \frac{1}{\nu}\nabla_x \cdot \left(\nabla U(x) p(t, x) + \frac{k_B T}{m}\nabla_x p(t, x) \right),\]

At the equilibrium, the probability density function $p(t, x) = p_\infty(x)$ is independent of time and satisfies

\[ \nabla_x \cdot \left(\nabla U(x) p_\infty(x) + \frac{k_B T}{m}\nabla_x p_\infty(x) \right) = 0.\]

In the one-dimensional case, this means that

\[ \frac{\partial U(x)}{\partial x} p_\infty(x) + \frac{k_B T}{m}\frac{\partial p_\infty(x)}{\partial t} = C,\]

for some constant $C$. This can be written in the form of the first order differential equation

\[ \frac{\partial p_\infty(x)}{\partial t} + \frac{m}{k_B T}\frac{\partial U(x)}{\partial x} p_\infty(x) = \frac{m}{k_B T} C.\]

This equation can be solved using the integrating factor

\[ e^{\frac{m}{k_B T} U(x)}.\]

But assuming that the terms on the left hand side of the first order differential equation vanish when $|x| \rightarrow \infty$, the constant $C$ must be zero. In this case, the solution simplifies to

\[ p_\infty(x) = C_0 e^{-\frac{m}{k_B T} U(x)},\]

for some other constant $C_0$, which we may write as $C_0 = 1/Z_0$ and obtain the expression

\[ p_\infty(x) = \frac{1}{Z_0} e^{-\frac{m}{k_B T} U(x)}.\]

The constant $Z_0$ is a normalization factor to yield that $p(x)$ is a proper probability density function, with total mass $1$, i.e.

\[ Z_0 = \int_{-\infty}^\infty e^{-\frac{m}{k_B T} U(x)}\;\mathrm{d}x.\]

One can check that the PDF above is also a solution in the multi-dimensional case.

When the potential $U=U(x)$ grows sufficiently rapidly as $|x|\rightarrow \infty$, this distribution is the unique equilibrium. In the case of thermodynamics, this corresponds to thermodynamic equilibrium and this is known as the Boltzmann distribution. The condition that the potential grows sufficiently rapidly at infinite means that the potential well is deep enough to confine the particle.

Abstracting away from the physical model and considering the overdamped Langevin equation in the form

\[ \mathrm{d} X_t = - \nabla U(X_t)\;\mathrm{d}t + \sqrt{2\gamma}\;\mathrm{d}W_{t},\]

for some constant $\gamma > 0$, then the Fokker-Planck equation reads

\[ \frac{\partial}{\partial t} p(t, x) = \nabla_x \cdot \left(\nabla U(x) p(t, x)\right) + \gamma\Delta_x p(t, x),\]

and the stationary distribution takes the form

\[ p_\infty(x) = \frac{1}{Z_0} e^{-\frac{U(x)}{\gamma}}.\]

Convergence to the asymptotic distribution

There are many rigorous results concerning the convergence to the equilibrium distribution, discussing conditions for the convergence, metrics, and rate of convergence. We will discuss them in more details in due course.

Sampling from the score function via Langevin dynamics

Now, suppose we take for the potential $U=U(x)$ minus a multiple of the logpdf of a random variable $X$ with probability density function $p_X(x)$ up to an arbitrary constant, i.e.

\[ U(x) = - \gamma \log p_X(x) + C,\]

for some constants $\gamma$ and $C$. The score function is connected to the gradient of the potential by

\[ \psi_X(x) = \nabla_x \log p_X(x) = - \frac{1}{\gamma} \nabla_x U(x).\]

In this case, we can write the PDF as

\[ p_X(x) = e^{\log p_X(x)} = e^{-U(x)/\gamma + C/\gamma} = \frac{1}{Z} e^{-U(x)/\gamma},\]

for a (normalizing) constant $Z = e^{-C/\gamma}$. Then we see that the PDF $p_X(x)$ is exactly the equilibrium of the Langevin equation

\[ \mathrm{d}X_t = -\nabla U(X_t)\;\mathrm{d}t + \sqrt{2\gamma}\mathrm{d}W_t.\]

which can be written as

\[ \mathrm{d}X_t = \gamma\psi_X(X_t)\;\mathrm{d}t + \sqrt{2\gamma}\mathrm{d}W_t.\]

This lead to a sampling method to draw samples from a distribution using its score function, as introduced by Gareth Roberts and Richard Tweedie (1996).

As mentioned above, questions about the conditions for the convergence, rate of converge and convergence metric are of great importance, and they are also important for sampling purposes. Other relevant question concern the stability of the equilibrium solution, when for instance an approximate score function is used. This is also relevant when the modeled score function (say via a neural network) might even not be exactly the gradient of a potential. We will leave these questions to another opportunity.

Numerical example

One-dimensional numerical example

We first illustrate the Langevin sampling by drawing samples for a univariate Gaussian mixture distribution.

Now we draw samples from it using the overdamped Langevin equation with $\gamma = 0.5$. We start with samples from a normal distribution $\mathcal{N}(5, 1^2)$ and evolve them according to the equation, from the initial time $t_0 = 0$ up to time $t_f = 10$.

Here is $tx$ plot of the ensemble of solutions of the stochastic overdamped Langevin equation.

We obtain the following histogram at the final time:

We may also visualize a surface plot and a heatmap of the evolution of the Fokker-Planck equation for the density.

With an animation just for fun.

Two-dimensional numerical example

Let's do a two-dimensional example, now. We consider a trimodal bivariate normal distribution and use the two-dimensional overdamped Langevin equations to obtain samples from the score function of the distribution.

Here is a visualization of the distribution and its score vector field.

Now we draw samples from it using the overdamped Langevin equation with $\gamma = 0.5$, and starting with samples from a standard normal distribution, evolving the particles from the initial time $t_0 = 0$ up to time $t_f = 10$.

Let us see the location of the generated sample points and their histogram.

scatter!(xt[1, :, end], xt[2, :, end], markersize=2, markercolor=:white, label=false) # hid2

Observe how many more particles are trapped near the modal point in the middle. In the target distribution, this is the smallest nodal point, but it seems the highest. Given enough time, though, the particles do tend to the proper distribution.

Let us see an animation for fun.

Now we draw samples starting from a uniform distribution of points on the square $[-6, 6]^2$, which encompasses all three modal regions.

Here is what we get.

scatter!(xt[1, :, end], xt[2, :, end], markersize=2, markercolor=:white, label=false) # hid2

Notice this is closer to the desired distribution. With the initial points spread out over the modal regions, the particles have less trouble moving to the asymptotic distribution.

Again, let us see an animation.