Denoising score matching of Pascal Vincent

Introduction

Aim

Explore the denoising score matching method proposed by Pascal Vincent (2011) and illustrate it by fiting a multi-layer perceptron to model the score function of a one-dimensional synthetic Gaussian-mixture distribution.

Motivation

The motivation is to continue building a solid background on score-matching diffusion.

Background

Aapo Hyvärinen (2005) proposed fitting directly the score of a model distribution. This is obtained, in theory, by minimizing an explicit score matching objective function. However, this function requires knowing the supposedly unknown target score function. The trick used by Aapo Hyvärinen (2005) was then to do an integration by parts and rewrite the optimization problem in terms of an implicit score matching objective function, which yields the same minima and does not require further information from the target distribution other than some sample points.

The implicit score matching method requires, however, the derivative of the score function of the model distribution, which is costly to compute in general.

Then, Vincent (2011) explored the idea of using non-parametric Parzen density estimation to directly approximate the explicit score matching objective, making a connection with denoising autoenconders (proposed earlir by Pascal himself, as a co-author in Vincent, Larochelle, Lajoie, Bengio, and Manzagol(2010)), and proposing the denoising score matching method.

Objetive function for denoising score matching

The original explicit and implicit score matching

The score-matching method from Aapo Hyvärinen (2005) aims to fit the score function $\psi(\mathbf{x}; {\boldsymbol{\theta}})$ of the model distribution to the score function $\psi_X(\mathbf{x})$ of a random variable $\mathbf{X}$ by minimizing the implicit score matching objective

\[ J_{\mathrm{ISM}}({\boldsymbol{\theta}}) = \int_{\mathbb{R}} p_{\mathbf{X}}(\mathbf{x}) \left( \frac{1}{2}\left\|\boldsymbol{\psi}(\mathbf{x}; {\boldsymbol{\theta}})\right\|^2 + \boldsymbol{\nabla}_{\mathbf{x}} \cdot \boldsymbol{\psi}(\mathbf{x}; {\boldsymbol{\theta}}) \right)\;\mathrm{d}\mathbf{x},\]

which is equivalent to minimizing the explicit score matching objective

\[ J_{\mathrm{ESM}}({\boldsymbol{\theta}}) = \frac{1}{2}\int_{\mathbb{R}^d} p_{\mathbf{X}}(\mathbf{x}) \left\|\boldsymbol{\psi}(\mathbf{x}; {\boldsymbol{\theta}}) - \boldsymbol{\psi}_{\mathbf{X}}(\mathbf{x})\right\|^2\;\mathrm{d}\mathbf{x},\]

due to the following identity obtained via integration by parts in the expectation

\[ J_{\mathrm{ESM}}({\boldsymbol{\theta}}) = {\tilde J}_{\mathrm{ISM}}({\boldsymbol{\theta}}) + C_\sigma,\]

where $C_\sigma$ is constant with respect to the parameters $\boldsymbol{\theta}$. The advantage of ${\tilde J}_{\mathrm{ISM}}({\boldsymbol{\theta}})$ is that it does not involve the unknown score function of $X$. It does, however, involve the gradient of the modeled score function, which is expensive to compute.

In practice, this is further approximated by the empirical distribution ${\tilde p}_0$ given by

\[ {\tilde p}_0(\mathbf{x}) = \frac{1}{N}\sum_{n=1}^N \delta(\mathbf{x} - \mathbf{x}_n),\]

so the implemented implicit score matching objective is

\[ {\tilde J}_{\mathrm{ISM{\tilde p}_0}}({\boldsymbol{\theta}}) = \frac{1}{N}\sum_{n=1}^N \left( \frac{1}{2}\left\|\boldsymbol{\psi}(\mathbf{x}_n; {\boldsymbol{\theta}})\right\|^2 + \boldsymbol{\nabla}_{\mathbf{x}} \cdot \boldsymbol{\psi}(\mathbf{x}_n; {\boldsymbol{\theta}}) \right).\]

Using Parzen estimation

Aapo Hyvärinen (2005) briefly mentions that minimizing $J_{\mathrm{ESM}}({\boldsymbol{\theta}})$ directly is "basically a non-parametric estimation problem", but dismisses it for the "simple trick of partial integration to compute the objective function very easily". As we have seen, the trick is fine for model functions for which we can compute the gradient without much trouble, but for modeling it with a neural network, for instance, it becomes computationally expensive.

A few years later, Vincent (2011) considered the idea of using a Parzel kernel density estimation

\[ {\tilde p}_\sigma(\mathbf{x}) = \frac{1}{\sigma^d}\int_{\mathbb{R}^d} K\left(\frac{\mathbf{x} - \tilde{\mathbf{x}}}{\sigma}\right) \;\mathrm{d}{\tilde p}_0(\tilde{\mathbf{x}}) = \frac{1}{\sigma^d N}\sum_{n=1}^N K\left(\frac{\mathbf{x} - \mathbf{x}_n}{\sigma}\right),\]

where $\sigma > 0$ is a kernel window parameter and $K(\mathbf{x})$ is a kernel density properly normalized to have mass one. In this way, the explicit score matching objective function is approximated by

\[ {\tilde J}_{\mathrm{ESM{\tilde p}_\sigma}}({\boldsymbol{\theta}}) = \frac{1}{2}\int_{\mathbb{R}^d} {\tilde p}_\sigma(\mathbf{x}) \left\|\boldsymbol{\psi}(\mathbf{x}; {\boldsymbol{\theta}}) - \boldsymbol{\nabla}_{\mathbf{x}}\log {\tilde p}_\sigma(\mathbf{x})\right\|^2\;\mathrm{d}\mathbf{x}.\]

Denoising autoencoder

However, Pascal Vincent (2011) did not use this as a final objective function. Pascal further simplified the objective function ${\tilde J}_{\mathrm{ESM{\tilde p}_\sigma}}({\boldsymbol{\theta}})$ by expanding the gradient of the logpdf of the Parzen estimator, writing a double integral with a conditional probability, simplifying the computation of the gradient of the logarithm of the Parzen estimation, which involves the log of a sum, to the gradient of the logarithm of the conditional probability, which involves the log of a single kernel.

In this way, Pascal arrived at the (Parzen) denoising score matching objective function

\[ {\tilde J}_{\mathrm{DSM{\tilde p}_\sigma}}({\boldsymbol{\theta}}) = \frac{1}{2}\int_{\mathbb{R}^d} \int_{\mathbb{R}^d} {\tilde p}_\sigma(\tilde{\mathbf{x}}|\mathbf{x}) {\tilde p}_0(\mathbf{x})\left\| \boldsymbol{\psi}(\tilde{\mathbf{x}}; {\boldsymbol{\theta}}) - \boldsymbol{\nabla}_{\tilde{\mathbf{x}}}\log {\tilde p}_\sigma(\tilde{\mathbf{x}}|\mathbf{x}) \right\|^2 \mathrm{d}\mathbf{x}\,\mathrm{d}\tilde{\mathbf{x}}\]

where ${\tilde p}_\sigma(\tilde{\mathbf{x}}|\mathbf{x})$ is the conditional density

\[ {\tilde p}_\sigma(\tilde{\mathbf{x}}|\mathbf{x}) = \frac{1}{\sigma^d}K\left(\frac{\tilde{\mathbf{x}} - \mathbf{x}}{\sigma}\right).\]

Notice that the empirical distribution is not a further approximation to this objective function. It comes directly from the Parzen estimator. So, we can write

\[ {\tilde J}_{\mathrm{DSM{\tilde p}_\sigma}}({\boldsymbol{\theta}}) = \frac{1}{2}\frac{1}{N}\sum_{n=1}^N \int_{\mathbb{R}^d} {\tilde p}_\sigma(\tilde{\mathbf{x}}|\mathbf{x}_n) \left\| \boldsymbol{\psi}(\tilde{\mathbf{x}}; {\boldsymbol{\theta}}) - \boldsymbol{\nabla}_{\tilde{\mathbf{x}}}\log {\tilde p}_\sigma(\tilde{\mathbf{x}}|\mathbf{x}_n) \right\|^2 \mathrm{d}\tilde{\mathbf{x}}.\]

We do need, however, for the sake of implementation, to approximate the (inner) expectation with respect to the conditional distribution. This is achieved by drawing a certain number of sample points from the conditional distribution associated with ${\tilde p}_\sigma(\tilde{\mathbf{x}}|\mathbf{x}_n)$.

Denoising autoencoder with the standard Gaussian kernel

At this point, choosing the kernel of the Parzen estimation to be the standard Gaussian kernel

\[ G(\mathbf{x}) = \frac{1}{\sqrt{2\pi}} e^{-\frac{1}{2} \mathbf{x}^2},\]

the conditional distribution is a Normal distribution with mean $\mathbf{x}_n$ and variance $\sigma^2$. Hence, for each $n=1, \ldots, N$, we draw $M$ sample points $\tilde{\mathbf{x}}_{n,m}$, $m=1, \ldots, M$, according to

\[ \tilde{\mathbf{x}}_{n,m} \sim \mathcal{N}(\mathbf{x}_n, \sigma^2), \quad m=1, \ldots, M.\]

Then, using the associated empirical distribution

\[ {\tilde p}_{\sigma, 0}(\tilde{\mathbf{x}}|\mathbf{x}_n) = \frac{1}{M}\sum_{m=1}^M \delta(\tilde{\mathbf{x}} - \tilde{\mathbf{x}}_{n,m}),\]

we arrive at the empirical denoising score matching objective

\[ {\tilde J}_{\mathrm{DSM{\tilde p}_{\sigma, 0}}}({\boldsymbol{\theta}}) = \frac{1}{2}\frac{1}{NM}\sum_{n=1}^N \sum_{m=1}^M \left\| \boldsymbol{\psi}(\mathbf{x}_{n, m}; {\boldsymbol{\theta}}) - \boldsymbol{\nabla}_{\tilde{\mathbf{x}}}\log {\tilde p}_\sigma(\tilde{\mathbf{x}}_{n,m}|\mathbf{x}_n) \right\|^2 \mathrm{d}\tilde{\mathbf{x}}.\]

With the Gaussian kernel, we have

\[ \begin{align*} \boldsymbol{\nabla}_{\tilde{\mathbf{x}}}\log {\tilde p}_\sigma(\tilde{\mathbf{x}}_{n,m}|\mathbf{x}_n) & = \boldsymbol{\nabla}_{\tilde{\mathbf{x}}} \log\left( \frac{1}{\sqrt{2\pi}\sigma^d} e^{-\frac{1}{2} \left(\frac{\tilde{\mathbf{x}} - \mathbf{x}_n}{\sigma}\right)^2} \right) \!\!\Bigg|_{\;\tilde{\mathbf{x}}=\tilde{\mathbf{x}}_{n, m}} \\ & = \boldsymbol{\nabla}_{\tilde{\mathbf{x}}} \left( -\frac{1}{2} \left(\frac{\tilde{\mathbf{x}} - \mathbf{x}_n}{\sigma}\right)^2 - \log(\sqrt{2\pi}\sigma^d) \right)\!\!\Bigg|_{\;\tilde{\mathbf{x}}=\tilde{\mathbf{x}}_{n, m}} \\ & = \left( - \frac{\tilde{\mathbf{x}} - \mathbf{x}_n}{\sigma^2} \right)\!\!\Bigg|_{\;\tilde{\mathbf{x}}=\tilde{\mathbf{x}}_{n, m}} \\ & = - \frac{\tilde{\mathbf{x}}_{n, m} - \mathbf{x}_n}{\sigma^2} \\ & = \frac{\mathbf{x}_n - \tilde{\mathbf{x}}_{n, m}}{\sigma^2} \end{align*}\]

Thus, in this case, the empirical denoising score matching objective reads

\[ {\tilde J}_{\mathrm{DSM{\tilde p}_{\sigma, 0}}}({\boldsymbol{\theta}}) = \frac{1}{2}\frac{1}{NM}\sum_{n=1}^N \sum_{m=1}^M \left\| \boldsymbol{\psi}(\mathbf{x}_{n, m}; {\boldsymbol{\theta}}) - \frac{\mathbf{x}_n - \tilde{\mathbf{x}}_{n, m}}{\sigma^2} \right\|^2 \mathrm{d}\tilde{\mathbf{x}}.\]

Often, with $N$ sufficiently large, it suffices to take $M=1$, i.e. a single "corrupted" sample $\tilde{\mathbf{x}}_n$, for each "clean" sample point $\mathbf{x}_n$. In this way, the double summation becomes a single summation and the computation is just as fast as the one with the score of the unconditional Parzen estimation, with the benefit similar to the denosing autoencoders.

Energy based modeling

The model distribution is chosen by Vincent (2011) in the form

\[ p_{\boldsymbol{\theta}}(\mathbf{x}) = \frac{1}{Z(\boldsymbol{\theta})} e^{-U(\mathbf{x}; \boldsymbol{\theta})},\]

for an energy potential $U(\mathbf{x}; \boldsymbol{\theta})$ of the form

\[ U(\mathbf{x}; \boldsymbol{\theta}) = - \frac{1}{\sigma^2}\left( \mathbf{c}\cdot \mathbf{x} - \frac{1}{2}\|\mathbf{x}\|^2 + \sum_{j=1}^d \operatorname{softplus}\left(\mathbf{W}_j\cdot \mathbf{x} + b_j\right)\right),\]

where

\[ \boldsymbol{\theta} = (\mathbf{W}, \mathbf{b}, \mathbf{c}), \quad \mathbf{W}\in \mathbb{R}^{d\times d}, \;\mathbf{b}, \mathbf{c}\in\mathbf{R}^d,\]

with $\mathbf{W}_j$ being the rows of the matrix $\mathbf{W}$, and $\operatorname{sigmoid}()$ is an activation function. For this model, the score function can be computed explicitly Vincent (2011), being

\[ \boldsymbol{\nabla}_{\mathbf{x}} p(\mathbf{x}; \boldsymbol{\theta}) = \frac{1}{\sigma^2}\left( \mathbf{W}^{\mathrm{tr}}\operatorname{sigmoid}\left(\mathbf{W}\mathbf{x} + \mathbf{b}\right) + \mathbf{c} - \mathbf{x}\right).\]

When substitute for in the denoising score matching objective, we obtain a loss function directly in terms of the parameters $\boldsymbol{\theta} = (\mathbf{W}, \mathbf{b}, \mathbf{c})$.

Connection with denoising autoencoder

This brings us to the connection, discussed by Vincent (2011), with denoising autoencoders, which were proposed a bit earlier by Vincent, Larochelle, Lajoie, Bengio, and Manzagol(2010).

In an autoencoder, as introduced by Kramer (1991), one has two models, an encoder model $\mathbf{y} = \mathbf{f}_{\boldsymbol{\xi}}(\mathbf{x})$ and a decoder $\mathbf{x} = \mathbf{g}_{\boldsymbol{\eta}}(\mathbf{y})$, and the objective is to be able to encode and decode the sample, and recover the original sample as close as possible, i.e.

\[ J(\boldsymbol{\xi}, \boldsymbol{\eta}) = \frac{1}{N} \sum_{n=1}^N \left\|\mathbf{x}_n - \mathbf{g}_{\boldsymbol{\eta}}(\mathbf{f}_{\boldsymbol{\xi}}(\mathbf{x}_n))\right\|.\]

Of course, this is useful when the latent space of the encoded information $\mathbf{y}$ is much smaller than the sample space, otherwise we just need to model the identity operator.

In a denoising autoencoder, proposed by Vincent, Larochelle, Lajoie, Bengio, and Manzagol(2010), one first "corrupts" the sample points according to some distribution law, say

\[ \tilde{\mathbf{x}}_n \sim \mathcal{P}(\tilde{\mathbf{x}}|\mathbf{x}_n),\]

and then use the corrupted sample to train the encoder/decoder pair with the objective function

\[ J_{\mathcal{P}}(\boldsymbol{\xi}, \boldsymbol{\eta}) = \frac{1}{N} \sum_{n=1}^N \left\|\mathbf{x}_n - \mathbf{g}_{\boldsymbol{\eta}}(\mathbf{f}_{\boldsymbol{\xi}}(\tilde{\mathbf{x}}_n)) \right\|.\]

The idea is that the encoder/decoder model learns to better "reconstruct" the information even from "imperfect" information. Think about the case of encoding/decoding a handwritten message, where the letters are not "perfect" according to any font style.

In Vincent (2011), by choosing the score model of the form

\[ \boldsymbol{\psi}(\mathbf{x}, \boldsymbol{\theta}) = \frac{1}{\sigma^2}\left( \mathbf{W}^{\mathrm{tr}} \operatorname{sigmoid}\left(\mathbf{W}\mathbf{x} + \mathbf{b}\right) + \mathbf{c} - \mathbf{x}\right),\]

where $\boldsymbol{\theta} = (\mathbf{W}, \mathbf{b}, \mathbf{c})$ are the parameters and $\operatorname{sigmoid}()$ is an activation function, and choosing the noise according to

\[ \mathcal{P}(\tilde{\mathbf{x}}|\mathbf{x}_n) = \mathcal{N}(\mathbf{x}_n, \sigma^2),\]

one obtains the connection

\[ {\tilde J}_{\mathrm{DSM{\tilde p}_{\sigma, 0}}}({\boldsymbol{\theta}}) = \frac{1}{2\sigma^4} {\tilde J}_{\mathcal{P}}(\boldsymbol{\theta}),\]

where ${\tilde J}_{\mathcal{P}}(\boldsymbol{\theta})$ is similar to the denoising autoencoder objective $J_{\mathcal{P}}(\boldsymbol{\xi}, \boldsymbol{\eta})$, and is defined by

\[ {\tilde J}_{\mathcal{P}}(\boldsymbol{\theta}) = \frac{1}{N} \sum_{n=1}^N \left\|\mathbf{x}_n - \mathbf{h}_{\boldsymbol{\theta}}(\tilde{\mathbf{x}}_n) \right\|,\]

where $\mathbf{h}_{\boldsymbol{\theta}}(\mathbf{x})$ is almost of the form of an encoder/decoder, namely

\[ \mathbf{h}_{\boldsymbol{\theta}}(\mathbf{x}) = \mathbf{g}_{\boldsymbol{\theta}}(\mathbf{f}_{\boldsymbol{\theta}}(\mathbf{x})) - \mathbf{x},\]

with

\[ {f}_{\boldsymbol{\theta}}(\mathbf{x}) = \mathrm{sigmoid}\left(\mathbf{W}\mathbf{x} + \mathbf{b}\right), \quad \mathbf{g}_{\boldsymbol{\theta}}(\mathbf{y}) = \mathbf{W}^{\mathrm{tr}}\mathbf{y} + \mathbf{c}.\]

Remember that here we are not trying to encode/decode the variate $\mathbf{x}$ itself, but its score function, so the above structure is compatible with that.

Vincent (2011) does not mention explicitly that what we denoted above by $\mathbf{h}_{\boldsymbol{\theta}}(\mathbf{x})$ is not exactly of the form $\mathbf{g}_{\boldsymbol{\theta}}(\mathbf{f}_{\boldsymbol{\theta}}(\mathbf{x}))$ and actually seems to suggest they are of the same form, for the sake of the connection with a denoising autoenconder. But we see here that it is not. Let us not freak out about that, though. This is good enough to draw some connection between denoising score matching and denoising autoencoder and to have this as an inspiration.

Proof that ${\tilde J}_{\mathrm{ESM{\tilde p}_\sigma}}({\boldsymbol{\theta}}) = {\tilde J}_{\mathrm{DSM{\tilde p}_\sigma}}({\boldsymbol{\theta}}) + C_\sigma$

We start by renaming the dummy variable in the expression for ${\tilde J}_{\mathrm{ESM{\tilde p}_\sigma}}({\boldsymbol{\theta}})$, writing

\[ {\tilde J}_{\mathrm{ESM{\tilde p}_\sigma}}({\boldsymbol{\theta}}) = \frac{1}{2}\int_{\mathbb{R}^d} {\tilde p}_\sigma(\tilde{\mathbf{x}}) \left\|\boldsymbol{\psi}(\tilde{\mathbf{x}}; {\boldsymbol{\theta}}) - \boldsymbol{\nabla}_{\tilde{\mathbf{x}}}\log {\tilde p}_\sigma(\tilde{\mathbf{x}})\right\|^2\;\mathrm{d}\tilde{\mathbf{x}}.\]

Then, we expand the integrand of ${\tilde J}_{\mathrm{ESM{\tilde p}_\sigma}}({\boldsymbol{\theta}})$ and write

\[ \begin{align*} {\tilde J}_{\mathrm{ESM{\tilde p}_\sigma}}({\boldsymbol{\theta}}) & = \frac{1}{2}\int_{\mathbb{R}^d} {\tilde p}_\sigma(\tilde{\mathbf{x}}) \left\|\boldsymbol{\psi}(\tilde{\mathbf{x}}; {\boldsymbol{\theta}}) - \boldsymbol{\nabla}_{\tilde{\mathbf{x}}}\log {\tilde p}_\sigma(\tilde{\mathbf{x}})\right\|^2\;\mathrm{d}\tilde{\mathbf{x}} \\ & = \frac{1}{2}\int_{\mathbb{R}^d} {\tilde p}_\sigma(\tilde{\mathbf{x}}) \left( \left\|\boldsymbol{\psi}(\tilde{\mathbf{x}}; {\boldsymbol{\theta}})\right\|^2 - 2\boldsymbol{\psi}(\tilde{\mathbf{x}}; {\boldsymbol{\theta}}) \cdot \boldsymbol{\nabla}_{\tilde{\mathbf{x}}}\log {\tilde p}_\sigma(\tilde{\mathbf{x}}) + \left\|\boldsymbol{\nabla}_{\tilde{\mathbf{x}}}\log {\tilde p}_\sigma(\tilde{\mathbf{x}})\right\|^2\right)\mathrm{d}\tilde{\mathbf{x}}. \end{align*}\]

The last term is constant with respect to the trainable parameters $\boldsymbol{\theta}$, so we just write

\[ {\tilde J}_{\mathrm{ESM{\tilde p}_\sigma}}({\boldsymbol{\theta}}) = \frac{1}{2}\int_{\mathbb{R}^d} {\tilde p}_\sigma(\tilde{\mathbf{x}}) \left( \left\|\boldsymbol{\psi}(\tilde{\mathbf{x}}; {\boldsymbol{\theta}})\right\|^2 - 2\boldsymbol{\psi}(\tilde{\mathbf{x}}; {\boldsymbol{\theta}}) \cdot \boldsymbol{\nabla}_{\tilde{\mathbf{x}}}\log {\tilde p}_\sigma(\tilde{\mathbf{x}})\right)\mathrm{d}\tilde{\mathbf{x}} + C_{\sigma, 1},\]

where

\[ C_{\sigma, 1} = \frac{1}{2}\int_{\mathbb{R}^d} {\tilde p}_\sigma(\tilde{\mathbf{x}}) \left\|\boldsymbol{\nabla}_{\tilde{\mathbf{x}}}\log {\tilde p}_\sigma(\tilde{\mathbf{x}})\right\|^2\mathrm{d}\tilde{\mathbf{x}}.\]

Now, notice we can write

\[ {\tilde p}_\sigma(\tilde{\mathbf{x}}) = \frac{1}{\sigma^d}\int_{\mathbb{R}^d} K\left(\frac{\tilde{\mathbf{x}} - \mathbf{x}}{\sigma}\right) \;\mathrm{d}{\tilde p}_0(\mathbf{x}) = \int_{\mathbb{R}^d} {\tilde p}_\sigma(\tilde{\mathbf{x}}|\mathbf{x}) \;\mathrm{d}{\tilde p}_0(\mathbf{x}) = \int_{\mathbb{R}^d} {\tilde p}_\sigma(\tilde{\mathbf{x}}|\mathbf{x}) {\tilde p}_0(\mathbf{x}) \;\mathrm{d}\mathbf{x},\]

where ${\tilde p}_\sigma(\tilde{\mathbf{x}}|\mathbf{x})$ is the conditional density

\[ {\tilde p}_\sigma(\tilde{\mathbf{x}}|\mathbf{x}) = \frac{1}{\sigma^d}K\left(\frac{\tilde{\mathbf{x}} - \mathbf{x}}{\sigma}\right).\]

Thus, the first term in the objective function becomes

\[ \frac{1}{2}\int_{\mathbb{R}^d} {\tilde p}_\sigma(\tilde{\mathbf{x}}) \left\|\boldsymbol{\psi}(\tilde{\mathbf{x}}; {\boldsymbol{\theta}})\right\|^2\;\mathrm{d}\tilde{\mathbf{x}} = \frac{1}{2}\int_{\mathbb{R}^d} \int_{\mathbb{R}^d} {\tilde p}_\sigma(\tilde{\mathbf{x}}|\mathbf{x}) {\tilde p}_0(\mathbf{x}) \left\|\boldsymbol{\psi}(\tilde{\mathbf{x}}; {\boldsymbol{\theta}})\right\|^2\;\mathrm{d}\mathbf{x}\,\mathrm{d}\tilde{\mathbf{x}}.\]

It remains to treat the second term. For that, we use that

\[ \boldsymbol{\nabla}_{\tilde{\mathbf{x}}}\log {\tilde p}_\sigma(\tilde{\mathbf{x}}) = \frac{1}{{\tilde p}_\sigma(\tilde{\mathbf{x}})} \boldsymbol{\nabla}_{\tilde{\mathbf{x}}}{\tilde p}_\sigma(\tilde{\mathbf{x}}).\]

Thus,

Now we write that

\[ \begin{align*} \boldsymbol{\nabla}_{\tilde{\mathbf{x}}}{\tilde p}_\sigma(\tilde{\mathbf{x}}) & = \boldsymbol{\nabla}_{\tilde{\mathbf{x}}} \int_{\mathbb{R}^d} {\tilde p}_\sigma(\tilde{\mathbf{x}}|\mathbf{x}) {\tilde p}_0(\mathbf{x}) \;\mathrm{d}\mathbf{x} \\ & = \int_{\mathbb{R}^d} \boldsymbol{\nabla}_{\tilde{\mathbf{x}}}{\tilde p}_\sigma(\tilde{\mathbf{x}}|\mathbf{x}) {\tilde p}_0(\mathbf{x}) \;\mathrm{d}\mathbf{x} \\ & = \int_{\mathbb{R}^d} {\tilde p}_\sigma(\tilde{\mathbf{x}}|\mathbf{x}) \boldsymbol{\nabla}_{\tilde{\mathbf{x}}} \log {\tilde p}_\sigma(\tilde{\mathbf{x}}|\mathbf{x}) {\tilde p}_0(\mathbf{x}) \;\mathrm{d}\mathbf{x}. \end{align*}\]

Hence,

\[ \begin{align*} \int_{\mathbb{R}^d} {\tilde p}_\sigma(\tilde{\mathbf{x}}) \boldsymbol{\psi}(\tilde{\mathbf{x}}; {\boldsymbol{\theta}}) \cdot \boldsymbol{\nabla}_{\tilde{\mathbf{x}}}\log {\tilde p}_\sigma(\tilde{\mathbf{x}})\mathrm{d}\tilde{\mathbf{x}} & = \int_{\mathbb{R}^d} \boldsymbol{\psi}(\tilde{\mathbf{x}}; {\boldsymbol{\theta}}) \cdot \left(\int_{\mathbb{R}^d} {\tilde p}_\sigma(\tilde{\mathbf{x}}|\mathbf{x}) \boldsymbol{\nabla}_{\tilde{\mathbf{x}}} \log {\tilde p}_\sigma(\tilde{\mathbf{x}}|\mathbf{x}) {\tilde p}_0(\mathbf{x}) \;\mathrm{d}\mathbf{x}\right)\mathrm{d}\tilde{\mathbf{x}} \\ & = \int_{\mathbb{R}^d} \int_{\mathbb{R}^d} {\tilde p}_\sigma(\tilde{\mathbf{x}}|\mathbf{x}) {\tilde p}_0(\mathbf{x}) \boldsymbol{\psi}(\tilde{\mathbf{x}}; {\boldsymbol{\theta}}) \cdot \boldsymbol{\nabla}_{\tilde{\mathbf{x}}} \log {\tilde p}_\sigma(\tilde{\mathbf{x}}|\mathbf{x}) \;\mathrm{d}\mathbf{x}\,\mathrm{d}\tilde{\mathbf{x}} \end{align*}\]

Putting the terms together, we find that

\[ \begin{align*} {\tilde J}_{\mathrm{ESM{\tilde p}_\sigma}}({\boldsymbol{\theta}}) & = \frac{1}{2}\int_{\mathbb{R}^d} {\tilde p}_\sigma(\tilde{\mathbf{x}}) \left( \left\|\boldsymbol{\psi}(\tilde{\mathbf{x}}; {\boldsymbol{\theta}})\right\|^2 - 2\boldsymbol{\psi}(\tilde{\mathbf{x}}; {\boldsymbol{\theta}}) \cdot \boldsymbol{\nabla}_{\tilde{\mathbf{x}}}\log {\tilde p}_\sigma(\tilde{\mathbf{x}})\right)\mathrm{d}\tilde{\mathbf{x}} + C_{\sigma, 1} \\ & = \frac{1}{2}\int_{\mathbb{R}^d} \int_{\mathbb{R}^d} {\tilde p}_\sigma(\tilde{\mathbf{x}}|\mathbf{x}) {\tilde p}_0(\mathbf{x}) \left( \left\|\boldsymbol{\psi}(\tilde{\mathbf{x}}; {\boldsymbol{\theta}})\right\|^2 - 2\boldsymbol{\psi}(\tilde{\mathbf{x}}; {\boldsymbol{\theta}}) \cdot \boldsymbol{\nabla}_{\tilde{\mathbf{x}}}\log {\tilde p}_\sigma(\tilde{\mathbf{x}}|\mathbf{x})\right)\;\mathrm{d}\mathbf{x}\,\mathrm{d}\tilde{\mathbf{x}} + C_{\sigma, 1} \end{align*}\]

Now we add and subtract the constant (with respect to the parameters $\boldsymbol{\theta}$)

\[ C_{\sigma, 2} = \frac{1}{2}\int_{\mathbb{R}^d} \int_{\mathbb{R}^d} {\tilde p}_\sigma(\tilde{\mathbf{x}}|\mathbf{x}) {\tilde p}_0(\mathbf{x}) \left\|\boldsymbol{\nabla}_{\tilde{\mathbf{x}}}\log {\tilde p}_\sigma(\tilde{\mathbf{x}}|\mathbf{x})\right\|^2 \;\mathrm{d}\mathbf{x}\,\mathrm{d}\tilde{\mathbf{x}}.\]

With that, we finally obtain the desired relation

\[ \begin{align*} {\tilde J}_{\mathrm{ESM{\tilde p}_\sigma}}({\boldsymbol{\theta}}) & = \frac{1}{2}\int_{\mathbb{R}^d} \int_{\mathbb{R}^d} {\tilde p}_\sigma(\tilde{\mathbf{x}}|\mathbf{x}) {\tilde p}_0(\mathbf{x}) \Bigg( \left\|\boldsymbol{\psi}(\tilde{\mathbf{x}}; {\boldsymbol{\theta}})\right\|^2 \\ & \qquad\qquad - 2\boldsymbol{\psi}(\tilde{\mathbf{x}}; {\boldsymbol{\theta}}) \cdot \boldsymbol{\nabla}_{\tilde{\mathbf{x}}}\log {\tilde p}_\sigma(\tilde{\mathbf{x}}|\mathbf{x}) + \left\|\boldsymbol{\nabla}_{\tilde{\mathbf{x}}}\log {\tilde p}_\sigma(\tilde{\mathbf{x}}|\mathbf{x})\right\|^2\Bigg)\;\mathrm{d}\mathbf{x}\,\mathrm{d}\tilde{\mathbf{x}} + C_{\sigma, 1} - C_{\sigma, 2} \\ & = \frac{1}{2}\int_{\mathbb{R}^d} \int_{\mathbb{R}^d} {\tilde p}_\sigma(\tilde{\mathbf{x}}|\mathbf{x}) {\tilde p}_0(\mathbf{x})\left\| \boldsymbol{\psi}(\tilde{\mathbf{x}}; {\boldsymbol{\theta}}) - \boldsymbol{\nabla}_{\tilde{\mathbf{x}}}\log {\tilde p}_\sigma(\tilde{\mathbf{x}}|\mathbf{x}) \right\|^2 \mathrm{d}\mathbf{x}\,\mathrm{d}\tilde{\mathbf{x}} + C_{\sigma, 1} - C_{\sigma, 2} \\ & = {\tilde J}_{\mathrm{DSM{\tilde p}_\sigma}}({\boldsymbol{\theta}}) + C_\sigma, \end{align*}\]

where

\[ C_\sigma = C_{\sigma, 1} - C_{\sigma, 2}.\]

Numerical example

We illustrate, numerically, the use of the denoising (explicit) score matching objective ${\tilde J}_{\mathrm{DSM{\tilde p}_\sigma}}$ to model a synthetic univariate Gaussian mixture distribution. However, instead using an energy based method and model an energy potential of the density as done by Vincent (2011), we model directly the score function.

Julia language setup

We use the Julia programming language for the numerical simulations, with suitable packages.

Packages

using StatsPlots
using Random
using Distributions
using Lux # artificial neural networks explicitly parametrized
using Optimisers
using Zygote # automatic differentiation
using Markdown

Reproducibility

We set the random seed for reproducibility purposes.

rng = Xoshiro(12345)

Data

We build the usual target model and draw samples from it.

target_prob = MixtureModel([Normal(-3, 1), Normal(3, 1)], [0.1, 0.9])

xrange = range(-10, 10, 200)
dx = Float64(xrange.step)
xx = permutedims(collect(xrange))
target_pdf = pdf.(target_prob, xrange')
target_score = gradlogpdf.(target_prob, xrange')

sample_points = permutedims(rand(rng, target_prob, 1024))

1×1024 Matrix{Float64}:
 2.30308  2.84284  3.4103  3.68232  …  1.71428  2.75491  3.14101  2.48846

Visualizing the sample data drawn from the distribution and the PDF.

Visualizing the score function.

The neural network model

The neural network we consider is a simple feed-forward neural network made of a single hidden layer, obtained as a chain of a couple of dense layers. This is implemented with the LuxDL/Lux.jl package.

We will see that we don't need a big neural network in this simple example. We go as low as it works.

model = Chain(Dense(1 => 8, relu), Dense(8 => 1))

Chain(
    layer_1 = Dense(1 => 8, relu),                # 16 parameters
    layer_2 = Dense(8 => 1),                      # 9 parameters
)         # Total: 25 parameters,
          #        plus 0 states.

The LuxDL/Lux.jl package uses explicit parameters, that are initialized (or obtained) with the Lux.setup function, giving us the parameters and the state of the model.

ps, st = Lux.setup(rng, model) # initialize and get the parameters and states of the model

((layer_1 = (weight = Float32[-0.003556765; -1.8715183; … ; 0.66702616; -0.9373461;;], bias = Float32[0.18317068, 0.5787344, -0.18110967, 0.9307035, -0.43067825, -0.46645045, -0.8246051, -0.9340805]), layer_2 = (weight = Float32[0.27326134 -0.2086962 … 0.42855448 0.5658726], bias = Float32[0.09530755])), (layer_1 = NamedTuple(), layer_2 = NamedTuple()))

Loss function

Here it is how we implement the empirical denoising score matching objective

\[ {\tilde J}_{\mathrm{DSM{\tilde p}_{\sigma, 0}}}({\boldsymbol{\theta}}) = \frac{1}{2}\frac{1}{NM}\sum_{n=1}^N \sum_{m=1}^M \left\| \boldsymbol{\psi}(\mathbf{x}_{n, m}; {\boldsymbol{\theta}}) - \frac{\mathbf{x}_n - \tilde{\mathbf{x}}_{n, m}}{\sigma^2} \right\|^2 \mathrm{d}\tilde{\mathbf{x}}.\]

First we precompute the matrix $(\mathbf{a}_{n,m})_{n,m}$ given by

\[ \mathbf{a}_{n,m} = \frac{\mathbf{x}_n - \tilde{\mathbf{x}}_{n, m}}{\sigma^2}.\]

Then, at each iteration of the optimization process, we take the current parameters $\boldsymbol{\theta}$ and apply the model to the perturbed points $\tilde{\mathbf{x}}_{n, m}$ to obtain the predicted scores $\{\boldsymbol{\psi}_{n,m}^{\boldsymbol{\theta}}\}$ with values

\[ \boldsymbol{\psi}_{n,m}^{\boldsymbol{\theta}} = \boldsymbol{\psi}(\tilde{\mathbf{x}}_{n,m}, \boldsymbol{\theta}),\]

and then compute half the mean square distance between the two matrices:

\[ \frac{1}{2}\sum_{m=1}^M \sum_{n=1}^N \left\| \boldsymbol{\psi}_{n,m}^{\boldsymbol{\theta}} - \mathbf{a}_{n,m}\right|^2.\]

In the implementation below, we just use $M=1$, so the matrices $(\boldsymbol{\psi}_{n,m}^{\boldsymbol{\theta}})_{n,m}$ and $(\mathbf{a}_{n,m})_{n,m}$ are actually just vectors. Besides, this is a scalar example, i.e. with $d=1$, so they are indeed plain real-valued vectors $(\psi_{n,1})_n$ and $(a_{n,1})_n$.

In general, though, these objects $(\boldsymbol{\psi}_{n,m}^{\boldsymbol{\theta}})_{n,m}$ and $(\mathbf{a}_{n,m})_{n,m}$ are $\mathbb{R}^d$-vector-valued matrices.

So, here is how we prepare the data.

sigma = 0.3
noised_sample_points = sample_points .+ sigma .* randn(size(sample_points))
dsm_target = ( sample_points .- noised_sample_points ) ./ sigma ^ 2
data = (noised_sample_points, dsm_target)

and here is the implementation of the loss function

function loss_function_dsm(model, ps, st, data)
    noised_sample_points, dsm_target = data
    y_score_pred, st = Lux.apply(model, noised_sample_points, ps, st)
    loss = mean(abs2, y_score_pred .- dsm_target) / 2
    return loss, st, ()
end

Optimization setup

Optimization method

We use the Adam optimiser.

opt = Adam(0.01)

tstate_org = Lux.Training.TrainState(model, ps, st, opt)

TrainState(
    Chain(
        layer_1 = Dense(1 => 8, relu),            # 16 parameters
        layer_2 = Dense(8 => 1),                  # 9 parameters
    ),
    number of parameters: 25
    number of states: 0
    optimizer: Optimisers.Adam(eta=0.01, beta=(0.9, 0.999), epsilon=1.0e-8)
    step: 0
)

Automatic differentiation in the optimization

As mentioned, we setup differentiation in LuxDL/Lux.jl with the FluxML/Zygote.jl library.

vjp_rule = Lux.Training.AutoZygote()

ADTypes.AutoZygote()

Processor

We use the CPU instead of the GPU.

dev_cpu = cpu_device()
## dev_gpu = gpu_device()

(::MLDataDevices.CPUDevice{Missing}) (generic function with 1 method)

Check differentiation

Check if Zygote via Lux is working fine to differentiate the loss functions for training.

Lux.Training.compute_gradients(vjp_rule, loss_function_dsm, data, tstate_org)

((layer_1 = (weight = Float32[-0.4709084; -0.021179952; … ; -0.7667468; 0.06211383;;], bias = Float32[-0.2026222, 0.0025454229, -0.08840098, -0.1808644, 0.26040265, 0.060871985, -0.29630455, -0.009166527]), layer_2 = (weight = Float32[-0.129691 -0.19699347 … -0.6232718 -0.08775806], bias = Float32[-0.7414961])), 5.753983413410022, (), Lux.Training.TrainState{Nothing, Nothing, Lux.Chain{@NamedTuple{layer_1::Lux.Dense{typeof(NNlib.relu), Int64, Int64, Nothing, Nothing, Static.True}, layer_2::Lux.Dense{typeof(identity), Int64, Int64, Nothing, Nothing, Static.True}}, Nothing}, @NamedTuple{layer_1::@NamedTuple{weight::Matrix{Float32}, bias::Vector{Float32}}, layer_2::@NamedTuple{weight::Matrix{Float32}, bias::Vector{Float32}}}, @NamedTuple{layer_1::@NamedTuple{}, layer_2::@NamedTuple{}}, Optimisers.Adam{Float64, Tuple{Float64, Float64}, Float64}, @NamedTuple{layer_1::@NamedTuple{weight::Optimisers.Leaf{Optimisers.Adam{Float64, Tuple{Float64, Float64}, Float64}, Tuple{Matrix{Float32}, Matrix{Float32}, Tuple{Float32, Float32}}}, bias::Optimisers.Leaf{Optimisers.Adam{Float64, Tuple{Float64, Float64}, Float64}, Tuple{Vector{Float32}, Vector{Float32}, Tuple{Float32, Float32}}}}, layer_2::@NamedTuple{weight::Optimisers.Leaf{Optimisers.Adam{Float64, Tuple{Float64, Float64}, Float64}, Tuple{Matrix{Float32}, Matrix{Float32}, Tuple{Float32, Float32}}}, bias::Optimisers.Leaf{Optimisers.Adam{Float64, Tuple{Float64, Float64}, Float64}, Tuple{Vector{Float32}, Vector{Float32}, Tuple{Float32, Float32}}}}}}(nothing, nothing, Lux.Chain{@NamedTuple{layer_1::Lux.Dense{typeof(NNlib.relu), Int64, Int64, Nothing, Nothing, Static.True}, layer_2::Lux.Dense{typeof(identity), Int64, Int64, Nothing, Nothing, Static.True}}, Nothing}((layer_1 = Dense(1 => 8, relu), layer_2 = Dense(8 => 1)), nothing), (layer_1 = (weight = Float32[-0.003556765; -1.8715183; … ; 0.66702616; -0.9373461;;], bias = Float32[0.18317068, 0.5787344, -0.18110967, 0.9307035, -0.43067825, -0.46645045, -0.8246051, -0.9340805]), layer_2 = (weight = Float32[0.27326134 -0.2086962 … 0.42855448 0.5658726], bias = Float32[0.09530755])), (layer_1 = NamedTuple(), layer_2 = NamedTuple()), Optimisers.Adam(eta=0.01, beta=(0.9, 0.999), epsilon=1.0e-8), (layer_1 = (weight = Leaf(Adam(eta=0.01, beta=(0.9, 0.999), epsilon=1.0e-8), (Float32[0.0; 0.0; … ; 0.0; 0.0;;], Float32[0.0; 0.0; … ; 0.0; 0.0;;], (0.9, 0.999))), bias = Leaf(Adam(eta=0.01, beta=(0.9, 0.999), epsilon=1.0e-8), (Float32[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], Float32[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], (0.9, 0.999)))), layer_2 = (weight = Leaf(Adam(eta=0.01, beta=(0.9, 0.999), epsilon=1.0e-8), (Float32[0.0 0.0 … 0.0 0.0], Float32[0.0 0.0 … 0.0 0.0], (0.9, 0.999))), bias = Leaf(Adam(eta=0.01, beta=(0.9, 0.999), epsilon=1.0e-8), (Float32[0.0], Float32[0.0], (0.9, 0.999))))), 0))

Training loop

Here is the typical main training loop suggest in the LuxDL/Lux.jl tutorials, but sligthly modified to save the history of losses per iteration.

function train(tstate, vjp, data, loss_function, epochs, numshowepochs=20, numsavestates=0)
    losses = zeros(epochs)
    tstates = [(0, tstate)]
    for epoch in 1:epochs
        grads, loss, stats, tstate = Lux.Training.compute_gradients(vjp,
            loss_function, data, tstate)
        if ( epochs ≥ numshowepochs > 0 ) && rem(epoch, div(epochs, numshowepochs)) == 0
            println("Epoch: $(epoch) || Loss: $(loss)")
        end
        if ( epochs ≥ numsavestates > 0 ) && rem(epoch, div(epochs, numsavestates)) == 0
            push!(tstates, (epoch, tstate))
        end
        losses[epoch] = loss
        tstate = Lux.Training.apply_gradients(tstate, grads)
    end
    return tstate, losses, tstates
end

train (generic function with 3 methods)

Training

Now we train the model with the objective function ${\tilde J}_{\mathrm{ESM{\tilde p}_\sigma{\tilde p}_0}}({\boldsymbol{\theta}})$.

@time tstate, losses, tstates = train(tstate_org, vjp_rule, data, loss_function_dsm, 500, 20, 125)

┌ Warning: Mixed-Precision `matmul_cpu_fallback!` detected and Octavian.jl cannot be used for this set of inputs (C [Matrix{Float64}]: A [Matrix{Float32}] x B [Matrix{Float64}]). Falling back to generic implementation. This may be slow.
└ @ LuxLib.Impl ~/.julia/packages/LuxLib/R8Czx/src/impl/matmul.jl:190
Epoch: 25 || Loss: 5.490694994578582
Epoch: 50 || Loss: 5.411018088033561
Epoch: 75 || Loss: 5.366161486046216
Epoch: 100 || Loss: 5.349692989318008
Epoch: 125 || Loss: 5.344661337361436
Epoch: 150 || Loss: 5.3427350409095915
Epoch: 175 || Loss: 5.341208180768765
Epoch: 200 || Loss: 5.340844349396627
Epoch: 225 || Loss: 5.3406587303000865
Epoch: 250 || Loss: 5.340497241277021
Epoch: 275 || Loss: 5.340339649884561
Epoch: 300 || Loss: 5.340102702177624
Epoch: 325 || Loss: 5.339372582664066
Epoch: 350 || Loss: 5.338481038512005
Epoch: 375 || Loss: 5.337603304167694
Epoch: 400 || Loss: 5.336665694737605
Epoch: 425 || Loss: 5.335643781956049
Epoch: 450 || Loss: 5.33453568692666
Epoch: 475 || Loss: 5.333349510246203
Epoch: 500 || Loss: 5.332096933358944
  0.179752 seconds (282.10 k allocations: 134.874 MiB, 13.10% gc time, 29.25% compilation time)

Results

Testing out the trained model.

y_pred = Lux.apply(tstate.model, xrange', tstate.parameters, tstate.states)[1]

1×200 Matrix{Float64}:
 6.38197  6.28932  6.19666  6.10401  6.01135  …  -5.7164  -5.79946  -5.88251

Visualizing the result.

plot(title="Fitting", titlefont=10)

plot!(xrange, target_score', linewidth=4, label="score function")

scatter!(sample_points', s -> gradlogpdf(target_prob, s), label="data", markersize=2)

plot!(xx', y_pred', linewidth=2, label="predicted MLP")

Just for the fun of it, let us see an animation of the optimization process.

Recovering the PDF of the distribution from the trained score function.

paux = exp.(accumulate(+, y_pred) .* dx)
pdf_pred = paux ./ sum(paux) ./ dx
plot(title="Original PDF and PDF from predicted score function", titlefont=10)
plot!(xrange, target_pdf', label="original")
plot!(xrange, pdf_pred', label="recoverd")

And the animation of the evolution of the PDF.

We also visualize the evolution of the losses.

plot(losses, title="Evolution of the loss", titlefont=10, xlabel="iteration", ylabel="error", legend=false)