Score matching with Parzen estimation
Introduction
Aim
Explore the use of Parzen kernel estimation to approximate the explicit score matching, used as the basis for the implicit score matching objective proposed by Aapo Hyvärinen (2005) and discussed en passant by Pascal Vincent (2011) on their way to the denoising score matching objective. We illustrate the method by fiting a multi-layer perceptron to model the score function of a one-dimensional synthetic Gaussian-mixture distribution.
Motivation
The motivation is to continue building a solid background on score-matching diffusion.
Background
Aapo Hyvärinen (2005) proposed fitting directly the score of a distribution. This is obtained, in theory, by minimizing an explicit score matching objective function (i.e. the Fisher divergence). However, this function requires knowing the supposedly unknown target score function. The trick used by Aapo Hyvärinen (2005) was then to do an integration by parts and rewrite the optimization problem in terms of an implicit score matching objective function, which yields the same minima and does not require further information from the target distribution other than some sample points. The implicit score matching method requires, however, the derivative of the score function of the model, which is costly to compute in general.
Then, Pascal Vincent (2011) explored the idea of using non-parametric Parzen density estimation to directly approximate the explicit score matching objective, making a connection with denoising autoenconders, and proposing the denoising (explicit) score matching method.
We will detail denoising score matching in a separate note. Here, we stop at the Parzen density estimation idea, which was used in Pascal Vincent (2011) only as a step towards the denoising score matching. We call this method the Parzen estimated (explicit) score matching method. We also model directly the score of the distribution, as proposed in 1. Song and Ermon (2019), instead of the pdf or an energy potential of the pdf, as done earlier.
Objective function approximating the explicit score matching objective
The score-matching method from Aapo Hyvärinen (2005) aims to fit the model score function $\psi(\mathbf{x}; {\boldsymbol{\theta}})$ to the score function $\psi_X(\mathbf{x})$ of a random variable $\mathbf{X}$ by minimizing the implicit score matching objective
\[ J_{\mathrm{ISM}}({\boldsymbol{\theta}}) = \int_{\mathbb{R}^d} p_{\mathbf{X}}(\mathbf{x}) \left( \frac{1}{2}\left\|\boldsymbol{\psi}(\mathbf{x}; {\boldsymbol{\theta}})\right\|^2 + \boldsymbol{\nabla}_{\mathbf{x}} \cdot \boldsymbol{\psi}(\mathbf{x}; {\boldsymbol{\theta}}) \right)\;\mathrm{d}\mathbf{x},\]
which is equivalent to minimizing the explicit score matching objective
\[ J_{\mathrm{ESM}}({\boldsymbol{\theta}}) = \frac{1}{2}\int_{\mathbb{R}^d} p_{\mathbf{X}}(\mathbf{x}) \left\|\boldsymbol{\psi}(\mathbf{x}; {\boldsymbol{\theta}}) - \boldsymbol{\psi}_{\mathbf{X}}(\mathbf{x})\right\|^2\;\mathrm{d}\mathbf{x},\]
due to the following identity obtained via integration by parts in the expectation
\[ J_{\mathrm{ESM}}({\boldsymbol{\theta}}) = {\tilde J}_{\mathrm{ISM}}({\boldsymbol{\theta}}) + C,\]
where $C$ is constant with respect to the parameters. The advantage of ${\tilde J}_{\mathrm{ISM}}({\boldsymbol{\theta}})$ is that it does not involve the unknown score function of $X$. It does, however, involve the gradient of the modeled score function, which is expensive to compute.
In practice, this is further approximated by the empirical distribution ${\tilde p}_0(\mathbf{x})$ given by
\[ {\tilde p}_0(\mathbf{x}) = \frac{1}{N}\sum_{n=1}^N \delta(\mathbf{x} - \mathbf{x}_n),\]
so the implemented objective is the empirical implicit score matching objective
\[ {\tilde J}_{\mathrm{ISM}{\tilde p}_0}({\boldsymbol{\theta}}) = \frac{1}{N}\sum_{n=1}^N \left( \frac{1}{2}\left\|\boldsymbol{\psi}(\mathbf{x}_n; {\boldsymbol{\theta}})\right\|^2 + \boldsymbol{\nabla}_{\mathbf{x}} \cdot \boldsymbol{\psi}(\mathbf{x}_n; {\boldsymbol{\theta}}) \right).\]
Aapo Hyvärinen (2005) briefly mentions that minimizing $J_{\mathrm{ESM}}({\boldsymbol{\theta}})$ directly is "basically a non-parametric estimation problem", but dismisses it for the "simple trick of partial integration to compute the objective function very easily". As we have seen, the trick is fine for model functions for which we can compute the gradient without much trouble, but for modeling it with a neural network, for instance, it becomes computationally expensive.
A few years later, Pascal Vincent (2011) considered the idea of using a Parzel kernel density estimation
\[ {\tilde p}_\sigma(\mathbf{x}) = \frac{1}{\sigma^d}\int_{\mathbb{R}^d} K\left(\frac{\mathbf{x} - \mathbf{x}_n}{\sigma}\right) \;\mathrm{d}{\tilde p}_0(\mathbf{x}) = \frac{1}{\sigma^d N}\sum_{n=1}^N K\left(\frac{\mathbf{x} - \mathbf{x}_n}{\sigma}\right),\]
where $\sigma > 0$ is a kernel window parameter and $K(\mathbf{x})$ is a kernel density (properly normalized to have mass one). In this way, the explicit score matching objective function is approximated by the Parzen-estimated explicit score matching objective
\[ {\tilde J}_{\mathrm{P_\sigma ESM}}({\boldsymbol{\theta}}) = \frac{1}{2}\int_{\mathbb{R}^d} {\tilde p}_\sigma(\mathbf{x}) \left\|\boldsymbol{\psi}(\mathbf{x}; {\boldsymbol{\theta}}) - \boldsymbol{\nabla}_{\mathbf{x}}\log {\tilde p}_\sigma(\mathbf{x})\right\|^2\;\mathrm{d}\mathbf{x},\]
which is then further approximated with the empirical distribution, yielding the empirical Parzen-estimated explicit score matching
\[ \begin{align*} {\tilde J}_{\mathrm{P_\sigma ESM{\tilde p}_0}}({\boldsymbol{\theta}}) & = \frac{1}{2}\int_{\mathbb{R}^d} {\tilde p}_0(\mathbf{x}) \left\|\boldsymbol{\psi}(\mathbf{x}; {\boldsymbol{\theta}}) - \boldsymbol{\nabla}_{\mathbf{x}}\log {\tilde p}_\sigma(\mathbf{x})\right\|^2\;\mathrm{d}\mathbf{x} \\ & = \frac{1}{N} \sum_{n=1}^N \left\|\boldsymbol{\psi}(\mathbf{x}_n; {\boldsymbol{\theta}}) - \boldsymbol{\nabla}_{\mathbf{x}_n}\log {\tilde p}_\sigma(\mathbf{x})\right\|^2. \end{align*}\]
However, Pascal Vincent (2011) did not use this as a final objective function. Pascal further simplified the objective function ${\tilde J}_{\mathrm{P_\sigma ESM}}({\boldsymbol{\theta}})$ by expanding the gradient of the logpdf of the Parzen estimator, writing a double integral with a conditional probability, and switching the order of integration. We will do this in a follow up note, but for the moment we will stop at ${\tilde J}_{\mathrm{P_\sigma ESM}}({\boldsymbol{\theta}})$ and ${\tilde J}_{\mathrm{P_\sigma ESM{\tilde p}_0}}({\boldsymbol{\theta}})$, use a Gaussian estimator, and see how this works.
Computing the score function with the Parzen estimation amounts to
\[ \begin{align*} \boldsymbol{\nabla}_{\mathbf{x}}\log {\tilde p}_\sigma(\mathbf{x}) & = \boldsymbol{\nabla}_{\mathbf{x}}\log\left( \frac{1}{\sigma^d N}\sum_{n=1}^N K\left(\frac{\mathbf{x} - \mathbf{x}_n}{\sigma}\right)\right) \\ & = \frac{1}{{\tilde p}_\sigma(\mathbf{x})} \frac{1}{\sigma^d N}\sum_{n=1}^N \boldsymbol{\nabla}_{\mathbf{x}} \left(K\left(\frac{\mathbf{x} - \mathbf{x}_n}{\sigma}\right)\right) \\ & = \frac{1}{{\tilde p}_\sigma(\mathbf{x})} \frac{1}{\sigma^d N}\sum_{n=1}^N \frac{1}{\sigma}\left(\boldsymbol{\nabla}_{\mathbf{x}} K\right)\left(\frac{\mathbf{x} - \mathbf{x}_n}{\sigma}\right). \end{align*}\]
If we use the standard Gaussian kernel
\[ G(\mathbf{x}) = \frac{1}{\sqrt{2\pi}} e^{-\frac{1}{2} \mathbf{x}^2},\]
then
\[ \left(\boldsymbol{\nabla}_{\mathbf{x}} G\right)(\mathbf{x}) = \frac{1}{\sqrt{2\pi}}e^{-\frac{1}{2} \mathbf{x}^2} \mathbf{x} = G(\mathbf{x})\mathbf{x},\]
so that
\[ \boldsymbol{\nabla}_{\mathbf{x}}\log {\tilde p}_\sigma(\mathbf{x}) = \frac{1}{{\tilde p}_\sigma(\mathbf{x})} \frac{1}{\sigma^d N}\sum_{n=1}^N G\left(\frac{\mathbf{x} - \mathbf{x}_n}{\sigma}\right)\frac{\mathbf{x} - \mathbf{x}_n}{\sigma^2}.\]
Notice that this can be computed beforehand at the sample points, just with the knowledge of the sample points themselves. Indeed, renaming the index above from $n$ to $j$, and computing the approximate score function at each sample point $\mathbf{x}_n$ yield
\[ \boldsymbol{\nabla}_{\mathbf{x}}\log {\tilde p}_\sigma(\mathbf{x}_n) = \frac{1}{{\tilde p}_\sigma(\mathbf{x}_n)} \frac{1}{\sigma^d N}\sum_{n=1}^N G\left(\frac{\mathbf{x}_n - \mathbf{x}_j}{\sigma}\right)\frac{\mathbf{x}_n - \mathbf{x}_j}{\sigma^2},\]
where
\[{\tilde p}_\sigma(\mathbf{x}_n) = \frac{1}{\sigma^d N}\sum_{j=1}^N G\left(\frac{\mathbf{x} - \mathbf{x}_j}{\sigma}\right).\]
Then, the explicit score matching objective approximated with the Parzen kernel estimator and with the empirical distribution yields the objective
\[ {\tilde J}_{\mathrm{P_\sigma ESM{\tilde p}_0}}({\boldsymbol{\theta}}) = \frac{1}{2}\frac{1}{N} \sum_{n=1}^N \left\|\boldsymbol{\psi}(\mathbf{x}_n; {\boldsymbol{\theta}}) - \boldsymbol{\nabla}_{\mathbf{x}}\log {\tilde p}_\sigma(\mathbf{x}_n)\right\|^2.\]
Numerical example
We illustrate, numerically, the use of the empirical Parzen-estimated explicit score matching objective ${\tilde J}_{\mathrm{P_\sigma ESM{\tilde p}_0}}$ to model a synthetic univariate Gaussian mixture distribution.
Julia language setup
We use the Julia programming language for the numerical simulations, with suitable packages.
Packages
using StatsPlots
using Random
using Distributions
using Lux # artificial neural networks explicitly parametrized
using Optimisers
using Zygote # automatic differentiation
using Markdown
There are several Julia libraries for artificial neural networks and for automatic differentiation (AD). As discussed in a previous note, we will use here the LuxDL/Lux.jl library, which is taylored to the differential equations SciML ecosystem.
Reproducibility
We set the random seed for reproducibility purposes.
rng = Xoshiro(12345)
Data
We build the target model and draw samples from it.
The target model is a univariate random variable denoted by $X$ and defined by a probability distribution. Associated with that we consider its PDF and its score-function.
target_prob = MixtureModel([Normal(-3, 1), Normal(3, 1)], [0.1, 0.9])
xrange = range(-10, 10, 200)
dx = Float64(xrange.step)
xx = permutedims(collect(xrange))
target_pdf = pdf.(target_prob, xrange')
target_score = gradlogpdf.(target_prob, xrange')
sample_points = permutedims(rand(rng, target_prob, 1024))
1×1024 Matrix{Float64}:
2.30308 2.84284 3.4103 3.68232 … 1.71428 2.75491 3.14101 2.48846
Visualizing the sample data drawn from the distribution and the PDF.
Visualizing the score function.
For the Parzen estimated score matching, we need to pre-compute the score function of the Parzen estimation.
G(x) = exp(-x^2 / 2) / √(2π)
psigma(x, sigma, sample_points) = mean(G( (x - xn) / sigma ) for xn in sample_points) / sigma
score_parzen(x, sigma, sample_points) = mean(G( (x - xn) / sigma ) * (xn - x) / sigma^2 for xn in sample_points) / psigma(x, sigma, sample_points) / sigma
score_parzen (generic function with 1 method)
The Parzen estimated score function is highly sensitive to the window parameter $\sigma$:
sigma = 0.5
score_parzen_points = map(x -> score_parzen(x, sigma, sample_points), sample_points)
data = (sample_points, score_parzen_points)
([2.303077959422043 2.8428423932782843 … 3.1410080972036334 2.488464630750972], [0.5909442505839475 0.15076935144375442 … -0.10704209580452212 0.4520888778084795])
We choose the value $\sigma = 0.5$.
The neural network model
The neural network we consider is a simple feed-forward neural network made of a single hidden layer, obtained as a chain of a couple of dense layers. This is implemented with the LuxDL/Lux.jl package.
We will see that we don't need a big neural network in this simple example. We go as low as it works.
model = Chain(Dense(1 => 8, relu), Dense(8 => 1))
Chain(
layer_1 = Dense(1 => 8, relu), # 16 parameters
layer_2 = Dense(8 => 1), # 9 parameters
) # Total: 25 parameters,
# plus 0 states.
The LuxDL/Lux.jl package uses explicit parameters, that are initialized (or obtained) with the Lux.setup
function, giving us the parameters and the state of the model.
ps, st = Lux.setup(rng, model) # initialize and get the parameters and states of the model
((layer_1 = (weight = Float32[-0.0008383376; -0.4411211; … ; 0.15721959; -0.22093461;;], bias = Float32[0.0; 0.0; … ; 0.0; 0.0;;]), layer_2 = (weight = Float32[0.14955823 0.4725347 … -0.6732873 -0.76267356], bias = Float32[0.0;;])), (layer_1 = NamedTuple(), layer_2 = NamedTuple()))
Loss function
Here it is how we implement the objective ${\tilde J}_{\mathrm{P_\sigma ESM{\tilde p}_0}}({\boldsymbol{\theta}})$.
function loss_function_parzen(model, ps, st, data)
sample_points, score_parzen_points = data
y_score_pred, st = Lux.apply(model, sample_points, ps, st)
loss = mean(abs2, y_score_pred .- score_parzen_points)
return loss, st, ()
end
loss_function_parzen (generic function with 1 method)
Optimization setup
Optimization method
We use the Adam optimiser.
opt = Adam(0.01)
tstate_org = Lux.Training.TrainState(rng, model, opt)
TrainState
model: Lux.Chain{@NamedTuple{layer_1::Lux.Dense{true, typeof(NNlib.relu), typeof(WeightInitializers.glorot_uniform), typeof(WeightInitializers.zeros32)}, layer_2::Lux.Dense{true, typeof(identity), typeof(WeightInitializers.glorot_uniform), typeof(WeightInitializers.zeros32)}}, Nothing}((layer_1 = Dense(1 => 8, relu), layer_2 = Dense(8 => 1)), nothing)
# of parameters: 25
# of states: 0
optimizer: Adam(0.01, (0.9, 0.999), 1.0e-8)
step: 0
Automatic differentiation in the optimization
As mentioned, we setup differentiation in LuxDL/Lux.jl with the FluxML/Zygote.jl library.
vjp_rule = Lux.Training.AutoZygote()
ADTypes.AutoZygote()
Processor
We use the CPU instead of the GPU.
dev_cpu = cpu_device()
## dev_gpu = gpu_device()
(::LuxDeviceUtils.LuxCPUDevice) (generic function with 5 methods)
Check differentiation
Check if Zygote via Lux is working fine to differentiate the loss functions for training.
Lux.Training.compute_gradients(vjp_rule, loss_function_parzen, data, tstate_org)
((layer_1 = (weight = Float32[-2.8955894; 0.24220864; … ; 2.6812541; 8.4831505;;], bias = Float32[-0.95564574; -0.0560816; … ; 0.88490766; 2.7997365;;]), layer_2 = (weight = Float32[-4.793218 -0.12134284 … -7.5171814 -9.92585], bias = Float32[-4.4427743;;])), 5.284951837642634, (), Lux.Training.TrainState{Nothing, Nothing, Lux.Chain{@NamedTuple{layer_1::Lux.Dense{true, typeof(NNlib.relu), typeof(WeightInitializers.glorot_uniform), typeof(WeightInitializers.zeros32)}, layer_2::Lux.Dense{true, typeof(identity), typeof(WeightInitializers.glorot_uniform), typeof(WeightInitializers.zeros32)}}, Nothing}, @NamedTuple{layer_1::@NamedTuple{weight::Matrix{Float32}, bias::Matrix{Float32}}, layer_2::@NamedTuple{weight::Matrix{Float32}, bias::Matrix{Float32}}}, @NamedTuple{layer_1::@NamedTuple{}, layer_2::@NamedTuple{}}, Optimisers.Adam, @NamedTuple{layer_1::@NamedTuple{weight::Optimisers.Leaf{Optimisers.Adam, Tuple{Matrix{Float32}, Matrix{Float32}, Tuple{Float32, Float32}}}, bias::Optimisers.Leaf{Optimisers.Adam, Tuple{Matrix{Float32}, Matrix{Float32}, Tuple{Float32, Float32}}}}, layer_2::@NamedTuple{weight::Optimisers.Leaf{Optimisers.Adam, Tuple{Matrix{Float32}, Matrix{Float32}, Tuple{Float32, Float32}}}, bias::Optimisers.Leaf{Optimisers.Adam, Tuple{Matrix{Float32}, Matrix{Float32}, Tuple{Float32, Float32}}}}}}(nothing, nothing, Lux.Chain{@NamedTuple{layer_1::Lux.Dense{true, typeof(NNlib.relu), typeof(WeightInitializers.glorot_uniform), typeof(WeightInitializers.zeros32)}, layer_2::Lux.Dense{true, typeof(identity), typeof(WeightInitializers.glorot_uniform), typeof(WeightInitializers.zeros32)}}, Nothing}((layer_1 = Dense(1 => 8, relu), layer_2 = Dense(8 => 1)), nothing), (layer_1 = (weight = Float32[0.36434844; -0.2782616; … ; 0.57140595; 0.7544968;;], bias = Float32[0.0; 0.0; … ; 0.0; 0.0;;]), layer_2 = (weight = Float32[0.22010337 0.5554293 … -0.20381103 -0.6448325], bias = Float32[0.0;;])), (layer_1 = NamedTuple(), layer_2 = NamedTuple()), Adam(0.01, (0.9, 0.999), 1.0e-8), (layer_1 = (weight = Leaf(Adam(0.01, (0.9, 0.999), 1.0e-8), (Float32[0.0; 0.0; … ; 0.0; 0.0;;], Float32[0.0; 0.0; … ; 0.0; 0.0;;], (0.9, 0.999))), bias = Leaf(Adam(0.01, (0.9, 0.999), 1.0e-8), (Float32[0.0; 0.0; … ; 0.0; 0.0;;], Float32[0.0; 0.0; … ; 0.0; 0.0;;], (0.9, 0.999)))), layer_2 = (weight = Leaf(Adam(0.01, (0.9, 0.999), 1.0e-8), (Float32[0.0 0.0 … 0.0 0.0], Float32[0.0 0.0 … 0.0 0.0], (0.9, 0.999))), bias = Leaf(Adam(0.01, (0.9, 0.999), 1.0e-8), (Float32[0.0;;], Float32[0.0;;], (0.9, 0.999))))), 0))
Training loop
Here is the typical main training loop suggest in the LuxDL/Lux.jl tutorials, but sligthly modified to save the history of losses per iteration.
function train(tstate, vjp, data, loss_function, epochs, numshowepochs=20, numsavestates=0)
losses = zeros(epochs)
tstates = [(0, tstate)]
for epoch in 1:epochs
grads, loss, stats, tstate = Lux.Training.compute_gradients(vjp,
loss_function, data, tstate)
if ( epochs ≥ numshowepochs > 0 ) && rem(epoch, div(epochs, numshowepochs)) == 0
println("Epoch: $(epoch) || Loss: $(loss)")
end
if ( epochs ≥ numsavestates > 0 ) && rem(epoch, div(epochs, numsavestates)) == 0
push!(tstates, (epoch, tstate))
end
losses[epoch] = loss
tstate = Lux.Training.apply_gradients(tstate, grads)
end
return tstate, losses, tstates
end
train (generic function with 3 methods)
Training
Now we train the model with the objective function ${\tilde J}_{\mathrm{P_\sigma ESM{\tilde p}_0}}({\boldsymbol{\theta}})$.
@time tstate, losses, tstates = train(tstate_org, vjp_rule, data, loss_function_parzen, 500, 20, 125)
Epoch: 25 || Loss: 0.4430188778889257
Epoch: 50 || Loss: 0.36148477697952613
Epoch: 75 || Loss: 0.31848269692133824
Epoch: 100 || Loss: 0.27213985651426276
Epoch: 125 || Loss: 0.22408721389654504
Epoch: 150 || Loss: 0.1768749523714307
Epoch: 175 || Loss: 0.13340709669718717
Epoch: 200 || Loss: 0.09615046175945093
Epoch: 225 || Loss: 0.0663155056519743
Epoch: 250 || Loss: 0.04405017376534694
Epoch: 275 || Loss: 0.028544834846467564
Epoch: 300 || Loss: 0.01869552476648377
Epoch: 325 || Loss: 0.012975478634386299
Epoch: 350 || Loss: 0.009753473791430753
Epoch: 375 || Loss: 0.00811123135232417
Epoch: 400 || Loss: 0.00730669631662745
Epoch: 425 || Loss: 0.006895175277809365
Epoch: 450 || Loss: 0.0066665919248817115
Epoch: 475 || Loss: 0.006499004619862312
Epoch: 500 || Loss: 0.006308042755676025
0.079218 seconds (131.04 k allocations: 123.886 MiB, 22.94% gc time, 38.01% compilation time)
Results
Testing out the trained model.
y_pred = Lux.apply(tstate.model, xrange', tstate.parameters, tstate.states)[1]
1×200 Matrix{Float64}:
6.41162 6.31644 6.22126 6.12609 6.03091 … -5.32654 -5.40551 -5.48449
Visualizing the result.
plot(title="Fitting", titlefont=10)
plot!(xrange, target_score', linewidth=4, label="score function")
scatter!(sample_points', s -> gradlogpdf(target_prob, s), label="data", markersize=2)
plot!(xx', y_pred', linewidth=2, label="predicted MLP")
Just for the fun of it, let us see an animation of the optimization process.
Recovering the PDF of the distribution from the trained score function.
paux = exp.(accumulate(+, y_pred) .* dx)
pdf_pred = paux ./ sum(paux) ./ dx
plot(title="Original PDF and PDF from predicted score function", titlefont=10)
plot!(xrange, target_pdf', label="original")
plot!(xrange, pdf_pred', label="recoverd")
And the animation of the evolution of the PDF.
We also visualize the evolution of the losses.
plot(losses, title="Evolution of the loss", titlefont=10, xlabel="iteration", ylabel="error", legend=false)
References
- Pascal Vincent (2011), "A connection between score matching and denoising autoencoders," Neural Computation, 23 (7), 1661-1674, doi:10.1162/NECOa00142
- Aapo Hyvärinen (2005), "Estimation of non-normalized statistical models by score matching", Journal of Machine Learning Research 6, 695-709
- Y. Song and S. Ermon (2019), "Generative modeling by estimating gradients of the data distribution", NIPS'19: Proceedings of the 33rd International Conference on Neural Information Processing Systems, no. 1067, 11918-11930