Overview of the main results pertaining to score-based generative methods

Here we review the main results associated with score-based generative methods.

Main results

The Langevin equation dates back to the beginning of the 20th Century, as proposed by Langevin (1908), in relation to the Brownian motion.

The use of the Langevin equation (and especially of the overdamped Langevin equation) to draw samples of a known distribution from its score function was proposed by Roberts and Tweedie (1996). Rates of convergence were also given and are still an important area of research.

The idea of fitting a distribution via the score function was then proposed by Hyvärinen (2005), via implicit score matching. The implicit score matching is obtained via integration by parts on the explicit score matching objective, which requires the unknown target score function. The implicit score objective function eliminates the unknown score of the target distribution, but requires computing the divergence of the score of the model pdf, which means the trace of the hessian of the model pdf. This works fine for some specific models for which the derivatives can be readily computed, as considered by Hyvärinen (2005) (and subsequently by Köster and Hyvärinen (2010)), but which otherwise might be computationally intensive.

Kingma and LeCun (2010) used an energy based framework to write the pdf in terms of the exponential of an energy function, modeled as a neural network, and used automatic differentiation to compute the hessian of the log of the pdf via the hessian of the energy function. In doing so, they also proposed a regularized implicit score matching objective, by adding a penalization in terms of the hessian, for stability purposes. This penalization does not add much to the complexity of the method, but the underlying method is already computationally demanding since it is based on the implicit score matching and needs the gradient of the score function. This computational cost also increases dramatically with the dimension of the problem.

Vincent (2011) proposed the denoising score matching method, by working with the explicit score matching and using Parzen kernel density estimation to approximate the gradient of the desired score function, avoiding the need to differentiate the score of the model pdf. One can directly compute the score of the Parzen estimation, which involves computing the logarithm of a sum of kernels, and this can be computed explicitly beforehand, so the optimization process is relatively straightforward. Pascal went further, though, writing the objective function as a double integral with a conditional distribution, so the score that needs to be computed is that of a single kernel, simplifying the computation. This is also interpreted in connection with the denoising auto-encoder methods, where one uses the original sample and a corrupted sample, associated with the conditional distribution over a single kernel of the Parzen estimation. The double integral is not a burden in computations since it turns out a single corrupted sample for each clean sample is sufficient, rendering it effectively as a single integral, which becomes a simple sum over the sample data.

Song, Garg, Shi, and Ermon (2020) addressed again the implicit score matching approach and proposed a sliced (implicit) score matching method to reduce the computational cost of the implicit score matching for high-dimensional problems. The trick is to take derivatives only at certain random directions, at each sample point.

A few months later, Pang, Xu, Li, Song, Ermon, and Zhu (2020) proposed the finite-difference (implicit) score matching method, using finite differences to approximate whatever derivatives appear in the loss function. The authors suggest using finite-differences not only for implicit score matching, but also sliced score matching, denoising score matching, sliced score matching with variance reduction, and so on. The use of finite-differences greatly reduces the computational cost of the optimization process.

A few years earlier, Sohl-Dickstein, Weiss, Maheswaranathan, Ganguli (2015) introduced denoising diffusion probabilistic models (DDPM), which was further improved in Ho, Jain, and Abbeel (2020). The idea is to embed the target distribution into a Markov chain and model the reverse process of the Markov chain. The Markov chain gradually adds noise to the process and drives it to a tractable noisy distribution. The model then starts from the tractable distribution and reverts back to an approximation of the target distribution. In other words, the model denoises a noisy sample towards a sample of the desired distribution. This is a much more complex task which greatly increases the dimension of the problem, but which yields more stability to both training and generative processes. The original work of Sohl-Dickstein, Weiss, Maheswaranathan, Ganguli (2015) had a more complex model and an expensive loss function. Ho, Jain, and Abbeel (2020) simplified the model and the loss function, turning it into a practical and efficient model. Further on, Nichol and Dhariwal (2021) showed improvements with a nonlinear variance schedule for the forward process and a partial training of the variance of the model reverse process. It is worth mentioning that, in principle, this is a maximum likelyhood estimation, not a score matching, but it turns out that the actual term modeled in the reverse Markov chain is somewhat a discretized approximation of the score function.

Song and Ermon (2019) finally proposed modeling directly the score function as a neural network (instead of the pdf or the energy potential) and proposed a noise conditional score network (NSCN) model, with a sequence of perturbations of the data with different noise levels, similar to denoising score matching, and a cooresponding denoising scheme based on annealed Langevin dynamics.

In a subsequent work, Song, Meng, Ermon (2021) improved the DDPM model and introduced denoising diffusion implicit models (DDIM), which is based on non-Markovian processes but amounts to the same training strategy and model. The advantage, then, is that sampling is greatly expedited with a non-Markovian reverse process that can sample directly from the tractable distribution. We do not detail this method here, though.

Then came Song, Sohl-Dickstein, Kingma, Kumar, Ermon, Poole (2020), with the score-based denoising diffusion SDE models, unifying the ideas of DDPM and NCSN methods, in a time-continuum model, adding noise to the data via a stochastic differential equation (SDE), with a corresponding reverse SDE.

Finally, we have the paper from the NVIDIA team, Karras, Aittala, Aila, Laine (2022) with the aim of "elucidating the design space of diffusion-based generative models", as the title says.

References

  1. P. Langevin (1908), "Sur la théorie du mouvement brownien [On the Theory of Brownian Motion]". C. R. Acad. Sci. Paris. 146: 530–533
  2. G. O. Roberts, R. L. Tweedie (1996), "Exponential Convergence of Langevin Distributions and Their Discrete Approximations", Bernoulli, Vol. 2, No. 4, 341-363, doi:10.2307/3318418
  3. Aapo Hyvärinen (2005), "Estimation of non-normalized statistical models by score matching", Journal of Machine Learning Research 6, 695-709
  4. U. Köster, A. Hyvärinen (2010), "A two-layer model of natural stimuli estimated with score matching", Neural. Comput. 22 (no. 9), 2308-33, doi: 10.1162/NECOa00010
  5. Durk P. Kingma, Yann Cun (2010), "Regularized estimation of image statistics by Score Matching", Advances in Neural Information Processing Systems 23 (NIPS 2010)
  6. Pascal Vincent (2011), "A connection between score matching and denoising autoencoders," Neural Computation, 23 (7), 1661-1674, doi:10.1162/NECOa00142
  7. Y. Song, S. Garg, J. Shi, S. Ermon (2020), Sliced Score Matching: A Scalable Approach to Density and Score Estimation, Proceedings of The 35th Uncertainty in Artificial Intelligence Conference, PMLR 115:574-584 – see also the arxiv version
  8. T. Pang, K. Xu, C. Li, Y. Song, S. Ermon, J. Zhu (2020), Efficient Learning of Generative Models via Finite-Difference Score Matching, NeurIPS - see also the arxiv version
  9. J. Sohl-Dickstein, E. A. Weiss, N. Maheswaranathan, S. Ganguli (2015), "Deep unsupervised learning using nonequilibrium thermodynamics", ICML'15: Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37, 2256-2265
  10. Y. Song and S. Ermon (2019), "Generative modeling by estimating gradients of the data distribution", NIPS'19: Proceedings of the 33rd International Conference on Neural Information Processing Systems, no. 1067, 11918-11930
  11. J. Ho, A. Jain, P. Abbeel (2020), "Denoising diffusion probabilistic models", in Advances in Neural Information Processing Systems 33, NeurIPS2020
  12. A. Q. Nichol and P. Dhariwal (2021), "Improved denoising diffusion probabilistic models", ICLR 2021 Conference
  13. J. Song, C. Meng, S. Ermon (2021), "Denoising diffusion implicit models", ICLR 2021 Conference
  14. Y. Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, B. Poole (2020), "Score-based generative modeling through stochastic differential equations", arXiv:2011.13456
  15. T. Karras, M. Aittala, T. Aila, S. Laine (2022), Elucidating the design space of diffusion-based generative models, Advances in Neural Information Processing Systems 35 (NeurIPS 2022)