Finite Step Unconditional Markov Diffusion Models
Last updated
Was this helpful?
Last updated
Was this helpful?
Let be the parametric model that models data , then we can optimize by maximize the likelihood or equivalently
From now on, we use 's to denote the forward physical distributions and 's the backward variational ansatz. The parameters are implied in .
Let be the diffusion steps, the joint density of forward process can be expanded sequentially if the process is Markov
The reverse process variational ansatz can be similarly constructed
which can be interpreted as a series of consecutive priors for the physical observation .
If we marginalize the latent variables , we get the objective function or observable likelihood
assume Markov property of backward process. However, without assumption of Markov property, we can still insert an identity and get
The log-likelihood average over all data distribution is
Use the concavity of logarithm function,
where
Regardless of Markov property of the processes, max likelihood lower bound is equivalent to min KL-divergence between variational ansatz and physical forward process.
If the forward process is Markov, then we have
Similarly, if we assume the backward joint density can be expanded as
The posterior of physical forward process may be represented as
Thus, the forward process joint density has an posterior expansion
where the last factor telescopes
The ratio of forward and backward density can be expanded in the following fashion
whose logarithm reads
where
and the total loss becomes a typical variational inference loss: the sum of data negative log-likelihood and a series of prior KL-divergence.
So far we have not assumed any specific distribution yet. The objective is purely based on the assumption of Markov property.
For the particular case of Gaussian diffusion models, we assume
the terminal distribution is normal
forward transition process is Gaussian
It is useful to introduce additional notations:
Iteratively apply the formula, we get
We can compute the posterior of the physical process using
where the RHS is a product of Gaussians. One can show that the posterior is indeed Gaussian after doing an easy but lengthy calculation
where
and
Since the target distribution is Gaussian, it is a good idea to choose Gaussian distribution as the variational Ansatz
where we used the KL-divergence between two Gaussian distributions.
Clearly, we have the exact solution
and therefore,
Finally, the objective becomes
It can also be shown that
During training, the model learned the backward transition distribution
The backward iteration is essentially sampling and calculating
Training
Sample
Inference
The reconstruction formula is given in previous section "sampling the backward process."
We will derive the lower bound of maximum likelihood objective and then show that it is related to a KL-divergence up to a constant.
Proof. The original approach would expand the joint density
Note that , then we can reinterpret the integral
The entropy depends on data distribution and does not contain model parameters. Thus, maximizing log-likelihood lower bound is equivalent to minimizing KL-divergence between forward process joint density and backward process joint density.
however, without the knowledge of the initial state , there could be infinity possibilities. Therefore, we fix the initial state, and get probabilies given the fixed
or the equality for
Using the posterior expansion of , the total KL-divergence
,
for , and
.
The last term is a constant with fixed distribution . If add back the entropy term , the first term becomes the usual likelihood
where .
, and
.
Let , the forward process can be written as
Thus, we can generate for any without actually do the iterative calculations. There is a similar property for any Markov process, i.e. Feynman-Kac formula
Express in terms of and noise, the mean simplies
where the model parameters are and . The variance will eventually contribute to learning rate; we will treat as a hyperparameter instead of learning it from stochastic gradient descent. The only learnable parameter is then .
Recall the objective for each time step
where is the noise that generated from .
Next, we reparameterize to separate the explicit dependency of and and let the model only focus on the implicit dependencies, i.e.
where is the model output.
where .
Thus, the generic loss term for is
Note that we still have the freedom to choose that controls the importance of each step. But in the literature, they usually take a heuristic approach by ignoring the weight factor keeping only the loss.
where .
Construct
Feed to model
Minimize
Sample
Loop
Sample
Compute
Return