Diffusion Models for Imaging and Vision
VAE (Variational Auto-Encoder)
Setting
- Input $x$ (e.g. image) $\rightarrow$ Encoder $\rightarrow$ Latent Variables $z$ $\rightarrow$ Decoder $\rightarrow$ Generated $\hat{x}$
- Autoencoder: an input variable $x$ and a latent variable $z$
- latent space/latent feature space/embedding space: an embedding of a set of items within a manifold in which items resembling each other are positioned closer to one another
- “Variational”: use probabiity distriution $p(x)$ for $x$ and $p(z)$ for latent variable $z$ - $p(z|x)$: associated with the encoder. Not the encoder, but encoder has to make sure $p(z|x)$ behave consistently. Have no access to it. - $p(x|z)$: associated with the decoder. Have no access to it.
- Proxy distributions: - $q_{\phi}(z|x)$: proxy for $p(z|x)$. Assume it is a Gaussian. Need estimate the mean and variance for the Guassian $q_{\phi}(z|x)$ - $p_{\theta}(x|z)$: proxy for $p(x|z)$. ONLY need to estimate the mean and variance for $p_{\theta}(x|z)$. Do not need to estimate anything for the Gaussian $p_{\theta}(x|z)$. Evaluate how good the generated image $x$ is. Need decoder neural network to turn $z$ into $x$.
Evidence Lower Bound (Similar to Maximum Likelihood Estimation)
- Loss function: evidence lower bound (ELBO): $ELBO(x) = E_{q_{\phi}(z|x)}[\log{\frac{p(x, z)}{q_{\phi}(z|x)}}]$
- $\log{p_{\theta}(x)} = \int_{z}q(z|x)\log{P(x)}dz = \int_{z}q(z|x)\log{P(z, x)/P(z|x)}dz = \int_{z}q(z|x)\log(P(z, x)/q(z|x))dz + \int_{z}q(z|x)\log{q(z|x)/p(z|x)}= E_{q_{\phi}(z|x)}[\log{\frac{p(x, z)}{q_{\phi}(z|x)}}] + D_{KL}(q_{\phi}(z|x)|p(z|x))$
- Lower bound for the prior distribution $log(p(x))$, KL divergence is always non-negative.
- Gap beteween $log(p(x))$ and ELBO(x) is KL divergence: we need to ensure $q_{\phi}(z|x)$ is close to $p(z|x)$
- $ELBO(x) = E_{q_{\phi}(z|x)}[\log{p_{\theta}(x|z)}] - D_{KL}(q_{\phi}(z|x)|p(z))$
- Use $p(x, z) = p(x|z)p(z)$, split the expectations, and KL definition
- Replace $p(x|z)$ with proxy $p_{\theta}(x|z)$
- Previoous assumptions: $p_{\theta}(x|z)$, $q_{phi}(z|x)$, and $p(z)$ are Gaussian distributions
- ELBO = “how good the decoder is” - “KL divergence for encoder”
- Reconstruction (from $x$ and $z~q_{\phi}(z|x)$, find $\theta$ to maximize $p_{\theta}{x|z}$). Decoder to produce a good image $x$. Want to maximize $\log{p_{\theta}(x|z)}$
- Prior Matching (from $(z, x)$, find $\phi$ to minimize KL). KL divergence$\downarrow$ similarity $\uparrow$. Here maxizing ELBO, put a negative sign in front of KL term.
Training VAE
- Ground truth pairs $(x,z)$, where $x$ is the clear image, $z$ is the corresponding encoded image generated with distribution $\q_{\phi}(z|x)$ (assume is as a Gaussian)
- A deep neural network such that $\mu = \mu_{\phi}(x)$ and $\sigma^{2} = \sigma_{\phi}^{2}(x)$
- The $l$-th sample $z^{(l)}~\mathcal{N}(z|\mu_{\phi}(x^{(l)}), \sigma_{\phi}^{2}(x^{(l)}I)$
- $\mu_{\phi}(x^{(l)})$ and $\sigma_{\phi}^{2}(x^{(l)}$ are functions of $x^{(l)}$ $\rightarrow$ different $x$ have different Gaussian
- High-dimensional Guassian $x~\mathcal{N}(x|\mu, \Sigma)$
- Sampling process with transformation of white noise: $x = \mu + \Sigma^{1/2}w$
- $w~\mathcal{N}(0, I)$, $\Sigma^{1/2}$ is eigen-decomposition
- Encoder $p_{\theta}(x|z)$
- Assume $(\hat{x} - x)~\mathcal{N}(0, \sigma_{decoder}^{2})$, where $\hat{x} = decode_{\theta}(z)$
- The distribution $p_{\theta}(x|z)$: $\log{p_{\theta}(x|z)} = \log{\mathcal{N}(x|decode_{\theta}(z), \sigma_{decode}^{2}I)} = -\frac{|x - decode_{\theta}(z)|^{2}}{2\sigma_{dec}^{2}} - \log{\sqrt{(2\pi\sigma_{dec}^{2})^{D}}}$
- $D$ is the dimension of $x$
- 2nd term can be ignored
- This equation shows that maximization likelyhood term in ELCO is the $L_{2}$ loss between decoded image $\hat{x}$ and ground truth $x$
Loss Function
- Approximate the expectation by Monte-Carlo simulation: $E_{q_{\phi}(z|x)[\log{p_{\theta}(x|z)}]}\approx \frac{1}{L}\sum_{l = 1}^{L}\log{p_{\theta}(x^{(l)}|z^{(l)})}$
- Training Loss: $argmax_{\phi, \theta}{\frac{1}{L}\sum_{l = 1}^{L}\log{p_{\theta}(x^{(l)}|z^{(l)})}-D_{KL}(q_{\phi}(z|x^{(l)}||p(z)))}$ * KL divergence term: DL divergence for two d-dimensional Gaussian distributions can be straightly calculated with $\mu$ and $\sigma$
Inference with VAE
Generate the decoded image $\hat{x}$ with a latent $z$ which is sampled from $p(z) = \mathcal{N}(0, I)$ with decode parameter
Denoising Diffusion Probabilistic Model (DDPM)
- “Diffusion models are incremental updates where the assembly of the whole gives us the encoder-decoder structure. The transition from one state to another is realized by a denoiser.”
- Variational diffusion model
- $x_{0}$: original image, same as $x$ in VAE
- $x_{T}$: latent variable, same as $z$ in VAE
- $x_{1}$, …, $x_{T-1}$: intermediante states, also latent variables, not white Gaussian
Building Blocks
- Transition block
- Forward transition: Ideally, with $x_{t-1}$ and $p(x_{t}|x_{t-1})$, we can get $x_{t}$. However, the distribution $p(x_{t}|x_{t-1})$ remaining unaccessible. We have to use approxy Gaussian $q_{\phi}(x_{t}|x_{t-1})$
- Reverse transition: Similarly, the distribution $p(x_{t+1}|x_{t})$ is unknown. The approxy Gaussian $p_{\theta}(x_{t+1}|x_{t})$ is used here. The mean of this Gaussian needs to be estimated by a neural network.
- Initial block: Only has reverse transition from $x_{1}$ to $x_{0}$. With approxy $p_{\theta}(x_{0}|x_{1})$ and $x_{1}$, the $x_{0}$ can be estimated.
- Final block: Only has forward transition from $x_{T-1}$ to $x_{T}$. With approxy $q_{\phi}(x_{T}|x_{T-1})$ and $x_{T-1}$, the $x_{T}$ can be estimated.
- Understanding the transition distribution $q_{\phi}(x_{t}|x_{t-1})$:
- Definition: $q_{\phi}(x_{t}|x_{t-1}) = \mathcal{N}(x_{t}|\sqrt{\alpha_{t}}x_{t-1}, (1-\alpha_{t})I)$
- Mean is $\sqrt{\alpha_{t}}x_{t-1}$, variance is $(1-\alpha_{t})$
- Scaling factor $\sqrt{\alpha_{t}}$ controls the variance magnitude and preventing explosion or vanishing
- $x_{t} = \sqrt{\alpha_{t}}x_{t-1} + \sqrt{(1-\alpha_{t})}\epsilon$, where $\epsilon~\mathcal{N}(0, I)$
- Derived from Gaussian mixture model
The Magical Scalars $\sqrt{\alpha_{t}}$ and $1-\alpha_{t}$
- $q_{\phi}(x_{t}|x_{t-1}) = \mathcal{N}(x_{t}|ax_{t-1}, b^{2}I)$, where $a = \sqrt{\alpha}$, $b = \sqrt{1-\alpha}$
- $x_{t} = ax_{t-1} + b\epsilon_{t-1}$ $\rightarrow$ $x_{t} = a^{t}x_{0} + b[\epsilon_{t-1} + a\epsilon_{t-2}+\cdots+a^{t-1}\epsilon_{0}]$
- Assume $w_{t} = b[\epsilon_{t-1} + a\epsilon_{t-2}+\cdots+a^{t-1}\epsilon_{0}]$ to be $~\mathcal{N}(0, I)$
- Calculating covariance for $w_{t}$, with $t\rightarrow \inf$, get $b = \sqrt{1-a^{2}}$
Distribution $q_{\phi}(x_{t}|x_{0})$
- $q_{\phi}(x_{t}|x_{0}) = \mathcal{N}(x_{t}|\sqrt{\bar{\alpha_{t}}x_{0}}, (1-\bar{\alpha_{t}})I)$, where $\bar{\alpha_{t}} = \prod_{i = 1}^{t}\alpha_{i}$
- Similarly, calculate the covariance for the noise part
Evidence Lower Bound
- $ELBO_{\phi, \theta}(x) = E_{q_{\phi}(x_{1}|x_{0})}[\log{p_{\theta}(x_{0}|x_{1})}] - E_{q_{\phi}(x_{T-1}|x_{0})}[D_{KL}(q_{\phi}(x_{T}|x_{T-1})||p(x_{T}))] - \sum_{t = 1}^{T-1}E_{q_{\phi}(x_{t-1}, x_{t+1}|x_{0})}[D_{KL}(q_{\phi}(x_{t}|x_{t-1})||p_{\theta}(x_{t}|x_{t+1}))]$
- 3 terms for evaluating initila block, final block, and transition blocks
- Reconstruction: The expectation is taken with respect to the samples drawn from $q_{\phi}(x_{1}|x_{0})$, which is distribution that generates intermediate latent variable $x_{1}$.
- Prior matching: KL divergence to measure the difference between $q_{x_{T}|x_{T-1}}$ (forward transition) and $p(x_{T})$ (assuming $\mathcal{N}(0, I)$)
- Consistency: forward transition $\q_{phi}(x_{t}|x_{t-1})$, reverse transition $p_{\theta}(x_{t}|x_{x+1})$; KL divergence to measure the deviation
Rewrite the Consistency Term
- Bayers theorem condision on $x_{0}$: $q(x_{t}|x_{t-1}, x_{0}) = \frac{q(x_{t-1}|x_{t}, x_{0})q(x_{t}|x_{0})}{q(x_{t-1}|x_{0})}$. This switch the direction of $q(x_{t}|x_{t-1}, x_{0})$ to $q(x_{t-1}|x_{t}, x_{0})$
- Updated $ELBO_{\phi, \theta}(x) = E_{q_{\phi}(x_{1}|x_{0})}[\log{p_{\theta}(x_{0}|x_{1})}]-D_{KL}(q_{\phi}(x_{T}|x_{0})||p(x_{T})) - \sum_{t = 2}^{T}E_{q_{\phi}(x_{t}|x_{0})}[D_{KL}(q_{\phi}(x_{t-1}|x_{t}, x_{0})||p_{\theta}(x_{t-1}|x_{t}))]$
- Reconstruction: same as before. Maximize the log-likelihood
- Prior matching: simplified KL divergence between $q_{\phi}(x_{T}|x_{0})$ and $p(x_{T})$. with condition upon $x_{0}$
- Consistency: index form $2$ to $T$; instead of mathcing forward transition to reverse transition, use $q_{\phi}$ to construct a reverse transition and match with $p_{\theta}$
Derivation of $q_{\phi}(x_{t-1}|x_{t}, x_{0})$, which is a Guassian
- Mean and variance
- Mean: $\mu_{q}(x_{t}, x_{0}) = \frac{(1-\bar{\alpha_{t-1}})\sqrt{\alpha_{t}}}{1-\bar{\alpha_{t}}}x_{t} + \frac{(1-\alpha_{t})\sqrt{\bar{\alpha_{t-1}}}}{1-\bar{\alpha_{t}}}x_{0}$
- Variance: $\Sigma_{q}(t) = \frac{(1-\alpha_{t})(1-\sqrt{\bar{\alpha_{t-1}}})}{1-\bar{\alpha_{t}}}$
- Mean and variance are completely characterized by $x_{t}$ and $x_{0}$
- Calculated with Bayes theorem
- Pick $p_{\theta}(x_{t-1}|x_{t}) = \mathcal{N}(x_{t-1}|\mu_{\theta}(x_{t}), \sigma_{q}^{2}(t)I)$
- Mean is determined by neural network; variance is chosen to be $\sigma_{q}^{2}(t)$
- Consistence term $D_{KL} = \frac{||\mu_{q}(x_{t}, x_{0}) - \mu_{\theta}(x_{t})||^{2}}{2\sigma_{q}^{2}(t)}$
- KL divergence to L2 loss between the two mean vectors
- Till now, all subscripts $\phi$ are dropped. Just adding different levels of white noise to each of $x_{t}$
- ELBO is optimized over $\theta$ through network
- Sampling: $q(x_{t}|x{0}) = N(x_{t}|\sqrt{\bar{\alpha_{t}}}x_{0}, (1-\bar{\alpha_{t}})I)$
Training and Inference
- Training
- With definition of $\mu_{\theta}(x_{t}) = \frac{(1-\bar{\alpha_{t-1}})\sqrt{\alpha_{t}}}{1-\bar{\alpha_{t}}} x_{t} + \frac{(1-\alpha_{t})\sqrt{\bar{\alpha_{t-1}}}}{1-\bar{\alpha_{t}}}\hat{x_{\theta}}(x_{t})$, where $\hat{x_{\theta}}(x_{t})$ is another network
- With some simplification and substitution ELBO becomes: $-\sum_{t=1}^{T}E_{q(x_{t}|x_{0})}[\frac{1}{2\sigma_{q}^{2}(t)}\frac{(1-\alpha_{t})^{2}\bar{\alpha_{t-1}}}{(1-\bar{\alpha_{t}})^{2}}|\hat{x_{\theta}}(x_{t})-x_{0}|^{2}]$
- Forward sampling process: with Gaussian assumption, one-step data generation $x_{t} = \sqrt{\bar{\alpha_{t}}}x_{0} + \sqrt{(1-\bar{\alpha_{t}})z}$ [//]: # INSERT IMAGE HERE
- $x_{t -1} = \frac{(1-\bar{\alpha_{t-1}})\sqrt{\alpha_{t}}}{1-\bar{\alpha_{t}}}x_{t} + \frac{(1-\alpha_{t})\sqrt{\bar{\alpha_{t-1}}}}{1-\bar{\alpha_{t}}}\hat{x_{\theta}}(x_{t}) + \sigma_{q}(t)z$, where $z~\mathcal{N}(0, I)$ [//]: # INSERT IMAGE HERE
Derivation Based on Noise Vector
- Predicts the noise instead of the signal
- Gradient step in training is updated with $\Sigma_{\theta}|\hat{\epsilon_{\theta}}(x_{t})-\epsilon_{0}|^{2}$
- Inference $x_{t-1}$ update with $x_{t-1} = \frac{1}{\sqrt{\alpha_{t}}}(x_{t}-\frac{1-\alpha_{t}}{\sqrt{1-\bar{\alpha_{t}}}}\bar{\epsilon_{\theta}}(x_{t})) + \sigma_{q}(t)z$
Inversion by Direct Denoising (InDI): linear combinations $x_{t-1} = (\text{something})\cdot x_{t}+(\text{something else})\cdot \text{denoise}(x_{t}) + \text{noise}$
- $\text{denoise}(x_{t})$
- Based on statistics: minimum mean squared error (MMSE) denoiser: $\text{denoise}(y) = \arg \min_{g}E[|g(y)-x|^{2}] = E[x|y]$, where $x$ is the denoised image, $y$is the noisy image
- For diffusion model, $\text{denoise}(x_{t}) = E[x_{t-1}|x_{t}]$
- With given $p_{\theta}(x_{t-1}|x_{t})$, the optimal denoiser is $E[p_{\theta}(x_{t-1}|x_{t})]$
- For image quality aspect, MMSE is never a good metric.
- MMSE denoiser is equivalent to the conditional expectation of the posterior distribution
- Incremental denoising steps
- Considering a small step $\tau$ previous to time $t$, where $0\leq \tau <t\leq 1$
- $x_{t} = (1-t)x_{0}+ty$, $0\leq t\leq 1$
- $E[x_{t-\tau}|x_{t}] = (1-\frac{\tau}{t})x_{t}+\frac{\tau}{t}E[x_{0}|x_{t}]$
- With $\hat{x_{t-\tau}} = E[x_{t-\tau}|x_{t}]$, $\hat{x_{t}}$, and $\text{denoise}(\hat{x_{t}}) = E[x_{0}|x_{t}]$, the inference step is given as $\hat{x_{t-\tau}} = (1-\frac{\tau}{t})\cdot \hat{x_{t}}+ \frac{\tau}{t}\text{denoise}(\hat{x_{t}})$
- With a noisy image y and denoiser, the above equation can help use retrieve the image $\hat{x_{t-1}}$…till $\hat{x}_{0}$.
- Training
- $min_{\theta}E_{x,y}E_{t~uniform}[|\text{denoise}{\theta}(x{t})-x|^{2}]$
- Training one denoiser for all time steps $t$
- Connection with denoising score-matching
- $\frac{dx_{t}}{dt} = \lim_{\tau\rightarrow0}\frac{x_{t}-x_{t-\tau}}{\tau} = \frac{x_{t}-\text{denoise}(x_{t})}{t}$, which is an ordinary diofferential equation (ODE)
- The incremental denoising iteration is equivalent to the denoising score-matching.
- Adding stochastic steps
- Add stochastic perturbation to incremental denoising iterations with a sequence of noise levels ${\sigma_{t}|0\leq t\leq 1}$
- $x_{t} = (1-t)x+ty+\sqrt{t}\sigma_{t}\epsilon$
- For training, $min_{\theta}E_{x, y}E_{t~uniform}E_{\epsilon}[|\text{denoise}(x_{t})-x|^{2}]$
- For inference, $\hat{x_{t-\tau}} = (1-\frac{\tau}{t})\cdot \hat{x_t}+\frac{\tau}{t}\text{denoise}(\hat{x_{t}}) + (t-\tau)\sqrt{\sigma_{t-\tau}^{2}-\sigma_{t}^{2}}\epsilon$