AI Reading Notes: Image And Video Gen
Overall highlights
- Common approach is to encode or add noise to image and convert it to a latent low dimension image first and then denoise/decode some random noise to generate image
- Because one complicated distribution model can be treated as a sequence of transformation of Gaussian model
- Loss function usually involves calculating KL divergence between encoder/ downsampling model (which reflects real training data distribution) and decoder/denoising model
- There are many variations of loss function. Some calculates loss function upper bound with KL divergence. Some generates loss function based on variational bayesian and graphical model.
- Diffusion model is the most popular one because it is both flexible and tractable though it is expensive to generate image with.
- Self Attention and Cross attention component can be added into one diffusion step to
- Calculate Gaussian distribution median and variation
- Cross reference different partitions of images
- Cross reference image and text
- Cross reference different video frames to achieve frames level consistency
- CLIP is a popular text to image model which is trained on text - image relationship data
- An image generation process can contain different diffusion models. One diffusion model for one resolution amplification.
- To reduce calculation complexity of sampling data during generation, we can tune some params and calculate next value directly based on trained params or trained medians
- Another speedup tech is to skip steps or reduce steps in diffusion model
- Convolution is a common tech to transform image and video size in both 2D and 3D space
- Popular architecture and components used in various place of a large model and in various ways:
- Unet, Transformer, Self Attention, Cross Attention, CLIP, diffusion model, diffusion model pipeline, Convolution, residual tech
From GAN to WGAN
Source: https://lilianweng.github.io/posts/2017-08-20-gan/
A discriminator model to tell fake image from true image
A generator model to generate fake image as real as possible to cheat discriminator model
Goal , make generator model output image as real as possible
Loss function:
Minimize discriminator loss, while maximize generator’s chance to successfully cheat discriminator
Limitation: unstable training, slow convergence
From Autoencoder to Beta-VAE
Source: https://lilianweng.github.io/posts/2018-08-12-vae/
- Train encoder decoder to generate image. The idea is compress and decompress
- Images are high dimensional data but can be compressed into low dimensional space because there are many constraints in each image e.g. object and object relationship, light ray and shade
- Denoising Autoencoder: Add noise to input image to increase training data. It is like mask texts in input when training Language Model
- Sparse Autoencoder: select only top k activated nodes for next layer calculation
- Contractive Autoencoder: adds a term in the loss function to penalize the representation being too sensitive to the input
- VAE: Variational Autoencoder : Model the process with the methods of variational bayesian and graphical model which is analogous to encoder decoder model
- by minimizing the loss, we are maximizing the lower bound of the probability of generating real data samples.
- Reparameterization Trick:
- loss function requires calculating expectation which involves sampling data.
- It is impossible to do gradient decent backward propagation on data sampling process
- so instead of sampling data, just assume Gaussian distribution and use median was next step’s value
- the random sampling part is reparameterized to variable part and can be ignored during backward propagation while can be used in denoising phase
- Beta-VAE: penalize the difference between encoder and decoder
- VQ-VAE and VQ-VAE-2: Make hidden state z discrete and finite. z only contains a limited number of hierarchical categories
- Temporal Difference VAE : work with sequential data
Flow-based Deep Generative Models
Source: https://lilianweng.github.io/posts/2018-10-13-flow-models/#made
Here is a quick summary of the difference between GAN, VAE, and flow-based generative models:
-
Generative adversarial networks: GAN provides a smart solution to model the data generation, an unsupervised learning problem, as a supervised one. The discriminator model learns to distinguish the real data from the fake samples that are produced by the generator model. Two models are trained as they are playing a minimax game.
-
Variational autoencoders: VAE inexplicitly optimizes the log-likelihood of the data by maximizing the evidence lower bound (ELBO).
-
Flow-based generative models: A flow-based generative model is constructed by a sequence of invertible transformations. Unlike other two, the model explicitly learns the data distribution and therefore the loss function is simply the negative log-likelihood.
𝑝(𝑥)
-
Normalization flow
- Good density estimation is critical in machine learning because backward propagation requires simple probability distribution to calculate derivative easily and quickly
- Gaussian distribution is often used in latent variable generative model because it is simple even though real world distribution is much more complicated
- We can achieve a more complicated distribution by gradually transform a Gaussian distribution in multiple steps
-
final log likely hood function of log p(x) is
-
where fi should satisfy
- It is easily invertible.
- Its Jacobian determinant is easy to compute.
-
One implementation is RealNVP
- Since neither reverse of s and t nor Jacobian of s and t need to be calculated, s and t can be arbitrarily complex, e.g. a neuron network
-
Autoregressive Flows
- To model sequential data, each output only depends on the data observed in the past, but not on the future ones.
- implementations
- MADE,
- process all temporal states at the same time by inputting them together into the model
- In each layer, assign a unique order to each node. One node can only receive input whose order is ≤ its order from previous layer. This guarantees the autoregressive constraint
- PixelRNN/ PixelCNN,
- For image data, one pixel in next layer only depends on all previous layer pixels whose position is before it (low row or same row lower column)
- Different convolution techniques are used to increase calculation efficiency (e.g. smaller convolution window)
- WaveNet
- For 1-D audio data
- Output depends on a number of input from previous time.
- One option is output depends on n immediately previous input. But n can’t be too large and it doesn’t work for long sequence where one output might want input from long time ago
- One solution is dilated convolution where the depended n inputs are sampled from long sequence ago instead of immediately n previous input
- MADE,
-
Masked Autoregressive Flow / Inverse Autoregressive Flow
- is a type of normalizing flows, where the transformation layer is built as an autoregressive neural network.
- Goal: Known z’s distribution p(z), estimate x and p(x)
- Masked Autoregressive Flow: estimate xi based on all previous x and current z. So data gen is slow while density estimation is fast because it only depends on know p(z)
- Inverse Autoregressive Flow: estimate xi based on all previous z . So data gen is fast because it depends on all know z in one pass while density estimation is slow because we have to accumulate all previous z’s density distribution sequentially
Diffusion Model
Source: https://lilianweng.github.io/posts/2021-07-11-diffusion-models/
- Basic idea
- Add Gaussian noise to input image until it becomes white noise.
- And in generation step, reverse the process by denoising from white noise step by step.
- The loss function is the difference between forward distribution model and denoising model.
- Forward Gaussian model params are stats from training input
- The goal is to learn denoising model
- Another training option is to training a estimator to estimate gradient decent of log q(x)
- Tricks
- Make each forward step’s variance follows sin function whose input is t
- Train denoising model’s variance also
- Additional image category information can be added to the denoising process
- First train a image classier
- During denoising process, given a class, we can add the classier output with weight to the reversed noise part in the Gaussian distribution model in each steps
- Speed up diffusion process, 3 options
- Denoising diffusion implicit model (DDIM) - by setting some params to 0, instead of sampling in each forward step, we can calculate next value directly while keeping noise along the process. It increase inference speed
- skip some sampling steps
- Iteratively halve sampling step by using teacher model to teach a student model in each step
- Consistency model, train a function which map Xt0 to Xt in any t directly. 2 suboptions
- Loss as difference between original diffusion model vs the function
- train the function independently
- Latent variable space: Use an autoencoder to compress the image to latent space first before send to diffussion model
- Scale up Generation Resolution and Quality
- Use multiple diffusion model for different resolution compression
- Use Gaussian noise for low resolution
- Use Gaussian blur for high resolution
- CLIP model to convert text to image and to cross attention
- Model Architecture
- Unet, controlnet, apply convolution and compose additional image
- Transformer to transform noise and variance
- summary
- pro: both tractable and flexible while many other models don’t have both
- cons: quite expensive in terms of time and compute because it relies on long Markov chain to sample
Diffusion Model For Video
Source: https://lilianweng.github.io/posts/2024-04-12-diffusion-video/
Interesting model architecture and implementation ideas
- Extend 2D diffusion model to 3D. The extra dimension is time
- Create separate Spatial and Temporal diffusion model and cascade them together
- Spatiotemporal SR layers contain pseudo-3D convo layers and pseudo-3D attention layers
- pseudo-3D attention layers contain separate Spatial and Temporal attention layer
- Divide a video into small patches of temporal and spatial block and apply attention on these patches
- Divide whole model into multiple layers. Each layer upsamples different resolution and each layer contains both Spatial and Temporal diffusion model
- One goal is to edit a video which means given a text input and video input, generate a new video.
- To incorporate the additional video input, copy current downsampling model’s params and create a separate model to transform the input video into low dimension latent states.
- And apply cross attention between text embedding and video embedding
- Project a video into a long picture with each frame as part of the picture, run diffusion model on it. And also add attention component to join different frames to achieve consistency among frames
- Add a frame interpolation network, increasing the effective frame rate by interpolating between generated frames. This is a fine-tuned model for the task of predicting masked frames for video upsampling.
- During training, divide a video into content component (represent by text) and structure component (snapshot of input video) and run cross attention between text and video
- During inference, text is input and generate video based on that
- Pre-train text to image diffusion model, freeze it, add temporal diffusion layer and fine tune on video data.
- enforce temporally coherent reconstructions across frames with a video-aware discriminator which tell which frame is good during decoding
- We can pretrain on text to image data, pre-train on curated video data separately and finally fine tune on high quality video data
- adapt a pre-trained text-to-image model to output videos without any training
- Generate raw frames with motion info
- Define a direction function for controlling the global scene and camera motion
- Generate first frame randomly, and downsample it using diffusion model
- Combine the first frame and direction function and finally upsample them to generate frames with motion
- Run diffusion model to generate full video based on raw frames with motion
- Reprogramming frame-level self-attention using a new cross-frame attention of each frame on the first frame, to preserve the context, appearance, and identity of the foreground object。