AI Reading Notes: Image And Video Gen

 

Overall highlights

  • Common approach is to encode or add noise to image and convert it to a latent low dimension image first and then denoise/decode some random noise to generate image
    • Because one complicated distribution model can be treated as a sequence of transformation of Gaussian model
  • Loss function usually involves calculating KL divergence between encoder/ downsampling model (which reflects real training data distribution) and decoder/denoising model
    • There are many variations of loss function. Some calculates loss function upper bound with KL divergence. Some generates loss function based on variational bayesian and graphical model.
  • Diffusion model is the most popular one because it is both flexible and tractable though it is expensive to generate image with.
  • Self Attention and Cross attention component can be added into one diffusion step to
    • Calculate Gaussian distribution median and variation
    • Cross reference different partitions of images
    • Cross reference image and text
    • Cross reference different video frames to achieve frames level consistency
  • CLIP is a popular text to image model which is trained on text - image relationship data
  • An image generation process can contain different diffusion models. One diffusion model for one resolution amplification.
  • To reduce calculation complexity of sampling data during generation, we can tune some params and calculate next value directly based on trained params or trained medians
  • Another speedup tech is to skip steps or reduce steps in diffusion model
  • Convolution is a common tech to transform image and video size in both 2D and 3D space
  • Popular architecture and components used in various place of a large model and in various ways:
    • Unet, Transformer, Self Attention, Cross Attention, CLIP, diffusion model, diffusion model pipeline, Convolution, residual tech


From GAN to WGAN

Sourcehttps://lilianweng.github.io/posts/2017-08-20-gan/


A discriminator model to tell fake image from true image

A generator model to generate fake image as real as possible to cheat discriminator model

Goal , make generator model output image as real as possible

Loss function:

Minimize discriminator loss, while maximize generator’s chance to successfully cheat discriminator

Limitation: unstable training, slow convergence


From Autoencoder to Beta-VAE

Sourcehttps://lilianweng.github.io/posts/2018-08-12-vae/



  • Train encoder decoder to generate image. The idea is compress and decompress
  • Images are high dimensional data but can be compressed into low dimensional space because there are many constraints in each image e.g. object and object relationship, light ray and shade
  • Denoising Autoencoder: Add noise to input image to increase training data. It is like mask texts in input when training Language Model
  • Sparse Autoencoder: select only top k activated nodes for next layer calculation
  • Contractive Autoencoder: adds a term in the loss function to penalize the representation being too sensitive to the input
  • VAE: Variational Autoencoder : Model the process with the methods of variational bayesian and graphical model which is analogous to encoder decoder model
    • by minimizing the loss, we are maximizing the lower bound of the probability of generating real data samples.
    • Reparameterization Trick:
      • loss function requires calculating expectation which involves sampling data.
      • It is impossible to do gradient decent backward propagation on data sampling process
      • so instead of sampling data, just assume Gaussian distribution and use median was next step’s value
      • the random sampling part is reparameterized to variable part and can be ignored during backward propagation while can be used in denoising phase
    • Beta-VAE: penalize the difference between encoder and decoder
    • VQ-VAE and VQ-VAE-2: Make hidden state z discrete and finite. z only contains a limited number of hierarchical categories
    • Temporal Difference VAE : work with sequential data


Flow-based Deep Generative Models

Sourcehttps://lilianweng.github.io/posts/2018-10-13-flow-models/#made


Here is a quick summary of the difference between GAN, VAE, and flow-based generative models:

  1. Generative adversarial networks: GAN provides a smart solution to model the data generation, an unsupervised learning problem, as a supervised one. The discriminator model learns to distinguish the real data from the fake samples that are produced by the generator model. Two models are trained as they are playing a minimax game.

  2. Variational autoencoders: VAE inexplicitly optimizes the log-likelihood of the data by maximizing the evidence lower bound (ELBO).

  3. Flow-based generative models: A flow-based generative model is constructed by a sequence of invertible transformations. Unlike other two, the model explicitly learns the data distribution  and therefore the loss function is simply the negative log-likelihood.

    𝑝(𝑥)

  • Normalization flow

    • Good density estimation is critical in machine learning because backward propagation requires simple probability distribution to calculate derivative easily and quickly
    • Gaussian distribution is often used in latent variable generative model because it is simple even though real world distribution is much more complicated
    • We can achieve a more complicated distribution by gradually transform a Gaussian distribution in multiple steps


    • final log likely hood function of log p(x) is


    • where fi should satisfy

      • It is easily invertible.
      • Its Jacobian determinant is easy to compute.
    • One implementation is RealNVP

      • Since neither reverse of s and t nor Jacobian of s and t need to be calculated, s and t can be arbitrarily complex, e.g. a neuron network


  • Autoregressive Flows

    • To model sequential data, each output only depends on the data observed in the past, but not on the future ones.
    • implementations
      • MADE,
        • process all temporal states at the same time by inputting them together into the model
        • In each layer, assign a unique order to each node. One node can only receive input whose order is ≤ its order from previous layer. This guarantees the autoregressive constraint
      • PixelRNN/ PixelCNN,
        • For image data, one pixel in next layer only depends on all previous layer pixels whose position is before it (low row or same row lower column)
        • Different convolution techniques are used to increase calculation efficiency (e.g. smaller convolution window)
      • WaveNet
        • For 1-D audio data
        • Output depends on a number of input from previous time.
        • One option is output depends on n immediately previous input. But n can’t be too large and it doesn’t work for long sequence where one output might want input from long time ago
        • One solution is dilated convolution where the depended n inputs are sampled from long sequence ago instead of immediately n previous input
  • Masked Autoregressive Flow / Inverse Autoregressive Flow

    • is a type of normalizing flows, where the transformation layer is built as an autoregressive neural network.
    • Goal: Known z’s distribution p(z), estimate x and p(x)


    • Masked Autoregressive Flow: estimate xi based on all previous x and current z. So data gen is slow while density estimation is fast because it only depends on know p(z)
    • Inverse Autoregressive Flow: estimate xi based on all previous z . So data gen is fast because it depends on all know z in one pass while density estimation is slow because we have to accumulate all previous z’s density distribution sequentially


Diffusion Model

Sourcehttps://lilianweng.github.io/posts/2021-07-11-diffusion-models/


  • Basic idea
    • Add Gaussian noise to input image until it becomes white noise.
    • And in generation step, reverse the process by denoising from white noise step by step.
    • The loss function is the difference between forward distribution model and denoising model.
      • Forward Gaussian model params are stats from training input
      • The goal is to learn denoising model
    • Another training option is to training a estimator to estimate gradient decent of log q(x)
  • Tricks
    • Make each forward step’s variance follows sin function whose input is t
    • Train denoising model’s variance also
  • Additional image category information can be added to the denoising process
    • First train a image classier
    • During denoising process, given a class, we can add the classier output with weight to the reversed noise part in the Gaussian distribution model in each steps
  • Speed up diffusion process, 3 options
    • Denoising diffusion implicit model (DDIM) - by setting some params to 0, instead of sampling in each forward step, we can calculate next value directly while keeping noise along the process. It increase inference speed
    • skip some sampling steps
    • Iteratively halve sampling step by using teacher model to teach a student model in each step
    • Consistency model, train a function which map Xt0 to Xt in any t directly. 2 suboptions
      • Loss as difference between original diffusion model vs the function
      • train the function independently
  • Latent variable space: Use an autoencoder to compress the image to latent space first before send to diffussion model
  • Scale up Generation Resolution and Quality
    • Use multiple diffusion model for different resolution compression
    • Use Gaussian noise for low resolution
    • Use Gaussian blur for high resolution
    • CLIP model to convert text to image and to cross attention
  • Model Architecture
    • Unet, controlnet, apply convolution and compose additional image
    • Transformer to transform noise and variance
  • summary
    • pro: both tractable and flexible while many other models don’t have both
    • cons: quite expensive in terms of time and compute because it relies on long Markov chain to sample


Diffusion Model For Video

Sourcehttps://lilianweng.github.io/posts/2024-04-12-diffusion-video/

Interesting model architecture and implementation ideas

  • Extend 2D diffusion model to 3D. The extra dimension is time
  • Create separate Spatial and Temporal diffusion model and cascade them together
    • Spatiotemporal SR layers contain pseudo-3D convo layers and pseudo-3D attention layers
    • pseudo-3D attention layers contain separate Spatial and Temporal attention layer
  • Divide a video into small patches of temporal and spatial block and apply attention on these patches
  • Divide whole model into multiple layers. Each layer upsamples different resolution and each layer contains both Spatial and Temporal diffusion model
  • One goal is to edit a video which means given a text input and video input, generate a new video.
    • To incorporate the additional video input, copy current downsampling model’s params and create a separate model to transform the input video into low dimension latent states.
    • And apply cross attention between text embedding and video embedding
  • Project a video into a long picture with each frame as part of the picture, run diffusion model on it. And also add attention component to join different frames to achieve consistency among frames
  • Add a frame interpolation network, increasing the effective frame rate by interpolating between generated frames. This is a fine-tuned model for the task of predicting masked frames for video upsampling.
  • During training, divide a video into content component (represent by text) and structure component (snapshot of input video) and run cross attention between text and video
    • During inference, text is input and generate video based on that
  • Pre-train text to image diffusion model, freeze it, add temporal diffusion layer and fine tune on video data.
    • enforce temporally coherent reconstructions across frames with a video-aware discriminator which tell which frame is good during decoding
    • We can pretrain on text to image data, pre-train on curated video data separately and finally fine tune on high quality video data
  • adapt a pre-trained text-to-image model to output videos without any training
    • Generate raw frames with motion info
      • Define a direction function for controlling the global scene and camera motion
      • Generate first frame randomly, and downsample it using diffusion model
      • Combine the first frame and direction function and finally upsample them to generate frames with motion
    • Run diffusion model to generate full video based on raw frames with motion
      • Reprogramming frame-level self-attention using a new cross-frame attention of each frame on the first frame, to preserve the context, appearance, and identity of the foreground object。

Reference

Popular posts from this blog

Software Architecture Books Summary And Highlights -- Part 1 Goal, Introduction And Index

拉美500年,荆棘丛生的自由繁荣之路

Does Free Consciousness exist ?