AI Reading Notes: Image And Video Gen

 

Overall highlights

  • Common approach is to encode or add noise to image and convert it to a latent low dimension image first and then denoise/decode some random noise to generate image
    • Because one complicated distribution model can be treated as a sequence of transformation of Gaussian model
  • Loss function usually involves calculating KL divergence between encoder/ downsampling model (which reflects real training data distribution) and decoder/denoising model
    • There are many variations of loss function. Some calculates loss function upper bound with KL divergence. Some generates loss function based on variational bayesian and graphical model.
  • Diffusion model is the most popular one because it is both flexible and tractable though it is expensive to generate image with.
  • Self Attention and Cross attention component can be added into one diffusion step to
    • Calculate Gaussian distribution median and variation
    • Cross reference different partitions of images
    • Cross reference image and text
    • Cross reference different video frames to achieve frames level consistency
  • CLIP is a popular text to image model which is trained on text - image relationship data
  • An image generation process can contain different diffusion models. One diffusion model for one resolution amplification.
  • To reduce calculation complexity of sampling data during generation, we can tune some params and calculate next value directly based on trained params or trained medians
  • Another speedup tech is to skip steps or reduce steps in diffusion model
  • Convolution is a common tech to transform image and video size in both 2D and 3D space
  • Popular architecture and components used in various place of a large model and in various ways:
    • Unet, Transformer, Self Attention, Cross Attention, CLIP, diffusion model, diffusion model pipeline, Convolution, residual tech


From GAN to WGAN

Sourcehttps://lilianweng.github.io/posts/2017-08-20-gan/


A discriminator model to tell fake image from true image

A generator model to generate fake image as real as possible to cheat discriminator model

Goal , make generator model output image as real as possible

Loss function:

Minimize discriminator loss, while maximize generator’s chance to successfully cheat discriminator

Limitation: unstable training, slow convergence


From Autoencoder to Beta-VAE

Sourcehttps://lilianweng.github.io/posts/2018-08-12-vae/



  • Train encoder decoder to generate image. The idea is compress and decompress
  • Images are high dimensional data but can be compressed into low dimensional space because there are many constraints in each image e.g. object and object relationship, light ray and shade
  • Denoising Autoencoder: Add noise to input image to increase training data. It is like mask texts in input when training Language Model
  • Sparse Autoencoder: select only top k activated nodes for next layer calculation
  • Contractive Autoencoder: adds a term in the loss function to penalize the representation being too sensitive to the input
  • VAE: Variational Autoencoder : Model the process with the methods of variational bayesian and graphical model which is analogous to encoder decoder model
    • by minimizing the loss, we are maximizing the lower bound of the probability of generating real data samples.
    • Reparameterization Trick:
      • loss function requires calculating expectation which involves sampling data.
      • It is impossible to do gradient decent backward propagation on data sampling process
      • so instead of sampling data, just assume Gaussian distribution and use median was next step’s value
      • the random sampling part is reparameterized to variable part and can be ignored during backward propagation while can be used in denoising phase
    • Beta-VAE: penalize the difference between encoder and decoder
    • VQ-VAE and VQ-VAE-2: Make hidden state z discrete and finite. z only contains a limited number of hierarchical categories
    • Temporal Difference VAE : work with sequential data


Flow-based Deep Generative Models

Sourcehttps://lilianweng.github.io/posts/2018-10-13-flow-models/#made


Here is a quick summary of the difference between GAN, VAE, and flow-based generative models:

  1. Generative adversarial networks: GAN provides a smart solution to model the data generation, an unsupervised learning problem, as a supervised one. The discriminator model learns to distinguish the real data from the fake samples that are produced by the generator model. Two models are trained as they are playing a minimax game.

  2. Variational autoencoders: VAE inexplicitly optimizes the log-likelihood of the data by maximizing the evidence lower bound (ELBO).

  3. Flow-based generative models: A flow-based generative model is constructed by a sequence of invertible transformations. Unlike other two, the model explicitly learns the data distribution  and therefore the loss function is simply the negative log-likelihood.

    𝑝(𝑥)

  • Normalization flow

    • Good density estimation is critical in machine learning because backward propagation requires simple probability distribution to calculate derivative easily and quickly
    • Gaussian distribution is often used in latent variable generative model because it is simple even though real world distribution is much more complicated
    • We can achieve a more complicated distribution by gradually transform a Gaussian distribution in multiple steps


    • final log likely hood function of log p(x) is


    • where fi should satisfy

      • It is easily invertible.
      • Its Jacobian determinant is easy to compute.
    • One implementation is RealNVP

      • Since neither reverse of s and t nor Jacobian of s and t need to be calculated, s and t can be arbitrarily complex, e.g. a neuron network


  • Autoregressive Flows

    • To model sequential data, each output only depends on the data observed in the past, but not on the future ones.
    • implementations
      • MADE,
        • process all temporal states at the same time by inputting them together into the model
        • In each layer, assign a unique order to each node. One node can only receive input whose order is ≤ its order from previous layer. This guarantees the autoregressive constraint
      • PixelRNN/ PixelCNN,
        • For image data, one pixel in next layer only depends on all previous layer pixels whose position is before it (low row or same row lower column)
        • Different convolution techniques are used to increase calculation efficiency (e.g. smaller convolution window)
      • WaveNet
        • For 1-D audio data
        • Output depends on a number of input from previous time.
        • One option is output depends on n immediately previous input. But n can’t be too large and it doesn’t work for long sequence where one output might want input from long time ago
        • One solution is dilated convolution where the depended n inputs are sampled from long sequence ago instead of immediately n previous input
  • Masked Autoregressive Flow / Inverse Autoregressive Flow

    • is a type of normalizing flows, where the transformation layer is built as an autoregressive neural network.
    • Goal: Known z’s distribution p(z), estimate x and p(x)


    • Masked Autoregressive Flow: estimate xi based on all previous x and current z. So data gen is slow while density estimation is fast because it only depends on know p(z)
    • Inverse Autoregressive Flow: estimate xi based on all previous z . So data gen is fast because it depends on all know z in one pass while density estimation is slow because we have to accumulate all previous z’s density distribution sequentially


Diffusion Model

Sourcehttps://lilianweng.github.io/posts/2021-07-11-diffusion-models/


  • Basic idea
    • Add Gaussian noise to input image until it becomes white noise.
    • And in generation step, reverse the process by denoising from white noise step by step.
    • The loss function is the difference between forward distribution model and denoising model.
      • Forward Gaussian model params are stats from training input
      • The goal is to learn denoising model
    • Another training option is to training a estimator to estimate gradient decent of log q(x)
  • Tricks
    • Make each forward step’s variance follows sin function whose input is t
    • Train denoising model’s variance also
  • Additional image category information can be added to the denoising process
    • First train a image classier
    • During denoising process, given a class, we can add the classier output with weight to the reversed noise part in the Gaussian distribution model in each steps
  • Speed up diffusion process, 3 options
    • Denoising diffusion implicit model (DDIM) - by setting some params to 0, instead of sampling in each forward step, we can calculate next value directly while keeping noise along the process. It increase inference speed
    • skip some sampling steps
    • Iteratively halve sampling step by using teacher model to teach a student model in each step
    • Consistency model, train a function which map Xt0 to Xt in any t directly. 2 suboptions
      • Loss as difference between original diffusion model vs the function
      • train the function independently
  • Latent variable space: Use an autoencoder to compress the image to latent space first before send to diffussion model
  • Scale up Generation Resolution and Quality
    • Use multiple diffusion model for different resolution compression
    • Use Gaussian noise for low resolution
    • Use Gaussian blur for high resolution
    • CLIP model to convert text to image and to cross attention
  • Model Architecture
    • Unet, controlnet, apply convolution and compose additional image
    • Transformer to transform noise and variance
  • summary
    • pro: both tractable and flexible while many other models don’t have both
    • cons: quite expensive in terms of time and compute because it relies on long Markov chain to sample


Diffusion Model For Video

Sourcehttps://lilianweng.github.io/posts/2024-04-12-diffusion-video/

Interesting model architecture and implementation ideas

  • Extend 2D diffusion model to 3D. The extra dimension is time
  • Create separate Spatial and Temporal diffusion model and cascade them together
    • Spatiotemporal SR layers contain pseudo-3D convo layers and pseudo-3D attention layers
    • pseudo-3D attention layers contain separate Spatial and Temporal attention layer
  • Divide a video into small patches of temporal and spatial block and apply attention on these patches
  • Divide whole model into multiple layers. Each layer upsamples different resolution and each layer contains both Spatial and Temporal diffusion model
  • One goal is to edit a video which means given a text input and video input, generate a new video.
    • To incorporate the additional video input, copy current downsampling model’s params and create a separate model to transform the input video into low dimension latent states.
    • And apply cross attention between text embedding and video embedding
  • Project a video into a long picture with each frame as part of the picture, run diffusion model on it. And also add attention component to join different frames to achieve consistency among frames
  • Add a frame interpolation network, increasing the effective frame rate by interpolating between generated frames. This is a fine-tuned model for the task of predicting masked frames for video upsampling.
  • During training, divide a video into content component (represent by text) and structure component (snapshot of input video) and run cross attention between text and video
    • During inference, text is input and generate video based on that
  • Pre-train text to image diffusion model, freeze it, add temporal diffusion layer and fine tune on video data.
    • enforce temporally coherent reconstructions across frames with a video-aware discriminator which tell which frame is good during decoding
    • We can pretrain on text to image data, pre-train on curated video data separately and finally fine tune on high quality video data
  • adapt a pre-trained text-to-image model to output videos without any training
    • Generate raw frames with motion info
      • Define a direction function for controlling the global scene and camera motion
      • Generate first frame randomly, and downsample it using diffusion model
      • Combine the first frame and direction function and finally upsample them to generate frames with motion
    • Run diffusion model to generate full video based on raw frames with motion
      • Reprogramming frame-level self-attention using a new cross-frame attention of each frame on the first frame, to preserve the context, appearance, and identity of the foreground object。

Reference

Popular posts from this blog

Does Free Consciousness exist ?

Software Architecture Books Summary And Highlights -- Part 1 Goal, Introduction And Index

拉美500年,荆棘丛生的自由繁荣之路