Overall highlights

Common approach is to encode or add noise to image and convert it to a latent low dimension image first and then denoise/decode some random noise to generate image
- Because one complicated distribution model can be treated as a sequence of transformation of Gaussian model
Loss function usually involves calculating KL divergence between encoder/ downsampling model (which reflects real training data distribution) and decoder/denoising model
- There are many variations of loss function. Some calculates loss function upper bound with KL divergence. Some generates loss function based on variational bayesian and graphical model.
Diffusion model is the most popular one because it is both flexible and tractable though it is expensive to generate image with.
Self Attention and Cross attention component can be added into one diffusion step to
- Calculate Gaussian distribution median and variation
- Cross reference different partitions of images
- Cross reference image and text
- Cross reference different video frames to achieve frames level consistency
CLIP is a popular text to image model which is trained on text - image relationship data
An image generation process can contain different diffusion models. One diffusion model for one resolution amplification.
To reduce calculation complexity of sampling data during generation, we can tune some params and calculate next value directly based on trained params or trained medians
Another speedup tech is to skip steps or reduce steps in diffusion model
Convolution is a common tech to transform image and video size in both 2D and 3D space
Popular architecture and components used in various place of a large model and in various ways:
- Unet, Transformer, Self Attention, Cross Attention, CLIP, diffusion model, diffusion model pipeline, Convolution, residual tech

From GAN to WGAN

Source: https://lilianweng.github.io/posts/2017-08-20-gan/

A discriminator model to tell fake image from true image

A generator model to generate fake image as real as possible to cheat discriminator model

Goal , make generator model output image as real as possible

Loss function:

Minimize discriminator loss, while maximize generator’s chance to successfully cheat discriminator

Limitation: unstable training, slow convergence

From Autoencoder to Beta-VAE

Source: https://lilianweng.github.io/posts/2018-08-12-vae/

Train encoder decoder to generate image. The idea is compress and decompress
Images are high dimensional data but can be compressed into low dimensional space because there are many constraints in each image e.g. object and object relationship, light ray and shade
Denoising Autoencoder: Add noise to input image to increase training data. It is like mask texts in input when training Language Model
Sparse Autoencoder: select only top k activated nodes for next layer calculation
Contractive Autoencoder: adds a term in the loss function to penalize the representation being too sensitive to the input
VAE: Variational Autoencoder : Model the process with the methods of variational bayesian and graphical model which is analogous to encoder decoder model
- by minimizing the loss, we are maximizing the lower bound of the probability of generating real data samples.
- Reparameterization Trick:
  - loss function requires calculating expectation which involves sampling data.
  - It is impossible to do gradient decent backward propagation on data sampling process
  - so instead of sampling data, just assume Gaussian distribution and use median was next step’s value
  - the random sampling part is reparameterized to variable part and can be ignored during backward propagation while can be used in denoising phase
- Beta-VAE: penalize the difference between encoder and decoder
- VQ-VAE and VQ-VAE-2: Make hidden state z discrete and finite. z only contains a limited number of hierarchical categories
- Temporal Difference VAE : work with sequential data

Flow-based Deep Generative Models

Source: https://lilianweng.github.io/posts/2018-10-13-flow-models/#made

Here is a quick summary of the difference between GAN, VAE, and flow-based generative models:

Generative adversarial networks: GAN provides a smart solution to model the data generation, an unsupervised learning problem, as a supervised one. The discriminator model learns to distinguish the real data from the fake samples that are produced by the generator model. Two models are trained as they are playing a minimax game.
Variational autoencoders: VAE inexplicitly optimizes the log-likelihood of the data by maximizing the evidence lower bound (ELBO).
Flow-based generative models: A flow-based generative model is constructed by a sequence of invertible transformations. Unlike other two, the model explicitly learns the data distribution and therefore the loss function is simply the negative log-likelihood.

𝑝(𝑥)

Normalization flow
- Good density estimation is critical in machine learning because backward propagation requires simple probability distribution to calculate derivative easily and quickly
- Gaussian distribution is often used in latent variable generative model because it is simple even though real world distribution is much more complicated
- We can achieve a more complicated distribution by gradually transform a Gaussian distribution in multiple steps
- final log likely hood function of log p(x) is
- where fi should satisfy
  - It is easily invertible.
  - Its Jacobian determinant is easy to compute.
- One implementation is RealNVP
  - Since neither reverse of s and t nor Jacobian of s and t need to be calculated, s and t can be arbitrarily complex, e.g. a neuron network
Autoregressive Flows
- To model sequential data, each output only depends on the data observed in the past, but not on the future ones.
- implementations
  - MADE,
    - process all temporal states at the same time by inputting them together into the model
    - In each layer, assign a unique order to each node. One node can only receive input whose order is ≤ its order from previous layer. This guarantees the autoregressive constraint
  - PixelRNN/ PixelCNN,
    - For image data, one pixel in next layer only depends on all previous layer pixels whose position is before it (low row or same row lower column)
    - Different convolution techniques are used to increase calculation efficiency (e.g. smaller convolution window)
  - WaveNet
    - For 1-D audio data
    - Output depends on a number of input from previous time.
    - One option is output depends on n immediately previous input. But n can’t be too large and it doesn’t work for long sequence where one output might want input from long time ago
    - One solution is dilated convolution where the depended n inputs are sampled from long sequence ago instead of immediately n previous input
Masked Autoregressive Flow / Inverse Autoregressive Flow
- is a type of normalizing flows, where the transformation layer is built as an autoregressive neural network.
- Goal: Known z’s distribution p(z), estimate x and p(x)
- Masked Autoregressive Flow: estimate xi based on all previous x and current z. So data gen is slow while density estimation is fast because it only depends on know p(z)
- Inverse Autoregressive Flow: estimate xi based on all previous z . So data gen is fast because it depends on all know z in one pass while density estimation is slow because we have to accumulate all previous z’s density distribution sequentially

Diffusion Model

Source: https://lilianweng.github.io/posts/2021-07-11-diffusion-models/

Basic idea
- Add Gaussian noise to input image until it becomes white noise.
- And in generation step, reverse the process by denoising from white noise step by step.
- The loss function is the difference between forward distribution model and denoising model.
  - Forward Gaussian model params are stats from training input
  - The goal is to learn denoising model
- Another training option is to training a estimator to estimate gradient decent of log q(x)
Tricks
- Make each forward step’s variance follows sin function whose input is t
- Train denoising model’s variance also
Additional image category information can be added to the denoising process
- First train a image classier
- During denoising process, given a class, we can add the classier output with weight to the reversed noise part in the Gaussian distribution model in each steps
Speed up diffusion process, 3 options
- Denoising diffusion implicit model (DDIM) - by setting some params to 0, instead of sampling in each forward step, we can calculate next value directly while keeping noise along the process. It increase inference speed
- skip some sampling steps
- Iteratively halve sampling step by using teacher model to teach a student model in each step
- Consistency model, train a function which map Xt0 to Xt in any t directly. 2 suboptions
  - Loss as difference between original diffusion model vs the function
  - train the function independently
Latent variable space: Use an autoencoder to compress the image to latent space first before send to diffussion model
Scale up Generation Resolution and Quality
- Use multiple diffusion model for different resolution compression
- Use Gaussian noise for low resolution
- Use Gaussian blur for high resolution
- CLIP model to convert text to image and to cross attention
Model Architecture
- Unet, controlnet, apply convolution and compose additional image
- Transformer to transform noise and variance
summary
- pro: both tractable and flexible while many other models don’t have both
- cons: quite expensive in terms of time and compute because it relies on long Markov chain to sample

Diffusion Model For Video

Source: https://lilianweng.github.io/posts/2024-04-12-diffusion-video/

Interesting model architecture and implementation ideas

Extend 2D diffusion model to 3D. The extra dimension is time
Create separate Spatial and Temporal diffusion model and cascade them together
- Spatiotemporal SR layers contain pseudo-3D convo layers and pseudo-3D attention layers
- pseudo-3D attention layers contain separate Spatial and Temporal attention layer
Divide a video into small patches of temporal and spatial block and apply attention on these patches
Divide whole model into multiple layers. Each layer upsamples different resolution and each layer contains both Spatial and Temporal diffusion model
One goal is to edit a video which means given a text input and video input, generate a new video.
- To incorporate the additional video input, copy current downsampling model’s params and create a separate model to transform the input video into low dimension latent states.
- And apply cross attention between text embedding and video embedding
Project a video into a long picture with each frame as part of the picture, run diffusion model on it. And also add attention component to join different frames to achieve consistency among frames
Add a frame interpolation network, increasing the effective frame rate by interpolating between generated frames. This is a fine-tuned model for the task of predicting masked frames for video upsampling.
During training, divide a video into content component (represent by text) and structure component (snapshot of input video) and run cross attention between text and video
- During inference, text is input and generate video based on that
Pre-train text to image diffusion model, freeze it, add temporal diffusion layer and fine tune on video data.
- enforce temporally coherent reconstructions across frames with a video-aware discriminator which tell which frame is good during decoding
- We can pretrain on text to image data, pre-train on curated video data separately and finally fine tune on high quality video data
adapt a pre-trained text-to-image model to output videos without any training

Generate raw frames with motion info
- Define a direction function for controlling the global scene and camera motion
- Generate first frame randomly, and downsample it using diffusion model
- Combine the first frame and direction function and finally upsample them to generate frames with motion
Run diffusion model to generate full video based on raw frames with motion

Reprogramming frame-level self-attention using a new cross-frame attention of each frame on the first frame, to preserve the context, appearance, and identity of the foreground object。

Reference

From GAN to WGAN

From Autoencoder to Beta-VAE

Flow-based Deep Generative Models

What are Diffusion Models?

Diffusion Models for Video Generation

拉美500年，荆棘丛生的自由繁荣之路

- August 18, 2024

缘起最近对拉美的政治经济历史感兴趣，所以读了一些相关书籍，看了一些相关视频，感觉拉美还是一个很有趣的地区：资源丰富，悠久的被殖民的历史，灾难性的通货膨胀，贫民窟，贫富差距大etc。所以把阅读的笔记和思考重新整理如下。注：下面的很多内容都是来自读书笔记，如有雷同，那是真的在抄书 lol 参考材料：從「已開發」倒退回「發展中水準」的國家，經濟學家眼中最離奇的案例（视频）阿根廷国家崩溃报告（视频）《掉队的拉美》 [智]塞巴斯蒂安.爱德华兹（书）《拉丁美洲被切开的血管》 [乌拉圭] 爱德华多·加莱亚诺（书）正文拉美的问题相比其他国家，拉美有很多优势，比如资源丰富，有丰富的矿产资源，气候也很适合农业发展；比如比亚洲和非洲国家更早实现独立和民主制度；比如没有直接卷入一战和二战，二战期间由于欧州陷入战乱无暇输出工业品，拉美的民族工业从而获得了更多市场，并得到了长足发展。但是二战之后拉美的发展速度却远远落后于一片废墟的欧洲，还被东亚诸国后发超车。《掉队的拉美》中把经济的增长转型分为三个阶段：第一个阶段，产量增加和收入提高主要是由生产率增长驱动的。简单来说，第一个阶段的经济增长不是由于使用了更多机器或雇用了更多工人，而是由于做事的效率提高了。第二个阶段，效率的提高和生产率的增长仍然强劲，整体经济持续快速发展。与第一个阶段不同的是，第二个阶段对机器、建筑物、公路和港口的投资成为增长的另一重要来源。第三个阶段，包括实物资本和人力资本在内的资本积累成为增长最主要的来源，有助于维持相对较快的经济扩张。有时第三个阶段会引起新的结构或技术变革，使生产率有新的跃升，于是上述过程进入一个层次更高的新周期。作者认为绝大多数拉美国家并没有跨越增长转型的第一个阶段。从各项经济、社会指标上，拉美的各个国家也很落后。比如拉美的贫困人口多。1970年，在实施进口替代发展战略整整30年之后，所有拉美家庭中仍有40%生活在贫困线以下，农村地区的贫困发生率达到令人震惊的62%。还比如拉美的人均收入低。1975年拉美平均人均收入相当于美国的24%，至2006年，这一数值跌至19%。再比如拉美的贫富差距很大，受教育程度普遍偏低，失业率高企，通胀失控等等。根据经济学研究，一个国家的自由繁荣主要取决于以下几个因素： ...

Search This Blog

Swortal

AI Reading Notes: Image And Video Gen

Overall highlights

From GAN to WGAN

From Autoencoder to Beta-VAE

Flow-based Deep Generative Models

Diffusion Model

Diffusion Model For Video

Reference

Popular posts from this blog

拉美500年，荆棘丛生的自由繁荣之路

Software Architecture Books Summary And Highlights -- Part 1 Goal, Introduction And Index

以小见大，从国父的故事窥见美国独立建国的大历史