MIT Efficient ML Course Notes and Highlights
Personal highlights
- Memory movement is more expensive than computation
- Network latency is more significant than computation
- with same memory consumption, we want the network to have as much computation as possible to increase accuracy
- Common technique: Pruning, Quantization, Distillation
- different level of grouping and granularity
- used in pruning, quantization, parallel execution
- Common evaluation and optimization criteria
- weight significance, activation significance, tensor wise, channel wise, batch wise …
- l2 loss, KL divergence,
- accuracy, latency, number of computation, memory usage
- Common ideas to optimize a neural network structure using above techniques
- architecture option as a trainable parameter and additional loss or KD divergence
- Optimize the architecture params with regular weights either together or freeze one and optimize the other iteratively
- iteratively prune/ quantize / distill and evaluate after fine tune in each round
- abrasion study, delete one layer and see result
- Neural Architectural Search
- design the search space first, e.g. resolution, channel number ..
- search for best architecture
- Pre-train a large network first and evaluate subnet on small devices next
- based on previously collected stats like accuracy - latency relationship , min max value
- different ratio for different layer, e.g. different pruning / quantization ratio for different layer, because different layers’ channel, functions are different.
- feedback to later stage is more useful than feedback to early stage during fine tune because early stage represents some stage learned knowledge
- Compare different component’s output directly and calculate loss and gradient besides loss feedback from final output
- A good architecture should have closed local minimum
- apply a small perturbation to input and analyze
- should be similar to parent model
- if train from scratch, good model should be sensitive to perturbation to an input
- network augmentation, duplicate some layers or some components to increase network width and depth with studied components so that it is easy to estimate new network metric like latency and accuracy
- freeze some params, finetune others
- e.g. freeze the existing network, add a branch or a cascade layer to be trained during fine tune for a specific task like image editing
- e.g. freeze weights , fine tune only bias
- reinforced learning
- evolutionary search
- Mixture of expert style, keep all option in the network and activate different option based on input and latency estimator
- another option is 2 different models each handles different aspect of input e.g. different sensor data, small image details vs voxelized data
- local minima of weights for different layers after calculating gradient descent on several input samples should be closer
- flatten higher dimensions to lower dimensions, e.g. vision to series of images, 3d matrix to 2d matrix,
- expand channels, e.g. audio to image like input, time + frequency info for each period
- shift same data’s different part around with other data during training to reduce dimension needed for computation in neural network
- first token sink: first several tokens are very important. So even when doing sliding window, the first several tokens should always be kept in the windows.
- Similar idea on prefix tuning
- architecture option as a trainable parameter and additional loss or KD divergence
- Computation optimization
- Defer small computation to later stage so that they can be done together to increase efficiency. e.g. accumulate and defer small gradients update after several iteration
- loop reorg, loop tiling etc
- divide each layer’s input into small patches and calculate each patch separately
- matrix transformation
- on static data like weight to reduce calculation during inference
- LoRA, transform weight matrix to smaller matrix so that there are less params to fine tune
- scale down and scale up in different stages so that overall result doesn’t change while computation is optimized
- only do more computation on a small percentage of salient weights e.g. using higher quantization scale
- More data
- Contrastive learning, similar input should produce similar output, solve supervised learning’s limited data problem with unsupervised learning
- Data augmentation, image rotation, image zooming etc
- mask image or text and predict during training
- multi modal
- cross attention, between image transformer and text transformer
- convert image to token first and merge with text token
Course Notes
Convolution:
- channel, height, width, kernel
Memory is much more expensive to move in terms of both energy and latency than computation
Common techniques
- Pruning
- remove node by setting weight to 0
- Iterative pruning and fine tuning reduces accuracy loss
- train on all weights
- Prune weights by a small percent
- fine tune
- iteratively Prune more and then fine tune
- fine grained/unstructured/irregular pruning → coarse-grained/structured/regular pruning
- fine grained unstructured pruning
- pattern based pruning
- e.g. randomly prune but each row only prune 50% weights
- easier to compress to dense matrix
- vector based
- kernel based
- channel based
- uniformed shrink - same prune ratio for each channel
- different prune ratio for each channel - higher accuracy than uniform shrink
- Fine grained pruning
- flexible
- higher compress ratio
- harder accelerate on hardware
- Coarse-grained
- less flexible
- easy to accelerate
- low compress ratio
- pruning criterion
- based on weight magnitude, percentage of zero in a channel, importance of a neuron
- delete one layer/channel, see how sensitive the performance is to decide whether to prune
- regression based:
- Instead of comparing whole network loss, compare each layer’s loss before and after pruning
- Add a hyper params to control number of non zero channel, one param for each channel
- Fix weights and solve param for channel selection
- Fix param, solve weights to minimize reconstruction error
- for convolution network, the size of params of different layers can differ drastically
- Pruning strategy
- Determine pruning ratio based on accuracy requirement and pruning ratio - accuracy statistics
- AutoML, prune as a reinforced learning problem, action is choosing pruning ratio on various layers, choose different action based on feedback and reward
- NetAdapt , gradually prune each layer and measure accuracy after fine tune until the accuracy matches goal
- Can we start with sparse network and train directly?
- No, because training algorithm like sgd require dense network and redundancy to converge
- One solution that only works well on some data set but not image dataset is iterative magnitude pruning
- Randomly initialize weights first on a dense network
- train the dense network
- determine which to prune based on training result
- reset the pruned network weights while keeping pruning result and train again
- System support for sparsity
- use some special data structure to store weights including some virtual index pointer
- or store non zero weight index in a separate array and compress weights array into smaller array by removing 0 weights
- principle of efficient AI computing to be lazy:
- avoid redundant computation
- or reject the work quickly
- or delay as much as possible
- Sparse convolution computation
- maintain same sparsity between input and output by only computing some convoluted element and skip other with 0 weights
- to make it easier to parallelize this computation, pad some 0 to some parallel computation so that each computation unit computes same size of block .
- trade computation efficiency for regularity
- Quantization
- representation
- integer
- fixed point
- floating point
- BF16 has better training performance
- training requires higher number range, inference requires higher precision, so this is a tradeoff
- floating point mainly cost a lot in memory and related data movement. so usually we do quantization in storage and floating point math in computation
- mainly for inference and fine tuning, pre training still need full floating point
- Options
- K means based weight quantization
- Cluster weights into several category and calculate each category’s mean
- store category and mean in separate matrix and vector
- Fine tune quantized weights: when compute gradient, also categorize , calculate gradient mean, and update weight mean
- with pruning + quantization, we can achieve higher compression rate with same level of accuracy loss
- 4bits is enough
- huffman coding:
- in frequent weights: use more bits to represent
- frequent weights: use less bits to represent
- smaller model can achieve higher compression ratio with high accuracy
- So we can think of designing more compact model at first
- linear quantization
- map the floating point continuous range to discrete range with linear transformation
- symmetric quantization, bias in linear transformation is 0
- We can add the linear transformation into whole neural network’s computation
- the computation is integer arithmatic
- After expanding the linear quantization transformation in neural network’s computation, some element is static and can be precomputed, and some element can be quantized
- K means based weight quantization
- Post training Quantization
- Quantization granularity
- per tensor - coarse grain
- work well for large model
- accuracy drop for small model
- per channel
- group most granular
- the more granular the high accuracy
- more memory usage
- per tensor - coarse grain
- Dynamic range for activation quantization, Collect activation stats before model deployment, 2 options
- collect exponential moving average during training time
- run a calibration batches data and collect min and max
- assume the min max follows gaussian distribution, if clip too much, the large activation will have larger quantization error, if you don’t clip, you will have sparsely scattered centroid which is less useful
- calculate kl divergence between real and estimated distribution,
- find a threshold where it begins to diverge a lot
- another option is to minimize mean-square-error using Newton Raphson method
- at first the error drops as clipping range drops because too scattered centroid is useless
- Rounding
- rounding to nearest is not optimal
- another option is to learn rounding threshold by minimizing rounding error
- Quantization granularity
- Quantization aware training
- during training, accumulate small gradients inside each layers for multiple steps, so that to accumulate small movement
- on boundary of each layer, quantize weights, but still store full precision of weights
- in inference, completely quantize weights
- straight through estimator, pass small gradients through the model as if it had been identity function
- It can greatly improve inference accuracy with quantization
- Binary/Ternary quantization:
- binary
- deterministic quantization , = 1 if r > 0
- stochastic quantization = 1 if p> some probability
- quant error can be minimized by multiplying result with a scaling factor
- calculation becomes bit operation
- ternary
- category threshold is a heuristic params * expectation of values
- binary
- Mixed-precision quantization
- different number of discrete values for different channel,
- much better than fixed precision one
- representation
- Distillation
- Train a smaller student network based on teacher network
- The goal of knowledge distillation is to align the class probability distributions from teacher and student networks
- what to match
- output logits , cross entropy loss, l2 loss etc
- intermediate weights
- train a projector to project student weights to larger teacher weights and calculate loss
- intermediate feature
- KD loss
- gradient
- attention map: gradient of loss to input feature x
- apply a small perturbation to input, the attention map should match between teacher and student
- which means if one pixel is sensitive to input change in teacher, it should be same to student model
- the gradient can be intermediate output of each layer
- sparsity pattern
- Intuition: the teacher and student networks should have similar sparsity patterns after the ReLU activation.
- relation information
- between different layers:
- between teacher and student , channel number is same, only layer number is different,
- Use inner product to extract relational information to generate a matrix of shape Cin x Cout for all layers(inner product of input tensor and output tensor for each module), student and teacher model should match on this based on different loss like l2 loss
- between different samples
- sample clustering should be similar between teacher and student
- between different layers:
- self and online distillation
- self distillation
- Born-Again Networks in each iteration, further train a model, use both classification objective and distillation objective
- online distillation
- initialize 2 model with different params randomly
- objective: loss and kl divergence between 2 models
- combined approach
- deeper stage feedback is more valuable because it is closer to label output, while earlier stage has better acceleration
- send deeper stage loss to earlier stage to improve earlier stage
- self distillation
- Knowledge distillation tasks
- for object detection in an image
- feature imitation: Exploit teacher’s prediction as an upper bound for the student to achieve. Once the quality of the student surpasses that of the teacher with a certain margin, the loss becomes zero.
- weighted cross entropy loss: give foreground object higher loss while background object lower loss
- Add a 1x1 convolution layer to match the shape.
- Convert bounding box regression to classification problem
- divide the space into multiple buckets,
- classify target object’s bounding box coordinate into different bucket
- so that convert regression problem into classification problem
- and it is easier to compare teacher model and student model’s output because instead of comparing if bounding box coordinates match, we only need to compare if bounding box coordinate’s space bucket match
- Calculate the distillation loss between two probability distributions predicted by the teacher and the student.
- kd for semantic segmentation
- add a discriminator loss
- kd for gan
- distillation loss and reconstruction loss
- kd for attention
- feature map match + attention map match
- for object detection in an image
- network augmentation
- network dropout/ data augmentation can prevent overfiting for large model but it can hurt accuracy of small model
- core idea is to provide extra supervision during training
- during training, besides the base model, have an augmented model,
- the augmented model mostly share weights with base model, but it also has extra weights
- loss is base model loss + augmented model’s loss
- performance is further improved if it is combined with knowledge distillation
Neural Architecture Search
- Popular building block
- depthwise convolution: one channel to one channel convolution
- 1x1 convolution:
- 1 to 1 on hight and width level
- but merge multiple input channel into one output channel
- used to reduce channel dimensions
- resnet50: bottleneck block
- reduce channel size first using 1x1 convolution
- Then apply a larger convolution to save number of params and computations
- resnetxt: grouped convolution
- divide channels into multiple groups and do conv within each group
- in each group, apply resnet50 like conversion and convolution,
- finally add them together to restore into original params size
- Equivalent to a multi-path block
- mobilenet: depthwise separable convolution,
- optimized for for mobile device
- Use depthwise convolution within each channel to capture spatial information.
- Use 1x1 convolution to fuse/exchange information across different channels.
- MobileNetV2: inverted bottleneck block
- Depthwise convolution has a much lower capacity compared to normal convolution.
- Increase the depthwise convolution's input and output channels to improve its capacity.
- Depthwise convolution’s cost only grows linearly. Therefore, the cost is still affordable.
- However, this design is not memory-efficient for both inference and training.
- Depthwise convolution has a much lower capacity compared to normal convolution.
- Shufflenet
- further reduce cost by replacing 1x1 conv with 1x1 group conv
- exchange info across different groups via channel shuffle
- transformer multi head self attention
- huge design space, manual design unscalable
- param: layers, channels, connectivity, resolution, kernels
- Goal is achieve higher accuracy with less computation
- search space
- cell level
- normal cell
- reduction cell (merge multiple cells)
- once a cell is searched, it can be repeated multiple times
- connection and dependency between different cells can be complicated and irregular which increase computation cost because one computation has to wait for another computation to finish
- use RNN to search space of a single cell
- Left: An RNN controller generates the candidate cells.
- five steps:
- finding two inputs, selecting two input transformation operations (e.g. convolution / pooling / identity), and finally selecting the method to combine the results. These five steps will be repeated for B times.
- Right: A cell generated after one step
- network level
- width: number of channel
- depth: number of repeated cell
- resolution
- kernel: convolution size
- topology connection: different path of down sample or up sample
- Design the search space for TinyML: memory is important for TinyML
- energy constraint, latency constraint
- same memory, the more flops, the better potential performance, because computing is cheap, data movement is expensive
- compare the search space (resolution x width) based on number of max flops achieved and probability of the flop can be achieved (number of generated model which can product the flop number)
- no need to train model to compare
- manually designed search space can achieve better perf than larger design space or random design space
- cell level
- search strategy
- grid search
- search all combination in a grid in order
- if we already have a good base model and want to make it larger, we can scale different params including depth, width, resolution
- compound scaling of all dimensions together has better result
- random search
- useful to do a sanity check before doing grid search
- reinforcement learning
- Generate an Architecture with probability p,
- train the model with the architecture and generate accuracy
- use accuracy to calculate the probability of the generated architecture’s gradient decent and generate a reward to update the controller
- gradient descent
- store all options in each node,
- during training, train the probability of each option with gradient descent and store them
- during inference, pick the option with highest prob
- F is a latency prediction model (typically a regressor or a lookup table). With such formulation, we can calculate an additional gradient for the architecture parameters from the latency penalty term.
- cons: all options need to be kept in memory
- evolutionary search
- fitness: f(accuracy, efficiency)
- crossover: Randomly choose one operator among two choices (from the parents) for each layer
- grid search
- accuracy estimation strategy
- training from scratch
- train on training set
- evaluate on evaluation set
- feedback with evaluation result
- cons: high cost
- inherit weight
- based on a trained model
- wider: add more nodes in same layers, calculate new weights based on existing weights so that the updated model is mathematically equivalent to old model
- deeper: duplicate the existing structure and make the network deeper
- generation: generate based on operation of make it wider or deeper instead of always a new network
- hyper network
- convert network architecture into an embedding
- training a hyper network to predict params based on the network embedding
- training from scratch
- zero shot nas: estimate accuracy without training model
- zenNas
- randomly initialize weights
- model should be sensitive to perturbation to an input
- but also penalize too large batch normalization variance, model shouldn’t be too sensitve
- gradsign
- intuition: local minima of weights for different layers after calculating gradient descent on several input samples should be closer
- zenNas
- Hardware aware NAS
- may need different model for different hardware, e.g. respary pi, cpu, gpu, mobile
- iterative train and evaluation on small hardware is too expensive
- option 1 proxy task
- train only for few epoch and evaluate
- cons: some model might converge quickly at the beginning but slow later, some model might converge slow at the beginning
- estimate latency based on FLOPs and params count: not accurate
- use small architecture space e.g. low depth to estimate large architecture’s latency
- train only for few epoch and evaluate
- option 2 proxylessNas
- overparam the model, 2 types of params,
- model params
- architecture params: params to decide whether to enable some architecture
- iterative train model params and architecture params
- pros: convert to single training process
- pros: prune redundant architecture
- binary architecture params: hold one architecture’s param in memory in each model param training iteration to reduce memory footprint
- Can expand option 1 capability. e.g.
- small architecture space → large architecture space
- FLOPs and params count → full profile
- train only for few epoch and evaluate → full training
- overparam the model, 2 types of params,
- MAC ≠ real hardware efficiency
- some models have higher MAC but lower latency
- GPU latency is less sensitive to more channels but more sensitive to more layers because gpu has larger cache
- latency prediction
- Layer-wise latency profiling: latency lookup table
- dimensions: Architecture, op 1, op 2, op 3, ..
- multiple ops dimension because latency is not linear to op number,
- some ops can be executed in parallel
- Network-wise latency profiling: latency prediction model based on kernel size, resolution, width etc
- Layer-wise latency profiling: latency lookup table
- cost of training and searching on hardware is high while cost of post searching training is much higher
- Many hardwares are not GPU like, e.g. CPU
- Once for all searching
- train a large model first
- Pick a subnet, fine tune on a specific hardware, and evaluate accuracy and performance
- progressively prune the network to evaluate. e.g. prune resolution, prune width, prune depth, prune kernel size
- neural architecture accelerator search
- to design hardware architecture
- 2 types of params to optimize
- architecture sizing, like buffer size
- connectivity params, like parallelism, loop order
- which processing unit to parallelize
- loop order of different processing units
- connectivity params encoding
- they are non numerical, so need encoding
- importance based encoding, put more important dimension / processing unit at higher encoding bit
- iteratively do NAS and NAAS for model and then hardware
TinyML on Microcontrollers
-
We need to reduce both weights and activation to fit DNNs for tinyML
-
Memory (SRAM) holds input and output activations, storage (flash/DRAM) hold kernel
-
flash usage
- = model size, hold entire model
-
SRAM usage
- = Input activation + output activation
- Dynamic, different for each layer
- We care about peak SRAM
- (Weights are not counted since they can be partially fetched)
-
Cloud and mobile CNN cannot fit tinyML
-
We need to reduce both model size and activation size
- MobileNetV2 reduces only model size but not peak activation size
- MCUNet reduces not only model size but also activation size
-
MCUNet
- Co-design of both TinyNAS and TinyEngine (efficient inference engine)
- TinyNas: Two-Stage NAS for Tiny Memory Constraints
- First design the design space, then search the subnet
- design space includes resolution, width multiplier (a multiplier on channel number )
- subnet search includes Kernel sizes, Expansion ratios (channel expansion ratios in middle stage), #blocks per stage
- 1. Automated search space optimization
- search width scale and resolution
- the larger SRAM, the higher resolution because it can have more activations
- the larger Flash, them larger channel but less resolution because of larger model
- 2. Resource-constrained model specialization
- one shot NAS through weight sharing
- each iteration
- random sample subnetwork
- Jointly fine-tune multiple sub-networks
- and do evolutionary research
- Small child networks are nested in large ones.
- each iteration
- one shot NAS through weight sharing
- Outperforming Manual&NAS Models
- TinyNAS designs networks with more uniform peak memory for each block, allowing us to fit a larger model at the same amount of memory
- achieve higher accuracy with less memory
- First design the design space, then search the subnet
-
MCUNetV2: Patch-based Inference
- peak usually happens at early layers because early layers has higher resolution
- save memory with patch based inference
- Divide early layers input into multiple patches,
- do multiple layers of computation on same patch together to reduce memory footprint in each iteration
- Problem: repeated convolution computation at the edge of all patches
- Network Redistribution to Reduce Overhead
- less convolution at the early layers and more convolution at later layers
- 1x1 conv at the edge of each patch
- equivalent to old version
- Joint Automated Search for Optimization
- Kernel size in per-patch stage is small to reduce spatial overlapping
- Expansion ratio in middle stage is small to reduce peak memory; large in later stage to boost performance.
- Larger input resolution for resolution-sensitive datasets like VWW(MCUNet: 128x128)
-
application
- for vision
- patch based inference allows larger resolution
- object detection is more sensitive to resolution size
- we can even learn new task on mcu
- for audio
- divide audio input into multiple overlapping frame, each frame length = t
- do frequency transform on each frame,
- then generate a 2 dimension feature data. one dimension is time (more like frame index), the other is frequency, value is frequency transformed value.
- this is similar to an image
- then the input can be sent to MCUNet and apply cnn on it
- CNN performs better than dnn Deep Neural Networks
- for time series/ anomaly detection
- Detect Anomaly with Autoencoders
- during inference, if the input is too different from the output, then it is anomaly
- for vision
-
Parallel Computing Techniques
- loop optimization, Optimize locality and reduce branching overhead
- loop reordering Optimizes locality by reordering the sequence of loops
- improve data locality of caches
- data movement (cache miss) is expensive
- chunk of memory is fetched at a time
- reduce cache miss by loop reordering
- always read different matrix data in the same row first instead of by column.
- loop reordering could cause multiple writes in lowest loop by writing to different target matrix element. But the overall performance is improved
- improve data locality of caches
- loop tiling, reduce memory access by partitioning a loop's iteration space
- partition each matrix into small tiles so that one tile can completely fit into a cache
- do calculation on a tile first so that in one loop, a tile is completely used and no need to be reloaded into cache again in later loop
- for multiple level of cache, we can have multiple level of tiles
- Loop unrolling: reduces branching overhead at the expense of its binary size.
- overhead: loop iterator boundary check and increment operation. e.g. k from 0 - N, need to check k< N for many times
- solution: replicate the loop body for multiple times, increase binary size but reduce overhead
- loop reordering Optimizes locality by reordering the sequence of loops
- SIMD (single instruction, multiple data) programming:
- Performs the same operation on multiple data points simultaneously.
- Complex instruction set computer CISC
- Reduced instruction set computer RISC
- Key features
- Specialized registers that can hold and process multiple data elements.
- one operation on multiple data qat the same time
- pros
- Increase computational throughput and speed
- Improve energy efficiency
- multithreading
- shared memory programing
- pros:
- parallelism
- more responsive
- higher resource utilization
- Simplified Program Structure:
- Multithreading can help break down complex problems into simpler, smaller tasks.
- Pthreads: A C library for creating and managing POSIX threads.
- OpenMP: An API for C, C++, and Fortran to support parallel programming using shared-memory model.
- CUDA programming
- Use GPUs to accelerate computation.
- CUDA is a C-like language to express programs that run on GPUs using the compute-mode hardware interface
- hierarchy of CUDA threads
- define block number and then threads per block
- CPU allocate memory and start CUDA thread at the beginning
- Memory Model
- Distinct host and device address spaces
- Data can be moved between address spaces
- host cannot access data in device address
- much faster then cpu programming
- most time are still used on core and data movement, cuda core only takes small amount of time
- CUDA Programming on Tensor Cores: Higher throughput and more data types
- Matrix multiplication intrinsics : break down operand A matrix and B matrix into different number of tiles and multiply them in parallel
- loop optimization, Optimize locality and reduce branching overhead
-
inference optimization
- to enhance computation speed and reduce memory usage
- Image to Column (Im2col) convolution:
- Rearranges input data to directly utilize matrix multiplication kernels.
- flatten and expand the input and weight matrix into K^2 *C size where K is kernel size and C is channel length
- Im2col is a technique to convert the image in a form such that Generalized Matrix Multiplication (GEMM) calls for dot products.
- Pro
- Utilize GEMM for convolution
- Con
- Require additional memory space.
- The implicit GEMM can solve the additional memory problem.
- A variant of direct convolution, and operates directly on the input weight and activation tensors.
- In-place depth-wise convolution:
- Reuse the input buffer to write the output data, so as to reduce peak SRAM memory.
- many neural network uses Inverted Residual Block with depth-wise convolutions which reduce model size and FLOPs, but significantly increase peak memory (3-6x
- To reduce the peak memory of depth-wise convolution, we utilize the “in-place” updating policy with a temporary buffer.
- NHWC for point-wise convolution, and NCHW for depth-wise convolution:
- Exploit the appropriate data layout for different types of convolution.
- point-wise conv: conv on channel instead of height and width
- N is batch, h height, w width, c channel
- NCHW vs NHWC, different data storage order, NCWH uses WH as inner dimension when storing input data while NWHC uses channel as inner dimension
- So NCHW is better for depth wise conv while NHWC is better for point wise conv due to different locality
- Wingrad convolution:
- Reduce the number of multiplications to enhance computing speed for convolution
- apply transformation in advance to both input and weight matrix to reduce matrix multiplication into point-wise multiplication
- the transformation can be done in advance which doesn’t take too much resource
- for example, for 3 x 3 conv, 4 output, traditionally it requires 9 x 4 x C MAC, while with transformed weight and input, it only needs 16 x C MACs
Transformer and LLM
- layer norm and residual connection is added for training stability
- the pre-norm (layer norm before Feedforward network / attention) design is more popular now due to better training stability
- Comparing absolute/relative positional encoding
- Absolute positional encoding fuses the positional information into the input embeddings (thus Q/K/V). The information is propagated through the entire Transformer.
- Relative positional encoding provides relative distance information by impacting the attention scores (either adding a bias or modifying queries and keys), not V.
- Advantage: generalize to sequence length not seen during training, i.e., train short, test long (does not always apply)
- rotary positional embedding , used in llama 2
- adv: able to extend context window by interpolating
- KV cache optimizations:
- Multi-Head Attention -> Multi-Query Attention -> Grouped-Query Attention
- The KV cache could be large with long context
- During Transformer decoding (GPT-style), we need to store the Keys and Values of all previous tokens so that we can perform the attention computation, namely the KV cache
- Only need the current query token
- the kv cache size goes quickly larger than model size
- Reduce the KV cache memory usage with MQA/GQA
- Reducing the KV cache size by reducing #kv-heads
- Multi-head attention (MHA): N heads for query, N heads for key/value
- Multi-query attention (MQA): N heads for query, 1 heads for key/value
- Grouped-query attention (GQA): N heads for query, G heads for key/value (typically G = N/8)
- GQA matches the accuracy of MHA under a large model size
- FFN -> GLU (gated linear unit)
- How to scale up?
- The Chinchilla Law
- We need to scale up both the model size and data size for training to have the best training computation vs. accuracy trade-off
- Note: the trade-off is different if we consider the inference computation trade-off
- You want to train a smaller model longer to save inference costs (e.g., LLaMA)
- Efficient inference algorithms for LLMs
- Quantization: SmoothQuant, AWQ, TinyChat
- smoothQuant (W8A8)
- 8-bit weight, 8-bit activation (W8A8)
- for large llm, there are many outliers in activation values and the range is very large. So it is very hard to quantize activation output and the accuracy degrade
- but weights are in small range and easy to quantize
- Solution: For Y = XW, scale down X by multiplying 0.1 and scale up weights by x10. So overall value doesn’t change but X(activation output) is much easier to quantize while weights are slightly more difficult to quantize
- it doesn't increase computation cost because W can be multiplied and stored in advance, and X scaling can be merged with Layernorm calculation in previous stage
- in attention calculation, int8 is used. In layernorm and softmax, float point is used
- result: accuracy not degrade with larger model size, latency is reduced, memory is also reduced
- SmoothQuant can loosely Quantize LLAMA families and further lowering hardware barrier
- W4A16 for Single-batch serving
- W8A8 quantization is good for batch serving (e.g., batch size 128)
- But single-query LLM inference (e.g., local) is still highly memory-bounded
- We need low-bit weight-only quantization (e.g., W4A16) for this setting
- AWQ for Low-bit Weight-only Quantization
- targeting group wise W3/W4 quantization
- 4 bit quantization increase perplexity
- the larger perplexity, the worse performance
- Group-wise/block-wise quantization (e.g., 64/128/256) offers a better accuracy-model size trade-off.
- But there is still a performance gap with round-to-nearest (RTN) quantization (INT3-g128
- We find that weights are not equally important, keeping only 1% of salient weight channels in FP16 can greatly improve perplexity
- how to select salient weight?
- look at activation distribution, not weights
- just scale salient channel weights to larger (also scale down related input ) also improve perplexity with same magnitude
- Multiplying the salient channels with s > 1 reduces its quantization error
- smoothQuant (W8A8)
- pruning/sparsity: SpAtten, H2O, MoE
- Wanda: pruning by considering weights and activations
- Use |weight| * ||activation|| as the criteria for pruning
- SpAtten: token pruning & head pruning
- attention has no weight. So can only prune token and heads
- Cascade pruning of unimportant tokens and heads. — prune some unimportant tokens first and then recalculate and prune further
- Tokens with a small cumulative attention are pruned away.
- V pruning: don’t fetch V if QK is small.
- Progressive quantization: low precision first, if not confident => high precision
- in softmax, if with low precision, no significantly unimportant one can be found, then try high precision
- Attention sparsity
- H2O: token pruning in KV cache
- Keep the local tokens and heavy Hitter Tokens (H2) in the cache
- good for static sparsity
- DejaVu (input dependent sparsity)
- Static sparsity: hurts the accuracy with a medium-high sparsity
- Contextual sparsity: small, input-dependent sets of redundant heads and features
- Contextual sparsity exists and can be predicted (using an async predictor head)
- Accelerate inference without hurting model quality
- H2O: token pruning in KV cache
- mixture of expert
- Wanda: pruning by considering weights and activations
- Quantization: SmoothQuant, AWQ, TinyChat
- Efficient inference systems for LLMs
- vLLM: store large KV cache similar to operating system’s page structure
- KV cache could be large
- when we need to reserve KV cache space for a request which could cause large waste when qps is high
- divide KV cache into multiple blocks
- logical page, physical page and a map between logic page and physical page
- physical page is more compact and doesn’t need to follow the same order of logical page
- we can even share prefix sequence, e.g. in copilot, share prefix to generate different autocomplete suggestion
- Streaming LLM
- Urgent need for LLMs in streaming applications such as multi-round dialogues, where long interactions are needed
- Challenges:
- Extensive memory consumption during the decoding stage.
- Inability of popular LLMs to generalize to longer text sequences.
- attention sink:
- without first token, the perplexity increase significantly no matter what the first token is
- because everything attend to first token and due to nature of softmax
- solution: sliding window attention while always keep first N tokens
- slide through tokens
- with page attention, first n tokens can be preserved and achieve better performance
- can also pre train with static first n tokens
- flash attention
- speculative decoding
- vLLM: store large KV cache similar to operating system’s page structure
- Efficient fine tuning for LLMS
- LORA/ QLORA
- adapter
- LORA insert a parallel layer to bypass main transformer
- adapter insert a layer after FFN
- cannot be fused
- Smaller adapter for transfer learning
- prompt tuning
- train a prompt to be prepended to given prompt
- no need to fine tune separate models for different task, just tune different prompt prefixes
Vision transformer
- Apply the standard transformer encoder (ViT)
- Convert 2D image to a sequence of patches
- For a 96 x 96 image, Use 32 x 32 , stride = 0 , padding = 0 , strided convolution to divide the image into 3 x 3 sequence of patches and convert each block into a single token
- in_channels=3, out_channels = 768
- apply a positional embedding to denote the position of each patch
- feed patch embeddings to standard transformer encoder
- ViT outperform CNN when data size is large
- High resolution is essential for achieving good performances in dense prediction tasks (e.g., segmentation).
- ViT’s computational cost grows quadratically as the input resolution increases.
- Efficient ViT and Acceleration Techniques
- Window Attention
- Instead of doing attention between patches, do attention within each patch by dividing them into smaller patches
- also Gradually downsample the feature map size.
- drawback: no information exchange between patches
- solution: shift the window to cover different sub patches in one window
- FlatTransformer: Sparse Window Transformer with equal size grouping
- some image might be sparse e.g. cloud point (only few points have values)
- instead of group by proximity, just group non empty patch by equal size
- though proximity is not guaranteed, accuracy is still good
- linear attention — efficientViT
- Replace SoftMax with linear function
- Then since all things are linear, we can move linear normalization function to later part so that smaller matrix needs to be normalized and reduce the complexity
- But Relu linear attention cannot produce sharp distributions. Thus, it is good at capturing global context information but bad at capturing local information.
- It also lacks multi-scale learning ability.
- Solution:
- multiple Relu Linear Attention branches to capture feature at different scale
- some branches contain a convolution layer before attention layer which merge nearby tokens into one and generate a Q K V at a different scale.
- sparse attention — SparseViT
- in each layer, prune some image patches based on l2 activation magnitude to filter out blocks with less information
- better than converting the whole image into low resolution first
- During training, fine tune the model with different activation sparsity configuration at each iteration
- During inference, find the optimal layerwise sparsity activation configuration under a latency constraint with evolutionary search
- Window Attention
- Self-supervised learning for ViT
- Traditional ViT challenge: need large labeled dataset to outperform CNN but it is costly to have large labeled dataset
- One option is to train on unlabeled dataset first and fine tune on labeled dataset
- Contrastive learning
- Augment the labeled image by rotating, changing color etc
- The model’s output for augmented data should be closer to the original data’s output and farther to different label’s data’s output
- During training we can also correlate image and text using CLIP
- Masked image modeling
- masked image patches and predict the masked patches
- For text masking , BERT only masks 15% tokens. The masked token still occupy an input position
- for Image masking, 75% mask ratio
- Use a heavy encoder and lite decode
- the masked patches are not input to encoder to increase efficiency
- the masked patches are input to decoder and make prediction
- Multi-modal LLM
- Cross attention (Flamingo)
- 2 models with additional components
- LLM text model and ViT model.
- during training, keep the 2 model unchanged
- Additional components
- above ViT model, add a perceiver resampler which always outputs same number of token regardless of input image resolution or input video length
- perceiver resampler is a transformer model, K V is generated from input image, Q is generated from a prelearned static set of tokens
- So the output token number is always equal to size of prelearned tokens
- Above text LLM, add a gated cross attention layer to take both ViT model output and llm output
- Use a tanh gate to control the amount of visual information
- result
- good at in context learning
- good at visual dialog
- Visual token as input (PaLM-E)
- treat visual info as token and input to LLM together with text token. e.g. here is an <apple image>, do …
- Capability: mobile manipulation, motion planning, visual Q&A
- Application: handle corner case of autonomous driving
- Cross attention (Flamingo)
Gan, Video and Point Cloud
- Gan
- conditional: image + label (text, or segmentation map, or strokes)
- unconditional: no label
- generative models are more expensive than recognition model
- Efficient Gan
- GAN Compression (compress generators with NAS+distillation)
- reconstruction loss + distillation loss (same internal stage output difference) + gan loss
- neural architecture search: automated channel reduction
- create super child model with different channel configurations
- within some constraint, search for best configuration
- Anycost GAN (dynamic cost vs. quality trade-off)
- motivation: convert image to a latent space so that we can easily edit image property like hair color, smile style, skin color etc
- we also want an efficient model to generate edited result quickly for best user exp
- solution: 2 models, one smaller model which generate low quality image quickly and one full model which generates high quality image. 2 options:
- Different resolution
- generative model in multiple resolution, up sample resolution in order, discriminator model with multiple resolutions also
- 3 options:
- Style-GAN: only feed generative result to high resolution discriminator — low quality in lower resolution
- MSG-GAN: feed all resolution generation results to all resolution discriminator — too expensive in training
- sampling based multi resolution: in each iteration, only sample one resolution generator output and send to one resolution discriminator
- Different channel number
- similar to above approach with multi resolution generators and discriminators but different channel numbers in generators, while discriminator doesn’t care about channel numbers still multi resolution
- combine channel numbers as architecture info with discriminator output so that discriminator output is partitioned by architecture info
- Different resolution
- Differentiable Augmentation (data efficient training of GANs)
- GAN needs large training data
- GANs Heavily Deteriorate Given Limited Data
- accuracy deteriorate with less data
- Discriminator overfit with less data
- Solution: data augmentation. e.g. rotate, zoom in out image
- When training Discriminator, augment data same way on Discriminator’s expected input and generator output to D’s input
- when training Generator, augment data on its output also
- GAN Compression (compress generators with NAS+distillation)
- Efficient Video Understanding
- recognition model
- Difference between video and image processing: temporal modeling
- 2D CNNs for video understanding
- 3 options
- Input each frame into cnn and get an aggregated score
- input both frames and diff between 2 frames into 2 different models
- 2D CNN + Post-fusion (e.g., LSTM): add a LSTM layer to model temporal relationship after cnn layer
- Pro
- Compute-efficient. Reuse 2D CNNs from image recognition.
- Con
- Aggregating 2D CNNs cannot model temporal information. Low accuracy on video benchmarks.
- Optical flow is very slow to compute (much slower than the deep network itself).
- Late fusion cannot model low-level temporal relationships
- 3 options
- 3D CNNs for video understanding
- Option 1: feed 3d directly
- Option 2: initialize with pre trained 2D model and inflate its params by duplicating 2d params multiple times e.g. inflate 7 x 7 conv to 7x7x7
- Pros
- Jointly modeling spatiotemporal information
- Can model low-, middle-, high-level information
- Cons
- Large cost (model size, computation) due to the extra temporal dimension
- more params, easy to overfit
- not enough good video data
- Temporal Shift Module (TSM)
- to improve efficiency
- use 2D cnn model but shift different channels along temporal dimension so that one frame input contains information from different channels shifted from different time
- 2 versions
- offline bi-directional, shift both forward and backward on temporal dimension
- application: action recognition, fall detection, video recommendation, etc.
- use Relu to fill missing channels (due to being shifted away)
- online uni-directional, shift only one direction for video streaming
- Application: autonomous driving, improving detection with temporal cues, etc.
- cache previous frame
- offline bi-directional, shift both forward and backward on temporal dimension
- could also help static object detection in autonomous self driving in case of glares in one frame
- efficient point cloud recognition
- mainly for autonomous driving
- point cloud is sampled from Lidar sensor
- point cloud is represented as 3d coordinates + feature, example feature is like lightness of a point
- points are randomly stored instead of stored in some order like image
- points are also spare. only 10% of 3d space has points
- So it takes lots of computation and memory movement to figure out spatial relationship between points during training time
- one option is Voxel CNN, which divides 3d space into blocks and one point for each block so that there are fewer points
- cons: low resolution, less information
- Another option: Point-Voxel CNN:
- 2 models,
- one model handles random points data without its proximity info. It could capture some high resolution details on small objects
- another model handles Voxelized data with low resolution. it could capture large objects
- and merge the 2 models result together
- so the 2 models results can compliment each other
- it could achieve high accuracy with low memory and low latency
- variation: use high resolution Voxelized data, but divide the space into blocks and only convolve on each block
- 2 models,
- third option 3D Neural Architecture Search with SPVConv
- optimize the model with NAS
- forth option: Range-Point-Voxel Convolution RPVConv
- a third model trained on range image which is an image taken from point cloud with light rays in particular angles
- Bevfusion multi-task, multi-sensor fusion
- BEV: bird eye view
- multi sensor: camera, lidar, radar
- Multiple tasks: detecting the vehicles and pedestrians, segmenting the lanes and drivable regions, etc
- convert camera images and point cloud data into BEV space separately with 2 separate conv models
- merge them into one space with more features
- send to another model for further processing
Diffusion model
- DDPM Denoising Diffusion Probabilistic Models
- forward process: Xt depends on Xt-1 with Gaussian distribution, where beta is predefined param to control diffusion speed
- backward process: train to predict noise added
- training algorithm:
- train a unet with theta param to predict added noise.
- the input is a real image + noise calculated from forward process
- output is predicted added noise from unet
- loss is diff between predicted added noise and real loss
- train a unet with theta param to predict added noise.
- sampling algorithm
- given a Xt, calculate predicted noise using Unet (most computation here)
- subtract calculated noise from Xt
- add another white noise based on predefined beta^2 back to Xt
- conditional model
- Scalar condition: condition on a single scalar (e.g., class ID “cat”)
- simple spatial addition: convert scale to embedding of batch size x channel size, add each channel’s value to one channel’s all pixel values
- adaptive normalization, convert scale embedding to scale and bias which are learnable
- Text condition: condition on a sequence of text tokens (e.g., “photo of a moon gate”)
- cross attention, K V comes from text embedding, Q from image
- Pixel-wise condition: condition on a spatial map (e.g., semantic map, canny edge
- simple concatenation: concat image and control image and train together
- controlnet, freeze original unet, train a separate model with control image input, combine 2 models output and calculate loss to train the control model
- conditional image generation
- option 1. train a separate classifier and sampling with classifier output so that generated image belong to one class
- trade diversity for quality
- can only do classification
- option 2: classifier free, no extra classifier model
- can generate image with any text input
- during training, randomly drop out classifier to improve generalization
- during sampling, combine non classifier based noise and classifier based noise together
- option 1. train a separate classifier and sampling with classifier output so that generated image belong to one class
- Scalar condition: condition on a single scalar (e.g., class ID “cat”)
- latent diffusion model
- VAE + diffusion model,
- use encoder to reduce an image’s resolution into a latent space with low resolution
- then apply diffusion model on latent space
- during sampling, use decoder to increase diffusion model output’s resolution
- app: stable diffusion
- VAE + diffusion model,
- image editting
- stroke based or image to image personalization
- add noise to input image and denoise using diffusion model
- text based editing
- cross attention
- for word swap, just swap V with new word’s V
- for additional text, add additional K
- for scaling like “make something less crowded” , adjust value in K V vector
- stroke based or image to image personalization
- image personalization
- do editing for personalized image, e.g. add something to your own dog image
- option 1:Dreambooth: fine tune on personalization task for each image
- option 2: fast composer
- during training,
- convert text to embedding and concat each word embedding with related images, and pass embedding to diffusion model to generate image
- segment composed image into segmentation map
- the above 2 output diff should be minimized
- during sampling,
- still generate text and image embedding
- sample based on text first, then based on image to inject image
- alpha param to control the level towards original image or target image
- during training,
- trilemma
- High quality samples, fast sampling, model coverage / diversity
- Diffusion model: High quality samples, , model coverage / diversity
- VAE: fast sampling, model coverage / diversity
- GAN: High quality samples, fast sampling
- Fast sampling techniques
- Denoising Diffusion Implicit Model,
- Xt-1 rely on both Xt and X0
- X0 is estimated using a formula
- The original DDPM requires small beta (step size) to generate good image
- but DDIM doesn’t require small beta
- so it can skip steps
- progressive distillation:
- teacher and student model, student one step, teacher 2 steps
- guided diffusion distillation
- in classifier free sampling, a step is to combine conditional term and unconditional terms which requires 2 forward step
- train a child model to predict that combination in only one step
- Denoising Diffusion Implicit Model,
- Acceleration techniques
- sparsity:
- editing usually cover 1% of whole image
- only apply diffusion / convolution on that small blocks
- quantization
- different quantization scale for different time stamp in diffusion model
- different quant scale for Tensor and Weight since Unet cancat sample which makes distribution different
- sparsity:
Distributed Training
- parallelism technique
- data parallelism: split data among gpus
- Param server: aggregate global gradient
- Worker server: compute local gradient
- steps
- param server copies all weights to each worker’s local model
- each worker compute gradient with splitted data
- each worker send gradient to param server
- param server aggregate gradients and update weights
- param server sync weights with worker’s local model
- communication primitive
- operations: gather, scatter, reduce, broadcast, all reduce (reduce on all nodes), all gather (gather on all nodes)
- different complexity for data parallelism with different operations
- existing approach has high time and bandwidth complexity on param server since it needs to do O(n) reduce and send weights to all nodes
- variation:
- sequential reduce on each worker instead of on param server (all reduce sequential)
- ring reduce, each time step one node send its weight to neighbor node (all reduce ring)
- all reduce parallel reduce, parallely reduce all on all nodes: O(n^2) bandwidth
- recursive halve reduce: each iteration, 2 nodes exchange all its weights, log n time complexity
- Reducing memory in data parallelism: ZeRO-1 / 2 / 3 and FSDP
- shard optimizer states, gradients and weights across multiple gpu,
- communicate missing data between gpu when needed
- pipeline parallelism: run different layer on different GPUs
- Gpipe: Easy Scaling with Micro-Batch Pipeline Parallelism
- micro batch improve device utilization
- the more micro batch, the more device utilzation
- but too small batch may not occupy whole gpu which causes waste
- Gpipe: Easy Scaling with Micro-Batch Pipeline Parallelism
- tensor parallelism: within one layer, run different tensor computation on different GPUs
- Tensor parallelism: split a weight tensor into N chunks and parallelize
- f and g denote the AllReduce / Identity operations to synchronize.
- The params and activations (blue part) are spliced across different GPUs.
- data parallelism: split data among gpus
- comparing different parallelization
- data parallelism
- high gpu utilization, high memory cost, low communication
- pipeline parallelism
- low utilization, low memory cost, medium communication
- tensor parallelism
- high utilization, low memory cost, high communication
- data parallelism
- hybrid approach
- 2d parallelism
- outer loop data parallelism, inner loop pipeline parallelism
- or outer loop pipeline parallelism, inner loop, tensor parallelism
- 3d parallelism
- Alpa: A Unified Compiler for Distributed Training
- inter operator parallelism, run group of op on different gpu
- intra operator parallelism, divide operand and run same op on different gpu
- algorithm
- design inter op
- design intra op
- cost estimation and iterate
- 2d parallelism
- Understand the bandwidth and latency bottleneck of distributed training
- communication is essential
- Requires synchronization, high communication frequency
- Larger model, larger transfer data size, longer transfer time
- More training nodes, more communication (all-reduce), longer latency
- Gradient compression: overcome the bandwidth bottleneck
- Gradient Pruning: Sparse Communication, Deep Gradient Compression
- Only send top-k gradients (by magnitude)
- Keep the un-pruned part as error feedback (residual) and accumulate it for later use
- Improve training speed
- Work for simple neural networks, but fail on modern models like ResNet
- because of momentum is wrong and it is directed to a wrong optimum
- solution: accumulate momentum velocity instead of gradients
- Warm Up Training
- In the early stages of training, the network is changing rapidly
- Local gradient accumulation and stale gradient will aggravate the problem
- Warm up the learning rate
- Warm up sparsity
- avoid a sudden change in sparsity
- exponentially increasing sparsity in first several epochs → help optimizer adapt to larger sparsity
- when combining warm up and momentum correction, the accuracy matches the baseline
- Problem: sparse gradients get denser during all-reduce. solution
- same sparsity pattern, coarse grained sparsity
- possibly prune in the middle of ring all-reduce
- PowerSGD: Low-Rank Gradient Compression
- Motivation: Address the irregular sparse pattern in gradient compression, prevent gradients from getting denser
- Method: Instead using fine-grained pruning, adapt low-rank factorization instead.
- project 2d array to 1d array by remove 0 elements.
- Different server might have different sparsity pattern, but can all be projected to 1d array
- Gradient Quantization: 1-Bit SGD, TernGrad
- 1bit SGD
- quantize to 1bit ,1 or 0, + scale of that bit, e.g. 2^n the n
- accumulate small errors locally
- threshold quantization
- if > t → 1, if < -t → -1
- TernGrad, : value / max value, assign to 1 , 0, -1 based on the probability of that divided number being 1 or -1
- 1bit SGD
- Gradient Pruning: Sparse Communication, Deep Gradient Compression
- Delayed gradient update: overcome the latency bottleneck
- bandwidth is easy to improve by upgrading hardware or compressing payload, latency is hard
- worker continues using stale gradient when the gradient from other workers hasn’t arrived
- upon arrival, when calculating next gradient, the worker kickout stale gradient n steps ago, and average gradients from all other workers n steps ago and update new gradient.
On device training and transfer learning
- what is transfer learning
- Transfer learning is a technique in machine learning in which knowledge learned from a task is re-used in order to boost performance on a related task.
- For example, for image classification, knowledge gained while learning to recognize cars could be applied when trying to recognize trucks.
- we want to do transfer learning at edge device rather than cloud so that it has better customization and privacy. goal is to adapt to local data
- Deep leakage of gradients, gradients is not safe to share
- federated learning:
- patch multiple data on device,
- calculate gradients,
- only send gradients to server
- server update the model weights and sync with device local weights
- gradients could leak input data
- Membership Inference [Shokri 2016
- Given gradients, it’s possible to find whether a data point belongs to the batch.
- Property Inference [Melis 2018]
- Given gradients, it’s possible to find whether a data point with certain property is in the batch
- deep leakage attack via gradient matching for text and image
- given input gradient, initialize a noise, calculate gradient, the loss is diff between calculated gradients and real gradients, after several iterations, the input could approach real input
- the input data contains multiple data in batch , it is harder to attack because there are more variables to solve
- the generated image has lower resolution
- Membership Inference [Shokri 2016
- federated learning:
- memory bottleneck of on device learning
- training consumes much more memory since it also needs to keep activations in memory for gradient calculation
- one option is to fine tune only last layer, but accuracy decrease a lot
- another option is to fine tune last layer + last batch normalization layer, accuracy not decrease a lot, params number decrease a lot, but activation number doesn’t decrease a lot
- Tiny transfer learning (TinyTL)
- Updating weights requires storing intermediate activations
- Updating biases does not, is memory-efficient
- solution 1: Freeze weights, only fine tune biases
- solution 2: lite residual learning
- Add lite residual modules to increase model capacity
- Key principle - keep activation size small
- Reduce the resolution
- Avoid inverted bottleneck
- fine-tune bias only + lite residual learning: high accuracy, large memory saving
- Compared with dynamic activation pruning, TinyTL saves the memory more effectively
- Sparse back-propagation (SparseBP)
- for fine tuning
- some layers and channels can be skipped during backward propagation
- no need to back-propagate to early layers since they only contain fundamental models
- so that we only need to store and compute on a subset of the activations.
- the middle layers use both low activation memory and low weight memory
- because early layers have higher resolution
- while latter layers contain more channels
- so middle layers are the place to fine tune
- We update biases for the later layers (related to activation only), and weights for the intermediate layers (related to activation and weights)
- Contribution Analysis: fine-tune only one layer on a downstream task to measure the accuracy improvement (Δ accuracy) as contributions.
- Only fine-tune the layers with large Δaccuracy (contributes more to performance
- MobilenetV2 prefers first depth-wise conv
- BERT prefers QKV projection and first FFN layers.
- Use evolutionary search to find the sparse back-propagation scheme
- SparseBP reduce memory usage and increase accuracy
- Quantized training with quantization aware scaling (QAS)
- Real quantized graphs save memory, but hard to quantize
- Making training difficult
- Mixed precisions: int8/int32/fp32..
- Lack BatchNorm
- the problem is this approach makes gradient too small comparing to weights
- after math analysis, it is found that after quantization, weight/ gradient is reduced due to scaling factor multiplication
- so when calculating gradient, just multiply back scaling factor to increase its size
- Making training difficult
- Real quantized graphs save memory, but hard to quantize
- PockEngine: system support for sparse back-propagation
Efficient Fine Tuning and Prompt Engineering
- PockEngine: System Support for Sparse Back-Propagation
- conventional training framework focus on flexibility. They perform inference at compile time and backward propagation at runtime
- PockEngine moves most workload from runtime to compile-time , thus minimizes the runtime overhead , also enables opportunities for extensive graph optimizations
- reordering reduce memory footprint
- We extend PockEngine to support:
- Diverse models (CNN + Transformers)
- Diverse frontends PyTorch, TensorFlow, Jax,
- Diverse hardware backends
- Apple M1
- Raspberry Pi
- Smartphones
- Efficient LLM Fine-Tuning
- When shifts from vision models to large language models, the parameters and the optimizer states dominate the GPU memory usage than activations
- BitFit / Adapter / Prompt-Tuning / Prefix-Tuning
- BitFit: fine tune only the bias terms
- From small-to-medium datasets, BitFit is competitive with (and sometimes better than) full-fine-tuning.
- For larger data, the method shows inferior performance than full
- Adapter.
- Inserting learnable layers inside transformer architectures
- Add only a few trainable parameters per task, and new tasks can be added without revisiting previous ones.
- these adapter modules brings extra inference latency during deployment
- Inserting learnable layers inside transformer architectures
- prompt tuning
- We can train a continuous prompt that is prepended to inputs for each task
- We can mix different learned prompts in a single batch
- Comparable accuracy as fine-tuning as the model gets larger.
- prefix tuning
- Prompt-Tuning only adds learnable prompts to the first layer.
- Prefix-Tuning adds tunable prompts to each layer
- Prefix-tuning shows consistent improvement of embedding-only-tuning.
- Both methods increase input length
- Lead to longer inference latency.
- Take up available input length and limit the real usable sequence.
- BitFit: fine tune only the bias terms
- LoRA / QLoRA / LongLoRA
- LoRA can also be applied to diffusion model
- LongLoRA: Efficient fine-tuning of long-context LLMs
- Shifted Sparse Attention:
- Split attention heads, group tokens, shift the groups.
- split attention heads into smaller group of tokens,
- only do attention within a group,
- shift the group by half so that information can be exchanged cross groups
- Split attention heads, group tokens, shift the groups.
- Enhanced LoRA:
- We should fine-tune the input embedding and normalization layer.
- Shifted Sparse Attention:
- Prompt Engineering
- Zero-Shot / Few-Shot / Chain-of-Thoughts
- Previous (non-large) language models, one model for one task
- When model sizes grows, LLMs show “emergent abilities”: one model for various tasks
- LLM / Diffusion Prompting Examples
- Zero-Shot / Few-Shot / Chain-of-Thoughts