Personal highlights Memory movement is more expensive than computation Network latency is more significant than computation with same memory consumption, we want the network to have as much computation as possible to increase accuracy Common technique: Pruning, Quantization, Distillation different level of grouping and granularity used in pruning, quantization, parallel execution Common evaluation and optimization criteria weight significance, activation significance, tensor wise, channel wise, batch wise … l2 loss, KL divergence, accuracy, latency, number of computation, memory usage Common ideas to optimize a neural network structure using above techniques architecture option as a trainable parameter and additional loss or KD divergence Optimize the architecture params with regular weights either together or freeze one and optimize the other iteratively iteratively prune/ quantize / distill and evaluate after fine tune in each round abrasion study, delete...