Reference
Pruning
Learning both Weights and Connections for Efficient Neural Networks
Pruning Convolutional Neural Networks for Resource Efficient Inference
Optimal Brain Compression: A Framework for Accurate Post-Training Quantization and Pruning
SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot
A Simple and Effective Pruning Approach for Large Language Models
Quantization
Quantizing deep convolutional networks for efficient inference: A whitepaper
Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference
AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
Quantization Mimic: Towards Very Tiny CNN for Object Detection