Model Compression Techniques

Pruning Techniques

Method Comparison
Method	Weight Update	Calibration Data	Pruning Metric	Complexity	Fine-Tuning	Support for LLM/Attention/Linear	Support for Convolutional Layer
Granular-Magnitude	✗	✗	\(\|Wij\|\)	O(1)	✓	✓	✓
Channel-Wise Magnitude	✗	✓	\(\|Wj\|\)	O(1)	✓	✓	✓
Optimal Brain Compression	✗	✓	\(\|W\|^2/diag(XXT + λI)−1\)	O(d^3 hidden)	✗	✓	✓
SparseGPT	✓	✓	\(\|W\|^2/diag(XXT + λI)−1\)	O(d^3 hidden)	✗	✓	✗
Wanda	✗	✓	\(\|W_{ij}\|. \|X_{j}\|_{2}\)	O(d^2 hidden)	✗	✓	✗
Venum	✗	✓	\(\|W_{ij}\|. \|X_{j}\|_{2}\)	O(d^2 hidden)	Minimal(optional)	✓	✓

Currently, the package only suports Eager Mode Quantization. I look forward to integerate FX Graph Mode Quantization in the near future.

There are three types of quantization supported:

Dynamic Quantization: - Weights are quantized with activations read/stored in floating point and quantized for compute.
Static Quantization: - Weights are quantized. - Activations are quantized. - Calibration is required post-training.
Static Quantization Aware Training: - Weights are quantized. - Activations are quantized. - Quantization numerics are modeled during training.