Model Compression Techniques
Pruning Techniques
Method |
Weight Update |
Calibration Data |
Pruning Metric |
Complexity |
Fine-Tuning |
Support for LLM/Attention/Linear |
Support for Convolutional Layer |
---|---|---|---|---|---|---|---|
Granular-Magnitude |
✗ |
✗ |
\(|Wij|\) |
O(1) |
✓ |
✓ |
✓ |
Channel-Wise Magnitude |
✗ |
✓ |
\(|Wj|\) |
O(1) |
✓ |
✓ |
✓ |
Optimal Brain Compression |
✗ |
✓ |
\(|W|^2/diag(XXT + λI)−1\) |
O(d^3 hidden) |
✗ |
✓ |
✓ |
SparseGPT |
✓ |
✓ |
\(|W|^2/diag(XXT + λI)−1\) |
O(d^3 hidden) |
✗ |
✓ |
✗ |
Wanda |
✗ |
✓ |
\(|W_{ij}|. |X_{j}|_{2}\) |
O(d^2 hidden) |
✗ |
✓ |
✗ |
Venum |
✗ |
✓ |
\(|W_{ij}|. |X_{j}|_{2}\) |
O(d^2 hidden) |
Minimal(optional) |
✓ |
✓ |
Quantization Techniques
Currently, the package only suports Eager Mode Quantization. I look forward to integerate FX Graph Mode Quantization in the near future.
There are three types of quantization supported:
Dynamic Quantization: - Weights are quantized with activations read/stored in floating point and quantized for compute.
Static Quantization: - Weights are quantized. - Activations are quantized. - Calibration is required post-training.
Static Quantization Aware Training: - Weights are quantized. - Activations are quantized. - Quantization numerics are modeled during training.