Model Compression Techniques

Pruning Techniques

Method Comparison

Method

Weight Update

Calibration Data

Pruning Metric

Complexity

Fine-Tuning

Support for LLM/Attention/Linear

Support for Convolutional Layer

Granular-Magnitude

\(|Wij|\)

O(1)

Channel-Wise Magnitude

\(|Wj|\)

O(1)

Optimal Brain Compression

\(|W|^2/diag(XXT + λI)−1\)

O(d^3 hidden)

SparseGPT

\(|W|^2/diag(XXT + λI)−1\)

O(d^3 hidden)

Wanda

\(|W_{ij}|. |X_{j}|_{2}\)

O(d^2 hidden)

Venum

\(|W_{ij}|. |X_{j}|_{2}\)

O(d^2 hidden)

Minimal(optional)

Quantization Techniques

Currently, the package only suports Eager Mode Quantization. I look forward to integerate FX Graph Mode Quantization in the near future.

There are three types of quantization supported:

  1. Dynamic Quantization: - Weights are quantized with activations read/stored in floating point and quantized for compute.

  2. Static Quantization: - Weights are quantized. - Activations are quantized. - Calibration is required post-training.

  3. Static Quantization Aware Training: - Weights are quantized. - Activations are quantized. - Quantization numerics are modeled during training.