Live Quantization: ~source code
Original(float32)
→
Quantized(int8)
→
Dequantized
Error(MSE): 0.0000
What is Quantization?
Quantization is basically the process of reducing the number of bits used to represent a number, usually from float32 to int8.
Formulas used
Absolute Maximum Quantization
\begin{align*}
\mathbf{X}_{\text{quant}} &= \text{round}\Biggl ( \frac{127}{\max|\mathbf{X}| + ε} \cdot \mathbf{X} \Biggr ) \\
\mathbf{X}_{\text{dequant}} &= \frac{\max|\mathbf{X}|}{127} \cdot \mathbf{X}_{\text{quant}} \\
\end{align*}
Zero-Point Quantization\begin{align*}
\mathbf{X}_{\text{quant}} &= \text{round}\Biggl ( \frac{127}{\max|\mathbf{X}|} \cdot \mathbf{X} \Biggr ) \\
\mathbf{X}_{\text{dequant}} &= \frac{\max|\mathbf{X}|}{127} \cdot \mathbf{X}_{\text{quant}} \\
\end{align*}
Reference
Labonne, Maxime. "Introduction to Weight Quantization." Maxime Labonne's Blog, July 6, 2023.