Live Quantization: ~source code

Original(float32)
Quantized(int8)
Dequantized
Error(MSE): 0.0000

What is Quantization?

Quantization is basically the process of reducing the number of bits used to represent a number, usually from float32 to int8.

Formulas used

Absolute Maximum Quantization

\begin{align*} \mathbf{X}_{\text{quant}} &= \text{round}\Biggl ( \frac{127}{\max|\mathbf{X}| + ε} \cdot \mathbf{X} \Biggr ) \\ \mathbf{X}_{\text{dequant}} &= \frac{\max|\mathbf{X}|}{127} \cdot \mathbf{X}_{\text{quant}} \\ \end{align*}
Zero-Point Quantization
\begin{align*} \mathbf{X}_{\text{quant}} &= \text{round}\Biggl ( \frac{127}{\max|\mathbf{X}|} \cdot \mathbf{X} \Biggr ) \\ \mathbf{X}_{\text{dequant}} &= \frac{\max|\mathbf{X}|}{127} \cdot \mathbf{X}_{\text{quant}} \\ \end{align*}

Reference