quantization int8

算法
相关函数
- cmath
- algorithm
  - std::min_element / std::max_element
参考文献

算法

8-bit 与float转换

\[real\_value = (int8\_value - zero\_point)\times scale\]

简写为：$ r = S(q-Z) $

int8_value的weight范围是[-127, 127]，zero_point为0；activations/inputs范围是[-128, 127]，zero_point范围是[-128, 127]

threshold

threshold理解为某个tensor的元素最大值，则：

\[Y = F(X) \text{(float运算)} => y = f(x) \text{(int8运算)}\]

其中 $ x = X \times \frac{128}{threshold_x}, Y = y \times \frac{threshold_y}{128} $

per-axis 与 per-tensor

per-axis，表示某个维度每一片都有一个scale和zero_point，比如per-channel表示每个channel都有一个scale和zero_point
per-tensor，表示整个tensor用一个scale和zero_point

Scale转换

\[M = 2^{-n}M_0，其中M_0取值[0.5,1], n是一个非负数\]

有$ y = x \times M $，且y与x都是整型，M是浮点型，通过以上公式可以将其转换为整型运算。当multiplier为int32时 $ Multiplier = 2^{31}M_0 $，这样Multiplier至少有30位精度。

举例说明：

\[\begin{align} &y = x \times 0.1234 \\ &=> y = x \times 0.9872 \times 2^{-3} \\ &=> y = x \times (0.9872 \times 2^{31}) \times 2^{-34} \\ &=> y = x \times \frac{2119995857}{1 \ll 34} \\ &=> y = (x \times 2119995857) \gg 34 \end{align}\]

Add推导

\[\begin{align} &Y= X_1 + X_2 + X_3 \\ &=> y \frac{thy}{128} = x_1 \frac{thx_1}{128} + x_2 \frac{thx_2}{128} + x_3 \frac{thx_3}{128} \\ &=> y = x_1 \frac{thx_1}{thy} + x_2 \frac{thx_2}{thy} + x_3 \frac{thx_3}{thy} \\ &=> y = x_1 M_1 + x_2 M_2 + x_3 M_3，取 M_1、M_2、M_3中最大Shift \\ &=> y = x_1 \frac{M_1}{1 \ll Shift} + x_2 \frac{M_2}{1 \ll Shift} + x_3 \frac{M_3}{1 \ll Shift} \\ &=> y = (x_1 \times{M_1} + x_2 \times {M_2} + x_3 \times {M_3} ) \gg Shift \end{align}\]

矩阵乘法推导

有两N x N矩阵r1和r2，r3=r1 x r2，为了简化，令zero_point都为0，则浮点到整型运算推导过程如下：

\[\begin{align} &r_a^{(i,j)} = S_a \times q_a^{i,j} \\ &=> S_3 q_3^{i,k} = \sum_{j=1}^{N}S_1q_1^{i,j}S_2q_2^{j,k} \\ &=> q_3^{i,k} = M \sum_{j=1}^{N}q_1^{i,j}q_2^{j,k}，其中 M := \frac{S_1S_2}{S_3} = 2^{-n}M_0 \end{align}\]

相关函数

cmath

std::round

double round(double x)

四舍五入，比如：std::round(7.479) = 7, std::round(7.579) = 8

std::floor

double floor(double x)

取整，但<= x，比如：std::floor(7.579) = 7

std::frexp

double frexp(double x, int *y)

二进制浮点表达转换，若w = std::frexp(x, &y)，则x = w * (2^y)，w范围：(-1.0, -0.5] U [0.5, 1.0)

algorithm

std::min_element / std::max_element

template< class ForwardIt >
ForwardIt min_element( ForwardIt first, ForwardIt last );
template< class ForwardIt, class Compare >
ForwardIt min_element( ForwardIt first, ForwardIt last, Compare comp );

查找最小/最大元素

参考文献

TensorFlow Lite 8-bit quantization specification

Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference

算法