算法
8-bit 与float转换
\[real\_value = (int8\_value - zero\_point)\times scale \\ 也可写为:r = S(q-Z)\]int8_value
的weight范围是[-127, 127]
,zero_point
为0;activations/inputs
范围是[-128, 127]
,zero_point
范围是[-128, 127]
threshold
threshold理解为某个tensor的元素最大值,则:
\(Y = F(X) (float运算) \\
=> y = f(x),其中 x = X \times \frac{128}{threshold_x}, Y = y \times \frac{threshold_y}{128} (int8运算)\)
per-axis 与 per-tensor
per-axis
,表示某个维度每一片都有一个scale和zero_point
,比如per-channel
表示每个channel都有一个scale和zero_point
per-tensor
,表示整个tensor用一个scale和zero_point
Scale转换
\[M = 2^{-n}M_0,其中M_0取值[0.5,1], n是一个非负数\]有y = x * M
,且y与x都是整型,M是浮点型,通过以上公式可以将其转换为整型运算。当multiplier为int32时
\(Multiplier = 2^{31}M_0\)
,这样Multiplier至少有30位精度。
举例说明:
\(y = x \times 0.1234 \\
=> y = x \times 0.9872 \times 2^{-3} \\
=> y = x \times (0.9872 \times 2^{31}) \times 2^{-34} \\
=> y = x \times \frac{2119995857}{1 \ll 34} \\
=> y = (x \times 2119995857) \gg 34\)
Add推导
\[Y= X_1 + X_2 + X_3 (float) \\ => y \frac{thy}{128} = x_1 \frac{thx_1}{128} + x_2 \frac{thx_2}{128} + x_3 \frac{thx_3}{128} \\ => y = x_1 \frac{thx_1}{thy} + x_2 \frac{thx_2}{thy} + x_3 \frac{thx_3}{thy} \\ => y = x_1 M_1 + x_2 M_2 + x3 M_3 \\ 取 M_1、M_2、M_3中最大Shift \\ => y = x_1 \frac{M_1}{1 \ll Shift} + x_2 \frac{M_2}{1 \ll Shift} + x_3 \frac{M_3}{1 \ll Shift} \\ => y = (x_1 \times{M_1} + x_2 \times {M_2} + x_3 \times {M_3} ) \gg Shift\]矩阵乘法推导
有两N x N
矩阵r1
和r2
,r3=r1 x r2
,为了简化,令zero_point
都为0,则浮点到整型运算推导过程如下:
\(r_a^{(i,j)} = S_a \times q_a^{i,j} \\
S_3 q_3^{i,k} = \sum_{j=1}^{N}S_1q_1^{i,j}S_2q_2^{j,k} \\
q_3^{i,k} = M \sum_{j=1}^{N}q_1^{i,j}q_2^{j,k} \\
其中 M := \frac{S_1S_2}{S_3} = 2^{-n}M_0\)
相关函数
cmath
std::round
double round(double x)
四舍五入,比如:std::round(7.479) = 7
, std::round(7.579) = 8
std::floor
double floor(double x)
取整,但<= x
,比如:std::floor(7.579) = 7
std::frexp
double frexp(double x, int *y)
二进制浮点表达转换,若w = std::frexp(x, &y)
,则x = w * (2^y)
,w范围:(-1.0, -0.5] U [0.5, 1.0)
algorithm
std::min_element / std::max_element
template< class ForwardIt >
ForwardIt min_element( ForwardIt first, ForwardIt last );
template< class ForwardIt, class Compare >
ForwardIt min_element( ForwardIt first, ForwardIt last, Compare comp );
查找最小/最大元素
参考文献
TensorFlow Lite 8-bit quantization specification
Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference