quantization int8
算法Permalink
8-bit 与float转换Permalink
real_value=(int8_value−zero_point)×scale简写为:r=S(q−Z)
int8_value
的weight范围是[-127, 127]
,zero_point
为0;activations/inputs
范围是[-128, 127]
,zero_point
范围是[-128, 127]
thresholdPermalink
threshold理解为某个tensor的元素最大值,则:
Y=F(X)(float运算)=>y=f(x)(int8运算)其中 x=X×128thresholdx,Y=y×thresholdy128
per-axis 与 per-tensorPermalink
per-axis
,表示某个维度每一片都有一个scale和zero_point
,比如per-channel
表示每个channel都有一个scale和zero_point
per-tensor
,表示整个tensor用一个scale和zero_point
Scale转换Permalink
M=2−nM0,其中M0取值[0.5,1],n是一个非负数有y=x×M,且y与x都是整型,M是浮点型,通过以上公式可以将其转换为整型运算。当multiplier为int32时 Multiplier=231M0,这样Multiplier至少有30位精度。
举例说明:
y=x×0.1234=>y=x×0.9872×2−3=>y=x×(0.9872×231)×2−34=>y=x×21199958571≪34=>y=(x×2119995857)≫34Add推导Permalink
Y=X1+X2+X3=>ythy128=x1thx1128+x2thx2128+x3thx3128=>y=x1thx1thy+x2thx2thy+x3thx3thy=>y=x1M1+x2M2+x3M3,取M1、M2、M3中最大Shift=>y=x1M11≪Shift+x2M21≪Shift+x3M31≪Shift=>y=(x1×M1+x2×M2+x3×M3)≫Shift矩阵乘法推导Permalink
有两N x N
矩阵r1
和r2
,r3=r1 x r2
,为了简化,令zero_point
都为0,则浮点到整型运算推导过程如下:
相关函数Permalink
cmathPermalink
std::roundPermalink
double round(double x)
四舍五入,比如:std::round(7.479) = 7
, std::round(7.579) = 8
std::floorPermalink
double floor(double x)
取整,但<= x
,比如:std::floor(7.579) = 7
std::frexpPermalink
double frexp(double x, int *y)
二进制浮点表达转换,若w = std::frexp(x, &y)
,则x = w * (2^y)
,w范围:(-1.0, -0.5] U [0.5, 1.0)
algorithmPermalink
std::min_element / std::max_elementPermalink
template< class ForwardIt >
ForwardIt min_element( ForwardIt first, ForwardIt last );
template< class ForwardIt, class Compare >
ForwardIt min_element( ForwardIt first, ForwardIt last, Compare comp );
查找最小/最大元素
参考文献Permalink
TensorFlow Lite 8-bit quantization specification
Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference