quantization int8

6 分钟阅读

算法Permalink

8-bit 与float转换Permalink

real_value=(int8_valuezero_point)×scale

简写为:r=S(qZ)

int8_value的weight范围是[-127, 127]zero_point为0;activations/inputs范围是[-128, 127]zero_point范围是[-128, 127]

thresholdPermalink

threshold理解为某个tensor的元素最大值,则:

Y=F(X)(float运算)=>y=f(x)(int8运算)

其中 x=X×128thresholdx,Y=y×thresholdy128

per-axis 与 per-tensorPermalink

  • per-axis,表示某个维度每一片都有一个scale和zero_point,比如per-channel表示每个channel都有一个scale和zero_point
  • per-tensor,表示整个tensor用一个scale和zero_point

Scale转换Permalink

M=2nM0M0[0.5,1],n

y=x×M,且y与x都是整型,M是浮点型,通过以上公式可以将其转换为整型运算。当multiplier为int32时 Multiplier=231M0,这样Multiplier至少有30位精度。

举例说明:

y=x×0.1234=>y=x×0.9872×23=>y=x×(0.9872×231)×234=>y=x×2119995857134=>y=(x×2119995857)34

Add推导Permalink

Y=X1+X2+X3=>ythy128=x1thx1128+x2thx2128+x3thx3128=>y=x1thx1thy+x2thx2thy+x3thx3thy=>y=x1M1+x2M2+x3M3M1M2M3Shift=>y=x1M11Shift+x2M21Shift+x3M31Shift=>y=(x1×M1+x2×M2+x3×M3)Shift

矩阵乘法推导Permalink

有两N x N矩阵r1r2r3=r1 x r2,为了简化,令zero_point都为0,则浮点到整型运算推导过程如下:

r(i,j)a=Sa×qi,ja=>S3qi,k3=Nj=1S1qi,j1S2qj,k2=>qi,k3=MNj=1qi,j1qj,k2M:=S1S2S3=2nM0

相关函数Permalink

cmathPermalink

std::roundPermalink

double round(double x)

四舍五入,比如:std::round(7.479) = 7, std::round(7.579) = 8

std::floorPermalink

double floor(double x)

取整,但<= x,比如:std::floor(7.579) = 7

std::frexpPermalink

double frexp(double x, int *y)

二进制浮点表达转换,若w = std::frexp(x, &y),则x = w * (2^y),w范围:(-1.0, -0.5] U [0.5, 1.0)

algorithmPermalink

std::min_element / std::max_elementPermalink

template< class ForwardIt >
ForwardIt min_element( ForwardIt first, ForwardIt last );
template< class ForwardIt, class Compare >
ForwardIt min_element( ForwardIt first, ForwardIt last, Compare comp );

查找最小/最大元素

参考文献Permalink

TensorFlow Lite 8-bit quantization specification

Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference