Qwen3VL解析
源码:qwen3_vl
Vision部分
Qwen3VL图片patch大小为16,比如图片800x640
,对应thw为[1, 50, 40]
。
一些重要参数如下:
hidden_size = 1152
num_heads = 16
1. fast_pos_embed_interpolate
采用更加细致的位置编码策略,根据thw灵活得到嵌入编码。原始训练尺寸为768x768,原始thw为[1, 48, 48]
。
对应num_position_embeddings=2304
,代码中的self.num_grid_per_side = 2304**0.5 = 48
那么h_idxs
从[0,47]
分50份,w_idx
从[0,47]
,分40份,数值如下:
h_idxs: tensor([ 0.0000, 0.9592, 1.9184, 2.8776, 3.8367, 4.7959, 5.7551, 6.7143,
7.6735, 8.6327, 9.5918, 10.5510, 11.5102, 12.4694, 13.4286, 14.3878,
15.3469, 16.3061, 17.2653, 18.2245, 19.1837, 20.1429, 21.1020, 22.0612,
23.0204, 23.9796, 24.9388, 25.8980, 26.8571, 27.8163, 28.7755, 29.7347,
30.6939, 31.6531, 32.6122, 33.5714, 34.5306, 35.4898, 36.4490, 37.4082,
38.3673, 39.3265, 40.2857, 41.2449, 42.2041, 43.1633, 44.1224, 45.0816,
46.0408, 47.0000])
w_idxs: tensor([ 0.0000, 1.2051, 2.4103, 3.6154, 4.8205, 6.0256, 7.2308, 8.4359,
9.6410, 10.8462, 12.0513, 13.2564, 14.4615, 15.6667, 16.8718, 18.0769,
19.2821, 20.4872, 21.6923, 22.8974, 24.1026, 25.3077, 26.5128, 27.7179,
28.9231, 30.1282, 31.3333, 32.5385, 33.7436, 34.9487, 36.1538, 37.3590,
38.5641, 39.7692, 40.9744, 42.1795, 43.3846, 44.5897, 45.7949, 47.0000])
经过双线性差值计算后得到idx_tensor
和weight_tensor
,如下:
idx_tensor shape: torch.Size([4, 2000])
idx_tensor: tensor([[ 0, 1, 2, ..., 2300, 2301, 2303],
[ 1, 2, 3, ..., 2301, 2302, 2303],
[ 48, 49, 50, ..., 2300, 2301, 2303],
[ 49, 50, 51, ..., 2301, 2302, 2303]])
weight_tensor shape: torch.Size([4, 2000])
weight_tensor: tensor([[1.0000, 0.7949, 0.5897, ..., 0.4103, 0.2051, 1.0000],
[0.0000, 0.2051, 0.4103, ..., 0.5897, 0.7949, 0.0000],
[0.0000, 0.0000, 0.0000, ..., 0.0000, 0.0000, 0.0000],
[0.0000, 0.0000, 0.0000, ..., 0.0000, 0.0000, 0.0000]])
经过pos_embed
后patch_pos_embeds
,如下:
patch_pos_embeds shape: torch.Size([2000, 1152])
patch_pos_embeds: tensor([[ 1.4441, 0.9112, -0.3521, ..., -1.8984, -1.3373, 0.3247],
[-0.3676, 0.1233, 0.9939, ..., -0.3780, -0.4418, -0.8809],
[ 0.1733, 0.8421, -0.2520, ..., 0.5423, -0.8453, -2.3868],
...,
[-0.7435, 0.0577, 0.6804, ..., -0.6957, -1.3058, 0.2319],
[-0.6288, 0.9080, -1.0453, ..., -0.9599, -0.5916, 1.1169],
[ 1.2138, -0.0820, -2.2672, ..., -0.1266, 0.4536, -1.5239]])
然后对其做merge位置交换:
pos_embed = (
pos_embed.view(t, h // merge_size, merge_size, w // merge_size, merge_size, -1)
.permute(0, 1, 3, 2, 4, 5)
.flatten(0, 4)
)
最终输出[2000, 1152]
2. rot_pos_emb
这个函数涉及图片pos_ids
如何产生,以及位置编码映射。它其实按merge循序标记id。以thw =[1, 50, 40]
为例,得到id如下:
pos_ids shape: torch.Size([2000, 2])
pos_ids: tensor([[ 0, 0],[ 0, 1],[ 1, 0],...,[48, 39],[49, 38],[49, 39]])
编码映射函数Qwen3VLVisionRotaryEmbedding
与之前Qwen2VL的一致,
head_dim = hidden_size // num_heads = 72
self.rotary_pos_emb = Qwen3VLVisionRotaryEmbedding(head_dim // 2)
freq_table = self.rotary_pos_emb(50) # 50 取自最大id
"""
freq_table shape: torch.Size([50, 18])
freq_table: tensor([[0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00],
[1.0000e+00, 5.9948e-01, 3.5938e-01, 2.1544e-01, 1.2915e-01, 7.7426e-02,
4.6416e-02, 2.7826e-02, 1.6681e-02, 1.0000e-02, 5.9948e-03, 3.5938e-03,
......
"""
最后将所有pos_ids
经过映射,得到位置编码:
embeddings = freq_table[pos_ids]
"""
embeddings shape: torch.Size([2000, 2, 18])
embeddings: tensor([[[0.0000e+00, 0.0000e+00, 0.0000e+00, ..., 0.0000e+00,
0.0000e+00, 0.0000e+00],
......
"""
3. position_embeddings
对位置编码进一步求cos和sin
rotary_pos_emb = embeddings.reshape(seq_len, -1)
emb = torch.cat((rotary_pos_emb, rotary_pos_emb), dim=-1)
position_embeddings = (emb.cos(), emb.sin())
"""
position_embeddings[0] shape: torch.Size([2000, 72])
position_embeddings[0]: tensor([[ 1.0000, 1.0000, 1.0000, ..., 1.0000, 1.0000, 1.0000],
[ 1.0000, 1.0000, 1.0000, ..., 1.0000, 1.0000, 1.0000],
[ 0.5403, 0.8256, 0.9361, ..., 1.0000, 1.0000, 1.0000],
...,
[-0.6401, -0.8771, -0.0285, ..., 0.9998, 0.9999, 1.0000],
[ 0.3006, -0.4532, 0.3249, ..., 0.9998, 0.9999, 1.0000],
[ 0.3006, -0.4532, 0.3249, ..., 0.9998, 0.9999, 1.0000]])
"""
4. cu_seqlens
这个是为了得到计算mask,在Qwen2.5VL
开始引入。由于Qwen2.5VL
存在Window Attention,cu_seqlens
代表全量计算mask,cu_window_seqlens
代表计算Window Attention的mask。但是Qwen3VL已经没有Windows Attention,所以cu_seqlens
其实是多余的。
5. Qwen3VLVisionBlock
与Qwen2.5VL
的block大致是一样的,由attention和mlp组成。只有一些细微区别:
- norm1和norm2采用LayerNorm;
Qwen2.5VL
采用RMSNorm -
MLP由两层linear,用
gelu_pytorch_tanh
做激活;Qwen2.5VL
采用三层linear,用silu
做激活 - 不再有window attention
每个block的输出为[2000, 1152]
。
另外中间deepstack_visual_indexes
(对应8/16/24三层)的block结果会经过各自的Qwen3VLVisionPatchMerger
转换成[500, 2048]
,并保存下来。
6. Qwen3VLVisionPatchMerger
与Qwen2.5VL
基本一致,只有激活改用LayerNorm。最终输出[500,2048]
LLM部分
1. position_ids的摆放
与Qwen2.5VL
相同,在get_rope_index
中实现,采用三维[T,H,W
],规则如下:
- 文本T/H/W相同,且以单位1递增;直到递增到视频T。比如
[0,1,2,3,4
] - 视频T会在最后文本T上递增1,然后保持相等;HW按照
2x2
merge顺序递增;下一个文本会递增50起始。比如[5,5,5,5,5,5,5,5,...,5]
;下一个文本是[55,56,...]
2. 图片如何嵌入
- 首先文本会经过tokenizer转换成
input_ids
,其中图片部分用image_token_id
留空,比如本例中的图片会填入500个image_token_id
。比如有100个文本id和500个图片id,则input_ids
为[1, 600]
input_ids
经过words embedding[151936, 2048]
,得到[1, 600, 2048]
- 将Vision的输出结果
[500, 2048]
,嵌入到对应的位置,得到[1, 600, 2048]
3. Blocks处理
这部分与Qwen2.5VL
大致相同,经过attention+mlp,输出[1, 600, 2048]
。
另外前三层会依次对输出结果的vision部分加上deepstack_visual
的结果。