Qwen3VL解析

源码:qwen3_vl
Vision部分
Qwen3VL图片patch大小为16,比如图片800x640,对应thw为[1, 50, 40]。
一些重要参数如下:
hidden_size = 1152
num_heads = 16
1. fast_pos_embed_interpolate
采用更加细致的位置编码策略,根据thw灵活得到嵌入编码。原始训练尺寸为768x768,原始thw为[1, 48, 48]。
对应num_position_embeddings=2304,代码中的self.num_grid_per_side = 2304**0.5 = 48
那么h_idxs从[0,47]分50份,w_idx从[0,47],分40份,数值如下:
h_idxs: tensor([ 0.0000, 0.9592, 1.9184, 2.8776, 3.8367, 4.7959, 5.7551, 6.7143,
7.6735, 8.6327, 9.5918, 10.5510, 11.5102, 12.4694, 13.4286, 14.3878,
15.3469, 16.3061, 17.2653, 18.2245, 19.1837, 20.1429, 21.1020, 22.0612,
23.0204, 23.9796, 24.9388, 25.8980, 26.8571, 27.8163, 28.7755, 29.7347,
30.6939, 31.6531, 32.6122, 33.5714, 34.5306, 35.4898, 36.4490, 37.4082,
38.3673, 39.3265, 40.2857, 41.2449, 42.2041, 43.1633, 44.1224, 45.0816,
46.0408, 47.0000])
w_idxs: tensor([ 0.0000, 1.2051, 2.4103, 3.6154, 4.8205, 6.0256, 7.2308, 8.4359,
9.6410, 10.8462, 12.0513, 13.2564, 14.4615, 15.6667, 16.8718, 18.0769,
19.2821, 20.4872, 21.6923, 22.8974, 24.1026, 25.3077, 26.5128, 27.7179,
28.9231, 30.1282, 31.3333, 32.5385, 33.7436, 34.9487, 36.1538, 37.3590,
38.5641, 39.7692, 40.9744, 42.1795, 43.3846, 44.5897, 45.7949, 47.0000])
经过双线性插值计算后得到idx_tensor和weight_tensor,如下:
idx_tensor shape: torch.Size([4, 2000])
idx_tensor: tensor([[ 0, 1, 2, ..., 2300, 2301, 2303],
[ 1, 2, 3, ..., 2301, 2302, 2303],
[ 48, 49, 50, ..., 2300, 2301, 2303],
[ 49, 50, 51, ..., 2301, 2302, 2303]])
weight_tensor shape: torch.Size([4, 2000])
weight_tensor: tensor([[1.0000, 0.7949, 0.5897, ..., 0.4103, 0.2051, 1.0000],
[0.0000, 0.2051, 0.4103, ..., 0.5897, 0.7949, 0.0000],
[0.0000, 0.0000, 0.0000, ..., 0.0000, 0.0000, 0.0000],
[0.0000, 0.0000, 0.0000, ..., 0.0000, 0.0000, 0.0000]])
这个双线性插值的理解如下:

48x48的每个点坐标依次从0到2303,经过resize后到50x40的每个点的相邻坐标和加权。
经过pos_embed后patch_pos_embeds,如下:
patch_pos_embeds shape: torch.Size([2000, 1152])
patch_pos_embeds: tensor([[ 1.4441, 0.9112, -0.3521, ..., -1.8984, -1.3373, 0.3247],
[-0.3676, 0.1233, 0.9939, ..., -0.3780, -0.4418, -0.8809],
[ 0.1733, 0.8421, -0.2520, ..., 0.5423, -0.8453, -2.3868],
...,
[-0.7435, 0.0577, 0.6804, ..., -0.6957, -1.3058, 0.2319],
[-0.6288, 0.9080, -1.0453, ..., -0.9599, -0.5916, 1.1169],
[ 1.2138, -0.0820, -2.2672, ..., -0.1266, 0.4536, -1.5239]])
然后对其做merge位置交换:
pos_embed = (
pos_embed.view(t, h // merge_size, merge_size, w // merge_size, merge_size, -1)
.permute(0, 1, 3, 2, 4, 5)
.flatten(0, 4)
)
最终输出[2000, 1152]
2. rot_pos_emb
这个函数涉及图片pos_ids如何产生,以及位置编码映射。它其实按merge循序标记id。以thw =[1, 50, 40]为例,得到id如下:
pos_ids shape: torch.Size([2000, 2])
pos_ids: tensor([[ 0, 0],[ 0, 1],[ 1, 0],...,[48, 39],[49, 38],[49, 39]])
编码映射函数Qwen3VLVisionRotaryEmbedding与之前Qwen2VL的一致,
head_dim = hidden_size // num_heads = 72
self.rotary_pos_emb = Qwen3VLVisionRotaryEmbedding(head_dim // 2)
freq_table = self.rotary_pos_emb(50) # 50 取自最大id
"""
freq_table shape: torch.Size([50, 18])
freq_table: tensor([[0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00],
[1.0000e+00, 5.9948e-01, 3.5938e-01, 2.1544e-01, 1.2915e-01, 7.7426e-02,
4.6416e-02, 2.7826e-02, 1.6681e-02, 1.0000e-02, 5.9948e-03, 3.5938e-03,
......
"""
最后将所有pos_ids经过映射,得到位置编码:
embeddings = freq_table[pos_ids]
"""
embeddings shape: torch.Size([2000, 2, 18])
embeddings: tensor([[[0.0000e+00, 0.0000e+00, 0.0000e+00, ..., 0.0000e+00,
0.0000e+00, 0.0000e+00],
......
"""
3. position_embeddings
对位置编码进一步求cos和sin
rotary_pos_emb = embeddings.reshape(seq_len, -1)
emb = torch.cat((rotary_pos_emb, rotary_pos_emb), dim=-1)
position_embeddings = (emb.cos(), emb.sin())
"""
position_embeddings[0] shape: torch.Size([2000, 72])
position_embeddings[0]: tensor([[ 1.0000, 1.0000, 1.0000, ..., 1.0000, 1.0000, 1.0000],
[ 1.0000, 1.0000, 1.0000, ..., 1.0000, 1.0000, 1.0000],
[ 0.5403, 0.8256, 0.9361, ..., 1.0000, 1.0000, 1.0000],
...,
[-0.6401, -0.8771, -0.0285, ..., 0.9998, 0.9999, 1.0000],
[ 0.3006, -0.4532, 0.3249, ..., 0.9998, 0.9999, 1.0000],
[ 0.3006, -0.4532, 0.3249, ..., 0.9998, 0.9999, 1.0000]])
"""
4. cu_seqlens
这个是为了得到计算mask,在Qwen2.5VL开始引入。由于Qwen2.5VL存在Window Attention,cu_seqlens代表全量计算mask,cu_window_seqlens代表计算Window Attention的mask。但是Qwen3VL已经没有Windows Attention,所以cu_seqlens其实是多余的。
5. Qwen3VLVisionBlock
与Qwen2.5VL的block大致是一样的,由attention和mlp组成。只有一些细微区别:
- norm1和norm2采用LayerNorm;
Qwen2.5VL采用RMSNorm -
MLP由两层linear,用
gelu_pytorch_tanh做激活;Qwen2.5VL采用三层linear,用silu做激活 - 不再有window attention
每个block的输出为[2000, 1152]。
另外中间deepstack_visual_indexes(对应8/16/24三层)的block结果会经过各自的Qwen3VLVisionPatchMerger转换成[500, 2048],并保存下来。
6. Qwen3VLVisionPatchMerger
与Qwen2.5VL基本一致,只有激活改用LayerNorm。最终输出[500,2048]
LLM部分
1. input_ids的编码
图片的编码与Qwen2.5VL相同,视频的编码有很大不同。比如grid_thw=[6, 50, 40]的视频,在Qwen2.5VL中是连续排列,比如:<|im_start|>user\n<|vision_start|><|video_pad|><|video_pad|>......<|vision_end|>。
在Qwen3VL中每个t都加入了时序,如下:
'<|im_start|>user\n<0.3 seconds><|vision_start|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|>......<1.4 seconds><|vision_start|><|video_pad|>......
2. position_ids的摆放
与Qwen2.5VL相同,在get_rope_index中实现,采用三维[T,H,W],规则如下:
- 文本T/H/W相同,且以单位1递增;直到递增到视频T。比如
[0,1,2,3,4] - 视频T会在最后文本T上递增1,然后保持相等;HW按照
2x2merge顺序递增;下一个文本会递增13起始。比如[5,5,5,5,5,5,5,5,...,5];下一个文本是[18,19,...]
3. 图片如何嵌入
- 首先文本会经过tokenizer转换成
input_ids,其中图片部分用image_token_id留空,比如本例中的图片会填入500个image_token_id。比如有100个文本id和500个图片id,则input_ids为[1, 600] input_ids经过words embedding[151936, 2048],得到[1, 600, 2048]- 将Vision的输出结果
[500, 2048],嵌入到对应的位置,得到[1, 600, 2048]
4. Blocks处理
这部分与Qwen2.5VL大致相同,经过attention+mlp,输出[1, 600, 2048]。
另外有两点差异:
-
一是
deepstack_visual的处理上,前三层会依次对输出结果的vision部分加上deepstack_visual的结果。 -
二是mrope的区别,比如
[3, 1, 1, 64]转换成[1,1,1,64]时,采用间隔拼接,也就是THWTHWTHW的次序;Qwen2.5VL采用顺序拼接,也就是TTT…HHH…WWW…。可以参见apply_interleaved_mrope。如下图: