Qwen3VL解析

October 7, 2025 10 分钟阅读

Vision部分
LLM部分

Vision部分

Qwen3VL图片patch大小为16，比如图片800x640，对应thw为[1, 50, 40]。

一些重要参数如下：

hidden_size = 1152
num_heads = 16

1. fast_pos_embed_interpolate

采用更加细致的位置编码策略，根据thw灵活得到嵌入编码。原始训练尺寸为768x768，原始thw为[1, 48, 48]。

对应num_position_embeddings=2304，代码中的self.num_grid_per_side = 2304**0.5 = 48

那么h_idxs从[0,47]分50份，w_idx从[0,47]，分40份，数值如下：

h_idxs: tensor([ 0.0000,  0.9592,  1.9184,  2.8776,  3.8367,  4.7959,  5.7551,  6.7143,
         7.6735,  8.6327,  9.5918, 10.5510, 11.5102, 12.4694, 13.4286, 14.3878,
        15.3469, 16.3061, 17.2653, 18.2245, 19.1837, 20.1429, 21.1020, 22.0612,
        23.0204, 23.9796, 24.9388, 25.8980, 26.8571, 27.8163, 28.7755, 29.7347,
        30.6939, 31.6531, 32.6122, 33.5714, 34.5306, 35.4898, 36.4490, 37.4082,
        38.3673, 39.3265, 40.2857, 41.2449, 42.2041, 43.1633, 44.1224, 45.0816,
        46.0408, 47.0000])
w_idxs: tensor([ 0.0000,  1.2051,  2.4103,  3.6154,  4.8205,  6.0256,  7.2308,  8.4359,
         9.6410, 10.8462, 12.0513, 13.2564, 14.4615, 15.6667, 16.8718, 18.0769,
        19.2821, 20.4872, 21.6923, 22.8974, 24.1026, 25.3077, 26.5128, 27.7179,
        28.9231, 30.1282, 31.3333, 32.5385, 33.7436, 34.9487, 36.1538, 37.3590,
        38.5641, 39.7692, 40.9744, 42.1795, 43.3846, 44.5897, 45.7949, 47.0000])

经过双线性插值计算后得到idx_tensor和weight_tensor，如下：

idx_tensor shape: torch.Size([4, 2000])
idx_tensor: tensor([[   0,    1,    2,  ..., 2300, 2301, 2303],
        [   1,    2,    3,  ..., 2301, 2302, 2303],
        [  48,   49,   50,  ..., 2300, 2301, 2303],
        [  49,   50,   51,  ..., 2301, 2302, 2303]])
weight_tensor shape: torch.Size([4, 2000])
weight_tensor: tensor([[1.0000, 0.7949, 0.5897,  ..., 0.4103, 0.2051, 1.0000],
        [0.0000, 0.2051, 0.4103,  ..., 0.5897, 0.7949, 0.0000],
        [0.0000, 0.0000, 0.0000,  ..., 0.0000, 0.0000, 0.0000],
        [0.0000, 0.0000, 0.0000,  ..., 0.0000, 0.0000, 0.0000]])

这个双线性插值的理解如下：

48x48的每个点坐标依次从0到2303，经过resize后到50x40的每个点的相邻坐标和加权。

经过pos_embed后patch_pos_embeds，如下：

patch_pos_embeds shape: torch.Size([2000, 1152])
patch_pos_embeds: tensor([[ 1.4441,  0.9112, -0.3521,  ..., -1.8984, -1.3373,  0.3247],
        [-0.3676,  0.1233,  0.9939,  ..., -0.3780, -0.4418, -0.8809],
        [ 0.1733,  0.8421, -0.2520,  ...,  0.5423, -0.8453, -2.3868],
        ...,
        [-0.7435,  0.0577,  0.6804,  ..., -0.6957, -1.3058,  0.2319],
        [-0.6288,  0.9080, -1.0453,  ..., -0.9599, -0.5916,  1.1169],
        [ 1.2138, -0.0820, -2.2672,  ..., -0.1266,  0.4536, -1.5239]])

然后对其做merge位置交换:

pos_embed = (
    pos_embed.view(t, h // merge_size, merge_size, w // merge_size, merge_size, -1)
    .permute(0, 1, 3, 2, 4, 5)
    .flatten(0, 4)
)

最终输出[2000, 1152]

2. rot_pos_emb

这个函数涉及图片pos_ids如何产生，以及位置编码映射。它其实按merge循序标记id。以thw =[1, 50, 40]为例，得到id如下：

pos_ids shape: torch.Size([2000, 2])
pos_ids: tensor([[ 0,  0],[ 0,  1],[ 1,  0],...,[48, 39],[49, 38],[49, 39]])

编码映射函数Qwen3VLVisionRotaryEmbedding与之前Qwen2VL的一致，

head_dim = hidden_size // num_heads = 72 
self.rotary_pos_emb = Qwen3VLVisionRotaryEmbedding(head_dim // 2)
freq_table = self.rotary_pos_emb(50) # 50 取自最大id
"""
freq_table shape: torch.Size([50, 18])
freq_table: tensor([[0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
         0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
         0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00],
        [1.0000e+00, 5.9948e-01, 3.5938e-01, 2.1544e-01, 1.2915e-01, 7.7426e-02,
         4.6416e-02, 2.7826e-02, 1.6681e-02, 1.0000e-02, 5.9948e-03, 3.5938e-03,
         ......
"""

最后将所有pos_ids经过映射，得到位置编码：

embeddings = freq_table[pos_ids]
"""
embeddings shape: torch.Size([2000, 2, 18])
embeddings: tensor([[[0.0000e+00, 0.0000e+00, 0.0000e+00,  ..., 0.0000e+00,
          0.0000e+00, 0.0000e+00],
         ......
"""

3. position_embeddings

对位置编码进一步求cos和sin

rotary_pos_emb = embeddings.reshape(seq_len, -1)
emb = torch.cat((rotary_pos_emb, rotary_pos_emb), dim=-1)
position_embeddings = (emb.cos(), emb.sin())
"""
position_embeddings[0] shape: torch.Size([2000, 72])
position_embeddings[0]: tensor([[ 1.0000,  1.0000,  1.0000,  ...,  1.0000,  1.0000,  1.0000],
        [ 1.0000,  1.0000,  1.0000,  ...,  1.0000,  1.0000,  1.0000],
        [ 0.5403,  0.8256,  0.9361,  ...,  1.0000,  1.0000,  1.0000],
        ...,
        [-0.6401, -0.8771, -0.0285,  ...,  0.9998,  0.9999,  1.0000],
        [ 0.3006, -0.4532,  0.3249,  ...,  0.9998,  0.9999,  1.0000],
        [ 0.3006, -0.4532,  0.3249,  ...,  0.9998,  0.9999,  1.0000]])
"""

4. cu_seqlens

这个是为了得到计算mask，在Qwen2.5VL开始引入。由于Qwen2.5VL存在Window Attention，cu_seqlens代表全量计算mask，cu_window_seqlens代表计算Window Attention的mask。但是Qwen3VL已经没有Windows Attention，所以cu_seqlens其实是多余的。

5. Qwen3VLVisionBlock

与Qwen2.5VL的block大致是一样的，由attention和mlp组成。只有一些细微区别：

norm1和norm2采用LayerNorm；Qwen2.5VL采用RMSNorm
MLP由两层linear，用gelu_pytorch_tanh做激活；Qwen2.5VL采用三层linear，用silu做激活
不再有window attention

每个block的输出为[2000, 1152]。

另外中间deepstack_visual_indexes（对应8/16/24三层）的block结果会经过各自的Qwen3VLVisionPatchMerger转换成[500, 2048]，并保存下来。

6. Qwen3VLVisionPatchMerger

与Qwen2.5VL基本一致，只有激活改用LayerNorm。最终输出[500,2048]

LLM部分

1. input_ids的编码

在Qwen3VL中每个t都加入了时序，如下：

'<|im_start|>user\n<0.3 seconds><|vision_start|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|><|video_pad|>......<1.4 seconds><|vision_start|><|video_pad|>......

2. position_ids的摆放

与Qwen2.5VL相同，在get_rope_index中实现，采用三维[T,H,W]，规则如下：

文本T/H/W相同，且以单位1递增；直到递增到视频T。比如[0,1,2,3,4]
视频T会在最后文本T上递增1，然后保持相等；HW按照2x2 merge顺序递增；下一个文本会递增13起始。比如[5,5,5,5,5,5,5,5,...,5]；下一个文本是[18,19,...]

3. 图片如何嵌入

首先文本会经过tokenizer转换成input_ids，其中图片部分用image_token_id留空，比如本例中的图片会填入500个image_token_id。比如有100个文本id和500个图片id，则input_ids为[1, 600]
input_ids经过words embedding[151936, 2048]，得到[1, 600, 2048]
将Vision的输出结果[500, 2048]，嵌入到对应的位置，得到[1, 600, 2048]

4. Blocks处理

这部分与Qwen2.5VL大致相同，经过attention+mlp，输出[1, 600, 2048]。

另外有两点差异：

一是deepstack_visual的处理上，前三层会依次对输出结果的vision部分加上deepstack_visual的结果。
二是mrope的区别，比如[3, 1, 1, 64]转换成[1,1,1,64]时，采用间隔拼接，也就是THWTHWTHW的次序；Qwen2.5VL采用顺序拼接，也就是TTT…HHH…WWW…。可以参见apply_interleaved_mrope。如下图：

HarmonyHu