Qwen3VL解析

10 分钟阅读

源码:qwen3_vl

Vision部分

Qwen3VL图片patch大小为16,比如图片800x640,对应thw为[1, 50, 40]

一些重要参数如下:

hidden_size = 1152
num_heads = 16

1. fast_pos_embed_interpolate

采用更加细致的位置编码策略,根据thw灵活得到嵌入编码。原始训练尺寸为768x768,原始thw为[1, 48, 48]

对应num_position_embeddings=2304,代码中的self.num_grid_per_side = 2304**0.5 = 48

那么h_idxs[0,47]分50份,w_idx[0,47],分40份,数值如下:

h_idxs: tensor([ 0.0000,  0.9592,  1.9184,  2.8776,  3.8367,  4.7959,  5.7551,  6.7143,
         7.6735,  8.6327,  9.5918, 10.5510, 11.5102, 12.4694, 13.4286, 14.3878,
        15.3469, 16.3061, 17.2653, 18.2245, 19.1837, 20.1429, 21.1020, 22.0612,
        23.0204, 23.9796, 24.9388, 25.8980, 26.8571, 27.8163, 28.7755, 29.7347,
        30.6939, 31.6531, 32.6122, 33.5714, 34.5306, 35.4898, 36.4490, 37.4082,
        38.3673, 39.3265, 40.2857, 41.2449, 42.2041, 43.1633, 44.1224, 45.0816,
        46.0408, 47.0000])
w_idxs: tensor([ 0.0000,  1.2051,  2.4103,  3.6154,  4.8205,  6.0256,  7.2308,  8.4359,
         9.6410, 10.8462, 12.0513, 13.2564, 14.4615, 15.6667, 16.8718, 18.0769,
        19.2821, 20.4872, 21.6923, 22.8974, 24.1026, 25.3077, 26.5128, 27.7179,
        28.9231, 30.1282, 31.3333, 32.5385, 33.7436, 34.9487, 36.1538, 37.3590,
        38.5641, 39.7692, 40.9744, 42.1795, 43.3846, 44.5897, 45.7949, 47.0000])

经过双线性差值计算后得到idx_tensorweight_tensor,如下:

idx_tensor shape: torch.Size([4, 2000])
idx_tensor: tensor([[   0,    1,    2,  ..., 2300, 2301, 2303],
        [   1,    2,    3,  ..., 2301, 2302, 2303],
        [  48,   49,   50,  ..., 2300, 2301, 2303],
        [  49,   50,   51,  ..., 2301, 2302, 2303]])
weight_tensor shape: torch.Size([4, 2000])
weight_tensor: tensor([[1.0000, 0.7949, 0.5897,  ..., 0.4103, 0.2051, 1.0000],
        [0.0000, 0.2051, 0.4103,  ..., 0.5897, 0.7949, 0.0000],
        [0.0000, 0.0000, 0.0000,  ..., 0.0000, 0.0000, 0.0000],
        [0.0000, 0.0000, 0.0000,  ..., 0.0000, 0.0000, 0.0000]])

经过pos_embedpatch_pos_embeds,如下:

patch_pos_embeds shape: torch.Size([2000, 1152])
patch_pos_embeds: tensor([[ 1.4441,  0.9112, -0.3521,  ..., -1.8984, -1.3373,  0.3247],
        [-0.3676,  0.1233,  0.9939,  ..., -0.3780, -0.4418, -0.8809],
        [ 0.1733,  0.8421, -0.2520,  ...,  0.5423, -0.8453, -2.3868],
        ...,
        [-0.7435,  0.0577,  0.6804,  ..., -0.6957, -1.3058,  0.2319],
        [-0.6288,  0.9080, -1.0453,  ..., -0.9599, -0.5916,  1.1169],
        [ 1.2138, -0.0820, -2.2672,  ..., -0.1266,  0.4536, -1.5239]])

然后对其做merge位置交换:

pos_embed = (
    pos_embed.view(t, h // merge_size, merge_size, w // merge_size, merge_size, -1)
    .permute(0, 1, 3, 2, 4, 5)
    .flatten(0, 4)
)

最终输出[2000, 1152]

2. rot_pos_emb

这个函数涉及图片pos_ids如何产生,以及位置编码映射。它其实按merge循序标记id。以thw =[1, 50, 40]为例,得到id如下:

pos_ids shape: torch.Size([2000, 2])
pos_ids: tensor([[ 0,  0],[ 0,  1],[ 1,  0],...,[48, 39],[49, 38],[49, 39]])

编码映射函数Qwen3VLVisionRotaryEmbedding与之前Qwen2VL的一致,

head_dim = hidden_size // num_heads = 72 
self.rotary_pos_emb = Qwen3VLVisionRotaryEmbedding(head_dim // 2)
freq_table = self.rotary_pos_emb(50) # 50 取自最大id
"""
freq_table shape: torch.Size([50, 18])
freq_table: tensor([[0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
         0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
         0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00],
        [1.0000e+00, 5.9948e-01, 3.5938e-01, 2.1544e-01, 1.2915e-01, 7.7426e-02,
         4.6416e-02, 2.7826e-02, 1.6681e-02, 1.0000e-02, 5.9948e-03, 3.5938e-03,
         ......
"""

最后将所有pos_ids经过映射,得到位置编码:

embeddings = freq_table[pos_ids]
"""
embeddings shape: torch.Size([2000, 2, 18])
embeddings: tensor([[[0.0000e+00, 0.0000e+00, 0.0000e+00,  ..., 0.0000e+00,
          0.0000e+00, 0.0000e+00],
         ......
"""

3. position_embeddings

对位置编码进一步求cos和sin

rotary_pos_emb = embeddings.reshape(seq_len, -1)
emb = torch.cat((rotary_pos_emb, rotary_pos_emb), dim=-1)
position_embeddings = (emb.cos(), emb.sin())
"""
position_embeddings[0] shape: torch.Size([2000, 72])
position_embeddings[0]: tensor([[ 1.0000,  1.0000,  1.0000,  ...,  1.0000,  1.0000,  1.0000],
        [ 1.0000,  1.0000,  1.0000,  ...,  1.0000,  1.0000,  1.0000],
        [ 0.5403,  0.8256,  0.9361,  ...,  1.0000,  1.0000,  1.0000],
        ...,
        [-0.6401, -0.8771, -0.0285,  ...,  0.9998,  0.9999,  1.0000],
        [ 0.3006, -0.4532,  0.3249,  ...,  0.9998,  0.9999,  1.0000],
        [ 0.3006, -0.4532,  0.3249,  ...,  0.9998,  0.9999,  1.0000]])
"""

4. cu_seqlens

这个是为了得到计算mask,在Qwen2.5VL开始引入。由于Qwen2.5VL存在Window Attention,cu_seqlens代表全量计算mask,cu_window_seqlens代表计算Window Attention的mask。但是Qwen3VL已经没有Windows Attention,所以cu_seqlens其实是多余的。

5. Qwen3VLVisionBlock

Qwen2.5VL的block大致是一样的,由attention和mlp组成。只有一些细微区别:

  • norm1和norm2采用LayerNorm;Qwen2.5VL采用RMSNorm
  • MLP由两层linear,用gelu_pytorch_tanh做激活;Qwen2.5VL采用三层linear,用silu做激活

  • 不再有window attention

每个block的输出为[2000, 1152]

另外中间deepstack_visual_indexes(对应8/16/24三层)的block结果会经过各自的Qwen3VLVisionPatchMerger转换成[500, 2048],并保存下来。

6. Qwen3VLVisionPatchMerger

Qwen2.5VL基本一致,只有激活改用LayerNorm。最终输出[500,2048]

LLM部分

1. position_ids的摆放

Qwen2.5VL相同,在get_rope_index中实现,采用三维[T,H,W],规则如下:

  • 文本T/H/W相同,且以单位1递增;直到递增到视频T。比如[0,1,2,3,4]
  • 视频T会在最后文本T上递增1,然后保持相等;HW按照2x2 merge顺序递增;下一个文本会递增50起始。比如[5,5,5,5,5,5,5,5,...,5];下一个文本是[55,56,...]

2. 图片如何嵌入

  • 首先文本会经过tokenizer转换成input_ids,其中图片部分用image_token_id留空,比如本例中的图片会填入500个image_token_id。比如有100个文本id和500个图片id,则input_ids[1, 600]
  • input_ids经过words embedding[151936, 2048],得到[1, 600, 2048]
  • 将Vision的输出结果[500, 2048],嵌入到对应的位置,得到[1, 600, 2048]

3. Blocks处理

这部分与Qwen2.5VL大致相同,经过attention+mlp,输出[1, 600, 2048]

另外前三层会依次对输出结果的vision部分加上deepstack_visual的结果。