利用OpenVINO在本地部署Qwen2.5-Omni全模态任务

英特尔开发人员专区

567人浏览 · 2025-06-21 09:38:43

英特尔开发人员专区 · 2025-06-21 09:38:43 发布

杨亦诚

介绍

Qwen2.5-Omni是Qwen 模型家族中新一代端到端多模态旗舰模型。该模型专为全方位多模态感知设计，能够无缝处理文本、图像、音频和视频等多种输入形式，并通过实时流式响应同时生成文本与自然语音合成输出。

主要特点包括：

全能创新架构：Qwen2.5-Omni基于全新的Thinker-Talker架构，这是一种端到端的多模态模型，旨在支持文本/图像/音频/视频的跨模态理解，同时以流式方式生成文本和自然语音响应。
实时音视频交互：架构旨在支持完全实时交互，支持分块输入和即时输出。
自然流畅的语音生成：在语音生成的自然性和稳定性方面超越了许多现有的流式和非流式替代方案。
全模态性能优势：在同等规模的单模态模型进行基准测试时，表现出卓越的性能。Qwen2.5-Omni在音频能力上优于类似大小的Qwen2-Audio，并与Qwen2.5-VL-7B保持同等水平。
卓越的端到端语音指令跟随能力：Qwen2.5-Omni在端到端语音指令跟随方面表现出与文本输入处理相媲美的效果，在MMLU通用知识理解和GSM8K数学推理等基准测试中表现优异。

目前OpenVINOTM已基本完成了针对Qwen2.5-Omni的适配，以优化其在Intel平台上的推理性能，接下来我们就一起来看下如何利用OpenVINOTM在本地部署Qwen2.5-Omni全模态任务。

第一步，环境准备

该示例基于Jupyter Notebook编写，因此我们需要准备好相对应的Python环境。基础环境可以参考以下链接安装，并根据自己的操作系统进行选择具体步骤。

https://github.com/openvinotoolkit/openvino_notebooks?tab=readme-ov-file#-getting-started

图：基础环境安装导航页面

除了安装用于原始模型依赖的Transformers库以及qwen-omni-utils[decord]组件以外，我们还需要安装最新的OpenVINOTM runtime以及NNCF工具以实现模型的压缩以及流水线任务迁移。

pip install -q "git+https://github.com/huggingface/transformers" \

"torchvision" "accelerate" "qwen-omni-utils[decord]" "gradio>=4.19" --no-cache-dir --extra-index-url https://download.pytorch.org/whl/cpu

pip install -q "openvino==2025.1.0" "nncf>=2.16.0"

第二步，模型下载和转换

Qwen2.5-Omni采用Thinker-Talker双核架构。Thinker模块如同大脑，负责处理文本、音频、视频等多模态输入，生成高层语义表征及对应文本内容；Talker模块则类似发声器官，以流式方式接收Thinker实时输出的语义表征与文本，流畅合成离散语音单元。其中在Thinker模块又包含了针对图像以音频输入的编码模块：Vison Encoder以及Audio Encoder；而Talker模块后也需要对接Token2Wav模块，将生成的音频token转化为音频信号。

图：Qwen2.5-Omni模型架构

在使用OpenVINOTM重构任务流水线之前，我们需要将以上提到这些子模块转换为OpenVINOTM的IR模型格式。这里OpenVINOTM提供了openvino.convert_model函数来实现将PyTorch的模型对象转化为OpenVINOTM模型对象的过程，该接口通过调用torch.jit.trace接口重构模型静态图结构，因此需要额外送入example_input来模拟模型的原始输入。

ov_model = ov.convert_model(model, example_input=torch.rand(1, 3, 224, 224))

在这个例子中，我们为开发者提供了convert_qwen2_5_omni_model函数，通过该方法可以一键将Qwen2.5-Omni中所有的模型模块转为OpenVINO IR文件。除此以外我们还在其中封装了对于NNCF量化工具的集成，开发者可以通过导入配置信息的方法，完成对Thinker以及Talker模型的权重量化操作。

import nncf



compression_configuration = {

    "mode": nncf.CompressWeightsMode.INT4_ASYM,

    "group_size": 128,

    "ratio": 0.8,

}



convert_qwen2_5_omni_model(model_id, model_dir, compression_configuration)

第三步，重构任务流水线

第三步中，我们需要基于刚刚导出的OpenVINOTM IR模型，来重构整个Qwen2.5-Omni的任务流水线。原则上只需要将原始任务的中PyTorch对象替换为OpenVINOTM对象即可，但考虑到在上一步骤我们导出的Thinker以及Talker模型为stateful的，这意味着他们的KV Cache在自回归的过程中是由OpenVINOTM所管理的，因此需要对原始的Past Key Value机制进行少量修改。在每轮自回归任务开始前，通过self.request.reset_state()函数，重置模型内部的KV Cache。这个例子里我们也将完整的流水线封装为了OVQwen2_5OmniModel类，用于初始化重构后的模型任务对象。

from qwen2_5_omni_helper import OVQwen2_5OmniModel

from transformers import Qwen2_5OmniProcessor



ov_model = OVQwen2_5OmniModel(model_dir, thinker_device=thinker_device.value, talker_device=talker_device.value, token2wav_device=token2wav_device.value)

processor = Qwen2_5OmniProcessor.from_pretrained(model_dir)

第四步，运行多模态任务

参考官方提供的示例，我们利用OpenVINOTM构建了以下参考任务：

文本输入-音频输出

conversation = [

    {

        "role": "system",

        "content": [

            {

                "type": "text",

                "text": "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech.",

            }

        ],

    },

    {

        "role": "user",

        "content": [

            {"type": "text", "text": "What is the answer for 1+1? Explain it."},

        ],

    },

]



print("Question:\nWhat is the answer for 1+1? Explain it.")

print("Answer:")



text = processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False)

inputs = processor(text=text, return_tensors="pt", padding=True, use_audio_in_video=False)

text_ids, audio = ov_model.generate(

    **inputs, stream_config=TextStreamer(processor.tokenizer, skip_prompt=True, skip_special_tokens=True), return_audio=True, thinker_max_new_tokens=256

)

输出效果：

Question:

What is the answer for 1+1? Explain it.

Answer:

[===start thinker===]

Well, 1 + 1 is 2. It's a basic addition. You know, when you have one thing and you add another one, you end up with two things. It's like if you have one apple and you get another apple, you'll have two apples. So, the answer is 2. If you have any other math questions or just want to chat more, feel free to let me know.

Setting `pad_token_id` to `eos_token_id`:8292 for open-end generation.

[===start talker===]

[===start token2wav===]

文本+图片理解

conversation = [

    {

        "role": "system",

        "content": [

            {

                "type": "text",

                "text": "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech.",

            }

        ],

    },

    {

        "role": "user",

        "content": [

            {"type": "image", "image": "cat.png"},

            {"type": "text", "text": "What is unusual on this picture?"},

        ],

    },

]

text = processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False)

audios, images, videos = process_mm_info(conversation, use_audio_in_video=False)

inputs = processor(text=text, images=images, videos=videos, return_tensors="pt", padding=True, use_audio_in_video=False)

text_ids = ov_model.generate(

    **inputs, stream_config=TextStreamer(processor.tokenizer, skip_prompt=True, skip_special_tokens=True), return_audio=False, thinker_max_new_tokens=256

)



text = processor.batch_decode(text_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)

输出效果：

Answer:

[===start thinker===]

Well, it's not really unusual that a cat is in a box. Cats love boxes! But what might seem a bit odd is that the cat is lying on its back in the box. Usually, cats like to curl up in boxes, not lie on their backs. It could be that the cat is just really comfortable in that box and decided to relax that way. What do you think about it?

文本+音频理解

question = "Translate the audio to  French. "



conversation = [

    {

        "role": "system",

        "content": [

            {

                "type": "text",

                "text": "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech.",

            }

        ],

    },

    {

        "role": "user",

        "content": [

            {"type": "text", "text": question},

            {"type": "audio", "audio": "Trailer.wav"},

        ],

    },

]



print(f"Question:\n{question}")

display(IPython.display.Audio("Trailer.wav"))



print("Answer:")

text = processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False)



audios, images, videos = process_mm_info(conversation, use_audio_in_video=True)

inputs = processor(text=text, audio=audios, return_tensors="pt", padding=True, use_audio_in_video=True)

text_ids = ov_model.generate(

    **inputs, stream_config=TextStreamer(processor.tokenizer, skip_prompt=True, skip_special_tokens=True), return_audio=False, thinker_max_new_tokens=256

)



text = processor.batch_decode(text_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)

输出效果：

Question:

Translate the audio to  French.

Answer:

[===start thinker===]
Quelle que soit votre format.如果还有其他翻译相关的问题或者别的事，都可以跟我说哦。

文本+视频理解

question = "Describe the video"



conversation = [

    {

        "role": "system",

        "content": [

            {

                "type": "text",

                "text": "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech.",

            }

        ],

    },

    {

        "role": "user",

        "content": [

            {"type": "text", "text": question},

            {"type": "video", "video": "coco.mp4"},

        ],

    },

]



print(f"Question:\n{question}")

display(IPython.display.Video("coco.mp4"))



print("Answer:")

text = processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False)



audios, images, videos = process_mm_info(conversation, use_audio_in_video=False)

inputs = processor(text=text, videos=videos, return_tensors="pt", padding=True, use_audio_in_video=False)

text_ids = ov_model.generate(

    **inputs, stream_config=TextStreamer(processor.tokenizer, skip_prompt=True, skip_special_tokens=True), return_audio=False, thinker_max_new_tokens=256

)



text = processor.batch_decode(text_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)

输出效果：

Question:

Describe the video

Answer:

qwen-vl-utils using decord to read video.

[===start thinker===]

Well, in the video, there's a black dog on a leash. It's walking on a sidewalk. The dog is wearing a collar and it seems to be moving at a steady pace. There's also a person walking next to the dog, but we can only see part of them, like their legs and feet. The sidewalk is made of concrete and there are some leaves scattered around. The background has a white wall. It looks like a normal day, just a dog being walked. So, what do you think about it? Do you have any other questions?

此外在这个例子中我们也准备了基于Gradio构建的交互式应用，大家可以可在notebook最后执行该示例，并上传自己的图片进行对话。以下是在Intel第二代Ultra处理器集显上部署该示例的效果演示。