如何利用OpenVINO™工具套件高效部署混元系列模型

本文将介绍如何利用OpenVINO™工具套件在本地部署混元系列模型。

英特尔开发人员专区

712人浏览 · 2025-08-08 10:39:11

英特尔开发人员专区 · 2025-08-08 10:39:11 发布

模型介绍

混元是腾讯开源的高效大语言模型系列，专为多样化计算环境中的灵活部署而设计。从边缘设备到高并发生产系统，这些模型凭借先进的量化支持和超长上下文能力，在各种场景下都能提供最优性能。系列模型包括预训练和指令微调两种变体，参数规模涵盖0.5B、1.8B、4B和7B。混元官方仓库：https://modelscope.cn/models/Tencent-Hunyuan/Hunyuan-7B-Instruct。

OpenVINO™作为一个跨平台的深度学习模型部署工具，可以极大优化大语言的模型的推理性能，在充分激活硬件算力同时，降低对于内存资源的占用。本文将介绍如何利用OpenVINO™工具套件在本地部署混元系列模型。

内容列表

1. 环境准备

2. 模型下载和转换

3. 模型部署

第一步，环境准备

通过以下命令可以搭建基于Python的模型部署环境。

python -m venv py_venv ./py_venv/Scripts/activate.bat pip install --pre -U openvino-genai --extra-index-url https://storage.openvinotoolkit.org/simple/wheels/nightlypip install nncfpip install git+https://github.com/openvino-dev-samples/optimum-intel.git@hunyuanpip install git+https://github.com/huggingface/transformers@4970b23cedaf745f963779b4eae68da281e8c6ca

该示例在以下环境中已得到验证：

· 硬件环境:

o Intel® Core™ Ultra 7 258V

♣ iGPU Driver：32.0.101.6972

♣ NPU Driver：32.0.100.4181

♣ Memory: 32GB

· 操作系统：

o Windows 11 24H2 (26100.4061)

· OpenVINO™版本:

o openvino 2025.2.0

o openvino-genai 2025.2.0.0

o openvino-tokenizers 2025.2.0.0

· Transformers版本:

o https://github.com/huggingface/transformers@4970b23cedaf745f963779b4eae68da281e8c6ca

第二步，模型下载和转换

在部署模型之前，我们首先需要将原始的PyTorch模型转换为OpenVINO™的IR静态图格式，并对其进行压缩，以实现更轻量化的部署和最佳的性能表现。通过Optimum提供的命令行工具optimum-cli，我们可以一键完成模型的格式转换和权重量化任务：

optimum-cli export openvino --model tencent/Hunyuan-4B-Instruct --task text-generation-with-past --weight-format int4 --group-size 128 --ratio 0.8 --trust-remote-code <model_dir>

开发者可以根据模型的输出结果，调整其中的量化参数，包括：

· --model：为模型在HuggingFace上的model id，这里我们也提前下载原始模型，并将model id替换为原始模型的本地路径，针对国内开发者，推荐使用ModelScope魔搭社区作为原始模型的下载渠道，具体加载方式可以参考ModelScope官方指南：https://www.modelscope.cn/docs/models/download

· --weight-format：量化精度，可以选择fp32,fp16,int8,int4,int4_sym_g128,int4_asym_g128,int4_sym_g64,int4_asym_g64

· --group-size：权重里共享量化参数的通道数量

· --ratio：int4/int8权重比例，默认为1.0，0.6表示60%的权重以int4表，40%以int8表示

· --sym：是否开启对称量化

此外我们建议使用以下参数对运行在NPU上的模型进行量化，以达到性能和精度的平衡。

optimum-cli export openvino --model <model id> --task text-generation-with-past --weight-format int4 --sym --group-size -1 --backup-precision int8_sym --trust-remote-code <model_dir>

这里的--backup-precision是指混合量化精度中，8bit参数的量化策略。

第三步，模型部署

目前我们推荐是用openvino-genai来部署大语言以及生成式AI任务，它同时支持Python和C++两种编程语言，安装容量不到200MB，支持流式输出以及多种采样策略。

· GenAI API部署示例

import argparseimport openvino_genai def streamer(subword):   print(subword, end='', flush=True)   # Return flag corresponds whether generation should be stopped.   return openvino_genai.StreamingStatus.RUNNING def main():   parser = argparse.ArgumentParser()   parser.add_argument('model_dir', help='Path to the model directory')   parser.add_argument('device', nargs='?', default='NPU', help='Device to run the model on (default: CPU)')   args = parser.parse_args()    device = args.devicepipe = openvino_genai.LLMPipeline(args.model_dir, device)   tokenizer = pipe.get_tokenizer()   tokenizer.set_chat_template("{% if messages[0]['role'] == 'system' %}{% set loop_messages = messages[1:] %}{% set system_message = messages[0]['content'] %}<｜hy_begin▁of▁sentence｜>{{ system_message }}<｜hy_place▁holder▁no▁3｜>{% else %}{% set loop_messages = messages %}<｜hy_begin▁of▁sentence｜>{% endif %}{% for message in loop_messages %}{% if message['role'] == 'user' %}<｜hy_User｜>{{ message['content'] }}{% elif message['role'] == 'assistant' %}<｜hy_Assistant｜>{{ message['content'] }}<｜hy_place▁holder▁no▁2｜>{% endif %}{% endfor %}{% if add_generation_prompt %}<｜hy_Assistant｜>{% else %}<｜hy_place▁holder▁no▁8｜>{% endif %}{% if add_generation_prompt and enable_thinking is defined and not enable_thinking %}<think>\n\n</think>\n{% endif %}")    config = openvino_genai.GenerationConfig()   config.max_new_tokens = 10204    pipe.start_chat()   while True:       try:           prompt = input('question:\n')       except EOFError:           break       pipe.generate(prompt, config, streamer)       print('\n----------')   pipe.finish_chat() if '__main__' == __name__:   main()

其中，'model_dir'为OpenVINO™ IR格式的模型文件夹路径，'device'为模型部署设备，支持CPU,GPU以及NPU。此外，openvino-genai提供了chat模式的构建方法，通过声明pipe.start_chat()以及pipe.finish_chat()，多轮聊天中的历史数据将被以kvcache的形态，在内存中进行管理，从而提升运行效率。

开发者可以通过该示例中方法调整chat template，以关闭和开启thinking模式，具体方式可以参考官方文档（https://huggingface.co/tencent/Hunyuan-4B-Instruct）。由于目前OpenVINO™ Tokenizer还没有完全支持Hunyuan-7B-Instruct模型默认的chat template格式，因此我们需要手动替换原始的chat template，对其进行简化，具体方法如下：

 tokenizer = pipe.get_tokenizer() tokenizer.set_chat_template("{% for message in messages %}{% if message['role'] == 'system' %}<|startoftext|>{{ message['content'] }}<|extra_4|>{% elif message['role'] == 'assistant' %}<|startoftext|>{{ message['content'] }}<|eos|>{% else %}<|startoftext|>{{ message['content'] }}<|extra_0|>{% endif %}{% endfor %}{{- '<think>\n\n</think>\n' }}")

chat模式输出结果示例：

关于该示例的后续更新，可以关注OpenVINO™ notebooks仓库：https://github.com/openvinotoolkit/openvino_notebooks/tree/latest/notebooks/llm-chatbot

总结

可以看到，利用OpenVINO™工具套件，我们可以非常轻松地将转换后的混元系列模型部署在Intel的硬件平台上，从而进一步在本地构建起各类基于LLM的服务和应用。

参考资料

openvino-genai 示例：https://github.com/openvinotoolkit/openvino.genai/blob/master/samples/python/text_generation/chat_sample.py
llm-chatbot notebook示例: https://github.com/openvinotoolkit/openvino_notebooks/tree/latest/notebooks/llm-chatbot
openvino-genai仓库： https://github.com/openvinotoolkit/openvino.genai
魔搭社区OpenVINO™专区：https://www.modelscope.cn/organization/OpenVINO
OpenVINO™ Model Hub：https://www.intel.com/content/www/us/en/developer/tools/openvino-toolkit/model-hub.html