如何利用OpenVINO在本地构建多模态RAG应用

本文将介绍如何利用OpenVINO和LlamaIndex工具构建应用于视频理解任务的RAG流水线。

英特尔开发人员专区

970人浏览 · 2025-02-14 13:11:15

英特尔开发人员专区 · 2025-02-14 13:11:15 发布

作者：杨亦诚

介绍

Retrieval-Augmented Generation (RAG) 系统可以通过从知识库中过滤关键信息来优化LLM任务的内存占用及推理性能。归功于文本解析、索引和检索等成熟工具的应用，为文本内容构建 RAG 流水线已经相对成熟。然而为视频内容构建 RAG 流水线则困难得多。

由于视频结合了图像，音频和文本元素，因此需要更多和更复杂的数据处理能力。本文将介绍如何利用OpenVINO和LlamaIndex工具构建应用于视频理解任务的RAG流水线。

要构建真正的多模态视频理解RAG，需要处理视频中不同模态的数据，例如语音内容、视觉内容等。在这个例子中，我们展示了专为视频分析而设计的多模态 RAG 流水线。它利用 Whisper 模型将视频中的语音内容转换为文本内容，利用 CLIP 模型生成多模态嵌入式向量，利用视觉语言模型（VLM）处理检索到的图像和文本消息以及用户请求。下图详细说明了该流水线的工作原理。

图：视频理解RAG工作原理

源码地址：https://github.com/openvinotoolkit/openvino_notebooks/tree/latest/notebooks/multimodal-rag

内容列表

环境准备
模型下载和转换
视频数据提取与处理
创建多模态向量索引
多模态向量检索
答案生成

第一步，环境准备

该示例基于Jupyter Notebook编写，因此我们需要准备好相对应的Python环境。基础环境可以参考以下链接安装，并根据自己的操作系统进行选择具体步骤。

https://github.com/openvinotoolkit/openvino_notebooks?tab=readme-ov-file#-getting-started

图：基础环境安装导航页面

此外本示例将依赖OpenVINOTM和LlamaIndex的集成组件，因此我们需要单独在环境中对他们进行安装，分别是用于为图像和文本生成多模态向量的llama-index-embeddings-openvino库，以及视觉多模态推理llama-index-multi-modal-llms-openvino库。

第二步，模型下载和转换

完成环境搭建后，我们需要逐一下载流水线中用到的语音识别ASR模型，多模型向量化模型CLIP，以及视觉语言模型模型VLM。

考虑到精度对模型准确性的影响，在这个示例中我们直接从OpenVINOTM HuggingFace仓库中(https://huggingface.co/OpenVINO/distil-whisper-large-v3-int8-ov)，下载转换以后的ASR int8模型。

import huggingface_hub as hf_hub

asr_model_id = "OpenVINO/distil-whisper-large-v3-int8-ov"
asr_model_path = asr_model_id.split("/")[-1]

if not Path(asr_model_path).exists():
    hf_hub.snapshot_download(asr_model_id, local_dir=asr_model_path)

而CLIP及VLM模型则采用Optimum-intel的命令行工具，通过下载原始模型对他们进行转换和量化。

from cmd_helper import optimum_cli

clip_model_id = "laion/CLIP-ViT-B-32-laion2B-s34B-b79K"
clip_model_path = clip_model_id.split("/")[-1]

if not Path(clip_model_path).exists():
    optimum_cli(clip_model_id, clip_model_path)

第三步，视频数据提取与处理

接下来我们需要使用第三方工具提取视频文件中的音频和图片，并利用ASR模型将音频转化为文本，便于后续的向量化操作。在这一步中我们选择了一个关于高斯分布的科普视频作为示例（https://www.youtube.com/watch?v=d_qvLDhkg00）。可以参考以下代码片段，完成对ASR模型的初始化以及音频内容识别。识别结果将被以.txt文件格式保存在本地。

from optimum.intel import OVModelForSpeechSeq2Seq
from transformers import AutoProcessor, pipeline

asr_model = OVModelForSpeechSeq2Seq.from_pretrained(asr_model_path, device=asr_device.value)
asr_processor = AutoProcessor.from_pretrained(asr_model_path)

pipe = pipeline("automatic-speech-recognition", model=asr_model, tokenizer=asr_processor.tokenizer, feature_extractor=asr_processor.feature_extractor)

result = pipe(en_raw_speech, return_timestamps=True)

第四步，创建多模态向量索引

这也是整个RAG链路中最关键的一步，将视频文件中获取的文本和图像转换为向量数据，存入向量数据库。这些向量数据的质量也直接影响后续检索任务中的召回准确性。这里我们首先需要对CLIP模型进行初始化，利用OpenVINOTM和LlamaIndex集成后的库可以轻松实现这一点。

from llama_index.embeddings.huggingface_openvino import OpenVINOClipEmbedding

clip_model = OpenVINOClipEmbedding(model_id_or_path=clip_model_path, device=clip_device.value)

然后可以直接调用LlamaIndex提供的向量数据库组件快速完成建库过程，并对检索引擎进行初始化。

from llama_index.core.indices import MultiModalVectorStoreIndex
from llama_index.vector_stores.qdrant import QdrantVectorStore
from llama_index.core import StorageContext, Settings
from llama_index.core.node_parser import SentenceSplitter

Settings.embed_model = clip_model

index = MultiModalVectorStoreIndex.from_documents(
    documents, storage_context=storage_context, image_embed_model=Settings.embed_model, transformations=[SentenceSplitter(chunk_size=300, chunk_overlap=30)]
)

retriever_engine = index.as_retriever(similarity_top_k=2, image_similarity_top_k=5)

第五步，多模态向量检索

传统的文本RAG通过检索文本相似度来召唤向量数据库中关键的文本内容，而多模态RAG则需要额外对图片向量进行检索，用以返回与输入问题相关性最高的关键帧，供VLM进一步理解。这里我们会将用户的提问文本向量化后，通过向量引擎检索得到与该问题相似度最高的若干个文本片段，以及视频帧。LlamaIndex为我们提供了强大的工具组件，通过调用函数的方式可以轻松实现以上步骤。

from llama_index.core import SimpleDirectoryReader

query_str = "tell me more about gaussian function"

img, txt = retrieve(retriever_engine=retriever_engine, query_str=query_str)
image_documents = SimpleDirectoryReader(input_dir=output_folder, input_files=img).load_data()

代码运行后，我们可以看到检索得到的文本段和关键帧。

图：检索返回的关键帧和相关文本片段

第六步，答案生成

多模态RAG流水线的最后一步是要将用户问题，以及检索到相关文本及图像内容送入VLM模型进行答案生成。这里我们选择微软的Phi-3.5-vision-instruct多模态模型，以及OpenVINOTM和LlamaIndex集后的多模态模任务组件，完成图片及文本内容理解。值得注意的是由于检索返回的关键帧往往包含多张图片，因此这里需要选择支持多图输入的多模态视觉模型。以下代码为VLM模型初始化方法。

from llama_index.multi_modal_llms.openvino import OpenVINOMultiModal



vlm = OpenVINOMultiModal(

    model_id_or_path=vlm_int4_model_path,

    device=vlm_device.value,

    messages_to_prompt=messages_to_prompt,

    trust_remote_code=True,

    generate_kwargs={"do_sample": False, "eos_token_id": processor.tokenizer.eos_token_id},

)

完成VLM模型对象初始化后，我们需要将上下文信息与图片送入VLM模型，生成最终答案。此外在这个示例中还构建了基于Gradio的交互式demo，供大家参考。

response = vlm.stream_complete(

    prompt=qa_tmpl_str.format(context_str=context_str, query_str=query_str),

    image_documents=image_documents,

)

for r in response:

    print(r.delta, end="")

运行结果如下：

“A Gaussian function, also known as a normal distribution, is a type of probability distribution that is symmetric and bell-shaped. It is characterized by its mean and standard deviation, which determine the center and spread of the distribution, respectively. The Gaussian function is widely used in statistics and probability theory due to its unique properties and applications in various fields such as physics, engineering, and finance. The function is defined by the equation e to the negative x squared, where x represents the input variable. The graph of a Gaussian function is a smooth curve that approaches the x-axis as it moves away from the center, creating a bell-like shape. The function is also known for its property of being able to describe the distribution of random variables, making it a fundamental concept in probability theory and statistics.”

总结

在视频内容理解任务中，如果将全部的视频帧一并送入VLM进行理解和识别，会对VLM性能和资源占用带来非常大的挑战。通过多模态RAG技术，我们可以首先对关键帧进行检索，从而压缩在视频理解任务中VLM的输入数据量，提高整套系统的识别效率和准确性。而OpenVINOTM与LlamaIndex集成后的组件则可以提供完整方案的同时，在本地PC端流畅运行流水线中的各个模型。

参考资料

OpenVINO Notebook：https://github.com/openvinotoolkit/openvino_notebooks
Llama Index&OpenVINO多模态模型示例：Local Multimodal pipeline with OpenVINO - LlamaIndex

Llama Index&OpenVINO嵌入式模型示例：https://docs.llamaindex.ai/en/stable/examples/embeddings/openvino/