精选· 重要性 4/5

Qwen2-Audio发布：支持语音聊天与音频分析的多模态模型

Qwen Team Blog·将近 2 年前·约 4 分钟阅读

中文导读

Qwen2-Audio是阿里云推出的新一代音频语言模型，支持语音聊天和音频分析，无需ASR模块即可直接理解语音指令，在多项基准测试中超越此前最先进模型。

DEMO 论文 GitHub Hugging Face ModelScope Discord为了实现构建AGI系统的目标，模型应能理解来自不同模态的信息。得益于大型语言模型的快速发展，LLM现已具备语言理解和推理能力。

此前，我们已将LLM（即Qwen）扩展到更多模态，包括视觉和音频，并构建了Qwen-VL和Qwen-Audio。今天，我们发布Qwen2-Audio，这是Qwen-Audio的下一个版本，能够接受音频和文本输入并生成文本输出。

Qwen2-Audio具有以下特性：语音聊天：用户首次可以直接用语音向音频语言模型发出指令，无需ASR模块。音频分析：模型能够通过文本指令分析音频信息，包括语音、声音、音乐等。多语言：模型支持超过8种语言和方言，例如中文、英语、粤语、法语、意大利语、西班牙语、德语和日语。

我们在Hugging Face和ModelScope上开源了Qwen2-Audio-7B和Qwen2-Audio-7B-Instruct的权重，并构建了演示供用户交互。

以下是一些展示模型性能的示例：语音聊天音频分析性能我们在基准数据集上进行了一系列实验，包括LibriSpeech、Common Voice 15、Fleurs、Aishell2、CoVoST2、Meld、Vocalsound和AIR-Benchmark，

以评估Qwen2-Audio与我们之前发布的Qwen-Audio以及每个任务中最先进模型的性能对比。下面我们展示了一张图，说明Qwen2-Audio在各项任务中的表现。在所有任务中，Qwen2-Audio显著超越了之前的SOTA或Qwen-Audio。

下表列出了数据集上的更具体结果。架构以下是训练架构的演示。具体来说，我们从Qwen语言模型和音频编码器作为基础模型开始。我们依次应用多任务预训练进行音频语言对齐，以及监督微调和直接偏好优化，以掌握下游任务的能力并建模人类偏好。

如何使用现在Qwen2-Audio已获得Hugging Face Transformers的正式支持。

我们建议你从源码安装最新版本的transformers：pip install git+https://github.com/huggingface/transformers我们演示如何使用Qwen2-Audio-7B-Instruct进行语音聊天和音频分析。

以下是语音聊天的示例：from io import BytesIOfrom urllib.

request import urlopenimport librosafrom transformers import Qwen2AudioForConditionalGeneration,

AutoProcessorprocessor = AutoProcessor.from_pretrained("Qwen/Qwen2-Audio-7B-Instruct")model = Qwen2AudioForConditionalGeneration.from_pr

etrained("Qwen/Qwen2-Audio-7B-Instruct",device_map="auto")conversation = [{"role":"user","content":[{"type":"audio","audio_url":"https:

//qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-Audio/audio/guess_age_gender.wav"},

]},{"role":"assistant","content":"Yes,the speaker is female and in her twenties."},{"role":"user","content":[{"type":"audio",

"audio_url":"https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-Audio/audio/translate_to_chinese.wav"},]},

]text = processor.apply_chat_template(conversation,

add_generation_prompt=True,tokenize=False)audios = []for message in conversation:if isinstance(message["content"],list):

for ele in message["content"]:if ele["type"] == "audio":audios.append(librosa.load(BytesIO(urlopen(ele['audio_url']).read()),

sr=processor.feature_extractor.sampling_rate)[0])inputs = processor(text=text,

audios=audios,return_tensors="pt",padding=True)inputs.input_ids = inputs.input_ids.to("cuda")generate_ids = model.generate(**inputs,

max_length=256)generate_ids = generate_ids[:inputs.input_ids.size(1):]response = processor.

batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]在语音聊天模式下，用户输入仅为音频而不含文本，用户的指令包含在音频中。

接下来是音频分析的示例：

from io import BytesIOfrom urllib.request import urlopenimport librosafrom transformers import Qwen2AudioForConditionalGeneration,

AutoProcessorprocessor = AutoProcessor.from_pretrained("Qwen/Qwen2-Audio-7B-Instruct")model = Qwen2AudioForConditionalGeneration.from_pr

etrained("Qwen/Qwen2-Audio-7B-Instruct",

device_map="auto")conversation = [{'role':'system','content':'You are a helpful assistant.'},{"role":"user","content":[{"type":"audio",

"audio_url":

"https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-Audio/audio/glass-breaking-151256.mp3"},{"type":"text","text":"What's that sound?

"},]},{"role":"assistant","content":"It is the sound of glass shattering."},{"role":"user","content":[{"type":"text","text":

"What can you do when you hear that?

"},]},{"role":"assistant","content":"Stay alert and cautious,and check if anyone is hurt or if there is any damage to property."},

{"role":"user","content":[{"type":"audio","audio_url":"https:

//qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-Audio/audio/1272-128104-0000.flac"},{"type":"text","text":"What does the person say?

"},]},]text = processor.

apply_chat_template(conversation,add_generation_prompt=True,tokenize=False)audios = []for message in conversation:

if isinstance(message["content"],

list):for ele in message["content"]:if ele["type"] == "audio":audios.append(librosa.load(BytesIO(urlopen(ele['audio_url']).read()),

sr=processor.feature_extractor.sampling_rate)[0])inputs = processor(text=text,

audios=audios,return_tensors="pt",padding=True)inputs.input_ids = inputs.input_ids.to("cuda")generate_ids = model.generate(**inputs,

max_length=256)generate_ids = generate_ids[:

inputs.input_ids.size(1):]response = processor.batch_decode(generate_ids,skip_special_tokens=True,

clean_up_tokenization_spaces=False)[0]相比之下，在音频分析模式下，会附加文本指令。然而，在两种模式之间切换只需修改用户输入，无需担心系统提示等其他内容。

下一步这次我们带来了新的音频语言模型Qwen2-Audio，它同时支持语音聊天和音频分析，并理解超过8种语言和方言。在不久的将来，我们计划在更大的预训练数据集上训练改进的Qwen2-Audio模型，使模型能够支持更长的音频（超过30秒）。

我们还计划构建更大的Qwen2-Audio模型，以探索音频语言模型的缩放规律。

原文出处

Qwen2-Audio: Chat with Your Voice!

本文为机器翻译辅以 AI 润色，仅供参考。原始事实以原文为准。

Qwen2-Audio发布：支持语音聊天与音频分析的多模态模型

相关阅读

Bluesky AI助手Attie扩展为开放社交研究工具

Midjourney 收购占星社交应用 Co-Star，拓展产品线

硅谷在中国AI开放权重模型问题上严重分裂