阿里云大语音模型： Qwen-Audio 语音识别

本期内容视频介绍

1 Qwen-Audio 多任务语音模型

Qwen-Audio是阿里云于2023年11月30日开源的一款面向语音任务的多功能大模型。它基于OpenAI的Whisper Large-v2模型和Qwen 7B大语言模型进行开发，具有多种语音处理能力。该模型的主要特点包括：

多语言语音识别：能够识别并转录多种语言的语音内容。
语音翻译：具备将语音内容从一种语言翻译到另一种语言的能力。
语音场景分析：能够分析语音中的环境信息和背景声音。
基于语音的理解和推理：不仅转录语音，还能理解语音内容并进行逻辑推理。
语音编辑功能：提供编辑和修改语音记录的工具。

这些特性使Qwen-Audio成为一个处理各类语音任务的强大工具。

演示界面：https://qwen-audio.github.io/Qwen-Audio/

阿里云开源了两个语音处理模型：Qwen-Audio 和 Qwen-Audio-Chat，它们分别针对不同的应用场景。Qwen-Audio 主要用于处理特定的语音处理任务，例如语音识别，而 Qwen-Audio-Chat 则更适用于基于语音的多轮对话任务。关于 Qwen-Audio 模型，其在语音识别方面的主要特性包括：

Librispeech数据集表现：在Librispeech数据集上，Qwen-Audio 展现出了卓越的语音识别性能。
Aishell1和Aishell2数据集的SOTA成绩：在这两个中文语音数据集上，模型实现了最佳状态（State of the Art, SOTA）的识别结果。
支持带词级别时间戳的语音识别：Qwen-Audio 能够在执行语音识别任务的同时，为每个词提供精确的时间戳。

2 环境安装与语音识别（Windows操作系统）

第一步：下载代码（需要安装git工具）

git clone https://github.com/QwenLM/Qwen-Audio.git

第二步：环境安装（需要安装anaconda，或者其他python运行环境管理工具）
安装相关的开发库
```
pip install -r requirements.txt
```

第三步：代码准备
更多代码，可以参考以下官方链接：
https://github.com/QwenLM/Qwen-Audio

from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.generation import GenerationConfig
import torch
torch.manual_seed(1234)

# Note: The default behavior now has injection attack prevention off.
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-Audio-Chat", trust_remote_code=True)

# use bf16
# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-Audio-Chat", device_map="auto", trust_remote_code=True, bf16=True).eval()
# use fp16
# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-Audio-Chat", device_map="auto", trust_remote_code=True, fp16=True).eval()
# use cpu only
# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-Audio-Chat", device_map="cpu", trust_remote_code=True).eval()
# use cuda device
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-Audio-Chat", device_map="cuda", trust_remote_code=True).eval()

# Specify hyperparameters for generation (No need to do this if you are using transformers>4.32.0)
# model.generation_config = GenerationConfig.from_pretrained("Qwen/Qwen-Audio-Chat", trust_remote_code=True)

# 1st dialogue turn
query = tokenizer.from_list_format([
    {'audio': 'assets/audio/1272-128104-0000.flac'}, # Either a local path or an url
    {'text': 'what does the person say?'},
])
response, history = model.chat(tokenizer, query=query, history=None)
print(response)
# The person says: "mister quilter is the apostle of the middle classes and we are glad to welcome his gospel".

# 2nd dialogue turn
response, history = model.chat(tokenizer, 'Find the start time and end time of the word "middle classes"', history=history)
print(response)
# The word "middle classes" starts at <|2.33|> seconds and ends at <|3.26|> seconds.

程序会自动下载模型，模型默认存储位置：
C:\Users\UserName.cache\huggingface\hub\models--Qwen--Qwen-Audio

3 常见问题

Q1： Windows上使用GPU时，出现错误：AssertionError: Torch not compiled with CUDA enabled

解决方法：重新安装pytorch
```
pip uninstall torch
pip cache purge
pip install torch -f https://download.pytorch.org/whl/torch_stable.html
```
首先删除旧版本pytorch，然后清除缓存。第三行是安装最新的版本

问题参考讨论：https://github.com/pytorch/pytorch/issues/30664
Q2：为什么长语音只能识别前面的一部分？

Qwen-Audio是基于Whisper的encoder模型训练出来的，Whisper模型单次输入的语音长度，最长只能支持30秒。【要确认？】
Q3：有没有webui，或者API服务

本地部署的webui（代码库中有一个脚本可以使用）： web_demo_audio.py

在线体验版（基于Qwen-Audio-Chat）：
魔搭：
https://modelscope.cn/studios/qwen/Qwen-Audio-Chat-Demo/summary
Huggingface：
https://huggingface.co/spaces/Qwen/Qwen-Audio

阿里云提供的有API（可能支持，没确认）：
https://dashscope.aliyun.com/

其它的针对语音的通用WebUI，API没有关注过。

3 本地化微调

为了提高模型在特定的目标数据集（即本地化数据）上的性能，可以通过对模型进行微调（fine-tuning）来实现。尽管阿里巴巴还未公开实现这一过程的具体代码，但可以参考Qwen或Qwen-VL的代码来进行相应的自行修改和调整。未来，我计划制作相关课程，以详细介绍大型模型微调的技术和方法。