如何利用OpenAI Whisper模型轻松批量提取B站视频文字？

总所周知，B站是个年轻人最爱的学习网站。然而在地铁上、公交上、闲暇时间看视频做学习的时候，总是无法及时做笔记的。所以为了解决这个问题，做了一个基于openai-whisper大模型的语音转文字提取方案。这个方案可以批量下载视频合集，自动转换为文字。

文章导航

一、为什么选择 OpenAI Whisper 模型？

Whisper 是 OpenAI 推出的一个强大语音识别模型，凭借其高准确性和多语言支持，成为很多开发者的首选。使用 Whisper 可以帮助我们自动将视频中的音频部分转化为文本，为用户提供极大的便利，尤其是那些在学习过程中无法随时做笔记的人。

二、环境搭建：从零开始的项目准备

文件夹准备

Python 版本 3.8，没有环境的使用docker容器一键拉起环境： docker pull python:3.8

将下列命令拷贝执行即可

# 创建文件夹
mkdir -p /mnt/d/workspace/b-site-video-reader/
mkdir -p /mnt/d/workspace/audios/texts/
cd /mnt/d/workspace/b-site-video-reader/ 
# 创建虚拟环境
virtual -p /usr/bin/python3 py38-venv
# 激活虚拟环境
source /mnt/d/workspace/b-site-video-reader/py38-venv/bin/activate
apt update && apt install -y ffmpeg

虚拟环境依赖安装

将下面的内容保存至requirements.txt，然后使用 pip install -r requirements.txt

aiohappyeyeballs==2.4.0
aiohttp==3.10.6
aiosignal==1.3.1
annotated-types==0.7.0
anyio==4.5.0
async-timeout==4.0.3
attrs==24.2.0
beautifulsoup4==4.12.3
biliass==1.3.7
bilibili-api==4.1.0
bilili==1.4.15
certifi==2024.8.30
charset-normalizer==3.3.2
cssutils==2.11.1
dataclasses-json==0.6.7
decorator==4.4.2
distro==1.9.0
exceptiongroup==1.2.2
frozenlist==1.4.1
greenlet==3.1.1
h11==0.14.0
httpcore==1.0.5
httpx==0.27.2
idna==3.10
imageio==2.35.1
imageio-ffmpeg==0.5.1
jiter==0.5.0
jsonpatch==1.33
jsonpointer==3.0.0
langchain==0.2.16
langchain-community==0.2.17
langchain-core==0.2.41
langchain-openai==0.1.25
langchain-text-splitters==0.2.4
langsmith==0.1.128
marshmallow==3.22.0
more-itertools==10.5.0
moviepy==1.0.3
multidict==6.1.0
mypy-extensions==1.0.0
numpy==1.24.4
openai==1.47.1
orjson==3.10.7
packaging==24.1
pillow==10.4.0
proglog==0.1.10
protobuf==4.25.5
pydantic==2.9.2
pydantic_core==2.23.4
PyYAML==6.0.2
regex==2024.9.11
requests==2.32.3
sniffio==1.3.1
soupsieve==2.6
SQLAlchemy==2.0.35
tenacity==8.5.0
tiktoken==0.7.0
tqdm==4.66.5
typing-inspect==0.9.0
typing_extensions==4.12.2
urllib3==2.2.3
websockets==13.1
yarl==1.12.1
zhconv==1.4.3

三、批量下载并转换 B 站视频为文本

启动服务

命令行输入：python3 start.py 即可启动

下面的文件start.py启动了一个httpserver，端口为8000

import http.server
import socketserver
import os
import json
import threading
import time
from openai import OpenAI
import os
import time
import zhconv

PORT = 8000

# 语音转文字
def mp3_2_text(audio_file_name, suffix):
    """将MP3文件转换为txt文本。"""
    client = OpenAI(
        api_key="<填写你的key>",
        base_url="<填写你的大模型提供商,找不到的可以去这儿：https://oneapi.xty.app/v1>"
    )

    path = f'/mnt/d/workspace/audios/{suffix}'
    output_path = f'/mnt/d/workspace/audios/texts/{suffix}'
    try:
        os.mkdir(output_path)
    except Exception as e:
        print(f'目录已存在{output_path}')
    speech_path = os.path.join(path, f"{audio_file_name}")
    print(speech_path)
    retry_limit = 3
    retry_count = 0

    while retry_count < retry_limit:
        try:
            audio_file = open(speech_path, "rb")
            transcript = client.audio.transcriptions.create(
                model="whisper-1",
                file=audio_file,
                response_format="json",
                prompt="以下是普通话的句子，请返回简体中文字。"
            )
            simplified_text = zhconv.convert(transcript.text, 'zh-hans')
            print(simplified_text)
            output_file_path = os.path.join(output_path, f"{audio_file_name}.txt")
            with open(output_file_path, 'w', encoding='utf-8') as text_file:
                text_file.write(simplified_text)

            print(f'Transcript saved to: {output_file_path}')
            return audio_file_name
        except Exception as e:
            print(f'Error: {e}')
            retry_count += 1
            print(f'Retry attempt {retry_count}/{retry_limit}')
            time.sleep(2)

    print(f'Failed to transcribe {audio_file_name} after {retry_limit} attempts')
    return audio_file_name

def download_videos(url, suffix):
    FFMPEG_PATH='/usr/bin/ffmpeg'
    path = f'/mnt/d/workspace/audios/m4s/{suffix}/'
    output_path = f'/mnt/d/workspace/audios/{suffix}/'
    output_path_text_result = f'/mnt/d/workspace/audios/texts/{suffix}/'
    try:
        os.mkdir(output_path_text_result)
        os.mkdir(output_path)
    except Exception as e:
        print('目录已存在')
    bilili = '/mnt/d/workspace/b-site-video-reader/py38-venv/bin/bilili'
    video_url = [url]

    for v_url in video_url:
        # 混流 bilili -t dash 视频url即可
        os.system(f"{bilili} -t dash -d {path} -y -w -q 16  '{v_url}' 2>&1 > /dev/null")
        for i in os.listdir(path):
            if os.path.isdir(os.path.join(path, i)):
                for j in os.listdir(os.path.join(os.path.join(path, i), 'Videos')):
                    if j.endswith('.mp4'):
                        output_file = j.replace(' ','_')
                        output_file = output_file.replace('mp4', 'mp3')
                        output_file_dest = os.path.join(output_path, output_file)
                        text_output_file_path = os.path.join(output_path_text_result, f"{output_file}.txt")
                        if os.path.exists(output_file_dest):
                            print(f'跳过mp3及其后续: {output_file} \t 链接地址：{url}{url}\t文本文件 {os.path.exists(text_output_file_path)} 生成 {text_output_file_path}')
                        else:
                            os.system(f"{FFMPEG_PATH} -loglevel quiet -i '{os.path.join(path, i,'Videos', j)}' -y -f mp3 '{output_file_dest}'")
                            try:
                                if os.path.exists(text_output_file_path):
                                    print(f'{output_file} 的文本文件已存在！')
                                else:
                                    mp3_2_text(f'{os.path.join(output_path, output_file)}', suffix)
                            except Exception as e:
                                print(f'报错了，请检查: {output_file} \t 链接地址：{url}')
    print(f'{output_file} 处理完毕')
                           

class Handler(http.server.SimpleHTTPRequestHandler):
    def do_POST(self):
        content_length = int(self.headers['Content-Length'])  # 获取请求体的长度
        post_data = self.rfile.read(content_length)  # 读取请求体
        response = f"Received POST request: {post_data.decode()}"
        req = json.loads(post_data.decode())
        for url in req['urls']:
            server_thread = threading.Thread(target=download_videos, args=(url,req.get('suffix', int(time.time())), ))
            server_thread.daemon = True
            server_thread.start()
        self.send_response(200)
        self.end_headers()
        self.wfile.write(response.encode())

with socketserver.TCPServer(("", PORT), Handler) as httpd:
    print(f"Serving at port {PORT}")
    httpd.serve_forever()

获取视频内容

使用方法：在urls字段中加入想要下载处理的地址，suffix为文件夹后缀，同时开始下载的，需要注意性能：

curl -d '{"suffix":"20240925测试", "urls":["https://www.bilibili.com/video/BV1AE421w7dF/?spm_id_from=333.337.search-card.all.click"]}' http://localhost:8000

获取结果

输出文本文件位置在：/mnt/d/workspace/audios/texts/20240925测试/ 下学习效率。赶快动手，按照本文的步骤搭建环境，体验一下技术带来的便捷吧！