首页 > 分享 > 使用sambert

使用sambert

萌宠菠菠乐园
2024-10-06 05:14

使用sambert-hifigan微调实现个性化声音定制

基本概念

TTS(text-to-speech): 文本到语音。输入：文本输出：音频

Voice Cloning: 在模仿特定人物的声音。输入：文本信息+音频，输出：文本信息对应的音频。

Voice Conversion: 改变已有的语音录音，使其听起来像另一个人或以不同的情感或风格说话。输入：音频1+音频2，输出：内容和音频1一致，音色类似于音频2（例如：变声器）

sambert-hifigan整体框架

在这里插入图片描述

参考自：https://www.modelscope.cn/models/damo/speech_personal_sambert-hifigan_nsf_tts_zh-cn_pretrain_16k/summary

输入: 文本

输出: 音频

实验过程

数据集

以下为所用的数据集的时长

NameDataset_timeA<5minB1hourC1hourD1hourE3-5hourF3-5hour 长音频切片

主要使用whisper进行长音频切片，whisper采用的模型可以进行修改为large,small等，可根据机器的配置进行决定

import subprocess from pathlib import Path import librosa from scipy.io import wavfile import numpy as np import torch import csv import whisper def split_long_audio(model, filepaths, character_name, save_dir="data_dir", out_sr=44100): if isinstance(filepaths, str): filepaths = [filepaths] for file_idx, filepath in enumerate(filepaths): save_path = Path(save_dir) / character_name save_path.mkdir(exist_ok=True, parents=True) print(f"Transcribing file {file_idx}: '{filepath}' to segments...") result = model.transcribe(filepath, word_timestamps=True, task="transcribe", beam_size=5, best_of=5) segments = result['segments'] wav, sr = librosa.load(filepath, sr=None, offset=0, duration=None, mono=True) wav, _ = librosa.effects.trim(wav, top_db=20) peak = np.abs(wav).max() if peak > 1.0: wav = 0.98 * wav / peak wav2 = librosa.resample(wav, orig_sr=sr, target_sr=out_sr) wav2 /= max(wav2.max(), -wav2.min()) for i, seg in enumerate(segments): start_time = seg['start'] end_time = seg['end'] wav_seg = wav2[int(start_time * out_sr):int(end_time * out_sr)] wav_seg_name = f"{character_name}_{file_idx}_{i}.wav" out_fpath = save_path / wav_seg_name wavfile.write(out_fpath, rate=out_sr, data=(wav_seg * np.iinfo(np.int16).max).astype(np.int16)) whisper_size = "medium" whisper_model = whisper.load_model(whisper_size) split_long_audio(whisper_model, "filename.wav", "test", "dataset_raw") # 请在{filename}处填写您上传的wav文件名

1234567891011121314151617181920212223242526272829303132333435363738394041 音频标注

input_wav = "./test_wavs/" output_data = "./output_training_data/" ret, report = run_auto_label(input_wav=input_wav, work_dir=output_data, resource_revision="v1.0.5") 1234 声学模型配置修改

train_max_steps: 1002000 linguistic_unit: {cleaners: english_cleaners, lfeat_type_list: 'sy,tone,syllable_flag,word_segment,emo_category,speaker_category', speaker_list: 'A,F74,FBYN,FRXL,M7,xiaoyu'} #这个speak list中需要包含训练的说话人名称，例如：此处用A的声音做训练 12345 声码器配置修改

save_interval_steps: 2000 train_max_steps: 2500000 123 微调

# 特征提取 python kantts/preprocess/data_process.py --voice_input_dir /data/software/anchor_voice/zhubo_output_training_data --voice_output_dir training_stage/zhubo_feats --audio_config kantts/configs/audio_config_16k.yaml --speaker A # 训练声学模型（multisp） CUDA_VISIBLE_DEVICES=0 python kantts/bin/train_sambert.py --model_config speech_sambert-hifigan_tts_zh-cn_multisp_pretrain_16k/basemodel_16k/sambert/config.yaml --resume_path speech_sambert-hifigan_tts_zh-cn_multisp_pretrain_16k/basemodel_16k/sambert/ckpt/checkpoint_980000.pth --root_dir training_stage/zhubo_feats --stage_dir training_stage/zhubo_sambert_ckpt # 训练声码器 CUDA_VISIBLE_DEVICES=0 python kantts/bin/train_hifigan.py --model_config speech_sambert-hifigan_tts_zh-cn_multisp_pretrain_16k/basemodel_16k/hifigan/config.yaml --resume_path speech_sambert-hifigan_tts_zh-cn_multisp_pretrain_16k/basemodel_16k/hifigan/ckpt/checkpoint_2000000.pth --root_dir training_stage/zhubo_feats --stage_dir training_stage/zhubo_hifigan_ckpt # 推理 CUDA_VISIBLE_DEVICES=0 python kantts/bin/text_to_wav.py --txt test.txt --output_dir res/zhubo_syn --res_zip speech_sambert-hifigan_tts_zh-cn_multisp_pretrain_16k/resource.zip --am_ckpt training_stage/zhubo_sambert_ckpt/ckpt/checkpoint_1000000.pth --voc_ckpt training_stage/zhubo_hifigan_ckpt/ckpt/checkpoint-3473.pth --speaker A 123456789101112 微调经验

高质量的数据集 >> 低质量的数量集

数据集的时长最好大于30min

低质量数据集即使时间很长，对于训练的产生的效果也不会有很大的改进

hifigan可以不进行微调，进行微调的时候，也可以无需进行很多轮次的迭代，因为它的loss值下降的很慢

声音存在颤音电流音，且loss值较高的时候增加epoch次数可以提高生成的质量，或者进行声码器训练

数据集越大，每个epoch耗时越长

提高效果的三种方法：1. 制作高质量数据集 2. 增加声学模型的epoch次数 3.微调声码器

微调sambert loss统计

微调了好几个模型，本机环境：RTX4090，time_consume为实际所采取的epoch和消耗的时间（发现epoch持续增加，并没有提高效果所以停了）

modelNameTime_consumeLossmel_lossdur_losspitch_lossenergy_lossx_band_widthh_band_widthBatch_sizesambertA377 epoch/5 min0.56640.15090.04510.10400.093517.931017.93103.6250sambertB200 epoch/12min0.58600.15110.06090.09550.105320.187020.18704sambertC168 epoch/17 min0.67470.15870.08220.10880.140121.950021.95003.993sambertD252 epoch/19min0.71720.16610.08350.11010.158420.80720.8073.964sambertE75 epoch/17min0.89200.17860.12690.14160.228415.606015.60603.9960SambertF500epoch/163min0.63490.14360.08050.10400.133121.113021.11303.9940 微调hifigan统计 ModelNameTime_consumemel_lossadversarial_lossfeature_matching_lossgenerator_lossreal_lossfake_lossdiscriminator_lossHifiganA100 epoch/9min0.21296.12587.070227.58391.61581.58223.1981HifiganB137 epoch/10 min0.21256.29787.348028.38071.58311.57713.1602HifiganC207 epoch/23 min0.19435.90186.366125.13791.66051.63453.2950HifiganD190 epoch/15min0.21176.47517.537928.83361.54941.55203.1014HifiganE53epoch/18min0.20505.55537.113126.96721.7102.1.65163.3617.HifiganF292epoch/103min0.19616.92378.889931.06191.45391.50542.9716 优化方向数据集标注，高质量数据集多人物共同推理一段文本