100字范文 > 通过FastSpeech2中文合成项目梳理TTS流程3: 语音合成（synthesize.py)

通过FastSpeech2中文合成项目梳理TTS流程3: 语音合成（synthesize.py)

时间：2019-06-13 22:09:29

1. 参考github网址：

GitHub - roedoejet/FastSpeech2: An implementation of Microsoft's "FastSpeech 2: Fast and High-Quality End-to-End Text to Speech"

2. 语音合成所用python 命令：

python3 synthesize.py --text "你好" --restore_step 400000 --mode single -p config/AISHELL3/preprocess.yaml -m config/AISHELL3/model.yaml -t config/AISHELL3/train.yaml

附录：

--restore_step 这个parameter要根据所使用的trained model的实际情况填写

-- text这个parameter只能输入汉字不能输入拼音

3. 数据训练代码解析

3.1 代码整体架构：

有4个常规函数：

def read_lexicon(lex_path):

def preprocess_english(text, preprocess_config):

def preprocess_mandarin(text, preprocess_config):

def synthesize(model, step, configs, vocoder, batchs, control_values):

和一个main函数

if __name__ == "__main__":

3.2分解代码，逐个理解：

3.2.1理解main函数

定义可控训练参数

if __name__ == "__main__":parser = argparse.ArgumentParser()parser.add_argument("--restore_step", type=int, required=True)parser.add_argument("--mode",type=str,choices=["batch", "single"],required=True,help="Synthesize a whole dataset or a single sentence",)parser.add_argument("--source",type=str,default=None,help="path to a source file with format like train.txt and val.txt, for batch mode only",)parser.add_argument("--text",type=str,default=None,help="raw text to synthesize, for single-sentence mode only",)parser.add_argument("--speaker_id",type=int,default=0,help="speaker ID for multi-speaker synthesis, for single-sentence mode only",)parser.add_argument("-p","--preprocess_config",type=str,required=True,help="path to preprocess.yaml",)parser.add_argument("-m", "--model_config", type=str, required=True, help="path to model.yaml")parser.add_argument("-t", "--train_config", type=str, required=True, help="path to train.yaml")parser.add_argument("--pitch_control",type=float,default=1.0,help="control the pitch of the whole utterance, larger value for higher pitch",)parser.add_argument("--energy_control",type=float,default=1.0,help="control the energy of the whole utterance, larger value for larger volume",)parser.add_argument("--duration_control",type=float,default=1.0,help="control the speed of the whole utterance, larger value for slower speaking rate",)args = parser.parse_args()

分batch mode和single mode检查source text

# Check source textsif args.mode == "batch":assert args.source is not None and args.text is Noneif args.mode == "single":assert args.source is None and args.text is not None

读取configs

# Read Configpreprocess_config = yaml.load(open(args.preprocess_config, "r"), Loader=yaml.FullLoader)model_config = yaml.load(open(args.model_config, "r"), Loader=yaml.FullLoader)train_config = yaml.load(open(args.train_config, "r"), Loader=yaml.FullLoader)configs = (preprocess_config, model_config, train_config)

从utils文件夹下的model.py调用模型和声码器

# Get modelmodel = get_model(args, configs, device, train=False)# Load vocodervocoder = get_vocoder(model_config, device)

根据之前设定的preprocess_config["preprocessing"]["text"]["language"] 是 "zh"来调动preprocess_mandarin 这个function，对texts进行预处理

附录：如果用英语或者其他语言，preprocess_config["preprocessing"]["text"]["language"]以及synthesize.py中的preprocess function要相应调整

# Preprocess textsif args.mode == "batch":# Get datasetdataset = TextDataset(args.source, preprocess_config)batchs = DataLoader(dataset,batch_size=8,collate_fn=dataset.collate_fn,)if args.mode == "single":ids = raw_texts = [args.text[:100]]speakers = np.array([args.speaker_id])if preprocess_config["preprocessing"]["text"]["language"] == "en":texts = np.array([preprocess_english(args.text, preprocess_config)])elif preprocess_config["preprocessing"]["text"]["language"] == "zh":texts = np.array([preprocess_mandarin(args.text, preprocess_config)])text_lens = np.array([len(texts[0])])batchs = [(ids, raw_texts, speakers, texts, text_lens, max(text_lens))]control_values = args.pitch_control, args.energy_control, args.duration_control

调动synthesize这个function进行最终语音合成

synthesize(model, args.restore_step, configs, vocoder, batchs, control_values)

3.2.2 理解preprocess_mandarin函数

调动read_lexicon这个function，读取lexicon（我设定的为"./lexicon/pinyin-lexicon-r.txt"）

def preprocess_mandarin(text, preprocess_config):lexicon = read_lexicon(preprocess_config["path"]["lexicon_path"])

调动Python 中拼音库 PyPinyin，把text转化成phones这个list里的phones

附录：style=Style.TONE3，声调风格3，即拼音声调在各个拼音之后，用数字 [1-4] 进行表示。如：中国 -> ``zhong1 guo2``

phones = []pinyins = [p[0]for p in pinyin(text, style=Style.TONE3, strict=False, neutral_tone_with_five=True)]for p in pinyins:if p in lexicon:phones += lexicon[p]else:phones.append("sp")phones = "{" + " ".join(phones) + "}"print("Raw Text Sequence: {}".format(text))print("Phoneme Sequence: {}".format(phones))

调动text文件夹里的_init_.py里的text_to_sequence这个function，把之前处理好的phones变成sequence，输出这个sequence

sequence = np.array(text_to_sequence(phones, preprocess_config["preprocessing"]["text"]["text_cleaners"]))return np.array(sequence)

3.2.3 理解preprocess_english函数

同理类比preprocess_mandarin，不再做详细解释

def preprocess_english(text, preprocess_config):text = text.rstrip(punctuation)lexicon = read_lexicon(preprocess_config["path"]["lexicon_path"])g2p = G2p()phones = []words = re.split(r"([,;.\-\?\!\s+])", text)for w in words:if w.lower() in lexicon:phones += lexicon[w.lower()]else:phones += list(filter(lambda p: p != " ", g2p(w)))phones = "{" + "}{".join(phones) + "}"phones = re.sub(r"\{[^\w\s]?\}", "{sp}", phones)phones = phones.replace("}{", " ")print("Raw Text Sequence: {}".format(text))print("Phoneme Sequence: {}".format(phones))sequence = np.array(text_to_sequence(phones, preprocess_config["preprocessing"]["text"]["text_cleaners"]))return np.array(sequence)

3.2.4 理解read_lexicon函数

根据lexicon_path读取lexicon

附录：

lexicon统一要求格式如下

WORDA PHONEA PHONEBWORDA PHONECWORDB PHONEB PHONEC

def read_lexicon(lex_path):lexicon = {}with open(lex_path) as f:for line in f:temp = re.split(r"\s+", line.strip("\n"))word = temp[0]phones = temp[1:]if word.lower() not in lexicon:lexicon[word.lower()] = phonesreturn lexicon

3.2.5 理解synthesize函数

3.2.5.1 synthesize函数的input

是在mian函数里定好的，详见上文3.2.1对于main函数的解释

if __name__ == "__main__":synthesize(model, args.restore_step, configs, vocoder, batchs, control_values)

3.2.5.2理解synthesize函数

从utils文件夹下的tools.py调用函数to_device function加载数据，也加载main函数里定好的model，最后调动utils文件夹下的tools.py中的synth_samples function合成最终语音

def synthesize(model, step, configs, vocoder, batchs, control_values):preprocess_config, model_config, train_config = configspitch_control, energy_control, duration_control = control_valuesfor batch in batchs:batch = to_device(batch, device)with torch.no_grad():# Forwardoutput = model(*(batch[2:]),p_control=pitch_control,e_control=energy_control,d_control=duration_control)synth_samples(batch,output,vocoder,model_config,preprocess_config,train_config["path"]["result_path"],)

4. 语音合成代码的输出

在设定好的result_path（我这里是./output/result/AISHELL3）输出音频和合成音频的频谱图

本内容不代表本网观点和政治立场，如有侵犯你的权益请联系我们处理。

网友评论

网友评论仅供其表达个人看法，并不表明网站立场。