研究目的
To present a voice conversion (VC) method that utilizes the recently proposed probabilistic models called recurrent temporal restricted Boltzmann machines (RTRBMs) for capturing high-order temporal dependencies in an acoustic sequence and converting features of emphasis for the source speaker to those of the target speaker using a neural network (NN).
研究成果
The proposed voice conversion method combining speaker-dependent RTRBMs and a NN outperforms conventional methods, especially in terms of MCD, regardless of gender. The method effectively captures and converts the abstractions of unique speaker characteristics, demonstrating high performance and stability.
研究不足
The method may face challenges with over-smoothing or over-fitting problems, especially if the amount of training data is not sufficient for the number of parameters.
1:Experimental Design and Method Selection:
The methodology involves using RTRBMs for each speaker to capture high-order temporal dependencies and a neural network for feature conversion.
2:Sample Selection and Data Sources:
Acoustic features from the ATR Japanese speech database were used, with parallel data from source and target speakers processed by dynamic programming.
3:List of Experimental Equipment and Materials:
24-dimensional MFCC features calculated from STRAIGHT spectra were used as input vectors.
4:Experimental Procedures and Operational Workflow:
The process includes training RTRBMs for each speaker, training an NN with projected features, and fine-tuning the entire network.
5:Data Analysis Methods:
Performance was evaluated using MCD (mel-cepstral distortion) for objective criteria and MOS (mean opinion score) for subjective criteria.
独家科研数据包,助您复现前沿成果,加速创新突破
获取完整内容