I am doing some research on TTS (Text-To-Speech) recently and noticed three almost state-of-the-art and also out-of-the-box solutions: LightSpeech (from Microsoft), FastSpeech2 (partly from Microsoft), Nemo (from Nvidia).

The testing text is a paragraph:

The Home Depot, Inc. is the world’s largest home improvement retailer based on net sales for fiscal 2021. We offer our customers a wide assortment of building materials, home improvement products, lawn and garden products, décor products, and facilities maintenance, repair and operations products and provide a number of services, including home improvement installation services and tool and equipment rental. As of the end of fiscal 2021, we operated 2,317 stores located throughout the U.S. (including the Commonwealth of Puerto Rico and the territories of the U.S. Virgin Islands and Guam), Canada, and Mexico. The Home Depot stores average approximately 104,000 square feet of enclosed space, with approximately 24,000 additional square feet of outside garden area. We also maintain a network of distribution and fulfillment centers, as well as a number of e-commerce websites in the U.S., Canada and Mexico. When we refer to “The Home Depot,” the “Company,” “we,” “us” or “our” in this report, we are referring to The Home Depot, Inc. and its consolidated subsidiaries.

The output of FastSpeech2:

it has a lot of noise and sounds like some type of metal.

The output of LightSpeech:

sounds a little better, more like human instead of robots

The output of Nemo:

this is the best result of all three solutions.

This test is just a summary of my research works and doesn’t mean which algorithm is better than others since the training process will heavily affect the final result. But at least, Nemo is the nearest one to the product scenario.