Just using the Python API of Text-to-Speech AI to transform a PDF file to mp3 audio, as the example:

from google.cloud import texttospeech
from PyPDF2 import PdfReader

client = texttospeech.TextToSpeechClient()

reader = PdfReader("xxx.pdf")

voice = texttospeech.VoiceSelectionParams(
    language_code="cmn-CN", name="cmn-CN-Wavenet-B", ssml_gender=texttospeech.SsmlVoiceGender.MALE
)

audio_config = texttospeech.AudioConfig(
    audio_encoding=texttospeech.AudioEncoding.MP3,
    speaking_rate=0.8,
)

text = ""
index = 1
# try first 10 pages
for page in reader.pages[:10]:
    text += page.extract_text()

print(len(text))
synthesis_input = texttospeech.SynthesisInput(text=text)
response = client.synthesize_speech(
    input=synthesis_input, voice=voice, audio_config=audio_config
)

with open("outout.mp3", "wb") as out:
    out.write(response.audio_content)
    print("Written")

Very simple, right? But it just reported an error:

google.api_core.exceptions.InvalidArgument: 400 Either `input.text` or `input.ssml` is longer than the limit of 5000 bytes. This limit is different from quotas. To fix, reduce the byte length of the characters in this request, or consider using the Long Audio API: https://cloud.google.com/text-to-speech/docs/create-audio-text-long-audio-synthesis.

It seems the request is too long. Let’s use the “Long Audio API”:

from google.cloud import texttospeech
from PyPDF2 import PdfReader

client = texttospeech.TextToSpeechLongAudioSynthesizeClient()

reader = PdfReader("xxx.pdf")

voice = texttospeech.VoiceSelectionParams(
    language_code="cmn-CN", name="cmn-CN-Wavenet-B", ssml_gender=texttospeech.SsmlVoiceGender.MALE
)

audio_config = texttospeech.AudioConfig(
    audio_encoding=texttospeech.AudioEncoding.LINEAR16,
    speaking_rate=0.8,
)

text = ""
index = 1
for page in reader.pages[:10]:
    text += page.extract_text()

print(len(text))
synthesis_input = texttospeech.SynthesisInput(text=text)
request = texttospeech.SynthesizeLongAudioRequest(
    parent="projects/robin-00000/locations/us",
    input=synthesis_input, voice=voice, audio_config=audio_config,
    output_gcs_uri="gs://robin_tts/xxx.mp3"
)

operation = client.synthesize_long_audio(request=request)
result = operation.result(timeout=300)
print(result)

It couldn’t work still:

google.api_core.exceptions.InvalidArgument: 400 The long audio API does not support the language zh. Supported languages: en, es.

Okay. It doesn’t support the Chinese language. Then, what should I do if I want to translate a Chinese pdf to mp3? Convert them page by page into 500 mp3 files? This is terrible. Even for the short mp3 it generated, it definitely sounds like a machine, not a human.

Google has the state-of-the-art technology of deep learning but some of their products in the cloud are ridiculously hard to use (such as Vertex AI, and this Text-to-Speech).

After some searching (at least Google search is perfect as before), I found this NaturalReader. Surprisingly, it supports the Chinese language and the voice is as well as a real human. The only problem is it is very expensive for individual users.