.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "tutorials/tacotron2_pipeline_tutorial.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note Click :ref:`here ` to download the full example code .. rst-class:: sphx-glr-example-title .. _sphx_glr_tutorials_tacotron2_pipeline_tutorial.py: Text-to-Speech with Tacotron2 ============================= **Author**: `Yao-Yuan Yang `__, `Moto Hira `__ .. GENERATED FROM PYTHON SOURCE LINES 11-45 Overview -------- This tutorial shows how to build text-to-speech pipeline, using the pretrained Tacotron2 in torchaudio. The text-to-speech pipeline goes as follows: 1. Text preprocessing First, the input text is encoded into a list of symbols. In this tutorial, we will use English characters as the symbols. 2. Spectrogram generation From the encoded text, a spectrogram is generated. We use the ``Tacotron2`` model for this. 3. Time-domain conversion The last step is converting the spectrogram into the waveform. The process to generate speech from spectrogram is also called a Vocoder. In this tutorial, three different vocoders are used, :py:class:`~torchaudio.models.WaveRNN`, :py:class:`~torchaudio.transforms.GriffinLim`, and `Nvidia's WaveGlow `__. The following figure illustrates the whole process. .. image:: https://download.pytorch.org/torchaudio/tutorial-assets/tacotron2_tts_pipeline.png All the related components are bundled in :py:class:`torchaudio.pipelines.Tacotron2TTSBundle`, but this tutorial will also cover the process under the hood. .. GENERATED FROM PYTHON SOURCE LINES 47-50 Preparation ----------- .. GENERATED FROM PYTHON SOURCE LINES 50-62 .. code-block:: default import torch import torchaudio torch.random.manual_seed(0) device = "cuda" if torch.cuda.is_available() else "cpu" print(torch.__version__) print(torchaudio.__version__) print(device) .. rst-class:: sphx-glr-script-out .. code-block:: none 2.10.0.dev20251013+cu126 2.8.0a0+1d65bbe cuda .. GENERATED FROM PYTHON SOURCE LINES 64-69 .. code-block:: default import IPython import matplotlib.pyplot as plt .. GENERATED FROM PYTHON SOURCE LINES 70-73 Text Processing --------------- .. GENERATED FROM PYTHON SOURCE LINES 76-90 Character-based encoding ~~~~~~~~~~~~~~~~~~~~~~~~ In this section, we will go through how the character-based encoding works. Since the pre-trained Tacotron2 model expects specific set of symbol tables, the same functionalities is available in ``torchaudio``. However, we will first manually implement the encoding to aid in understanding. First, we define the set of symbols ``'_-!\'(),.:;? abcdefghijklmnopqrstuvwxyz'``. Then, we will map the each character of the input text into the index of the corresponding symbol in the table. Symbols that are not in the table are ignored. .. GENERATED FROM PYTHON SOURCE LINES 90-105 .. code-block:: default symbols = "_-!'(),.:;? abcdefghijklmnopqrstuvwxyz" look_up = {s: i for i, s in enumerate(symbols)} symbols = set(symbols) def text_to_sequence(text): text = text.lower() return [look_up[s] for s in text if s in symbols] text = "Hello world! Text to speech!" print(text_to_sequence(text)) .. rst-class:: sphx-glr-script-out .. code-block:: none [19, 16, 23, 23, 26, 11, 34, 26, 29, 23, 15, 2, 11, 31, 16, 35, 31, 11, 31, 26, 11, 30, 27, 16, 16, 14, 19, 2] .. GENERATED FROM PYTHON SOURCE LINES 106-111 As mentioned in the above, the symbol table and indices must match what the pretrained Tacotron2 model expects. ``torchaudio`` provides the same transform along with the pretrained model. You can instantiate and use such transform as follow. .. GENERATED FROM PYTHON SOURCE LINES 111-121 .. code-block:: default processor = torchaudio.pipelines.TACOTRON2_WAVERNN_CHAR_LJSPEECH.get_text_processor() text = "Hello world! Text to speech!" processed, lengths = processor(text) print(processed) print(lengths) .. rst-class:: sphx-glr-script-out .. code-block:: none tensor([[19, 16, 23, 23, 26, 11, 34, 26, 29, 23, 15, 2, 11, 31, 16, 35, 31, 11, 31, 26, 11, 30, 27, 16, 16, 14, 19, 2]]) tensor([28], dtype=torch.int32) .. GENERATED FROM PYTHON SOURCE LINES 122-129 Note: The output of our manual encoding and the ``torchaudio`` ``text_processor`` output matches (meaning we correctly re-implemented what the library does internally). It takes either a text or list of texts as inputs. When a list of texts are provided, the returned ``lengths`` variable represents the valid length of each processed tokens in the output batch. The intermediate representation can be retrieved as follows: .. GENERATED FROM PYTHON SOURCE LINES 129-133 .. code-block:: default print([processor.tokens[i] for i in processed[0, : lengths[0]]]) .. rst-class:: sphx-glr-script-out .. code-block:: none ['h', 'e', 'l', 'l', 'o', ' ', 'w', 'o', 'r', 'l', 'd', '!', ' ', 't', 'e', 'x', 't', ' ', 't', 'o', ' ', 's', 'p', 'e', 'e', 'c', 'h', '!'] .. GENERATED FROM PYTHON SOURCE LINES 134-151 Spectrogram Generation ---------------------- ``Tacotron2`` is the model we use to generate spectrogram from the encoded text. For the detail of the model, please refer to `the paper `__. It is easy to instantiate a Tacotron2 model with pretrained weights, however, note that the input to Tacotron2 models need to be processed by the matching text processor. :py:class:`torchaudio.pipelines.Tacotron2TTSBundle` bundles the matching models and processors together so that it is easy to create the pipeline. For the available bundles, and its usage, please refer to :py:class:`~torchaudio.pipelines.Tacotron2TTSBundle`. .. GENERATED FROM PYTHON SOURCE LINES 151-168 .. code-block:: default bundle = torchaudio.pipelines.TACOTRON2_WAVERNN_CHAR_LJSPEECH processor = bundle.get_text_processor() tacotron2 = bundle.get_tacotron2().to(device) text = "Hello world! Text to speech!" with torch.inference_mode(): processed, lengths = processor(text) processed = processed.to(device) lengths = lengths.to(device) spec, _, _ = tacotron2.infer(processed, lengths) _ = plt.imshow(spec[0].cpu().detach(), origin="lower", aspect="auto") .. image-sg:: /tutorials/images/sphx_glr_tacotron2_pipeline_tutorial_001.png :alt: tacotron2 pipeline tutorial :srcset: /tutorials/images/sphx_glr_tacotron2_pipeline_tutorial_001.png :class: sphx-glr-single-img .. rst-class:: sphx-glr-script-out .. code-block:: none Downloading: "https://download.pytorch.org/torchaudio/models/tacotron2_english_characters_1500_epochs_wavernn_ljspeech.pth" to /root/.cache/torch/hub/checkpoints/tacotron2_english_characters_1500_epochs_wavernn_ljspeech.pthote that ``Tacotron2.infer`` method perfoms multinomial sampling, therefore, the process of generating the spectrogram incurs randomness. .. GENERATED FROM PYTHON SOURCE LINES 172-186 .. code-block:: default def plot(): fig, ax = plt.subplots(3, 1) for i in range(3): with torch.inference_mode(): spec, spec_lengths, _ = tacotron2.infer(processed, lengths) print(spec[0].shape) ax[i].imshow(spec[0].cpu().detach(), origin="lower", aspect="auto") plot() .. image-sg:: /tutorials/images/sphx_glr_tacotron2_pipeline_tutorial_002.png :alt: tacotron2 pipeline tutorial :srcset: /tutorials/images/sphx_glr_tacotron2_pipeline_tutorial_002.png :class: sphx-glr-single-img .. rst-class:: sphx-glr-script-out .. code-block:: none torch.Size([80, 183]) torch.Size([80, 196]) torch.Size([80, 184]) .. GENERATED FROM PYTHON SOURCE LINES 187-196 Waveform Generation ------------------- Once the spectrogram is generated, the last process is to recover the waveform from the spectrogram using a vocoder. ``torchaudio`` provides vocoders based on ``GriffinLim`` and ``WaveRNN``. .. GENERATED FROM PYTHON SOURCE LINES 199-205 WaveRNN Vocoder ~~~~~~~~~~~~~~~ Continuing from the previous section, we can instantiate the matching WaveRNN model from the same bundle. .. GENERATED FROM PYTHON SOURCE LINES 205-221 .. code-block:: default bundle = torchaudio.pipelines.TACOTRON2_WAVERNN_CHAR_LJSPEECH processor = bundle.get_text_processor() tacotron2 = bundle.get_tacotron2().to(device) vocoder = bundle.get_vocoder().to(device) text = "Hello world! Text to speech!" with torch.inference_mode(): processed, lengths = processor(text) processed = processed.to(device) lengths = lengths.to(device) spec, spec_lengths, _ = tacotron2.infer(processed, lengths) waveforms, lengths = vocoder(spec, spec_lengths) .. rst-class:: sphx-glr-script-out .. code-block:: none Downloading: "https://download.pytorch.org/torchaudio/models/wavernn_10k_epochs_8bits_ljspeech.pth" to /root/.cache/torch/hub/checkpoints/wavernn_10k_epochs_8bits_ljspeech.pth 0.8% 1.5% 2.3% 3.0% 3.8% 4.5% 5.3% 6.0% 6.8% 7.5% 8.3% 9.0% 9.8% 10.5% 11.3% 12.0% 12.8% 13.5% 14.3% 15.0% 15.8% 16.5% 17.3% 18.0% 18.8% 19.5% 20.3% 21.0% 21.8% 22.5% 23.3% 24.0% 24.8% 25.5% 26.3% 27.0% 27.8% 28.5% 29.3% 30.0% 30.8% 31.5% 32.3% 33.0% 33.8% 34.5% 35.3% 36.0% 36.8% 37.5% 38.3% 39.0% 39.8% 40.5% 41.3% 42.0% 42.8% 43.5% 44.3% 45.0% 45.8% 46.5% 47.3% 48.0% 48.8% 49.5% 50.3% 51.0% 51.8% 52.5% 53.3% 54.0% 54.8% 55.5% 56.3% 57.0% 57.8% 58.5% 59.3% 60.0% 60.8% 61.5% 62.3% 63.0% 63.8% 64.5% 65.3% 66.0% 66.8% 67.5% 68.3% 69.0% 69.8% 70.5% 71.3% 72.0% 72.8% 73.5% 74.3% 75.0% 75.8% 76.5% 77.3% 78.0% 78.8% 79.5% 80.3% 81.0% 81.8% 82.5% 83.3% 84.0% 84.8% 85.5% 86.3% 87.0% 87.8% 88.5% 89.3% 90.0% 90.8% 91.5% 92.3% 93.0% 93.8% 94.5% 95.3% 96.0% 96.8% 97.5% 98.3% 99.0% 99.8% 100.0% .. GENERATED FROM PYTHON SOURCE LINES 223-239 .. code-block:: default def plot(waveforms, spec, sample_rate): waveforms = waveforms.cpu().detach() fig, [ax1, ax2] = plt.subplots(2, 1) ax1.plot(waveforms[0]) ax1.set_xlim(0, waveforms.size(-1)) ax1.grid(True) ax2.imshow(spec[0].cpu().detach(), origin="lower", aspect="auto") return IPython.display.Audio(waveforms[0:1], rate=sample_rate) plot(waveforms, spec, vocoder.sample_rate) .. image-sg:: /tutorials/images/sphx_glr_tacotron2_pipeline_tutorial_003.png :alt: tacotron2 pipeline tutorial :srcset: /tutorials/images/sphx_glr_tacotron2_pipeline_tutorial_003.png :class: sphx-glr-single-img .. raw:: html


.. GENERATED FROM PYTHON SOURCE LINES 240-248 Griffin-Lim Vocoder ~~~~~~~~~~~~~~~~~~~ Using the Griffin-Lim vocoder is same as WaveRNN. You can instantiate the vocoder object with :py:func:`~torchaudio.pipelines.Tacotron2TTSBundle.get_vocoder` method and pass the spectrogram. .. GENERATED FROM PYTHON SOURCE LINES 248-262 .. code-block:: default bundle = torchaudio.pipelines.TACOTRON2_GRIFFINLIM_CHAR_LJSPEECH processor = bundle.get_text_processor() tacotron2 = bundle.get_tacotron2().to(device) vocoder = bundle.get_vocoder().to(device) with torch.inference_mode(): processed, lengths = processor(text) processed = processed.to(device) lengths = lengths.to(device) spec, spec_lengths, _ = tacotron2.infer(processed, lengths) waveforms, lengths = vocoder(spec, spec_lengths) .. rst-class:: sphx-glr-script-out .. code-block:: none Downloading: "https://download.pytorch.org/torchaudio/models/tacotron2_english_characters_1500_epochs_ljspeech.pth" to /root/.cache/torch/hub/checkpoints/tacotron2_english_characters_1500_epochs_ljspeech.pthcode-block:: default plot(waveforms, spec, vocoder.sample_rate) .. image-sg:: /tutorials/images/sphx_glr_tacotron2_pipeline_tutorial_004.png :alt: tacotron2 pipeline tutorial :srcset: /tutorials/images/sphx_glr_tacotron2_pipeline_tutorial_004.png :class: sphx-glr-single-img .. raw:: html


.. GENERATED FROM PYTHON SOURCE LINES 269-276 Waveglow Vocoder ~~~~~~~~~~~~~~~~ Waveglow is a vocoder published by Nvidia. The pretrained weights are published on Torch Hub. One can instantiate the model using ``torch.hub`` module. .. GENERATED FROM PYTHON SOURCE LINES 276-300 .. code-block:: default # Workaround to load model mapped on GPU # https://stackoverflow.com/a/61840832 waveglow = torch.hub.load( "NVIDIA/DeepLearningExamples:torchhub", "nvidia_waveglow", model_math="fp32", pretrained=False, ) checkpoint = torch.hub.load_state_dict_from_url( "https://api.ngc.nvidia.com/v2/models/nvidia/waveglowpyt_fp32/versions/1/files/nvidia_waveglowpyt_fp32_20190306.pth", # noqa: E501 progress=False, map_location=device, ) state_dict = {key.replace("module.", ""): value for key, value in checkpoint["state_dict"].items()} waveglow.load_state_dict(state_dict) waveglow = waveglow.remove_weightnorm(waveglow) waveglow = waveglow.to(device) waveglow.eval() with torch.no_grad(): waveforms = waveglow.infer(spec) .. rst-class:: sphx-glr-script-out .. code-block:: none /pytorch/audio/ci_env/lib/python3.11/site-packages/torch/hub.py:335: UserWarning: You are about to download and run code from an untrusted repository. In a future release, this won't be allowed. To add the repository to your trusted list, change the command to load(..., trust_repo=False) and a command prompt will appear asking for an explicit confirmation of trust, or load(..., trust_repo=True), which will assume that the prompt is to be answered with 'yes'. You can also use load(..., trust_repo='check') which will only prompt for confirmation if the repo is not already trusted. This will eventually be the default behaviour warnings.warn( Downloading: "https://github.com/NVIDIA/DeepLearningExamples/zipball/torchhub" to /root/.cache/torch/hub/torchhub.zip /root/.cache/torch/hub/NVIDIA_DeepLearningExamples_torchhub/PyTorch/Classification/ConvNets/image_classification/models/common.py:13: UserWarning: pytorch_quantization module not found, quantization will not be available warnings.warn( /root/.cache/torch/hub/NVIDIA_DeepLearningExamples_torchhub/PyTorch/Classification/ConvNets/image_classification/models/efficientnet.py:17: UserWarning: pytorch_quantization module not found, quantization will not be available warnings.warn( /pytorch/audio/ci_env/lib/python3.11/site-packages/torch/nn/utils/weight_norm.py:144: FutureWarning: `torch.nn.utils.weight_norm` is deprecated in favor of `torch.nn.utils.parametrizations.weight_norm`. WeightNorm.apply(module, name, dim) Downloading: "https://api.ngc.nvidia.com/v2/models/nvidia/waveglowpyt_fp32/versions/1/files/nvidia_waveglowpyt_fp32_20190306.pth" to /root/.cache/torch/hub/checkpoints/nvidia_waveglowpyt_fp32_20190306.pth .. GENERATED FROM PYTHON SOURCE LINES 302-304 .. code-block:: default plot(waveforms, spec, 22050) .. image-sg:: /tutorials/images/sphx_glr_tacotron2_pipeline_tutorial_005.png :alt: tacotron2 pipeline tutorial :srcset: /tutorials/images/sphx_glr_tacotron2_pipeline_tutorial_005.png :class: sphx-glr-single-img .. raw:: html


.. rst-class:: sphx-glr-timing **Total running time of the script:** ( 1 minutes 16.512 seconds) .. _sphx_glr_download_tutorials_tacotron2_pipeline_tutorial.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: tacotron2_pipeline_tutorial.py ` .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: tacotron2_pipeline_tutorial.ipynb ` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_