.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "tutorials/ctc_forced_alignment_api_tutorial.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note Click :ref:`here ` to download the full example code .. rst-class:: sphx-glr-example-title .. _sphx_glr_tutorials_ctc_forced_alignment_api_tutorial.py: CTC forced alignment API tutorial ================================= **Author**: `Xiaohui Zhang `__, `Moto Hira `__ .. warning:: Starting with version 2.8, we are refactoring TorchAudio to transition it into a maintenance phase. As a result: - The APIs described in this tutorial are deprecated in 2.8 and will be removed in 2.9. - The decoding and encoding capabilities of PyTorch for both audio and video are being consolidated into TorchCodec. Please see https://github.com/pytorch/audio/issues/3902 for more information. The forced alignment is a process to align transcript with speech. This tutorial shows how to align transcripts to speech using :py:func:`torchaudio.functional.forced_align` which was developed along the work of `Scaling Speech Technology to 1,000+ Languages `__. :py:func:`~torchaudio.functional.forced_align` has custom CPU and CUDA implementations which are more performant than the vanilla Python implementation above, and are more accurate. It can also handle missing transcript with special ```` token. There is also a high-level API, :py:class:`torchaudio.pipelines.Wav2Vec2FABundle`, which wraps the pre/post-processing explained in this tutorial and makes it easy to run forced-alignments. `Forced alignment for multilingual data <./forced_alignment_for_multilingual_data_tutorial.html>`__ uses this API to illustrate how to align non-English transcripts. .. GENERATED FROM PYTHON SOURCE LINES 37-39 Preparation ----------- .. GENERATED FROM PYTHON SOURCE LINES 39-46 .. code-block:: default import torch import torchaudio print(torch.__version__) print(torchaudio.__version__) .. rst-class:: sphx-glr-script-out .. code-block:: none 2.8.0+cu126 2.8.0 .. GENERATED FROM PYTHON SOURCE LINES 48-52 .. code-block:: default device = torch.device("cuda" if torch.cuda.is_available() else "cpu") print(device) .. rst-class:: sphx-glr-script-out .. code-block:: none cuda .. GENERATED FROM PYTHON SOURCE LINES 54-60 .. code-block:: default import IPython import matplotlib.pyplot as plt import torchaudio.functional as F .. GENERATED FROM PYTHON SOURCE LINES 61-64 First we prepare the speech data and the transcript we area going to use. .. GENERATED FROM PYTHON SOURCE LINES 64-70 .. code-block:: default SPEECH_FILE = torchaudio.utils.download_asset("tutorial-assets/Lab41-SRI-VOiCES-src-sp0307-ch127535-sg0042.wav") waveform, _ = torchaudio.load(SPEECH_FILE) TRANSCRIPT = "i had that curiosity beside me at this moment".split() .. rst-class:: sphx-glr-script-out .. code-block:: none /pytorch/audio/examples/tutorials/ctc_forced_alignment_api_tutorial.py:65: UserWarning: torchaudio.utils.download.download_asset has been deprecated. This deprecation is part of a large refactoring effort to transition TorchAudio into a maintenance phase. Please see https://github.com/pytorch/audio/issues/3902 for more information. It will be removed from the 2.9 release. SPEECH_FILE = torchaudio.utils.download_asset("tutorial-assets/Lab41-SRI-VOiCES-src-sp0307-ch127535-sg0042.wav") /pytorch/audio/src/torchaudio/_backend/utils.py:213: UserWarning: In 2.9, this function's implementation will be changed to use torchaudio.load_with_torchcodec` under the hood. Some parameters like ``normalize``, ``format``, ``buffer_size``, and ``backend`` will be ignored. We recommend that you port your code to rely directly on TorchCodec's decoder instead: https://docs.pytorch.org/torchcodec/stable/generated/torchcodec.decoders.AudioDecoder.html#torchcodec.decoders.AudioDecoder. warnings.warn( /pytorch/audio/src/torchaudio/_backend/ffmpeg.py:88: UserWarning: torio.io._streaming_media_decoder.StreamingMediaDecoder has been deprecated. This deprecation is part of a large refactoring effort to transition TorchAudio into a maintenance phase. The decoding and encoding capabilities of PyTorch for both audio and video are being consolidated into TorchCodec. Please see https://github.com/pytorch/audio/issues/3902 for more information. It will be removed from the 2.9 release. s = torchaudio.io.StreamReader(src, format, None, buffer_size) .. GENERATED FROM PYTHON SOURCE LINES 71-90 Generating emissions ~~~~~~~~~~~~~~~~~~~~ :py:func:`~torchaudio.functional.forced_align` takes emission and token sequences and outputs timestaps of the tokens and their scores. Emission reperesents the frame-wise probability distribution over tokens, and it can be obtained by passing waveform to an acoustic model. Tokens are numerical expression of transcripts. There are many ways to tokenize transcripts, but here, we simply map alphabets into integer, which is how labels were constructed when the acoustice model we are going to use was trained. We will use a pre-trained Wav2Vec2 model, :py:data:`torchaudio.pipelines.MMS_FA`, to obtain emission and tokenize the transcript. .. GENERATED FROM PYTHON SOURCE LINES 90-98 .. code-block:: default bundle = torchaudio.pipelines.MMS_FA model = bundle.get_model(with_star=False).to(device) with torch.inference_mode(): emission, _ = model(waveform.to(device)) .. rst-class:: sphx-glr-script-out .. code-block:: none Downloading: "https://dl.fbaipublicfiles.com/mms/torchaudio/ctc_alignment_mling_uroman/model.pt" to /root/.cache/torch/hub/checkpoints/model.pt 0%| | 0.00/1.18G [00:00 a b a - - b -> a b a a - b -> a b a - a b -> a a b ^^^ ^^^ .. GENERATED FROM PYTHON SOURCE LINES 201-209 Token-level alignments ~~~~~~~~~~~~~~~~~~~~~~ Next step is to resolve the repetation, so that each alignment does not depend on previous alignments. :py:func:`torchaudio.functional.merge_tokens` computes the :py:class:`~torchaudio.functional.TokenSpan` object, which represents which token from the transcript is present at what time span. .. GENERATED FROM PYTHON SOURCE LINES 212-220 .. code-block:: default token_spans = F.merge_tokens(aligned_tokens, alignment_scores) print("Token\tTime\tScore") for s in token_spans: print(f"{LABELS[s.token]}\t[{s.start:3d}, {s.end:3d})\t{s.score:.2f}") .. rst-class:: sphx-glr-script-out .. code-block:: none Token Time Score i [ 32, 33) 1.00 h [ 35, 37) 0.96 a [ 37, 38) 1.00 d [ 41, 42) 1.00 t [ 44, 45) 1.00 h [ 45, 46) 1.00 a [ 47, 48) 1.00 t [ 50, 51) 1.00 c [ 54, 55) 1.00 u [ 58, 60) 0.98 r [ 63, 64) 1.00 i [ 65, 66) 1.00 o [ 72, 73) 1.00 s [ 79, 80) 1.00 i [ 83, 84) 1.00 t [ 85, 86) 1.00 y [ 88, 89) 1.00 b [ 93, 94) 1.00 e [ 95, 96) 1.00 s [101, 102) 1.00 i [110, 111) 1.00 d [113, 114) 1.00 e [114, 115) 0.85 m [116, 117) 1.00 e [119, 120) 1.00 a [124, 125) 1.00 t [127, 128) 1.00 t [129, 130) 1.00 h [130, 131) 1.00 i [132, 133) 1.00 s [136, 137) 1.00 m [141, 142) 1.00 o [144, 145) 1.00 m [148, 149) 1.00 e [151, 152) 1.00 n [153, 154) 1.00 t [155, 156) 1.00 .. GENERATED FROM PYTHON SOURCE LINES 221-225 Word-level alignments ~~~~~~~~~~~~~~~~~~~~~ Now let’s group the token-level alignments into word-level alignments. .. GENERATED FROM PYTHON SOURCE LINES 225-240 .. code-block:: default def unflatten(list_, lengths): assert len(list_) == sum(lengths) i = 0 ret = [] for l in lengths: ret.append(list_[i : i + l]) i += l return ret word_spans = unflatten(token_spans, [len(word) for word in TRANSCRIPT]) .. GENERATED FROM PYTHON SOURCE LINES 241-244 Audio previews ~~~~~~~~~~~~~~ .. GENERATED FROM PYTHON SOURCE LINES 244-261 .. code-block:: default # Compute average score weighted by the span length def _score(spans): return sum(s.score * len(s) for s in spans) / sum(len(s) for s in spans) def preview_word(waveform, spans, num_frames, transcript, sample_rate=bundle.sample_rate): ratio = waveform.size(1) / num_frames x0 = int(ratio * spans[0].start) x1 = int(ratio * spans[-1].end) print(f"{transcript} ({_score(spans):.2f}): {x0 / sample_rate:.3f} - {x1 / sample_rate:.3f} sec") segment = waveform[:, x0:x1] return IPython.display.Audio(segment.numpy(), rate=sample_rate) num_frames = emission.size(1) .. GENERATED FROM PYTHON SOURCE LINES 262-267 .. code-block:: default # Generate the audio for each segment print(TRANSCRIPT) IPython.display.Audio(SPEECH_FILE) .. rst-class:: sphx-glr-script-out .. code-block:: none ['i', 'had', 'that', 'curiosity', 'beside', 'me', 'at', 'this', 'moment'] .. raw:: html

.. GENERATED FROM PYTHON SOURCE LINES 269-272 .. code-block:: default preview_word(waveform, word_spans[0], num_frames, TRANSCRIPT[0]) .. rst-class:: sphx-glr-script-out .. code-block:: none i (1.00): 0.644 - 0.664 sec .. raw:: html

.. GENERATED FROM PYTHON SOURCE LINES 274-277 .. code-block:: default preview_word(waveform, word_spans[1], num_frames, TRANSCRIPT[1]) .. rst-class:: sphx-glr-script-out .. code-block:: none had (0.98): 0.704 - 0.845 sec .. raw:: html

.. GENERATED FROM PYTHON SOURCE LINES 279-282 .. code-block:: default preview_word(waveform, word_spans[2], num_frames, TRANSCRIPT[2]) .. rst-class:: sphx-glr-script-out .. code-block:: none that (1.00): 0.885 - 1.026 sec .. raw:: html

.. GENERATED FROM PYTHON SOURCE LINES 284-287 .. code-block:: default preview_word(waveform, word_spans[3], num_frames, TRANSCRIPT[3]) .. rst-class:: sphx-glr-script-out .. code-block:: none curiosity (1.00): 1.086 - 1.790 sec .. raw:: html

.. GENERATED FROM PYTHON SOURCE LINES 289-292 .. code-block:: default preview_word(waveform, word_spans[4], num_frames, TRANSCRIPT[4]) .. rst-class:: sphx-glr-script-out .. code-block:: none beside (0.97): 1.871 - 2.314 sec .. raw:: html

.. GENERATED FROM PYTHON SOURCE LINES 294-297 .. code-block:: default preview_word(waveform, word_spans[5], num_frames, TRANSCRIPT[5]) .. rst-class:: sphx-glr-script-out .. code-block:: none me (1.00): 2.334 - 2.414 sec .. raw:: html

.. GENERATED FROM PYTHON SOURCE LINES 299-302 .. code-block:: default preview_word(waveform, word_spans[6], num_frames, TRANSCRIPT[6]) .. rst-class:: sphx-glr-script-out .. code-block:: none at (1.00): 2.495 - 2.575 sec .. raw:: html

.. GENERATED FROM PYTHON SOURCE LINES 304-307 .. code-block:: default preview_word(waveform, word_spans[7], num_frames, TRANSCRIPT[7]) .. rst-class:: sphx-glr-script-out .. code-block:: none this (1.00): 2.595 - 2.756 sec .. raw:: html

.. GENERATED FROM PYTHON SOURCE LINES 309-312 .. code-block:: default preview_word(waveform, word_spans[8], num_frames, TRANSCRIPT[8]) .. rst-class:: sphx-glr-script-out .. code-block:: none moment (1.00): 2.837 - 3.138 sec .. raw:: html

.. GENERATED FROM PYTHON SOURCE LINES 313-318 Visualization ~~~~~~~~~~~~~ Now let's look at the alignment result and segment the original speech into words. .. GENERATED FROM PYTHON SOURCE LINES 318-344 .. code-block:: default def plot_alignments(waveform, token_spans, emission, transcript, sample_rate=bundle.sample_rate): ratio = waveform.size(1) / emission.size(1) / sample_rate fig, axes = plt.subplots(2, 1) axes[0].imshow(emission[0].detach().cpu().T, aspect="auto") axes[0].set_title("Emission") axes[0].set_xticks([]) axes[1].specgram(waveform[0], Fs=sample_rate) for t_spans, chars in zip(token_spans, transcript): t0, t1 = t_spans[0].start + 0.1, t_spans[-1].end - 0.1 axes[0].axvspan(t0 - 0.5, t1 - 0.5, facecolor="None", hatch="/", edgecolor="white") axes[1].axvspan(ratio * t0, ratio * t1, facecolor="None", hatch="/", edgecolor="white") axes[1].annotate(f"{_score(t_spans):.2f}", (ratio * t0, sample_rate * 0.51), annotation_clip=False) for span, char in zip(t_spans, chars): t0 = span.start * ratio axes[1].annotate(char, (t0, sample_rate * 0.55), annotation_clip=False) axes[1].set_xlabel("time [second]") axes[1].set_xlim([0, None]) fig.tight_layout() .. GENERATED FROM PYTHON SOURCE LINES 346-349 .. code-block:: default plot_alignments(waveform, word_spans, emission, TRANSCRIPT) .. image-sg:: /tutorials/images/sphx_glr_ctc_forced_alignment_api_tutorial_002.png :alt: Emission :srcset: /tutorials/images/sphx_glr_ctc_forced_alignment_api_tutorial_002.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 350-360 Inconsistent treatment of ``blank`` token ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ When splitting the token-level alignments into words, you will notice that some blank tokens are treated differently, and this makes the interpretation of the result somehwat ambigious. This is easy to see when we plot the scores. The following figure shows word regions and non-word regions, with the frame-level scores of non-blank tokens. .. GENERATED FROM PYTHON SOURCE LINES 361-382 .. code-block:: default def plot_scores(word_spans, scores): fig, ax = plt.subplots() span_xs, span_hs = [], [] ax.axvspan(word_spans[0][0].start - 0.05, word_spans[-1][-1].end + 0.05, facecolor="paleturquoise", edgecolor="none", zorder=-1) for t_span in word_spans: for span in t_span: for t in range(span.start, span.end): span_xs.append(t + 0.5) span_hs.append(scores[t].item()) ax.annotate(LABELS[span.token], (span.start, -0.07)) ax.axvspan(t_span[0].start - 0.05, t_span[-1].end + 0.05, facecolor="mistyrose", edgecolor="none", zorder=-1) ax.bar(span_xs, span_hs, color="lightsalmon", edgecolor="coral") ax.set_title("Frame-level scores and word segments") ax.set_ylim(-0.1, None) ax.grid(True, axis="y") ax.axhline(0, color="black") fig.tight_layout() plot_scores(word_spans, alignment_scores) .. image-sg:: /tutorials/images/sphx_glr_ctc_forced_alignment_api_tutorial_003.png :alt: Frame-level scores and word segments :srcset: /tutorials/images/sphx_glr_ctc_forced_alignment_api_tutorial_003.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 383-420 In this plot, the blank tokens are those highlighted area without vertical bar. You can see that there are blank tokens which are interpreted as part of a word (highlighted red), while the others (highlighted blue) are not. One reason for this is because the model was trained without a label for the word boundary. The blank tokens are treated not just as repeatation but also as silence between words. But then, a question arises. Should frames immediately after or near the end of a word be silent or repeat? In the above example, if you go back to the previous plot of spectrogram and word regions, you see that after "y" in "curiosity", there is still some activities in multiple frequency buckets. Would it be more accurate if that frame was included in the word? Unfortunately, CTC does not provide a comprehensive solution to this. Models trained with CTC are known to exhibit "peaky" response, that is, they tend to spike for an aoccurance of a label, but the spike does not last for the duration of the label. (Note: Pre-trained Wav2Vec2 models tend to spike at the beginning of label occurances, but this not always the case.) :cite:`zeyer2021does` has in-depth alanysis on the peaky behavior of CTC. We encourage those who are interested understanding more to refer to the paper. The following is a quote from the paper, which is the exact issue we are facing here. *Peaky behavior can be problematic in certain cases,* *e.g. when an application requires to not use the blank label,* *e.g. to get meaningful time accurate alignments of phonemes* *to a transcription.* .. GENERATED FROM PYTHON SOURCE LINES 422-435 Advanced: Handling transcripts with ```` token ---------------------------------------------------- Now let’s look at when the transcript is partially missing, how can we improve alignment quality using the ```` token, which is capable of modeling any token. Here we use the same English example as used above. But we remove the beginning text ``“i had that curiosity beside me at”`` from the transcript. Aligning audio with such transcript results in wrong alignments of the existing word “this”. However, this issue can be mitigated by using the ```` token to model the missing text. .. GENERATED FROM PYTHON SOURCE LINES 437-438 First, we extend the dictionary to include the ```` token. .. GENERATED FROM PYTHON SOURCE LINES 438-441 .. code-block:: default DICTIONARY["*"] = len(DICTIONARY) .. GENERATED FROM PYTHON SOURCE LINES 442-445 Next, we extend the emission tensor with the extra dimension corresponding to the ```` token. .. GENERATED FROM PYTHON SOURCE LINES 445-453 .. code-block:: default star_dim = torch.zeros((1, emission.size(1), 1), device=emission.device, dtype=emission.dtype) emission = torch.cat((emission, star_dim), 2) assert len(DICTIONARY) == emission.shape[2] plot_emission(emission[0]) .. image-sg:: /tutorials/images/sphx_glr_ctc_forced_alignment_api_tutorial_004.png :alt: Frame-wise class probabilities :srcset: /tutorials/images/sphx_glr_ctc_forced_alignment_api_tutorial_004.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 454-456 The following function combines all the processes, and compute word segments from emission in one-go. .. GENERATED FROM PYTHON SOURCE LINES 456-466 .. code-block:: default def compute_alignments(emission, transcript, dictionary): tokens = [dictionary[char] for word in transcript for char in word] alignment, scores = align(emission, tokens) token_spans = F.merge_tokens(alignment, scores) word_spans = unflatten(token_spans, [len(word) for word in transcript]) return word_spans .. GENERATED FROM PYTHON SOURCE LINES 467-469 Full Transcript ~~~~~~~~~~~~~~~ .. GENERATED FROM PYTHON SOURCE LINES 469-473 .. code-block:: default word_spans = compute_alignments(emission, TRANSCRIPT, DICTIONARY) plot_alignments(waveform, word_spans, emission, TRANSCRIPT) .. image-sg:: /tutorials/images/sphx_glr_ctc_forced_alignment_api_tutorial_005.png :alt: Emission :srcset: /tutorials/images/sphx_glr_ctc_forced_alignment_api_tutorial_005.png :class: sphx-glr-single-img .. rst-class:: sphx-glr-script-out .. code-block:: none /pytorch/audio/examples/tutorials/ctc_forced_alignment_api_tutorial.py:146: UserWarning: torchaudio.functional._alignment.forced_align has been deprecated. This deprecation is part of a large refactoring effort to transition TorchAudio into a maintenance phase. Please see https://github.com/pytorch/audio/issues/3902 for more information. It will be removed from the 2.9 release. alignments, scores = F.forced_align(emission, targets, blank=0) .. GENERATED FROM PYTHON SOURCE LINES 474-478 Partial Transcript with ```` token ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Now we replace the first part of the transcript with the ```` token. .. GENERATED FROM PYTHON SOURCE LINES 478-483 .. code-block:: default transcript = "* this moment".split() word_spans = compute_alignments(emission, transcript, DICTIONARY) plot_alignments(waveform, word_spans, emission, transcript) .. image-sg:: /tutorials/images/sphx_glr_ctc_forced_alignment_api_tutorial_006.png :alt: Emission :srcset: /tutorials/images/sphx_glr_ctc_forced_alignment_api_tutorial_006.png :class: sphx-glr-single-img .. rst-class:: sphx-glr-script-out .. code-block:: none /pytorch/audio/examples/tutorials/ctc_forced_alignment_api_tutorial.py:146: UserWarning: torchaudio.functional._alignment.forced_align has been deprecated. This deprecation is part of a large refactoring effort to transition TorchAudio into a maintenance phase. Please see https://github.com/pytorch/audio/issues/3902 for more information. It will be removed from the 2.9 release. alignments, scores = F.forced_align(emission, targets, blank=0) .. GENERATED FROM PYTHON SOURCE LINES 485-488 .. code-block:: default preview_word(waveform, word_spans[0], num_frames, transcript[0]) .. rst-class:: sphx-glr-script-out .. code-block:: none * (1.00): 0.000 - 2.595 sec .. raw:: html

.. GENERATED FROM PYTHON SOURCE LINES 490-493 .. code-block:: default preview_word(waveform, word_spans[1], num_frames, transcript[1]) .. rst-class:: sphx-glr-script-out .. code-block:: none this (1.00): 2.595 - 2.756 sec .. raw:: html

.. GENERATED FROM PYTHON SOURCE LINES 495-498 .. code-block:: default preview_word(waveform, word_spans[2], num_frames, transcript[2]) .. rst-class:: sphx-glr-script-out .. code-block:: none moment (1.00): 2.837 - 3.138 sec .. raw:: html

.. GENERATED FROM PYTHON SOURCE LINES 499-505 Partial Transcript without ```` token ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ As a comparison, the following aligns the partial transcript without using ```` token. It demonstrates the effect of ```` token for dealing with deletion errors. .. GENERATED FROM PYTHON SOURCE LINES 505-510 .. code-block:: default transcript = "this moment".split() word_spans = compute_alignments(emission, transcript, DICTIONARY) plot_alignments(waveform, word_spans, emission, transcript) .. image-sg:: /tutorials/images/sphx_glr_ctc_forced_alignment_api_tutorial_007.png :alt: Emission :srcset: /tutorials/images/sphx_glr_ctc_forced_alignment_api_tutorial_007.png :class: sphx-glr-single-img .. rst-class:: sphx-glr-script-out .. code-block:: none /pytorch/audio/examples/tutorials/ctc_forced_alignment_api_tutorial.py:146: UserWarning: torchaudio.functional._alignment.forced_align has been deprecated. This deprecation is part of a large refactoring effort to transition TorchAudio into a maintenance phase. Please see https://github.com/pytorch/audio/issues/3902 for more information. It will be removed from the 2.9 release. alignments, scores = F.forced_align(emission, targets, blank=0) .. GENERATED FROM PYTHON SOURCE LINES 511-519 Conclusion ---------- In this tutorial, we looked at how to use torchaudio’s forced alignment API to align and segment speech files, and demonstrated one advanced usage: How introducing a ```` token could improve alignment accuracy when transcription errors exist. .. GENERATED FROM PYTHON SOURCE LINES 522-528 Acknowledgement --------------- Thanks to `Vineel Pratap `__ and `Zhaoheng Ni `__ for developing and open-sourcing the forced aligner API. .. rst-class:: sphx-glr-timing **Total running time of the script:** ( 0 minutes 7.698 seconds) .. _sphx_glr_download_tutorials_ctc_forced_alignment_api_tutorial.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: ctc_forced_alignment_api_tutorial.py ` .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: ctc_forced_alignment_api_tutorial.ipynb ` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_