.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "tutorials/audio_data_augmentation_tutorial.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note Click :ref:`here ` to download the full example code .. rst-class:: sphx-glr-example-title .. _sphx_glr_tutorials_audio_data_augmentation_tutorial.py: Audio Data Augmentation ======================= **Author**: `Moto Hira `__ ``torchaudio`` provides a variety of ways to augment audio data. In this tutorial, we look into a way to apply effects, filters, RIR (room impulse response) and codecs. At the end, we synthesize noisy speech over phone from clean speech. .. GENERATED FROM PYTHON SOURCE LINES 15-25 .. code-block:: default import torch import torchaudio import torchaudio.functional as F print(torch.__version__) print(torchaudio.__version__) import matplotlib.pyplot as plt .. rst-class:: sphx-glr-script-out .. code-block:: none 2.10.0.dev20251013+cu126 2.8.0a0+1d65bbe .. GENERATED FROM PYTHON SOURCE LINES 26-31 Preparation ----------- First, we import the modules and download the audio assets we use in this tutorial. .. GENERATED FROM PYTHON SOURCE LINES 31-42 .. code-block:: default from IPython.display import Audio from torchaudio.utils import _download_asset SAMPLE_WAV = _download_asset("tutorial-assets/steam-train-whistle-daniel_simon.wav") SAMPLE_RIR = _download_asset("tutorial-assets/Lab41-SRI-VOiCES-rm1-impulse-mc01-stu-clo-8000hz.wav") SAMPLE_SPEECH = _download_asset("tutorial-assets/Lab41-SRI-VOiCES-src-sp0307-ch127535-sg0042-8000hz.wav") SAMPLE_NOISE = _download_asset("tutorial-assets/Lab41-SRI-VOiCES-rm1-babb-mc01-stu-clo-8000hz.wav") .. rst-class:: sphx-glr-script-out .. code-block:: none 30.0% 59.9% 89.9% 100.0% 100.0% 100.0% 100.0% .. GENERATED FROM PYTHON SOURCE LINES 43-46 Loading the data ------------------------------ .. GENERATED FROM PYTHON SOURCE LINES 46-51 .. code-block:: default waveform1, sample_rate = torchaudio.load(SAMPLE_WAV, channels_first=False) print(waveform1.shape, sample_rate) .. rst-class:: sphx-glr-script-out .. code-block:: none torch.Size([109368, 2]) 44100 .. GENERATED FROM PYTHON SOURCE LINES 52-54 Let’s listen to the audio. .. GENERATED FROM PYTHON SOURCE LINES 54-75 .. code-block:: default def plot_waveform(waveform, sample_rate, title="Waveform", xlim=None): waveform = waveform.numpy() num_channels, num_frames = waveform.shape time_axis = torch.arange(0, num_frames) / sample_rate figure, axes = plt.subplots(num_channels, 1) if num_channels == 1: axes = [axes] for c in range(num_channels): axes[c].plot(time_axis, waveform[c], linewidth=1) axes[c].grid(True) if num_channels > 1: axes[c].set_ylabel(f"Channel {c+1}") if xlim: axes[c].set_xlim(xlim) figure.suptitle(title) .. GENERATED FROM PYTHON SOURCE LINES 77-96 .. code-block:: default def plot_specgram(waveform, sample_rate, title="Spectrogram", xlim=None): waveform = waveform.numpy() num_channels, _ = waveform.shape figure, axes = plt.subplots(num_channels, 1) if num_channels == 1: axes = [axes] for c in range(num_channels): axes[c].specgram(waveform[c], Fs=sample_rate) if num_channels > 1: axes[c].set_ylabel(f"Channel {c+1}") if xlim: axes[c].set_xlim(xlim) figure.suptitle(title) .. GENERATED FROM PYTHON SOURCE LINES 97-102 .. code-block:: default plot_waveform(waveform1.T, sample_rate, title="Original", xlim=(-0.1, 3.2)) plot_specgram(waveform1.T, sample_rate, title="Original", xlim=(0, 3.04)) Audio(waveform1.T, rate=sample_rate) .. rst-class:: sphx-glr-horizontal * .. image-sg:: /tutorials/images/sphx_glr_audio_data_augmentation_tutorial_001.png :alt: Original :srcset: /tutorials/images/sphx_glr_audio_data_augmentation_tutorial_001.png :class: sphx-glr-multi-img * .. image-sg:: /tutorials/images/sphx_glr_audio_data_augmentation_tutorial_002.png :alt: Original :srcset: /tutorials/images/sphx_glr_audio_data_augmentation_tutorial_002.png :class: sphx-glr-multi-img .. raw:: html


.. GENERATED FROM PYTHON SOURCE LINES 103-118 Simulating room reverberation ----------------------------- `Convolution reverb `__ is a technique that's used to make clean audio sound as though it has been produced in a different environment. Using Room Impulse Response (RIR), for instance, we can make clean speech sound as though it has been uttered in a conference room. For this process, we need RIR data. The following data are from the VOiCES dataset, but you can record your own — just turn on your microphone and clap your hands. .. GENERATED FROM PYTHON SOURCE LINES 118-124 .. code-block:: default rir_raw, sample_rate = torchaudio.load(SAMPLE_RIR) plot_waveform(rir_raw, sample_rate, title="Room Impulse Response (raw)") plot_specgram(rir_raw, sample_rate, title="Room Impulse Response (raw)") Audio(rir_raw, rate=sample_rate) .. rst-class:: sphx-glr-horizontal * .. image-sg:: /tutorials/images/sphx_glr_audio_data_augmentation_tutorial_003.png :alt: Room Impulse Response (raw) :srcset: /tutorials/images/sphx_glr_audio_data_augmentation_tutorial_003.png :class: sphx-glr-multi-img * .. image-sg:: /tutorials/images/sphx_glr_audio_data_augmentation_tutorial_004.png :alt: Room Impulse Response (raw) :srcset: /tutorials/images/sphx_glr_audio_data_augmentation_tutorial_004.png :class: sphx-glr-multi-img .. raw:: html


.. GENERATED FROM PYTHON SOURCE LINES 125-128 First, we need to clean up the RIR. We extract the main impulse and normalize it by its power. .. GENERATED FROM PYTHON SOURCE LINES 128-134 .. code-block:: default rir = rir_raw[:, int(sample_rate * 1.01) : int(sample_rate * 1.3)] rir = rir / torch.linalg.vector_norm(rir, ord=2) plot_waveform(rir, sample_rate, title="Room Impulse Response") .. image-sg:: /tutorials/images/sphx_glr_audio_data_augmentation_tutorial_005.png :alt: Room Impulse Response :srcset: /tutorials/images/sphx_glr_audio_data_augmentation_tutorial_005.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 135-138 Then, using :py:func:`torchaudio.functional.fftconvolve`, we convolve the speech signal with the RIR. .. GENERATED FROM PYTHON SOURCE LINES 138-142 .. code-block:: default speech, _ = torchaudio.load(SAMPLE_SPEECH) augmented = F.fftconvolve(speech, rir) .. GENERATED FROM PYTHON SOURCE LINES 143-146 Original ~~~~~~~~ .. GENERATED FROM PYTHON SOURCE LINES 146-151 .. code-block:: default plot_waveform(speech, sample_rate, title="Original") plot_specgram(speech, sample_rate, title="Original") Audio(speech, rate=sample_rate) .. rst-class:: sphx-glr-horizontal * .. image-sg:: /tutorials/images/sphx_glr_audio_data_augmentation_tutorial_006.png :alt: Original :srcset: /tutorials/images/sphx_glr_audio_data_augmentation_tutorial_006.png :class: sphx-glr-multi-img * .. image-sg:: /tutorials/images/sphx_glr_audio_data_augmentation_tutorial_007.png :alt: Original :srcset: /tutorials/images/sphx_glr_audio_data_augmentation_tutorial_007.png :class: sphx-glr-multi-img .. raw:: html


.. GENERATED FROM PYTHON SOURCE LINES 152-155 RIR applied ~~~~~~~~~~~ .. GENERATED FROM PYTHON SOURCE LINES 155-161 .. code-block:: default plot_waveform(augmented, sample_rate, title="RIR Applied") plot_specgram(augmented, sample_rate, title="RIR Applied") Audio(augmented, rate=sample_rate) .. rst-class:: sphx-glr-horizontal * .. image-sg:: /tutorials/images/sphx_glr_audio_data_augmentation_tutorial_008.png :alt: RIR Applied :srcset: /tutorials/images/sphx_glr_audio_data_augmentation_tutorial_008.png :class: sphx-glr-multi-img * .. image-sg:: /tutorials/images/sphx_glr_audio_data_augmentation_tutorial_009.png :alt: RIR Applied :srcset: /tutorials/images/sphx_glr_audio_data_augmentation_tutorial_009.png :class: sphx-glr-multi-img .. raw:: html


.. GENERATED FROM PYTHON SOURCE LINES 162-178 Adding background noise ----------------------- To introduce background noise to audio data, we can add a noise Tensor to the Tensor representing the audio data according to some desired signal-to-noise ratio (SNR) [`wikipedia `__], which determines the intensity of the audio data relative to that of the noise in the output. $$ \\mathrm{SNR} = \\frac{P_{signal}}{P_{noise}} $$ $$ \\mathrm{SNR_{dB}} = 10 \\log _{{10}} \\mathrm {SNR} $$ To add noise to audio data per SNRs, we use :py:func:`torchaudio.functional.add_noise`. .. GENERATED FROM PYTHON SOURCE LINES 178-187 .. code-block:: default speech, _ = torchaudio.load(SAMPLE_SPEECH) noise, _ = torchaudio.load(SAMPLE_NOISE) noise = noise[:, : speech.shape[1]] snr_dbs = torch.tensor([20, 10, 3]) noisy_speeches = F.add_noise(speech, noise, snr_dbs) .. GENERATED FROM PYTHON SOURCE LINES 188-191 Background noise ~~~~~~~~~~~~~~~~ .. GENERATED FROM PYTHON SOURCE LINES 191-196 .. code-block:: default plot_waveform(noise, sample_rate, title="Background noise") plot_specgram(noise, sample_rate, title="Background noise") Audio(noise, rate=sample_rate) .. rst-class:: sphx-glr-horizontal * .. image-sg:: /tutorials/images/sphx_glr_audio_data_augmentation_tutorial_010.png :alt: Background noise :srcset: /tutorials/images/sphx_glr_audio_data_augmentation_tutorial_010.png :class: sphx-glr-multi-img * .. image-sg:: /tutorials/images/sphx_glr_audio_data_augmentation_tutorial_011.png :alt: Background noise :srcset: /tutorials/images/sphx_glr_audio_data_augmentation_tutorial_011.png :class: sphx-glr-multi-img .. raw:: html


.. GENERATED FROM PYTHON SOURCE LINES 197-200 SNR 20 dB ~~~~~~~~~ .. GENERATED FROM PYTHON SOURCE LINES 200-206 .. code-block:: default snr_db, noisy_speech = snr_dbs[0], noisy_speeches[0:1] plot_waveform(noisy_speech, sample_rate, title=f"SNR: {snr_db} [dB]") plot_specgram(noisy_speech, sample_rate, title=f"SNR: {snr_db} [dB]") Audio(noisy_speech, rate=sample_rate) .. rst-class:: sphx-glr-horizontal * .. image-sg:: /tutorials/images/sphx_glr_audio_data_augmentation_tutorial_012.png :alt: SNR: 20 [dB] :srcset: /tutorials/images/sphx_glr_audio_data_augmentation_tutorial_012.png :class: sphx-glr-multi-img * .. image-sg:: /tutorials/images/sphx_glr_audio_data_augmentation_tutorial_013.png :alt: SNR: 20 [dB] :srcset: /tutorials/images/sphx_glr_audio_data_augmentation_tutorial_013.png :class: sphx-glr-multi-img .. raw:: html


.. GENERATED FROM PYTHON SOURCE LINES 207-210 SNR 10 dB ~~~~~~~~~ .. GENERATED FROM PYTHON SOURCE LINES 210-216 .. code-block:: default snr_db, noisy_speech = snr_dbs[1], noisy_speeches[1:2] plot_waveform(noisy_speech, sample_rate, title=f"SNR: {snr_db} [dB]") plot_specgram(noisy_speech, sample_rate, title=f"SNR: {snr_db} [dB]") Audio(noisy_speech, rate=sample_rate) .. rst-class:: sphx-glr-horizontal * .. image-sg:: /tutorials/images/sphx_glr_audio_data_augmentation_tutorial_014.png :alt: SNR: 10 [dB] :srcset: /tutorials/images/sphx_glr_audio_data_augmentation_tutorial_014.png :class: sphx-glr-multi-img * .. image-sg:: /tutorials/images/sphx_glr_audio_data_augmentation_tutorial_015.png :alt: SNR: 10 [dB] :srcset: /tutorials/images/sphx_glr_audio_data_augmentation_tutorial_015.png :class: sphx-glr-multi-img .. raw:: html


.. GENERATED FROM PYTHON SOURCE LINES 217-220 SNR 3 dB ~~~~~~~~ .. GENERATED FROM PYTHON SOURCE LINES 220-225 .. code-block:: default snr_db, noisy_speech = snr_dbs[2], noisy_speeches[2:3] plot_waveform(noisy_speech, sample_rate, title=f"SNR: {snr_db} [dB]") plot_specgram(noisy_speech, sample_rate, title=f"SNR: {snr_db} [dB]") Audio(noisy_speech, rate=sample_rate) .. rst-class:: sphx-glr-horizontal * .. image-sg:: /tutorials/images/sphx_glr_audio_data_augmentation_tutorial_016.png :alt: SNR: 3 [dB] :srcset: /tutorials/images/sphx_glr_audio_data_augmentation_tutorial_016.png :class: sphx-glr-multi-img * .. image-sg:: /tutorials/images/sphx_glr_audio_data_augmentation_tutorial_017.png :alt: SNR: 3 [dB] :srcset: /tutorials/images/sphx_glr_audio_data_augmentation_tutorial_017.png :class: sphx-glr-multi-img .. raw:: html


.. rst-class:: sphx-glr-timing **Total running time of the script:** ( 0 minutes 7.726 seconds) .. _sphx_glr_download_tutorials_audio_data_augmentation_tutorial.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: audio_data_augmentation_tutorial.py ` .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: audio_data_augmentation_tutorial.ipynb ` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_