.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "tutorials/audio_feature_extractions_tutorial.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note Click :ref:`here ` to download the full example code .. rst-class:: sphx-glr-example-title .. _sphx_glr_tutorials_audio_feature_extractions_tutorial.py: Audio Feature Extractions ========================= **Author**: `Moto Hira `__ ``torchaudio`` implements feature extractions commonly used in the audio domain. They are available in ``torchaudio.functional`` and ``torchaudio.transforms``. ``functional`` implements features as standalone functions. They are stateless. ``transforms`` implements features as objects, using implementations from ``functional`` and ``torch.nn.Module``. They can be serialized using TorchScript. .. GENERATED FROM PYTHON SOURCE LINES 19-30 .. code-block:: default import torch import torchaudio import torchaudio.functional as F import torchaudio.transforms as T print(torch.__version__) print(torchaudio.__version__) import matplotlib.pyplot as plt .. rst-class:: sphx-glr-script-out .. code-block:: none 2.10.0.dev20251013+cu126 2.8.0a0+1d65bbe .. GENERATED FROM PYTHON SOURCE LINES 31-42 Overview of audio features -------------------------- The following diagram shows the relationship between common audio features and torchaudio APIs to generate them. .. image:: https://download.pytorch.org/torchaudio/tutorial-assets/torchaudio_feature_extractions.png For the complete list of available features, please refer to the documentation. .. GENERATED FROM PYTHON SOURCE LINES 45-47 Preparation ----------- .. GENERATED FROM PYTHON SOURCE LINES 47-89 .. code-block:: default from IPython.display import Audio from matplotlib.patches import Rectangle from torchaudio.utils import _download_asset torch.random.manual_seed(0) SAMPLE_SPEECH = _download_asset("tutorial-assets/Lab41-SRI-VOiCES-src-sp0307-ch127535-sg0042.wav") def plot_waveform(waveform, sr, title="Waveform", ax=None): waveform = waveform.numpy() num_channels, num_frames = waveform.shape time_axis = torch.arange(0, num_frames) / sr if ax is None: _, ax = plt.subplots(num_channels, 1) ax.plot(time_axis, waveform[0], linewidth=1) ax.grid(True) ax.set_xlim([0, time_axis[-1]]) ax.set_title(title) def plot_spectrogram(specgram, title=None, ylabel="freq_bin", ax=None): if ax is None: _, ax = plt.subplots(1, 1) if title is not None: ax.set_title(title) ax.set_ylabel(ylabel) power_to_db = T.AmplitudeToDB("power", 80.0) ax.imshow(power_to_db(specgram), origin="lower", aspect="auto", interpolation="nearest") def plot_fbank(fbank, title=None): fig, axs = plt.subplots(1, 1) axs.set_title(title or "Filter bank") axs.imshow(fbank, aspect="auto") axs.set_ylabel("frequency bin") axs.set_xlabel("mel bin") .. GENERATED FROM PYTHON SOURCE LINES 90-96 Spectrogram ----------- To get the frequency make-up of an audio signal as it varies with time, you can use :py:func:`torchaudio.transforms.Spectrogram`. .. GENERATED FROM PYTHON SOURCE LINES 96-106 .. code-block:: default # Load audio SPEECH_WAVEFORM, SAMPLE_RATE = torchaudio.load(SAMPLE_SPEECH) # Define transform spectrogram = T.Spectrogram(n_fft=512) # Perform transform spec = spectrogram(SPEECH_WAVEFORM) .. GENERATED FROM PYTHON SOURCE LINES 108-114 .. code-block:: default fig, axs = plt.subplots(2, 1) plot_waveform(SPEECH_WAVEFORM, SAMPLE_RATE, title="Original waveform", ax=axs[0]) plot_spectrogram(spec[0], title="spectrogram", ax=axs[1]) fig.tight_layout() .. image-sg:: /tutorials/images/sphx_glr_audio_feature_extractions_tutorial_001.png :alt: Original waveform, spectrogram :srcset: /tutorials/images/sphx_glr_audio_feature_extractions_tutorial_001.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 116-119 .. code-block:: default Audio(SPEECH_WAVEFORM.numpy(), rate=SAMPLE_RATE) .. raw:: html


.. GENERATED FROM PYTHON SOURCE LINES 120-139 The effect of ``n_fft`` parameter ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The core of spectrogram computation is (short-term) Fourier transform, and the ``n_fft`` parameter corresponds to the :math:`N` in the following definition of descrete Fourier transform. $$ X_k = \\sum_{n=0}^{N-1} x_n e^{-\\frac{2\\pi i}{N} nk} $$ (For the detail of Fourier transform, please refer to `Wikipedia `__. The value of ``n_fft`` determines the resolution of frequency axis. However, with the higher ``n_fft`` value, the energy will be distributed among more bins, so when you visualize it, it might look more blurry, even thought they are higher resolution. The following illustrates this; .. GENERATED FROM PYTHON SOURCE LINES 141-149 .. note:: ``hop_length`` determines the time axis resolution. By default, (i.e. ``hop_length=None`` and ``win_length=None``), the value of ``n_fft // 4`` is used. Here we use the same ``hop_length`` value across different ``n_fft`` so that they have the same number of elemets in the time axis. .. GENERATED FROM PYTHON SOURCE LINES 150-160 .. code-block:: default n_ffts = [32, 128, 512, 2048] hop_length = 64 specs = [] for n_fft in n_ffts: spectrogram = T.Spectrogram(n_fft=n_fft, hop_length=hop_length) spec = spectrogram(SPEECH_WAVEFORM) specs.append(spec) .. GENERATED FROM PYTHON SOURCE LINES 162-169 .. code-block:: default fig, axs = plt.subplots(len(specs), 1, sharex=True) for i, (spec, n_fft) in enumerate(zip(specs, n_ffts)): plot_spectrogram(spec[0], ylabel=f"n_fft={n_fft}", ax=axs[i]) axs[i].set_xlabel(None) fig.tight_layout() .. image-sg:: /tutorials/images/sphx_glr_audio_feature_extractions_tutorial_002.png :alt: audio feature extractions tutorial :srcset: /tutorials/images/sphx_glr_audio_feature_extractions_tutorial_002.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 170-179 When comparing signals, it is desirable to use the same sampling rate, however if you must use the different sampling rate, care must be taken for interpretating the meaning of ``n_fft``. Recall that ``n_fft`` determines the resolution of the frequency axis for a given sampling rate. In other words, what each bin on the frequency axis represents is subject to the sampling rate. As we have seen above, changing the value of ``n_fft`` does not change the coverage of frequency range for the same input signal. .. GENERATED FROM PYTHON SOURCE LINES 182-184 Let's downsample the audio and apply spectrogram with the same ``n_fft`` value. .. GENERATED FROM PYTHON SOURCE LINES 185-191 .. code-block:: default # Downsample to half of the original sample rate speech2 = torchaudio.functional.resample(SPEECH_WAVEFORM, SAMPLE_RATE, SAMPLE_RATE // 2) # Upsample to the original sample rate speech3 = torchaudio.functional.resample(speech2, SAMPLE_RATE // 2, SAMPLE_RATE) .. GENERATED FROM PYTHON SOURCE LINES 193-201 .. code-block:: default # Apply the same spectrogram spectrogram = T.Spectrogram(n_fft=512) spec0 = spectrogram(SPEECH_WAVEFORM) spec2 = spectrogram(speech2) spec3 = spectrogram(speech3) .. GENERATED FROM PYTHON SOURCE LINES 203-212 .. code-block:: default # Visualize it fig, axs = plt.subplots(3, 1) plot_spectrogram(spec0[0], ylabel="Original", ax=axs[0]) axs[0].add_patch(Rectangle((0, 3), 212, 128, edgecolor="r", facecolor="none")) plot_spectrogram(spec2[0], ylabel="Downsampled", ax=axs[1]) plot_spectrogram(spec3[0], ylabel="Upsampled", ax=axs[2]) fig.tight_layout() .. image-sg:: /tutorials/images/sphx_glr_audio_feature_extractions_tutorial_003.png :alt: audio feature extractions tutorial :srcset: /tutorials/images/sphx_glr_audio_feature_extractions_tutorial_003.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 213-222 In the above visualization, the second plot ("Downsampled") might give the impression that the spectrogram is streched. This is because the meaning of frequency bins is different from the original one. Even though, they have the same number of bins, in the second plot, the frequency is only covered to the half of the original sampling rate. This becomes more clear if we resample the downsampled signal again so that it has the same sample rate as the original. .. GENERATED FROM PYTHON SOURCE LINES 225-232 GriffinLim ---------- To recover a waveform from a spectrogram, you can use :py:class:`torchaudio.transforms.GriffinLim`. The same set of parameters used for spectrogram must be used. .. GENERATED FROM PYTHON SOURCE LINES 232-242 .. code-block:: default # Define transforms n_fft = 1024 spectrogram = T.Spectrogram(n_fft=n_fft) griffin_lim = T.GriffinLim(n_fft=n_fft) # Apply the transforms spec = spectrogram(SPEECH_WAVEFORM) reconstructed_waveform = griffin_lim(spec) .. GENERATED FROM PYTHON SOURCE LINES 244-250 .. code-block:: default _, axes = plt.subplots(2, 1, sharex=True, sharey=True) plot_waveform(SPEECH_WAVEFORM, SAMPLE_RATE, title="Original", ax=axes[0]) plot_waveform(reconstructed_waveform, SAMPLE_RATE, title="Reconstructed", ax=axes[1]) Audio(reconstructed_waveform, rate=SAMPLE_RATE) .. image-sg:: /tutorials/images/sphx_glr_audio_feature_extractions_tutorial_004.png :alt: Original, Reconstructed :srcset: /tutorials/images/sphx_glr_audio_feature_extractions_tutorial_004.png :class: sphx-glr-single-img .. raw:: html


.. GENERATED FROM PYTHON SOURCE LINES 251-260 Mel Filter Bank --------------- :py:func:`torchaudio.functional.melscale_fbanks` generates the filter bank for converting frequency bins to mel-scale bins. Since this function does not require input audio/features, there is no equivalent transform in :py:func:`torchaudio.transforms`. .. GENERATED FROM PYTHON SOURCE LINES 260-274 .. code-block:: default n_fft = 256 n_mels = 64 sample_rate = 6000 mel_filters = F.melscale_fbanks( int(n_fft // 2 + 1), n_mels=n_mels, f_min=0.0, f_max=sample_rate / 2.0, sample_rate=sample_rate, norm="slaney", ) .. GENERATED FROM PYTHON SOURCE LINES 276-280 .. code-block:: default plot_fbank(mel_filters, "Mel Filter Bank - torchaudio") .. image-sg:: /tutorials/images/sphx_glr_audio_feature_extractions_tutorial_005.png :alt: Mel Filter Bank - torchaudio :srcset: /tutorials/images/sphx_glr_audio_feature_extractions_tutorial_005.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 281-289 MelSpectrogram -------------- Generating a mel-scale spectrogram involves generating a spectrogram and performing mel-scale conversion. In ``torchaudio``, :py:func:`torchaudio.transforms.MelSpectrogram` provides this functionality. .. GENERATED FROM PYTHON SOURCE LINES 289-310 .. code-block:: default n_fft = 1024 win_length = None hop_length = 512 n_mels = 128 mel_spectrogram = T.MelSpectrogram( sample_rate=sample_rate, n_fft=n_fft, win_length=win_length, hop_length=hop_length, center=True, pad_mode="reflect", power=2.0, norm="slaney", n_mels=n_mels, mel_scale="htk", ) melspec = mel_spectrogram(SPEECH_WAVEFORM) .. GENERATED FROM PYTHON SOURCE LINES 312-316 .. code-block:: default plot_spectrogram(melspec[0], title="MelSpectrogram - torchaudio", ylabel="mel freq") .. image-sg:: /tutorials/images/sphx_glr_audio_feature_extractions_tutorial_006.png :alt: MelSpectrogram - torchaudio :srcset: /tutorials/images/sphx_glr_audio_feature_extractions_tutorial_006.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 317-320 MFCC ---- .. GENERATED FROM PYTHON SOURCE LINES 320-340 .. code-block:: default n_fft = 2048 win_length = None hop_length = 512 n_mels = 256 n_mfcc = 256 mfcc_transform = T.MFCC( sample_rate=sample_rate, n_mfcc=n_mfcc, melkwargs={ "n_fft": n_fft, "n_mels": n_mels, "hop_length": hop_length, "mel_scale": "htk", }, ) mfcc = mfcc_transform(SPEECH_WAVEFORM) .. GENERATED FROM PYTHON SOURCE LINES 342-345 .. code-block:: default plot_spectrogram(mfcc[0], title="MFCC") .. image-sg:: /tutorials/images/sphx_glr_audio_feature_extractions_tutorial_007.png :alt: MFCC :srcset: /tutorials/images/sphx_glr_audio_feature_extractions_tutorial_007.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 346-349 LFCC ---- .. GENERATED FROM PYTHON SOURCE LINES 349-368 .. code-block:: default n_fft = 2048 win_length = None hop_length = 512 n_lfcc = 256 lfcc_transform = T.LFCC( sample_rate=sample_rate, n_lfcc=n_lfcc, speckwargs={ "n_fft": n_fft, "win_length": win_length, "hop_length": hop_length, }, ) lfcc = lfcc_transform(SPEECH_WAVEFORM) plot_spectrogram(lfcc[0], title="LFCC") .. image-sg:: /tutorials/images/sphx_glr_audio_feature_extractions_tutorial_008.png :alt: LFCC :srcset: /tutorials/images/sphx_glr_audio_feature_extractions_tutorial_008.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 369-372 Pitch ----- .. GENERATED FROM PYTHON SOURCE LINES 372-375 .. code-block:: default pitch = F.detect_pitch_frequency(SPEECH_WAVEFORM, SAMPLE_RATE) .. GENERATED FROM PYTHON SOURCE LINES 377-396 .. code-block:: default def plot_pitch(waveform, sr, pitch): figure, axis = plt.subplots(1, 1) axis.set_title("Pitch Feature") axis.grid(True) end_time = waveform.shape[1] / sr time_axis = torch.linspace(0, end_time, waveform.shape[1]) axis.plot(time_axis, waveform[0], linewidth=1, color="gray", alpha=0.3) axis2 = axis.twinx() time_axis = torch.linspace(0, end_time, pitch.shape[1]) axis2.plot(time_axis, pitch[0], linewidth=2, label="Pitch", color="green") axis2.legend(loc=0) plot_pitch(SPEECH_WAVEFORM, SAMPLE_RATE, pitch) .. image-sg:: /tutorials/images/sphx_glr_audio_feature_extractions_tutorial_009.png :alt: Pitch Feature :srcset: /tutorials/images/sphx_glr_audio_feature_extractions_tutorial_009.png :class: sphx-glr-single-img .. rst-class:: sphx-glr-timing **Total running time of the script:** ( 0 minutes 5.693 seconds) .. _sphx_glr_download_tutorials_audio_feature_extractions_tutorial.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: audio_feature_extractions_tutorial.py ` .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: audio_feature_extractions_tutorial.ipynb ` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_