Vad¶
- class torchaudio.transforms.Vad(sample_rate: int, trigger_level: float = 7.0, trigger_time: float = 0.25, search_time: float = 1.0, allowed_gap: float = 0.25, pre_trigger_time: float = 0.0, boot_time: float = 0.35, noise_up_time: float = 0.1, noise_down_time: float = 0.01, noise_reduction_amount: float = 1.35, measure_freq: float = 20.0, measure_duration: Optional[float] = None, measure_smooth_time: float = 0.4, hp_filter_freq: float = 50.0, lp_filter_freq: float = 6000.0, hp_lifter_freq: float = 150.0, lp_lifter_freq: float = 2000.0)[source]¶
- Voice Activity Detector. Similar to SoX implementation. - Attempts to trim silence and quiet background sounds from the ends of recordings of speech. The algorithm currently uses a simple cepstral power measurement to detect voice, so may be fooled by other things, especially music. - The effect can trim only from the front of the audio, so in order to trim from the back, the reverse effect must also be used. - Parameters:
- sample_rate (int) – Sample rate of audio signal. 
- trigger_level (float, optional) – The measurement level used to trigger activity detection. This may need to be changed depending on the noise level, signal level, and other characteristics of the input audio. (Default: 7.0) 
- trigger_time (float, optional) – The time constant (in seconds) used to help ignore short bursts of sound. (Default: 0.25) 
- search_time (float, optional) – The amount of audio (in seconds) to search for quieter/shorter bursts of audio to include prior to the detected trigger point. (Default: 1.0) 
- allowed_gap (float, optional) – The allowed gap (in seconds) between quiteter/shorter bursts of audio to include prior to the detected trigger point. (Default: 0.25) 
- pre_trigger_time (float, optional) – The amount of audio (in seconds) to preserve before the trigger point and any found quieter/shorter bursts. (Default: 0.0) 
- boot_time (float, optional) The algorithm (python:internally) – estimation/reduction in order to detect the start of the wanted audio. This option sets the time for the initial noise estimate. (Default: 0.35) 
- noise_up_time (float, optional) – for when the noise level is increasing. (Default: 0.1) 
- noise_down_time (float, optional) – for when the noise level is decreasing. (Default: 0.01) 
- noise_reduction_amount (float, optional) – the detection algorithm (e.g. 0, 0.5, …). (Default: 1.35) 
- measure_freq (float, optional) – processing/measurements. (Default: 20.0) 
- measure_duration – (float or None, optional) Measurement duration. (Default: Twice the measurement period; i.e. with overlap.) 
- measure_smooth_time (float, optional) – spectral measurements. (Default: 0.4) 
- hp_filter_freq (float, optional) – at the input to the detector algorithm. (Default: 50.0) 
- lp_filter_freq (float, optional) – at the input to the detector algorithm. (Default: 6000.0) 
- hp_lifter_freq (float, optional) – in the detector algorithm. (Default: 150.0) 
- lp_lifter_freq (float, optional) – in the detector algorithm. (Default: 2000.0) 
 
 - Example
- >>> waveform, sample_rate = torchaudio.load("test.wav", normalize=True) >>> waveform_reversed, sample_rate = apply_effects_tensor(waveform, sample_rate, [["reverse"]]) >>> transform = transforms.Vad(sample_rate=sample_rate, trigger_level=7.5) >>> waveform_reversed_front_trim = transform(waveform_reversed) >>> waveform_end_trim, sample_rate = apply_effects_tensor( >>> waveform_reversed_front_trim, sample_rate, [["reverse"]] >>> ) 
- Reference:
 - forward(waveform: Tensor) Tensor[source]¶
- Parameters:
- waveform (Tensor) – Tensor of audio of dimension (channels, time) or (time) Tensor of shape (channels, time) is treated as a multi-channel recording of the same event and the resulting output will be trimmed to the earliest voice activity in any channel.