MVDR¶
- class torchaudio.transforms.MVDR(ref_channel: int = 0, solution: str = 'ref_channel', multi_mask: bool = False, diag_loading: bool = True, diag_eps: float = 1e-07, online: bool = False)[source]¶
- Minimum Variance Distortionless Response (MVDR) module that performs MVDR beamforming with Time-Frequency masks. - Based on https://github.com/espnet/espnet/blob/master/espnet2/enh/layers/beamformer.py - We provide three solutions of MVDR beamforming. One is based on reference channel selection [Souden et al., 2009] ( - solution=ref_channel).\[\textbf{w}_{\text{MVDR}}(f) = \frac{{{\bf{\Phi}_{\textbf{NN}}^{-1}}(f){\bf{\Phi}_{\textbf{SS}}}}(f)} {\text{Trace}({{{\bf{\Phi}_{\textbf{NN}}^{-1}}(f) \bf{\Phi}_{\textbf{SS}}}(f))}}\bm{u} \]- where \(\bf{\Phi}_{\textbf{SS}}\) and \(\bf{\Phi}_{\textbf{NN}}\) are the covariance matrices of speech and noise, respectively. \(\bf{u}\) is an one-hot vector to determine the reference channel. - The other two solutions are based on the steering vector ( - solution=stv_evdor- solution=stv_power).\[\textbf{w}_{\text{MVDR}}(f) = \frac{{{\bf{\Phi}_{\textbf{NN}}^{-1}}(f){\bm{v}}(f)}} {{\bm{v}^{\mathsf{H}}}(f){\bf{\Phi}_{\textbf{NN}}^{-1}}(f){\bm{v}}(f)} \]- where \(\bm{v}\) is the acoustic transfer function or the steering vector. \(.^{\mathsf{H}}\) denotes the Hermitian Conjugate operation. - We apply either eigenvalue decomposition [Higuchi et al., 2016] or the power method [Mises and Pollaczek-Geiringer, 1929] to get the steering vector from the PSD matrix of speech. - After estimating the beamforming weight, the enhanced Short-time Fourier Transform (STFT) is obtained by \[\hat{\bf{S}} = {\bf{w}^\mathsf{H}}{\bf{Y}}, {\bf{w}} \in \mathbb{C}^{M \times F} \]- where \(\bf{Y}\) and \(\hat{\bf{S}}\) are the STFT of the multi-channel noisy speech and the single-channel enhanced speech, respectively. - For online streaming audio, we provide a recursive method [Higuchi et al., 2017] to update the PSD matrices of speech and noise, respectively. - Parameters:
- ref_channel (int, optional) – Reference channel for beamforming. (Default: - 0)
- solution (str, optional) – Solution to compute the MVDR beamforming weights. Options: [ - ref_channel,- stv_evd,- stv_power]. (Default:- ref_channel)
- multi_mask (bool, optional) – If - True, only accepts multi-channel Time-Frequency masks. (Default:- False)
- diagonal_loading (bool, optional) – If - True, enables applying diagonal loading to the covariance matrix of the noise. (Default:- True)
- diag_eps (float, optional) – The coefficient multiplied to the identity matrix for diagonal loading. It is only effective when - diagonal_loadingis set to- True. (Default:- 1e-7)
- online (bool, optional) – If - True, updates the MVDR beamforming weights based on the previous covarience matrices. (Default:- False)
 
 - Note - To improve the numerical stability, the input spectrogram will be converted to double precision ( - torch.complex128or- torch.cdouble) dtype for internal computation. The output spectrogram is converted to the dtype of the input spectrogram to be compatible with other modules.- Note - If you use - stv_evdsolution, the gradient of the same input may not be identical if the eigenvalues of the PSD matrix are not distinct (i.e. some eigenvalues are close or identical).- forward(specgram: Tensor, mask_s: Tensor, mask_n: Optional[Tensor] = None) Tensor[source]¶
- Perform MVDR beamforming. - Parameters:
- specgram (torch.Tensor) – Multi-channel complex-valued spectrum. Tensor with dimensions (…, channel, freq, time) 
- mask_s (torch.Tensor) – Time-Frequency mask of target speech. Tensor with dimensions (…, freq, time) if multi_mask is - Falseor with dimensions (…, channel, freq, time) if multi_mask is- True.
- mask_n (torch.Tensor or None, optional) – Time-Frequency mask of noise. Tensor with dimensions (…, freq, time) if multi_mask is - Falseor with dimensions (…, channel, freq, time) if multi_mask is- True. (Default: None)
 
- Returns:
- Single-channel complex-valued enhanced spectrum with dimensions (…, freq, time). 
- Return type: