torchaudio.models.wav2vec2_model¶
- torchaudio.models.wav2vec2_model(extractor_mode: str, extractor_conv_layer_config: Optional[List[Tuple[int, int, int]]], extractor_conv_bias: bool, encoder_embed_dim: int, encoder_projection_dropout: float, encoder_pos_conv_kernel: int, encoder_pos_conv_groups: int, encoder_num_layers: int, encoder_num_heads: int, encoder_attention_dropout: float, encoder_ff_interm_features: int, encoder_ff_interm_dropout: float, encoder_dropout: float, encoder_layer_norm_first: bool, encoder_layer_drop: float, aux_num_out: Optional[int]) Wav2Vec2Model[source]¶
- Builds custom - Wav2Vec2Model.- Note - The “feature extractor” below corresponds to ConvFeatureExtractionModel in the original - fairseqimplementation. This is referred as “(convolutional) feature encoder” in the wav2vec 2.0 [Baevski et al., 2020] paper.- The “encoder” below corresponds to TransformerEncoder, and this is referred as “Transformer” in the paper. - Parameters:
- extractor_mode (str) – - Operation mode of feature extractor. Valid values are - "group_norm"or- "layer_norm". If- "group_norm", then a single normalization is applied in the first convolution block. Otherwise, all the convolution blocks will have layer normalization.- This option corresponds to - extractor_modefrom- fairseq.
- extractor_conv_layer_config (list of python:integer tuples or None) – - Configuration of convolution layers in feature extractor. List of convolution configuration, i.e. - [(output_channel, kernel_size, stride), ...]- If - Noneis provided, then the following default value is used.- [ (512, 10, 5), (512, 3, 2), (512, 3, 2), (512, 3, 2), (512, 3, 2), (512, 2, 2), (512, 2, 2), ] - This option corresponds to - conv_feature_layersfrom- fairseq.
- extractor_conv_bias (bool) – - Whether to include bias term to each convolution operation. - This option corresponds to - conv_biasfrom- fairseq.
- encoder_embed_dim (int) – - The dimension of embedding in encoder. - This option corresponds to - encoder_embed_dimfrom- fairseq.
- encoder_projection_dropout (float) – - The dropout probability applied after the input feature is projected to - encoder_embed_dim.- This option corresponds to - dropout_inputfrom- fairseq.
- encoder_pos_conv_kernel (int) – - The kernel size of convolutional positional embeddings. - This option corresponds to - conv_posfrom- fairseq.
- encoder_pos_conv_groups (int) – - The number of groups of convolutional positional embeddings. - This option corresponds to - conv_pos_groupsfrom- fairseq.
- encoder_num_layers (int) – - The number of self attention layers in transformer block. - This option corresponds to - encoder_layersfrom- fairseq.
- encoder_num_heads (int) – - The number of heads in self attention layers. - This option corresponds to - encoder_attention_headsfrom- fairseq.
- encoder_attention_dropout (float) – - The dropout probability applied after softmax in self-attention layer. - This option corresponds to - attention_dropoutfrom- fairseq.
- encoder_ff_interm_features (int) – - The dimension of hidden features in feed forward layer. - This option corresponds to - encoder_ffn_embed_dimfrom- fairseq.
- encoder_ff_interm_dropout (float) – - The dropout probability applied in feedforward layer. - This option correspinds to - activation_dropoutfrom- fairseq.
- encoder_dropout (float) – - The dropout probability applied at the end of feed forward layer. - This option corresponds to - dropoutfrom- fairseq.
- encoder_layer_norm_first (bool) – - Control the order of layer norm in transformer layer and each encoder layer. If True, in transformer layer, layer norm is applied before features are fed to encoder layers. In encoder layer, two layer norms are applied before and after self attention. If False, in transformer layer, layer norm is applied after features are fed to encoder layers. In encoder layer, two layer norms are applied after self attention, before and after feed forward. - This option corresponds to - layer_norm_firstfrom- fairseq.
- encoder_layer_drop (float) – - Probability to drop each encoder layer during training. - This option corresponds to - layerdropfrom- fairseq.
- aux_num_out (int or None) – When provided, attach an extra linear layer on top of encoder, which can be used for fine-tuning. 
 
- Returns:
- The resulting model. 
- Return type: