Transformer Layers#
Transformer layers use self-attention mechanisms to process sequences in parallel, enabling efficient training on long sequences. They are the foundation of modern NLP models (BERT, GPT) and increasingly used in vision and other domains.
Transformer: Complete encoder-decoder architecture
TransformerEncoder/Decoder: Standalone encoder or decoder stacks
TransformerEncoderLayer/DecoderLayer: Individual transformer blocks
MultiheadAttention: Core attention mechanism used throughout
Key parameters:
d_model: Dimension of the model (embedding dimension)nhead: Number of attention headsnum_encoder_layers/num_decoder_layers: Number of stacked layersdim_feedforward: Dimension of feedforward networkdropout: Dropout rate for regularization
Transformer#
Complete encoder-decoder transformer architecture.
-
class Transformer : public torch::nn::ModuleHolder<TransformerImpl>#
A
ModuleHoldersubclass forTransformerImpl.See the documentation for
TransformerImplclass to learn what methods it provides, and examples of how to useTransformerwithtorch::nn::TransformerOptions. See the documentation forModuleHolderto learn about PyTorch’s module storage semantics.Public Types
-
using Impl = TransformerImpl#
-
using Impl = TransformerImpl#
-
class TransformerImpl : public torch::nn::Cloneable<TransformerImpl>#
A transformer model.
User is able to modify the attributes as needed. The architecture is based on the paper “Attention Is All You Need”. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, pages 6000-6010.
See https://pytorch.org/docs/stable/generated/torch.nn.Transformer.html to learn about the exact behavior of this transformer model
See the documentation for
torch::nn::Transformerclass to learn what constructor arguments are supported for this encoder layer modelExample:
Transformer trans(TransformerOptions(512, 8));
Public Functions
-
explicit TransformerImpl(TransformerOptions options_)#
-
Tensor forward(const Tensor &src, const Tensor &tgt, const Tensor &src_mask = {}, const Tensor &tgt_mask = {}, const Tensor &memory_mask = {}, const Tensor &src_key_padding_mask = {}, const Tensor &tgt_key_padding_mask = {}, const Tensor &memory_key_padding_mask = {})#
forward function for Transformer Module Args: src: the sequence to the encoder (required).
tgt: the sequence to the decoder (required). src_mask: the additive mask for the src sequence (optional). tgt_mask: the additive mask for the tgt sequence (optional). memory_mask: the additive mask for the encoder output (optional). src_key_padding_mask: the ByteTensor mask for src keys per batch (optional). tgt_key_padding_mask: the ByteTensor mask for tgt keys per batch (optional). memory_key_padding_mask: the ByteTensor mask for memory keys per batch (optional).
Shape: src:
(S, N, E)tgt:(T, N, E)src_mask:(S, S)tgt_mask:(T, T)memory_mask:(T, S)src_key_padding_mask:(N, S)tgt_key_padding_mask:(N, T)memory_key_padding_mask:(N, S)Note: [src/tgt/memory]_mask ensures that position i is allowed to attend the unmasked positions. If a ByteTensor is provided, the non-zero positions are not allowed to attend while the zero positions will be unchanged. If a BoolTensor is provided, positions with
Trueare not allowed to attend whileFalsevalues will be unchanged. If a FloatTensor is provided, it will be added to the attention weight.[src/tgt/memory]_key_padding_mask provides specified elements in the key to be ignored by the attention. If a ByteTensor is provided, the non-zero positions will be ignored while the zero positions will be unchanged. If a BoolTensor is provided, the positions with the value of
Truewill be ignored while the position with the value ofFalsewill be unchanged.output:
(T, N, E)Note: Due to the multi-head attention architecture in the transformer model, the output sequence length of a transformer is same as the input sequence (i.e. target) length of the decode.
where S is the source sequence length, T is the target sequence length, N is the batch size, E is the feature number.
-
virtual void reset() override#
reset()must perform initialization of all members with reference semantics, most importantly parameters, buffers and submodules.
-
void reset_parameters()#
Public Members
-
TransformerOptions options#
options with which this
Transformerwas constructed
Public Static Functions
-
static Tensor generate_square_subsequent_mask(int64_t sz)#
Generate a square mask for the sequence.
The masked positions are filled with
-infin float type. Unmasked positions are filled with0.0in float type. Note:This function will always return a CPU tensor.
This function requires the platform support IEEE754, since
-infis guaranteed to be valid only when IEEE754 is supported. If the platform doesn’t support IEEE754, this function will fill the mask with the smallest float number instead of-inf, a one time warning will pop up as well.
Friends
- friend struct torch::nn::AnyModuleHolder
-
explicit TransformerImpl(TransformerOptions options_)#
Example:
auto transformer = torch::nn::Transformer(
torch::nn::TransformerOptions()
.d_model(512)
.nhead(8)
.num_encoder_layers(6)
.num_decoder_layers(6)
.dim_feedforward(2048)
.dropout(0.1));
TransformerEncoder#
Stack of encoder layers for processing source sequences.
-
class TransformerEncoder : public torch::nn::ModuleHolder<TransformerEncoderImpl>#
A
ModuleHoldersubclass forTransformerEncoderImpl.See the documentation for
TransformerEncoderImplclass to learn what methods it provides, and examples of how to useTransformerEncoderwithtorch::nn::TransformerEncoderOptions. See the documentation forModuleHolderto learn about PyTorch’s module storage semantics.Public Types
-
using Impl = TransformerEncoderImpl#
-
using Impl = TransformerEncoderImpl#
-
class TransformerEncoderImpl : public torch::nn::Cloneable<TransformerEncoderImpl>#
TransformerEncoder module.
See https://pytorch.org/docs/main/generated/torch.nn.TransformerEncoder.html to learn abouut the exact behavior of this encoder layer module.
See the documentation for
torch::nn::TransformerEncoderclass to learn what constructor arguments are supported for this encoder module.Example:
TransformerEncoderLayer encoderLayer(TransformerEncoderLayerOptions(512, 8).dropout(0.1)); TransformerEncoder encoder(TransformerEncoderOptions(encoderLayer, 6).norm(LayerNorm(LayerNormOptions({2}))));
Public Functions
-
inline TransformerEncoderImpl(TransformerEncoderLayer encoder_layer, int64_t num_layers)#
-
explicit TransformerEncoderImpl(TransformerEncoderOptions options_)#
-
Tensor forward(const Tensor &src, const Tensor &src_mask = {}, const Tensor &src_key_padding_mask = {})#
-
virtual void reset() override#
reset()must perform initialization of all members with reference semantics, most importantly parameters, buffers and submodules.
-
void reset_parameters()#
Public Members
-
TransformerEncoderOptions options#
options with which this
TransformerEncoderwas constructed
-
ModuleList layers = nullptr#
module list that contains all the encoder layers
Friends
- friend struct torch::nn::AnyModuleHolder
-
inline TransformerEncoderImpl(TransformerEncoderLayer encoder_layer, int64_t num_layers)#
TransformerDecoder#
Stack of decoder layers for generating target sequences.
-
class TransformerDecoder : public torch::nn::ModuleHolder<TransformerDecoderImpl>#
A
ModuleHoldersubclass forTransformerDecoderImpl.See the documentation for
TransformerDecoderImplclass to learn what methods it provides, and examples of how to useTransformerDecoderwithtorch::nn::TransformerDecoderOptions. See the documentation forModuleHolderto learn about PyTorch’s module storage semantics.Public Types
-
using Impl = TransformerDecoderImpl#
-
using Impl = TransformerDecoderImpl#
-
class TransformerDecoderImpl : public torch::nn::Cloneable<TransformerDecoderImpl>#
TransformerDecoder is a stack of N decoder layers.
See https://pytorch.org/docs/main/generated/torch.nn.TransformerDecoder.html to learn abouut the exact behavior of this decoder module
See the documentation for
torch::nn::TransformerDecoderOptionsclass to learn what constructor arguments are supported for this decoder moduleExample:
TransformerDecoderLayer decoder_layer(TransformerDecoderLayerOptions(512, 8).dropout(0.1)); TransformerDecoder transformer_decoder(TransformerDecoderOptions(decoder_layer, 6).norm(LayerNorm(LayerNormOptions({2})))); const auto memory = torch::rand({10, 32, 512}); const auto tgt = torch::rand({20, 32, 512}); auto out = transformer_decoder(tgt, memory);
Public Functions
-
inline TransformerDecoderImpl(TransformerDecoderLayer decoder_layer, int64_t num_layers)#
-
explicit TransformerDecoderImpl(TransformerDecoderOptions options_)#
-
virtual void reset() override#
reset()must perform initialization of all members with reference semantics, most importantly parameters, buffers and submodules.
-
void reset_parameters()#
-
Tensor forward(const Tensor &tgt, const Tensor &memory, const Tensor &tgt_mask = {}, const Tensor &memory_mask = {}, const Tensor &tgt_key_padding_mask = {}, const Tensor &memory_key_padding_mask = {})#
Pass the inputs (and mask) through the decoder layer in turn.
Args: tgt: the sequence to the decoder layer (required). memory: the sequence from the last layer of the encoder (required). tgt_mask: the mask for the tgt sequence (optional). memory_mask: the mask for the memory sequence (optional). tgt_key_padding_mask: the mask for the tgt keys per batch (optional). memory_key_padding_mask: the mask for the memory keys per batch (optional).
Public Members
-
TransformerDecoderOptions options#
The options used to configure this module.
-
ModuleList layers = {nullptr}#
Cloned layers of decoder layers.
Friends
- friend struct torch::nn::AnyModuleHolder
-
inline TransformerDecoderImpl(TransformerDecoderLayer decoder_layer, int64_t num_layers)#
TransformerEncoderLayer#
Single encoder layer with self-attention and feedforward network.
-
class TransformerEncoderLayerImpl : public torch::nn::Cloneable<TransformerEncoderLayerImpl>#
TransformerEncoderLayer module.
See https://pytorch.org/docs/main/generated/torch.nn.TransformerEncoderLayer.html to learn abouut the exact behavior of this encoder layer model
See the documentation for
torch::nn::TransformerEncoderLayerclass to learn what constructor arguments are supported for this encoder layer modelExample:
TransformerEncoderLayer encoderLayer(TransformerEncoderLayerOptions(512, 8).dropout(0.1));
Public Functions
-
inline TransformerEncoderLayerImpl(int64_t d_model, int64_t nhead)#
-
explicit TransformerEncoderLayerImpl(TransformerEncoderLayerOptions options_)#
-
Tensor forward(const Tensor &src, const Tensor &src_mask = {}, const Tensor &src_key_padding_mask = {})#
-
virtual void reset() override#
reset()must perform initialization of all members with reference semantics, most importantly parameters, buffers and submodules.
-
void reset_parameters()#
Public Members
-
TransformerEncoderLayerOptions options#
options with which this
TransformerEncoderLayerwas constructed
-
MultiheadAttention self_attn = nullptr#
self attention
Friends
- friend struct torch::nn::AnyModuleHolder
-
inline TransformerEncoderLayerImpl(int64_t d_model, int64_t nhead)#
TransformerDecoderLayer#
Single decoder layer with self-attention, cross-attention, and feedforward network.
Warning
doxygenclass: Cannot find class “TransformerDecoderLayerImpl” in doxygen xml output for project “PyTorch” from directory: ../build/xml
MultiheadAttention#
Scaled dot-product attention with multiple parallel heads.
-
class MultiheadAttention : public torch::nn::ModuleHolder<MultiheadAttentionImpl>#
A
ModuleHoldersubclass forMultiheadAttentionImpl.See the documentation for
MultiheadAttentionImplclass to learn what methods it provides, and examples of how to useMultiheadAttentionwithtorch::nn::MultiheadAttentionOptions. See the documentation forModuleHolderto learn about PyTorch’s module storage semantics.Public Types
-
using Impl = MultiheadAttentionImpl#
-
using Impl = MultiheadAttentionImpl#
-
class MultiheadAttentionImpl : public torch::nn::Cloneable<MultiheadAttentionImpl>#
Applies the MultiheadAttention function element-wise.
See https://pytorch.org/docs/main/nn.html#torch.nn.MultiheadAttention to learn about the exact behavior of this module.
See the documentation for
torch::nn::MultiheadAttentionOptionsclass to learn what constructor arguments are supported for this module.Example:
MultiheadAttention model(MultiheadAttentionOptions(20, 10).bias(false));
Public Functions
-
inline MultiheadAttentionImpl(int64_t embed_dim, int64_t num_heads)#
-
explicit MultiheadAttentionImpl(const MultiheadAttentionOptions &options_)#
-
std::tuple<Tensor, Tensor> forward(const Tensor &query, const Tensor &key, const Tensor &value, const Tensor &key_padding_mask = {}, bool need_weights = true, const Tensor &attn_mask = {}, bool average_attn_weights = true)#
-
virtual void reset() override#
reset()must perform initialization of all members with reference semantics, most importantly parameters, buffers and submodules.
-
void _reset_parameters()#
Friends
- friend struct torch::nn::AnyModuleHolder
-
inline MultiheadAttentionImpl(int64_t embed_dim, int64_t num_heads)#