Models

A Model defines the neural network’s forward() method and encapsulates all of the learnable parameters in the network. Each model also provides a set of named architectures that define the precise network configuration (e.g., embedding dimension, number of layers, etc.).

Both the model type and architecture are selected via the --arch command-line argument. Once selected, a model may expose additional command-line arguments for further configuration.

Note

All fairseq Models extend BaseFairseqModel, which in turn extends torch.nn.Module. Thus any fairseq Model can be used as a stand-alone Module in other PyTorch code.

Convolutional Neural Networks (CNN)

class fairseq.models.fconv.FConvModel(encoder, decoder)[source]

A fully convolutional model, i.e. a convolutional encoder and a convolutional decoder, as described in “Convolutional Sequence to Sequence Learning” (Gehring et al., 2017).

Parameters:

The Convolutional model provides the following named architectures and command-line arguments:

usage: 
        [--arch {fconv,fconv_iwslt_de_en,fconv_wmt_en_ro,fconv_wmt_en_de,fconv_wmt_en_fr}]
        [--dropout D] [--encoder-embed-dim N] [--encoder-embed-path STR]
        [--encoder-layers EXPR] [--decoder-embed-dim N]
        [--decoder-embed-path STR] [--decoder-layers EXPR]
        [--decoder-out-embed-dim N] [--decoder-attention EXPR]
        [--share-input-output-embed]

Named architectures

--arch Possible choices: fconv, fconv_iwslt_de_en, fconv_wmt_en_ro, fconv_wmt_en_de, fconv_wmt_en_fr

Additional command-line arguments

--dropout dropout probability
--encoder-embed-dim encoder embedding dimension
--encoder-embed-path path to pre-trained encoder embedding
--encoder-layers encoder layers [(dim, kernel_size), …]
--decoder-embed-dim decoder embedding dimension
--decoder-embed-path path to pre-trained decoder embedding
--decoder-layers decoder layers [(dim, kernel_size), …]
--decoder-out-embed-dim decoder output embedding dimension
--decoder-attention decoder attention [True, …]
--share-input-output-embed

share input and output embeddings (requires –decoder-out-embed-dim and –decoder-embed-dim to be equal)

Default: False

static add_args(parser)[source]

Add model-specific arguments to the parser.

classmethod build_model(args, task)[source]

Build a new model instance.

class fairseq.models.fconv.FConvEncoder(dictionary, embed_dim=512, embed_dict=None, max_positions=1024, convolutions=((512, 3), (512, 3), (512, 3), (512, 3), (512, 3), (512, 3), (512, 3), (512, 3), (512, 3), (512, 3), (512, 3), (512, 3), (512, 3), (512, 3), (512, 3), (512, 3), (512, 3), (512, 3), (512, 3), (512, 3)), dropout=0.1)[source]

Convolutional encoder consisting of len(convolutions) layers.

Parameters:
  • dictionary (Dictionary) – encoding dictionary
  • embed_dim (int, optional) – embedding dimension
  • embed_dict (str, optional) – filename from which to load pre-trained embeddings
  • max_positions (int, optional) – maximum supported input sequence length
  • convolutions (list, optional) – the convolutional layer structure. Each list item i corresponds to convolutional layer i. Layers are given as (out_channels, kernel_width, [residual]). Residual connections are added between layers when residual=1 (which is the default behavior).
  • dropout (float, optional) – dropout to be applied before each conv layer
forward(src_tokens, src_lengths)[source]
Parameters:
  • src_tokens (LongTensor) – tokens in the source language of shape (batch, src_len)
  • src_lengths (LongTensor) – lengths of each source sentence of shape (batch)
Returns:

  • encoder_out (tuple): a tuple with two elements, where the first element is the last encoder layer’s output and the second element is the same quantity summed with the input embedding (used for attention). The shape of both tensors is (batch, src_len, embed_dim).
  • encoder_padding_mask (ByteTensor): the positions of padding elements of shape (batch, src_len)

Return type:

dict

max_positions()[source]

Maximum input length supported by the encoder.

reorder_encoder_out(encoder_out, new_order)[source]

Reorder encoder output according to new_order.

Parameters:
  • encoder_out – output from the forward() method
  • new_order (LongTensor) – desired order
Returns:

encoder_out rearranged according to new_order

class fairseq.models.fconv.FConvDecoder(dictionary, embed_dim=512, embed_dict=None, out_embed_dim=256, max_positions=1024, convolutions=((512, 3), (512, 3), (512, 3), (512, 3), (512, 3), (512, 3), (512, 3), (512, 3), (512, 3), (512, 3), (512, 3), (512, 3), (512, 3), (512, 3), (512, 3), (512, 3), (512, 3), (512, 3), (512, 3), (512, 3)), attention=True, dropout=0.1, share_embed=False, positional_embeddings=True, adaptive_softmax_cutoff=None, adaptive_softmax_dropout=0)[source]

Convolutional decoder

forward(prev_output_tokens, encoder_out_dict=None, incremental_state=None)[source]
Parameters:
  • prev_output_tokens (LongTensor) – previous decoder outputs of shape (batch, tgt_len), for input feeding/teacher forcing
  • encoder_out (Tensor, optional) – output from the encoder, used for encoder-side attention
  • incremental_state (dict) – dictionary used for storing state during Incremental decoding
Returns:

  • the last decoder layer’s output of shape (batch, tgt_len, vocab)
  • the last decoder layer’s attention weights of shape (batch, tgt_len, src_len)

Return type:

tuple

max_positions()[source]

Maximum output length supported by the decoder.

reorder_incremental_state(incremental_state, new_order)[source]

Reorder incremental state.

This should be called when the order of the input has changed from the previous time step. A typical use case is beam search, where the input order changes between time steps based on the selection of beams.

upgrade_state_dict(state_dict)[source]

Upgrade a (possibly old) state dict for new versions of fairseq.

Long Short-Term Memory (LSTM) networks

class fairseq.models.lstm.LSTMModel(encoder, decoder)[source]
static add_args(parser)[source]

Add model-specific arguments to the parser.

classmethod build_model(args, task)[source]

Build a new model instance.

class fairseq.models.lstm.LSTMEncoder(dictionary, embed_dim=512, hidden_size=512, num_layers=1, dropout_in=0.1, dropout_out=0.1, bidirectional=False, left_pad=True, pretrained_embed=None, padding_value=0.0)[source]

LSTM encoder.

forward(src_tokens, src_lengths)[source]

Args: src_tokens (LongTensor): tokens in the source language of shape

(batch, src_len)
src_lengths (LongTensor): lengths of each source sentence of shape
(batch)
max_positions()[source]

Maximum input length supported by the encoder.

reorder_encoder_out(encoder_out, new_order)[source]

Reorder encoder output according to new_order.

Parameters:
  • encoder_out – output from the forward() method
  • new_order (LongTensor) – desired order
Returns:

encoder_out rearranged according to new_order

class fairseq.models.lstm.LSTMDecoder(dictionary, embed_dim=512, hidden_size=512, out_embed_dim=512, num_layers=1, dropout_in=0.1, dropout_out=0.1, attention=True, encoder_output_units=512, pretrained_embed=None, share_input_output_embed=False, adaptive_softmax_cutoff=None)[source]

LSTM decoder.

forward(prev_output_tokens, encoder_out_dict, incremental_state=None)[source]
Parameters:
  • prev_output_tokens (LongTensor) – previous decoder outputs of shape (batch, tgt_len), for input feeding/teacher forcing
  • encoder_out (Tensor, optional) – output from the encoder, used for encoder-side attention
  • incremental_state (dict) – dictionary used for storing state during Incremental decoding
Returns:

  • the last decoder layer’s output of shape (batch, tgt_len, vocab)
  • the last decoder layer’s attention weights of shape (batch, tgt_len, src_len)

Return type:

tuple

max_positions()[source]

Maximum output length supported by the decoder.

reorder_incremental_state(incremental_state, new_order)[source]

Reorder incremental state.

This should be called when the order of the input has changed from the previous time step. A typical use case is beam search, where the input order changes between time steps based on the selection of beams.

Transformer (self-attention) networks

class fairseq.models.transformer.TransformerModel(encoder, decoder)[source]

Transformer model from “Attention Is All You Need” (Vaswani, et al, 2017).

Parameters:

The Transformer model provides the following named architectures and command-line arguments:

usage: 
        [--arch {transformer,transformer_iwslt_de_en,transformer_wmt_en_de,transformer_vaswani_wmt_en_de_big,transformer_vaswani_wmt_en_fr_big,transformer_wmt_en_de_big,transformer_wmt_en_de_big_t2t}]
        [--dropout D] [--attention-dropout D] [--activation-dropout D]
        [--encoder-embed-path STR] [--encoder-embed-dim N]
        [--encoder-ffn-embed-dim N] [--encoder-layers N]
        [--encoder-attention-heads N] [--encoder-normalize-before]
        [--encoder-learned-pos] [--decoder-embed-path STR]
        [--decoder-embed-dim N] [--decoder-ffn-embed-dim N]
        [--decoder-layers N] [--decoder-attention-heads N]
        [--decoder-learned-pos] [--decoder-normalize-before]
        [--share-decoder-input-output-embed] [--share-all-embeddings]
        [--no-token-positional-embeddings] [--adaptive-softmax-cutoff EXPR]
        [--adaptive-softmax-dropout D] [--activation-fn {relu,gelu,gelu_fast}]

Named architectures

--arch Possible choices: transformer, transformer_iwslt_de_en, transformer_wmt_en_de, transformer_vaswani_wmt_en_de_big, transformer_vaswani_wmt_en_fr_big, transformer_wmt_en_de_big, transformer_wmt_en_de_big_t2t

Additional command-line arguments

--dropout dropout probability
--attention-dropout dropout probability for attention weights
--activation-dropout, --relu-dropout dropout probability after activation in FFN.
--encoder-embed-path path to pre-trained encoder embedding
--encoder-embed-dim encoder embedding dimension
--encoder-ffn-embed-dim encoder embedding dimension for FFN
--encoder-layers num encoder layers
--encoder-attention-heads num encoder attention heads
--encoder-normalize-before

apply layernorm before each encoder block

Default: False

--encoder-learned-pos

use learned positional embeddings in the encoder

Default: False

--decoder-embed-path path to pre-trained decoder embedding
--decoder-embed-dim decoder embedding dimension
--decoder-ffn-embed-dim decoder embedding dimension for FFN
--decoder-layers num decoder layers
--decoder-attention-heads num decoder attention heads
--decoder-learned-pos

use learned positional embeddings in the decoder

Default: False

--decoder-normalize-before

apply layernorm before each decoder block

Default: False

--share-decoder-input-output-embed

share decoder input and output embeddings

Default: False

--share-all-embeddings

share encoder, decoder and output embeddings (requires shared dictionary and embed dim)

Default: False

--no-token-positional-embeddings

if set, disables positional embeddings (outside self attention)

Default: False

--adaptive-softmax-cutoff comma separated list of adaptive softmax cutoff points. Must be used with adaptive_loss criterion
--adaptive-softmax-dropout sets adaptive softmax dropout for the tail projections
--activation-fn

Possible choices: relu, gelu, gelu_fast

Which activation function to use

static add_args(parser)[source]

Add model-specific arguments to the parser.

classmethod build_model(args, task)[source]

Build a new model instance.

class fairseq.models.transformer.TransformerEncoder(args, dictionary, embed_tokens)[source]

Transformer encoder consisting of args.encoder_layers layers. Each layer is a TransformerEncoderLayer.

Parameters:
forward(src_tokens, src_lengths)[source]
Parameters:
  • src_tokens (LongTensor) – tokens in the source language of shape (batch, src_len)
  • src_lengths (torch.LongTensor) – lengths of each source sentence of shape (batch)
Returns:

  • encoder_out (Tensor): the last encoder layer’s output of shape (src_len, batch, embed_dim)
  • encoder_padding_mask (ByteTensor): the positions of padding elements of shape (batch, src_len)

Return type:

dict

max_positions()[source]

Maximum input length supported by the encoder.

reorder_encoder_out(encoder_out, new_order)[source]

Reorder encoder output according to new_order.

Parameters:
  • encoder_out – output from the forward() method
  • new_order (LongTensor) – desired order
Returns:

encoder_out rearranged according to new_order

upgrade_state_dict_named(state_dict, name)[source]

Upgrade a (possibly old) state dict for new versions of fairseq.

class fairseq.models.transformer.TransformerEncoderLayer(args)[source]

Encoder layer block.

In the original paper each operation (multi-head attention or FFN) is postprocessed with: dropout -> add residual -> layernorm. In the tensor2tensor code they suggest that learning is more robust when preprocessing each layer with layernorm and postprocessing with: dropout -> add residual. We default to the approach in the paper, but the tensor2tensor approach can be enabled by setting args.encoder_normalize_before to True.

Parameters:args (argparse.Namespace) – parsed command-line arguments
forward(x, encoder_padding_mask)[source]
Parameters:
  • x (Tensor) – input to the layer of shape (seq_len, batch, embed_dim)
  • encoder_padding_mask (ByteTensor) – binary ByteTensor of shape (batch, src_len) where padding elements are indicated by 1.
Returns:

encoded output of shape (batch, src_len, embed_dim)

upgrade_state_dict_named(state_dict, name)[source]

Rename layer norm states from …layer_norms.0.weight to …self_attn_layer_norm.weight and …layer_norms.1.weight to …final_layer_norm.weight

class fairseq.models.transformer.TransformerDecoder(args, dictionary, embed_tokens, no_encoder_attn=False, final_norm=True)[source]

Transformer decoder consisting of args.decoder_layers layers. Each layer is a TransformerDecoderLayer.

Parameters:
  • args (argparse.Namespace) – parsed command-line arguments
  • dictionary (Dictionary) – decoding dictionary
  • embed_tokens (torch.nn.Embedding) – output embedding
  • no_encoder_attn (bool, optional) – whether to attend to encoder outputs (default: False).
  • final_norm (bool, optional) – apply layer norm to the output of the final decoder layer (default: True).
forward(prev_output_tokens, encoder_out=None, incremental_state=None)[source]
Parameters:
  • prev_output_tokens (LongTensor) – previous decoder outputs of shape (batch, tgt_len), for input feeding/teacher forcing
  • encoder_out (Tensor, optional) – output from the encoder, used for encoder-side attention
  • incremental_state (dict) – dictionary used for storing state during Incremental decoding
Returns:

  • the last decoder layer’s output of shape (batch, tgt_len, vocab)
  • the last decoder layer’s attention weights of shape (batch, tgt_len, src_len)

Return type:

tuple

max_positions()[source]

Maximum output length supported by the decoder.

upgrade_state_dict_named(state_dict, name)[source]

Upgrade a (possibly old) state dict for new versions of fairseq.

class fairseq.models.transformer.TransformerDecoderLayer(args, no_encoder_attn=False)[source]

Decoder layer block.

In the original paper each operation (multi-head attention, encoder attention or FFN) is postprocessed with: dropout -> add residual -> layernorm. In the tensor2tensor code they suggest that learning is more robust when preprocessing each layer with layernorm and postprocessing with: dropout -> add residual. We default to the approach in the paper, but the tensor2tensor approach can be enabled by setting args.decoder_normalize_before to True.

Parameters:
  • args (argparse.Namespace) – parsed command-line arguments
  • no_encoder_attn (bool, optional) – whether to attend to encoder outputs (default: False).
forward(x, encoder_out, encoder_padding_mask, incremental_state, prev_self_attn_state=None, prev_attn_state=None, self_attn_mask=None, self_attn_padding_mask=None)[source]
Parameters:
  • x (Tensor) – input to the layer of shape (seq_len, batch, embed_dim)
  • encoder_padding_mask (ByteTensor) – binary ByteTensor of shape (batch, src_len) where padding elements are indicated by 1.
Returns:

encoded output of shape (batch, src_len, embed_dim)

Adding new models

fairseq.models.register_model(name)[source]

New model types can be added to fairseq with the register_model() function decorator.

For example:

@register_model('lstm')
class LSTM(FairseqModel):
    (...)

Note

All models must implement the BaseFairseqModel interface. Typically you will extend FairseqModel for sequence-to-sequence tasks or FairseqLanguageModel for language modeling tasks.

Parameters:name (str) – the name of the model
fairseq.models.register_model_architecture(model_name, arch_name)[source]

New model architectures can be added to fairseq with the register_model_architecture() function decorator. After registration, model architectures can be selected with the --arch command-line argument.

For example:

@register_model_architecture('lstm', 'lstm_luong_wmt_en_de')
def lstm_luong_wmt_en_de(args):
    args.encoder_embed_dim = getattr(args, 'encoder_embed_dim', 1000)
    (...)

The decorated function should take a single argument args, which is a argparse.Namespace of arguments parsed from the command-line. The decorated function should modify these arguments in-place to match the desired architecture.

Parameters:
  • model_name (str) – the name of the Model (Model must already be registered)
  • arch_name (str) – the name of the model architecture (--arch)
class fairseq.models.BaseFairseqModel[source]

Base class for fairseq models.

static add_args(parser)[source]

Add model-specific arguments to the parser.

classmethod build_model(args, task)[source]

Build a new model instance.

get_normalized_probs(net_output, log_probs, sample=None)[source]

Get normalized probabilities (or log probs) from a net’s output.

get_targets(sample, net_output)[source]

Get targets from either the sample or the net’s output.

load_state_dict(state_dict, strict=True)[source]

Copies parameters and buffers from state_dict into this module and its descendants.

Overrides the method in nn.Module. Compared with that method this additionally “upgrades” state_dicts from old checkpoints.

make_generation_fast_(**kwargs)[source]

Optimize model for faster generation.

max_decoder_positions()[source]

Maximum length supported by the decoder.

max_positions()[source]

Maximum length supported by the model.

prepare_for_onnx_export_(**kwargs)[source]

Make model exportable via ONNX trace.

upgrade_state_dict(state_dict)[source]

Upgrade old state dicts to work with newer code.

upgrade_state_dict_named(state_dict, name)[source]

Upgrade old state dicts to work with newer code.

Parameters:
  • state_dict (dict) – state dictionary to upgrade, in place
  • name (str) – the state dict key corresponding to the current module
class fairseq.models.FairseqModel(encoder, decoder)[source]

Base class for encoder-decoder models.

Parameters:
forward(src_tokens, src_lengths, prev_output_tokens)[source]

Run the forward pass for an encoder-decoder model.

First feed a batch of source tokens through the encoder. Then, feed the encoder output and previous decoder outputs (i.e., input feeding/teacher forcing) to the decoder to produce the next outputs:

encoder_out = self.encoder(src_tokens, src_lengths)
return self.decoder(prev_output_tokens, encoder_out)
Parameters:
  • src_tokens (LongTensor) – tokens in the source language of shape (batch, src_len)
  • src_lengths (LongTensor) – source sentence lengths of shape (batch)
  • prev_output_tokens (LongTensor) – previous decoder outputs of shape (batch, tgt_len), for input feeding/teacher forcing
Returns:

the decoder’s output, typically of shape (batch, tgt_len, vocab)

max_positions()[source]

Maximum length supported by the model.

class fairseq.models.FairseqLanguageModel(decoder)[source]

Base class for decoder-only models.

Parameters:decoder (FairseqDecoder) – the decoder
forward(src_tokens, src_lengths)[source]

Run the forward pass for a decoder-only model.

Feeds a batch of tokens through the decoder to predict the next tokens.

Parameters:
  • src_tokens (LongTensor) – tokens on which to condition the decoder, of shape (batch, tgt_len)
  • src_lengths (LongTensor) – source sentence lengths of shape (batch)
Returns:

the decoder’s output, typically of shape (batch, seq_len, vocab)

max_positions()[source]

Maximum length supported by the model.

remove_head()[source]

Removes the head of the model (e.g. the softmax layer) to conserve space when it is not needed

supported_targets
class fairseq.models.FairseqEncoder(dictionary)[source]

Base class for encoders.

forward(src_tokens, src_lengths)[source]
Parameters:
  • src_tokens (LongTensor) – tokens in the source language of shape (batch, src_len)
  • src_lengths (LongTensor) – lengths of each source sentence of shape (batch)
max_positions()[source]

Maximum input length supported by the encoder.

reorder_encoder_out(encoder_out, new_order)[source]

Reorder encoder output according to new_order.

Parameters:
  • encoder_out – output from the forward() method
  • new_order (LongTensor) – desired order
Returns:

encoder_out rearranged according to new_order

upgrade_state_dict(state_dict)[source]

Upgrade a (possibly old) state dict for new versions of fairseq.

class fairseq.models.CompositeEncoder(encoders)[source]

A wrapper around a dictionary of FairseqEncoder objects.

We run forward on each encoder and return a dictionary of outputs. The first encoder’s dictionary is used for initialization.

Parameters:encoders (dict) – a dictionary of FairseqEncoder objects.
forward(src_tokens, src_lengths)[source]
Parameters:
  • src_tokens (LongTensor) – tokens in the source language of shape (batch, src_len)
  • src_lengths (LongTensor) – lengths of each source sentence of shape (batch)
Returns:

the outputs from each Encoder

Return type:

dict

max_positions()[source]

Maximum input length supported by the encoder.

reorder_encoder_out(encoder_out, new_order)[source]

Reorder encoder output according to new_order.

upgrade_state_dict(state_dict)[source]

Upgrade a (possibly old) state dict for new versions of fairseq.

class fairseq.models.FairseqDecoder(dictionary)[source]

Base class for decoders.

forward(prev_output_tokens, encoder_out)[source]
Parameters:
  • prev_output_tokens (LongTensor) – previous decoder outputs of shape (batch, tgt_len), for input feeding/teacher forcing
  • encoder_out (Tensor, optional) – output from the encoder, used for encoder-side attention
Returns:

  • the last decoder layer’s output of shape (batch, tgt_len, vocab)
  • the last decoder layer’s attention weights of shape (batch, tgt_len, src_len)

Return type:

tuple

get_normalized_probs(net_output, log_probs, sample)[source]

Get normalized probabilities (or log probs) from a net’s output.

max_positions()[source]

Maximum input length supported by the decoder.

upgrade_state_dict(state_dict)[source]

Upgrade a (possibly old) state dict for new versions of fairseq.

Incremental decoding

class fairseq.models.FairseqIncrementalDecoder(dictionary)[source]

Base class for incremental decoders.

Incremental decoding is a special mode at inference time where the Model only receives a single timestep of input corresponding to the immediately previous output token (for input feeding) and must produce the next output incrementally. Thus the model must cache any long-term state that is needed about the sequence, e.g., hidden states, convolutional states, etc.

Compared to the standard FairseqDecoder interface, the incremental decoder interface allows forward() functions to take an extra keyword argument (incremental_state) that can be used to cache state across time-steps.

The FairseqIncrementalDecoder interface also defines the reorder_incremental_state() method, which is used during beam search to select and reorder the incremental state based on the selection of beams.

To learn more about how incremental decoding works, refer to this blog.

forward(prev_output_tokens, encoder_out, incremental_state=None)[source]
Parameters:
  • prev_output_tokens (LongTensor) – previous decoder outputs of shape (batch, tgt_len), for input feeding/teacher forcing
  • encoder_out (Tensor, optional) – output from the encoder, used for encoder-side attention
  • incremental_state (dict) – dictionary used for storing state during Incremental decoding
Returns:

  • the last decoder layer’s output of shape (batch, tgt_len, vocab)
  • the last decoder layer’s attention weights of shape (batch, tgt_len, src_len)

Return type:

tuple

reorder_incremental_state(incremental_state, new_order)[source]

Reorder incremental state.

This should be called when the order of the input has changed from the previous time step. A typical use case is beam search, where the input order changes between time steps based on the selection of beams.

set_beam_size(beam_size)[source]

Sets the beam size in the decoder and all children.