Data Loading and Utilities

Datasets

Datasets define the data format and provide helpers for creating mini-batches.

class fairseq.data.FairseqDataset[source]

A dataset that provides helpers for batching.

collater(samples)[source]

Merge a list of samples to form a mini-batch.

Parameters:samples (List[dict]) – samples to collate
Returns:a mini-batch suitable for forwarding with a Model
Return type:dict
num_tokens(index)[source]

Return the number of tokens in a sample. This value is used to enforce --max-tokens during batching.

ordered_indices()[source]

Return an ordered list of indices. Batches will be constructed based on this order.

prefetch(indices)[source]

Prefetch the data required for this epoch.

size(index)[source]

Return an example’s size as a float or tuple. This value is used when filtering a dataset with --max-positions.

supports_prefetch

Whether this dataset supports prefetching.

class fairseq.data.LanguagePairDataset(src, src_sizes, src_dict, tgt=None, tgt_sizes=None, tgt_dict=None, left_pad_source=True, left_pad_target=False, max_source_positions=1024, max_target_positions=1024, shuffle=True, input_feeding=True, remove_eos_from_source=False, append_eos_to_target=False)[source]

A pair of torch.utils.data.Datasets.

Parameters:
  • src (torch.utils.data.Dataset) – source dataset to wrap
  • src_sizes (List[int]) – source sentence lengths
  • src_dict (Dictionary) – source vocabulary
  • tgt (torch.utils.data.Dataset, optional) – target dataset to wrap
  • tgt_sizes (List[int], optional) – target sentence lengths
  • tgt_dict (Dictionary, optional) – target vocabulary
  • left_pad_source (bool, optional) – pad source tensors on the left side (default: True).
  • left_pad_target (bool, optional) – pad target tensors on the left side (default: False).
  • max_source_positions (int, optional) – max number of tokens in the source sentence (default: 1024).
  • max_target_positions (int, optional) – max number of tokens in the target sentence (default: 1024).
  • shuffle (bool, optional) – shuffle dataset elements before batching (default: True).
  • input_feeding (bool, optional) – create a shifted version of the targets to be passed into the model for input feeding/teacher forcing (default: True).
  • remove_eos_from_source (bool, optional) – if set, removes eos from end of source if it’s present (default: False).
  • append_eos_to_target (bool, optional) – if set, appends eos to end of target if it’s absent (default: False).
collater(samples)[source]

Merge a list of samples to form a mini-batch.

Parameters:samples (List[dict]) – samples to collate
Returns:a mini-batch with the following keys:
  • id (LongTensor): example IDs in the original input order
  • ntokens (int): total number of tokens in the batch
  • net_input (dict): the input to the Model, containing keys:
    • src_tokens (LongTensor): a padded 2D Tensor of tokens in the source sentence of shape (bsz, src_len). Padding will appear on the left if left_pad_source is True.
    • src_lengths (LongTensor): 1D Tensor of the unpadded lengths of each source sentence of shape (bsz)
    • prev_output_tokens (LongTensor): a padded 2D Tensor of tokens in the target sentence, shifted right by one position for input feeding/teacher forcing, of shape (bsz, tgt_len). This key will not be present if input_feeding is False. Padding will appear on the left if left_pad_target is True.
  • target (LongTensor): a padded 2D Tensor of tokens in the target sentence of shape (bsz, tgt_len). Padding will appear on the left if left_pad_target is True.
Return type:dict
num_tokens(index)[source]

Return the number of tokens in a sample. This value is used to enforce --max-tokens during batching.

ordered_indices()[source]

Return an ordered list of indices. Batches will be constructed based on this order.

prefetch(indices)[source]

Prefetch the data required for this epoch.

size(index)[source]

Return an example’s size as a float or tuple. This value is used when filtering a dataset with --max-positions.

supports_prefetch

Whether this dataset supports prefetching.

class fairseq.data.MonolingualDataset(dataset, sizes, src_vocab, tgt_vocab, add_eos_for_other_targets, shuffle, targets=None, add_bos_token=False)[source]

A wrapper around torch.utils.data.Dataset for monolingual data.

Parameters:
  • dataset (torch.utils.data.Dataset) – dataset to wrap
  • sizes (List[int]) – sentence lengths
  • vocab (Dictionary) – vocabulary
  • shuffle (bool, optional) – shuffle the elements before batching (default: True).
collater(samples)[source]

Merge a list of samples to form a mini-batch.

Parameters:samples (List[dict]) – samples to collate
Returns:a mini-batch with the following keys:
  • id (LongTensor): example IDs in the original input order
  • ntokens (int): total number of tokens in the batch
  • net_input (dict): the input to the Model, containing keys:
    • src_tokens (LongTensor): a padded 2D Tensor of tokens in the source sentence of shape (bsz, src_len). Padding will appear on the right.
  • target (LongTensor): a padded 2D Tensor of tokens in the target sentence of shape (bsz, tgt_len). Padding will appear on the right.
Return type:dict
num_tokens(index)[source]

Return the number of tokens in a sample. This value is used to enforce --max-tokens during batching.

ordered_indices()[source]

Return an ordered list of indices. Batches will be constructed based on this order.

prefetch(indices)[source]

Prefetch the data required for this epoch.

size(index)[source]

Return an example’s size as a float or tuple. This value is used when filtering a dataset with --max-positions.

supports_prefetch

Whether this dataset supports prefetching.

Helper Datasets

These datasets wrap other fairseq.data.FairseqDataset instances and provide additional functionality:

class fairseq.data.BacktranslationDataset(tgt_dataset, src_dict, tgt_dict=None, backtranslation_fn=None, output_collater=None, cuda=True, **kwargs)[source]

Sets up a backtranslation dataset which takes a tgt batch, generates a src using a tgt-src backtranslation function (backtranslation_fn), and returns the corresponding {generated src, input tgt} batch.

Parameters:
  • tgt_dataset (FairseqDataset) – the dataset to be backtranslated. Only the source side of this dataset will be used. After backtranslation, the source sentences in this dataset will be returned as the targets.
  • src_dict (Dictionary) – the dictionary of backtranslated sentences.
  • tgt_dict (Dictionary, optional) – the dictionary of sentences to be backtranslated.
  • backtranslation_fn (callable, optional) – function to call to generate backtranslations. This is typically the generate method of a SequenceGenerator object. Pass in None when it is not available at initialization time, and use set_backtranslation_fn function to set it when available.
  • output_collater (callable, optional) – function to call on the backtranslated samples to create the final batch (default: tgt_dataset.collater).
  • cuda – use GPU for generation
collater(samples)[source]

Merge and backtranslate a list of samples to form a mini-batch.

Using the samples from tgt_dataset, load a collated target sample to feed to the backtranslation model. Then take the backtranslation with the best score as the source and the original input as the target.

Note: we expect tgt_dataset to provide a function collater() that will collate samples into the format expected by backtranslation_fn. After backtranslation, we will feed the new list of samples (i.e., the (backtranslated source, original source) pairs) to output_collater and return the result.

Parameters:samples (List[dict]) – samples to backtranslate and collate
Returns:a mini-batch with keys coming from output_collater
Return type:dict
num_tokens(index)[source]

Just use the tgt dataset num_tokens

ordered_indices()[source]

Just use the tgt dataset ordered_indices

prefetch(indices)[source]

Prefetch the data required for this epoch.

size(index)[source]

Return an example’s size as a float or tuple. This value is used when filtering a dataset with --max-positions.

Note: we use tgt_dataset to approximate the length of the source sentence, since we do not know the actual length until after backtranslation.

supports_prefetch

Whether this dataset supports prefetching.

class fairseq.data.ConcatDataset(datasets, sample_ratios=1)[source]
collater(samples)[source]

Merge a list of samples to form a mini-batch.

Parameters:samples (List[dict]) – samples to collate
Returns:a mini-batch suitable for forwarding with a Model
Return type:dict
num_tokens(index: int)[source]

Return the number of tokens in a sample. This value is used to enforce --max-tokens during batching.

ordered_indices()[source]

Returns indices sorted by length. So less padding is needed.

prefetch(indices)[source]

Prefetch the data required for this epoch.

size(idx: int)[source]

Return an example’s size as a float or tuple.

supports_prefetch

Whether this dataset supports prefetching.

class fairseq.data.RoundRobinZipDatasets(datasets, eval_key=None)[source]

Zip multiple FairseqDataset instances together.

Shorter datasets are repeated in a round-robin fashion to match the length of the longest one.

Parameters:
  • datasets (Dict[FairseqDataset]) – a dictionary of FairseqDataset instances.
  • eval_key (str, optional) – a key used at evaluation time that causes this instance to pass-through batches from datasets[eval_key].
collater(samples)[source]

Merge a list of samples to form a mini-batch.

num_tokens(index)[source]

Return an example’s length (number of tokens), used for batching.

ordered_indices()[source]

Ordered indices for batching.

prefetch(indices)[source]

Prefetch the data required for this epoch.

size(index)[source]

Return an example’s size as a float or tuple. This value is used when filtering a dataset with --max-positions.

supports_prefetch

Whether this dataset supports prefetching.

class fairseq.data.TransformEosDataset(dataset, eos, append_eos_to_src=False, remove_eos_from_src=False, append_eos_to_tgt=False, remove_eos_from_tgt=False, has_target=True)[source]

A FairseqDataset wrapper that appends/prepends/strips EOS.

Note that the transformation is applied in collater().

Parameters:
  • dataset (FairseqDataset) – dataset to wrap
  • eos (int) – index of the end-of-sentence symbol
  • append_eos_to_src (bool, optional) – append EOS to the end of src
  • remove_eos_from_src (bool, optional) – remove EOS from the end of src
  • append_eos_to_tgt (bool, optional) – append EOS to the end of tgt
  • remove_eos_from_tgt (bool, optional) – remove EOS from the end of tgt
collater(samples)[source]

Merge a list of samples to form a mini-batch.

Parameters:samples (List[dict]) – samples to collate
Returns:a mini-batch suitable for forwarding with a Model
Return type:dict
num_tokens(index)[source]

Return the number of tokens in a sample. This value is used to enforce --max-tokens during batching.

ordered_indices()[source]

Return an ordered list of indices. Batches will be constructed based on this order.

prefetch(indices)[source]

Prefetch the data required for this epoch.

size(index)[source]

Return an example’s size as a float or tuple. This value is used when filtering a dataset with --max-positions.

supports_prefetch

Whether this dataset supports prefetching.

Dictionary

class fairseq.data.Dictionary(pad='<pad>', eos='</s>', unk='<unk>', bos='<s>')[source]

A mapping from symbols to consecutive integers

add_symbol(word, n=1)[source]

Adds a word to the dictionary

bos()[source]

Helper to get index of beginning-of-sentence symbol

eos()[source]

Helper to get index of end-of-sentence symbol

finalize(threshold=-1, nwords=-1, padding_factor=8)[source]

Sort symbols by frequency in descending order, ignoring special ones.

Parameters:
  • threshold defines the minimum word count (-) –
  • nwords defines the total number of words in the final dictionary, (-) – including special symbols
  • padding_factor can be used to pad the dictionary size to be a (-) – multiple of 8, which is important on some hardware (e.g., Nvidia Tensor Cores).
index(sym)[source]

Returns the index of the specified symbol

classmethod load(f, ignore_utf_errors=False)[source]

Loads the dictionary from a text file with the format:

` <symbol0> <count0> <symbol1> <count1> ... `

pad()[source]

Helper to get index of pad symbol

save(f)[source]

Stores dictionary into a text file

string(tensor, bpe_symbol=None, escape_unk=False)[source]

Helper for converting a tensor of token indices to a string.

Can optionally remove BPE symbols or escape <unk> words.

unk()[source]

Helper to get index of unk symbol

unk_string(escape=False)[source]

Return unknown string, optionally escaped as: <<unk>>

update(new_dict)[source]

Updates counts from new dictionary.

Iterators

class fairseq.data.CountingIterator(iterable, start=0)[source]

Wrapper around an iterable that maintains the iteration count.

Parameters:iterable (iterable) – iterable to wrap
count

number of elements consumed from this iterator

Type:int
has_next()[source]

Whether the iterator has been exhausted.

skip(num_to_skip)[source]

Fast-forward the iterator by skipping num_to_skip elements.

class fairseq.data.EpochBatchIterator(dataset, collate_fn, batch_sampler, seed=1, num_shards=1, shard_id=0, num_workers=0, epoch=0)[source]

A multi-epoch iterator over a torch.utils.data.Dataset.

Compared to torch.utils.data.DataLoader, this iterator:

  • can be reused across multiple epochs with the next_epoch_itr() method (optionally shuffled between epochs)
  • can be serialized/deserialized with the state_dict() and load_state_dict() methods
  • supports sharding with the num_shards and shard_id arguments
Parameters:
  • dataset (Dataset) – dataset from which to load the data
  • collate_fn (callable) – merges a list of samples to form a mini-batch
  • batch_sampler (Sampler) – an iterator over batches of indices
  • seed (int, optional) – seed for random number generator for reproducibility (default: 1).
  • num_shards (int, optional) – shard the data iterator into N shards (default: 1).
  • shard_id (int, optional) – which shard of the data iterator to return (default: 0).
  • num_workers (int, optional) – how many subprocesses to use for data loading. 0 means the data will be loaded in the main process (default: 0).
  • epoch (int, optional) – the epoch to start the iterator from (default: 0).
end_of_epoch()[source]

Returns whether the most recent epoch iterator has been exhausted

iterations_in_epoch

The number of consumed batches in the current epoch.

load_state_dict(state_dict)[source]

Copies the state of the iterator from the given state_dict.

next_epoch_itr(shuffle=True, fix_batches_to_gpus=False)[source]

Return a new iterator over the dataset.

Parameters:
  • shuffle (bool, optional) – shuffle batches before returning the iterator (default: True).
  • fix_batches_to_gpus – ensure that batches are always allocated to the same shards across epochs. Requires that dataset supports prefetching (default: False).
state_dict()[source]

Returns a dictionary containing a whole state of the iterator.

class fairseq.data.GroupedIterator(iterable, chunk_size)[source]

Wrapper around an iterable that returns groups (chunks) of items.

Parameters:
  • iterable (iterable) – iterable to wrap
  • chunk_size (int) – size of each chunk
class fairseq.data.ShardedIterator(iterable, num_shards, shard_id, fill_value=None)[source]

A sharded wrapper around an iterable, padded to length.

Parameters:
  • iterable (iterable) – iterable to wrap
  • num_shards (int) – number of shards to split the iterable into
  • shard_id (int) – which shard to iterator over
  • fill_value (Any, optional) – padding value when the iterable doesn’t evenly divide num_shards (default: None).