Data Loading and Utilities¶

Datasets¶

Datasets define the data format and provide helpers for creating mini-batches.

class fairseq.data.FairseqDataset[source]¶

A dataset that provides helpers for batching.

collater(samples)[source]¶

Merge a list of samples to form a mini-batch.

Parameters:	samples (List[dict]) – samples to collate
Returns:	a mini-batch suitable for forwarding with a Model
Return type:	dict

num_tokens(index)[source]¶: Return the number of tokens in a sample. This value is used to enforce --max-tokens during batching.

ordered_indices()[source]¶: Return an ordered list of indices. Batches will be constructed based on this order.

prefetch(indices)[source]¶: Prefetch the data required for this epoch.

size(index)[source]¶: Return an example’s size as a float or tuple. This value is used when filtering a dataset with --max-positions.

supports_prefetch¶: Whether this dataset supports prefetching.

class fairseq.data.LanguagePairDataset(src, src_sizes, src_dict, tgt=None, tgt_sizes=None, tgt_dict=None, left_pad_source=True, left_pad_target=False, max_source_positions=1024, max_target_positions=1024, shuffle=True, input_feeding=True, remove_eos_from_source=False, append_eos_to_target=False)[source]¶

A pair of torch.utils.data.Datasets.

Parameters:

src (torch.utils.data.Dataset) – source dataset to wrap
src_sizes (List[int]) – source sentence lengths
src_dict (Dictionary) – source vocabulary
tgt (torch.utils.data.Dataset, optional) – target dataset to wrap
tgt_sizes (List[int], optional) – target sentence lengths
tgt_dict (Dictionary, optional) – target vocabulary
left_pad_source (bool, optional) – pad source tensors on the left side (default: True).
left_pad_target (bool, optional) – pad target tensors on the left side (default: False).
max_source_positions (int, optional) – max number of tokens in the source sentence (default: 1024).
max_target_positions (int, optional) – max number of tokens in the target sentence (default: 1024).
shuffle (bool, optional) – shuffle dataset elements before batching (default: True).
input_feeding (bool, optional) – create a shifted version of the targets to be passed into the model for input feeding/teacher forcing (default: True).
remove_eos_from_source (bool, optional) – if set, removes eos from end of source if it’s present (default: False).
append_eos_to_target (bool, optional) – if set, appends eos to end of target if it’s absent (default: False).

collater(samples)[source]¶

Merge a list of samples to form a mini-batch.

Parameters:	samples (List[dict]) – samples to collate
Returns:	a mini-batch with the following keys: id (LongTensor): example IDs in the original input order ntokens (int): total number of tokens in the batch net_input (dict): the input to the Model, containing keys: src_tokens (LongTensor): a padded 2D Tensor of tokens in the source sentence of shape (bsz, src_len). Padding will appear on the left if left_pad_source is `True`. src_lengths (LongTensor): 1D Tensor of the unpadded lengths of each source sentence of shape (bsz) prev_output_tokens (LongTensor): a padded 2D Tensor of tokens in the target sentence, shifted right by one position for input feeding/teacher forcing, of shape (bsz, tgt_len). This key will not be present if input_feeding is `False`. Padding will appear on the left if left_pad_target is `True`. target (LongTensor): a padded 2D Tensor of tokens in the target sentence of shape (bsz, tgt_len). Padding will appear on the left if left_pad_target is `True`.
Return type:	dict

num_tokens(index)[source]¶: Return the number of tokens in a sample. This value is used to enforce --max-tokens during batching.

ordered_indices()[source]¶: Return an ordered list of indices. Batches will be constructed based on this order.

prefetch(indices)[source]¶: Prefetch the data required for this epoch.

size(index)[source]¶: Return an example’s size as a float or tuple. This value is used when filtering a dataset with --max-positions.

supports_prefetch¶: Whether this dataset supports prefetching.

class fairseq.data.MonolingualDataset(dataset, sizes, src_vocab, tgt_vocab, add_eos_for_other_targets, shuffle, targets=None, add_bos_token=False)[source]¶

A wrapper around torch.utils.data.Dataset for monolingual data.

Parameters:	dataset (torch.utils.data.Dataset) – dataset to wrap sizes (List[int]) – sentence lengths vocab (Dictionary) – vocabulary shuffle (bool, optional) – shuffle the elements before batching (default: True).

collater(samples)[source]¶

Merge a list of samples to form a mini-batch.

Parameters:	samples (List[dict]) – samples to collate
Returns:	a mini-batch with the following keys: id (LongTensor): example IDs in the original input order ntokens (int): total number of tokens in the batch net_input (dict): the input to the Model, containing keys: src_tokens (LongTensor): a padded 2D Tensor of tokens in the source sentence of shape (bsz, src_len). Padding will appear on the right. target (LongTensor): a padded 2D Tensor of tokens in the target sentence of shape (bsz, tgt_len). Padding will appear on the right.
Return type:	dict

num_tokens(index)[source]¶: Return the number of tokens in a sample. This value is used to enforce --max-tokens during batching.

ordered_indices()[source]¶: Return an ordered list of indices. Batches will be constructed based on this order.

prefetch(indices)[source]¶: Prefetch the data required for this epoch.

size(index)[source]¶: Return an example’s size as a float or tuple. This value is used when filtering a dataset with --max-positions.

supports_prefetch¶: Whether this dataset supports prefetching.

Helper Datasets

These datasets wrap other fairseq.data.FairseqDataset instances and provide additional functionality:

class fairseq.data.BacktranslationDataset(tgt_dataset, src_dict, tgt_dict=None, backtranslation_fn=None, output_collater=None, cuda=True, **kwargs)[source]¶

Sets up a backtranslation dataset which takes a tgt batch, generates a src using a tgt-src backtranslation function (backtranslation_fn), and returns the corresponding {generated src, input tgt} batch.

Parameters:

tgt_dataset (FairseqDataset) – the dataset to be backtranslated. Only the source side of this dataset will be used. After backtranslation, the source sentences in this dataset will be returned as the targets.
src_dict (Dictionary) – the dictionary of backtranslated sentences.
tgt_dict (Dictionary, optional) – the dictionary of sentences to be backtranslated.
backtranslation_fn (callable, optional) – function to call to generate backtranslations. This is typically the generate method of a SequenceGenerator object. Pass in None when it is not available at initialization time, and use set_backtranslation_fn function to set it when available.
output_collater (callable, optional) – function to call on the backtranslated samples to create the final batch (default: tgt_dataset.collater).
cuda – use GPU for generation

collater(samples)[source]¶

Merge and backtranslate a list of samples to form a mini-batch.

Using the samples from tgt_dataset, load a collated target sample to feed to the backtranslation model. Then take the backtranslation with the best score as the source and the original input as the target.

Note: we expect tgt_dataset to provide a function collater() that will collate samples into the format expected by backtranslation_fn. After backtranslation, we will feed the new list of samples (i.e., the (backtranslated source, original source) pairs) to output_collater and return the result.

Parameters:	samples (List[dict]) – samples to backtranslate and collate
Returns:	a mini-batch with keys coming from output_collater
Return type:	dict

num_tokens(index)[source]¶: Just use the tgt dataset num_tokens

ordered_indices()[source]¶: Just use the tgt dataset ordered_indices

prefetch(indices)[source]¶: Prefetch the data required for this epoch.

size(index)[source]¶

Return an example’s size as a float or tuple. This value is used when filtering a dataset with --max-positions.

Note: we use tgt_dataset to approximate the length of the source sentence, since we do not know the actual length until after backtranslation.

supports_prefetch¶: Whether this dataset supports prefetching.

class fairseq.data.ConcatDataset(datasets, sample_ratios=1)[source]¶

collater(samples)[source]¶

Merge a list of samples to form a mini-batch.

Parameters:	samples (List[dict]) – samples to collate
Returns:	a mini-batch suitable for forwarding with a Model
Return type:	dict

num_tokens(index: int)[source]¶: Return the number of tokens in a sample. This value is used to enforce --max-tokens during batching.

ordered_indices()[source]¶: Returns indices sorted by length. So less padding is needed.

prefetch(indices)[source]¶: Prefetch the data required for this epoch.

size(idx: int)[source]¶: Return an example’s size as a float or tuple.

supports_prefetch¶: Whether this dataset supports prefetching.

class fairseq.data.RoundRobinZipDatasets(datasets, eval_key=None)[source]¶

Zip multiple FairseqDataset instances together.

Shorter datasets are repeated in a round-robin fashion to match the length of the longest one.

Parameters:	datasets (Dict[FairseqDataset]) – a dictionary of `FairseqDataset` instances. eval_key (str, optional) – a key used at evaluation time that causes this instance to pass-through batches from datasets[eval_key].

collater(samples)[source]¶: Merge a list of samples to form a mini-batch.

num_tokens(index)[source]¶: Return an example’s length (number of tokens), used for batching.

ordered_indices()[source]¶: Ordered indices for batching.

prefetch(indices)[source]¶: Prefetch the data required for this epoch.

size(index)[source]¶: Return an example’s size as a float or tuple. This value is used when filtering a dataset with --max-positions.

supports_prefetch¶: Whether this dataset supports prefetching.

class fairseq.data.TransformEosDataset(dataset, eos, append_eos_to_src=False, remove_eos_from_src=False, append_eos_to_tgt=False, remove_eos_from_tgt=False, has_target=True)[source]¶

A FairseqDataset wrapper that appends/prepends/strips EOS.

Note that the transformation is applied in collater().

Parameters:

dataset (FairseqDataset) – dataset to wrap
eos (int) – index of the end-of-sentence symbol
append_eos_to_src (bool, optional) – append EOS to the end of src
remove_eos_from_src (bool, optional) – remove EOS from the end of src
append_eos_to_tgt (bool, optional) – append EOS to the end of tgt
remove_eos_from_tgt (bool, optional) – remove EOS from the end of tgt

collater(samples)[source]¶

Merge a list of samples to form a mini-batch.

Parameters:	samples (List[dict]) – samples to collate
Returns:	a mini-batch suitable for forwarding with a Model
Return type:	dict

num_tokens(index)[source]¶: Return the number of tokens in a sample. This value is used to enforce --max-tokens during batching.

ordered_indices()[source]¶: Return an ordered list of indices. Batches will be constructed based on this order.

prefetch(indices)[source]¶: Prefetch the data required for this epoch.

size(index)[source]¶: Return an example’s size as a float or tuple. This value is used when filtering a dataset with --max-positions.

supports_prefetch¶: Whether this dataset supports prefetching.

Dictionary¶

class fairseq.data.Dictionary(pad='<pad>', eos='</s>', unk='<unk>', bos='<s>')[source]¶

A mapping from symbols to consecutive integers

add_symbol(word, n=1)[source]¶: Adds a word to the dictionary

bos()[source]¶: Helper to get index of beginning-of-sentence symbol

eos()[source]¶: Helper to get index of end-of-sentence symbol

finalize(threshold=-1, nwords=-1, padding_factor=8)[source]¶

Sort symbols by frequency in descending order, ignoring special ones.

Parameters:	threshold defines the minimum word count (-) – nwords defines the total number of words in the final dictionary, (-) – including special symbols padding_factor can be used to pad the dictionary size to be a (-) – multiple of 8, which is important on some hardware (e.g., Nvidia Tensor Cores).

index(sym)[source]¶: Returns the index of the specified symbol

classmethod load(f, ignore_utf_errors=False)[source]¶

Loads the dictionary from a text file with the format:

` <symbol0> <count0> <symbol1> <count1> ... `

pad()[source]¶: Helper to get index of pad symbol

save(f)[source]¶: Stores dictionary into a text file

string(tensor, bpe_symbol=None, escape_unk=False)[source]¶

Helper for converting a tensor of token indices to a string.

Can optionally remove BPE symbols or escape <unk> words.

unk()[source]¶: Helper to get index of unk symbol

unk_string(escape=False)[source]¶: Return unknown string, optionally escaped as: <<unk>>

update(new_dict)[source]¶: Updates counts from new dictionary.

Iterators¶

class fairseq.data.CountingIterator(iterable, start=0)[source]¶

Wrapper around an iterable that maintains the iteration count.

Parameters:	iterable (iterable) – iterable to wrap

count¶

number of elements consumed from this iterator

Type:	int

has_next()[source]¶: Whether the iterator has been exhausted.

skip(num_to_skip)[source]¶: Fast-forward the iterator by skipping num_to_skip elements.

class fairseq.data.EpochBatchIterator(dataset, collate_fn, batch_sampler, seed=1, num_shards=1, shard_id=0, num_workers=0, epoch=0)[source]¶

A multi-epoch iterator over a torch.utils.data.Dataset.

Compared to torch.utils.data.DataLoader, this iterator:

can be reused across multiple epochs with the next_epoch_itr() method (optionally shuffled between epochs)
can be serialized/deserialized with the state_dict() and load_state_dict() methods
supports sharding with the num_shards and shard_id arguments

Parameters:

dataset (Dataset) – dataset from which to load the data
collate_fn (callable) – merges a list of samples to form a mini-batch
batch_sampler (Sampler) – an iterator over batches of indices
seed (int, optional) – seed for random number generator for reproducibility (default: 1).
num_shards (int, optional) – shard the data iterator into N shards (default: 1).
shard_id (int, optional) – which shard of the data iterator to return (default: 0).
num_workers (int, optional) – how many subprocesses to use for data loading. 0 means the data will be loaded in the main process (default: 0).
epoch (int, optional) – the epoch to start the iterator from (default: 0).

end_of_epoch()[source]¶: Returns whether the most recent epoch iterator has been exhausted

iterations_in_epoch¶: The number of consumed batches in the current epoch.

load_state_dict(state_dict)[source]¶: Copies the state of the iterator from the given state_dict.

next_epoch_itr(shuffle=True, fix_batches_to_gpus=False)[source]¶

Return a new iterator over the dataset.

Parameters:	shuffle (bool, optional) – shuffle batches before returning the iterator (default: True). fix_batches_to_gpus – ensure that batches are always allocated to the same shards across epochs. Requires that `dataset` supports prefetching (default: False).

state_dict()[source]¶: Returns a dictionary containing a whole state of the iterator.

class fairseq.data.GroupedIterator(iterable, chunk_size)[source]¶

Wrapper around an iterable that returns groups (chunks) of items.

Parameters:	iterable (iterable) – iterable to wrap chunk_size (int) – size of each chunk

class fairseq.data.ShardedIterator(iterable, num_shards, shard_id, fill_value=None)[source]¶

A sharded wrapper around an iterable, padded to length.

Parameters:	iterable (iterable) – iterable to wrap num_shards (int) – number of shards to split the iterable into shard_id (int) – which shard to iterator over fill_value (Any, optional) – padding value when the iterable doesn’t evenly divide num_shards (default: None).