Data Loading and Utilities¶

Datasets¶

Datasets define the data format and provide helpers for creating mini-batches.

class fairseq.data.FairseqDataset[source]¶

A dataset that provides helpers for batching.

collater(samples)[source]¶

Merge a list of samples to form a mini-batch.

Parameters:	samples (List[dict]) – samples to collate
Returns:	a mini-batch suitable for forwarding with a Model
Return type:	dict

num_tokens(index)[source]¶: Return the number of tokens in a sample. This value is used to enforce --max-tokens during batching.

ordered_indices()[source]¶: Return an ordered list of indices. Batches will be constructed based on this order.

prefetch(indices)[source]¶: Prefetch the data required for this epoch.

size(index)[source]¶: Return an example’s size as a float or tuple. This value is used when filtering a dataset with --max-positions.

supports_prefetch¶: Whether this dataset supports prefetching.

class fairseq.data.LanguagePairDataset(src, src_sizes, src_dict, tgt=None, tgt_sizes=None, tgt_dict=None, left_pad_source=True, left_pad_target=False, max_source_positions=1024, max_target_positions=1024, shuffle=True, input_feeding=True, remove_eos_from_source=False, append_eos_to_target=False, align_dataset=None, append_bos=False)[source]¶

A pair of torch.utils.data.Datasets.

Parameters:

src (torch.utils.data.Dataset) – source dataset to wrap
src_sizes (List[int]) – source sentence lengths
src_dict (Dictionary) – source vocabulary
tgt (torch.utils.data.Dataset, optional) – target dataset to wrap
tgt_sizes (List[int], optional) – target sentence lengths
tgt_dict (Dictionary, optional) – target vocabulary
left_pad_source (bool, optional) – pad source tensors on the left side (default: True).
left_pad_target (bool, optional) – pad target tensors on the left side (default: False).
max_source_positions (int, optional) – max number of tokens in the source sentence (default: 1024).
max_target_positions (int, optional) – max number of tokens in the target sentence (default: 1024).
shuffle (bool, optional) – shuffle dataset elements before batching (default: True).
input_feeding (bool, optional) – create a shifted version of the targets to be passed into the model for teacher forcing (default: True).
remove_eos_from_source (bool, optional) – if set, removes eos from end of source if it’s present (default: False).
append_eos_to_target (bool, optional) – if set, appends eos to end of target if it’s absent (default: False).
align_dataset (torch.utils.data.Dataset, optional) – dataset containing alignments.
append_bos (bool, optional) – if set, appends bos to the beginning of source/target sentence.

collater(samples)[source]¶

Merge a list of samples to form a mini-batch.

Parameters:	samples (List[dict]) – samples to collate
Returns:	a mini-batch with the following keys: id (LongTensor): example IDs in the original input order ntokens (int): total number of tokens in the batch net_input (dict): the input to the Model, containing keys: src_tokens (LongTensor): a padded 2D Tensor of tokens in the source sentence of shape (bsz, src_len). Padding will appear on the left if left_pad_source is `True`. src_lengths (LongTensor): 1D Tensor of the unpadded lengths of each source sentence of shape (bsz) prev_output_tokens (LongTensor): a padded 2D Tensor of tokens in the target sentence, shifted right by one position for teacher forcing, of shape (bsz, tgt_len). This key will not be present if input_feeding is `False`. Padding will appear on the left if left_pad_target is `True`. target (LongTensor): a padded 2D Tensor of tokens in the target sentence of shape (bsz, tgt_len). Padding will appear on the left if left_pad_target is `True`.
Return type:	dict

num_tokens(index)[source]¶: Return the number of tokens in a sample. This value is used to enforce --max-tokens during batching.

ordered_indices()[source]¶: Return an ordered list of indices. Batches will be constructed based on this order.

prefetch(indices)[source]¶: Prefetch the data required for this epoch.

size(index)[source]¶: Return an example’s size as a float or tuple. This value is used when filtering a dataset with --max-positions.

supports_prefetch¶: Whether this dataset supports prefetching.

class fairseq.data.MonolingualDataset(dataset, sizes, src_vocab, tgt_vocab, add_eos_for_other_targets, shuffle, targets=None, add_bos_token=False)[source]¶

A wrapper around torch.utils.data.Dataset for monolingual data.

Parameters:	dataset (torch.utils.data.Dataset) – dataset to wrap sizes (List[int]) – sentence lengths vocab (Dictionary) – vocabulary shuffle (bool, optional) – shuffle the elements before batching (default: True).

collater(samples)[source]¶

Merge a list of samples to form a mini-batch.

Parameters:	samples (List[dict]) – samples to collate
Returns:	a mini-batch with the following keys: id (LongTensor): example IDs in the original input order ntokens (int): total number of tokens in the batch net_input (dict): the input to the Model, containing keys: src_tokens (LongTensor): a padded 2D Tensor of tokens in the source sentence of shape (bsz, src_len). Padding will appear on the right. target (LongTensor): a padded 2D Tensor of tokens in the target sentence of shape (bsz, tgt_len). Padding will appear on the right.
Return type:	dict

num_tokens(index)[source]¶: Return the number of tokens in a sample. This value is used to enforce --max-tokens during batching.

ordered_indices()[source]¶: Return an ordered list of indices. Batches will be constructed based on this order.

prefetch(indices)[source]¶: Prefetch the data required for this epoch.

size(index)[source]¶: Return an example’s size as a float or tuple. This value is used when filtering a dataset with --max-positions.

supports_prefetch¶: Whether this dataset supports prefetching.

Helper Datasets

These datasets wrap other fairseq.data.FairseqDataset instances and provide additional functionality:

class fairseq.data.BacktranslationDataset(tgt_dataset, src_dict, tgt_dict=None, backtranslation_fn=None, output_collater=None, cuda=True, **kwargs)[source]¶

Sets up a backtranslation dataset which takes a tgt batch, generates a src using a tgt-src backtranslation function (backtranslation_fn), and returns the corresponding {generated src, input tgt} batch.

Parameters:

tgt_dataset (FairseqDataset) – the dataset to be backtranslated. Only the source side of this dataset will be used. After backtranslation, the source sentences in this dataset will be returned as the targets.
src_dict (Dictionary) – the dictionary of backtranslated sentences.
tgt_dict (Dictionary, optional) – the dictionary of sentences to be backtranslated.
backtranslation_fn (callable, optional) – function to call to generate backtranslations. This is typically the generate method of a SequenceGenerator object. Pass in None when it is not available at initialization time, and use set_backtranslation_fn function to set it when available.
output_collater (callable, optional) – function to call on the backtranslated samples to create the final batch (default: tgt_dataset.collater).
cuda – use GPU for generation

collater(samples)[source]¶

Merge and backtranslate a list of samples to form a mini-batch.

Using the samples from tgt_dataset, load a collated target sample to feed to the backtranslation model. Then take the backtranslation with the best score as the source and the original input as the target.

Note: we expect tgt_dataset to provide a function collater() that will collate samples into the format expected by backtranslation_fn. After backtranslation, we will feed the new list of samples (i.e., the (backtranslated source, original source) pairs) to output_collater and return the result.

Parameters:	samples (List[dict]) – samples to backtranslate and collate
Returns:	a mini-batch with keys coming from output_collater
Return type:	dict

num_tokens(index)[source]¶: Just use the tgt dataset num_tokens

ordered_indices()[source]¶: Just use the tgt dataset ordered_indices

prefetch(indices)[source]¶: Prefetch the data required for this epoch.

size(index)[source]¶

Return an example’s size as a float or tuple. This value is used when filtering a dataset with --max-positions.

Note: we use tgt_dataset to approximate the length of the source sentence, since we do not know the actual length until after backtranslation.

supports_prefetch¶: Whether this dataset supports prefetching.

class fairseq.data.ConcatDataset(datasets, sample_ratios=1)[source]¶

collater(samples)[source]¶

Merge a list of samples to form a mini-batch.

Parameters:	samples (List[dict]) – samples to collate
Returns:	a mini-batch suitable for forwarding with a Model
Return type:	dict

num_tokens(index: int)[source]¶: Return the number of tokens in a sample. This value is used to enforce --max-tokens during batching.

ordered_indices()[source]¶: Returns indices sorted by length. So less padding is needed.

prefetch(indices)[source]¶: Prefetch the data required for this epoch.

set_epoch(epoch)[source]¶: Will receive the updated epoch number at the beginning of the epoch.

size(idx: int)[source]¶: Return an example’s size as a float or tuple.

supports_prefetch¶: Whether this dataset supports prefetching.

class fairseq.data.ResamplingDataset(dataset, weights=None, replace=True, size_ratio=1.0, batch_by_size=True, seed=0, epoch=0)[source]¶

Randomly samples from a given dataset at each epoch.

Sampling is done with or without replacement, depending on the “replace” parameter.

Optionally, the epoch size can be rescaled. This is potentially desirable to increase per-epoch coverage of the base dataset (since sampling with replacement means that many items in the dataset will be left out). In the case of sampling without replacement, size_ratio should be strictly less than 1.

Parameters:

dataset (Dataset) – dataset on which to sample.
weights (List[float]) – list of probability weights (default: None, which corresponds to uniform sampling).
replace (bool) – sampling mode; True for “with replacement”, or False for “without replacement” (default: True)
size_ratio (float) – the ratio to subsample to; must be positive (default: 1.0).
batch_by_size (bool) – whether or not to batch by sequence length (default: True).
seed (int) – RNG seed to use (default: 0).
epoch (int) – starting epoch number (default: 0).

num_tokens(index)[source]¶: Return the number of tokens in a sample. This value is used to enforce --max-tokens during batching.

ordered_indices()[source]¶: Return an ordered list of indices. Batches will be constructed based on this order.

prefetch(indices)[source]¶: Prefetch the data required for this epoch.

set_epoch(epoch)[source]¶: Will receive the updated epoch number at the beginning of the epoch.

size(index)[source]¶: Return an example’s size as a float or tuple. This value is used when filtering a dataset with --max-positions.

class fairseq.data.RoundRobinZipDatasets(datasets, eval_key=None)[source]¶

Zip multiple FairseqDataset instances together.

Shorter datasets are repeated in a round-robin fashion to match the length of the longest one.

Parameters:	datasets (Dict[FairseqDataset]) – a dictionary of `FairseqDataset` instances. eval_key (str, optional) – a key used at evaluation time that causes this instance to pass-through batches from datasets[eval_key].

collater(samples)[source]¶: Merge a list of samples to form a mini-batch.

num_tokens(index)[source]¶: Return an example’s length (number of tokens), used for batching.

ordered_indices()[source]¶: Ordered indices for batching.

prefetch(indices)[source]¶: Prefetch the data required for this epoch.

size(index)[source]¶: Return an example’s size as a float or tuple. This value is used when filtering a dataset with --max-positions.

supports_prefetch¶: Whether this dataset supports prefetching.

class fairseq.data.TransformEosDataset(dataset, eos, append_eos_to_src=False, remove_eos_from_src=False, append_eos_to_tgt=False, remove_eos_from_tgt=False, has_target=True)[source]¶

A FairseqDataset wrapper that appends/prepends/strips EOS.

Note that the transformation is applied in collater().

Parameters:

dataset (FairseqDataset) – dataset to wrap
eos (int) – index of the end-of-sentence symbol
append_eos_to_src (bool, optional) – append EOS to the end of src
remove_eos_from_src (bool, optional) – remove EOS from the end of src
append_eos_to_tgt (bool, optional) – append EOS to the end of tgt
remove_eos_from_tgt (bool, optional) – remove EOS from the end of tgt

collater(samples)[source]¶

Merge a list of samples to form a mini-batch.

Parameters:	samples (List[dict]) – samples to collate
Returns:	a mini-batch suitable for forwarding with a Model
Return type:	dict

num_tokens(index)[source]¶: Return the number of tokens in a sample. This value is used to enforce --max-tokens during batching.

ordered_indices()[source]¶: Return an ordered list of indices. Batches will be constructed based on this order.

prefetch(indices)[source]¶: Prefetch the data required for this epoch.

size(index)[source]¶: Return an example’s size as a float or tuple. This value is used when filtering a dataset with --max-positions.

supports_prefetch¶: Whether this dataset supports prefetching.

Dictionary¶

class fairseq.data.Dictionary(pad='<pad>', eos='</s>', unk='<unk>', bos='<s>', extra_special_symbols=None)[source]¶

A mapping from symbols to consecutive integers

add_from_file(f)[source]¶: Loads a pre-existing dictionary from a text file and adds its symbols to this instance.

add_symbol(word, n=1)[source]¶: Adds a word to the dictionary

bos()[source]¶: Helper to get index of beginning-of-sentence symbol

eos()[source]¶: Helper to get index of end-of-sentence symbol

finalize(threshold=-1, nwords=-1, padding_factor=8)[source]¶

Sort symbols by frequency in descending order, ignoring special ones.

Parameters:	threshold defines the minimum word count (-) – nwords defines the total number of words in the final dictionary, (-) – including special symbols padding_factor can be used to pad the dictionary size to be a (-) – multiple of 8, which is important on some hardware (e.g., Nvidia Tensor Cores).

index(sym)[source]¶: Returns the index of the specified symbol

classmethod load(f)[source]¶

Loads the dictionary from a text file with the format:

` <symbol0> <count0> <symbol1> <count1> ... `

pad()[source]¶: Helper to get index of pad symbol

save(f)[source]¶: Stores dictionary into a text file

string(tensor, bpe_symbol=None, escape_unk=False)[source]¶

Helper for converting a tensor of token indices to a string.

Can optionally remove BPE symbols or escape <unk> words.

unk()[source]¶: Helper to get index of unk symbol

unk_string(escape=False)[source]¶: Return unknown string, optionally escaped as: <<unk>>

update(new_dict)[source]¶: Updates counts from new dictionary.

Iterators¶

class fairseq.data.CountingIterator(iterable, start=0)[source]¶

Wrapper around an iterable that maintains the iteration count.

Parameters:	iterable (iterable) – iterable to wrap

count¶

number of elements consumed from this iterator

Type:	int

has_next()[source]¶: Whether the iterator has been exhausted.

skip(num_to_skip)[source]¶: Fast-forward the iterator by skipping num_to_skip elements.

take(n)[source]¶: Truncates the iterator to n elements at most.

class fairseq.data.EpochBatchIterator(dataset, collate_fn, batch_sampler, seed=1, num_shards=1, shard_id=0, num_workers=0, epoch=0)[source]¶

A multi-epoch iterator over a torch.utils.data.Dataset.

Compared to torch.utils.data.DataLoader, this iterator:

can be reused across multiple epochs with the next_epoch_itr() method (optionally shuffled between epochs)
can be serialized/deserialized with the state_dict() and load_state_dict() methods
supports sharding with the num_shards and shard_id arguments

Parameters:

dataset (Dataset) – dataset from which to load the data
collate_fn (callable) – merges a list of samples to form a mini-batch
batch_sampler (Sampler) – an iterator over batches of indices
seed (int, optional) – seed for random number generator for reproducibility (default: 1).
num_shards (int, optional) – shard the data iterator into N shards (default: 1).
shard_id (int, optional) – which shard of the data iterator to return (default: 0).
num_workers (int, optional) – how many subprocesses to use for data loading. 0 means the data will be loaded in the main process (default: 0).
epoch (int, optional) – the epoch to start the iterator from (default: 0).

end_of_epoch() → bool[source]¶: Returns whether the most recent epoch iterator has been exhausted

iterations_in_epoch¶: The number of consumed batches in the current epoch.

load_state_dict(state_dict)[source]¶: Copies the state of the iterator from the given state_dict.

next_epoch_itr(shuffle=True, fix_batches_to_gpus=False)[source]¶

Return a new iterator over the dataset.

Parameters:	shuffle (bool, optional) – shuffle batches before returning the iterator (default: True). fix_batches_to_gpus – ensure that batches are always allocated to the same shards across epochs. Requires that `dataset` supports prefetching (default: False).

state_dict()[source]¶: Returns a dictionary containing a whole state of the iterator.

class fairseq.data.GroupedIterator(iterable, chunk_size)[source]¶

Wrapper around an iterable that returns groups (chunks) of items.

Parameters:	iterable (iterable) – iterable to wrap chunk_size (int) – size of each chunk

class fairseq.data.ShardedIterator(iterable, num_shards, shard_id, fill_value=None)[source]¶

A sharded wrapper around an iterable, padded to length.

Parameters:	iterable (iterable) – iterable to wrap num_shards (int) – number of shards to split the iterable into shard_id (int) – which shard to iterator over fill_value (Any, optional) – padding value when the iterable doesn’t evenly divide num_shards (default: None).