Data Loading and Utilities¶
Datasets¶
Datasets define the data format and provide helpers for creating mini-batches.
-
class
fairseq.data.
FairseqDataset
[source]¶ A dataset that provides helpers for batching.
-
collater
(samples)[source]¶ Merge a list of samples to form a mini-batch.
Parameters: samples (List[dict]) – samples to collate Returns: a mini-batch suitable for forwarding with a Model Return type: dict
-
num_tokens
(index)[source]¶ Return the number of tokens in a sample. This value is used to enforce
--max-tokens
during batching.
-
ordered_indices
()[source]¶ Return an ordered list of indices. Batches will be constructed based on this order.
-
size
(index)[source]¶ Return an example’s size as a float or tuple. This value is used when filtering a dataset with
--max-positions
.
-
supports_prefetch
¶ Whether this dataset supports prefetching.
-
-
class
fairseq.data.
LanguagePairDataset
(src, src_sizes, src_dict, tgt=None, tgt_sizes=None, tgt_dict=None, left_pad_source=True, left_pad_target=False, max_source_positions=1024, max_target_positions=1024, shuffle=True, input_feeding=True, remove_eos_from_source=False, append_eos_to_target=False, align_dataset=None, append_bos=False)[source]¶ A pair of torch.utils.data.Datasets.
Parameters: - src (torch.utils.data.Dataset) – source dataset to wrap
- src_sizes (List[int]) – source sentence lengths
- src_dict (Dictionary) – source vocabulary
- tgt (torch.utils.data.Dataset, optional) – target dataset to wrap
- tgt_sizes (List[int], optional) – target sentence lengths
- tgt_dict (Dictionary, optional) – target vocabulary
- left_pad_source (bool, optional) – pad source tensors on the left side (default: True).
- left_pad_target (bool, optional) – pad target tensors on the left side (default: False).
- max_source_positions (int, optional) – max number of tokens in the source sentence (default: 1024).
- max_target_positions (int, optional) – max number of tokens in the target sentence (default: 1024).
- shuffle (bool, optional) – shuffle dataset elements before batching (default: True).
- input_feeding (bool, optional) – create a shifted version of the targets to be passed into the model for teacher forcing (default: True).
- remove_eos_from_source (bool, optional) – if set, removes eos from end of source if it’s present (default: False).
- append_eos_to_target (bool, optional) – if set, appends eos to end of target if it’s absent (default: False).
- align_dataset (torch.utils.data.Dataset, optional) – dataset containing alignments.
- append_bos (bool, optional) – if set, appends bos to the beginning of source/target sentence.
-
collater
(samples)[source]¶ Merge a list of samples to form a mini-batch.
Parameters: samples (List[dict]) – samples to collate Returns: a mini-batch with the following keys: - id (LongTensor): example IDs in the original input order
- ntokens (int): total number of tokens in the batch
- net_input (dict): the input to the Model, containing keys:
- src_tokens (LongTensor): a padded 2D Tensor of tokens in
the source sentence of shape (bsz, src_len). Padding will
appear on the left if left_pad_source is
True
. - src_lengths (LongTensor): 1D Tensor of the unpadded lengths of each source sentence of shape (bsz)
- prev_output_tokens (LongTensor): a padded 2D Tensor of
tokens in the target sentence, shifted right by one
position for teacher forcing, of shape (bsz, tgt_len).
This key will not be present if input_feeding is
False
. Padding will appear on the left if left_pad_target isTrue
.
- src_tokens (LongTensor): a padded 2D Tensor of tokens in
the source sentence of shape (bsz, src_len). Padding will
appear on the left if left_pad_source is
- target (LongTensor): a padded 2D Tensor of tokens in the
target sentence of shape (bsz, tgt_len). Padding will appear
on the left if left_pad_target is
True
.
Return type: dict
-
num_tokens
(index)[source]¶ Return the number of tokens in a sample. This value is used to enforce
--max-tokens
during batching.
-
ordered_indices
()[source]¶ Return an ordered list of indices. Batches will be constructed based on this order.
-
size
(index)[source]¶ Return an example’s size as a float or tuple. This value is used when filtering a dataset with
--max-positions
.
-
supports_prefetch
¶ Whether this dataset supports prefetching.
-
class
fairseq.data.
MonolingualDataset
(dataset, sizes, src_vocab, tgt_vocab, add_eos_for_other_targets, shuffle, targets=None, add_bos_token=False)[source]¶ A wrapper around torch.utils.data.Dataset for monolingual data.
Parameters: - dataset (torch.utils.data.Dataset) – dataset to wrap
- sizes (List[int]) – sentence lengths
- vocab (Dictionary) – vocabulary
- shuffle (bool, optional) – shuffle the elements before batching (default: True).
-
collater
(samples)[source]¶ Merge a list of samples to form a mini-batch.
Parameters: samples (List[dict]) – samples to collate Returns: a mini-batch with the following keys: - id (LongTensor): example IDs in the original input order
- ntokens (int): total number of tokens in the batch
- net_input (dict): the input to the Model, containing keys:
- src_tokens (LongTensor): a padded 2D Tensor of tokens in the source sentence of shape (bsz, src_len). Padding will appear on the right.
- target (LongTensor): a padded 2D Tensor of tokens in the target sentence of shape (bsz, tgt_len). Padding will appear on the right.
Return type: dict
-
num_tokens
(index)[source]¶ Return the number of tokens in a sample. This value is used to enforce
--max-tokens
during batching.
-
ordered_indices
()[source]¶ Return an ordered list of indices. Batches will be constructed based on this order.
-
size
(index)[source]¶ Return an example’s size as a float or tuple. This value is used when filtering a dataset with
--max-positions
.
-
supports_prefetch
¶ Whether this dataset supports prefetching.
Helper Datasets
These datasets wrap other fairseq.data.FairseqDataset
instances and
provide additional functionality:
-
class
fairseq.data.
BacktranslationDataset
(tgt_dataset, src_dict, tgt_dict=None, backtranslation_fn=None, output_collater=None, cuda=True, **kwargs)[source]¶ Sets up a backtranslation dataset which takes a tgt batch, generates a src using a tgt-src backtranslation function (backtranslation_fn), and returns the corresponding {generated src, input tgt} batch.
Parameters: - tgt_dataset (FairseqDataset) – the dataset to be backtranslated. Only the source side of this dataset will be used. After backtranslation, the source sentences in this dataset will be returned as the targets.
- src_dict (Dictionary) – the dictionary of backtranslated sentences.
- tgt_dict (Dictionary, optional) – the dictionary of sentences to be backtranslated.
- backtranslation_fn (callable, optional) – function to call to generate
backtranslations. This is typically the generate method of a
SequenceGenerator
object. Pass in None when it is not available at initialization time, and use set_backtranslation_fn function to set it when available. - output_collater (callable, optional) – function to call on the
backtranslated samples to create the final batch
(default:
tgt_dataset.collater
). - cuda – use GPU for generation
-
collater
(samples)[source]¶ Merge and backtranslate a list of samples to form a mini-batch.
Using the samples from tgt_dataset, load a collated target sample to feed to the backtranslation model. Then take the backtranslation with the best score as the source and the original input as the target.
Note: we expect tgt_dataset to provide a function collater() that will collate samples into the format expected by backtranslation_fn. After backtranslation, we will feed the new list of samples (i.e., the (backtranslated source, original source) pairs) to output_collater and return the result.
Parameters: samples (List[dict]) – samples to backtranslate and collate Returns: a mini-batch with keys coming from output_collater Return type: dict
-
size
(index)[source]¶ Return an example’s size as a float or tuple. This value is used when filtering a dataset with
--max-positions
.Note: we use tgt_dataset to approximate the length of the source sentence, since we do not know the actual length until after backtranslation.
-
supports_prefetch
¶ Whether this dataset supports prefetching.
-
class
fairseq.data.
ConcatDataset
(datasets, sample_ratios=1)[source]¶ -
collater
(samples)[source]¶ Merge a list of samples to form a mini-batch.
Parameters: samples (List[dict]) – samples to collate Returns: a mini-batch suitable for forwarding with a Model Return type: dict
-
num_tokens
(index: int)[source]¶ Return the number of tokens in a sample. This value is used to enforce
--max-tokens
during batching.
-
supports_prefetch
¶ Whether this dataset supports prefetching.
-
-
class
fairseq.data.
ResamplingDataset
(dataset, weights=None, replace=True, size_ratio=1.0, batch_by_size=True, seed=0, epoch=0)[source]¶ Randomly samples from a given dataset at each epoch.
Sampling is done with or without replacement, depending on the “replace” parameter.
Optionally, the epoch size can be rescaled. This is potentially desirable to increase per-epoch coverage of the base dataset (since sampling with replacement means that many items in the dataset will be left out). In the case of sampling without replacement, size_ratio should be strictly less than 1.
Parameters: - dataset (Dataset) – dataset on which to sample.
- weights (List[float]) – list of probability weights (default: None, which corresponds to uniform sampling).
- replace (bool) – sampling mode; True for “with replacement”, or False for “without replacement” (default: True)
- size_ratio (float) – the ratio to subsample to; must be positive (default: 1.0).
- batch_by_size (bool) – whether or not to batch by sequence length (default: True).
- seed (int) – RNG seed to use (default: 0).
- epoch (int) – starting epoch number (default: 0).
-
num_tokens
(index)[source]¶ Return the number of tokens in a sample. This value is used to enforce
--max-tokens
during batching.
-
class
fairseq.data.
RoundRobinZipDatasets
(datasets, eval_key=None)[source]¶ Zip multiple
FairseqDataset
instances together.Shorter datasets are repeated in a round-robin fashion to match the length of the longest one.
Parameters: - datasets (Dict[FairseqDataset]) – a dictionary of
FairseqDataset
instances. - eval_key (str, optional) – a key used at evaluation time that causes this instance to pass-through batches from datasets[eval_key].
-
size
(index)[source]¶ Return an example’s size as a float or tuple. This value is used when filtering a dataset with
--max-positions
.
-
supports_prefetch
¶ Whether this dataset supports prefetching.
- datasets (Dict[FairseqDataset]) – a dictionary of
-
class
fairseq.data.
TransformEosDataset
(dataset, eos, append_eos_to_src=False, remove_eos_from_src=False, append_eos_to_tgt=False, remove_eos_from_tgt=False, has_target=True)[source]¶ A
FairseqDataset
wrapper that appends/prepends/strips EOS.Note that the transformation is applied in
collater()
.Parameters: - dataset (FairseqDataset) – dataset to wrap
- eos (int) – index of the end-of-sentence symbol
- append_eos_to_src (bool, optional) – append EOS to the end of src
- remove_eos_from_src (bool, optional) – remove EOS from the end of src
- append_eos_to_tgt (bool, optional) – append EOS to the end of tgt
- remove_eos_from_tgt (bool, optional) – remove EOS from the end of tgt
-
collater
(samples)[source]¶ Merge a list of samples to form a mini-batch.
Parameters: samples (List[dict]) – samples to collate Returns: a mini-batch suitable for forwarding with a Model Return type: dict
-
num_tokens
(index)[source]¶ Return the number of tokens in a sample. This value is used to enforce
--max-tokens
during batching.
-
ordered_indices
()[source]¶ Return an ordered list of indices. Batches will be constructed based on this order.
-
size
(index)[source]¶ Return an example’s size as a float or tuple. This value is used when filtering a dataset with
--max-positions
.
-
supports_prefetch
¶ Whether this dataset supports prefetching.
Dictionary¶
-
class
fairseq.data.
Dictionary
(pad='<pad>', eos='</s>', unk='<unk>', bos='<s>', extra_special_symbols=None)[source]¶ A mapping from symbols to consecutive integers
-
add_from_file
(f)[source]¶ Loads a pre-existing dictionary from a text file and adds its symbols to this instance.
-
finalize
(threshold=-1, nwords=-1, padding_factor=8)[source]¶ Sort symbols by frequency in descending order, ignoring special ones.
Parameters: - threshold defines the minimum word count (-) –
- nwords defines the total number of words in the final dictionary, (-) – including special symbols
- padding_factor can be used to pad the dictionary size to be a (-) – multiple of 8, which is important on some hardware (e.g., Nvidia Tensor Cores).
-
classmethod
load
(f)[source]¶ Loads the dictionary from a text file with the format:
` <symbol0> <count0> <symbol1> <count1> ... `
-
Iterators¶
-
class
fairseq.data.
CountingIterator
(iterable, start=0)[source]¶ Wrapper around an iterable that maintains the iteration count.
Parameters: iterable (iterable) – iterable to wrap
-
class
fairseq.data.
EpochBatchIterator
(dataset, collate_fn, batch_sampler, seed=1, num_shards=1, shard_id=0, num_workers=0, epoch=0)[source]¶ A multi-epoch iterator over a
torch.utils.data.Dataset
.Compared to
torch.utils.data.DataLoader
, this iterator:- can be reused across multiple epochs with the
next_epoch_itr()
method (optionally shuffled between epochs) - can be serialized/deserialized with the
state_dict()
andload_state_dict()
methods - supports sharding with the num_shards and shard_id arguments
Parameters: - dataset (Dataset) – dataset from which to load the data
- collate_fn (callable) – merges a list of samples to form a mini-batch
- batch_sampler (Sampler) – an iterator over batches of indices
- seed (int, optional) – seed for random number generator for reproducibility (default: 1).
- num_shards (int, optional) – shard the data iterator into N shards (default: 1).
- shard_id (int, optional) – which shard of the data iterator to return (default: 0).
- num_workers (int, optional) – how many subprocesses to use for data loading. 0 means the data will be loaded in the main process (default: 0).
- epoch (int, optional) – the epoch to start the iterator from (default: 0).
-
iterations_in_epoch
¶ The number of consumed batches in the current epoch.
-
next_epoch_itr
(shuffle=True, fix_batches_to_gpus=False)[source]¶ Return a new iterator over the dataset.
Parameters: - shuffle (bool, optional) – shuffle batches before returning the iterator (default: True).
- fix_batches_to_gpus – ensure that batches are always
allocated to the same shards across epochs. Requires
that
dataset
supports prefetching (default: False).
- can be reused across multiple epochs with the
-
class
fairseq.data.
GroupedIterator
(iterable, chunk_size)[source]¶ Wrapper around an iterable that returns groups (chunks) of items.
Parameters: - iterable (iterable) – iterable to wrap
- chunk_size (int) – size of each chunk