Data Loading and Utilities¶
Datasets¶
Datasets define the data format and provide helpers for creating mini-batches.
-
class
fairseq.data.
FairseqDataset
[source]¶ A dataset that provides helpers for batching.
-
batch_by_size
(indices, max_tokens=None, max_sentences=None, required_batch_size_multiple=1)[source]¶ Given an ordered set of indices, return batches according to max_tokens, max_sentences and required_batch_size_multiple.
-
collater
(samples)[source]¶ Merge a list of samples to form a mini-batch.
Parameters: samples (List[dict]) – samples to collate Returns: a mini-batch suitable for forwarding with a Model Return type: dict
-
filter_indices_by_size
(indices, max_sizes)[source]¶ Filter a list of sample indices. Remove those that are longer than specified in max_sizes.
WARNING: don’t update, override method in child classes
Parameters: Returns: filtered sample array list: list of removed indices
Return type: np.array
-
get_batch_shapes
()[source]¶ Return a list of valid batch shapes, for example:
[(8, 512), (16, 256), (32, 128)]
The first dimension of each tuple is the batch size and can be
None
to automatically infer the max batch size based on--max-tokens
. The second dimension of each tuple is the max supported length as given byfairseq.data.FairseqDataset.num_tokens()
.This will be used by
fairseq.data.FairseqDataset.batch_by_size()
to restrict batch shapes. This is useful on TPUs to avoid too many dynamic shapes (and recompilations).
-
num_tokens
(index)[source]¶ Return the number of tokens in a sample. This value is used to enforce
--max-tokens
during batching.
-
ordered_indices
()[source]¶ Return an ordered list of indices. Batches will be constructed based on this order.
-
size
(index)[source]¶ Return an example’s size as a float or tuple. This value is used when filtering a dataset with
--max-positions
.
-
supports_fetch_outside_dataloader
¶ Whether this dataset supports fetching outside the workers of the dataloader.
-
supports_prefetch
¶ Whether this dataset supports prefetching.
-
-
class
fairseq.data.
LanguagePairDataset
(src, src_sizes, src_dict, tgt=None, tgt_sizes=None, tgt_dict=None, left_pad_source=True, left_pad_target=False, shuffle=True, input_feeding=True, remove_eos_from_source=False, append_eos_to_target=False, align_dataset=None, constraints=None, append_bos=False, eos=None, num_buckets=0, src_lang_id=None, tgt_lang_id=None, pad_to_multiple=1)[source]¶ A pair of torch.utils.data.Datasets.
Parameters: - src (torch.utils.data.Dataset) – source dataset to wrap
- src_sizes (List[int]) – source sentence lengths
- src_dict (Dictionary) – source vocabulary
- tgt (torch.utils.data.Dataset, optional) – target dataset to wrap
- tgt_sizes (List[int], optional) – target sentence lengths
- tgt_dict (Dictionary, optional) – target vocabulary
- left_pad_source (bool, optional) – pad source tensors on the left side (default: True).
- left_pad_target (bool, optional) – pad target tensors on the left side (default: False).
- shuffle (bool, optional) – shuffle dataset elements before batching (default: True).
- input_feeding (bool, optional) – create a shifted version of the targets to be passed into the model for teacher forcing (default: True).
- remove_eos_from_source (bool, optional) – if set, removes eos from end of source if it’s present (default: False).
- append_eos_to_target (bool, optional) – if set, appends eos to end of target if it’s absent (default: False).
- align_dataset (torch.utils.data.Dataset, optional) – dataset containing alignments.
- constraints (Tensor, optional) – 2d tensor with a concatenated, zero- delimited list of constraints for each sentence.
- append_bos (bool, optional) – if set, appends bos to the beginning of source/target sentence.
- num_buckets (int, optional) – if set to a value greater than 0, then batches will be bucketed into the given number of batch shapes.
- src_lang_id (int, optional) – source language ID, if set, the collated batch will contain a field ‘src_lang_id’ in ‘net_input’ which indicates the source language of the samples.
- tgt_lang_id (int, optional) –
target language ID, if set, the collated batch will contain a field ‘tgt_lang_id’ which indicates the target language
of the samples.
-
collater
(samples, pad_to_length=None)[source]¶ Merge a list of samples to form a mini-batch.
Parameters: Returns: a mini-batch with the following keys:
- id (LongTensor): example IDs in the original input order
- ntokens (int): total number of tokens in the batch
- net_input (dict): the input to the Model, containing keys:
- src_tokens (LongTensor): a padded 2D Tensor of tokens in
the source sentence of shape (bsz, src_len). Padding will
appear on the left if left_pad_source is
True
. - src_lengths (LongTensor): 1D Tensor of the unpadded lengths of each source sentence of shape (bsz)
- prev_output_tokens (LongTensor): a padded 2D Tensor of
tokens in the target sentence, shifted right by one
position for teacher forcing, of shape (bsz, tgt_len).
This key will not be present if input_feeding is
False
. Padding will appear on the left if left_pad_target isTrue
. - src_lang_id (LongTensor): a long Tensor which contains source language IDs of each sample in the batch
- src_tokens (LongTensor): a padded 2D Tensor of tokens in
the source sentence of shape (bsz, src_len). Padding will
appear on the left if left_pad_source is
- target (LongTensor): a padded 2D Tensor of tokens in the
target sentence of shape (bsz, tgt_len). Padding will appear
on the left if left_pad_target is
True
. - tgt_lang_id (LongTensor): a long Tensor which contains target language
- IDs of each sample in the batch
Return type:
-
filter_indices_by_size
(indices, max_sizes)[source]¶ - Filter a list of sample indices. Remove those that are longer
- than specified in max_sizes.
Parameters: Returns: filtered sample array list: list of removed indices
Return type: np.array
-
get_batch_shapes
()[source]¶ Return a list of valid batch shapes, for example:
[(8, 512), (16, 256), (32, 128)]
The first dimension of each tuple is the batch size and can be
None
to automatically infer the max batch size based on--max-tokens
. The second dimension of each tuple is the max supported length as given byfairseq.data.FairseqDataset.num_tokens()
.This will be used by
fairseq.data.FairseqDataset.batch_by_size()
to restrict batch shapes. This is useful on TPUs to avoid too many dynamic shapes (and recompilations).
-
num_tokens
(index)[source]¶ Return the number of tokens in a sample. This value is used to enforce
--max-tokens
during batching.
-
ordered_indices
()[source]¶ Return an ordered list of indices. Batches will be constructed based on this order.
-
size
(index)[source]¶ Return an example’s size as a float or tuple. This value is used when filtering a dataset with
--max-positions
.
-
supports_prefetch
¶ Whether this dataset supports prefetching.
-
class
fairseq.data.
MonolingualDataset
(dataset, sizes, src_vocab, tgt_vocab, add_eos_for_other_targets, shuffle, targets=None, add_bos_token=False)[source]¶ A wrapper around torch.utils.data.Dataset for monolingual data.
Parameters: - dataset (torch.utils.data.Dataset) – dataset to wrap
- sizes (List[int]) – sentence lengths
- vocab (Dictionary) – vocabulary
- shuffle (bool, optional) – shuffle the elements before batching (default: True).
-
collater
(samples)[source]¶ Merge a list of samples to form a mini-batch.
Parameters: samples (List[dict]) – samples to collate Returns: a mini-batch with the following keys: - id (LongTensor): example IDs in the original input order
- ntokens (int): total number of tokens in the batch
- net_input (dict): the input to the Model, containing keys:
- src_tokens (LongTensor): a padded 2D Tensor of tokens in the source sentence of shape (bsz, src_len). Padding will appear on the right.
- target (LongTensor): a padded 2D Tensor of tokens in the target sentence of shape (bsz, tgt_len). Padding will appear on the right.
Return type: dict
-
num_tokens
(index)[source]¶ Return the number of tokens in a sample. This value is used to enforce
--max-tokens
during batching.
-
ordered_indices
()[source]¶ Return an ordered list of indices. Batches will be constructed based on this order.
-
size
(index)[source]¶ Return an example’s size as a float or tuple. This value is used when filtering a dataset with
--max-positions
.
-
supports_prefetch
¶ Whether this dataset supports prefetching.
Helper Datasets
These datasets wrap other fairseq.data.FairseqDataset
instances and
provide additional functionality:
-
class
fairseq.data.
BacktranslationDataset
(tgt_dataset, src_dict, tgt_dict=None, backtranslation_fn=None, output_collater=None, cuda=True, **kwargs)[source]¶ Sets up a backtranslation dataset which takes a tgt batch, generates a src using a tgt-src backtranslation function (backtranslation_fn), and returns the corresponding {generated src, input tgt} batch.
Parameters: - tgt_dataset (FairseqDataset) – the dataset to be backtranslated. Only the source side of this dataset will be used. After backtranslation, the source sentences in this dataset will be returned as the targets.
- src_dict (Dictionary) – the dictionary of backtranslated sentences.
- tgt_dict (Dictionary, optional) – the dictionary of sentences to be backtranslated.
- backtranslation_fn (callable, optional) – function to call to generate
backtranslations. This is typically the generate method of a
SequenceGenerator
object. Pass in None when it is not available at initialization time, and use set_backtranslation_fn function to set it when available. - output_collater (callable, optional) – function to call on the
backtranslated samples to create the final batch
(default:
tgt_dataset.collater
). - cuda – use GPU for generation
-
collater
(samples)[source]¶ Merge and backtranslate a list of samples to form a mini-batch.
Using the samples from tgt_dataset, load a collated target sample to feed to the backtranslation model. Then take the backtranslation with the best score as the source and the original input as the target.
Note: we expect tgt_dataset to provide a function collater() that will collate samples into the format expected by backtranslation_fn. After backtranslation, we will feed the new list of samples (i.e., the (backtranslated source, original source) pairs) to output_collater and return the result.
Parameters: samples (List[dict]) – samples to backtranslate and collate Returns: a mini-batch with keys coming from output_collater Return type: dict
-
size
(index)[source]¶ Return an example’s size as a float or tuple. This value is used when filtering a dataset with
--max-positions
.Note: we use tgt_dataset to approximate the length of the source sentence, since we do not know the actual length until after backtranslation.
-
supports_prefetch
¶ Whether this dataset supports prefetching.
-
class
fairseq.data.
ConcatDataset
(datasets, sample_ratios=1)[source]¶ -
can_reuse_epoch_itr_across_epochs
¶ Whether we can reuse the
fairseq.data.EpochBatchIterator
for this dataset across epochs.This needs to return
False
if the sample sizes can change across epochs, in which case we may need to regenerate batches at each epoch. If your dataset relies inset_epoch
then you should consider setting this toFalse
.
-
collater
(samples, **extra_args)[source]¶ Merge a list of samples to form a mini-batch.
Parameters: samples (List[dict]) – samples to collate Returns: a mini-batch suitable for forwarding with a Model Return type: dict
-
num_tokens
(index: int)[source]¶ Return the number of tokens in a sample. This value is used to enforce
--max-tokens
during batching.
-
supports_prefetch
¶ Whether this dataset supports prefetching.
-
-
class
fairseq.data.
ResamplingDataset
(dataset, weights=None, replace=True, size_ratio=1.0, batch_by_size=True, seed=0, epoch=1)[source]¶ Randomly samples from a given dataset at each epoch.
Sampling is done with or without replacement, depending on the “replace” parameter.
Optionally, the epoch size can be rescaled. This is potentially desirable to increase per-epoch coverage of the base dataset (since sampling with replacement means that many items in the dataset will be left out). In the case of sampling without replacement, size_ratio should be strictly less than 1.
Parameters: - dataset (Dataset) – dataset on which to sample.
- weights (List[float]) – list of probability weights (default: None, which corresponds to uniform sampling).
- replace (bool) – sampling mode; True for “with replacement”, or False for “without replacement” (default: True)
- size_ratio (float) – the ratio to subsample to; must be positive (default: 1.0).
- batch_by_size (bool) – whether or not to batch by sequence length (default: True).
- seed (int) – RNG seed to use (default: 0).
- epoch (int) – starting epoch number (default: 1).
-
can_reuse_epoch_itr_across_epochs
¶ Whether we can reuse the
fairseq.data.EpochBatchIterator
for this dataset across epochs.This needs to return
False
if the sample sizes can change across epochs, in which case we may need to regenerate batches at each epoch. If your dataset relies inset_epoch
then you should consider setting this toFalse
.
-
num_tokens
(index)[source]¶ Return the number of tokens in a sample. This value is used to enforce
--max-tokens
during batching.
-
class
fairseq.data.
RoundRobinZipDatasets
(datasets, eval_key=None)[source]¶ Zip multiple
FairseqDataset
instances together.Shorter datasets are repeated in a round-robin fashion to match the length of the longest one.
Parameters: - datasets (Dict[FairseqDataset]) – a dictionary of
FairseqDataset
instances. - eval_key (str, optional) – a key used at evaluation time that causes this instance to pass-through batches from datasets[eval_key].
-
size
(index)[source]¶ Return an example’s size as a float or tuple. This value is used when filtering a dataset with
--max-positions
.
-
supports_prefetch
¶ Whether this dataset supports prefetching.
- datasets (Dict[FairseqDataset]) – a dictionary of
-
class
fairseq.data.
TransformEosDataset
(dataset, eos, append_eos_to_src=False, remove_eos_from_src=False, append_eos_to_tgt=False, remove_eos_from_tgt=False, has_target=True)[source]¶ A
FairseqDataset
wrapper that appends/prepends/strips EOS.Note that the transformation is applied in
collater()
.Parameters: - dataset (FairseqDataset) – dataset to wrap
- eos (int) – index of the end-of-sentence symbol
- append_eos_to_src (bool, optional) – append EOS to the end of src
- remove_eos_from_src (bool, optional) – remove EOS from the end of src
- append_eos_to_tgt (bool, optional) – append EOS to the end of tgt
- remove_eos_from_tgt (bool, optional) – remove EOS from the end of tgt
-
collater
(samples)[source]¶ Merge a list of samples to form a mini-batch.
Parameters: samples (List[dict]) – samples to collate Returns: a mini-batch suitable for forwarding with a Model Return type: dict
-
num_tokens
(index)[source]¶ Return the number of tokens in a sample. This value is used to enforce
--max-tokens
during batching.
-
ordered_indices
()[source]¶ Return an ordered list of indices. Batches will be constructed based on this order.
-
size
(index)[source]¶ Return an example’s size as a float or tuple. This value is used when filtering a dataset with
--max-positions
.
-
supports_prefetch
¶ Whether this dataset supports prefetching.
Dictionary¶
-
class
fairseq.data.
Dictionary
(*, bos='<s>', pad='<pad>', eos='</s>', unk='<unk>', extra_special_symbols=None)[source]¶ A mapping from symbols to consecutive integers
-
add_from_file
(f)[source]¶ Loads a pre-existing dictionary from a text file and adds its symbols to this instance.
-
finalize
(threshold=-1, nwords=-1, padding_factor=8)[source]¶ Sort symbols by frequency in descending order, ignoring special ones.
Parameters: - threshold defines the minimum word count (-) –
- nwords defines the total number of words in the final dictionary, (-) – including special symbols
- padding_factor can be used to pad the dictionary size to be a (-) – multiple of 8, which is important on some hardware (e.g., Nvidia Tensor Cores).
-
classmethod
load
(f)[source]¶ Loads the dictionary from a text file with the format:
` <symbol0> <count0> <symbol1> <count1> ... `
-
Iterators¶
-
class
fairseq.data.
CountingIterator
(iterable, start=None, total=None)[source]¶ Wrapper around an iterable that maintains the iteration count.
Parameters:
-
class
fairseq.data.
EpochBatchIterator
(dataset, collate_fn, batch_sampler, seed=1, num_shards=1, shard_id=0, num_workers=0, epoch=1, buffer_size=0, timeout=0, disable_shuffling=False)[source]¶ A multi-epoch iterator over a
torch.utils.data.Dataset
.Compared to
torch.utils.data.DataLoader
, this iterator:- can be reused across multiple epochs with the
next_epoch_itr()
method (optionally shuffled between epochs) - can be serialized/deserialized with the
state_dict()
andload_state_dict()
methods - supports sharding with the num_shards and shard_id arguments
Parameters: - dataset (Dataset) – dataset from which to load the data
- collate_fn (callable) – merges a list of samples to form a mini-batch
- batch_sampler (Sampler or a callable) – an iterator over batches of indices, or a callable to create such an iterator (~torch.utils.data.Sampler). A callable batch_sampler will be called for each epoch to enable per epoch dynamic batch iterators defined by this callable batch_sampler.
- seed (int, optional) – seed for random number generator for reproducibility (default: 1).
- num_shards (int, optional) – shard the data iterator into N shards (default: 1).
- shard_id (int, optional) – which shard of the data iterator to return (default: 0).
- num_workers (int, optional) – how many subprocesses to use for data loading. 0 means the data will be loaded in the main process (default: 0).
- epoch (int, optional) – the epoch to start the iterator from (default: 1).
- buffer_size (int, optional) – the number of batches to keep ready in the queue. Helps speeding up dataloading. When buffer_size is zero, the default torch.utils.data.DataLoader preloading is used.
- timeout (int, optional) – if positive, the timeout value for collecting a batch
from workers. Should always be non-negative (default:
0
). - disable_shuffling (bool, optional) – force disable shuffling
(default:
False
).
-
iterations_in_epoch
¶ The number of consumed batches in the current epoch.
-
next_epoch_idx
¶ Return the epoch index after next_epoch_itr is called.
-
next_epoch_itr
(shuffle=True, fix_batches_to_gpus=False)[source]¶ Return a new iterator over the dataset.
Parameters: - shuffle (bool, optional) – shuffle batches before returning the iterator (default: True).
- fix_batches_to_gpus – ensure that batches are always
allocated to the same shards across epochs. Requires
that
dataset
supports prefetching (default: False).
- can be reused across multiple epochs with the
-
class
fairseq.data.
GroupedIterator
(iterable, chunk_size)[source]¶ Wrapper around an iterable that returns groups (chunks) of items.
Parameters: - iterable (iterable) – iterable to wrap
- chunk_size (int) – size of each chunk