Data Loading and Utilities¶
Datasets¶
Datasets define the data format and provide helpers for creating mini-batches.
-
class
fairseq.data.
FairseqDataset
[source]¶ A dataset that provides helpers for batching.
-
collater
(samples)[source]¶ Merge a list of samples to form a mini-batch.
Parameters: samples (List[dict]) – samples to collate Returns: a mini-batch suitable for forwarding with a Model Return type: dict
-
num_tokens
(index)[source]¶ Return the number of tokens in a sample. This value is used to enforce
--max-tokens
during batching.
-
ordered_indices
()[source]¶ Return an ordered list of indices. Batches will be constructed based on this order.
-
size
(index)[source]¶ Return an example’s size as a float or tuple. This value is used when filtering a dataset with
--max-positions
.
-
supports_prefetch
¶ Whether this dataset supports prefetching.
-
-
class
fairseq.data.
LanguagePairDataset
(src, src_sizes, src_dict, tgt=None, tgt_sizes=None, tgt_dict=None, left_pad_source=True, left_pad_target=False, max_source_positions=1024, max_target_positions=1024, shuffle=True, input_feeding=True, remove_eos_from_source=False, append_eos_to_target=False)[source]¶ A pair of torch.utils.data.Datasets.
Parameters: - src (torch.utils.data.Dataset) – source dataset to wrap
- src_sizes (List[int]) – source sentence lengths
- src_dict (Dictionary) – source vocabulary
- tgt (torch.utils.data.Dataset, optional) – target dataset to wrap
- tgt_sizes (List[int], optional) – target sentence lengths
- tgt_dict (Dictionary, optional) – target vocabulary
- left_pad_source (bool, optional) – pad source tensors on the left side (default: True).
- left_pad_target (bool, optional) – pad target tensors on the left side (default: False).
- max_source_positions (int, optional) – max number of tokens in the source sentence (default: 1024).
- max_target_positions (int, optional) – max number of tokens in the target sentence (default: 1024).
- shuffle (bool, optional) – shuffle dataset elements before batching (default: True).
- input_feeding (bool, optional) – create a shifted version of the targets to be passed into the model for input feeding/teacher forcing (default: True).
- remove_eos_from_source (bool, optional) – if set, removes eos from end of source if it’s present (default: False).
- append_eos_to_target (bool, optional) – if set, appends eos to end of target if it’s absent (default: False).
-
collater
(samples)[source]¶ Merge a list of samples to form a mini-batch.
Parameters: samples (List[dict]) – samples to collate Returns: a mini-batch with the following keys: - id (LongTensor): example IDs in the original input order
- ntokens (int): total number of tokens in the batch
- net_input (dict): the input to the Model, containing keys:
- src_tokens (LongTensor): a padded 2D Tensor of tokens in
the source sentence of shape (bsz, src_len). Padding will
appear on the left if left_pad_source is
True
. - src_lengths (LongTensor): 1D Tensor of the unpadded lengths of each source sentence of shape (bsz)
- prev_output_tokens (LongTensor): a padded 2D Tensor of
tokens in the target sentence, shifted right by one position
for input feeding/teacher forcing, of shape (bsz,
tgt_len). This key will not be present if input_feeding
is
False
. Padding will appear on the left if left_pad_target isTrue
.
- src_tokens (LongTensor): a padded 2D Tensor of tokens in
the source sentence of shape (bsz, src_len). Padding will
appear on the left if left_pad_source is
- target (LongTensor): a padded 2D Tensor of tokens in the
target sentence of shape (bsz, tgt_len). Padding will appear
on the left if left_pad_target is
True
.
Return type: dict
-
num_tokens
(index)[source]¶ Return the number of tokens in a sample. This value is used to enforce
--max-tokens
during batching.
-
ordered_indices
()[source]¶ Return an ordered list of indices. Batches will be constructed based on this order.
-
size
(index)[source]¶ Return an example’s size as a float or tuple. This value is used when filtering a dataset with
--max-positions
.
-
supports_prefetch
¶ Whether this dataset supports prefetching.
-
class
fairseq.data.
MonolingualDataset
(dataset, sizes, src_vocab, tgt_vocab, add_eos_for_other_targets, shuffle, targets=None, add_bos_token=False)[source]¶ A wrapper around torch.utils.data.Dataset for monolingual data.
Parameters: - dataset (torch.utils.data.Dataset) – dataset to wrap
- sizes (List[int]) – sentence lengths
- vocab (Dictionary) – vocabulary
- shuffle (bool, optional) – shuffle the elements before batching (default: True).
-
collater
(samples)[source]¶ Merge a list of samples to form a mini-batch.
Parameters: samples (List[dict]) – samples to collate Returns: a mini-batch with the following keys: - id (LongTensor): example IDs in the original input order
- ntokens (int): total number of tokens in the batch
- net_input (dict): the input to the Model, containing keys:
- src_tokens (LongTensor): a padded 2D Tensor of tokens in the source sentence of shape (bsz, src_len). Padding will appear on the right.
- target (LongTensor): a padded 2D Tensor of tokens in the target sentence of shape (bsz, tgt_len). Padding will appear on the right.
Return type: dict
-
num_tokens
(index)[source]¶ Return the number of tokens in a sample. This value is used to enforce
--max-tokens
during batching.
-
ordered_indices
()[source]¶ Return an ordered list of indices. Batches will be constructed based on this order.
-
size
(index)[source]¶ Return an example’s size as a float or tuple. This value is used when filtering a dataset with
--max-positions
.
-
supports_prefetch
¶ Whether this dataset supports prefetching.
Helper Datasets
These datasets wrap other fairseq.data.FairseqDataset
instances and
provide additional functionality:
-
class
fairseq.data.
BacktranslationDataset
(tgt_dataset, src_dict, tgt_dict=None, backtranslation_fn=None, output_collater=None, cuda=True, **kwargs)[source]¶ Sets up a backtranslation dataset which takes a tgt batch, generates a src using a tgt-src backtranslation function (backtranslation_fn), and returns the corresponding {generated src, input tgt} batch.
Parameters: - tgt_dataset (FairseqDataset) – the dataset to be backtranslated. Only the source side of this dataset will be used. After backtranslation, the source sentences in this dataset will be returned as the targets.
- src_dict (Dictionary) – the dictionary of backtranslated sentences.
- tgt_dict (Dictionary, optional) – the dictionary of sentences to be backtranslated.
- backtranslation_fn (callable, optional) – function to call to generate
backtranslations. This is typically the generate method of a
SequenceGenerator
object. Pass in None when it is not available at initialization time, and use set_backtranslation_fn function to set it when available. - output_collater (callable, optional) – function to call on the
backtranslated samples to create the final batch
(default:
tgt_dataset.collater
). - cuda – use GPU for generation
-
collater
(samples)[source]¶ Merge and backtranslate a list of samples to form a mini-batch.
Using the samples from tgt_dataset, load a collated target sample to feed to the backtranslation model. Then take the backtranslation with the best score as the source and the original input as the target.
Note: we expect tgt_dataset to provide a function collater() that will collate samples into the format expected by backtranslation_fn. After backtranslation, we will feed the new list of samples (i.e., the (backtranslated source, original source) pairs) to output_collater and return the result.
Parameters: samples (List[dict]) – samples to backtranslate and collate Returns: a mini-batch with keys coming from output_collater Return type: dict
-
size
(index)[source]¶ Return an example’s size as a float or tuple. This value is used when filtering a dataset with
--max-positions
.Note: we use tgt_dataset to approximate the length of the source sentence, since we do not know the actual length until after backtranslation.
-
supports_prefetch
¶ Whether this dataset supports prefetching.
-
class
fairseq.data.
ConcatDataset
(datasets, sample_ratios=1)[source]¶ -
collater
(samples)[source]¶ Merge a list of samples to form a mini-batch.
Parameters: samples (List[dict]) – samples to collate Returns: a mini-batch suitable for forwarding with a Model Return type: dict
-
num_tokens
(index: int)[source]¶ Return the number of tokens in a sample. This value is used to enforce
--max-tokens
during batching.
-
supports_prefetch
¶ Whether this dataset supports prefetching.
-
-
class
fairseq.data.
RoundRobinZipDatasets
(datasets, eval_key=None)[source]¶ Zip multiple
FairseqDataset
instances together.Shorter datasets are repeated in a round-robin fashion to match the length of the longest one.
Parameters: - datasets (Dict[FairseqDataset]) – a dictionary of
FairseqDataset
instances. - eval_key (str, optional) – a key used at evaluation time that causes this instance to pass-through batches from datasets[eval_key].
-
size
(index)[source]¶ Return an example’s size as a float or tuple. This value is used when filtering a dataset with
--max-positions
.
-
supports_prefetch
¶ Whether this dataset supports prefetching.
- datasets (Dict[FairseqDataset]) – a dictionary of
-
class
fairseq.data.
TransformEosDataset
(dataset, eos, append_eos_to_src=False, remove_eos_from_src=False, append_eos_to_tgt=False, remove_eos_from_tgt=False, has_target=True)[source]¶ A
FairseqDataset
wrapper that appends/prepends/strips EOS.Note that the transformation is applied in
collater()
.Parameters: - dataset (FairseqDataset) – dataset to wrap
- eos (int) – index of the end-of-sentence symbol
- append_eos_to_src (bool, optional) – append EOS to the end of src
- remove_eos_from_src (bool, optional) – remove EOS from the end of src
- append_eos_to_tgt (bool, optional) – append EOS to the end of tgt
- remove_eos_from_tgt (bool, optional) – remove EOS from the end of tgt
-
collater
(samples)[source]¶ Merge a list of samples to form a mini-batch.
Parameters: samples (List[dict]) – samples to collate Returns: a mini-batch suitable for forwarding with a Model Return type: dict
-
num_tokens
(index)[source]¶ Return the number of tokens in a sample. This value is used to enforce
--max-tokens
during batching.
-
ordered_indices
()[source]¶ Return an ordered list of indices. Batches will be constructed based on this order.
-
size
(index)[source]¶ Return an example’s size as a float or tuple. This value is used when filtering a dataset with
--max-positions
.
-
supports_prefetch
¶ Whether this dataset supports prefetching.
Dictionary¶
-
class
fairseq.data.
Dictionary
(pad='<pad>', eos='</s>', unk='<unk>', bos='<s>')[source]¶ A mapping from symbols to consecutive integers
-
finalize
(threshold=-1, nwords=-1, padding_factor=8)[source]¶ Sort symbols by frequency in descending order, ignoring special ones.
Parameters: - threshold defines the minimum word count (-) –
- nwords defines the total number of words in the final dictionary, (-) – including special symbols
- padding_factor can be used to pad the dictionary size to be a (-) – multiple of 8, which is important on some hardware (e.g., Nvidia Tensor Cores).
-
classmethod
load
(f, ignore_utf_errors=False)[source]¶ Loads the dictionary from a text file with the format:
` <symbol0> <count0> <symbol1> <count1> ... `
-
Iterators¶
-
class
fairseq.data.
CountingIterator
(iterable, start=0)[source]¶ Wrapper around an iterable that maintains the iteration count.
Parameters: iterable (iterable) – iterable to wrap
-
class
fairseq.data.
EpochBatchIterator
(dataset, collate_fn, batch_sampler, seed=1, num_shards=1, shard_id=0, num_workers=0, epoch=0)[source]¶ A multi-epoch iterator over a
torch.utils.data.Dataset
.Compared to
torch.utils.data.DataLoader
, this iterator:- can be reused across multiple epochs with the
next_epoch_itr()
method (optionally shuffled between epochs) - can be serialized/deserialized with the
state_dict()
andload_state_dict()
methods - supports sharding with the num_shards and shard_id arguments
Parameters: - dataset (Dataset) – dataset from which to load the data
- collate_fn (callable) – merges a list of samples to form a mini-batch
- batch_sampler (Sampler) – an iterator over batches of indices
- seed (int, optional) – seed for random number generator for reproducibility (default: 1).
- num_shards (int, optional) – shard the data iterator into N shards (default: 1).
- shard_id (int, optional) – which shard of the data iterator to return (default: 0).
- num_workers (int, optional) – how many subprocesses to use for data loading. 0 means the data will be loaded in the main process (default: 0).
- epoch (int, optional) – the epoch to start the iterator from (default: 0).
-
iterations_in_epoch
¶ The number of consumed batches in the current epoch.
-
next_epoch_itr
(shuffle=True, fix_batches_to_gpus=False)[source]¶ Return a new iterator over the dataset.
Parameters: - shuffle (bool, optional) – shuffle batches before returning the iterator (default: True).
- fix_batches_to_gpus – ensure that batches are always
allocated to the same shards across epochs. Requires
that
dataset
supports prefetching (default: False).
- can be reused across multiple epochs with the
-
class
fairseq.data.
GroupedIterator
(iterable, chunk_size)[source]¶ Wrapper around an iterable that returns groups (chunks) of items.
Parameters: - iterable (iterable) – iterable to wrap
- chunk_size (int) – size of each chunk