Command-line Tools¶
Fairseq provides several command-line tools for training and evaluating models:
- fairseq-preprocess: Data pre-processing: build vocabularies and binarize training data
- fairseq-train: Train a new model on one or multiple GPUs
- fairseq-generate: Translate pre-processed data with a trained model
- fairseq-interactive: Translate raw text with a trained model
- fairseq-score: BLEU scoring of generated translations against reference translations
- fairseq-eval-lm: Language model evaluation
fairseq-preprocess¶
Data pre-processing: build vocabularies and binarize training data.
usage: fairseq-preprocess [-h] [--no-progress-bar] [--log-interval N]
[--log-format {json,none,simple,tqdm}]
[--tensorboard-logdir DIR] [--tbmf-wrapper]
[--seed N] [--cpu] [--fp16]
[--memory-efficient-fp16]
[--fp16-init-scale FP16_INIT_SCALE]
[--fp16-scale-window FP16_SCALE_WINDOW]
[--fp16-scale-tolerance FP16_SCALE_TOLERANCE]
[--min-loss-scale D]
[--threshold-loss-scale THRESHOLD_LOSS_SCALE]
[--user-dir USER_DIR]
[--criterion {adaptive_loss,label_smoothed_cross_entropy,composite_loss,masked_lm_loss,binary_cross_entropy,cross_entropy}]
[--optimizer {adadelta,adam,adafactor,adagrad,nag,lamb,sgd}]
[--lr-scheduler {cosine,reduce_lr_on_plateau,fixed,triangular,polynomial_decay,inverse_sqrt}]
[--task TASK] [-s SRC] [-t TARGET] [--trainpref FP]
[--validpref FP] [--testpref FP] [--destdir DIR]
[--thresholdtgt N] [--thresholdsrc N] [--tgtdict FP]
[--srcdict FP] [--nwordstgt N] [--nwordssrc N]
[--alignfile ALIGN] [--dataset-impl FORMAT]
[--joined-dictionary] [--only-source]
[--padding-factor N] [--workers N]
Named Arguments¶
--no-progress-bar | disable progress bar Default: False |
--log-interval | log progress every N batches (when progress bar is disabled) Default: 1000 |
--log-format | Possible choices: json, none, simple, tqdm log format to use |
--tensorboard-logdir | path to save logs for tensorboard, should match –logdir of running tensorboard (default: no tensorboard logging) Default: “” |
--tbmf-wrapper | [FB only] Default: False |
--seed | pseudo random number generator seed Default: 1 |
--cpu | use CPU instead of CUDA Default: False |
--fp16 | use FP16 Default: False |
--memory-efficient-fp16 | use a memory-efficient version of FP16 training; implies –fp16 Default: False |
--fp16-init-scale | default FP16 loss scale Default: 128 |
--fp16-scale-window | number of updates before increasing loss scale |
--fp16-scale-tolerance | pct of updates that can overflow before decreasing the loss scale Default: 0.0 |
--min-loss-scale | minimum FP16 loss scale, after which training is stopped Default: 0.0001 |
--threshold-loss-scale | threshold FP16 loss scale from below |
--user-dir | path to a python module containing custom extensions (tasks and/or architectures) |
--criterion | Possible choices: adaptive_loss, label_smoothed_cross_entropy, composite_loss, masked_lm_loss, binary_cross_entropy, cross_entropy Default: “cross_entropy” |
--optimizer | Possible choices: adadelta, adam, adafactor, adagrad, nag, lamb, sgd Default: “nag” |
--lr-scheduler | Possible choices: cosine, reduce_lr_on_plateau, fixed, triangular, polynomial_decay, inverse_sqrt Default: “fixed” |
--task | Possible choices: translation, translation_from_pretrained_xlm, multilingual_translation, semisupervised_translation, cross_lingual_lm, masked_lm, audio_pretraining, translation_moe, language_modeling task Default: “translation” |
--dataset-impl | Possible choices: raw, lazy, cached, mmap output dataset implementation Default: “cached” |
Preprocessing¶
-s, --source-lang | source language |
-t, --target-lang | target language |
--trainpref | train file prefix |
--validpref | comma separated, valid file prefixes |
--testpref | comma separated, test file prefixes |
--destdir | destination dir Default: “data-bin” |
--thresholdtgt | map words appearing less than threshold times to unknown Default: 0 |
--thresholdsrc | map words appearing less than threshold times to unknown Default: 0 |
--tgtdict | reuse given target dictionary |
--srcdict | reuse given source dictionary |
--nwordstgt | number of target words to retain Default: -1 |
--nwordssrc | number of source words to retain Default: -1 |
--alignfile | an alignment file (optional) |
--joined-dictionary | Generate joined dictionary Default: False |
--only-source | Only process the source language Default: False |
--padding-factor | Pad dictionary size to be multiple of N Default: 8 |
--workers | number of parallel workers Default: 1 |
fairseq-train¶
Train a new model on one or across multiple GPUs.
usage: fairseq-train [-h] [--no-progress-bar] [--log-interval N]
[--log-format {json,none,simple,tqdm}]
[--tensorboard-logdir DIR] [--tbmf-wrapper] [--seed N]
[--cpu] [--fp16] [--memory-efficient-fp16]
[--fp16-init-scale FP16_INIT_SCALE]
[--fp16-scale-window FP16_SCALE_WINDOW]
[--fp16-scale-tolerance FP16_SCALE_TOLERANCE]
[--min-loss-scale D]
[--threshold-loss-scale THRESHOLD_LOSS_SCALE]
[--user-dir USER_DIR]
[--criterion {adaptive_loss,label_smoothed_cross_entropy,composite_loss,masked_lm_loss,binary_cross_entropy,cross_entropy}]
[--optimizer {adadelta,adam,adafactor,adagrad,nag,lamb,sgd}]
[--lr-scheduler {cosine,reduce_lr_on_plateau,fixed,triangular,polynomial_decay,inverse_sqrt}]
[--task TASK] [--num-workers N]
[--skip-invalid-size-inputs-valid-test] [--max-tokens N]
[--max-sentences N] [--required-batch-size-multiple N]
[--dataset-impl FORMAT] [--train-subset SPLIT]
[--valid-subset SPLIT] [--validate-interval N]
[--disable-validation] [--max-tokens-valid N]
[--max-sentences-valid N] [--curriculum N]
[--distributed-world-size N]
[--distributed-rank DISTRIBUTED_RANK]
[--distributed-backend DISTRIBUTED_BACKEND]
[--distributed-init-method DISTRIBUTED_INIT_METHOD]
[--distributed-port DISTRIBUTED_PORT]
[--device-id DEVICE_ID] [--distributed-no-spawn]
[--ddp-backend {c10d,no_c10d}] [--bucket-cap-mb MB]
[--fix-batches-to-gpus] [--find-unused-parameters] --arch
ARCH [--max-epoch N] [--max-update N] [--clip-norm NORM]
[--sentence-avg] [--update-freq N1,N2,...,N_K]
[--lr LR_1,LR_2,...,LR_N] [--min-lr LR] [--use-bmuf]
[--save-dir DIR] [--restore-file RESTORE_FILE]
[--reset-dataloader] [--reset-lr-scheduler]
[--reset-meters] [--reset-optimizer]
[--optimizer-overrides DICT] [--save-interval N]
[--save-interval-updates N] [--keep-interval-updates N]
[--keep-last-epochs N] [--no-save]
[--no-epoch-checkpoints] [--no-last-checkpoints]
[--no-save-optimizer-state]
[--best-checkpoint-metric BEST_CHECKPOINT_METRIC]
[--maximize-best-checkpoint-metric]
Named Arguments¶
--no-progress-bar | disable progress bar Default: False |
--log-interval | log progress every N batches (when progress bar is disabled) Default: 1000 |
--log-format | Possible choices: json, none, simple, tqdm log format to use |
--tensorboard-logdir | path to save logs for tensorboard, should match –logdir of running tensorboard (default: no tensorboard logging) Default: “” |
--tbmf-wrapper | [FB only] Default: False |
--seed | pseudo random number generator seed Default: 1 |
--cpu | use CPU instead of CUDA Default: False |
--fp16 | use FP16 Default: False |
--memory-efficient-fp16 | use a memory-efficient version of FP16 training; implies –fp16 Default: False |
--fp16-init-scale | default FP16 loss scale Default: 128 |
--fp16-scale-window | number of updates before increasing loss scale |
--fp16-scale-tolerance | pct of updates that can overflow before decreasing the loss scale Default: 0.0 |
--min-loss-scale | minimum FP16 loss scale, after which training is stopped Default: 0.0001 |
--threshold-loss-scale | threshold FP16 loss scale from below |
--user-dir | path to a python module containing custom extensions (tasks and/or architectures) |
--criterion | Possible choices: adaptive_loss, label_smoothed_cross_entropy, composite_loss, masked_lm_loss, binary_cross_entropy, cross_entropy Default: “cross_entropy” |
--optimizer | Possible choices: adadelta, adam, adafactor, adagrad, nag, lamb, sgd Default: “nag” |
--lr-scheduler | Possible choices: cosine, reduce_lr_on_plateau, fixed, triangular, polynomial_decay, inverse_sqrt Default: “fixed” |
--task | Possible choices: translation, translation_from_pretrained_xlm, multilingual_translation, semisupervised_translation, cross_lingual_lm, masked_lm, audio_pretraining, translation_moe, language_modeling task Default: “translation” |
--dataset-impl | Possible choices: raw, lazy, cached, mmap output dataset implementation Default: “cached” |
Dataset and data loading¶
--num-workers | how many subprocesses to use for data loading Default: 0 |
--skip-invalid-size-inputs-valid-test | ignore too long or too short lines in valid and test set Default: False |
--max-tokens | maximum number of tokens in a batch |
--max-sentences, --batch-size | maximum number of sentences in a batch |
--required-batch-size-multiple | batch size will be a multiplier of this value Default: 8 |
--train-subset | Possible choices: train, valid, test data subset to use for training (train, valid, test) Default: “train” |
--valid-subset | comma separated list of data subsets to use for validation (train, valid, valid1, test, test1) Default: “valid” |
--validate-interval | validate every N epochs Default: 1 |
--disable-validation | disable validation Default: False |
--max-tokens-valid | maximum number of tokens in a validation batch (defaults to –max-tokens) |
--max-sentences-valid | maximum number of sentences in a validation batch (defaults to –max-sentences) |
--curriculum | don’t shuffle batches for first N epochs Default: 0 |
Distributed training¶
--distributed-world-size | total number of GPUs across all nodes (default: all visible GPUs) Default: 1 |
--distributed-rank | rank of the current worker Default: 0 |
--distributed-backend | distributed backend Default: “nccl” |
--distributed-init-method | typically tcp://hostname:port that will be used to establish initial connetion |
--distributed-port | port number (not required if using –distributed-init-method) Default: -1 |
--device-id, --local_rank | which GPU to use (usually configured automatically) Default: 0 |
--distributed-no-spawn | do not spawn multiple processes even if multiple GPUs are visible Default: False |
--ddp-backend | Possible choices: c10d, no_c10d DistributedDataParallel backend Default: “c10d” |
--bucket-cap-mb | bucket size for reduction Default: 25 |
--fix-batches-to-gpus | don’t shuffle batches between GPUs; this reduces overall randomness and may affect precision but avoids the cost of re-reading the data Default: False |
--find-unused-parameters | disable unused parameter detection (not applicable to no_c10d ddp-backend Default: False |
Model configuration¶
--arch, -a | Possible choices: transformer, transformer_iwslt_de_en, transformer_wmt_en_de, transformer_vaswani_wmt_en_de_big, transformer_vaswani_wmt_en_fr_big, transformer_wmt_en_de_big, transformer_wmt_en_de_big_t2t, transformer_from_pretrained_xlm, transformer_lm, transformer_lm_big, transformer_lm_baevski_wiki103, transformer_lm_wiki103, transformer_lm_baevski_gbw, transformer_lm_gbw, transformer_lm_gpt, transformer_lm_gpt2_small, transformer_lm_gpt2_medium, transformer_lm_gpt2_big, lightconv, lightconv_iwslt_de_en, lightconv_wmt_en_de, lightconv_wmt_en_de_big, lightconv_wmt_en_fr_big, lightconv_wmt_zh_en_big, masked_lm, bert_base, bert_large, xlm_base, fconv, fconv_iwslt_de_en, fconv_wmt_en_ro, fconv_wmt_en_de, fconv_wmt_en_fr, fconv_lm, fconv_lm_dauphin_wikitext103, fconv_lm_dauphin_gbw, wav2vec, lightconv_lm, lightconv_lm_gbw, fconv_self_att, fconv_self_att_wp, lstm, lstm_wiseman_iwslt_de_en, lstm_luong_wmt_en_de, multilingual_transformer, multilingual_transformer_iwslt_de_en Model Architecture Default: “fconv” |
Optimization¶
--max-epoch, --me | force stop training at specified epoch Default: 0 |
--max-update, --mu | force stop training at specified update Default: 0 |
--clip-norm | clip threshold of gradients Default: 25 |
--sentence-avg | normalize gradients by the number of sentences in a batch (default is to normalize by number of tokens) Default: False |
--update-freq | update parameters every N_i batches, when in epoch i Default: 1 |
--lr, --learning-rate | learning rate for the first N epochs; all epochs >N using LR_N (note: this may be interpreted differently depending on –lr-scheduler) Default: 0.25 |
--min-lr | stop training when the learning rate reaches this minimum Default: -1 |
--use-bmuf | specify global optimizer for syncing models on different GPUs/Shards Default: False |
Checkpointing¶
--save-dir | path to save checkpoints Default: “checkpoints” |
--restore-file | filename in save-dir from which to load checkpoint Default: “checkpoint_last.pt” |
--reset-dataloader | if set, does not reload dataloader state from the checkpoint Default: False |
--reset-lr-scheduler | if set, does not load lr scheduler state from the checkpoint Default: False |
--reset-meters | if set, does not load meters from the checkpoint Default: False |
--reset-optimizer | if set, does not load optimizer state from the checkpoint Default: False |
--optimizer-overrides | a dictionary used to override optimizer args when loading a checkpoint Default: “{}” |
--save-interval | save a checkpoint every N epochs Default: 1 |
--save-interval-updates | save a checkpoint (and validate) every N updates Default: 0 |
--keep-interval-updates | keep the last N checkpoints saved with –save-interval-updates Default: -1 |
--keep-last-epochs | keep last N epoch checkpoints Default: -1 |
--no-save | don’t save models or checkpoints Default: False |
--no-epoch-checkpoints | only store last and best checkpoints Default: False |
--no-last-checkpoints | don’t store last checkpoints Default: False |
--no-save-optimizer-state | don’t save optimizer-state as part of checkpoint Default: False |
--best-checkpoint-metric | metric to use for saving “best” checkpoints Default: “loss” |
--maximize-best-checkpoint-metric | select the largest metric value for saving “best” checkpoints Default: False |
fairseq-generate¶
fairseq-interactive¶
Translate raw text with a trained model. Batches data on-the-fly.
usage: fairseq-interactive [-h] [--no-progress-bar] [--log-interval N]
[--log-format {json,none,simple,tqdm}]
[--tensorboard-logdir DIR] [--tbmf-wrapper]
[--seed N] [--cpu] [--fp16]
[--memory-efficient-fp16]
[--fp16-init-scale FP16_INIT_SCALE]
[--fp16-scale-window FP16_SCALE_WINDOW]
[--fp16-scale-tolerance FP16_SCALE_TOLERANCE]
[--min-loss-scale D]
[--threshold-loss-scale THRESHOLD_LOSS_SCALE]
[--user-dir USER_DIR]
[--criterion {adaptive_loss,label_smoothed_cross_entropy,composite_loss,masked_lm_loss,binary_cross_entropy,cross_entropy}]
[--optimizer {adadelta,adam,adafactor,adagrad,nag,lamb,sgd}]
[--lr-scheduler {cosine,reduce_lr_on_plateau,fixed,triangular,polynomial_decay,inverse_sqrt}]
[--task TASK] [--num-workers N]
[--skip-invalid-size-inputs-valid-test]
[--max-tokens N] [--max-sentences N]
[--required-batch-size-multiple N]
[--dataset-impl FORMAT] [--gen-subset SPLIT]
[--num-shards N] [--shard-id ID] [--path FILE]
[--remove-bpe [REMOVE_BPE]] [--quiet]
[--model-overrides DICT] [--results-path RESDIR]
[--beam N] [--nbest N] [--max-len-a N]
[--max-len-b N] [--min-len N] [--match-source-len]
[--no-early-stop] [--unnormalized]
[--no-beamable-mm] [--lenpen LENPEN]
[--unkpen UNKPEN] [--replace-unk [REPLACE_UNK]]
[--sacrebleu] [--score-reference]
[--prefix-size PS] [--no-repeat-ngram-size N]
[--sampling] [--sampling-topk PS]
[--sampling-topp PS] [--temperature N]
[--diverse-beam-groups N]
[--diverse-beam-strength N] [--print-alignment]
[--buffer-size N] [--input FILE]
Named Arguments¶
--no-progress-bar | disable progress bar Default: False |
--log-interval | log progress every N batches (when progress bar is disabled) Default: 1000 |
--log-format | Possible choices: json, none, simple, tqdm log format to use |
--tensorboard-logdir | path to save logs for tensorboard, should match –logdir of running tensorboard (default: no tensorboard logging) Default: “” |
--tbmf-wrapper | [FB only] Default: False |
--seed | pseudo random number generator seed Default: 1 |
--cpu | use CPU instead of CUDA Default: False |
--fp16 | use FP16 Default: False |
--memory-efficient-fp16 | use a memory-efficient version of FP16 training; implies –fp16 Default: False |
--fp16-init-scale | default FP16 loss scale Default: 128 |
--fp16-scale-window | number of updates before increasing loss scale |
--fp16-scale-tolerance | pct of updates that can overflow before decreasing the loss scale Default: 0.0 |
--min-loss-scale | minimum FP16 loss scale, after which training is stopped Default: 0.0001 |
--threshold-loss-scale | threshold FP16 loss scale from below |
--user-dir | path to a python module containing custom extensions (tasks and/or architectures) |
--criterion | Possible choices: adaptive_loss, label_smoothed_cross_entropy, composite_loss, masked_lm_loss, binary_cross_entropy, cross_entropy Default: “cross_entropy” |
--optimizer | Possible choices: adadelta, adam, adafactor, adagrad, nag, lamb, sgd Default: “nag” |
--lr-scheduler | Possible choices: cosine, reduce_lr_on_plateau, fixed, triangular, polynomial_decay, inverse_sqrt Default: “fixed” |
--task | Possible choices: translation, translation_from_pretrained_xlm, multilingual_translation, semisupervised_translation, cross_lingual_lm, masked_lm, audio_pretraining, translation_moe, language_modeling task Default: “translation” |
--dataset-impl | Possible choices: raw, lazy, cached, mmap output dataset implementation Default: “cached” |
Dataset and data loading¶
--num-workers | how many subprocesses to use for data loading Default: 0 |
--skip-invalid-size-inputs-valid-test | ignore too long or too short lines in valid and test set Default: False |
--max-tokens | maximum number of tokens in a batch |
--max-sentences, --batch-size | maximum number of sentences in a batch |
--required-batch-size-multiple | batch size will be a multiplier of this value Default: 8 |
--gen-subset | data subset to generate (train, valid, test) Default: “test” |
--num-shards | shard generation over N shards Default: 1 |
--shard-id | id of the shard to generate (id < num_shards) Default: 0 |
Generation¶
--path | path(s) to model file(s), colon separated |
--remove-bpe | remove BPE tokens before scoring (can be set to sentencepiece) |
--quiet | only print final scores Default: False |
--model-overrides | a dictionary used to override model args at generation that were used during model training Default: “{}” |
--results-path | path to save eval results (optional)” |
--beam | beam size Default: 5 |
--nbest | number of hypotheses to output Default: 1 |
--max-len-a | generate sequences of maximum length ax + b, where x is the source length Default: 0 |
--max-len-b | generate sequences of maximum length ax + b, where x is the source length Default: 200 |
--min-len | minimum generation length Default: 1 |
--match-source-len | generations should match the source length Default: False |
--no-early-stop | continue searching even after finalizing k=beam hypotheses; this is more correct, but increases generation time by 50% Default: False |
--unnormalized | compare unnormalized hypothesis scores Default: False |
--no-beamable-mm | don’t use BeamableMM in attention layers Default: False |
--lenpen | length penalty: <1.0 favors shorter, >1.0 favors longer sentences Default: 1 |
--unkpen | unknown word penalty: <0 produces more unks, >0 produces fewer Default: 0 |
--replace-unk | perform unknown replacement (optionally with alignment dictionary) |
--sacrebleu | score with sacrebleu Default: False |
--score-reference | just score the reference translation Default: False |
--prefix-size | initialize generation by target prefix of given length Default: 0 |
--no-repeat-ngram-size | ngram blocking such that this size ngram cannot be repeated in the generation Default: 0 |
--sampling | sample hypotheses instead of using beam search Default: False |
--sampling-topk | sample from top K likely next words instead of all words Default: -1 |
--sampling-topp | sample from the smallest set whose cumulative probability mass exceeds p for next words Default: -1.0 |
--temperature | temperature for generation Default: 1.0 |
--diverse-beam-groups | number of groups for Diverse Beam Search Default: -1 |
--diverse-beam-strength | strength of diversity penalty for Diverse Beam Search Default: 0.5 |
--print-alignment | if set, uses attention feedback to compute and print alignment to source tokens Default: False |
Interactive¶
--buffer-size | read this many sentences into a buffer before processing them Default: 0 |
--input | file to read from; use - for stdin Default: “-“ |
fairseq-score¶
fairseq-eval-lm¶
Evaluate the perplexity of a trained language model.
usage: fairseq-eval-lm [-h] [--no-progress-bar] [--log-interval N]
[--log-format {json,none,simple,tqdm}]
[--tensorboard-logdir DIR] [--tbmf-wrapper] [--seed N]
[--cpu] [--fp16] [--memory-efficient-fp16]
[--fp16-init-scale FP16_INIT_SCALE]
[--fp16-scale-window FP16_SCALE_WINDOW]
[--fp16-scale-tolerance FP16_SCALE_TOLERANCE]
[--min-loss-scale D]
[--threshold-loss-scale THRESHOLD_LOSS_SCALE]
[--user-dir USER_DIR]
[--criterion {adaptive_loss,label_smoothed_cross_entropy,composite_loss,masked_lm_loss,binary_cross_entropy,cross_entropy}]
[--optimizer {adadelta,adam,adafactor,adagrad,nag,lamb,sgd}]
[--lr-scheduler {cosine,reduce_lr_on_plateau,fixed,triangular,polynomial_decay,inverse_sqrt}]
[--task TASK] [--num-workers N]
[--skip-invalid-size-inputs-valid-test]
[--max-tokens N] [--max-sentences N]
[--required-batch-size-multiple N]
[--dataset-impl FORMAT] [--gen-subset SPLIT]
[--num-shards N] [--shard-id ID] [--path FILE]
[--remove-bpe [REMOVE_BPE]] [--quiet]
[--model-overrides DICT] [--results-path RESDIR]
[--output-word-probs] [--output-word-stats]
[--context-window N] [--softmax-batch N]
Named Arguments¶
--no-progress-bar | disable progress bar Default: False |
--log-interval | log progress every N batches (when progress bar is disabled) Default: 1000 |
--log-format | Possible choices: json, none, simple, tqdm log format to use |
--tensorboard-logdir | path to save logs for tensorboard, should match –logdir of running tensorboard (default: no tensorboard logging) Default: “” |
--tbmf-wrapper | [FB only] Default: False |
--seed | pseudo random number generator seed Default: 1 |
--cpu | use CPU instead of CUDA Default: False |
--fp16 | use FP16 Default: False |
--memory-efficient-fp16 | use a memory-efficient version of FP16 training; implies –fp16 Default: False |
--fp16-init-scale | default FP16 loss scale Default: 128 |
--fp16-scale-window | number of updates before increasing loss scale |
--fp16-scale-tolerance | pct of updates that can overflow before decreasing the loss scale Default: 0.0 |
--min-loss-scale | minimum FP16 loss scale, after which training is stopped Default: 0.0001 |
--threshold-loss-scale | threshold FP16 loss scale from below |
--user-dir | path to a python module containing custom extensions (tasks and/or architectures) |
--criterion | Possible choices: adaptive_loss, label_smoothed_cross_entropy, composite_loss, masked_lm_loss, binary_cross_entropy, cross_entropy Default: “cross_entropy” |
--optimizer | Possible choices: adadelta, adam, adafactor, adagrad, nag, lamb, sgd Default: “nag” |
--lr-scheduler | Possible choices: cosine, reduce_lr_on_plateau, fixed, triangular, polynomial_decay, inverse_sqrt Default: “fixed” |
--task | Possible choices: translation, translation_from_pretrained_xlm, multilingual_translation, semisupervised_translation, cross_lingual_lm, masked_lm, audio_pretraining, translation_moe, language_modeling task Default: “language_modeling” |
--dataset-impl | Possible choices: raw, lazy, cached, mmap output dataset implementation Default: “cached” |
Dataset and data loading¶
--num-workers | how many subprocesses to use for data loading Default: 0 |
--skip-invalid-size-inputs-valid-test | ignore too long or too short lines in valid and test set Default: False |
--max-tokens | maximum number of tokens in a batch |
--max-sentences, --batch-size | maximum number of sentences in a batch |
--required-batch-size-multiple | batch size will be a multiplier of this value Default: 8 |
--gen-subset | data subset to generate (train, valid, test) Default: “test” |
--num-shards | shard generation over N shards Default: 1 |
--shard-id | id of the shard to generate (id < num_shards) Default: 0 |
LM Evaluation¶
--path | path(s) to model file(s), colon separated |
--remove-bpe | remove BPE tokens before scoring (can be set to sentencepiece) |
--quiet | only print final scores Default: False |
--model-overrides | a dictionary used to override model args at generation that were used during model training Default: “{}” |
--results-path | path to save eval results (optional)” |
--output-word-probs | if set, outputs words and their predicted log probabilities to standard output Default: False |
--output-word-stats | if set, outputs word statistics such as word count, average probability, etc Default: False |
--context-window | ensures that every evaluated token has access to a context of at least this size, if possible Default: 0 |
--softmax-batch | if BxT is more than this, will batch the softmax over vocab to this amount of tokens in order to fit into GPU memory Default: 9223372036854775807 |