Command-line Tools

Fairseq provides several command-line tools for training and evaluating models:


Data pre-processing: build vocabularies and binarize training data.

usage: fairseq-preprocess [-h] [--no-progress-bar]
                          [--log-interval LOG_INTERVAL]
                          [--log-format {json,none,simple,tqdm}]
                          [--log-file LOG_FILE] [--aim-repo AIM_REPO]
                          [--aim-run-hash AIM_RUN_HASH]
                          [--tensorboard-logdir TENSORBOARD_LOGDIR]
                          [--wandb-project WANDB_PROJECT] [--azureml-logging]
                          [--seed SEED] [--cpu] [--tpu] [--bf16]
                          [--memory-efficient-bf16] [--fp16]
                          [--memory-efficient-fp16] [--fp16-no-flatten-grads]
                          [--fp16-init-scale FP16_INIT_SCALE]
                          [--fp16-scale-window FP16_SCALE_WINDOW]
                          [--fp16-scale-tolerance FP16_SCALE_TOLERANCE]
                          [--min-loss-scale MIN_LOSS_SCALE]
                          [--threshold-loss-scale THRESHOLD_LOSS_SCALE]
                          [--amp] [--amp-batch-retries AMP_BATCH_RETRIES]
                          [--amp-init-scale AMP_INIT_SCALE]
                          [--amp-scale-window AMP_SCALE_WINDOW]
                          [--user-dir USER_DIR]
                          [--empty-cache-freq EMPTY_CACHE_FREQ]
                          [--all-gather-list-size ALL_GATHER_LIST_SIZE]
                          [--model-parallel-size MODEL_PARALLEL_SIZE]
                          [--quantization-config-path QUANTIZATION_CONFIG_PATH]
                          [--profile] [--reset-logging] [--suppress-crashes]
                          [--use-plasma-view] [--plasma-path PLASMA_PATH]
                          [--criterion {adaptive_loss,composite_loss,cross_entropy,ctc,fastspeech2,hubert,label_smoothed_cross_entropy,latency_augmented_label_smoothed_cross_entropy,label_smoothed_cross_entropy_with_alignment,label_smoothed_cross_entropy_with_ctc,legacy_masked_lm_loss,masked_lm,model,nat_loss,sentence_prediction,sentence_prediction_adapters,sentence_ranking,tacotron2,speech_to_unit,speech_to_spectrogram,speech_unit_lm_criterion,wav2vec,vocab_parallel_cross_entropy}]
                          [--tokenizer {moses,nltk,space}]
                          [--bpe {byte_bpe,bytes,characters,fastbpe,gpt2,bert,hf_byte_bpe,sentencepiece,subword_nmt}]
                          [--optimizer {adadelta,adafactor,adagrad,adam,adamax,composite,cpu_adam,lamb,nag,sgd}]
                          [--lr-scheduler {cosine,fixed,inverse_sqrt,manual,pass_through,polynomial_decay,reduce_lr_on_plateau,step,tri_stage,triangular}]
                          [--scoring {bert_score,sacrebleu,bleu,chrf,meteor,wer}]
                          [--task TASK] [-s SRC] [-t TARGET] [--trainpref FP]
                          [--validpref FP] [--testpref FP] [--align-suffix FP]
                          [--destdir DIR] [--thresholdtgt N]
                          [--thresholdsrc N] [--tgtdict FP] [--srcdict FP]
                          [--nwordstgt N] [--nwordssrc N] [--alignfile ALIGN]
                          [--dataset-impl FORMAT] [--joined-dictionary]
                          [--only-source] [--padding-factor N] [--workers N]

Named Arguments


disable progress bar

Default: False


log progress every N batches (when progress bar is disabled)

Default: 100


Possible choices: json, none, simple, tqdm

log format to use

--log-file log file to copy metrics to.
--aim-repo path to Aim repository
--aim-run-hash Aim run hash. If skipped, creates or continues run based on save_dir
--tensorboard-logdir path to save logs for tensorboard, should match –logdir of running tensorboard (default: no tensorboard logging)
--wandb-project Weights and Biases project name to use for logging

Log scalars to AzureML context

Default: False


pseudo random number generator seed

Default: 1


use CPU instead of CUDA

Default: False


use TPU instead of CUDA

Default: False


use bfloat16; implies –tpu

Default: False


use a memory-efficient version of BF16 training; implies –bf16

Default: False


use FP16

Default: False


use a memory-efficient version of FP16 training; implies –fp16

Default: False


don’t flatten FP16 grads tensor

Default: False


default FP16 loss scale

Default: 128

--fp16-scale-window number of updates before increasing loss scale

pct of updates that can overflow before decreasing the loss scale

Default: 0.0


if set, the floating point conversion to fp16/bf16 runs on CPU. This reduces bus transfer time and GPU memory usage.

Default: False


minimum FP16/AMP loss scale, after which training is stopped

Default: 0.0001

--threshold-loss-scale threshold FP16 loss scale from below

use automatic mixed precision

Default: False


number of retries of same batch after reducing loss scale with AMP

Default: 2


default AMP loss scale

Default: 128

--amp-scale-window number of updates before increasing AMP loss scale
--user-dir path to a python module containing custom extensions (tasks and/or architectures)

how often to clear the PyTorch CUDA cache (0 to disable)

Default: 0


number of bytes reserved for gathering stats from workers

Default: 16384


total number of GPUs to parallelize model over

Default: 1

--quantization-config-path path to quantization config file

enable autograd profiler emit_nvtx

Default: False


when using Hydra, reset the logging at the beginning of training

Default: False


suppress crashes when training with the hydra_train entry point so that the main method can return a value (useful for sweeps)

Default: False


Store indices and sizes in shared memory

Default: False


path to run plasma_store, defaults to /tmp/plasma. Paths outside /tmp tend to fail.

Default: “/tmp/plasma”


Possible choices: adaptive_loss, composite_loss, cross_entropy, ctc, fastspeech2, hubert, label_smoothed_cross_entropy, latency_augmented_label_smoothed_cross_entropy, label_smoothed_cross_entropy_with_alignment, label_smoothed_cross_entropy_with_ctc, legacy_masked_lm_loss, masked_lm, model, nat_loss, sentence_prediction, sentence_prediction_adapters, sentence_ranking, tacotron2, speech_to_unit, speech_to_spectrogram, speech_unit_lm_criterion, wav2vec, vocab_parallel_cross_entropy

Default: “cross_entropy”

--tokenizer Possible choices: moses, nltk, space
--bpe Possible choices: byte_bpe, bytes, characters, fastbpe, gpt2, bert, hf_byte_bpe, sentencepiece, subword_nmt
--optimizer Possible choices: adadelta, adafactor, adagrad, adam, adamax, composite, cpu_adam, lamb, nag, sgd

Possible choices: cosine, fixed, inverse_sqrt, manual, pass_through, polynomial_decay, reduce_lr_on_plateau, step, tri_stage, triangular

Default: “fixed”


Possible choices: bert_score, sacrebleu, bleu, chrf, meteor, wer

Default: “bleu”


Possible choices: multilingual_language_modeling, speech_unit_modeling, hubert_pretraining, translation, multilingual_translation, semisupervised_translation, translation_from_pretrained_xlm, speech_to_text, text_to_speech, frm_text_to_speech, legacy_masked_lm, audio_pretraining, audio_finetuning, sentence_ranking, online_backtranslation, simul_speech_to_text, simul_text_to_text, cross_lingual_lm, span_masked_lm, denoising, multilingual_denoising, multilingual_masked_lm, language_modeling, masked_lm, nlu_finetuning, speech_to_speech, sentence_prediction, translation_from_pretrained_bart, sentence_prediction_adapters, translation_multi_simple_epoch, translation_lev, dummy_lm, dummy_masked_lm, dummy_mt


Default: “translation”


Possible choices: raw, lazy, cached, mmap, fasta, huffman

output dataset implementation

Default: “mmap”


-s, --source-lang source language
-t, --target-lang target language
--trainpref train file prefix (also used to build dictionaries)
--validpref comma separated, valid file prefixes (words missing from train set are replaced with <unk>)
--testpref comma separated, test file prefixes (words missing from train set are replaced with <unk>)
--align-suffix alignment file suffix

destination dir

Default: “data-bin”


map words appearing less than threshold times to unknown

Default: 0


map words appearing less than threshold times to unknown

Default: 0

--tgtdict reuse given target dictionary
--srcdict reuse given source dictionary

number of target words to retain

Default: -1


number of source words to retain

Default: -1

--alignfile an alignment file (optional)

Generate joined dictionary

Default: False


Only process the source language

Default: False


Pad dictionary size to be multiple of N

Default: 8


number of parallel workers

Default: 1


if true, only builds a dictionary and then exits

Default: False


Train a new model on one or across multiple GPUs.

usage: fairseq-train [-h] [--no-progress-bar] [--log-interval LOG_INTERVAL]
                     [--log-format {json,none,simple,tqdm}]
                     [--log-file LOG_FILE] [--aim-repo AIM_REPO]
                     [--aim-run-hash AIM_RUN_HASH]
                     [--tensorboard-logdir TENSORBOARD_LOGDIR]
                     [--wandb-project WANDB_PROJECT] [--azureml-logging]
                     [--seed SEED] [--cpu] [--tpu] [--bf16]
                     [--memory-efficient-bf16] [--fp16]
                     [--memory-efficient-fp16] [--fp16-no-flatten-grads]
                     [--fp16-init-scale FP16_INIT_SCALE]
                     [--fp16-scale-window FP16_SCALE_WINDOW]
                     [--fp16-scale-tolerance FP16_SCALE_TOLERANCE]
                     [--min-loss-scale MIN_LOSS_SCALE]
                     [--threshold-loss-scale THRESHOLD_LOSS_SCALE] [--amp]
                     [--amp-batch-retries AMP_BATCH_RETRIES]
                     [--amp-init-scale AMP_INIT_SCALE]
                     [--amp-scale-window AMP_SCALE_WINDOW]
                     [--user-dir USER_DIR]
                     [--empty-cache-freq EMPTY_CACHE_FREQ]
                     [--all-gather-list-size ALL_GATHER_LIST_SIZE]
                     [--model-parallel-size MODEL_PARALLEL_SIZE]
                     [--quantization-config-path QUANTIZATION_CONFIG_PATH]
                     [--profile] [--reset-logging] [--suppress-crashes]
                     [--use-plasma-view] [--plasma-path PLASMA_PATH]
                     [--criterion {adaptive_loss,composite_loss,cross_entropy,ctc,fastspeech2,hubert,label_smoothed_cross_entropy,latency_augmented_label_smoothed_cross_entropy,label_smoothed_cross_entropy_with_alignment,label_smoothed_cross_entropy_with_ctc,legacy_masked_lm_loss,masked_lm,model,nat_loss,sentence_prediction,sentence_prediction_adapters,sentence_ranking,tacotron2,speech_to_unit,speech_to_spectrogram,speech_unit_lm_criterion,wav2vec,vocab_parallel_cross_entropy}]
                     [--tokenizer {moses,nltk,space}]
                     [--bpe {byte_bpe,bytes,characters,fastbpe,gpt2,bert,hf_byte_bpe,sentencepiece,subword_nmt}]
                     [--optimizer {adadelta,adafactor,adagrad,adam,adamax,composite,cpu_adam,lamb,nag,sgd}]
                     [--lr-scheduler {cosine,fixed,inverse_sqrt,manual,pass_through,polynomial_decay,reduce_lr_on_plateau,step,tri_stage,triangular}]
                     [--scoring {bert_score,sacrebleu,bleu,chrf,meteor,wer}]
                     [--task TASK] [--num-workers NUM_WORKERS]
                     [--max-tokens MAX_TOKENS] [--batch-size BATCH_SIZE]
                     [--required-batch-size-multiple REQUIRED_BATCH_SIZE_MULTIPLE]
                     [--required-seq-len-multiple REQUIRED_SEQ_LEN_MULTIPLE]
                     [--dataset-impl {raw,lazy,cached,mmap,fasta,huffman}]
                     [--data-buffer-size DATA_BUFFER_SIZE]
                     [--train-subset TRAIN_SUBSET]
                     [--valid-subset VALID_SUBSET] [--combine-valid-subsets]
                     [--validate-interval VALIDATE_INTERVAL]
                     [--validate-interval-updates VALIDATE_INTERVAL_UPDATES]
                     [--validate-after-updates VALIDATE_AFTER_UPDATES]
                     [--fixed-validation-seed FIXED_VALIDATION_SEED]
                     [--max-tokens-valid MAX_TOKENS_VALID]
                     [--batch-size-valid BATCH_SIZE_VALID]
                     [--max-valid-steps MAX_VALID_STEPS]
                     [--curriculum CURRICULUM] [--gen-subset GEN_SUBSET]
                     [--num-shards NUM_SHARDS] [--shard-id SHARD_ID]
                     [--update-epoch-batch-itr UPDATE_EPOCH_BATCH_ITR]
                     [--distributed-world-size DISTRIBUTED_WORLD_SIZE]
                     [--distributed-num-procs DISTRIBUTED_NUM_PROCS]
                     [--distributed-rank DISTRIBUTED_RANK]
                     [--distributed-backend DISTRIBUTED_BACKEND]
                     [--distributed-init-method DISTRIBUTED_INIT_METHOD]
                     [--distributed-port DISTRIBUTED_PORT]
                     [--device-id DEVICE_ID] [--distributed-no-spawn]
                     [--ddp-backend {c10d,fully_sharded,legacy_ddp,no_c10d,pytorch_ddp,slowmo}]
                     [--ddp-comm-hook {none,fp16}]
                     [--bucket-cap-mb BUCKET_CAP_MB] [--fix-batches-to-gpus]
                     [--find-unused-parameters] [--gradient-as-bucket-view]
                     [--heartbeat-timeout HEARTBEAT_TIMEOUT]
                     [--broadcast-buffers] [--slowmo-momentum SLOWMO_MOMENTUM]
                     [--slowmo-base-algorithm SLOWMO_BASE_ALGORITHM]
                     [--localsgd-frequency LOCALSGD_FREQUENCY]
                     [--nprocs-per-node NPROCS_PER_NODE]
                     [--pipeline-balance PIPELINE_BALANCE]
                     [--pipeline-devices PIPELINE_DEVICES]
                     [--pipeline-chunks PIPELINE_CHUNKS]
                     [--pipeline-encoder-balance PIPELINE_ENCODER_BALANCE]
                     [--pipeline-encoder-devices PIPELINE_ENCODER_DEVICES]
                     [--pipeline-decoder-balance PIPELINE_DECODER_BALANCE]
                     [--pipeline-decoder-devices PIPELINE_DECODER_DEVICES]
                     [--pipeline-checkpoint {always,never,except_last}]
                     [--zero-sharding {none,os}] [--no-reshard-after-forward]
                     [--fp32-reduce-scatter] [--cpu-offload]
                     [--use-sharded-state] [--not-fsdp-flatten-parameters]
                     [--arch ARCH] [--max-epoch MAX_EPOCH]
                     [--max-update MAX_UPDATE]
                     [--stop-time-hours STOP_TIME_HOURS]
                     [--clip-norm CLIP_NORM] [--sentence-avg]
                     [--update-freq UPDATE_FREQ] [--lr LR]
                     [--stop-min-lr STOP_MIN_LR] [--use-bmuf]
                     [--skip-remainder-batch] [--save-dir SAVE_DIR]
                     [--restore-file RESTORE_FILE]
                     [--continue-once CONTINUE_ONCE]
                     [--finetune-from-model FINETUNE_FROM_MODEL]
                     [--reset-dataloader] [--reset-lr-scheduler]
                     [--reset-meters] [--reset-optimizer]
                     [--optimizer-overrides OPTIMIZER_OVERRIDES]
                     [--save-interval SAVE_INTERVAL]
                     [--save-interval-updates SAVE_INTERVAL_UPDATES]
                     [--keep-interval-updates KEEP_INTERVAL_UPDATES]
                     [--keep-interval-updates-pattern KEEP_INTERVAL_UPDATES_PATTERN]
                     [--keep-last-epochs KEEP_LAST_EPOCHS]
                     [--keep-best-checkpoints KEEP_BEST_CHECKPOINTS]
                     [--no-save] [--no-epoch-checkpoints]
                     [--no-last-checkpoints] [--no-save-optimizer-state]
                     [--best-checkpoint-metric BEST_CHECKPOINT_METRIC]
                     [--maximize-best-checkpoint-metric] [--patience PATIENCE]
                     [--checkpoint-suffix CHECKPOINT_SUFFIX]
                     [--checkpoint-shard-count CHECKPOINT_SHARD_COUNT]
                     [--write-checkpoints-asynchronously] [--store-ema]
                     [--ema-decay EMA_DECAY]
                     [--ema-start-update EMA_START_UPDATE]
                     [--ema-seed-model EMA_SEED_MODEL]
                     [--ema-update-freq EMA_UPDATE_FREQ] [--ema-fp32]

Named Arguments


disable progress bar

Default: False


log progress every N batches (when progress bar is disabled)

Default: 100


Possible choices: json, none, simple, tqdm

log format to use

--log-file log file to copy metrics to.
--aim-repo path to Aim repository
--aim-run-hash Aim run hash. If skipped, creates or continues run based on save_dir
--tensorboard-logdir path to save logs for tensorboard, should match –logdir of running tensorboard (default: no tensorboard logging)
--wandb-project Weights and Biases project name to use for logging

Log scalars to AzureML context

Default: False


pseudo random number generator seed

Default: 1


use CPU instead of CUDA

Default: False


use TPU instead of CUDA

Default: False


use bfloat16; implies –tpu

Default: False


use a memory-efficient version of BF16 training; implies –bf16

Default: False


use FP16

Default: False


use a memory-efficient version of FP16 training; implies –fp16

Default: False


don’t flatten FP16 grads tensor

Default: False


default FP16 loss scale

Default: 128

--fp16-scale-window number of updates before increasing loss scale

pct of updates that can overflow before decreasing the loss scale

Default: 0.0


if set, the floating point conversion to fp16/bf16 runs on CPU. This reduces bus transfer time and GPU memory usage.

Default: False


minimum FP16/AMP loss scale, after which training is stopped

Default: 0.0001

--threshold-loss-scale threshold FP16 loss scale from below

use automatic mixed precision

Default: False


number of retries of same batch after reducing loss scale with AMP

Default: 2


default AMP loss scale

Default: 128

--amp-scale-window number of updates before increasing AMP loss scale
--user-dir path to a python module containing custom extensions (tasks and/or architectures)

how often to clear the PyTorch CUDA cache (0 to disable)

Default: 0


number of bytes reserved for gathering stats from workers

Default: 16384


total number of GPUs to parallelize model over

Default: 1

--quantization-config-path path to quantization config file

enable autograd profiler emit_nvtx

Default: False


when using Hydra, reset the logging at the beginning of training

Default: False


suppress crashes when training with the hydra_train entry point so that the main method can return a value (useful for sweeps)

Default: False


Store indices and sizes in shared memory

Default: False


path to run plasma_store, defaults to /tmp/plasma. Paths outside /tmp tend to fail.

Default: “/tmp/plasma”


Possible choices: adaptive_loss, composite_loss, cross_entropy, ctc, fastspeech2, hubert, label_smoothed_cross_entropy, latency_augmented_label_smoothed_cross_entropy, label_smoothed_cross_entropy_with_alignment, label_smoothed_cross_entropy_with_ctc, legacy_masked_lm_loss, masked_lm, model, nat_loss, sentence_prediction, sentence_prediction_adapters, sentence_ranking, tacotron2, speech_to_unit, speech_to_spectrogram, speech_unit_lm_criterion, wav2vec, vocab_parallel_cross_entropy

Default: “cross_entropy”

--tokenizer Possible choices: moses, nltk, space
--bpe Possible choices: byte_bpe, bytes, characters, fastbpe, gpt2, bert, hf_byte_bpe, sentencepiece, subword_nmt
--optimizer Possible choices: adadelta, adafactor, adagrad, adam, adamax, composite, cpu_adam, lamb, nag, sgd

Possible choices: cosine, fixed, inverse_sqrt, manual, pass_through, polynomial_decay, reduce_lr_on_plateau, step, tri_stage, triangular

Default: “fixed”


Possible choices: bert_score, sacrebleu, bleu, chrf, meteor, wer

Default: “bleu”


Possible choices: multilingual_language_modeling, speech_unit_modeling, hubert_pretraining, translation, multilingual_translation, semisupervised_translation, translation_from_pretrained_xlm, speech_to_text, text_to_speech, frm_text_to_speech, legacy_masked_lm, audio_pretraining, audio_finetuning, sentence_ranking, online_backtranslation, simul_speech_to_text, simul_text_to_text, cross_lingual_lm, span_masked_lm, denoising, multilingual_denoising, multilingual_masked_lm, language_modeling, masked_lm, nlu_finetuning, speech_to_speech, sentence_prediction, translation_from_pretrained_bart, sentence_prediction_adapters, translation_multi_simple_epoch, translation_lev, dummy_lm, dummy_masked_lm, dummy_mt


Default: “translation”



how many subprocesses to use for data loading

Default: 1


ignore too long or too short lines in valid and test set

Default: False

--max-tokens maximum number of tokens in a batch
--batch-size, --max-sentences number of examples in a batch

batch size will be a multiplier of this value

Default: 8


maximum sequence length in batch will be a multiplier of this value

Default: 1


Possible choices: raw, lazy, cached, mmap, fasta, huffman

output dataset implementation


Number of batches to preload

Default: 10


data subset to use for training (e.g. train, valid, test)

Default: “train”


comma separated list of data subsets to use for validation (e.g. train, valid, test)

Default: “valid”

--combine-valid-subsets, --combine-val comma separated list of data subsets to use for validation (e.g. train, valid, test)

do not raise error if valid subsets are ignored

Default: False


validate every N epochs

Default: 1


validate every N updates

Default: 0


dont validate until reaching this many updates

Default: 0

--fixed-validation-seed specified random seed for validation

disable validation

Default: False

--max-tokens-valid maximum number of tokens in a validation batch (defaults to –max-tokens)
--batch-size-valid, --max-sentences-valid batch size of the validation batch (defaults to –batch-size)
--max-valid-steps, --nval How many batches to evaluate

don’t shuffle batches for first N epochs

Default: 0


data subset to generate (train, valid, test)

Default: “test”


shard generation over N shards

Default: 1


id of the shard to generate (id < num_shards)

Default: 0


shuffle batches in groups of num_shards to enable similar sequence lengths on each GPU worker when batches are sorted by length

Default: False

--update-epoch-batch-itr if true then prevents the reuse the epoch batch iterator by setting can_reuse_epoch_itr to false, defaults to –grouped-shuffling )

if true then increment seed with epoch for getting batch iterators, defautls to False.

Default: False



total number of GPUs across all nodes (default: all visible GPUs)

Default: 1


total number of processes to fork (default: all visible GPUs)

Default: 1


rank of the current worker

Default: 0


distributed backend

Default: “nccl”

--distributed-init-method typically tcp://hostname:port that will be used to establish initial connetion

port number (not required if using –distributed-init-method)

Default: -1

--device-id, --local_rank

which GPU to use (by default looks for $LOCAL_RANK, usually configured automatically)

Default: 0


do not spawn multiple processes even if multiple GPUs are visible

Default: False


Possible choices: c10d, fully_sharded, legacy_ddp, no_c10d, pytorch_ddp, slowmo

DistributedDataParallel backend

Default: “pytorch_ddp”


Possible choices: none, fp16

communication hook

Default: “none”


bucket size for reduction

Default: 25


don’t shuffle batches between GPUs; this reduces overall randomness and may affect precision but avoids the cost of re-reading the data

Default: False


disable unused parameter detection (not applicable to –ddp-backend=legacy_ddp)

Default: False


when set to True, gradients will be views pointing to different offsets of allreduce communication buckets. This can reduce peak memory usage, where the saved memory size will be equal to the total gradients size. –gradient-as-bucket-view=gradient_as_bucket_view)

Default: False


[deprecated] this is now defined per Criterion

Default: False


kill the job if no progress is made in N seconds; set to -1 to disable

Default: -1


Copy non-trainable parameters between GPUs, such as batchnorm population statistics

Default: False

--slowmo-momentum SlowMo momentum term; by default use 0.0 for 16 GPUs, 0.2 for 32 GPUs; 0.5 for 64 GPUs, 0.6 for > 64 GPUs

Base algorithm. Either ‘localsgd’ or ‘sgp’. Please refer to the documentation of ‘slowmo_base_algorithm’ parameter in for more details

Default: “localsgd”


Local SGD allreduce frequency

Default: 3


number of GPUs in each node. An allreduce operation across GPUs in a node is very fast. Hence, we do allreduce across GPUs in a node, and gossip across different nodes

Default: 1


if set, use pipeline model parallelism across GPUs

Default: False

--pipeline-balance partition the model into N_K pieces, where each piece contains N_i layers. The sum(args.pipeline_balance) should equal the total number of layers in the model
--pipeline-devices a list of device indices indicating which device to place each of the N_K partitions. The length of this list should equal the length of the –pipeline-balance argument

microbatch count for pipeline model parallelism

Default: 0

--pipeline-encoder-balance partition the pipeline parallel encoder into N_K pieces, where each piece contains N_i layers. The sum(args.pipeline_encoder_balance) should equal the total number of encoder layers in the model
--pipeline-encoder-devices a list of device indices indicating which device to place each of the N_K partitions. The length of this list should equal the length of the –pipeline-encoder-balance argument
--pipeline-decoder-balance partition the pipeline parallel decoder into N_K pieces, where each piece contains N_i layers. The sum(args.pipeline_decoder_balance) should equal the total number of decoder layers in the model
--pipeline-decoder-devices a list of device indices indicating which device to place each of the N_K partitions. The length of this list should equal the length of the –pipeline-decoder-balance argument

Possible choices: always, never, except_last

checkpointing mode for pipeline model parallelism

Default: “never”


Possible choices: none, os

ZeRO sharding

Default: “none”


don’t reshard parameters after forward pass

Default: False


reduce-scatter grads in FP32

Default: False


offload FP32 params to CPU

Default: False


use sharded checkpoint files

Default: False


not flatten parameter param for fsdp

Default: False

Model configuration

--arch, -a

Possible choices: transformer_tiny, transformer, transformer_iwslt_de_en, transformer_wmt_en_de, transformer_vaswani_wmt_en_de_big, transformer_vaswani_wmt_en_fr_big, transformer_wmt_en_de_big, transformer_wmt_en_de_big_t2t, transformer_from_pretrained_xlm, transformer_align, transformer_wmt_en_de_big_align, fconv, fconv_iwslt_de_en, fconv_wmt_en_ro, fconv_wmt_en_de, fconv_wmt_en_fr, roberta, roberta_prenorm, roberta_base, roberta_large, xlm, roberta_enc_dec, xmod_base_13, xmod_base_30, xmod_base_60, xmod_base_75, xmod_base, xmod_large_prenorm, s2t_berard, s2t_berard_256_3_3, s2t_berard_512_3_2, s2t_berard_512_5_3, convtransformer, convtransformer_espnet, s2t_transformer, s2t_transformer_s, s2t_transformer_xs, s2t_transformer_sp, s2t_transformer_m, s2t_transformer_mp, s2t_transformer_l, s2t_transformer_lp, wav2vec, wav2vec2, wav2vec_ctc, wav2vec_seq2seq, xm_transformer, s2t_conformer, lstm, lstm_wiseman_iwslt_de_en, lstm_luong_wmt_en_de, masked_lm, bert_base, bert_large, xlm_base, tacotron_2, tts_transformer, fastspeech2, lightconv, lightconv_iwslt_de_en, lightconv_wmt_en_de, lightconv_wmt_en_de_big, lightconv_wmt_en_fr_big, lightconv_wmt_zh_en_big, lightconv_lm, lightconv_lm_gbw, lstm_lm, s2ut_transformer, s2ut_transformer_fisher, s2spect_transformer, s2spect_transformer_fisher, s2ut_conformer, hf_gpt2, hf_gpt2_medium, hf_gpt2_large, hf_gpt2_xl, transformer_lm, transformer_lm_big, transformer_lm_baevski_wiki103, transformer_lm_wiki103, transformer_lm_baevski_gbw, transformer_lm_gbw, transformer_lm_gpt, transformer_lm_gpt2_small, transformer_lm_gpt2_tiny, transformer_lm_gpt2_medium, transformer_lm_gpt2_big, transformer_lm_gpt2_big_wide, transformer_lm_gpt2_bigger, transformer_lm_gpt3_small, transformer_lm_gpt3_medium, transformer_lm_gpt3_large, transformer_lm_gpt3_xl, transformer_lm_gpt3_2_7, transformer_lm_gpt3_6_7, transformer_lm_gpt3_13, transformer_lm_gpt3_175, multilingual_transformer, multilingual_transformer_iwslt_de_en, bart_large, bart_base, mbart_large, mbart_base, mbart_base_wmt20, transformer_ulm, transformer_ulm_big, transformer_ulm_tiny, hubert, hubert_ctc, hubert_seq2seq, fconv_self_att, fconv_self_att_wp, fconv_lm, fconv_lm_dauphin_wikitext103, fconv_lm_dauphin_gbw, nonautoregressive_transformer, nonautoregressive_transformer_wmt_en_de, nacrf_transformer, iterative_nonautoregressive_transformer, iterative_nonautoregressive_transformer_wmt_en_de, cmlm_transformer, cmlm_transformer_wmt_en_de, levenshtein_transformer, levenshtein_transformer_wmt_en_de, levenshtein_transformer_vaswani_wmt_en_de_big, levenshtein_transformer_wmt_en_de_big, insertion_transformer, dummy_model, model_parallel_roberta, model_parallel_roberta_v1, model_parallel_roberta_postnorm, model_parallel_roberta_base, model_parallel_roberta_large, transformer_iwslt_de_en_pipeline_parallel, transformer_wmt_en_de_big_pipeline_parallel, transformer_lm_megatron, transformer_lm_megatron_11b

model architecture



force stop training at specified epoch

Default: 0


force stop training at specified update

Default: 0


force stop training after specified cumulative time (if >0)

Default: 0


clip threshold of gradients

Default: 0.0


normalize gradients by the number of sentences in a batch (default is to normalize by number of tokens)

Default: False


update parameters every N_i batches, when in epoch i

Default: 1


learning rate for the first N epochs; all epochs >N using LR_N (note: this may be interpreted differently depending on –lr-scheduler)

Default: 0.25


stop training when the learning rate reaches this minimum

Default: -1.0


specify global optimizer for syncing models on different GPUs/shards

Default: False


if set, include the last (partial) batch of each epoch in training (default is to skip it).

Default: False



path to save checkpoints

Default: “checkpoints”


filename from which to load checkpoint (default: <save-dir>/

Default: “”

--continue-once continues from this checkpoint, unless a checkpoint indicated in ‘restore_file’ option is present
--finetune-from-model finetune from a pretrained model; note that meters and lr scheduler will be reset

if set, does not reload dataloader state from the checkpoint

Default: False


if set, does not load lr scheduler state from the checkpoint

Default: False


if set, does not load meters from the checkpoint

Default: False


if set, does not load optimizer state from the checkpoint

Default: False


a dictionary used to override optimizer args when loading a checkpoint

Default: “{}”


save a checkpoint every N epochs

Default: 1


save a checkpoint (and validate) every N updates

Default: 0


keep the last N checkpoints saved with –save-interval-updates

Default: -1


when used with –keep-interval-updates, skips deleting any checkpoints with update X where X % keep_interval_updates_pattern == 0

Default: -1


keep last N epoch checkpoints

Default: -1


keep best N checkpoints based on scores

Default: -1


don’t save models or checkpoints

Default: False


only store last and best checkpoints

Default: False


don’t store last checkpoints

Default: False


don’t save optimizer-state as part of checkpoint

Default: False


metric to use for saving “best” checkpoints

Default: “loss”


select the largest metric value for saving “best” checkpoints

Default: False


early stop training if valid performance doesn’t improve for N consecutive validation runs; note that this is influenced by –validate-interval

Default: -1


suffix to add to the checkpoint file name

Default: “”


Number of shards containing the checkpoint - if the checkpoint is over 300GB, it is preferable to split it into shards to prevent OOM on CPU while loading the checkpoint

Default: 1


load checkpoints on all data parallel devices (default: only load on rank 0 and broadcast to other devices)

Default: False

--write-checkpoints-asynchronously, --save-async

Write checkpoints asynchronously in a separate thread. NOTE: This feature is currently being tested.

Default: False

EMA configuration

--store-ema Default: False

decay for exponential moving average model

Default: 0.9999


start EMA update after this many model updates

Default: 0

--ema-seed-model Seed to load EMA model from. Used to load EMA model separately from the actual model.

Do EMA update every this many model updates

Default: 1


If true, store EMA model in fp32 even if model is in fp16

Default: False


Translate pre-processed data with a trained model.

usage: fairseq-generate [-h] [--no-progress-bar] [--log-interval LOG_INTERVAL]
                        [--log-format {json,none,simple,tqdm}]
                        [--log-file LOG_FILE] [--aim-repo AIM_REPO]
                        [--aim-run-hash AIM_RUN_HASH]
                        [--tensorboard-logdir TENSORBOARD_LOGDIR]
                        [--wandb-project WANDB_PROJECT] [--azureml-logging]
                        [--seed SEED] [--cpu] [--tpu] [--bf16]
                        [--memory-efficient-bf16] [--fp16]
                        [--memory-efficient-fp16] [--fp16-no-flatten-grads]
                        [--fp16-init-scale FP16_INIT_SCALE]
                        [--fp16-scale-window FP16_SCALE_WINDOW]
                        [--fp16-scale-tolerance FP16_SCALE_TOLERANCE]
                        [--min-loss-scale MIN_LOSS_SCALE]
                        [--threshold-loss-scale THRESHOLD_LOSS_SCALE] [--amp]
                        [--amp-batch-retries AMP_BATCH_RETRIES]
                        [--amp-init-scale AMP_INIT_SCALE]
                        [--amp-scale-window AMP_SCALE_WINDOW]
                        [--user-dir USER_DIR]
                        [--empty-cache-freq EMPTY_CACHE_FREQ]
                        [--all-gather-list-size ALL_GATHER_LIST_SIZE]
                        [--model-parallel-size MODEL_PARALLEL_SIZE]
                        [--quantization-config-path QUANTIZATION_CONFIG_PATH]
                        [--profile] [--reset-logging] [--suppress-crashes]
                        [--use-plasma-view] [--plasma-path PLASMA_PATH]
                        [--criterion {adaptive_loss,composite_loss,cross_entropy,ctc,fastspeech2,hubert,label_smoothed_cross_entropy,latency_augmented_label_smoothed_cross_entropy,label_smoothed_cross_entropy_with_alignment,label_smoothed_cross_entropy_with_ctc,legacy_masked_lm_loss,masked_lm,model,nat_loss,sentence_prediction,sentence_prediction_adapters,sentence_ranking,tacotron2,speech_to_unit,speech_to_spectrogram,speech_unit_lm_criterion,wav2vec,vocab_parallel_cross_entropy}]
                        [--tokenizer {moses,nltk,space}]
                        [--bpe {byte_bpe,bytes,characters,fastbpe,gpt2,bert,hf_byte_bpe,sentencepiece,subword_nmt}]
                        [--optimizer {adadelta,adafactor,adagrad,adam,adamax,composite,cpu_adam,lamb,nag,sgd}]
                        [--lr-scheduler {cosine,fixed,inverse_sqrt,manual,pass_through,polynomial_decay,reduce_lr_on_plateau,step,tri_stage,triangular}]
                        [--scoring {bert_score,sacrebleu,bleu,chrf,meteor,wer}]
                        [--task TASK] [--num-workers NUM_WORKERS]
                        [--max-tokens MAX_TOKENS] [--batch-size BATCH_SIZE]
                        [--required-batch-size-multiple REQUIRED_BATCH_SIZE_MULTIPLE]
                        [--required-seq-len-multiple REQUIRED_SEQ_LEN_MULTIPLE]
                        [--dataset-impl {raw,lazy,cached,mmap,fasta,huffman}]
                        [--data-buffer-size DATA_BUFFER_SIZE]
                        [--train-subset TRAIN_SUBSET]
                        [--valid-subset VALID_SUBSET]
                        [--validate-interval VALIDATE_INTERVAL]
                        [--validate-interval-updates VALIDATE_INTERVAL_UPDATES]
                        [--validate-after-updates VALIDATE_AFTER_UPDATES]
                        [--fixed-validation-seed FIXED_VALIDATION_SEED]
                        [--max-tokens-valid MAX_TOKENS_VALID]
                        [--batch-size-valid BATCH_SIZE_VALID]
                        [--max-valid-steps MAX_VALID_STEPS]
                        [--curriculum CURRICULUM] [--gen-subset GEN_SUBSET]
                        [--num-shards NUM_SHARDS] [--shard-id SHARD_ID]
                        [--update-epoch-batch-itr UPDATE_EPOCH_BATCH_ITR]
                        [--distributed-world-size DISTRIBUTED_WORLD_SIZE]
                        [--distributed-num-procs DISTRIBUTED_NUM_PROCS]
                        [--distributed-rank DISTRIBUTED_RANK]
                        [--distributed-backend DISTRIBUTED_BACKEND]
                        [--distributed-init-method DISTRIBUTED_INIT_METHOD]
                        [--distributed-port DISTRIBUTED_PORT]
                        [--device-id DEVICE_ID] [--distributed-no-spawn]
                        [--ddp-backend {c10d,fully_sharded,legacy_ddp,no_c10d,pytorch_ddp,slowmo}]
                        [--ddp-comm-hook {none,fp16}]
                        [--bucket-cap-mb BUCKET_CAP_MB]
                        [--fix-batches-to-gpus] [--find-unused-parameters]
                        [--gradient-as-bucket-view] [--fast-stat-sync]
                        [--heartbeat-timeout HEARTBEAT_TIMEOUT]
                        [--slowmo-momentum SLOWMO_MOMENTUM]
                        [--slowmo-base-algorithm SLOWMO_BASE_ALGORITHM]
                        [--localsgd-frequency LOCALSGD_FREQUENCY]
                        [--nprocs-per-node NPROCS_PER_NODE]
                        [--pipeline-balance PIPELINE_BALANCE]
                        [--pipeline-devices PIPELINE_DEVICES]
                        [--pipeline-chunks PIPELINE_CHUNKS]
                        [--pipeline-encoder-balance PIPELINE_ENCODER_BALANCE]
                        [--pipeline-encoder-devices PIPELINE_ENCODER_DEVICES]
                        [--pipeline-decoder-balance PIPELINE_DECODER_BALANCE]
                        [--pipeline-decoder-devices PIPELINE_DECODER_DEVICES]
                        [--pipeline-checkpoint {always,never,except_last}]
                        [--zero-sharding {none,os}]
                        [--no-reshard-after-forward] [--fp32-reduce-scatter]
                        [--cpu-offload] [--use-sharded-state]
                        [--not-fsdp-flatten-parameters] [--path PATH]
                        [--post-process [POST_PROCESS]] [--quiet]
                        [--model-overrides MODEL_OVERRIDES]
                        [--results-path RESULTS_PATH] [--beam BEAM]
                        [--nbest NBEST] [--max-len-a MAX_LEN_A]
                        [--max-len-b MAX_LEN_B] [--min-len MIN_LEN]
                        [--match-source-len] [--unnormalized]
                        [--no-early-stop] [--no-beamable-mm] [--lenpen LENPEN]
                        [--unkpen UNKPEN] [--replace-unk [REPLACE_UNK]]
                        [--sacrebleu] [--score-reference]
                        [--prefix-size PREFIX_SIZE]
                        [--no-repeat-ngram-size NO_REPEAT_NGRAM_SIZE]
                        [--sampling] [--sampling-topk SAMPLING_TOPK]
                        [--sampling-topp SAMPLING_TOPP]
                        [--constraints [{ordered,unordered}]]
                        [--temperature TEMPERATURE]
                        [--diverse-beam-groups DIVERSE_BEAM_GROUPS]
                        [--diverse-beam-strength DIVERSE_BEAM_STRENGTH]
                        [--diversity-rate DIVERSITY_RATE]
                        [--print-alignment [{hard,soft}]] [--print-step]
                        [--lm-path LM_PATH] [--lm-weight LM_WEIGHT]
                        [--iter-decode-eos-penalty ITER_DECODE_EOS_PENALTY]
                        [--iter-decode-max-iter ITER_DECODE_MAX_ITER]
                        [--iter-decode-with-beam ITER_DECODE_WITH_BEAM]
                        [--retain-iter-history] [--retain-dropout]
                        [--retain-dropout-modules RETAIN_DROPOUT_MODULES]
                        [--decoding-format {unigram,ensemble,vote,dp,bs}]
                        [--no-seed-provided] [--eos-token EOS_TOKEN]
                        [--save-dir SAVE_DIR] [--restore-file RESTORE_FILE]
                        [--continue-once CONTINUE_ONCE]
                        [--finetune-from-model FINETUNE_FROM_MODEL]
                        [--reset-dataloader] [--reset-lr-scheduler]
                        [--reset-meters] [--reset-optimizer]
                        [--optimizer-overrides OPTIMIZER_OVERRIDES]
                        [--save-interval SAVE_INTERVAL]
                        [--save-interval-updates SAVE_INTERVAL_UPDATES]
                        [--keep-interval-updates KEEP_INTERVAL_UPDATES]
                        [--keep-interval-updates-pattern KEEP_INTERVAL_UPDATES_PATTERN]
                        [--keep-last-epochs KEEP_LAST_EPOCHS]
                        [--keep-best-checkpoints KEEP_BEST_CHECKPOINTS]
                        [--no-save] [--no-epoch-checkpoints]
                        [--no-last-checkpoints] [--no-save-optimizer-state]
                        [--best-checkpoint-metric BEST_CHECKPOINT_METRIC]
                        [--patience PATIENCE]
                        [--checkpoint-suffix CHECKPOINT_SUFFIX]
                        [--checkpoint-shard-count CHECKPOINT_SHARD_COUNT]

Named Arguments


disable progress bar

Default: False


log progress every N batches (when progress bar is disabled)

Default: 100


Possible choices: json, none, simple, tqdm

log format to use

--log-file log file to copy metrics to.
--aim-repo path to Aim repository
--aim-run-hash Aim run hash. If skipped, creates or continues run based on save_dir
--tensorboard-logdir path to save logs for tensorboard, should match –logdir of running tensorboard (default: no tensorboard logging)
--wandb-project Weights and Biases project name to use for logging

Log scalars to AzureML context

Default: False


pseudo random number generator seed

Default: 1


use CPU instead of CUDA

Default: False


use TPU instead of CUDA

Default: False


use bfloat16; implies –tpu

Default: False


use a memory-efficient version of BF16 training; implies –bf16

Default: False


use FP16

Default: False


use a memory-efficient version of FP16 training; implies –fp16

Default: False


don’t flatten FP16 grads tensor

Default: False


default FP16 loss scale

Default: 128

--fp16-scale-window number of updates before increasing loss scale

pct of updates that can overflow before decreasing the loss scale

Default: 0.0


if set, the floating point conversion to fp16/bf16 runs on CPU. This reduces bus transfer time and GPU memory usage.

Default: False


minimum FP16/AMP loss scale, after which training is stopped

Default: 0.0001

--threshold-loss-scale threshold FP16 loss scale from below

use automatic mixed precision

Default: False


number of retries of same batch after reducing loss scale with AMP

Default: 2


default AMP loss scale

Default: 128

--amp-scale-window number of updates before increasing AMP loss scale
--user-dir path to a python module containing custom extensions (tasks and/or architectures)

how often to clear the PyTorch CUDA cache (0 to disable)

Default: 0


number of bytes reserved for gathering stats from workers

Default: 16384


total number of GPUs to parallelize model over

Default: 1

--quantization-config-path path to quantization config file

enable autograd profiler emit_nvtx

Default: False


when using Hydra, reset the logging at the beginning of training

Default: False


suppress crashes when training with the hydra_train entry point so that the main method can return a value (useful for sweeps)

Default: False


Store indices and sizes in shared memory

Default: False


path to run plasma_store, defaults to /tmp/plasma. Paths outside /tmp tend to fail.

Default: “/tmp/plasma”


Possible choices: adaptive_loss, composite_loss, cross_entropy, ctc, fastspeech2, hubert, label_smoothed_cross_entropy, latency_augmented_label_smoothed_cross_entropy, label_smoothed_cross_entropy_with_alignment, label_smoothed_cross_entropy_with_ctc, legacy_masked_lm_loss, masked_lm, model, nat_loss, sentence_prediction, sentence_prediction_adapters, sentence_ranking, tacotron2, speech_to_unit, speech_to_spectrogram, speech_unit_lm_criterion, wav2vec, vocab_parallel_cross_entropy

Default: “cross_entropy”

--tokenizer Possible choices: moses, nltk, space
--bpe Possible choices: byte_bpe, bytes, characters, fastbpe, gpt2, bert, hf_byte_bpe, sentencepiece, subword_nmt
--optimizer Possible choices: adadelta, adafactor, adagrad, adam, adamax, composite, cpu_adam, lamb, nag, sgd

Possible choices: cosine, fixed, inverse_sqrt, manual, pass_through, polynomial_decay, reduce_lr_on_plateau, step, tri_stage, triangular

Default: “fixed”


Possible choices: bert_score, sacrebleu, bleu, chrf, meteor, wer

Default: “bleu”


Possible choices: multilingual_language_modeling, speech_unit_modeling, hubert_pretraining, translation, multilingual_translation, semisupervised_translation, translation_from_pretrained_xlm, speech_to_text, text_to_speech, frm_text_to_speech, legacy_masked_lm, audio_pretraining, audio_finetuning, sentence_ranking, online_backtranslation, simul_speech_to_text, simul_text_to_text, cross_lingual_lm, span_masked_lm, denoising, multilingual_denoising, multilingual_masked_lm, language_modeling, masked_lm, nlu_finetuning, speech_to_speech, sentence_prediction, translation_from_pretrained_bart, sentence_prediction_adapters, translation_multi_simple_epoch, translation_lev, dummy_lm, dummy_masked_lm, dummy_mt


Default: “translation”



how many subprocesses to use for data loading

Default: 1


ignore too long or too short lines in valid and test set

Default: False

--max-tokens maximum number of tokens in a batch
--batch-size, --max-sentences number of examples in a batch

batch size will be a multiplier of this value

Default: 8


maximum sequence length in batch will be a multiplier of this value

Default: 1


Possible choices: raw, lazy, cached, mmap, fasta, huffman

output dataset implementation


Number of batches to preload

Default: 10


data subset to use for training (e.g. train, valid, test)

Default: “train”


comma separated list of data subsets to use for validation (e.g. train, valid, test)

Default: “valid”

--combine-valid-subsets, --combine-val comma separated list of data subsets to use for validation (e.g. train, valid, test)

do not raise error if valid subsets are ignored

Default: False


validate every N epochs

Default: 1


validate every N updates

Default: 0


dont validate until reaching this many updates

Default: 0

--fixed-validation-seed specified random seed for validation

disable validation

Default: False

--max-tokens-valid maximum number of tokens in a validation batch (defaults to –max-tokens)
--batch-size-valid, --max-sentences-valid batch size of the validation batch (defaults to –batch-size)
--max-valid-steps, --nval How many batches to evaluate

don’t shuffle batches for first N epochs

Default: 0


data subset to generate (train, valid, test)

Default: “test”


shard generation over N shards

Default: 1


id of the shard to generate (id < num_shards)

Default: 0


shuffle batches in groups of num_shards to enable similar sequence lengths on each GPU worker when batches are sorted by length

Default: False

--update-epoch-batch-itr if true then prevents the reuse the epoch batch iterator by setting can_reuse_epoch_itr to false, defaults to –grouped-shuffling )

if true then increment seed with epoch for getting batch iterators, defautls to False.

Default: False



total number of GPUs across all nodes (default: all visible GPUs)

Default: 1


total number of processes to fork (default: all visible GPUs)

Default: 1


rank of the current worker

Default: 0


distributed backend

Default: “nccl”

--distributed-init-method typically tcp://hostname:port that will be used to establish initial connetion

port number (not required if using –distributed-init-method)

Default: -1

--device-id, --local_rank

which GPU to use (by default looks for $LOCAL_RANK, usually configured automatically)

Default: 0


do not spawn multiple processes even if multiple GPUs are visible

Default: False


Possible choices: c10d, fully_sharded, legacy_ddp, no_c10d, pytorch_ddp, slowmo

DistributedDataParallel backend

Default: “pytorch_ddp”


Possible choices: none, fp16

communication hook

Default: “none”


bucket size for reduction

Default: 25


don’t shuffle batches between GPUs; this reduces overall randomness and may affect precision but avoids the cost of re-reading the data

Default: False


disable unused parameter detection (not applicable to –ddp-backend=legacy_ddp)

Default: False


when set to True, gradients will be views pointing to different offsets of allreduce communication buckets. This can reduce peak memory usage, where the saved memory size will be equal to the total gradients size. –gradient-as-bucket-view=gradient_as_bucket_view)

Default: False


[deprecated] this is now defined per Criterion

Default: False


kill the job if no progress is made in N seconds; set to -1 to disable

Default: -1


Copy non-trainable parameters between GPUs, such as batchnorm population statistics

Default: False

--slowmo-momentum SlowMo momentum term; by default use 0.0 for 16 GPUs, 0.2 for 32 GPUs; 0.5 for 64 GPUs, 0.6 for > 64 GPUs

Base algorithm. Either ‘localsgd’ or ‘sgp’. Please refer to the documentation of ‘slowmo_base_algorithm’ parameter in for more details

Default: “localsgd”


Local SGD allreduce frequency

Default: 3


number of GPUs in each node. An allreduce operation across GPUs in a node is very fast. Hence, we do allreduce across GPUs in a node, and gossip across different nodes

Default: 1


if set, use pipeline model parallelism across GPUs

Default: False

--pipeline-balance partition the model into N_K pieces, where each piece contains N_i layers. The sum(args.pipeline_balance) should equal the total number of layers in the model
--pipeline-devices a list of device indices indicating which device to place each of the N_K partitions. The length of this list should equal the length of the –pipeline-balance argument

microbatch count for pipeline model parallelism

Default: 0

--pipeline-encoder-balance partition the pipeline parallel encoder into N_K pieces, where each piece contains N_i layers. The sum(args.pipeline_encoder_balance) should equal the total number of encoder layers in the model
--pipeline-encoder-devices a list of device indices indicating which device to place each of the N_K partitions. The length of this list should equal the length of the –pipeline-encoder-balance argument
--pipeline-decoder-balance partition the pipeline parallel decoder into N_K pieces, where each piece contains N_i layers. The sum(args.pipeline_decoder_balance) should equal the total number of decoder layers in the model
--pipeline-decoder-devices a list of device indices indicating which device to place each of the N_K partitions. The length of this list should equal the length of the –pipeline-decoder-balance argument

Possible choices: always, never, except_last

checkpointing mode for pipeline model parallelism

Default: “never”


Possible choices: none, os

ZeRO sharding

Default: “none”


don’t reshard parameters after forward pass

Default: False


reduce-scatter grads in FP32

Default: False


offload FP32 params to CPU

Default: False


use sharded checkpoint files

Default: False


not flatten parameter param for fsdp

Default: False


--path path(s) to model file(s), colon separated
--post-process, --remove-bpe post-process text by removing BPE, letter segmentation, etc. Valid options can be found in

only print final scores

Default: False


a dictionary used to override model args at generation that were used during model training

Default: “{}”

--results-path path to save eval results (optional)

beam size

Default: 5


number of hypotheses to output

Default: 1


generate sequences of maximum length ax + b, where x is the source length

Default: 0


generate sequences of maximum length ax + b, where x is the source length

Default: 200


minimum generation length

Default: 1


generations should match the source length

Default: False


compare unnormalized hypothesis scores

Default: False



Default: False


don’t use BeamableMM in attention layers

Default: False


length penalty: <1.0 favors shorter, >1.0 favors longer sentences

Default: 1


unknown word penalty: <0 produces more unks, >0 produces fewer

Default: 0

--replace-unk perform unknown replacement (optionally with alignment dictionary)

score with sacrebleu

Default: False


just score the reference translation

Default: False


initialize generation by target prefix of given length

Default: 0


ngram blocking such that this size ngram cannot be repeated in the generation

Default: 0


sample hypotheses instead of using beam search

Default: False


sample from top K likely next words instead of all words

Default: -1


sample from the smallest set whose cumulative probability mass exceeds p for next words

Default: -1.0


Possible choices: ordered, unordered

enables lexically constrained decoding


temperature for generation

Default: 1.0


number of groups for Diverse Beam Search

Default: -1


strength of diversity penalty for Diverse Beam Search

Default: 0.5


strength of diversity penalty for Diverse Siblings Search

Default: -1.0


Possible choices: hard, soft

if set, uses attention feedback to compute and print alignment to source tokens (valid options are: hard, soft, otherwise treated as hard alignment)


print steps

Default: False

--lm-path path to lm checkpoint for lm fusion

weight for lm probs for lm fusion

Default: 0.0


if > 0.0, it penalized early-stopping in decoding.

Default: 0.0


maximum iterations for iterative refinement.

Default: 10


if set, run exact the maximum number of iterations without early stop

Default: False


if > 1, model will generate translations varying by the lengths.

Default: 1


if set, the last checkpoint are assumed to be a reranker to rescore the translations

Default: False


if set, decoding returns the whole history of iterative refinement

Default: False


Use dropout at inference time

Default: False

--retain-dropout-modules if set, only retain dropout for the specified modules; if not set, then dropout will be retained for all modules

Possible choices: unigram, ensemble, vote, dp, bs

special decoding format for advanced decoding.


if set, dont use seed for initializing random generators

Default: False

--eos-token EOS token



path to save checkpoints

Default: “checkpoints”


filename from which to load checkpoint (default: <save-dir>/

Default: “”

--continue-once continues from this checkpoint, unless a checkpoint indicated in ‘restore_file’ option is present
--finetune-from-model finetune from a pretrained model; note that meters and lr scheduler will be reset

if set, does not reload dataloader state from the checkpoint

Default: False


if set, does not load lr scheduler state from the checkpoint

Default: False


if set, does not load meters from the checkpoint

Default: False


if set, does not load optimizer state from the checkpoint

Default: False


a dictionary used to override optimizer args when loading a checkpoint

Default: “{}”


save a checkpoint every N epochs

Default: 1


save a checkpoint (and validate) every N updates

Default: 0


keep the last N checkpoints saved with –save-interval-updates

Default: -1


when used with –keep-interval-updates, skips deleting any checkpoints with update X where X % keep_interval_updates_pattern == 0

Default: -1


keep last N epoch checkpoints

Default: -1


keep best N checkpoints based on scores

Default: -1


don’t save models or checkpoints

Default: False


only store last and best checkpoints

Default: False


don’t store last checkpoints

Default: False


don’t save optimizer-state as part of checkpoint

Default: False


metric to use for saving “best” checkpoints

Default: “loss”


select the largest metric value for saving “best” checkpoints

Default: False


early stop training if valid performance doesn’t improve for N consecutive validation runs; note that this is influenced by –validate-interval

Default: -1


suffix to add to the checkpoint file name

Default: “”


Number of shards containing the checkpoint - if the checkpoint is over 300GB, it is preferable to split it into shards to prevent OOM on CPU while loading the checkpoint

Default: 1


load checkpoints on all data parallel devices (default: only load on rank 0 and broadcast to other devices)

Default: False

--write-checkpoints-asynchronously, --save-async

Write checkpoints asynchronously in a separate thread. NOTE: This feature is currently being tested.

Default: False


Translate raw text with a trained model. Batches data on-the-fly.

usage: fairseq-interactive [-h] [--no-progress-bar]
                           [--log-interval LOG_INTERVAL]
                           [--log-format {json,none,simple,tqdm}]
                           [--log-file LOG_FILE] [--aim-repo AIM_REPO]
                           [--aim-run-hash AIM_RUN_HASH]
                           [--tensorboard-logdir TENSORBOARD_LOGDIR]
                           [--wandb-project WANDB_PROJECT] [--azureml-logging]
                           [--seed SEED] [--cpu] [--tpu] [--bf16]
                           [--memory-efficient-bf16] [--fp16]
                           [--memory-efficient-fp16] [--fp16-no-flatten-grads]
                           [--fp16-init-scale FP16_INIT_SCALE]
                           [--fp16-scale-window FP16_SCALE_WINDOW]
                           [--fp16-scale-tolerance FP16_SCALE_TOLERANCE]
                           [--min-loss-scale MIN_LOSS_SCALE]
                           [--threshold-loss-scale THRESHOLD_LOSS_SCALE]
                           [--amp] [--amp-batch-retries AMP_BATCH_RETRIES]
                           [--amp-init-scale AMP_INIT_SCALE]
                           [--amp-scale-window AMP_SCALE_WINDOW]
                           [--user-dir USER_DIR]
                           [--empty-cache-freq EMPTY_CACHE_FREQ]
                           [--all-gather-list-size ALL_GATHER_LIST_SIZE]
                           [--model-parallel-size MODEL_PARALLEL_SIZE]
                           [--quantization-config-path QUANTIZATION_CONFIG_PATH]
                           [--profile] [--reset-logging] [--suppress-crashes]
                           [--use-plasma-view] [--plasma-path PLASMA_PATH]
                           [--criterion {adaptive_loss,composite_loss,cross_entropy,ctc,fastspeech2,hubert,label_smoothed_cross_entropy,latency_augmented_label_smoothed_cross_entropy,label_smoothed_cross_entropy_with_alignment,label_smoothed_cross_entropy_with_ctc,legacy_masked_lm_loss,masked_lm,model,nat_loss,sentence_prediction,sentence_prediction_adapters,sentence_ranking,tacotron2,speech_to_unit,speech_to_spectrogram,speech_unit_lm_criterion,wav2vec,vocab_parallel_cross_entropy}]
                           [--tokenizer {moses,nltk,space}]
                           [--bpe {byte_bpe,bytes,characters,fastbpe,gpt2,bert,hf_byte_bpe,sentencepiece,subword_nmt}]
                           [--optimizer {adadelta,adafactor,adagrad,adam,adamax,composite,cpu_adam,lamb,nag,sgd}]
                           [--lr-scheduler {cosine,fixed,inverse_sqrt,manual,pass_through,polynomial_decay,reduce_lr_on_plateau,step,tri_stage,triangular}]
                           [--scoring {bert_score,sacrebleu,bleu,chrf,meteor,wer}]
                           [--task TASK] [--num-workers NUM_WORKERS]
                           [--max-tokens MAX_TOKENS] [--batch-size BATCH_SIZE]
                           [--required-batch-size-multiple REQUIRED_BATCH_SIZE_MULTIPLE]
                           [--required-seq-len-multiple REQUIRED_SEQ_LEN_MULTIPLE]
                           [--dataset-impl {raw,lazy,cached,mmap,fasta,huffman}]
                           [--data-buffer-size DATA_BUFFER_SIZE]
                           [--train-subset TRAIN_SUBSET]
                           [--valid-subset VALID_SUBSET]
                           [--validate-interval VALIDATE_INTERVAL]
                           [--validate-interval-updates VALIDATE_INTERVAL_UPDATES]
                           [--validate-after-updates VALIDATE_AFTER_UPDATES]
                           [--fixed-validation-seed FIXED_VALIDATION_SEED]
                           [--max-tokens-valid MAX_TOKENS_VALID]
                           [--batch-size-valid BATCH_SIZE_VALID]
                           [--max-valid-steps MAX_VALID_STEPS]
                           [--curriculum CURRICULUM] [--gen-subset GEN_SUBSET]
                           [--num-shards NUM_SHARDS] [--shard-id SHARD_ID]
                           [--update-epoch-batch-itr UPDATE_EPOCH_BATCH_ITR]
                           [--distributed-world-size DISTRIBUTED_WORLD_SIZE]
                           [--distributed-num-procs DISTRIBUTED_NUM_PROCS]
                           [--distributed-rank DISTRIBUTED_RANK]
                           [--distributed-backend DISTRIBUTED_BACKEND]
                           [--distributed-init-method DISTRIBUTED_INIT_METHOD]
                           [--distributed-port DISTRIBUTED_PORT]
                           [--device-id DEVICE_ID] [--distributed-no-spawn]
                           [--ddp-backend {c10d,fully_sharded,legacy_ddp,no_c10d,pytorch_ddp,slowmo}]
                           [--ddp-comm-hook {none,fp16}]
                           [--bucket-cap-mb BUCKET_CAP_MB]
                           [--fix-batches-to-gpus] [--find-unused-parameters]
                           [--gradient-as-bucket-view] [--fast-stat-sync]
                           [--heartbeat-timeout HEARTBEAT_TIMEOUT]
                           [--slowmo-momentum SLOWMO_MOMENTUM]
                           [--slowmo-base-algorithm SLOWMO_BASE_ALGORITHM]
                           [--localsgd-frequency LOCALSGD_FREQUENCY]
                           [--nprocs-per-node NPROCS_PER_NODE]
                           [--pipeline-balance PIPELINE_BALANCE]
                           [--pipeline-devices PIPELINE_DEVICES]
                           [--pipeline-chunks PIPELINE_CHUNKS]
                           [--pipeline-encoder-balance PIPELINE_ENCODER_BALANCE]
                           [--pipeline-encoder-devices PIPELINE_ENCODER_DEVICES]
                           [--pipeline-decoder-balance PIPELINE_DECODER_BALANCE]
                           [--pipeline-decoder-devices PIPELINE_DECODER_DEVICES]
                           [--pipeline-checkpoint {always,never,except_last}]
                           [--zero-sharding {none,os}]
                           [--fp32-reduce-scatter] [--cpu-offload]
                           [--not-fsdp-flatten-parameters] [--path PATH]
                           [--post-process [POST_PROCESS]] [--quiet]
                           [--model-overrides MODEL_OVERRIDES]
                           [--results-path RESULTS_PATH] [--beam BEAM]
                           [--nbest NBEST] [--max-len-a MAX_LEN_A]
                           [--max-len-b MAX_LEN_B] [--min-len MIN_LEN]
                           [--match-source-len] [--unnormalized]
                           [--no-early-stop] [--no-beamable-mm]
                           [--lenpen LENPEN] [--unkpen UNKPEN]
                           [--replace-unk [REPLACE_UNK]] [--sacrebleu]
                           [--score-reference] [--prefix-size PREFIX_SIZE]
                           [--no-repeat-ngram-size NO_REPEAT_NGRAM_SIZE]
                           [--sampling] [--sampling-topk SAMPLING_TOPK]
                           [--sampling-topp SAMPLING_TOPP]
                           [--constraints [{ordered,unordered}]]
                           [--temperature TEMPERATURE]
                           [--diverse-beam-groups DIVERSE_BEAM_GROUPS]
                           [--diverse-beam-strength DIVERSE_BEAM_STRENGTH]
                           [--diversity-rate DIVERSITY_RATE]
                           [--print-alignment [{hard,soft}]] [--print-step]
                           [--lm-path LM_PATH] [--lm-weight LM_WEIGHT]
                           [--iter-decode-eos-penalty ITER_DECODE_EOS_PENALTY]
                           [--iter-decode-max-iter ITER_DECODE_MAX_ITER]
                           [--iter-decode-with-beam ITER_DECODE_WITH_BEAM]
                           [--retain-iter-history] [--retain-dropout]
                           [--retain-dropout-modules RETAIN_DROPOUT_MODULES]
                           [--decoding-format {unigram,ensemble,vote,dp,bs}]
                           [--no-seed-provided] [--eos-token EOS_TOKEN]
                           [--save-dir SAVE_DIR] [--restore-file RESTORE_FILE]
                           [--continue-once CONTINUE_ONCE]
                           [--finetune-from-model FINETUNE_FROM_MODEL]
                           [--reset-dataloader] [--reset-lr-scheduler]
                           [--reset-meters] [--reset-optimizer]
                           [--optimizer-overrides OPTIMIZER_OVERRIDES]
                           [--save-interval SAVE_INTERVAL]
                           [--save-interval-updates SAVE_INTERVAL_UPDATES]
                           [--keep-interval-updates KEEP_INTERVAL_UPDATES]
                           [--keep-interval-updates-pattern KEEP_INTERVAL_UPDATES_PATTERN]
                           [--keep-last-epochs KEEP_LAST_EPOCHS]
                           [--keep-best-checkpoints KEEP_BEST_CHECKPOINTS]
                           [--no-save] [--no-epoch-checkpoints]
                           [--no-last-checkpoints] [--no-save-optimizer-state]
                           [--best-checkpoint-metric BEST_CHECKPOINT_METRIC]
                           [--patience PATIENCE]
                           [--checkpoint-suffix CHECKPOINT_SUFFIX]
                           [--checkpoint-shard-count CHECKPOINT_SHARD_COUNT]
                           [--buffer-size BUFFER_SIZE] [--input INPUT]

Named Arguments


disable progress bar

Default: False


log progress every N batches (when progress bar is disabled)

Default: 100


Possible choices: json, none, simple, tqdm

log format to use

--log-file log file to copy metrics to.
--aim-repo path to Aim repository
--aim-run-hash Aim run hash. If skipped, creates or continues run based on save_dir
--tensorboard-logdir path to save logs for tensorboard, should match –logdir of running tensorboard (default: no tensorboard logging)
--wandb-project Weights and Biases project name to use for logging

Log scalars to AzureML context

Default: False


pseudo random number generator seed

Default: 1


use CPU instead of CUDA

Default: False


use TPU instead of CUDA

Default: False


use bfloat16; implies –tpu

Default: False


use a memory-efficient version of BF16 training; implies –bf16

Default: False


use FP16

Default: False


use a memory-efficient version of FP16 training; implies –fp16

Default: False


don’t flatten FP16 grads tensor

Default: False


default FP16 loss scale

Default: 128

--fp16-scale-window number of updates before increasing loss scale

pct of updates that can overflow before decreasing the loss scale

Default: 0.0


if set, the floating point conversion to fp16/bf16 runs on CPU. This reduces bus transfer time and GPU memory usage.

Default: False


minimum FP16/AMP loss scale, after which training is stopped

Default: 0.0001

--threshold-loss-scale threshold FP16 loss scale from below

use automatic mixed precision

Default: False


number of retries of same batch after reducing loss scale with AMP

Default: 2


default AMP loss scale

Default: 128

--amp-scale-window number of updates before increasing AMP loss scale
--user-dir path to a python module containing custom extensions (tasks and/or architectures)

how often to clear the PyTorch CUDA cache (0 to disable)

Default: 0


number of bytes reserved for gathering stats from workers

Default: 16384


total number of GPUs to parallelize model over

Default: 1

--quantization-config-path path to quantization config file

enable autograd profiler emit_nvtx

Default: False


when using Hydra, reset the logging at the beginning of training

Default: False


suppress crashes when training with the hydra_train entry point so that the main method can return a value (useful for sweeps)

Default: False


Store indices and sizes in shared memory

Default: False


path to run plasma_store, defaults to /tmp/plasma. Paths outside /tmp tend to fail.

Default: “/tmp/plasma”


Possible choices: adaptive_loss, composite_loss, cross_entropy, ctc, fastspeech2, hubert, label_smoothed_cross_entropy, latency_augmented_label_smoothed_cross_entropy, label_smoothed_cross_entropy_with_alignment, label_smoothed_cross_entropy_with_ctc, legacy_masked_lm_loss, masked_lm, model, nat_loss, sentence_prediction, sentence_prediction_adapters, sentence_ranking, tacotron2, speech_to_unit, speech_to_spectrogram, speech_unit_lm_criterion, wav2vec, vocab_parallel_cross_entropy

Default: “cross_entropy”

--tokenizer Possible choices: moses, nltk, space
--bpe Possible choices: byte_bpe, bytes, characters, fastbpe, gpt2, bert, hf_byte_bpe, sentencepiece, subword_nmt
--optimizer Possible choices: adadelta, adafactor, adagrad, adam, adamax, composite, cpu_adam, lamb, nag, sgd

Possible choices: cosine, fixed, inverse_sqrt, manual, pass_through, polynomial_decay, reduce_lr_on_plateau, step, tri_stage, triangular

Default: “fixed”


Possible choices: bert_score, sacrebleu, bleu, chrf, meteor, wer

Default: “bleu”


Possible choices: multilingual_language_modeling, speech_unit_modeling, hubert_pretraining, translation, multilingual_translation, semisupervised_translation, translation_from_pretrained_xlm, speech_to_text, text_to_speech, frm_text_to_speech, legacy_masked_lm, audio_pretraining, audio_finetuning, sentence_ranking, online_backtranslation, simul_speech_to_text, simul_text_to_text, cross_lingual_lm, span_masked_lm, denoising, multilingual_denoising, multilingual_masked_lm, language_modeling, masked_lm, nlu_finetuning, speech_to_speech, sentence_prediction, translation_from_pretrained_bart, sentence_prediction_adapters, translation_multi_simple_epoch, translation_lev, dummy_lm, dummy_masked_lm, dummy_mt


Default: “translation”



how many subprocesses to use for data loading

Default: 1


ignore too long or too short lines in valid and test set

Default: False

--max-tokens maximum number of tokens in a batch
--batch-size, --max-sentences number of examples in a batch

batch size will be a multiplier of this value

Default: 8


maximum sequence length in batch will be a multiplier of this value

Default: 1


Possible choices: raw, lazy, cached, mmap, fasta, huffman

output dataset implementation


Number of batches to preload

Default: 10


data subset to use for training (e.g. train, valid, test)

Default: “train”


comma separated list of data subsets to use for validation (e.g. train, valid, test)

Default: “valid”

--combine-valid-subsets, --combine-val comma separated list of data subsets to use for validation (e.g. train, valid, test)

do not raise error if valid subsets are ignored

Default: False


validate every N epochs

Default: 1


validate every N updates

Default: 0


dont validate until reaching this many updates

Default: 0

--fixed-validation-seed specified random seed for validation

disable validation

Default: False

--max-tokens-valid maximum number of tokens in a validation batch (defaults to –max-tokens)
--batch-size-valid, --max-sentences-valid batch size of the validation batch (defaults to –batch-size)
--max-valid-steps, --nval How many batches to evaluate

don’t shuffle batches for first N epochs

Default: 0


data subset to generate (train, valid, test)

Default: “test”


shard generation over N shards

Default: 1


id of the shard to generate (id < num_shards)

Default: 0


shuffle batches in groups of num_shards to enable similar sequence lengths on each GPU worker when batches are sorted by length

Default: False

--update-epoch-batch-itr if true then prevents the reuse the epoch batch iterator by setting can_reuse_epoch_itr to false, defaults to –grouped-shuffling )

if true then increment seed with epoch for getting batch iterators, defautls to False.

Default: False



total number of GPUs across all nodes (default: all visible GPUs)

Default: 1


total number of processes to fork (default: all visible GPUs)

Default: 1


rank of the current worker

Default: 0


distributed backend

Default: “nccl”

--distributed-init-method typically tcp://hostname:port that will be used to establish initial connetion

port number (not required if using –distributed-init-method)

Default: -1

--device-id, --local_rank

which GPU to use (by default looks for $LOCAL_RANK, usually configured automatically)

Default: 0


do not spawn multiple processes even if multiple GPUs are visible

Default: False


Possible choices: c10d, fully_sharded, legacy_ddp, no_c10d, pytorch_ddp, slowmo

DistributedDataParallel backend

Default: “pytorch_ddp”


Possible choices: none, fp16

communication hook

Default: “none”


bucket size for reduction

Default: 25


don’t shuffle batches between GPUs; this reduces overall randomness and may affect precision but avoids the cost of re-reading the data

Default: False


disable unused parameter detection (not applicable to –ddp-backend=legacy_ddp)

Default: False


when set to True, gradients will be views pointing to different offsets of allreduce communication buckets. This can reduce peak memory usage, where the saved memory size will be equal to the total gradients size. –gradient-as-bucket-view=gradient_as_bucket_view)

Default: False


[deprecated] this is now defined per Criterion

Default: False


kill the job if no progress is made in N seconds; set to -1 to disable

Default: -1


Copy non-trainable parameters between GPUs, such as batchnorm population statistics

Default: False

--slowmo-momentum SlowMo momentum term; by default use 0.0 for 16 GPUs, 0.2 for 32 GPUs; 0.5 for 64 GPUs, 0.6 for > 64 GPUs

Base algorithm. Either ‘localsgd’ or ‘sgp’. Please refer to the documentation of ‘slowmo_base_algorithm’ parameter in for more details

Default: “localsgd”


Local SGD allreduce frequency

Default: 3


number of GPUs in each node. An allreduce operation across GPUs in a node is very fast. Hence, we do allreduce across GPUs in a node, and gossip across different nodes

Default: 1


if set, use pipeline model parallelism across GPUs

Default: False

--pipeline-balance partition the model into N_K pieces, where each piece contains N_i layers. The sum(args.pipeline_balance) should equal the total number of layers in the model
--pipeline-devices a list of device indices indicating which device to place each of the N_K partitions. The length of this list should equal the length of the –pipeline-balance argument

microbatch count for pipeline model parallelism

Default: 0

--pipeline-encoder-balance partition the pipeline parallel encoder into N_K pieces, where each piece contains N_i layers. The sum(args.pipeline_encoder_balance) should equal the total number of encoder layers in the model
--pipeline-encoder-devices a list of device indices indicating which device to place each of the N_K partitions. The length of this list should equal the length of the –pipeline-encoder-balance argument
--pipeline-decoder-balance partition the pipeline parallel decoder into N_K pieces, where each piece contains N_i layers. The sum(args.pipeline_decoder_balance) should equal the total number of decoder layers in the model
--pipeline-decoder-devices a list of device indices indicating which device to place each of the N_K partitions. The length of this list should equal the length of the –pipeline-decoder-balance argument

Possible choices: always, never, except_last

checkpointing mode for pipeline model parallelism

Default: “never”


Possible choices: none, os

ZeRO sharding

Default: “none”


don’t reshard parameters after forward pass

Default: False


reduce-scatter grads in FP32

Default: False


offload FP32 params to CPU

Default: False


use sharded checkpoint files

Default: False


not flatten parameter param for fsdp

Default: False


--path path(s) to model file(s), colon separated
--post-process, --remove-bpe post-process text by removing BPE, letter segmentation, etc. Valid options can be found in

only print final scores

Default: False


a dictionary used to override model args at generation that were used during model training

Default: “{}”

--results-path path to save eval results (optional)

beam size

Default: 5


number of hypotheses to output

Default: 1


generate sequences of maximum length ax + b, where x is the source length

Default: 0


generate sequences of maximum length ax + b, where x is the source length

Default: 200


minimum generation length

Default: 1


generations should match the source length

Default: False


compare unnormalized hypothesis scores

Default: False



Default: False


don’t use BeamableMM in attention layers

Default: False


length penalty: <1.0 favors shorter, >1.0 favors longer sentences

Default: 1


unknown word penalty: <0 produces more unks, >0 produces fewer

Default: 0

--replace-unk perform unknown replacement (optionally with alignment dictionary)

score with sacrebleu

Default: False


just score the reference translation

Default: False


initialize generation by target prefix of given length

Default: 0


ngram blocking such that this size ngram cannot be repeated in the generation

Default: 0


sample hypotheses instead of using beam search

Default: False


sample from top K likely next words instead of all words

Default: -1


sample from the smallest set whose cumulative probability mass exceeds p for next words

Default: -1.0


Possible choices: ordered, unordered

enables lexically constrained decoding


temperature for generation

Default: 1.0


number of groups for Diverse Beam Search

Default: -1


strength of diversity penalty for Diverse Beam Search

Default: 0.5


strength of diversity penalty for Diverse Siblings Search

Default: -1.0


Possible choices: hard, soft

if set, uses attention feedback to compute and print alignment to source tokens (valid options are: hard, soft, otherwise treated as hard alignment)


print steps

Default: False

--lm-path path to lm checkpoint for lm fusion

weight for lm probs for lm fusion

Default: 0.0


if > 0.0, it penalized early-stopping in decoding.

Default: 0.0


maximum iterations for iterative refinement.

Default: 10


if set, run exact the maximum number of iterations without early stop

Default: False


if > 1, model will generate translations varying by the lengths.

Default: 1


if set, the last checkpoint are assumed to be a reranker to rescore the translations

Default: False


if set, decoding returns the whole history of iterative refinement

Default: False


Use dropout at inference time

Default: False

--retain-dropout-modules if set, only retain dropout for the specified modules; if not set, then dropout will be retained for all modules

Possible choices: unigram, ensemble, vote, dp, bs

special decoding format for advanced decoding.


if set, dont use seed for initializing random generators

Default: False

--eos-token EOS token



path to save checkpoints

Default: “checkpoints”


filename from which to load checkpoint (default: <save-dir>/

Default: “”

--continue-once continues from this checkpoint, unless a checkpoint indicated in ‘restore_file’ option is present
--finetune-from-model finetune from a pretrained model; note that meters and lr scheduler will be reset

if set, does not reload dataloader state from the checkpoint

Default: False


if set, does not load lr scheduler state from the checkpoint

Default: False


if set, does not load meters from the checkpoint

Default: False


if set, does not load optimizer state from the checkpoint

Default: False


a dictionary used to override optimizer args when loading a checkpoint

Default: “{}”


save a checkpoint every N epochs

Default: 1


save a checkpoint (and validate) every N updates

Default: 0


keep the last N checkpoints saved with –save-interval-updates

Default: -1


when used with –keep-interval-updates, skips deleting any checkpoints with update X where X % keep_interval_updates_pattern == 0

Default: -1


keep last N epoch checkpoints

Default: -1


keep best N checkpoints based on scores

Default: -1


don’t save models or checkpoints

Default: False


only store last and best checkpoints

Default: False


don’t store last checkpoints

Default: False


don’t save optimizer-state as part of checkpoint

Default: False


metric to use for saving “best” checkpoints

Default: “loss”


select the largest metric value for saving “best” checkpoints

Default: False


early stop training if valid performance doesn’t improve for N consecutive validation runs; note that this is influenced by –validate-interval

Default: -1


suffix to add to the checkpoint file name

Default: “”


Number of shards containing the checkpoint - if the checkpoint is over 300GB, it is preferable to split it into shards to prevent OOM on CPU while loading the checkpoint

Default: 1


load checkpoints on all data parallel devices (default: only load on rank 0 and broadcast to other devices)

Default: False

--write-checkpoints-asynchronously, --save-async

Write checkpoints asynchronously in a separate thread. NOTE: This feature is currently being tested.

Default: False



read this many sentences into a buffer before processing them

Default: 0


file to read from; use - for stdin

Default: “-”


BLEU scoring of generated translations against reference translations.

Command-line script for BLEU scoring.

usage: fairseq-score [-h] [-s SYS] -r REF [-o N] [--ignore-case] [--sacrebleu]

Named Arguments

-s, --sys

system output

Default: “-”

-r, --ref references
-o, --order

consider ngrams up to this order

Default: 4


case-insensitive scoring

Default: False


score with sacrebleu

Default: False


report sentence-level BLEUs (i.e., with +1 smoothing)

Default: False


Evaluate the perplexity of a trained language model.

usage: fairseq-eval-lm [-h] [--no-progress-bar] [--log-interval LOG_INTERVAL]
                       [--log-format {json,none,simple,tqdm}]
                       [--log-file LOG_FILE] [--aim-repo AIM_REPO]
                       [--aim-run-hash AIM_RUN_HASH]
                       [--tensorboard-logdir TENSORBOARD_LOGDIR]
                       [--wandb-project WANDB_PROJECT] [--azureml-logging]
                       [--seed SEED] [--cpu] [--tpu] [--bf16]
                       [--memory-efficient-bf16] [--fp16]
                       [--memory-efficient-fp16] [--fp16-no-flatten-grads]
                       [--fp16-init-scale FP16_INIT_SCALE]
                       [--fp16-scale-window FP16_SCALE_WINDOW]
                       [--fp16-scale-tolerance FP16_SCALE_TOLERANCE]
                       [--min-loss-scale MIN_LOSS_SCALE]
                       [--threshold-loss-scale THRESHOLD_LOSS_SCALE] [--amp]
                       [--amp-batch-retries AMP_BATCH_RETRIES]
                       [--amp-init-scale AMP_INIT_SCALE]
                       [--amp-scale-window AMP_SCALE_WINDOW]
                       [--user-dir USER_DIR]
                       [--empty-cache-freq EMPTY_CACHE_FREQ]
                       [--all-gather-list-size ALL_GATHER_LIST_SIZE]
                       [--model-parallel-size MODEL_PARALLEL_SIZE]
                       [--quantization-config-path QUANTIZATION_CONFIG_PATH]
                       [--profile] [--reset-logging] [--suppress-crashes]
                       [--use-plasma-view] [--plasma-path PLASMA_PATH]
                       [--criterion {adaptive_loss,composite_loss,cross_entropy,ctc,fastspeech2,hubert,label_smoothed_cross_entropy,latency_augmented_label_smoothed_cross_entropy,label_smoothed_cross_entropy_with_alignment,label_smoothed_cross_entropy_with_ctc,legacy_masked_lm_loss,masked_lm,model,nat_loss,sentence_prediction,sentence_prediction_adapters,sentence_ranking,tacotron2,speech_to_unit,speech_to_spectrogram,speech_unit_lm_criterion,wav2vec,vocab_parallel_cross_entropy}]
                       [--tokenizer {moses,nltk,space}]
                       [--bpe {byte_bpe,bytes,characters,fastbpe,gpt2,bert,hf_byte_bpe,sentencepiece,subword_nmt}]
                       [--optimizer {adadelta,adafactor,adagrad,adam,adamax,composite,cpu_adam,lamb,nag,sgd}]
                       [--lr-scheduler {cosine,fixed,inverse_sqrt,manual,pass_through,polynomial_decay,reduce_lr_on_plateau,step,tri_stage,triangular}]
                       [--scoring {bert_score,sacrebleu,bleu,chrf,meteor,wer}]
                       [--task TASK] [--num-workers NUM_WORKERS]
                       [--max-tokens MAX_TOKENS] [--batch-size BATCH_SIZE]
                       [--required-batch-size-multiple REQUIRED_BATCH_SIZE_MULTIPLE]
                       [--required-seq-len-multiple REQUIRED_SEQ_LEN_MULTIPLE]
                       [--dataset-impl {raw,lazy,cached,mmap,fasta,huffman}]
                       [--data-buffer-size DATA_BUFFER_SIZE]
                       [--train-subset TRAIN_SUBSET]
                       [--valid-subset VALID_SUBSET] [--combine-valid-subsets]
                       [--validate-interval VALIDATE_INTERVAL]
                       [--validate-interval-updates VALIDATE_INTERVAL_UPDATES]
                       [--validate-after-updates VALIDATE_AFTER_UPDATES]
                       [--fixed-validation-seed FIXED_VALIDATION_SEED]
                       [--max-tokens-valid MAX_TOKENS_VALID]
                       [--batch-size-valid BATCH_SIZE_VALID]
                       [--max-valid-steps MAX_VALID_STEPS]
                       [--curriculum CURRICULUM] [--gen-subset GEN_SUBSET]
                       [--num-shards NUM_SHARDS] [--shard-id SHARD_ID]
                       [--update-epoch-batch-itr UPDATE_EPOCH_BATCH_ITR]
                       [--distributed-world-size DISTRIBUTED_WORLD_SIZE]
                       [--distributed-num-procs DISTRIBUTED_NUM_PROCS]
                       [--distributed-rank DISTRIBUTED_RANK]
                       [--distributed-backend DISTRIBUTED_BACKEND]
                       [--distributed-init-method DISTRIBUTED_INIT_METHOD]
                       [--distributed-port DISTRIBUTED_PORT]
                       [--device-id DEVICE_ID] [--distributed-no-spawn]
                       [--ddp-backend {c10d,fully_sharded,legacy_ddp,no_c10d,pytorch_ddp,slowmo}]
                       [--ddp-comm-hook {none,fp16}]
                       [--bucket-cap-mb BUCKET_CAP_MB] [--fix-batches-to-gpus]
                       [--find-unused-parameters] [--gradient-as-bucket-view]
                       [--heartbeat-timeout HEARTBEAT_TIMEOUT]
                       [--slowmo-momentum SLOWMO_MOMENTUM]
                       [--slowmo-base-algorithm SLOWMO_BASE_ALGORITHM]
                       [--localsgd-frequency LOCALSGD_FREQUENCY]
                       [--nprocs-per-node NPROCS_PER_NODE]
                       [--pipeline-balance PIPELINE_BALANCE]
                       [--pipeline-devices PIPELINE_DEVICES]
                       [--pipeline-chunks PIPELINE_CHUNKS]
                       [--pipeline-encoder-balance PIPELINE_ENCODER_BALANCE]
                       [--pipeline-encoder-devices PIPELINE_ENCODER_DEVICES]
                       [--pipeline-decoder-balance PIPELINE_DECODER_BALANCE]
                       [--pipeline-decoder-devices PIPELINE_DECODER_DEVICES]
                       [--pipeline-checkpoint {always,never,except_last}]
                       [--zero-sharding {none,os}]
                       [--no-reshard-after-forward] [--fp32-reduce-scatter]
                       [--cpu-offload] [--use-sharded-state]
                       [--not-fsdp-flatten-parameters] [--path PATH]
                       [--post-process [POST_PROCESS]] [--quiet]
                       [--model-overrides MODEL_OVERRIDES]
                       [--results-path RESULTS_PATH] [--output-word-probs]
                       [--output-word-stats] [--context-window CONTEXT_WINDOW]
                       [--softmax-batch SOFTMAX_BATCH]

Named Arguments


disable progress bar

Default: False


log progress every N batches (when progress bar is disabled)

Default: 100


Possible choices: json, none, simple, tqdm

log format to use

--log-file log file to copy metrics to.
--aim-repo path to Aim repository
--aim-run-hash Aim run hash. If skipped, creates or continues run based on save_dir
--tensorboard-logdir path to save logs for tensorboard, should match –logdir of running tensorboard (default: no tensorboard logging)
--wandb-project Weights and Biases project name to use for logging

Log scalars to AzureML context

Default: False


pseudo random number generator seed

Default: 1


use CPU instead of CUDA

Default: False


use TPU instead of CUDA

Default: False


use bfloat16; implies –tpu

Default: False


use a memory-efficient version of BF16 training; implies –bf16

Default: False


use FP16

Default: False


use a memory-efficient version of FP16 training; implies –fp16

Default: False


don’t flatten FP16 grads tensor

Default: False


default FP16 loss scale

Default: 128

--fp16-scale-window number of updates before increasing loss scale

pct of updates that can overflow before decreasing the loss scale

Default: 0.0


if set, the floating point conversion to fp16/bf16 runs on CPU. This reduces bus transfer time and GPU memory usage.

Default: False


minimum FP16/AMP loss scale, after which training is stopped

Default: 0.0001

--threshold-loss-scale threshold FP16 loss scale from below

use automatic mixed precision

Default: False


number of retries of same batch after reducing loss scale with AMP

Default: 2


default AMP loss scale

Default: 128

--amp-scale-window number of updates before increasing AMP loss scale
--user-dir path to a python module containing custom extensions (tasks and/or architectures)

how often to clear the PyTorch CUDA cache (0 to disable)

Default: 0


number of bytes reserved for gathering stats from workers

Default: 16384


total number of GPUs to parallelize model over

Default: 1

--quantization-config-path path to quantization config file

enable autograd profiler emit_nvtx

Default: False


when using Hydra, reset the logging at the beginning of training

Default: False


suppress crashes when training with the hydra_train entry point so that the main method can return a value (useful for sweeps)

Default: False


Store indices and sizes in shared memory

Default: False


path to run plasma_store, defaults to /tmp/plasma. Paths outside /tmp tend to fail.

Default: “/tmp/plasma”


Possible choices: adaptive_loss, composite_loss, cross_entropy, ctc, fastspeech2, hubert, label_smoothed_cross_entropy, latency_augmented_label_smoothed_cross_entropy, label_smoothed_cross_entropy_with_alignment, label_smoothed_cross_entropy_with_ctc, legacy_masked_lm_loss, masked_lm, model, nat_loss, sentence_prediction, sentence_prediction_adapters, sentence_ranking, tacotron2, speech_to_unit, speech_to_spectrogram, speech_unit_lm_criterion, wav2vec, vocab_parallel_cross_entropy

Default: “cross_entropy”

--tokenizer Possible choices: moses, nltk, space
--bpe Possible choices: byte_bpe, bytes, characters, fastbpe, gpt2, bert, hf_byte_bpe, sentencepiece, subword_nmt
--optimizer Possible choices: adadelta, adafactor, adagrad, adam, adamax, composite, cpu_adam, lamb, nag, sgd

Possible choices: cosine, fixed, inverse_sqrt, manual, pass_through, polynomial_decay, reduce_lr_on_plateau, step, tri_stage, triangular

Default: “fixed”


Possible choices: bert_score, sacrebleu, bleu, chrf, meteor, wer

Default: “bleu”


Possible choices: multilingual_language_modeling, speech_unit_modeling, hubert_pretraining, translation, multilingual_translation, semisupervised_translation, translation_from_pretrained_xlm, speech_to_text, text_to_speech, frm_text_to_speech, legacy_masked_lm, audio_pretraining, audio_finetuning, sentence_ranking, online_backtranslation, simul_speech_to_text, simul_text_to_text, cross_lingual_lm, span_masked_lm, denoising, multilingual_denoising, multilingual_masked_lm, language_modeling, masked_lm, nlu_finetuning, speech_to_speech, sentence_prediction, translation_from_pretrained_bart, sentence_prediction_adapters, translation_multi_simple_epoch, translation_lev, dummy_lm, dummy_masked_lm, dummy_mt


Default: “language_modeling”



how many subprocesses to use for data loading

Default: 1


ignore too long or too short lines in valid and test set

Default: False

--max-tokens maximum number of tokens in a batch
--batch-size, --max-sentences number of examples in a batch

batch size will be a multiplier of this value

Default: 8


maximum sequence length in batch will be a multiplier of this value

Default: 1


Possible choices: raw, lazy, cached, mmap, fasta, huffman

output dataset implementation


Number of batches to preload

Default: 10


data subset to use for training (e.g. train, valid, test)

Default: “train”


comma separated list of data subsets to use for validation (e.g. train, valid, test)

Default: “valid”

--combine-valid-subsets, --combine-val comma separated list of data subsets to use for validation (e.g. train, valid, test)

do not raise error if valid subsets are ignored

Default: False


validate every N epochs

Default: 1


validate every N updates

Default: 0


dont validate until reaching this many updates

Default: 0

--fixed-validation-seed specified random seed for validation

disable validation

Default: False

--max-tokens-valid maximum number of tokens in a validation batch (defaults to –max-tokens)
--batch-size-valid, --max-sentences-valid batch size of the validation batch (defaults to –batch-size)
--max-valid-steps, --nval How many batches to evaluate

don’t shuffle batches for first N epochs

Default: 0


data subset to generate (train, valid, test)

Default: “test”


shard generation over N shards

Default: 1


id of the shard to generate (id < num_shards)

Default: 0


shuffle batches in groups of num_shards to enable similar sequence lengths on each GPU worker when batches are sorted by length

Default: False

--update-epoch-batch-itr if true then prevents the reuse the epoch batch iterator by setting can_reuse_epoch_itr to false, defaults to –grouped-shuffling )

if true then increment seed with epoch for getting batch iterators, defautls to False.

Default: False



total number of GPUs across all nodes (default: all visible GPUs)

Default: 1


total number of processes to fork (default: all visible GPUs)

Default: 1


rank of the current worker

Default: 0


distributed backend

Default: “nccl”

--distributed-init-method typically tcp://hostname:port that will be used to establish initial connetion

port number (not required if using –distributed-init-method)

Default: -1

--device-id, --local_rank

which GPU to use (by default looks for $LOCAL_RANK, usually configured automatically)

Default: 0


do not spawn multiple processes even if multiple GPUs are visible

Default: False


Possible choices: c10d, fully_sharded, legacy_ddp, no_c10d, pytorch_ddp, slowmo

DistributedDataParallel backend

Default: “pytorch_ddp”


Possible choices: none, fp16

communication hook

Default: “none”


bucket size for reduction

Default: 25


don’t shuffle batches between GPUs; this reduces overall randomness and may affect precision but avoids the cost of re-reading the data

Default: False


disable unused parameter detection (not applicable to –ddp-backend=legacy_ddp)

Default: False


when set to True, gradients will be views pointing to different offsets of allreduce communication buckets. This can reduce peak memory usage, where the saved memory size will be equal to the total gradients size. –gradient-as-bucket-view=gradient_as_bucket_view)

Default: False


[deprecated] this is now defined per Criterion

Default: False


kill the job if no progress is made in N seconds; set to -1 to disable

Default: -1


Copy non-trainable parameters between GPUs, such as batchnorm population statistics

Default: False

--slowmo-momentum SlowMo momentum term; by default use 0.0 for 16 GPUs, 0.2 for 32 GPUs; 0.5 for 64 GPUs, 0.6 for > 64 GPUs

Base algorithm. Either ‘localsgd’ or ‘sgp’. Please refer to the documentation of ‘slowmo_base_algorithm’ parameter in for more details

Default: “localsgd”


Local SGD allreduce frequency

Default: 3


number of GPUs in each node. An allreduce operation across GPUs in a node is very fast. Hence, we do allreduce across GPUs in a node, and gossip across different nodes

Default: 1


if set, use pipeline model parallelism across GPUs

Default: False

--pipeline-balance partition the model into N_K pieces, where each piece contains N_i layers. The sum(args.pipeline_balance) should equal the total number of layers in the model
--pipeline-devices a list of device indices indicating which device to place each of the N_K partitions. The length of this list should equal the length of the –pipeline-balance argument

microbatch count for pipeline model parallelism

Default: 0

--pipeline-encoder-balance partition the pipeline parallel encoder into N_K pieces, where each piece contains N_i layers. The sum(args.pipeline_encoder_balance) should equal the total number of encoder layers in the model
--pipeline-encoder-devices a list of device indices indicating which device to place each of the N_K partitions. The length of this list should equal the length of the –pipeline-encoder-balance argument
--pipeline-decoder-balance partition the pipeline parallel decoder into N_K pieces, where each piece contains N_i layers. The sum(args.pipeline_decoder_balance) should equal the total number of decoder layers in the model
--pipeline-decoder-devices a list of device indices indicating which device to place each of the N_K partitions. The length of this list should equal the length of the –pipeline-decoder-balance argument

Possible choices: always, never, except_last

checkpointing mode for pipeline model parallelism

Default: “never”


Possible choices: none, os

ZeRO sharding

Default: “none”


don’t reshard parameters after forward pass

Default: False


reduce-scatter grads in FP32

Default: False


offload FP32 params to CPU

Default: False


use sharded checkpoint files

Default: False


not flatten parameter param for fsdp

Default: False

LM Evaluation

--path path(s) to model file(s), colon separated
--post-process, --remove-bpe post-process text by removing BPE, letter segmentation, etc. Valid options can be found in

only print final scores

Default: False


a dictionary used to override model args at generation that were used during model training

Default: “{}”

--results-path path to save eval results (optional)

if set, outputs words and their predicted log probabilities to standard output

Default: False


if set, outputs word statistics such as word count, average probability, etc

Default: False


ensures that every evaluated token has access to a context of at least this size, if possible

Default: 0


if BxT is more than this, will batch the softmax over vocab to this amount of tokens, in order to fit into GPU memory

Default: 9223372036854775807