Command-line Tools

Fairseq provides several command-line tools for training and evaluating models:

fairseq-preprocess

Data pre-processing: build vocabularies and binarize training data.

usage: fairseq-preprocess [-h] [--no-progress-bar]
                          [--log-interval LOG_INTERVAL]
                          [--log-format {json,none,simple,tqdm}]
                          [--log-file LOG_FILE] [--aim-repo AIM_REPO]
                          [--aim-run-hash AIM_RUN_HASH]
                          [--tensorboard-logdir TENSORBOARD_LOGDIR]
                          [--wandb-project WANDB_PROJECT] [--azureml-logging]
                          [--seed SEED] [--cpu] [--tpu] [--bf16]
                          [--memory-efficient-bf16] [--fp16]
                          [--memory-efficient-fp16] [--fp16-no-flatten-grads]
                          [--fp16-init-scale FP16_INIT_SCALE]
                          [--fp16-scale-window FP16_SCALE_WINDOW]
                          [--fp16-scale-tolerance FP16_SCALE_TOLERANCE]
                          [--on-cpu-convert-precision]
                          [--min-loss-scale MIN_LOSS_SCALE]
                          [--threshold-loss-scale THRESHOLD_LOSS_SCALE]
                          [--amp] [--amp-batch-retries AMP_BATCH_RETRIES]
                          [--amp-init-scale AMP_INIT_SCALE]
                          [--amp-scale-window AMP_SCALE_WINDOW]
                          [--user-dir USER_DIR]
                          [--empty-cache-freq EMPTY_CACHE_FREQ]
                          [--all-gather-list-size ALL_GATHER_LIST_SIZE]
                          [--model-parallel-size MODEL_PARALLEL_SIZE]
                          [--quantization-config-path QUANTIZATION_CONFIG_PATH]
                          [--profile] [--reset-logging] [--suppress-crashes]
                          [--use-plasma-view] [--plasma-path PLASMA_PATH]
                          [--criterion {adaptive_loss,composite_loss,cross_entropy,ctc,fastspeech2,hubert,label_smoothed_cross_entropy,latency_augmented_label_smoothed_cross_entropy,label_smoothed_cross_entropy_with_alignment,label_smoothed_cross_entropy_with_ctc,legacy_masked_lm_loss,masked_lm,model,nat_loss,sentence_prediction,sentence_prediction_adapters,sentence_ranking,tacotron2,speech_to_unit,speech_to_spectrogram,speech_unit_lm_criterion,wav2vec,vocab_parallel_cross_entropy}]
                          [--tokenizer {moses,nltk,space}]
                          [--bpe {byte_bpe,bytes,characters,fastbpe,gpt2,bert,hf_byte_bpe,sentencepiece,subword_nmt}]
                          [--optimizer {adadelta,adafactor,adagrad,adam,adamax,composite,cpu_adam,lamb,nag,sgd}]
                          [--lr-scheduler {cosine,fixed,inverse_sqrt,manual,pass_through,polynomial_decay,reduce_lr_on_plateau,step,tri_stage,triangular}]
                          [--scoring {bert_score,sacrebleu,bleu,chrf,meteor,wer}]
                          [--task TASK] [-s SRC] [-t TARGET] [--trainpref FP]
                          [--validpref FP] [--testpref FP] [--align-suffix FP]
                          [--destdir DIR] [--thresholdtgt N]
                          [--thresholdsrc N] [--tgtdict FP] [--srcdict FP]
                          [--nwordstgt N] [--nwordssrc N] [--alignfile ALIGN]
                          [--dataset-impl FORMAT] [--joined-dictionary]
                          [--only-source] [--padding-factor N] [--workers N]
                          [--dict-only]

Named Arguments

--no-progress-bar

disable progress bar

Default: False

--log-interval

log progress every N batches (when progress bar is disabled)

Default: 100

--log-format

Possible choices: json, none, simple, tqdm

log format to use

--log-file log file to copy metrics to.
--aim-repo path to Aim repository
--aim-run-hash Aim run hash. If skipped, creates or continues run based on save_dir
--tensorboard-logdir path to save logs for tensorboard, should match –logdir of running tensorboard (default: no tensorboard logging)
--wandb-project Weights and Biases project name to use for logging
--azureml-logging

Log scalars to AzureML context

Default: False

--seed

pseudo random number generator seed

Default: 1

--cpu

use CPU instead of CUDA

Default: False

--tpu

use TPU instead of CUDA

Default: False

--bf16

use bfloat16; implies –tpu

Default: False

--memory-efficient-bf16

use a memory-efficient version of BF16 training; implies –bf16

Default: False

--fp16

use FP16

Default: False

--memory-efficient-fp16

use a memory-efficient version of FP16 training; implies –fp16

Default: False

--fp16-no-flatten-grads

don’t flatten FP16 grads tensor

Default: False

--fp16-init-scale

default FP16 loss scale

Default: 128

--fp16-scale-window number of updates before increasing loss scale
--fp16-scale-tolerance

pct of updates that can overflow before decreasing the loss scale

Default: 0.0

--on-cpu-convert-precision

if set, the floating point conversion to fp16/bf16 runs on CPU. This reduces bus transfer time and GPU memory usage.

Default: False

--min-loss-scale

minimum FP16/AMP loss scale, after which training is stopped

Default: 0.0001

--threshold-loss-scale threshold FP16 loss scale from below
--amp

use automatic mixed precision

Default: False

--amp-batch-retries

number of retries of same batch after reducing loss scale with AMP

Default: 2

--amp-init-scale

default AMP loss scale

Default: 128

--amp-scale-window number of updates before increasing AMP loss scale
--user-dir path to a python module containing custom extensions (tasks and/or architectures)
--empty-cache-freq

how often to clear the PyTorch CUDA cache (0 to disable)

Default: 0

--all-gather-list-size

number of bytes reserved for gathering stats from workers

Default: 16384

--model-parallel-size

total number of GPUs to parallelize model over

Default: 1

--quantization-config-path path to quantization config file
--profile

enable autograd profiler emit_nvtx

Default: False

--reset-logging

when using Hydra, reset the logging at the beginning of training

Default: False

--suppress-crashes

suppress crashes when training with the hydra_train entry point so that the main method can return a value (useful for sweeps)

Default: False

--use-plasma-view

Store indices and sizes in shared memory

Default: False

--plasma-path

path to run plasma_store, defaults to /tmp/plasma. Paths outside /tmp tend to fail.

Default: “/tmp/plasma”

--criterion

Possible choices: adaptive_loss, composite_loss, cross_entropy, ctc, fastspeech2, hubert, label_smoothed_cross_entropy, latency_augmented_label_smoothed_cross_entropy, label_smoothed_cross_entropy_with_alignment, label_smoothed_cross_entropy_with_ctc, legacy_masked_lm_loss, masked_lm, model, nat_loss, sentence_prediction, sentence_prediction_adapters, sentence_ranking, tacotron2, speech_to_unit, speech_to_spectrogram, speech_unit_lm_criterion, wav2vec, vocab_parallel_cross_entropy

Default: “cross_entropy”

--tokenizer Possible choices: moses, nltk, space
--bpe Possible choices: byte_bpe, bytes, characters, fastbpe, gpt2, bert, hf_byte_bpe, sentencepiece, subword_nmt
--optimizer Possible choices: adadelta, adafactor, adagrad, adam, adamax, composite, cpu_adam, lamb, nag, sgd
--lr-scheduler

Possible choices: cosine, fixed, inverse_sqrt, manual, pass_through, polynomial_decay, reduce_lr_on_plateau, step, tri_stage, triangular

Default: “fixed”

--scoring

Possible choices: bert_score, sacrebleu, bleu, chrf, meteor, wer

Default: “bleu”

--task

Possible choices: multilingual_language_modeling, speech_unit_modeling, hubert_pretraining, translation, multilingual_translation, semisupervised_translation, translation_from_pretrained_xlm, speech_to_text, text_to_speech, frm_text_to_speech, legacy_masked_lm, audio_pretraining, audio_finetuning, sentence_ranking, online_backtranslation, simul_speech_to_text, simul_text_to_text, cross_lingual_lm, span_masked_lm, denoising, multilingual_denoising, multilingual_masked_lm, language_modeling, masked_lm, nlu_finetuning, speech_to_speech, sentence_prediction, translation_from_pretrained_bart, sentence_prediction_adapters, translation_multi_simple_epoch, translation_lev, dummy_lm, dummy_masked_lm, dummy_mt

task

Default: “translation”

--dataset-impl

Possible choices: raw, lazy, cached, mmap, fasta, huffman

output dataset implementation

Default: “mmap”

Preprocessing

-s, --source-lang source language
-t, --target-lang target language
--trainpref train file prefix (also used to build dictionaries)
--validpref comma separated, valid file prefixes (words missing from train set are replaced with <unk>)
--testpref comma separated, test file prefixes (words missing from train set are replaced with <unk>)
--align-suffix alignment file suffix
--destdir

destination dir

Default: “data-bin”

--thresholdtgt

map words appearing less than threshold times to unknown

Default: 0

--thresholdsrc

map words appearing less than threshold times to unknown

Default: 0

--tgtdict reuse given target dictionary
--srcdict reuse given source dictionary
--nwordstgt

number of target words to retain

Default: -1

--nwordssrc

number of source words to retain

Default: -1

--alignfile an alignment file (optional)
--joined-dictionary

Generate joined dictionary

Default: False

--only-source

Only process the source language

Default: False

--padding-factor

Pad dictionary size to be multiple of N

Default: 8

--workers

number of parallel workers

Default: 1

--dict-only

if true, only builds a dictionary and then exits

Default: False

fairseq-train

Train a new model on one or across multiple GPUs.

usage: fairseq-train [-h] [--no-progress-bar] [--log-interval LOG_INTERVAL]
                     [--log-format {json,none,simple,tqdm}]
                     [--log-file LOG_FILE] [--aim-repo AIM_REPO]
                     [--aim-run-hash AIM_RUN_HASH]
                     [--tensorboard-logdir TENSORBOARD_LOGDIR]
                     [--wandb-project WANDB_PROJECT] [--azureml-logging]
                     [--seed SEED] [--cpu] [--tpu] [--bf16]
                     [--memory-efficient-bf16] [--fp16]
                     [--memory-efficient-fp16] [--fp16-no-flatten-grads]
                     [--fp16-init-scale FP16_INIT_SCALE]
                     [--fp16-scale-window FP16_SCALE_WINDOW]
                     [--fp16-scale-tolerance FP16_SCALE_TOLERANCE]
                     [--on-cpu-convert-precision]
                     [--min-loss-scale MIN_LOSS_SCALE]
                     [--threshold-loss-scale THRESHOLD_LOSS_SCALE] [--amp]
                     [--amp-batch-retries AMP_BATCH_RETRIES]
                     [--amp-init-scale AMP_INIT_SCALE]
                     [--amp-scale-window AMP_SCALE_WINDOW]
                     [--user-dir USER_DIR]
                     [--empty-cache-freq EMPTY_CACHE_FREQ]
                     [--all-gather-list-size ALL_GATHER_LIST_SIZE]
                     [--model-parallel-size MODEL_PARALLEL_SIZE]
                     [--quantization-config-path QUANTIZATION_CONFIG_PATH]
                     [--profile] [--reset-logging] [--suppress-crashes]
                     [--use-plasma-view] [--plasma-path PLASMA_PATH]
                     [--criterion {adaptive_loss,composite_loss,cross_entropy,ctc,fastspeech2,hubert,label_smoothed_cross_entropy,latency_augmented_label_smoothed_cross_entropy,label_smoothed_cross_entropy_with_alignment,label_smoothed_cross_entropy_with_ctc,legacy_masked_lm_loss,masked_lm,model,nat_loss,sentence_prediction,sentence_prediction_adapters,sentence_ranking,tacotron2,speech_to_unit,speech_to_spectrogram,speech_unit_lm_criterion,wav2vec,vocab_parallel_cross_entropy}]
                     [--tokenizer {moses,nltk,space}]
                     [--bpe {byte_bpe,bytes,characters,fastbpe,gpt2,bert,hf_byte_bpe,sentencepiece,subword_nmt}]
                     [--optimizer {adadelta,adafactor,adagrad,adam,adamax,composite,cpu_adam,lamb,nag,sgd}]
                     [--lr-scheduler {cosine,fixed,inverse_sqrt,manual,pass_through,polynomial_decay,reduce_lr_on_plateau,step,tri_stage,triangular}]
                     [--scoring {bert_score,sacrebleu,bleu,chrf,meteor,wer}]
                     [--task TASK] [--num-workers NUM_WORKERS]
                     [--skip-invalid-size-inputs-valid-test]
                     [--max-tokens MAX_TOKENS] [--batch-size BATCH_SIZE]
                     [--required-batch-size-multiple REQUIRED_BATCH_SIZE_MULTIPLE]
                     [--required-seq-len-multiple REQUIRED_SEQ_LEN_MULTIPLE]
                     [--dataset-impl {raw,lazy,cached,mmap,fasta,huffman}]
                     [--data-buffer-size DATA_BUFFER_SIZE]
                     [--train-subset TRAIN_SUBSET]
                     [--valid-subset VALID_SUBSET] [--combine-valid-subsets]
                     [--ignore-unused-valid-subsets]
                     [--validate-interval VALIDATE_INTERVAL]
                     [--validate-interval-updates VALIDATE_INTERVAL_UPDATES]
                     [--validate-after-updates VALIDATE_AFTER_UPDATES]
                     [--fixed-validation-seed FIXED_VALIDATION_SEED]
                     [--disable-validation]
                     [--max-tokens-valid MAX_TOKENS_VALID]
                     [--batch-size-valid BATCH_SIZE_VALID]
                     [--max-valid-steps MAX_VALID_STEPS]
                     [--curriculum CURRICULUM] [--gen-subset GEN_SUBSET]
                     [--num-shards NUM_SHARDS] [--shard-id SHARD_ID]
                     [--grouped-shuffling]
                     [--update-epoch-batch-itr UPDATE_EPOCH_BATCH_ITR]
                     [--update-ordered-indices-seed]
                     [--distributed-world-size DISTRIBUTED_WORLD_SIZE]
                     [--distributed-num-procs DISTRIBUTED_NUM_PROCS]
                     [--distributed-rank DISTRIBUTED_RANK]
                     [--distributed-backend DISTRIBUTED_BACKEND]
                     [--distributed-init-method DISTRIBUTED_INIT_METHOD]
                     [--distributed-port DISTRIBUTED_PORT]
                     [--device-id DEVICE_ID] [--distributed-no-spawn]
                     [--ddp-backend {c10d,fully_sharded,legacy_ddp,no_c10d,pytorch_ddp,slowmo}]
                     [--ddp-comm-hook {none,fp16}]
                     [--bucket-cap-mb BUCKET_CAP_MB] [--fix-batches-to-gpus]
                     [--find-unused-parameters] [--gradient-as-bucket-view]
                     [--fast-stat-sync]
                     [--heartbeat-timeout HEARTBEAT_TIMEOUT]
                     [--broadcast-buffers] [--slowmo-momentum SLOWMO_MOMENTUM]
                     [--slowmo-base-algorithm SLOWMO_BASE_ALGORITHM]
                     [--localsgd-frequency LOCALSGD_FREQUENCY]
                     [--nprocs-per-node NPROCS_PER_NODE]
                     [--pipeline-model-parallel]
                     [--pipeline-balance PIPELINE_BALANCE]
                     [--pipeline-devices PIPELINE_DEVICES]
                     [--pipeline-chunks PIPELINE_CHUNKS]
                     [--pipeline-encoder-balance PIPELINE_ENCODER_BALANCE]
                     [--pipeline-encoder-devices PIPELINE_ENCODER_DEVICES]
                     [--pipeline-decoder-balance PIPELINE_DECODER_BALANCE]
                     [--pipeline-decoder-devices PIPELINE_DECODER_DEVICES]
                     [--pipeline-checkpoint {always,never,except_last}]
                     [--zero-sharding {none,os}] [--no-reshard-after-forward]
                     [--fp32-reduce-scatter] [--cpu-offload]
                     [--use-sharded-state] [--not-fsdp-flatten-parameters]
                     [--arch ARCH] [--max-epoch MAX_EPOCH]
                     [--max-update MAX_UPDATE]
                     [--stop-time-hours STOP_TIME_HOURS]
                     [--clip-norm CLIP_NORM] [--sentence-avg]
                     [--update-freq UPDATE_FREQ] [--lr LR]
                     [--stop-min-lr STOP_MIN_LR] [--use-bmuf]
                     [--skip-remainder-batch] [--save-dir SAVE_DIR]
                     [--restore-file RESTORE_FILE]
                     [--continue-once CONTINUE_ONCE]
                     [--finetune-from-model FINETUNE_FROM_MODEL]
                     [--reset-dataloader] [--reset-lr-scheduler]
                     [--reset-meters] [--reset-optimizer]
                     [--optimizer-overrides OPTIMIZER_OVERRIDES]
                     [--save-interval SAVE_INTERVAL]
                     [--save-interval-updates SAVE_INTERVAL_UPDATES]
                     [--keep-interval-updates KEEP_INTERVAL_UPDATES]
                     [--keep-interval-updates-pattern KEEP_INTERVAL_UPDATES_PATTERN]
                     [--keep-last-epochs KEEP_LAST_EPOCHS]
                     [--keep-best-checkpoints KEEP_BEST_CHECKPOINTS]
                     [--no-save] [--no-epoch-checkpoints]
                     [--no-last-checkpoints] [--no-save-optimizer-state]
                     [--best-checkpoint-metric BEST_CHECKPOINT_METRIC]
                     [--maximize-best-checkpoint-metric] [--patience PATIENCE]
                     [--checkpoint-suffix CHECKPOINT_SUFFIX]
                     [--checkpoint-shard-count CHECKPOINT_SHARD_COUNT]
                     [--load-checkpoint-on-all-dp-ranks]
                     [--write-checkpoints-asynchronously] [--store-ema]
                     [--ema-decay EMA_DECAY]
                     [--ema-start-update EMA_START_UPDATE]
                     [--ema-seed-model EMA_SEED_MODEL]
                     [--ema-update-freq EMA_UPDATE_FREQ] [--ema-fp32]

Named Arguments

--no-progress-bar

disable progress bar

Default: False

--log-interval

log progress every N batches (when progress bar is disabled)

Default: 100

--log-format

Possible choices: json, none, simple, tqdm

log format to use

--log-file log file to copy metrics to.
--aim-repo path to Aim repository
--aim-run-hash Aim run hash. If skipped, creates or continues run based on save_dir
--tensorboard-logdir path to save logs for tensorboard, should match –logdir of running tensorboard (default: no tensorboard logging)
--wandb-project Weights and Biases project name to use for logging
--azureml-logging

Log scalars to AzureML context

Default: False

--seed

pseudo random number generator seed

Default: 1

--cpu

use CPU instead of CUDA

Default: False

--tpu

use TPU instead of CUDA

Default: False

--bf16

use bfloat16; implies –tpu

Default: False

--memory-efficient-bf16

use a memory-efficient version of BF16 training; implies –bf16

Default: False

--fp16

use FP16

Default: False

--memory-efficient-fp16

use a memory-efficient version of FP16 training; implies –fp16

Default: False

--fp16-no-flatten-grads

don’t flatten FP16 grads tensor

Default: False

--fp16-init-scale

default FP16 loss scale

Default: 128

--fp16-scale-window number of updates before increasing loss scale
--fp16-scale-tolerance

pct of updates that can overflow before decreasing the loss scale

Default: 0.0

--on-cpu-convert-precision

if set, the floating point conversion to fp16/bf16 runs on CPU. This reduces bus transfer time and GPU memory usage.

Default: False

--min-loss-scale

minimum FP16/AMP loss scale, after which training is stopped

Default: 0.0001

--threshold-loss-scale threshold FP16 loss scale from below
--amp

use automatic mixed precision

Default: False

--amp-batch-retries

number of retries of same batch after reducing loss scale with AMP

Default: 2

--amp-init-scale

default AMP loss scale

Default: 128

--amp-scale-window number of updates before increasing AMP loss scale
--user-dir path to a python module containing custom extensions (tasks and/or architectures)
--empty-cache-freq

how often to clear the PyTorch CUDA cache (0 to disable)

Default: 0

--all-gather-list-size

number of bytes reserved for gathering stats from workers

Default: 16384

--model-parallel-size

total number of GPUs to parallelize model over

Default: 1

--quantization-config-path path to quantization config file
--profile

enable autograd profiler emit_nvtx

Default: False

--reset-logging

when using Hydra, reset the logging at the beginning of training

Default: False

--suppress-crashes

suppress crashes when training with the hydra_train entry point so that the main method can return a value (useful for sweeps)

Default: False

--use-plasma-view

Store indices and sizes in shared memory

Default: False

--plasma-path

path to run plasma_store, defaults to /tmp/plasma. Paths outside /tmp tend to fail.

Default: “/tmp/plasma”

--criterion

Possible choices: adaptive_loss, composite_loss, cross_entropy, ctc, fastspeech2, hubert, label_smoothed_cross_entropy, latency_augmented_label_smoothed_cross_entropy, label_smoothed_cross_entropy_with_alignment, label_smoothed_cross_entropy_with_ctc, legacy_masked_lm_loss, masked_lm, model, nat_loss, sentence_prediction, sentence_prediction_adapters, sentence_ranking, tacotron2, speech_to_unit, speech_to_spectrogram, speech_unit_lm_criterion, wav2vec, vocab_parallel_cross_entropy

Default: “cross_entropy”

--tokenizer Possible choices: moses, nltk, space
--bpe Possible choices: byte_bpe, bytes, characters, fastbpe, gpt2, bert, hf_byte_bpe, sentencepiece, subword_nmt
--optimizer Possible choices: adadelta, adafactor, adagrad, adam, adamax, composite, cpu_adam, lamb, nag, sgd
--lr-scheduler

Possible choices: cosine, fixed, inverse_sqrt, manual, pass_through, polynomial_decay, reduce_lr_on_plateau, step, tri_stage, triangular

Default: “fixed”

--scoring

Possible choices: bert_score, sacrebleu, bleu, chrf, meteor, wer

Default: “bleu”

--task

Possible choices: multilingual_language_modeling, speech_unit_modeling, hubert_pretraining, translation, multilingual_translation, semisupervised_translation, translation_from_pretrained_xlm, speech_to_text, text_to_speech, frm_text_to_speech, legacy_masked_lm, audio_pretraining, audio_finetuning, sentence_ranking, online_backtranslation, simul_speech_to_text, simul_text_to_text, cross_lingual_lm, span_masked_lm, denoising, multilingual_denoising, multilingual_masked_lm, language_modeling, masked_lm, nlu_finetuning, speech_to_speech, sentence_prediction, translation_from_pretrained_bart, sentence_prediction_adapters, translation_multi_simple_epoch, translation_lev, dummy_lm, dummy_masked_lm, dummy_mt

task

Default: “translation”

dataset_data_loading

--num-workers

how many subprocesses to use for data loading

Default: 1

--skip-invalid-size-inputs-valid-test

ignore too long or too short lines in valid and test set

Default: False

--max-tokens maximum number of tokens in a batch
--batch-size, --max-sentences number of examples in a batch
--required-batch-size-multiple

batch size will be a multiplier of this value

Default: 8

--required-seq-len-multiple

maximum sequence length in batch will be a multiplier of this value

Default: 1

--dataset-impl

Possible choices: raw, lazy, cached, mmap, fasta, huffman

output dataset implementation

--data-buffer-size

Number of batches to preload

Default: 10

--train-subset

data subset to use for training (e.g. train, valid, test)

Default: “train”

--valid-subset

comma separated list of data subsets to use for validation (e.g. train, valid, test)

Default: “valid”

--combine-valid-subsets, --combine-val comma separated list of data subsets to use for validation (e.g. train, valid, test)
--ignore-unused-valid-subsets

do not raise error if valid subsets are ignored

Default: False

--validate-interval

validate every N epochs

Default: 1

--validate-interval-updates

validate every N updates

Default: 0

--validate-after-updates

dont validate until reaching this many updates

Default: 0

--fixed-validation-seed specified random seed for validation
--disable-validation

disable validation

Default: False

--max-tokens-valid maximum number of tokens in a validation batch (defaults to –max-tokens)
--batch-size-valid, --max-sentences-valid batch size of the validation batch (defaults to –batch-size)
--max-valid-steps, --nval How many batches to evaluate
--curriculum

don’t shuffle batches for first N epochs

Default: 0

--gen-subset

data subset to generate (train, valid, test)

Default: “test”

--num-shards

shard generation over N shards

Default: 1

--shard-id

id of the shard to generate (id < num_shards)

Default: 0

--grouped-shuffling

shuffle batches in groups of num_shards to enable similar sequence lengths on each GPU worker when batches are sorted by length

Default: False

--update-epoch-batch-itr if true then prevents the reuse the epoch batch iterator by setting can_reuse_epoch_itr to false, defaults to –grouped-shuffling )
--update-ordered-indices-seed

if true then increment seed with epoch for getting batch iterators, defautls to False.

Default: False

distributed_training

--distributed-world-size

total number of GPUs across all nodes (default: all visible GPUs)

Default: 1

--distributed-num-procs

total number of processes to fork (default: all visible GPUs)

Default: 1

--distributed-rank

rank of the current worker

Default: 0

--distributed-backend

distributed backend

Default: “nccl”

--distributed-init-method typically tcp://hostname:port that will be used to establish initial connetion
--distributed-port

port number (not required if using –distributed-init-method)

Default: -1

--device-id, --local_rank

which GPU to use (by default looks for $LOCAL_RANK, usually configured automatically)

Default: 0

--distributed-no-spawn

do not spawn multiple processes even if multiple GPUs are visible

Default: False

--ddp-backend

Possible choices: c10d, fully_sharded, legacy_ddp, no_c10d, pytorch_ddp, slowmo

DistributedDataParallel backend

Default: “pytorch_ddp”

--ddp-comm-hook

Possible choices: none, fp16

communication hook

Default: “none”

--bucket-cap-mb

bucket size for reduction

Default: 25

--fix-batches-to-gpus

don’t shuffle batches between GPUs; this reduces overall randomness and may affect precision but avoids the cost of re-reading the data

Default: False

--find-unused-parameters

disable unused parameter detection (not applicable to –ddp-backend=legacy_ddp)

Default: False

--gradient-as-bucket-view

when set to True, gradients will be views pointing to different offsets of allreduce communication buckets. This can reduce peak memory usage, where the saved memory size will be equal to the total gradients size. –gradient-as-bucket-view=gradient_as_bucket_view)

Default: False

--fast-stat-sync

[deprecated] this is now defined per Criterion

Default: False

--heartbeat-timeout

kill the job if no progress is made in N seconds; set to -1 to disable

Default: -1

--broadcast-buffers

Copy non-trainable parameters between GPUs, such as batchnorm population statistics

Default: False

--slowmo-momentum SlowMo momentum term; by default use 0.0 for 16 GPUs, 0.2 for 32 GPUs; 0.5 for 64 GPUs, 0.6 for > 64 GPUs
--slowmo-base-algorithm

Base algorithm. Either ‘localsgd’ or ‘sgp’. Please refer to the documentation of ‘slowmo_base_algorithm’ parameter in https://fairscale.readthedocs.io/en/latest/api/experimental/nn/slowmo_ddp.html for more details

Default: “localsgd”

--localsgd-frequency

Local SGD allreduce frequency

Default: 3

--nprocs-per-node

number of GPUs in each node. An allreduce operation across GPUs in a node is very fast. Hence, we do allreduce across GPUs in a node, and gossip across different nodes

Default: 1

--pipeline-model-parallel

if set, use pipeline model parallelism across GPUs

Default: False

--pipeline-balance partition the model into N_K pieces, where each piece contains N_i layers. The sum(args.pipeline_balance) should equal the total number of layers in the model
--pipeline-devices a list of device indices indicating which device to place each of the N_K partitions. The length of this list should equal the length of the –pipeline-balance argument
--pipeline-chunks

microbatch count for pipeline model parallelism

Default: 0

--pipeline-encoder-balance partition the pipeline parallel encoder into N_K pieces, where each piece contains N_i layers. The sum(args.pipeline_encoder_balance) should equal the total number of encoder layers in the model
--pipeline-encoder-devices a list of device indices indicating which device to place each of the N_K partitions. The length of this list should equal the length of the –pipeline-encoder-balance argument
--pipeline-decoder-balance partition the pipeline parallel decoder into N_K pieces, where each piece contains N_i layers. The sum(args.pipeline_decoder_balance) should equal the total number of decoder layers in the model
--pipeline-decoder-devices a list of device indices indicating which device to place each of the N_K partitions. The length of this list should equal the length of the –pipeline-decoder-balance argument
--pipeline-checkpoint

Possible choices: always, never, except_last

checkpointing mode for pipeline model parallelism

Default: “never”

--zero-sharding

Possible choices: none, os

ZeRO sharding

Default: “none”

--no-reshard-after-forward

don’t reshard parameters after forward pass

Default: False

--fp32-reduce-scatter

reduce-scatter grads in FP32

Default: False

--cpu-offload

offload FP32 params to CPU

Default: False

--use-sharded-state

use sharded checkpoint files

Default: False

--not-fsdp-flatten-parameters

not flatten parameter param for fsdp

Default: False

Model configuration

--arch, -a

Possible choices: transformer_tiny, transformer, transformer_iwslt_de_en, transformer_wmt_en_de, transformer_vaswani_wmt_en_de_big, transformer_vaswani_wmt_en_fr_big, transformer_wmt_en_de_big, transformer_wmt_en_de_big_t2t, transformer_from_pretrained_xlm, transformer_align, transformer_wmt_en_de_big_align, fconv, fconv_iwslt_de_en, fconv_wmt_en_ro, fconv_wmt_en_de, fconv_wmt_en_fr, roberta, roberta_prenorm, roberta_base, roberta_large, xlm, roberta_enc_dec, xmod_base_13, xmod_base_30, xmod_base_60, xmod_base_75, xmod_base, xmod_large_prenorm, s2t_berard, s2t_berard_256_3_3, s2t_berard_512_3_2, s2t_berard_512_5_3, convtransformer, convtransformer_espnet, s2t_transformer, s2t_transformer_s, s2t_transformer_xs, s2t_transformer_sp, s2t_transformer_m, s2t_transformer_mp, s2t_transformer_l, s2t_transformer_lp, wav2vec, wav2vec2, wav2vec_ctc, wav2vec_seq2seq, xm_transformer, s2t_conformer, lstm, lstm_wiseman_iwslt_de_en, lstm_luong_wmt_en_de, masked_lm, bert_base, bert_large, xlm_base, tacotron_2, tts_transformer, fastspeech2, lightconv, lightconv_iwslt_de_en, lightconv_wmt_en_de, lightconv_wmt_en_de_big, lightconv_wmt_en_fr_big, lightconv_wmt_zh_en_big, lightconv_lm, lightconv_lm_gbw, lstm_lm, s2ut_transformer, s2ut_transformer_fisher, s2spect_transformer, s2spect_transformer_fisher, s2ut_conformer, hf_gpt2, hf_gpt2_medium, hf_gpt2_large, hf_gpt2_xl, transformer_lm, transformer_lm_big, transformer_lm_baevski_wiki103, transformer_lm_wiki103, transformer_lm_baevski_gbw, transformer_lm_gbw, transformer_lm_gpt, transformer_lm_gpt2_small, transformer_lm_gpt2_tiny, transformer_lm_gpt2_medium, transformer_lm_gpt2_big, transformer_lm_gpt2_big_wide, transformer_lm_gpt2_bigger, transformer_lm_gpt3_small, transformer_lm_gpt3_medium, transformer_lm_gpt3_large, transformer_lm_gpt3_xl, transformer_lm_gpt3_2_7, transformer_lm_gpt3_6_7, transformer_lm_gpt3_13, transformer_lm_gpt3_175, multilingual_transformer, multilingual_transformer_iwslt_de_en, bart_large, bart_base, mbart_large, mbart_base, mbart_base_wmt20, transformer_ulm, transformer_ulm_big, transformer_ulm_tiny, hubert, hubert_ctc, hubert_seq2seq, fconv_self_att, fconv_self_att_wp, fconv_lm, fconv_lm_dauphin_wikitext103, fconv_lm_dauphin_gbw, nonautoregressive_transformer, nonautoregressive_transformer_wmt_en_de, nacrf_transformer, iterative_nonautoregressive_transformer, iterative_nonautoregressive_transformer_wmt_en_de, cmlm_transformer, cmlm_transformer_wmt_en_de, levenshtein_transformer, levenshtein_transformer_wmt_en_de, levenshtein_transformer_vaswani_wmt_en_de_big, levenshtein_transformer_wmt_en_de_big, insertion_transformer, dummy_model, model_parallel_roberta, model_parallel_roberta_v1, model_parallel_roberta_postnorm, model_parallel_roberta_base, model_parallel_roberta_large, transformer_iwslt_de_en_pipeline_parallel, transformer_wmt_en_de_big_pipeline_parallel, transformer_lm_megatron, transformer_lm_megatron_11b

model architecture

optimization

--max-epoch

force stop training at specified epoch

Default: 0

--max-update

force stop training at specified update

Default: 0

--stop-time-hours

force stop training after specified cumulative time (if >0)

Default: 0

--clip-norm

clip threshold of gradients

Default: 0.0

--sentence-avg

normalize gradients by the number of sentences in a batch (default is to normalize by number of tokens)

Default: False

--update-freq

update parameters every N_i batches, when in epoch i

Default: 1

--lr

learning rate for the first N epochs; all epochs >N using LR_N (note: this may be interpreted differently depending on –lr-scheduler)

Default: 0.25

--stop-min-lr

stop training when the learning rate reaches this minimum

Default: -1.0

--use-bmuf

specify global optimizer for syncing models on different GPUs/shards

Default: False

--skip-remainder-batch

if set, include the last (partial) batch of each epoch in training (default is to skip it).

Default: False

checkpoint

--save-dir

path to save checkpoints

Default: “checkpoints”

--restore-file

filename from which to load checkpoint (default: <save-dir>/checkpoint_last.pt

Default: “checkpoint_last.pt”

--continue-once continues from this checkpoint, unless a checkpoint indicated in ‘restore_file’ option is present
--finetune-from-model finetune from a pretrained model; note that meters and lr scheduler will be reset
--reset-dataloader

if set, does not reload dataloader state from the checkpoint

Default: False

--reset-lr-scheduler

if set, does not load lr scheduler state from the checkpoint

Default: False

--reset-meters

if set, does not load meters from the checkpoint

Default: False

--reset-optimizer

if set, does not load optimizer state from the checkpoint

Default: False

--optimizer-overrides

a dictionary used to override optimizer args when loading a checkpoint

Default: “{}”

--save-interval

save a checkpoint every N epochs

Default: 1

--save-interval-updates

save a checkpoint (and validate) every N updates

Default: 0

--keep-interval-updates

keep the last N checkpoints saved with –save-interval-updates

Default: -1

--keep-interval-updates-pattern

when used with –keep-interval-updates, skips deleting any checkpoints with update X where X % keep_interval_updates_pattern == 0

Default: -1

--keep-last-epochs

keep last N epoch checkpoints

Default: -1

--keep-best-checkpoints

keep best N checkpoints based on scores

Default: -1

--no-save

don’t save models or checkpoints

Default: False

--no-epoch-checkpoints

only store last and best checkpoints

Default: False

--no-last-checkpoints

don’t store last checkpoints

Default: False

--no-save-optimizer-state

don’t save optimizer-state as part of checkpoint

Default: False

--best-checkpoint-metric

metric to use for saving “best” checkpoints

Default: “loss”

--maximize-best-checkpoint-metric

select the largest metric value for saving “best” checkpoints

Default: False

--patience

early stop training if valid performance doesn’t improve for N consecutive validation runs; note that this is influenced by –validate-interval

Default: -1

--checkpoint-suffix

suffix to add to the checkpoint file name

Default: “”

--checkpoint-shard-count

Number of shards containing the checkpoint - if the checkpoint is over 300GB, it is preferable to split it into shards to prevent OOM on CPU while loading the checkpoint

Default: 1

--load-checkpoint-on-all-dp-ranks

load checkpoints on all data parallel devices (default: only load on rank 0 and broadcast to other devices)

Default: False

--write-checkpoints-asynchronously, --save-async

Write checkpoints asynchronously in a separate thread. NOTE: This feature is currently being tested.

Default: False

EMA configuration

--store-ema Default: False
--ema-decay

decay for exponential moving average model

Default: 0.9999

--ema-start-update

start EMA update after this many model updates

Default: 0

--ema-seed-model Seed to load EMA model from. Used to load EMA model separately from the actual model.
--ema-update-freq

Do EMA update every this many model updates

Default: 1

--ema-fp32

If true, store EMA model in fp32 even if model is in fp16

Default: False

fairseq-generate

Translate pre-processed data with a trained model.

usage: fairseq-generate [-h] [--no-progress-bar] [--log-interval LOG_INTERVAL]
                        [--log-format {json,none,simple,tqdm}]
                        [--log-file LOG_FILE] [--aim-repo AIM_REPO]
                        [--aim-run-hash AIM_RUN_HASH]
                        [--tensorboard-logdir TENSORBOARD_LOGDIR]
                        [--wandb-project WANDB_PROJECT] [--azureml-logging]
                        [--seed SEED] [--cpu] [--tpu] [--bf16]
                        [--memory-efficient-bf16] [--fp16]
                        [--memory-efficient-fp16] [--fp16-no-flatten-grads]
                        [--fp16-init-scale FP16_INIT_SCALE]
                        [--fp16-scale-window FP16_SCALE_WINDOW]
                        [--fp16-scale-tolerance FP16_SCALE_TOLERANCE]
                        [--on-cpu-convert-precision]
                        [--min-loss-scale MIN_LOSS_SCALE]
                        [--threshold-loss-scale THRESHOLD_LOSS_SCALE] [--amp]
                        [--amp-batch-retries AMP_BATCH_RETRIES]
                        [--amp-init-scale AMP_INIT_SCALE]
                        [--amp-scale-window AMP_SCALE_WINDOW]
                        [--user-dir USER_DIR]
                        [--empty-cache-freq EMPTY_CACHE_FREQ]
                        [--all-gather-list-size ALL_GATHER_LIST_SIZE]
                        [--model-parallel-size MODEL_PARALLEL_SIZE]
                        [--quantization-config-path QUANTIZATION_CONFIG_PATH]
                        [--profile] [--reset-logging] [--suppress-crashes]
                        [--use-plasma-view] [--plasma-path PLASMA_PATH]
                        [--criterion {adaptive_loss,composite_loss,cross_entropy,ctc,fastspeech2,hubert,label_smoothed_cross_entropy,latency_augmented_label_smoothed_cross_entropy,label_smoothed_cross_entropy_with_alignment,label_smoothed_cross_entropy_with_ctc,legacy_masked_lm_loss,masked_lm,model,nat_loss,sentence_prediction,sentence_prediction_adapters,sentence_ranking,tacotron2,speech_to_unit,speech_to_spectrogram,speech_unit_lm_criterion,wav2vec,vocab_parallel_cross_entropy}]
                        [--tokenizer {moses,nltk,space}]
                        [--bpe {byte_bpe,bytes,characters,fastbpe,gpt2,bert,hf_byte_bpe,sentencepiece,subword_nmt}]
                        [--optimizer {adadelta,adafactor,adagrad,adam,adamax,composite,cpu_adam,lamb,nag,sgd}]
                        [--lr-scheduler {cosine,fixed,inverse_sqrt,manual,pass_through,polynomial_decay,reduce_lr_on_plateau,step,tri_stage,triangular}]
                        [--scoring {bert_score,sacrebleu,bleu,chrf,meteor,wer}]
                        [--task TASK] [--num-workers NUM_WORKERS]
                        [--skip-invalid-size-inputs-valid-test]
                        [--max-tokens MAX_TOKENS] [--batch-size BATCH_SIZE]
                        [--required-batch-size-multiple REQUIRED_BATCH_SIZE_MULTIPLE]
                        [--required-seq-len-multiple REQUIRED_SEQ_LEN_MULTIPLE]
                        [--dataset-impl {raw,lazy,cached,mmap,fasta,huffman}]
                        [--data-buffer-size DATA_BUFFER_SIZE]
                        [--train-subset TRAIN_SUBSET]
                        [--valid-subset VALID_SUBSET]
                        [--combine-valid-subsets]
                        [--ignore-unused-valid-subsets]
                        [--validate-interval VALIDATE_INTERVAL]
                        [--validate-interval-updates VALIDATE_INTERVAL_UPDATES]
                        [--validate-after-updates VALIDATE_AFTER_UPDATES]
                        [--fixed-validation-seed FIXED_VALIDATION_SEED]
                        [--disable-validation]
                        [--max-tokens-valid MAX_TOKENS_VALID]
                        [--batch-size-valid BATCH_SIZE_VALID]
                        [--max-valid-steps MAX_VALID_STEPS]
                        [--curriculum CURRICULUM] [--gen-subset GEN_SUBSET]
                        [--num-shards NUM_SHARDS] [--shard-id SHARD_ID]
                        [--grouped-shuffling]
                        [--update-epoch-batch-itr UPDATE_EPOCH_BATCH_ITR]
                        [--update-ordered-indices-seed]
                        [--distributed-world-size DISTRIBUTED_WORLD_SIZE]
                        [--distributed-num-procs DISTRIBUTED_NUM_PROCS]
                        [--distributed-rank DISTRIBUTED_RANK]
                        [--distributed-backend DISTRIBUTED_BACKEND]
                        [--distributed-init-method DISTRIBUTED_INIT_METHOD]
                        [--distributed-port DISTRIBUTED_PORT]
                        [--device-id DEVICE_ID] [--distributed-no-spawn]
                        [--ddp-backend {c10d,fully_sharded,legacy_ddp,no_c10d,pytorch_ddp,slowmo}]
                        [--ddp-comm-hook {none,fp16}]
                        [--bucket-cap-mb BUCKET_CAP_MB]
                        [--fix-batches-to-gpus] [--find-unused-parameters]
                        [--gradient-as-bucket-view] [--fast-stat-sync]
                        [--heartbeat-timeout HEARTBEAT_TIMEOUT]
                        [--broadcast-buffers]
                        [--slowmo-momentum SLOWMO_MOMENTUM]
                        [--slowmo-base-algorithm SLOWMO_BASE_ALGORITHM]
                        [--localsgd-frequency LOCALSGD_FREQUENCY]
                        [--nprocs-per-node NPROCS_PER_NODE]
                        [--pipeline-model-parallel]
                        [--pipeline-balance PIPELINE_BALANCE]
                        [--pipeline-devices PIPELINE_DEVICES]
                        [--pipeline-chunks PIPELINE_CHUNKS]
                        [--pipeline-encoder-balance PIPELINE_ENCODER_BALANCE]
                        [--pipeline-encoder-devices PIPELINE_ENCODER_DEVICES]
                        [--pipeline-decoder-balance PIPELINE_DECODER_BALANCE]
                        [--pipeline-decoder-devices PIPELINE_DECODER_DEVICES]
                        [--pipeline-checkpoint {always,never,except_last}]
                        [--zero-sharding {none,os}]
                        [--no-reshard-after-forward] [--fp32-reduce-scatter]
                        [--cpu-offload] [--use-sharded-state]
                        [--not-fsdp-flatten-parameters] [--path PATH]
                        [--post-process [POST_PROCESS]] [--quiet]
                        [--model-overrides MODEL_OVERRIDES]
                        [--results-path RESULTS_PATH] [--beam BEAM]
                        [--nbest NBEST] [--max-len-a MAX_LEN_A]
                        [--max-len-b MAX_LEN_B] [--min-len MIN_LEN]
                        [--match-source-len] [--unnormalized]
                        [--no-early-stop] [--no-beamable-mm] [--lenpen LENPEN]
                        [--unkpen UNKPEN] [--replace-unk [REPLACE_UNK]]
                        [--sacrebleu] [--score-reference]
                        [--prefix-size PREFIX_SIZE]
                        [--no-repeat-ngram-size NO_REPEAT_NGRAM_SIZE]
                        [--sampling] [--sampling-topk SAMPLING_TOPK]
                        [--sampling-topp SAMPLING_TOPP]
                        [--constraints [{ordered,unordered}]]
                        [--temperature TEMPERATURE]
                        [--diverse-beam-groups DIVERSE_BEAM_GROUPS]
                        [--diverse-beam-strength DIVERSE_BEAM_STRENGTH]
                        [--diversity-rate DIVERSITY_RATE]
                        [--print-alignment [{hard,soft}]] [--print-step]
                        [--lm-path LM_PATH] [--lm-weight LM_WEIGHT]
                        [--iter-decode-eos-penalty ITER_DECODE_EOS_PENALTY]
                        [--iter-decode-max-iter ITER_DECODE_MAX_ITER]
                        [--iter-decode-force-max-iter]
                        [--iter-decode-with-beam ITER_DECODE_WITH_BEAM]
                        [--iter-decode-with-external-reranker]
                        [--retain-iter-history] [--retain-dropout]
                        [--retain-dropout-modules RETAIN_DROPOUT_MODULES]
                        [--decoding-format {unigram,ensemble,vote,dp,bs}]
                        [--no-seed-provided] [--eos-token EOS_TOKEN]
                        [--save-dir SAVE_DIR] [--restore-file RESTORE_FILE]
                        [--continue-once CONTINUE_ONCE]
                        [--finetune-from-model FINETUNE_FROM_MODEL]
                        [--reset-dataloader] [--reset-lr-scheduler]
                        [--reset-meters] [--reset-optimizer]
                        [--optimizer-overrides OPTIMIZER_OVERRIDES]
                        [--save-interval SAVE_INTERVAL]
                        [--save-interval-updates SAVE_INTERVAL_UPDATES]
                        [--keep-interval-updates KEEP_INTERVAL_UPDATES]
                        [--keep-interval-updates-pattern KEEP_INTERVAL_UPDATES_PATTERN]
                        [--keep-last-epochs KEEP_LAST_EPOCHS]
                        [--keep-best-checkpoints KEEP_BEST_CHECKPOINTS]
                        [--no-save] [--no-epoch-checkpoints]
                        [--no-last-checkpoints] [--no-save-optimizer-state]
                        [--best-checkpoint-metric BEST_CHECKPOINT_METRIC]
                        [--maximize-best-checkpoint-metric]
                        [--patience PATIENCE]
                        [--checkpoint-suffix CHECKPOINT_SUFFIX]
                        [--checkpoint-shard-count CHECKPOINT_SHARD_COUNT]
                        [--load-checkpoint-on-all-dp-ranks]
                        [--write-checkpoints-asynchronously]

Named Arguments

--no-progress-bar

disable progress bar

Default: False

--log-interval

log progress every N batches (when progress bar is disabled)

Default: 100

--log-format

Possible choices: json, none, simple, tqdm

log format to use

--log-file log file to copy metrics to.
--aim-repo path to Aim repository
--aim-run-hash Aim run hash. If skipped, creates or continues run based on save_dir
--tensorboard-logdir path to save logs for tensorboard, should match –logdir of running tensorboard (default: no tensorboard logging)
--wandb-project Weights and Biases project name to use for logging
--azureml-logging

Log scalars to AzureML context

Default: False

--seed

pseudo random number generator seed

Default: 1

--cpu

use CPU instead of CUDA

Default: False

--tpu

use TPU instead of CUDA

Default: False

--bf16

use bfloat16; implies –tpu

Default: False

--memory-efficient-bf16

use a memory-efficient version of BF16 training; implies –bf16

Default: False

--fp16

use FP16

Default: False

--memory-efficient-fp16

use a memory-efficient version of FP16 training; implies –fp16

Default: False

--fp16-no-flatten-grads

don’t flatten FP16 grads tensor

Default: False

--fp16-init-scale

default FP16 loss scale

Default: 128

--fp16-scale-window number of updates before increasing loss scale
--fp16-scale-tolerance

pct of updates that can overflow before decreasing the loss scale

Default: 0.0

--on-cpu-convert-precision

if set, the floating point conversion to fp16/bf16 runs on CPU. This reduces bus transfer time and GPU memory usage.

Default: False

--min-loss-scale

minimum FP16/AMP loss scale, after which training is stopped

Default: 0.0001

--threshold-loss-scale threshold FP16 loss scale from below
--amp

use automatic mixed precision

Default: False

--amp-batch-retries

number of retries of same batch after reducing loss scale with AMP

Default: 2

--amp-init-scale

default AMP loss scale

Default: 128

--amp-scale-window number of updates before increasing AMP loss scale
--user-dir path to a python module containing custom extensions (tasks and/or architectures)
--empty-cache-freq

how often to clear the PyTorch CUDA cache (0 to disable)

Default: 0

--all-gather-list-size

number of bytes reserved for gathering stats from workers

Default: 16384

--model-parallel-size

total number of GPUs to parallelize model over

Default: 1

--quantization-config-path path to quantization config file
--profile

enable autograd profiler emit_nvtx

Default: False

--reset-logging

when using Hydra, reset the logging at the beginning of training

Default: False

--suppress-crashes

suppress crashes when training with the hydra_train entry point so that the main method can return a value (useful for sweeps)

Default: False

--use-plasma-view

Store indices and sizes in shared memory

Default: False

--plasma-path

path to run plasma_store, defaults to /tmp/plasma. Paths outside /tmp tend to fail.

Default: “/tmp/plasma”

--criterion

Possible choices: adaptive_loss, composite_loss, cross_entropy, ctc, fastspeech2, hubert, label_smoothed_cross_entropy, latency_augmented_label_smoothed_cross_entropy, label_smoothed_cross_entropy_with_alignment, label_smoothed_cross_entropy_with_ctc, legacy_masked_lm_loss, masked_lm, model, nat_loss, sentence_prediction, sentence_prediction_adapters, sentence_ranking, tacotron2, speech_to_unit, speech_to_spectrogram, speech_unit_lm_criterion, wav2vec, vocab_parallel_cross_entropy

Default: “cross_entropy”

--tokenizer Possible choices: moses, nltk, space
--bpe Possible choices: byte_bpe, bytes, characters, fastbpe, gpt2, bert, hf_byte_bpe, sentencepiece, subword_nmt
--optimizer Possible choices: adadelta, adafactor, adagrad, adam, adamax, composite, cpu_adam, lamb, nag, sgd
--lr-scheduler

Possible choices: cosine, fixed, inverse_sqrt, manual, pass_through, polynomial_decay, reduce_lr_on_plateau, step, tri_stage, triangular

Default: “fixed”

--scoring

Possible choices: bert_score, sacrebleu, bleu, chrf, meteor, wer

Default: “bleu”

--task

Possible choices: multilingual_language_modeling, speech_unit_modeling, hubert_pretraining, translation, multilingual_translation, semisupervised_translation, translation_from_pretrained_xlm, speech_to_text, text_to_speech, frm_text_to_speech, legacy_masked_lm, audio_pretraining, audio_finetuning, sentence_ranking, online_backtranslation, simul_speech_to_text, simul_text_to_text, cross_lingual_lm, span_masked_lm, denoising, multilingual_denoising, multilingual_masked_lm, language_modeling, masked_lm, nlu_finetuning, speech_to_speech, sentence_prediction, translation_from_pretrained_bart, sentence_prediction_adapters, translation_multi_simple_epoch, translation_lev, dummy_lm, dummy_masked_lm, dummy_mt

task

Default: “translation”

dataset_data_loading

--num-workers

how many subprocesses to use for data loading

Default: 1

--skip-invalid-size-inputs-valid-test

ignore too long or too short lines in valid and test set

Default: False

--max-tokens maximum number of tokens in a batch
--batch-size, --max-sentences number of examples in a batch
--required-batch-size-multiple

batch size will be a multiplier of this value

Default: 8

--required-seq-len-multiple

maximum sequence length in batch will be a multiplier of this value

Default: 1

--dataset-impl

Possible choices: raw, lazy, cached, mmap, fasta, huffman

output dataset implementation

--data-buffer-size

Number of batches to preload

Default: 10

--train-subset

data subset to use for training (e.g. train, valid, test)

Default: “train”

--valid-subset

comma separated list of data subsets to use for validation (e.g. train, valid, test)

Default: “valid”

--combine-valid-subsets, --combine-val comma separated list of data subsets to use for validation (e.g. train, valid, test)
--ignore-unused-valid-subsets

do not raise error if valid subsets are ignored

Default: False

--validate-interval

validate every N epochs

Default: 1

--validate-interval-updates

validate every N updates

Default: 0

--validate-after-updates

dont validate until reaching this many updates

Default: 0

--fixed-validation-seed specified random seed for validation
--disable-validation

disable validation

Default: False

--max-tokens-valid maximum number of tokens in a validation batch (defaults to –max-tokens)
--batch-size-valid, --max-sentences-valid batch size of the validation batch (defaults to –batch-size)
--max-valid-steps, --nval How many batches to evaluate
--curriculum

don’t shuffle batches for first N epochs

Default: 0

--gen-subset

data subset to generate (train, valid, test)

Default: “test”

--num-shards

shard generation over N shards

Default: 1

--shard-id

id of the shard to generate (id < num_shards)

Default: 0

--grouped-shuffling

shuffle batches in groups of num_shards to enable similar sequence lengths on each GPU worker when batches are sorted by length

Default: False

--update-epoch-batch-itr if true then prevents the reuse the epoch batch iterator by setting can_reuse_epoch_itr to false, defaults to –grouped-shuffling )
--update-ordered-indices-seed

if true then increment seed with epoch for getting batch iterators, defautls to False.

Default: False

distributed_training

--distributed-world-size

total number of GPUs across all nodes (default: all visible GPUs)

Default: 1

--distributed-num-procs

total number of processes to fork (default: all visible GPUs)

Default: 1

--distributed-rank

rank of the current worker

Default: 0

--distributed-backend

distributed backend

Default: “nccl”

--distributed-init-method typically tcp://hostname:port that will be used to establish initial connetion
--distributed-port

port number (not required if using –distributed-init-method)

Default: -1

--device-id, --local_rank

which GPU to use (by default looks for $LOCAL_RANK, usually configured automatically)

Default: 0

--distributed-no-spawn

do not spawn multiple processes even if multiple GPUs are visible

Default: False

--ddp-backend

Possible choices: c10d, fully_sharded, legacy_ddp, no_c10d, pytorch_ddp, slowmo

DistributedDataParallel backend

Default: “pytorch_ddp”

--ddp-comm-hook

Possible choices: none, fp16

communication hook

Default: “none”

--bucket-cap-mb

bucket size for reduction

Default: 25

--fix-batches-to-gpus

don’t shuffle batches between GPUs; this reduces overall randomness and may affect precision but avoids the cost of re-reading the data

Default: False

--find-unused-parameters

disable unused parameter detection (not applicable to –ddp-backend=legacy_ddp)

Default: False

--gradient-as-bucket-view

when set to True, gradients will be views pointing to different offsets of allreduce communication buckets. This can reduce peak memory usage, where the saved memory size will be equal to the total gradients size. –gradient-as-bucket-view=gradient_as_bucket_view)

Default: False

--fast-stat-sync

[deprecated] this is now defined per Criterion

Default: False

--heartbeat-timeout

kill the job if no progress is made in N seconds; set to -1 to disable

Default: -1

--broadcast-buffers

Copy non-trainable parameters between GPUs, such as batchnorm population statistics

Default: False

--slowmo-momentum SlowMo momentum term; by default use 0.0 for 16 GPUs, 0.2 for 32 GPUs; 0.5 for 64 GPUs, 0.6 for > 64 GPUs
--slowmo-base-algorithm

Base algorithm. Either ‘localsgd’ or ‘sgp’. Please refer to the documentation of ‘slowmo_base_algorithm’ parameter in https://fairscale.readthedocs.io/en/latest/api/experimental/nn/slowmo_ddp.html for more details

Default: “localsgd”

--localsgd-frequency

Local SGD allreduce frequency

Default: 3

--nprocs-per-node

number of GPUs in each node. An allreduce operation across GPUs in a node is very fast. Hence, we do allreduce across GPUs in a node, and gossip across different nodes

Default: 1

--pipeline-model-parallel

if set, use pipeline model parallelism across GPUs

Default: False

--pipeline-balance partition the model into N_K pieces, where each piece contains N_i layers. The sum(args.pipeline_balance) should equal the total number of layers in the model
--pipeline-devices a list of device indices indicating which device to place each of the N_K partitions. The length of this list should equal the length of the –pipeline-balance argument
--pipeline-chunks

microbatch count for pipeline model parallelism

Default: 0

--pipeline-encoder-balance partition the pipeline parallel encoder into N_K pieces, where each piece contains N_i layers. The sum(args.pipeline_encoder_balance) should equal the total number of encoder layers in the model
--pipeline-encoder-devices a list of device indices indicating which device to place each of the N_K partitions. The length of this list should equal the length of the –pipeline-encoder-balance argument
--pipeline-decoder-balance partition the pipeline parallel decoder into N_K pieces, where each piece contains N_i layers. The sum(args.pipeline_decoder_balance) should equal the total number of decoder layers in the model
--pipeline-decoder-devices a list of device indices indicating which device to place each of the N_K partitions. The length of this list should equal the length of the –pipeline-decoder-balance argument
--pipeline-checkpoint

Possible choices: always, never, except_last

checkpointing mode for pipeline model parallelism

Default: “never”

--zero-sharding

Possible choices: none, os

ZeRO sharding

Default: “none”

--no-reshard-after-forward

don’t reshard parameters after forward pass

Default: False

--fp32-reduce-scatter

reduce-scatter grads in FP32

Default: False

--cpu-offload

offload FP32 params to CPU

Default: False

--use-sharded-state

use sharded checkpoint files

Default: False

--not-fsdp-flatten-parameters

not flatten parameter param for fsdp

Default: False

Generation

--path path(s) to model file(s), colon separated
--post-process, --remove-bpe post-process text by removing BPE, letter segmentation, etc. Valid options can be found in fairseq.data.utils.post_process.
--quiet

only print final scores

Default: False

--model-overrides

a dictionary used to override model args at generation that were used during model training

Default: “{}”

--results-path path to save eval results (optional)
--beam

beam size

Default: 5

--nbest

number of hypotheses to output

Default: 1

--max-len-a

generate sequences of maximum length ax + b, where x is the source length

Default: 0

--max-len-b

generate sequences of maximum length ax + b, where x is the source length

Default: 200

--min-len

minimum generation length

Default: 1

--match-source-len

generations should match the source length

Default: False

--unnormalized

compare unnormalized hypothesis scores

Default: False

--no-early-stop

deprecated

Default: False

--no-beamable-mm

don’t use BeamableMM in attention layers

Default: False

--lenpen

length penalty: <1.0 favors shorter, >1.0 favors longer sentences

Default: 1

--unkpen

unknown word penalty: <0 produces more unks, >0 produces fewer

Default: 0

--replace-unk perform unknown replacement (optionally with alignment dictionary)
--sacrebleu

score with sacrebleu

Default: False

--score-reference

just score the reference translation

Default: False

--prefix-size

initialize generation by target prefix of given length

Default: 0

--no-repeat-ngram-size

ngram blocking such that this size ngram cannot be repeated in the generation

Default: 0

--sampling

sample hypotheses instead of using beam search

Default: False

--sampling-topk

sample from top K likely next words instead of all words

Default: -1

--sampling-topp

sample from the smallest set whose cumulative probability mass exceeds p for next words

Default: -1.0

--constraints

Possible choices: ordered, unordered

enables lexically constrained decoding

--temperature

temperature for generation

Default: 1.0

--diverse-beam-groups

number of groups for Diverse Beam Search

Default: -1

--diverse-beam-strength

strength of diversity penalty for Diverse Beam Search

Default: 0.5

--diversity-rate

strength of diversity penalty for Diverse Siblings Search

Default: -1.0

--print-alignment

Possible choices: hard, soft

if set, uses attention feedback to compute and print alignment to source tokens (valid options are: hard, soft, otherwise treated as hard alignment)

--print-step

print steps

Default: False

--lm-path path to lm checkpoint for lm fusion
--lm-weight

weight for lm probs for lm fusion

Default: 0.0

--iter-decode-eos-penalty

if > 0.0, it penalized early-stopping in decoding.

Default: 0.0

--iter-decode-max-iter

maximum iterations for iterative refinement.

Default: 10

--iter-decode-force-max-iter

if set, run exact the maximum number of iterations without early stop

Default: False

--iter-decode-with-beam

if > 1, model will generate translations varying by the lengths.

Default: 1

--iter-decode-with-external-reranker

if set, the last checkpoint are assumed to be a reranker to rescore the translations

Default: False

--retain-iter-history

if set, decoding returns the whole history of iterative refinement

Default: False

--retain-dropout

Use dropout at inference time

Default: False

--retain-dropout-modules if set, only retain dropout for the specified modules; if not set, then dropout will be retained for all modules
--decoding-format

Possible choices: unigram, ensemble, vote, dp, bs

special decoding format for advanced decoding.

--no-seed-provided

if set, dont use seed for initializing random generators

Default: False

--eos-token EOS token

checkpoint

--save-dir

path to save checkpoints

Default: “checkpoints”

--restore-file

filename from which to load checkpoint (default: <save-dir>/checkpoint_last.pt

Default: “checkpoint_last.pt”

--continue-once continues from this checkpoint, unless a checkpoint indicated in ‘restore_file’ option is present
--finetune-from-model finetune from a pretrained model; note that meters and lr scheduler will be reset
--reset-dataloader

if set, does not reload dataloader state from the checkpoint

Default: False

--reset-lr-scheduler

if set, does not load lr scheduler state from the checkpoint

Default: False

--reset-meters

if set, does not load meters from the checkpoint

Default: False

--reset-optimizer

if set, does not load optimizer state from the checkpoint

Default: False

--optimizer-overrides

a dictionary used to override optimizer args when loading a checkpoint

Default: “{}”

--save-interval

save a checkpoint every N epochs

Default: 1

--save-interval-updates

save a checkpoint (and validate) every N updates

Default: 0

--keep-interval-updates

keep the last N checkpoints saved with –save-interval-updates

Default: -1

--keep-interval-updates-pattern

when used with –keep-interval-updates, skips deleting any checkpoints with update X where X % keep_interval_updates_pattern == 0

Default: -1

--keep-last-epochs

keep last N epoch checkpoints

Default: -1

--keep-best-checkpoints

keep best N checkpoints based on scores

Default: -1

--no-save

don’t save models or checkpoints

Default: False

--no-epoch-checkpoints

only store last and best checkpoints

Default: False

--no-last-checkpoints

don’t store last checkpoints

Default: False

--no-save-optimizer-state

don’t save optimizer-state as part of checkpoint

Default: False

--best-checkpoint-metric

metric to use for saving “best” checkpoints

Default: “loss”

--maximize-best-checkpoint-metric

select the largest metric value for saving “best” checkpoints

Default: False

--patience

early stop training if valid performance doesn’t improve for N consecutive validation runs; note that this is influenced by –validate-interval

Default: -1

--checkpoint-suffix

suffix to add to the checkpoint file name

Default: “”

--checkpoint-shard-count

Number of shards containing the checkpoint - if the checkpoint is over 300GB, it is preferable to split it into shards to prevent OOM on CPU while loading the checkpoint

Default: 1

--load-checkpoint-on-all-dp-ranks

load checkpoints on all data parallel devices (default: only load on rank 0 and broadcast to other devices)

Default: False

--write-checkpoints-asynchronously, --save-async

Write checkpoints asynchronously in a separate thread. NOTE: This feature is currently being tested.

Default: False

fairseq-interactive

Translate raw text with a trained model. Batches data on-the-fly.

usage: fairseq-interactive [-h] [--no-progress-bar]
                           [--log-interval LOG_INTERVAL]
                           [--log-format {json,none,simple,tqdm}]
                           [--log-file LOG_FILE] [--aim-repo AIM_REPO]
                           [--aim-run-hash AIM_RUN_HASH]
                           [--tensorboard-logdir TENSORBOARD_LOGDIR]
                           [--wandb-project WANDB_PROJECT] [--azureml-logging]
                           [--seed SEED] [--cpu] [--tpu] [--bf16]
                           [--memory-efficient-bf16] [--fp16]
                           [--memory-efficient-fp16] [--fp16-no-flatten-grads]
                           [--fp16-init-scale FP16_INIT_SCALE]
                           [--fp16-scale-window FP16_SCALE_WINDOW]
                           [--fp16-scale-tolerance FP16_SCALE_TOLERANCE]
                           [--on-cpu-convert-precision]
                           [--min-loss-scale MIN_LOSS_SCALE]
                           [--threshold-loss-scale THRESHOLD_LOSS_SCALE]
                           [--amp] [--amp-batch-retries AMP_BATCH_RETRIES]
                           [--amp-init-scale AMP_INIT_SCALE]
                           [--amp-scale-window AMP_SCALE_WINDOW]
                           [--user-dir USER_DIR]
                           [--empty-cache-freq EMPTY_CACHE_FREQ]
                           [--all-gather-list-size ALL_GATHER_LIST_SIZE]
                           [--model-parallel-size MODEL_PARALLEL_SIZE]
                           [--quantization-config-path QUANTIZATION_CONFIG_PATH]
                           [--profile] [--reset-logging] [--suppress-crashes]
                           [--use-plasma-view] [--plasma-path PLASMA_PATH]
                           [--criterion {adaptive_loss,composite_loss,cross_entropy,ctc,fastspeech2,hubert,label_smoothed_cross_entropy,latency_augmented_label_smoothed_cross_entropy,label_smoothed_cross_entropy_with_alignment,label_smoothed_cross_entropy_with_ctc,legacy_masked_lm_loss,masked_lm,model,nat_loss,sentence_prediction,sentence_prediction_adapters,sentence_ranking,tacotron2,speech_to_unit,speech_to_spectrogram,speech_unit_lm_criterion,wav2vec,vocab_parallel_cross_entropy}]
                           [--tokenizer {moses,nltk,space}]
                           [--bpe {byte_bpe,bytes,characters,fastbpe,gpt2,bert,hf_byte_bpe,sentencepiece,subword_nmt}]
                           [--optimizer {adadelta,adafactor,adagrad,adam,adamax,composite,cpu_adam,lamb,nag,sgd}]
                           [--lr-scheduler {cosine,fixed,inverse_sqrt,manual,pass_through,polynomial_decay,reduce_lr_on_plateau,step,tri_stage,triangular}]
                           [--scoring {bert_score,sacrebleu,bleu,chrf,meteor,wer}]
                           [--task TASK] [--num-workers NUM_WORKERS]
                           [--skip-invalid-size-inputs-valid-test]
                           [--max-tokens MAX_TOKENS] [--batch-size BATCH_SIZE]
                           [--required-batch-size-multiple REQUIRED_BATCH_SIZE_MULTIPLE]
                           [--required-seq-len-multiple REQUIRED_SEQ_LEN_MULTIPLE]
                           [--dataset-impl {raw,lazy,cached,mmap,fasta,huffman}]
                           [--data-buffer-size DATA_BUFFER_SIZE]
                           [--train-subset TRAIN_SUBSET]
                           [--valid-subset VALID_SUBSET]
                           [--combine-valid-subsets]
                           [--ignore-unused-valid-subsets]
                           [--validate-interval VALIDATE_INTERVAL]
                           [--validate-interval-updates VALIDATE_INTERVAL_UPDATES]
                           [--validate-after-updates VALIDATE_AFTER_UPDATES]
                           [--fixed-validation-seed FIXED_VALIDATION_SEED]
                           [--disable-validation]
                           [--max-tokens-valid MAX_TOKENS_VALID]
                           [--batch-size-valid BATCH_SIZE_VALID]
                           [--max-valid-steps MAX_VALID_STEPS]
                           [--curriculum CURRICULUM] [--gen-subset GEN_SUBSET]
                           [--num-shards NUM_SHARDS] [--shard-id SHARD_ID]
                           [--grouped-shuffling]
                           [--update-epoch-batch-itr UPDATE_EPOCH_BATCH_ITR]
                           [--update-ordered-indices-seed]
                           [--distributed-world-size DISTRIBUTED_WORLD_SIZE]
                           [--distributed-num-procs DISTRIBUTED_NUM_PROCS]
                           [--distributed-rank DISTRIBUTED_RANK]
                           [--distributed-backend DISTRIBUTED_BACKEND]
                           [--distributed-init-method DISTRIBUTED_INIT_METHOD]
                           [--distributed-port DISTRIBUTED_PORT]
                           [--device-id DEVICE_ID] [--distributed-no-spawn]
                           [--ddp-backend {c10d,fully_sharded,legacy_ddp,no_c10d,pytorch_ddp,slowmo}]
                           [--ddp-comm-hook {none,fp16}]
                           [--bucket-cap-mb BUCKET_CAP_MB]
                           [--fix-batches-to-gpus] [--find-unused-parameters]
                           [--gradient-as-bucket-view] [--fast-stat-sync]
                           [--heartbeat-timeout HEARTBEAT_TIMEOUT]
                           [--broadcast-buffers]
                           [--slowmo-momentum SLOWMO_MOMENTUM]
                           [--slowmo-base-algorithm SLOWMO_BASE_ALGORITHM]
                           [--localsgd-frequency LOCALSGD_FREQUENCY]
                           [--nprocs-per-node NPROCS_PER_NODE]
                           [--pipeline-model-parallel]
                           [--pipeline-balance PIPELINE_BALANCE]
                           [--pipeline-devices PIPELINE_DEVICES]
                           [--pipeline-chunks PIPELINE_CHUNKS]
                           [--pipeline-encoder-balance PIPELINE_ENCODER_BALANCE]
                           [--pipeline-encoder-devices PIPELINE_ENCODER_DEVICES]
                           [--pipeline-decoder-balance PIPELINE_DECODER_BALANCE]
                           [--pipeline-decoder-devices PIPELINE_DECODER_DEVICES]
                           [--pipeline-checkpoint {always,never,except_last}]
                           [--zero-sharding {none,os}]
                           [--no-reshard-after-forward]
                           [--fp32-reduce-scatter] [--cpu-offload]
                           [--use-sharded-state]
                           [--not-fsdp-flatten-parameters] [--path PATH]
                           [--post-process [POST_PROCESS]] [--quiet]
                           [--model-overrides MODEL_OVERRIDES]
                           [--results-path RESULTS_PATH] [--beam BEAM]
                           [--nbest NBEST] [--max-len-a MAX_LEN_A]
                           [--max-len-b MAX_LEN_B] [--min-len MIN_LEN]
                           [--match-source-len] [--unnormalized]
                           [--no-early-stop] [--no-beamable-mm]
                           [--lenpen LENPEN] [--unkpen UNKPEN]
                           [--replace-unk [REPLACE_UNK]] [--sacrebleu]
                           [--score-reference] [--prefix-size PREFIX_SIZE]
                           [--no-repeat-ngram-size NO_REPEAT_NGRAM_SIZE]
                           [--sampling] [--sampling-topk SAMPLING_TOPK]
                           [--sampling-topp SAMPLING_TOPP]
                           [--constraints [{ordered,unordered}]]
                           [--temperature TEMPERATURE]
                           [--diverse-beam-groups DIVERSE_BEAM_GROUPS]
                           [--diverse-beam-strength DIVERSE_BEAM_STRENGTH]
                           [--diversity-rate DIVERSITY_RATE]
                           [--print-alignment [{hard,soft}]] [--print-step]
                           [--lm-path LM_PATH] [--lm-weight LM_WEIGHT]
                           [--iter-decode-eos-penalty ITER_DECODE_EOS_PENALTY]
                           [--iter-decode-max-iter ITER_DECODE_MAX_ITER]
                           [--iter-decode-force-max-iter]
                           [--iter-decode-with-beam ITER_DECODE_WITH_BEAM]
                           [--iter-decode-with-external-reranker]
                           [--retain-iter-history] [--retain-dropout]
                           [--retain-dropout-modules RETAIN_DROPOUT_MODULES]
                           [--decoding-format {unigram,ensemble,vote,dp,bs}]
                           [--no-seed-provided] [--eos-token EOS_TOKEN]
                           [--save-dir SAVE_DIR] [--restore-file RESTORE_FILE]
                           [--continue-once CONTINUE_ONCE]
                           [--finetune-from-model FINETUNE_FROM_MODEL]
                           [--reset-dataloader] [--reset-lr-scheduler]
                           [--reset-meters] [--reset-optimizer]
                           [--optimizer-overrides OPTIMIZER_OVERRIDES]
                           [--save-interval SAVE_INTERVAL]
                           [--save-interval-updates SAVE_INTERVAL_UPDATES]
                           [--keep-interval-updates KEEP_INTERVAL_UPDATES]
                           [--keep-interval-updates-pattern KEEP_INTERVAL_UPDATES_PATTERN]
                           [--keep-last-epochs KEEP_LAST_EPOCHS]
                           [--keep-best-checkpoints KEEP_BEST_CHECKPOINTS]
                           [--no-save] [--no-epoch-checkpoints]
                           [--no-last-checkpoints] [--no-save-optimizer-state]
                           [--best-checkpoint-metric BEST_CHECKPOINT_METRIC]
                           [--maximize-best-checkpoint-metric]
                           [--patience PATIENCE]
                           [--checkpoint-suffix CHECKPOINT_SUFFIX]
                           [--checkpoint-shard-count CHECKPOINT_SHARD_COUNT]
                           [--load-checkpoint-on-all-dp-ranks]
                           [--write-checkpoints-asynchronously]
                           [--buffer-size BUFFER_SIZE] [--input INPUT]

Named Arguments

--no-progress-bar

disable progress bar

Default: False

--log-interval

log progress every N batches (when progress bar is disabled)

Default: 100

--log-format

Possible choices: json, none, simple, tqdm

log format to use

--log-file log file to copy metrics to.
--aim-repo path to Aim repository
--aim-run-hash Aim run hash. If skipped, creates or continues run based on save_dir
--tensorboard-logdir path to save logs for tensorboard, should match –logdir of running tensorboard (default: no tensorboard logging)
--wandb-project Weights and Biases project name to use for logging
--azureml-logging

Log scalars to AzureML context

Default: False

--seed

pseudo random number generator seed

Default: 1

--cpu

use CPU instead of CUDA

Default: False

--tpu

use TPU instead of CUDA

Default: False

--bf16

use bfloat16; implies –tpu

Default: False

--memory-efficient-bf16

use a memory-efficient version of BF16 training; implies –bf16

Default: False

--fp16

use FP16

Default: False

--memory-efficient-fp16

use a memory-efficient version of FP16 training; implies –fp16

Default: False

--fp16-no-flatten-grads

don’t flatten FP16 grads tensor

Default: False

--fp16-init-scale

default FP16 loss scale

Default: 128

--fp16-scale-window number of updates before increasing loss scale
--fp16-scale-tolerance

pct of updates that can overflow before decreasing the loss scale

Default: 0.0

--on-cpu-convert-precision

if set, the floating point conversion to fp16/bf16 runs on CPU. This reduces bus transfer time and GPU memory usage.

Default: False

--min-loss-scale

minimum FP16/AMP loss scale, after which training is stopped

Default: 0.0001

--threshold-loss-scale threshold FP16 loss scale from below
--amp

use automatic mixed precision

Default: False

--amp-batch-retries

number of retries of same batch after reducing loss scale with AMP

Default: 2

--amp-init-scale

default AMP loss scale

Default: 128

--amp-scale-window number of updates before increasing AMP loss scale
--user-dir path to a python module containing custom extensions (tasks and/or architectures)
--empty-cache-freq

how often to clear the PyTorch CUDA cache (0 to disable)

Default: 0

--all-gather-list-size

number of bytes reserved for gathering stats from workers

Default: 16384

--model-parallel-size

total number of GPUs to parallelize model over

Default: 1

--quantization-config-path path to quantization config file
--profile

enable autograd profiler emit_nvtx

Default: False

--reset-logging

when using Hydra, reset the logging at the beginning of training

Default: False

--suppress-crashes

suppress crashes when training with the hydra_train entry point so that the main method can return a value (useful for sweeps)

Default: False

--use-plasma-view

Store indices and sizes in shared memory

Default: False

--plasma-path

path to run plasma_store, defaults to /tmp/plasma. Paths outside /tmp tend to fail.

Default: “/tmp/plasma”

--criterion

Possible choices: adaptive_loss, composite_loss, cross_entropy, ctc, fastspeech2, hubert, label_smoothed_cross_entropy, latency_augmented_label_smoothed_cross_entropy, label_smoothed_cross_entropy_with_alignment, label_smoothed_cross_entropy_with_ctc, legacy_masked_lm_loss, masked_lm, model, nat_loss, sentence_prediction, sentence_prediction_adapters, sentence_ranking, tacotron2, speech_to_unit, speech_to_spectrogram, speech_unit_lm_criterion, wav2vec, vocab_parallel_cross_entropy

Default: “cross_entropy”

--tokenizer Possible choices: moses, nltk, space
--bpe Possible choices: byte_bpe, bytes, characters, fastbpe, gpt2, bert, hf_byte_bpe, sentencepiece, subword_nmt
--optimizer Possible choices: adadelta, adafactor, adagrad, adam, adamax, composite, cpu_adam, lamb, nag, sgd
--lr-scheduler

Possible choices: cosine, fixed, inverse_sqrt, manual, pass_through, polynomial_decay, reduce_lr_on_plateau, step, tri_stage, triangular

Default: “fixed”

--scoring

Possible choices: bert_score, sacrebleu, bleu, chrf, meteor, wer

Default: “bleu”

--task

Possible choices: multilingual_language_modeling, speech_unit_modeling, hubert_pretraining, translation, multilingual_translation, semisupervised_translation, translation_from_pretrained_xlm, speech_to_text, text_to_speech, frm_text_to_speech, legacy_masked_lm, audio_pretraining, audio_finetuning, sentence_ranking, online_backtranslation, simul_speech_to_text, simul_text_to_text, cross_lingual_lm, span_masked_lm, denoising, multilingual_denoising, multilingual_masked_lm, language_modeling, masked_lm, nlu_finetuning, speech_to_speech, sentence_prediction, translation_from_pretrained_bart, sentence_prediction_adapters, translation_multi_simple_epoch, translation_lev, dummy_lm, dummy_masked_lm, dummy_mt

task

Default: “translation”

dataset_data_loading

--num-workers

how many subprocesses to use for data loading

Default: 1

--skip-invalid-size-inputs-valid-test

ignore too long or too short lines in valid and test set

Default: False

--max-tokens maximum number of tokens in a batch
--batch-size, --max-sentences number of examples in a batch
--required-batch-size-multiple

batch size will be a multiplier of this value

Default: 8

--required-seq-len-multiple

maximum sequence length in batch will be a multiplier of this value

Default: 1

--dataset-impl

Possible choices: raw, lazy, cached, mmap, fasta, huffman

output dataset implementation

--data-buffer-size

Number of batches to preload

Default: 10

--train-subset

data subset to use for training (e.g. train, valid, test)

Default: “train”

--valid-subset

comma separated list of data subsets to use for validation (e.g. train, valid, test)

Default: “valid”

--combine-valid-subsets, --combine-val comma separated list of data subsets to use for validation (e.g. train, valid, test)
--ignore-unused-valid-subsets

do not raise error if valid subsets are ignored

Default: False

--validate-interval

validate every N epochs

Default: 1

--validate-interval-updates

validate every N updates

Default: 0

--validate-after-updates

dont validate until reaching this many updates

Default: 0

--fixed-validation-seed specified random seed for validation
--disable-validation

disable validation

Default: False

--max-tokens-valid maximum number of tokens in a validation batch (defaults to –max-tokens)
--batch-size-valid, --max-sentences-valid batch size of the validation batch (defaults to –batch-size)
--max-valid-steps, --nval How many batches to evaluate
--curriculum

don’t shuffle batches for first N epochs

Default: 0

--gen-subset

data subset to generate (train, valid, test)

Default: “test”

--num-shards

shard generation over N shards

Default: 1

--shard-id

id of the shard to generate (id < num_shards)

Default: 0

--grouped-shuffling

shuffle batches in groups of num_shards to enable similar sequence lengths on each GPU worker when batches are sorted by length

Default: False

--update-epoch-batch-itr if true then prevents the reuse the epoch batch iterator by setting can_reuse_epoch_itr to false, defaults to –grouped-shuffling )
--update-ordered-indices-seed

if true then increment seed with epoch for getting batch iterators, defautls to False.

Default: False

distributed_training

--distributed-world-size

total number of GPUs across all nodes (default: all visible GPUs)

Default: 1

--distributed-num-procs

total number of processes to fork (default: all visible GPUs)

Default: 1

--distributed-rank

rank of the current worker

Default: 0

--distributed-backend

distributed backend

Default: “nccl”

--distributed-init-method typically tcp://hostname:port that will be used to establish initial connetion
--distributed-port

port number (not required if using –distributed-init-method)

Default: -1

--device-id, --local_rank

which GPU to use (by default looks for $LOCAL_RANK, usually configured automatically)

Default: 0

--distributed-no-spawn

do not spawn multiple processes even if multiple GPUs are visible

Default: False

--ddp-backend

Possible choices: c10d, fully_sharded, legacy_ddp, no_c10d, pytorch_ddp, slowmo

DistributedDataParallel backend

Default: “pytorch_ddp”

--ddp-comm-hook

Possible choices: none, fp16

communication hook

Default: “none”

--bucket-cap-mb

bucket size for reduction

Default: 25

--fix-batches-to-gpus

don’t shuffle batches between GPUs; this reduces overall randomness and may affect precision but avoids the cost of re-reading the data

Default: False

--find-unused-parameters

disable unused parameter detection (not applicable to –ddp-backend=legacy_ddp)

Default: False

--gradient-as-bucket-view

when set to True, gradients will be views pointing to different offsets of allreduce communication buckets. This can reduce peak memory usage, where the saved memory size will be equal to the total gradients size. –gradient-as-bucket-view=gradient_as_bucket_view)

Default: False

--fast-stat-sync

[deprecated] this is now defined per Criterion

Default: False

--heartbeat-timeout

kill the job if no progress is made in N seconds; set to -1 to disable

Default: -1

--broadcast-buffers

Copy non-trainable parameters between GPUs, such as batchnorm population statistics

Default: False

--slowmo-momentum SlowMo momentum term; by default use 0.0 for 16 GPUs, 0.2 for 32 GPUs; 0.5 for 64 GPUs, 0.6 for > 64 GPUs
--slowmo-base-algorithm

Base algorithm. Either ‘localsgd’ or ‘sgp’. Please refer to the documentation of ‘slowmo_base_algorithm’ parameter in https://fairscale.readthedocs.io/en/latest/api/experimental/nn/slowmo_ddp.html for more details

Default: “localsgd”

--localsgd-frequency

Local SGD allreduce frequency

Default: 3

--nprocs-per-node

number of GPUs in each node. An allreduce operation across GPUs in a node is very fast. Hence, we do allreduce across GPUs in a node, and gossip across different nodes

Default: 1

--pipeline-model-parallel

if set, use pipeline model parallelism across GPUs

Default: False

--pipeline-balance partition the model into N_K pieces, where each piece contains N_i layers. The sum(args.pipeline_balance) should equal the total number of layers in the model
--pipeline-devices a list of device indices indicating which device to place each of the N_K partitions. The length of this list should equal the length of the –pipeline-balance argument
--pipeline-chunks

microbatch count for pipeline model parallelism

Default: 0

--pipeline-encoder-balance partition the pipeline parallel encoder into N_K pieces, where each piece contains N_i layers. The sum(args.pipeline_encoder_balance) should equal the total number of encoder layers in the model
--pipeline-encoder-devices a list of device indices indicating which device to place each of the N_K partitions. The length of this list should equal the length of the –pipeline-encoder-balance argument
--pipeline-decoder-balance partition the pipeline parallel decoder into N_K pieces, where each piece contains N_i layers. The sum(args.pipeline_decoder_balance) should equal the total number of decoder layers in the model
--pipeline-decoder-devices a list of device indices indicating which device to place each of the N_K partitions. The length of this list should equal the length of the –pipeline-decoder-balance argument
--pipeline-checkpoint

Possible choices: always, never, except_last

checkpointing mode for pipeline model parallelism

Default: “never”

--zero-sharding

Possible choices: none, os

ZeRO sharding

Default: “none”

--no-reshard-after-forward

don’t reshard parameters after forward pass

Default: False

--fp32-reduce-scatter

reduce-scatter grads in FP32

Default: False

--cpu-offload

offload FP32 params to CPU

Default: False

--use-sharded-state

use sharded checkpoint files

Default: False

--not-fsdp-flatten-parameters

not flatten parameter param for fsdp

Default: False

Generation

--path path(s) to model file(s), colon separated
--post-process, --remove-bpe post-process text by removing BPE, letter segmentation, etc. Valid options can be found in fairseq.data.utils.post_process.
--quiet

only print final scores

Default: False

--model-overrides

a dictionary used to override model args at generation that were used during model training

Default: “{}”

--results-path path to save eval results (optional)
--beam

beam size

Default: 5

--nbest

number of hypotheses to output

Default: 1

--max-len-a

generate sequences of maximum length ax + b, where x is the source length

Default: 0

--max-len-b

generate sequences of maximum length ax + b, where x is the source length

Default: 200

--min-len

minimum generation length

Default: 1

--match-source-len

generations should match the source length

Default: False

--unnormalized

compare unnormalized hypothesis scores

Default: False

--no-early-stop

deprecated

Default: False

--no-beamable-mm

don’t use BeamableMM in attention layers

Default: False

--lenpen

length penalty: <1.0 favors shorter, >1.0 favors longer sentences

Default: 1

--unkpen

unknown word penalty: <0 produces more unks, >0 produces fewer

Default: 0

--replace-unk perform unknown replacement (optionally with alignment dictionary)
--sacrebleu

score with sacrebleu

Default: False

--score-reference

just score the reference translation

Default: False

--prefix-size

initialize generation by target prefix of given length

Default: 0

--no-repeat-ngram-size

ngram blocking such that this size ngram cannot be repeated in the generation

Default: 0

--sampling

sample hypotheses instead of using beam search

Default: False

--sampling-topk

sample from top K likely next words instead of all words

Default: -1

--sampling-topp

sample from the smallest set whose cumulative probability mass exceeds p for next words

Default: -1.0

--constraints

Possible choices: ordered, unordered

enables lexically constrained decoding

--temperature

temperature for generation

Default: 1.0

--diverse-beam-groups

number of groups for Diverse Beam Search

Default: -1

--diverse-beam-strength

strength of diversity penalty for Diverse Beam Search

Default: 0.5

--diversity-rate

strength of diversity penalty for Diverse Siblings Search

Default: -1.0

--print-alignment

Possible choices: hard, soft

if set, uses attention feedback to compute and print alignment to source tokens (valid options are: hard, soft, otherwise treated as hard alignment)

--print-step

print steps

Default: False

--lm-path path to lm checkpoint for lm fusion
--lm-weight

weight for lm probs for lm fusion

Default: 0.0

--iter-decode-eos-penalty

if > 0.0, it penalized early-stopping in decoding.

Default: 0.0

--iter-decode-max-iter

maximum iterations for iterative refinement.

Default: 10

--iter-decode-force-max-iter

if set, run exact the maximum number of iterations without early stop

Default: False

--iter-decode-with-beam

if > 1, model will generate translations varying by the lengths.

Default: 1

--iter-decode-with-external-reranker

if set, the last checkpoint are assumed to be a reranker to rescore the translations

Default: False

--retain-iter-history

if set, decoding returns the whole history of iterative refinement

Default: False

--retain-dropout

Use dropout at inference time

Default: False

--retain-dropout-modules if set, only retain dropout for the specified modules; if not set, then dropout will be retained for all modules
--decoding-format

Possible choices: unigram, ensemble, vote, dp, bs

special decoding format for advanced decoding.

--no-seed-provided

if set, dont use seed for initializing random generators

Default: False

--eos-token EOS token

checkpoint

--save-dir

path to save checkpoints

Default: “checkpoints”

--restore-file

filename from which to load checkpoint (default: <save-dir>/checkpoint_last.pt

Default: “checkpoint_last.pt”

--continue-once continues from this checkpoint, unless a checkpoint indicated in ‘restore_file’ option is present
--finetune-from-model finetune from a pretrained model; note that meters and lr scheduler will be reset
--reset-dataloader

if set, does not reload dataloader state from the checkpoint

Default: False

--reset-lr-scheduler

if set, does not load lr scheduler state from the checkpoint

Default: False

--reset-meters

if set, does not load meters from the checkpoint

Default: False

--reset-optimizer

if set, does not load optimizer state from the checkpoint

Default: False

--optimizer-overrides

a dictionary used to override optimizer args when loading a checkpoint

Default: “{}”

--save-interval

save a checkpoint every N epochs

Default: 1

--save-interval-updates

save a checkpoint (and validate) every N updates

Default: 0

--keep-interval-updates

keep the last N checkpoints saved with –save-interval-updates

Default: -1

--keep-interval-updates-pattern

when used with –keep-interval-updates, skips deleting any checkpoints with update X where X % keep_interval_updates_pattern == 0

Default: -1

--keep-last-epochs

keep last N epoch checkpoints

Default: -1

--keep-best-checkpoints

keep best N checkpoints based on scores

Default: -1

--no-save

don’t save models or checkpoints

Default: False

--no-epoch-checkpoints

only store last and best checkpoints

Default: False

--no-last-checkpoints

don’t store last checkpoints

Default: False

--no-save-optimizer-state

don’t save optimizer-state as part of checkpoint

Default: False

--best-checkpoint-metric

metric to use for saving “best” checkpoints

Default: “loss”

--maximize-best-checkpoint-metric

select the largest metric value for saving “best” checkpoints

Default: False

--patience

early stop training if valid performance doesn’t improve for N consecutive validation runs; note that this is influenced by –validate-interval

Default: -1

--checkpoint-suffix

suffix to add to the checkpoint file name

Default: “”

--checkpoint-shard-count

Number of shards containing the checkpoint - if the checkpoint is over 300GB, it is preferable to split it into shards to prevent OOM on CPU while loading the checkpoint

Default: 1

--load-checkpoint-on-all-dp-ranks

load checkpoints on all data parallel devices (default: only load on rank 0 and broadcast to other devices)

Default: False

--write-checkpoints-asynchronously, --save-async

Write checkpoints asynchronously in a separate thread. NOTE: This feature is currently being tested.

Default: False

Interactive

--buffer-size

read this many sentences into a buffer before processing them

Default: 0

--input

file to read from; use - for stdin

Default: “-”

fairseq-score

BLEU scoring of generated translations against reference translations.

Command-line script for BLEU scoring.

usage: fairseq-score [-h] [-s SYS] -r REF [-o N] [--ignore-case] [--sacrebleu]
                     [--sentence-bleu]

Named Arguments

-s, --sys

system output

Default: “-”

-r, --ref references
-o, --order

consider ngrams up to this order

Default: 4

--ignore-case

case-insensitive scoring

Default: False

--sacrebleu

score with sacrebleu

Default: False

--sentence-bleu

report sentence-level BLEUs (i.e., with +1 smoothing)

Default: False

fairseq-eval-lm

Evaluate the perplexity of a trained language model.

usage: fairseq-eval-lm [-h] [--no-progress-bar] [--log-interval LOG_INTERVAL]
                       [--log-format {json,none,simple,tqdm}]
                       [--log-file LOG_FILE] [--aim-repo AIM_REPO]
                       [--aim-run-hash AIM_RUN_HASH]
                       [--tensorboard-logdir TENSORBOARD_LOGDIR]
                       [--wandb-project WANDB_PROJECT] [--azureml-logging]
                       [--seed SEED] [--cpu] [--tpu] [--bf16]
                       [--memory-efficient-bf16] [--fp16]
                       [--memory-efficient-fp16] [--fp16-no-flatten-grads]
                       [--fp16-init-scale FP16_INIT_SCALE]
                       [--fp16-scale-window FP16_SCALE_WINDOW]
                       [--fp16-scale-tolerance FP16_SCALE_TOLERANCE]
                       [--on-cpu-convert-precision]
                       [--min-loss-scale MIN_LOSS_SCALE]
                       [--threshold-loss-scale THRESHOLD_LOSS_SCALE] [--amp]
                       [--amp-batch-retries AMP_BATCH_RETRIES]
                       [--amp-init-scale AMP_INIT_SCALE]
                       [--amp-scale-window AMP_SCALE_WINDOW]
                       [--user-dir USER_DIR]
                       [--empty-cache-freq EMPTY_CACHE_FREQ]
                       [--all-gather-list-size ALL_GATHER_LIST_SIZE]
                       [--model-parallel-size MODEL_PARALLEL_SIZE]
                       [--quantization-config-path QUANTIZATION_CONFIG_PATH]
                       [--profile] [--reset-logging] [--suppress-crashes]
                       [--use-plasma-view] [--plasma-path PLASMA_PATH]
                       [--criterion {adaptive_loss,composite_loss,cross_entropy,ctc,fastspeech2,hubert,label_smoothed_cross_entropy,latency_augmented_label_smoothed_cross_entropy,label_smoothed_cross_entropy_with_alignment,label_smoothed_cross_entropy_with_ctc,legacy_masked_lm_loss,masked_lm,model,nat_loss,sentence_prediction,sentence_prediction_adapters,sentence_ranking,tacotron2,speech_to_unit,speech_to_spectrogram,speech_unit_lm_criterion,wav2vec,vocab_parallel_cross_entropy}]
                       [--tokenizer {moses,nltk,space}]
                       [--bpe {byte_bpe,bytes,characters,fastbpe,gpt2,bert,hf_byte_bpe,sentencepiece,subword_nmt}]
                       [--optimizer {adadelta,adafactor,adagrad,adam,adamax,composite,cpu_adam,lamb,nag,sgd}]
                       [--lr-scheduler {cosine,fixed,inverse_sqrt,manual,pass_through,polynomial_decay,reduce_lr_on_plateau,step,tri_stage,triangular}]
                       [--scoring {bert_score,sacrebleu,bleu,chrf,meteor,wer}]
                       [--task TASK] [--num-workers NUM_WORKERS]
                       [--skip-invalid-size-inputs-valid-test]
                       [--max-tokens MAX_TOKENS] [--batch-size BATCH_SIZE]
                       [--required-batch-size-multiple REQUIRED_BATCH_SIZE_MULTIPLE]
                       [--required-seq-len-multiple REQUIRED_SEQ_LEN_MULTIPLE]
                       [--dataset-impl {raw,lazy,cached,mmap,fasta,huffman}]
                       [--data-buffer-size DATA_BUFFER_SIZE]
                       [--train-subset TRAIN_SUBSET]
                       [--valid-subset VALID_SUBSET] [--combine-valid-subsets]
                       [--ignore-unused-valid-subsets]
                       [--validate-interval VALIDATE_INTERVAL]
                       [--validate-interval-updates VALIDATE_INTERVAL_UPDATES]
                       [--validate-after-updates VALIDATE_AFTER_UPDATES]
                       [--fixed-validation-seed FIXED_VALIDATION_SEED]
                       [--disable-validation]
                       [--max-tokens-valid MAX_TOKENS_VALID]
                       [--batch-size-valid BATCH_SIZE_VALID]
                       [--max-valid-steps MAX_VALID_STEPS]
                       [--curriculum CURRICULUM] [--gen-subset GEN_SUBSET]
                       [--num-shards NUM_SHARDS] [--shard-id SHARD_ID]
                       [--grouped-shuffling]
                       [--update-epoch-batch-itr UPDATE_EPOCH_BATCH_ITR]
                       [--update-ordered-indices-seed]
                       [--distributed-world-size DISTRIBUTED_WORLD_SIZE]
                       [--distributed-num-procs DISTRIBUTED_NUM_PROCS]
                       [--distributed-rank DISTRIBUTED_RANK]
                       [--distributed-backend DISTRIBUTED_BACKEND]
                       [--distributed-init-method DISTRIBUTED_INIT_METHOD]
                       [--distributed-port DISTRIBUTED_PORT]
                       [--device-id DEVICE_ID] [--distributed-no-spawn]
                       [--ddp-backend {c10d,fully_sharded,legacy_ddp,no_c10d,pytorch_ddp,slowmo}]
                       [--ddp-comm-hook {none,fp16}]
                       [--bucket-cap-mb BUCKET_CAP_MB] [--fix-batches-to-gpus]
                       [--find-unused-parameters] [--gradient-as-bucket-view]
                       [--fast-stat-sync]
                       [--heartbeat-timeout HEARTBEAT_TIMEOUT]
                       [--broadcast-buffers]
                       [--slowmo-momentum SLOWMO_MOMENTUM]
                       [--slowmo-base-algorithm SLOWMO_BASE_ALGORITHM]
                       [--localsgd-frequency LOCALSGD_FREQUENCY]
                       [--nprocs-per-node NPROCS_PER_NODE]
                       [--pipeline-model-parallel]
                       [--pipeline-balance PIPELINE_BALANCE]
                       [--pipeline-devices PIPELINE_DEVICES]
                       [--pipeline-chunks PIPELINE_CHUNKS]
                       [--pipeline-encoder-balance PIPELINE_ENCODER_BALANCE]
                       [--pipeline-encoder-devices PIPELINE_ENCODER_DEVICES]
                       [--pipeline-decoder-balance PIPELINE_DECODER_BALANCE]
                       [--pipeline-decoder-devices PIPELINE_DECODER_DEVICES]
                       [--pipeline-checkpoint {always,never,except_last}]
                       [--zero-sharding {none,os}]
                       [--no-reshard-after-forward] [--fp32-reduce-scatter]
                       [--cpu-offload] [--use-sharded-state]
                       [--not-fsdp-flatten-parameters] [--path PATH]
                       [--post-process [POST_PROCESS]] [--quiet]
                       [--model-overrides MODEL_OVERRIDES]
                       [--results-path RESULTS_PATH] [--output-word-probs]
                       [--output-word-stats] [--context-window CONTEXT_WINDOW]
                       [--softmax-batch SOFTMAX_BATCH]

Named Arguments

--no-progress-bar

disable progress bar

Default: False

--log-interval

log progress every N batches (when progress bar is disabled)

Default: 100

--log-format

Possible choices: json, none, simple, tqdm

log format to use

--log-file log file to copy metrics to.
--aim-repo path to Aim repository
--aim-run-hash Aim run hash. If skipped, creates or continues run based on save_dir
--tensorboard-logdir path to save logs for tensorboard, should match –logdir of running tensorboard (default: no tensorboard logging)
--wandb-project Weights and Biases project name to use for logging
--azureml-logging

Log scalars to AzureML context

Default: False

--seed

pseudo random number generator seed

Default: 1

--cpu

use CPU instead of CUDA

Default: False

--tpu

use TPU instead of CUDA

Default: False

--bf16

use bfloat16; implies –tpu

Default: False

--memory-efficient-bf16

use a memory-efficient version of BF16 training; implies –bf16

Default: False

--fp16

use FP16

Default: False

--memory-efficient-fp16

use a memory-efficient version of FP16 training; implies –fp16

Default: False

--fp16-no-flatten-grads

don’t flatten FP16 grads tensor

Default: False

--fp16-init-scale

default FP16 loss scale

Default: 128

--fp16-scale-window number of updates before increasing loss scale
--fp16-scale-tolerance

pct of updates that can overflow before decreasing the loss scale

Default: 0.0

--on-cpu-convert-precision

if set, the floating point conversion to fp16/bf16 runs on CPU. This reduces bus transfer time and GPU memory usage.

Default: False

--min-loss-scale

minimum FP16/AMP loss scale, after which training is stopped

Default: 0.0001

--threshold-loss-scale threshold FP16 loss scale from below
--amp

use automatic mixed precision

Default: False

--amp-batch-retries

number of retries of same batch after reducing loss scale with AMP

Default: 2

--amp-init-scale

default AMP loss scale

Default: 128

--amp-scale-window number of updates before increasing AMP loss scale
--user-dir path to a python module containing custom extensions (tasks and/or architectures)
--empty-cache-freq

how often to clear the PyTorch CUDA cache (0 to disable)

Default: 0

--all-gather-list-size

number of bytes reserved for gathering stats from workers

Default: 16384

--model-parallel-size

total number of GPUs to parallelize model over

Default: 1

--quantization-config-path path to quantization config file
--profile

enable autograd profiler emit_nvtx

Default: False

--reset-logging

when using Hydra, reset the logging at the beginning of training

Default: False

--suppress-crashes

suppress crashes when training with the hydra_train entry point so that the main method can return a value (useful for sweeps)

Default: False

--use-plasma-view

Store indices and sizes in shared memory

Default: False

--plasma-path

path to run plasma_store, defaults to /tmp/plasma. Paths outside /tmp tend to fail.

Default: “/tmp/plasma”

--criterion

Possible choices: adaptive_loss, composite_loss, cross_entropy, ctc, fastspeech2, hubert, label_smoothed_cross_entropy, latency_augmented_label_smoothed_cross_entropy, label_smoothed_cross_entropy_with_alignment, label_smoothed_cross_entropy_with_ctc, legacy_masked_lm_loss, masked_lm, model, nat_loss, sentence_prediction, sentence_prediction_adapters, sentence_ranking, tacotron2, speech_to_unit, speech_to_spectrogram, speech_unit_lm_criterion, wav2vec, vocab_parallel_cross_entropy

Default: “cross_entropy”

--tokenizer Possible choices: moses, nltk, space
--bpe Possible choices: byte_bpe, bytes, characters, fastbpe, gpt2, bert, hf_byte_bpe, sentencepiece, subword_nmt
--optimizer Possible choices: adadelta, adafactor, adagrad, adam, adamax, composite, cpu_adam, lamb, nag, sgd
--lr-scheduler

Possible choices: cosine, fixed, inverse_sqrt, manual, pass_through, polynomial_decay, reduce_lr_on_plateau, step, tri_stage, triangular

Default: “fixed”

--scoring

Possible choices: bert_score, sacrebleu, bleu, chrf, meteor, wer

Default: “bleu”

--task

Possible choices: multilingual_language_modeling, speech_unit_modeling, hubert_pretraining, translation, multilingual_translation, semisupervised_translation, translation_from_pretrained_xlm, speech_to_text, text_to_speech, frm_text_to_speech, legacy_masked_lm, audio_pretraining, audio_finetuning, sentence_ranking, online_backtranslation, simul_speech_to_text, simul_text_to_text, cross_lingual_lm, span_masked_lm, denoising, multilingual_denoising, multilingual_masked_lm, language_modeling, masked_lm, nlu_finetuning, speech_to_speech, sentence_prediction, translation_from_pretrained_bart, sentence_prediction_adapters, translation_multi_simple_epoch, translation_lev, dummy_lm, dummy_masked_lm, dummy_mt

task

Default: “language_modeling”

dataset_data_loading

--num-workers

how many subprocesses to use for data loading

Default: 1

--skip-invalid-size-inputs-valid-test

ignore too long or too short lines in valid and test set

Default: False

--max-tokens maximum number of tokens in a batch
--batch-size, --max-sentences number of examples in a batch
--required-batch-size-multiple

batch size will be a multiplier of this value

Default: 8

--required-seq-len-multiple

maximum sequence length in batch will be a multiplier of this value

Default: 1

--dataset-impl

Possible choices: raw, lazy, cached, mmap, fasta, huffman

output dataset implementation

--data-buffer-size

Number of batches to preload

Default: 10

--train-subset

data subset to use for training (e.g. train, valid, test)

Default: “train”

--valid-subset

comma separated list of data subsets to use for validation (e.g. train, valid, test)

Default: “valid”

--combine-valid-subsets, --combine-val comma separated list of data subsets to use for validation (e.g. train, valid, test)
--ignore-unused-valid-subsets

do not raise error if valid subsets are ignored

Default: False

--validate-interval

validate every N epochs

Default: 1

--validate-interval-updates

validate every N updates

Default: 0

--validate-after-updates

dont validate until reaching this many updates

Default: 0

--fixed-validation-seed specified random seed for validation
--disable-validation

disable validation

Default: False

--max-tokens-valid maximum number of tokens in a validation batch (defaults to –max-tokens)
--batch-size-valid, --max-sentences-valid batch size of the validation batch (defaults to –batch-size)
--max-valid-steps, --nval How many batches to evaluate
--curriculum

don’t shuffle batches for first N epochs

Default: 0

--gen-subset

data subset to generate (train, valid, test)

Default: “test”

--num-shards

shard generation over N shards

Default: 1

--shard-id

id of the shard to generate (id < num_shards)

Default: 0

--grouped-shuffling

shuffle batches in groups of num_shards to enable similar sequence lengths on each GPU worker when batches are sorted by length

Default: False

--update-epoch-batch-itr if true then prevents the reuse the epoch batch iterator by setting can_reuse_epoch_itr to false, defaults to –grouped-shuffling )
--update-ordered-indices-seed

if true then increment seed with epoch for getting batch iterators, defautls to False.

Default: False

distributed_training

--distributed-world-size

total number of GPUs across all nodes (default: all visible GPUs)

Default: 1

--distributed-num-procs

total number of processes to fork (default: all visible GPUs)

Default: 1

--distributed-rank

rank of the current worker

Default: 0

--distributed-backend

distributed backend

Default: “nccl”

--distributed-init-method typically tcp://hostname:port that will be used to establish initial connetion
--distributed-port

port number (not required if using –distributed-init-method)

Default: -1

--device-id, --local_rank

which GPU to use (by default looks for $LOCAL_RANK, usually configured automatically)

Default: 0

--distributed-no-spawn

do not spawn multiple processes even if multiple GPUs are visible

Default: False

--ddp-backend

Possible choices: c10d, fully_sharded, legacy_ddp, no_c10d, pytorch_ddp, slowmo

DistributedDataParallel backend

Default: “pytorch_ddp”

--ddp-comm-hook

Possible choices: none, fp16

communication hook

Default: “none”

--bucket-cap-mb

bucket size for reduction

Default: 25

--fix-batches-to-gpus

don’t shuffle batches between GPUs; this reduces overall randomness and may affect precision but avoids the cost of re-reading the data

Default: False

--find-unused-parameters

disable unused parameter detection (not applicable to –ddp-backend=legacy_ddp)

Default: False

--gradient-as-bucket-view

when set to True, gradients will be views pointing to different offsets of allreduce communication buckets. This can reduce peak memory usage, where the saved memory size will be equal to the total gradients size. –gradient-as-bucket-view=gradient_as_bucket_view)

Default: False

--fast-stat-sync

[deprecated] this is now defined per Criterion

Default: False

--heartbeat-timeout

kill the job if no progress is made in N seconds; set to -1 to disable

Default: -1

--broadcast-buffers

Copy non-trainable parameters between GPUs, such as batchnorm population statistics

Default: False

--slowmo-momentum SlowMo momentum term; by default use 0.0 for 16 GPUs, 0.2 for 32 GPUs; 0.5 for 64 GPUs, 0.6 for > 64 GPUs
--slowmo-base-algorithm

Base algorithm. Either ‘localsgd’ or ‘sgp’. Please refer to the documentation of ‘slowmo_base_algorithm’ parameter in https://fairscale.readthedocs.io/en/latest/api/experimental/nn/slowmo_ddp.html for more details

Default: “localsgd”

--localsgd-frequency

Local SGD allreduce frequency

Default: 3

--nprocs-per-node

number of GPUs in each node. An allreduce operation across GPUs in a node is very fast. Hence, we do allreduce across GPUs in a node, and gossip across different nodes

Default: 1

--pipeline-model-parallel

if set, use pipeline model parallelism across GPUs

Default: False

--pipeline-balance partition the model into N_K pieces, where each piece contains N_i layers. The sum(args.pipeline_balance) should equal the total number of layers in the model
--pipeline-devices a list of device indices indicating which device to place each of the N_K partitions. The length of this list should equal the length of the –pipeline-balance argument
--pipeline-chunks

microbatch count for pipeline model parallelism

Default: 0

--pipeline-encoder-balance partition the pipeline parallel encoder into N_K pieces, where each piece contains N_i layers. The sum(args.pipeline_encoder_balance) should equal the total number of encoder layers in the model
--pipeline-encoder-devices a list of device indices indicating which device to place each of the N_K partitions. The length of this list should equal the length of the –pipeline-encoder-balance argument
--pipeline-decoder-balance partition the pipeline parallel decoder into N_K pieces, where each piece contains N_i layers. The sum(args.pipeline_decoder_balance) should equal the total number of decoder layers in the model
--pipeline-decoder-devices a list of device indices indicating which device to place each of the N_K partitions. The length of this list should equal the length of the –pipeline-decoder-balance argument
--pipeline-checkpoint

Possible choices: always, never, except_last

checkpointing mode for pipeline model parallelism

Default: “never”

--zero-sharding

Possible choices: none, os

ZeRO sharding

Default: “none”

--no-reshard-after-forward

don’t reshard parameters after forward pass

Default: False

--fp32-reduce-scatter

reduce-scatter grads in FP32

Default: False

--cpu-offload

offload FP32 params to CPU

Default: False

--use-sharded-state

use sharded checkpoint files

Default: False

--not-fsdp-flatten-parameters

not flatten parameter param for fsdp

Default: False

LM Evaluation

--path path(s) to model file(s), colon separated
--post-process, --remove-bpe post-process text by removing BPE, letter segmentation, etc. Valid options can be found in fairseq.data.utils.post_process.
--quiet

only print final scores

Default: False

--model-overrides

a dictionary used to override model args at generation that were used during model training

Default: “{}”

--results-path path to save eval results (optional)
--output-word-probs

if set, outputs words and their predicted log probabilities to standard output

Default: False

--output-word-stats

if set, outputs word statistics such as word count, average probability, etc

Default: False

--context-window

ensures that every evaluated token has access to a context of at least this size, if possible

Default: 0

--softmax-batch

if BxT is more than this, will batch the softmax over vocab to this amount of tokens, in order to fit into GPU memory

Default: 9223372036854775807