Command-line Tools¶

Fairseq provides several command-line tools for training and evaluating models:

fairseq-preprocess: Data pre-processing: build vocabularies and binarize training data
fairseq-train: Train a new model on one or multiple GPUs
fairseq-generate: Translate pre-processed data with a trained model
fairseq-interactive: Translate raw text with a trained model
fairseq-score: BLEU scoring of generated translations against reference translations
fairseq-eval-lm: Language model evaluation

fairseq-preprocess¶

Data pre-processing: build vocabularies and binarize training data.

usage: fairseq-preprocess [-h] [--no-progress-bar]
                          [--log-interval LOG_INTERVAL]
                          [--log-format {json,none,simple,tqdm}]
                          [--tensorboard-logdir TENSORBOARD_LOGDIR]
                          [--seed SEED] [--cpu] [--tpu] [--bf16]
                          [--memory-efficient-bf16] [--fp16]
                          [--memory-efficient-fp16] [--fp16-no-flatten-grads]
                          [--fp16-init-scale FP16_INIT_SCALE]
                          [--fp16-scale-window FP16_SCALE_WINDOW]
                          [--fp16-scale-tolerance FP16_SCALE_TOLERANCE]
                          [--min-loss-scale MIN_LOSS_SCALE]
                          [--threshold-loss-scale THRESHOLD_LOSS_SCALE]
                          [--user-dir USER_DIR]
                          [--empty-cache-freq EMPTY_CACHE_FREQ]
                          [--all-gather-list-size ALL_GATHER_LIST_SIZE]
                          [--model-parallel-size MODEL_PARALLEL_SIZE]
                          [--checkpoint-suffix CHECKPOINT_SUFFIX]
                          [--checkpoint-shard-count CHECKPOINT_SHARD_COUNT]
                          [--quantization-config-path QUANTIZATION_CONFIG_PATH]
                          [--profile]
                          [--criterion {sentence_prediction,ctc,adaptive_loss,label_smoothed_cross_entropy,composite_loss,nat_loss,masked_lm,sentence_ranking,legacy_masked_lm_loss,cross_entropy,wav2vec,label_smoothed_cross_entropy_with_alignment,vocab_parallel_cross_entropy}]
                          [--tokenizer {nltk,space,moses}]
                          [--bpe {gpt2,bytes,sentencepiece,subword_nmt,byte_bpe,characters,bert,fastbpe,hf_byte_bpe}]
                          [--optimizer {adadelta,adam,adafactor,adagrad,lamb,nag,adamax,sgd}]
                          [--lr-scheduler {cosine,reduce_lr_on_plateau,fixed,triangular,polynomial_decay,tri_stage,inverse_sqrt}]
                          [--scoring {chrf,wer,sacrebleu,bleu}] [--task TASK]
                          [-s SRC] [-t TARGET] [--trainpref FP]
                          [--validpref FP] [--testpref FP] [--align-suffix FP]
                          [--destdir DIR] [--thresholdtgt N]
                          [--thresholdsrc N] [--tgtdict FP] [--srcdict FP]
                          [--nwordstgt N] [--nwordssrc N] [--alignfile ALIGN]
                          [--dataset-impl FORMAT] [--joined-dictionary]
                          [--only-source] [--padding-factor N] [--workers N]

Named Arguments¶

`--no-progress-bar`	disable progress bar Default: False
`--log-interval`	log progress every N batches (when progress bar is disabled) Default: 100
`--log-format`	Possible choices: json, none, simple, tqdm log format to use
`--tensorboard-logdir`	path to save logs for tensorboard, should match –logdir of running tensorboard (default: no tensorboard logging)
`--seed`	pseudo random number generator seed Default: 1
`--cpu`	use CPU instead of CUDA Default: False
`--tpu`	use TPU instead of CUDA Default: False
`--bf16`	use bfloat16; implies –tpu Default: False
`--memory-efficient-bf16`	use a memory-efficient version of BF16 training; implies –bf16 Default: False
`--fp16`	use FP16 Default: False
`--memory-efficient-fp16`	use a memory-efficient version of FP16 training; implies –fp16 Default: False
`--fp16-no-flatten-grads`	don’t flatten FP16 grads tensor Default: False
`--fp16-init-scale`	default FP16 loss scale Default: 128
`--fp16-scale-window`	number of updates before increasing loss scale
`--fp16-scale-tolerance`	pct of updates that can overflow before decreasing the loss scale Default: 0.0
`--min-loss-scale`	minimum FP16 loss scale, after which training is stopped Default: 0.0001
`--threshold-loss-scale`	threshold FP16 loss scale from below
`--user-dir`	path to a python module containing custom extensions (tasks and/or architectures)
`--empty-cache-freq`	how often to clear the PyTorch CUDA cache (0 to disable) Default: 0
`--all-gather-list-size`	number of bytes reserved for gathering stats from workers Default: 16384
`--model-parallel-size`	total number of GPUs to parallelize model over Default: 1
`--checkpoint-suffix`	suffix to add to the checkpoint file name Default: “”
`--checkpoint-shard-count`	Number of shards containing the checkpoint - if the checkpoint is over 300GB, it is preferable to split it into shards to prevent OOM on CPU while loading the checkpoint Default: 1
`--quantization-config-path`	path to quantization config file
`--profile`	enable autograd profiler emit_nvtx Default: False
`--criterion`	Possible choices: sentence_prediction, ctc, adaptive_loss, label_smoothed_cross_entropy, composite_loss, nat_loss, masked_lm, sentence_ranking, legacy_masked_lm_loss, cross_entropy, wav2vec, label_smoothed_cross_entropy_with_alignment, vocab_parallel_cross_entropy Default: “cross_entropy”
`--tokenizer`	Possible choices: nltk, space, moses
`--bpe`	Possible choices: gpt2, bytes, sentencepiece, subword_nmt, byte_bpe, characters, bert, fastbpe, hf_byte_bpe
`--optimizer`	Possible choices: adadelta, adam, adafactor, adagrad, lamb, nag, adamax, sgd
`--lr-scheduler`	Possible choices: cosine, reduce_lr_on_plateau, fixed, triangular, polynomial_decay, tri_stage, inverse_sqrt Default: “fixed”
`--scoring`	Possible choices: chrf, wer, sacrebleu, bleu Default: “bleu”
`--task`	Possible choices: sentence_prediction, translation, translation_from_pretrained_xlm, denoising, multilingual_translation, semisupervised_translation, cross_lingual_lm, multilingual_denoising, translation_from_pretrained_bart, masked_lm, sentence_ranking, speech_to_text, audio_pretraining, legacy_masked_lm, translation_multi_simple_epoch, multilingual_masked_lm, language_modeling, translation_lev, dummy_lm, dummy_masked_lm, dummy_mt task Default: “translation”
`--dataset-impl`	Possible choices: raw, lazy, cached, mmap, fasta output dataset implementation Default: “mmap”

Preprocessing¶

`-s, --source-lang`	source language
`-t, --target-lang`	target language
`--trainpref`	train file prefix
`--validpref`	comma separated, valid file prefixes
`--testpref`	comma separated, test file prefixes
`--align-suffix`	alignment file suffix
`--destdir`	destination dir Default: “data-bin”
`--thresholdtgt`	map words appearing less than threshold times to unknown Default: 0
`--thresholdsrc`	map words appearing less than threshold times to unknown Default: 0
`--tgtdict`	reuse given target dictionary
`--srcdict`	reuse given source dictionary
`--nwordstgt`	number of target words to retain Default: -1
`--nwordssrc`	number of source words to retain Default: -1
`--alignfile`	an alignment file (optional)
`--joined-dictionary`	Generate joined dictionary Default: False
`--only-source`	Only process the source language Default: False
`--padding-factor`	Pad dictionary size to be multiple of N Default: 8
`--workers`	number of parallel workers Default: 1

fairseq-train¶

Train a new model on one or across multiple GPUs.

usage: fairseq-train [-h] [--no-progress-bar] [--log-interval LOG_INTERVAL]
                     [--log-format {json,none,simple,tqdm}]
                     [--tensorboard-logdir TENSORBOARD_LOGDIR] [--seed SEED]
                     [--cpu] [--tpu] [--bf16] [--memory-efficient-bf16]
                     [--fp16] [--memory-efficient-fp16]
                     [--fp16-no-flatten-grads]
                     [--fp16-init-scale FP16_INIT_SCALE]
                     [--fp16-scale-window FP16_SCALE_WINDOW]
                     [--fp16-scale-tolerance FP16_SCALE_TOLERANCE]
                     [--min-loss-scale MIN_LOSS_SCALE]
                     [--threshold-loss-scale THRESHOLD_LOSS_SCALE]
                     [--user-dir USER_DIR]
                     [--empty-cache-freq EMPTY_CACHE_FREQ]
                     [--all-gather-list-size ALL_GATHER_LIST_SIZE]
                     [--model-parallel-size MODEL_PARALLEL_SIZE]
                     [--checkpoint-suffix CHECKPOINT_SUFFIX]
                     [--checkpoint-shard-count CHECKPOINT_SHARD_COUNT]
                     [--quantization-config-path QUANTIZATION_CONFIG_PATH]
                     [--profile]
                     [--criterion {sentence_prediction,ctc,adaptive_loss,label_smoothed_cross_entropy,composite_loss,nat_loss,masked_lm,sentence_ranking,legacy_masked_lm_loss,cross_entropy,wav2vec,label_smoothed_cross_entropy_with_alignment,vocab_parallel_cross_entropy}]
                     [--tokenizer {nltk,space,moses}]
                     [--bpe {gpt2,bytes,sentencepiece,subword_nmt,byte_bpe,characters,bert,fastbpe,hf_byte_bpe}]
                     [--optimizer {adadelta,adam,adafactor,adagrad,lamb,nag,adamax,sgd}]
                     [--lr-scheduler {cosine,reduce_lr_on_plateau,fixed,triangular,polynomial_decay,tri_stage,inverse_sqrt}]
                     [--scoring {chrf,wer,sacrebleu,bleu}] [--task TASK]
                     [--num-workers NUM_WORKERS]
                     [--skip-invalid-size-inputs-valid-test]
                     [--max-tokens MAX_TOKENS] [--batch-size BATCH_SIZE]
                     [--required-batch-size-multiple REQUIRED_BATCH_SIZE_MULTIPLE]
                     [--required-seq-len-multiple REQUIRED_SEQ_LEN_MULTIPLE]
                     [--dataset-impl {raw,lazy,cached,mmap,fasta}]
                     [--data-buffer-size DATA_BUFFER_SIZE]
                     [--train-subset TRAIN_SUBSET]
                     [--valid-subset VALID_SUBSET]
                     [--validate-interval VALIDATE_INTERVAL]
                     [--validate-interval-updates VALIDATE_INTERVAL_UPDATES]
                     [--validate-after-updates VALIDATE_AFTER_UPDATES]
                     [--fixed-validation-seed FIXED_VALIDATION_SEED]
                     [--disable-validation]
                     [--max-tokens-valid MAX_TOKENS_VALID]
                     [--batch-size-valid BATCH_SIZE_VALID]
                     [--curriculum CURRICULUM] [--gen-subset GEN_SUBSET]
                     [--num-shards NUM_SHARDS] [--shard-id SHARD_ID]
                     [--distributed-world-size DISTRIBUTED_WORLD_SIZE]
                     [--distributed-rank DISTRIBUTED_RANK]
                     [--distributed-backend DISTRIBUTED_BACKEND]
                     [--distributed-init-method DISTRIBUTED_INIT_METHOD]
                     [--distributed-port DISTRIBUTED_PORT]
                     [--device-id DEVICE_ID] [--distributed-no-spawn]
                     [--ddp-backend {c10d,no_c10d}]
                     [--bucket-cap-mb BUCKET_CAP_MB] [--fix-batches-to-gpus]
                     [--find-unused-parameters] [--fast-stat-sync]
                     [--broadcast-buffers]
                     [--distributed-wrapper {DDP,SlowMo}]
                     [--slowmo-momentum SLOWMO_MOMENTUM]
                     [--slowmo-algorithm SLOWMO_ALGORITHM]
                     [--localsgd-frequency LOCALSGD_FREQUENCY]
                     [--nprocs-per-node NPROCS_PER_NODE]
                     [--pipeline-model-parallel]
                     [--pipeline-balance PIPELINE_BALANCE]
                     [--pipeline-devices PIPELINE_DEVICES]
                     [--pipeline-chunks PIPELINE_CHUNKS]
                     [--pipeline-encoder-balance PIPELINE_ENCODER_BALANCE]
                     [--pipeline-encoder-devices PIPELINE_ENCODER_DEVICES]
                     [--pipeline-decoder-balance PIPELINE_DECODER_BALANCE]
                     [--pipeline-decoder-devices PIPELINE_DECODER_DEVICES]
                     [--pipeline-checkpoint {always,never,except_last}]
                     [--zero-sharding {none,os}] [--arch ARCH]
                     [--max-epoch MAX_EPOCH] [--max-update MAX_UPDATE]
                     [--stop-time-hours STOP_TIME_HOURS]
                     [--clip-norm CLIP_NORM] [--sentence-avg]
                     [--update-freq UPDATE_FREQ] [--lr LR] [--min-lr MIN_LR]
                     [--use-bmuf] [--save-dir SAVE_DIR]
                     [--restore-file RESTORE_FILE]
                     [--finetune-from-model FINETUNE_FROM_MODEL]
                     [--reset-dataloader] [--reset-lr-scheduler]
                     [--reset-meters] [--reset-optimizer]
                     [--optimizer-overrides OPTIMIZER_OVERRIDES]
                     [--save-interval SAVE_INTERVAL]
                     [--save-interval-updates SAVE_INTERVAL_UPDATES]
                     [--keep-interval-updates KEEP_INTERVAL_UPDATES]
                     [--keep-last-epochs KEEP_LAST_EPOCHS]
                     [--keep-best-checkpoints KEEP_BEST_CHECKPOINTS]
                     [--no-save] [--no-epoch-checkpoints]
                     [--no-last-checkpoints] [--no-save-optimizer-state]
                     [--best-checkpoint-metric BEST_CHECKPOINT_METRIC]
                     [--maximize-best-checkpoint-metric] [--patience PATIENCE]

Named Arguments¶

`--no-progress-bar`	disable progress bar Default: False
`--log-interval`	log progress every N batches (when progress bar is disabled) Default: 100
`--log-format`	Possible choices: json, none, simple, tqdm log format to use
`--tensorboard-logdir`	path to save logs for tensorboard, should match –logdir of running tensorboard (default: no tensorboard logging)
`--seed`	pseudo random number generator seed Default: 1
`--cpu`	use CPU instead of CUDA Default: False
`--tpu`	use TPU instead of CUDA Default: False
`--bf16`	use bfloat16; implies –tpu Default: False
`--memory-efficient-bf16`	use a memory-efficient version of BF16 training; implies –bf16 Default: False
`--fp16`	use FP16 Default: False
`--memory-efficient-fp16`	use a memory-efficient version of FP16 training; implies –fp16 Default: False
`--fp16-no-flatten-grads`	don’t flatten FP16 grads tensor Default: False
`--fp16-init-scale`	default FP16 loss scale Default: 128
`--fp16-scale-window`	number of updates before increasing loss scale
`--fp16-scale-tolerance`	pct of updates that can overflow before decreasing the loss scale Default: 0.0
`--min-loss-scale`	minimum FP16 loss scale, after which training is stopped Default: 0.0001
`--threshold-loss-scale`	threshold FP16 loss scale from below
`--user-dir`	path to a python module containing custom extensions (tasks and/or architectures)
`--empty-cache-freq`	how often to clear the PyTorch CUDA cache (0 to disable) Default: 0
`--all-gather-list-size`	number of bytes reserved for gathering stats from workers Default: 16384
`--model-parallel-size`	total number of GPUs to parallelize model over Default: 1
`--checkpoint-suffix`	suffix to add to the checkpoint file name Default: “”
`--checkpoint-shard-count`	Number of shards containing the checkpoint - if the checkpoint is over 300GB, it is preferable to split it into shards to prevent OOM on CPU while loading the checkpoint Default: 1
`--quantization-config-path`	path to quantization config file
`--profile`	enable autograd profiler emit_nvtx Default: False
`--criterion`	Possible choices: sentence_prediction, ctc, adaptive_loss, label_smoothed_cross_entropy, composite_loss, nat_loss, masked_lm, sentence_ranking, legacy_masked_lm_loss, cross_entropy, wav2vec, label_smoothed_cross_entropy_with_alignment, vocab_parallel_cross_entropy Default: “cross_entropy”
`--tokenizer`	Possible choices: nltk, space, moses
`--bpe`	Possible choices: gpt2, bytes, sentencepiece, subword_nmt, byte_bpe, characters, bert, fastbpe, hf_byte_bpe
`--optimizer`	Possible choices: adadelta, adam, adafactor, adagrad, lamb, nag, adamax, sgd
`--lr-scheduler`	Possible choices: cosine, reduce_lr_on_plateau, fixed, triangular, polynomial_decay, tri_stage, inverse_sqrt Default: “fixed”
`--scoring`	Possible choices: chrf, wer, sacrebleu, bleu Default: “bleu”
`--task`	Possible choices: sentence_prediction, translation, translation_from_pretrained_xlm, denoising, multilingual_translation, semisupervised_translation, cross_lingual_lm, multilingual_denoising, translation_from_pretrained_bart, masked_lm, sentence_ranking, speech_to_text, audio_pretraining, legacy_masked_lm, translation_multi_simple_epoch, multilingual_masked_lm, language_modeling, translation_lev, dummy_lm, dummy_masked_lm, dummy_mt task Default: “translation”

dataset_data_loading¶

`--num-workers`	how many subprocesses to use for data loading Default: 1
`--skip-invalid-size-inputs-valid-test`	ignore too long or too short lines in valid and test set Default: False
`--max-tokens`	maximum number of tokens in a batch
`--batch-size`	number of examples in a batch
`--required-batch-size-multiple`	batch size will be a multiplier of this value Default: 8
`--required-seq-len-multiple`	maximum sequence length in batch will be a multiplier of this value Default: 1
`--dataset-impl`	Possible choices: raw, lazy, cached, mmap, fasta output dataset implementation
`--data-buffer-size`	Number of batches to preload Default: 10
`--train-subset`	data subset to use for training (e.g. train, valid, test) Default: “train”
`--valid-subset`	comma separated list of data subsets to use for validation (e.g. train, valid, test) Default: “valid”
`--validate-interval`	validate every N epochs Default: 1
`--validate-interval-updates`	validate every N updates Default: 0
`--validate-after-updates`	dont validate until reaching this many updates Default: 0
`--fixed-validation-seed`	specified random seed for validation
`--disable-validation`	disable validation Default: False
`--max-tokens-valid`	maximum number of tokens in a validation batch (defaults to –max-tokens)
`--batch-size-valid`	batch size of the validation batch (defaults to –batch-size)
`--curriculum`	don’t shuffle batches for first N epochs Default: 0
`--gen-subset`	data subset to generate (train, valid, test) Default: “test”
`--num-shards`	shard generation over N shards Default: 1
`--shard-id`	id of the shard to generate (id < num_shards) Default: 0

distributed_training¶

`--distributed-world-size`	total number of GPUs across all nodes (default: all visible GPUs) Default: 1
`--distributed-rank`	rank of the current worker Default: 0
`--distributed-backend`	distributed backend Default: “nccl”
`--distributed-init-method`	typically tcp://hostname:port that will be used to establish initial connetion
`--distributed-port`	port number (not required if using –distributed-init-method) Default: -1
`--device-id, --local_rank`	which GPU to use (usually configured automatically) Default: 0
`--distributed-no-spawn`	do not spawn multiple processes even if multiple GPUs are visible Default: False
`--ddp-backend`	Possible choices: c10d, no_c10d DistributedDataParallel backend Default: “c10d”
`--bucket-cap-mb`	bucket size for reduction Default: 25
`--fix-batches-to-gpus`	don’t shuffle batches between GPUs; this reduces overall randomness and may affect precision but avoids the cost of re-reading the data Default: False
`--find-unused-parameters`	disable unused parameter detection (not applicable to no_c10d ddp-backend Default: False
`--fast-stat-sync`	[deprecated] this is now defined per Criterion Default: False
`--broadcast-buffers`	Copy non-trainable parameters between GPUs, such as batchnorm population statistics Default: False
`--distributed-wrapper`	Possible choices: DDP, SlowMo DistributedDataParallel backend Default: “DDP”
`--slowmo-momentum`	SlowMo momentum term; by default use 0.0 for 16 GPUs, 0.2 for 32 GPUs; 0.5 for 64 GPUs, 0.6 for > 64 GPUs
`--slowmo-algorithm`	whether to use LocalSGD or SGP Default: “LocalSGD”
`--localsgd-frequency`	Local SGD allreduce frequency Default: 3
`--nprocs-per-node`	number of GPUs in each node. An allreduce operation across GPUs in a node is very fast. Hence, we do allreduce across GPUs in a node, and gossip across different nodes Default: 1
`--pipeline-model-parallel`	if set, use pipeline model parallelism across GPUs Default: False
`--pipeline-balance`	partition the model into N_K pieces, where each piece contains N_i layers. The sum(args.pipeline_balance) should equal the total number of layers in the model
`--pipeline-devices`	a list of device indices indicating which device to place each of the N_K partitions. The length of this list should equal the length of the –pipeline-balance argument
`--pipeline-chunks`	microbatch count for pipeline model parallelism Default: 0
`--pipeline-encoder-balance`	partition the pipeline parallel encoder into N_K pieces, where each piece contains N_i layers. The sum(args.pipeline_encoder_balance) should equal the total number of encoder layers in the model
`--pipeline-encoder-devices`	a list of device indices indicating which device to place each of the N_K partitions. The length of this list should equal the length of the –pipeline-encoder-balance argument
`--pipeline-decoder-balance`	partition the pipeline parallel decoder into N_K pieces, where each piece contains N_i layers. The sum(args.pipeline_decoder_balance) should equal the total number of decoder layers in the model
`--pipeline-decoder-devices`	a list of device indices indicating which device to place each of the N_K partitions. The length of this list should equal the length of the –pipeline-decoder-balance argument
`--pipeline-checkpoint`	Possible choices: always, never, except_last checkpointing mode for pipeline model parallelism Default: “never”
`--zero-sharding`	Possible choices: none, os ZeRO sharding Default: “none”

Model configuration¶

--arch, -a

Possible choices: roberta, roberta_base, roberta_large, xlm, transformer, transformer_iwslt_de_en, transformer_wmt_en_de, transformer_vaswani_wmt_en_de_big, transformer_vaswani_wmt_en_fr_big, transformer_wmt_en_de_big, transformer_wmt_en_de_big_t2t, transformer_align, transformer_wmt_en_de_big_align, wav2vec, wav2vec2, wav2vec_ctc, wav2vec_seq2seq, transformer_from_pretrained_xlm, s2t_berard, s2t_berard_256_3_3, s2t_berard_512_3_2, s2t_berard_512_5_3, s2t_transformer, s2t_transformer_s, s2t_transformer_sp, s2t_transformer_m, s2t_transformer_mp, s2t_transformer_l, s2t_transformer_lp, transformer_lm, transformer_lm_big, transformer_lm_baevski_wiki103, transformer_lm_wiki103, transformer_lm_baevski_gbw, transformer_lm_gbw, transformer_lm_gpt, transformer_lm_gpt2_small, transformer_lm_gpt2_medium, transformer_lm_gpt2_big, lightconv, lightconv_iwslt_de_en, lightconv_wmt_en_de, lightconv_wmt_en_de_big, lightconv_wmt_en_fr_big, lightconv_wmt_zh_en_big, masked_lm, bert_base, bert_large, xlm_base, fconv, fconv_iwslt_de_en, fconv_wmt_en_ro, fconv_wmt_en_de, fconv_wmt_en_fr, fconv_lm, fconv_lm_dauphin_wikitext103, fconv_lm_dauphin_gbw, nonautoregressive_transformer, nonautoregressive_transformer_wmt_en_de, nacrf_transformer, iterative_nonautoregressive_transformer, iterative_nonautoregressive_transformer_wmt_en_de, cmlm_transformer, cmlm_transformer_wmt_en_de, levenshtein_transformer, levenshtein_transformer_wmt_en_de, levenshtein_transformer_vaswani_wmt_en_de_big, levenshtein_transformer_wmt_en_de_big, insertion_transformer, lightconv_lm, lightconv_lm_gbw, fconv_self_att, fconv_self_att_wp, bart_large, bart_base, mbart_large, mbart_base, mbart_base_wmt20, lstm, lstm_wiseman_iwslt_de_en, lstm_luong_wmt_en_de, multilingual_transformer, multilingual_transformer_iwslt_de_en, hf_gpt2, hf_gpt2_medium, hf_gpt2_large, hf_gpt2_xl, lstm_lm, dummy_model, model_parallel_roberta, model_parallel_roberta_base, model_parallel_roberta_large, transformer_lm_megatron, transformer_lm_megatron_11b, transformer_iwslt_de_en_pipeline_parallel, transformer_wmt_en_de_big_pipeline_parallel

model architecture

optimization¶

`--max-epoch`	force stop training at specified epoch Default: 0
`--max-update`	force stop training at specified update Default: 0
`--stop-time-hours`	force stop training after specified cumulative time (if >0) Default: 0
`--clip-norm`	clip threshold of gradients Default: 0.0
`--sentence-avg`	normalize gradients by the number of sentences in a batch (default is to normalize by number of tokens) Default: False
`--update-freq`	update parameters every N_i batches, when in epoch i Default: 1
`--lr`	learning rate for the first N epochs; all epochs >N using LR_N (note: this may be interpreted differently depending on –lr-scheduler) Default: 0.25
`--min-lr`	stop training when the learning rate reaches this minimum Default: -1.0
`--use-bmuf`	specify global optimizer for syncing models on different GPUs/shards Default: False

checkpoint¶

`--save-dir`	path to save checkpoints Default: “checkpoints”
`--restore-file`	filename from which to load checkpoint (default: <save-dir>/checkpoint_last.pt Default: “checkpoint_last.pt”
`--finetune-from-model`	finetune from a pretrained model; note that meters and lr scheduler will be reset
`--reset-dataloader`	if set, does not reload dataloader state from the checkpoint Default: False
`--reset-lr-scheduler`	if set, does not load lr scheduler state from the checkpoint Default: False
`--reset-meters`	if set, does not load meters from the checkpoint Default: False
`--reset-optimizer`	if set, does not load optimizer state from the checkpoint Default: False
`--optimizer-overrides`	a dictionary used to override optimizer args when loading a checkpoint Default: “{}”
`--save-interval`	save a checkpoint every N epochs Default: 1
`--save-interval-updates`	save a checkpoint (and validate) every N updates Default: 0
`--keep-interval-updates`	keep the last N checkpoints saved with –save-interval-updates Default: -1
`--keep-last-epochs`	keep last N epoch checkpoints Default: -1
`--keep-best-checkpoints`	keep best N checkpoints based on scores Default: -1
`--no-save`	don’t save models or checkpoints Default: False
`--no-epoch-checkpoints`	only store last and best checkpoints Default: False
`--no-last-checkpoints`	don’t store last checkpoints Default: False
`--no-save-optimizer-state`	don’t save optimizer-state as part of checkpoint Default: False
`--best-checkpoint-metric`	metric to use for saving “best” checkpoints Default: “loss”
`--maximize-best-checkpoint-metric`	select the largest metric value for saving “best” checkpoints Default: False
`--patience`	early stop training if valid performance doesn’t improve for N consecutive validation runs; note that this is influenced by –validate-interval Default: -1

fairseq-generate¶

Translate pre-processed data with a trained model.

usage: fairseq-generate [-h] [--no-progress-bar] [--log-interval LOG_INTERVAL]
                        [--log-format {json,none,simple,tqdm}]
                        [--tensorboard-logdir TENSORBOARD_LOGDIR]
                        [--seed SEED] [--cpu] [--tpu] [--bf16]
                        [--memory-efficient-bf16] [--fp16]
                        [--memory-efficient-fp16] [--fp16-no-flatten-grads]
                        [--fp16-init-scale FP16_INIT_SCALE]
                        [--fp16-scale-window FP16_SCALE_WINDOW]
                        [--fp16-scale-tolerance FP16_SCALE_TOLERANCE]
                        [--min-loss-scale MIN_LOSS_SCALE]
                        [--threshold-loss-scale THRESHOLD_LOSS_SCALE]
                        [--user-dir USER_DIR]
                        [--empty-cache-freq EMPTY_CACHE_FREQ]
                        [--all-gather-list-size ALL_GATHER_LIST_SIZE]
                        [--model-parallel-size MODEL_PARALLEL_SIZE]
                        [--checkpoint-suffix CHECKPOINT_SUFFIX]
                        [--checkpoint-shard-count CHECKPOINT_SHARD_COUNT]
                        [--quantization-config-path QUANTIZATION_CONFIG_PATH]
                        [--profile]
                        [--criterion {sentence_prediction,ctc,adaptive_loss,label_smoothed_cross_entropy,composite_loss,nat_loss,masked_lm,sentence_ranking,legacy_masked_lm_loss,cross_entropy,wav2vec,label_smoothed_cross_entropy_with_alignment,vocab_parallel_cross_entropy}]
                        [--tokenizer {nltk,space,moses}]
                        [--bpe {gpt2,bytes,sentencepiece,subword_nmt,byte_bpe,characters,bert,fastbpe,hf_byte_bpe}]
                        [--optimizer {adadelta,adam,adafactor,adagrad,lamb,nag,adamax,sgd}]
                        [--lr-scheduler {cosine,reduce_lr_on_plateau,fixed,triangular,polynomial_decay,tri_stage,inverse_sqrt}]
                        [--scoring {chrf,wer,sacrebleu,bleu}] [--task TASK]
                        [--num-workers NUM_WORKERS]
                        [--skip-invalid-size-inputs-valid-test]
                        [--max-tokens MAX_TOKENS] [--batch-size BATCH_SIZE]
                        [--required-batch-size-multiple REQUIRED_BATCH_SIZE_MULTIPLE]
                        [--required-seq-len-multiple REQUIRED_SEQ_LEN_MULTIPLE]
                        [--dataset-impl {raw,lazy,cached,mmap,fasta}]
                        [--data-buffer-size DATA_BUFFER_SIZE]
                        [--train-subset TRAIN_SUBSET]
                        [--valid-subset VALID_SUBSET]
                        [--validate-interval VALIDATE_INTERVAL]
                        [--validate-interval-updates VALIDATE_INTERVAL_UPDATES]
                        [--validate-after-updates VALIDATE_AFTER_UPDATES]
                        [--fixed-validation-seed FIXED_VALIDATION_SEED]
                        [--disable-validation]
                        [--max-tokens-valid MAX_TOKENS_VALID]
                        [--batch-size-valid BATCH_SIZE_VALID]
                        [--curriculum CURRICULUM] [--gen-subset GEN_SUBSET]
                        [--num-shards NUM_SHARDS] [--shard-id SHARD_ID]
                        [--distributed-world-size DISTRIBUTED_WORLD_SIZE]
                        [--distributed-rank DISTRIBUTED_RANK]
                        [--distributed-backend DISTRIBUTED_BACKEND]
                        [--distributed-init-method DISTRIBUTED_INIT_METHOD]
                        [--distributed-port DISTRIBUTED_PORT]
                        [--device-id DEVICE_ID] [--distributed-no-spawn]
                        [--ddp-backend {c10d,no_c10d}]
                        [--bucket-cap-mb BUCKET_CAP_MB]
                        [--fix-batches-to-gpus] [--find-unused-parameters]
                        [--fast-stat-sync] [--broadcast-buffers]
                        [--distributed-wrapper {DDP,SlowMo}]
                        [--slowmo-momentum SLOWMO_MOMENTUM]
                        [--slowmo-algorithm SLOWMO_ALGORITHM]
                        [--localsgd-frequency LOCALSGD_FREQUENCY]
                        [--nprocs-per-node NPROCS_PER_NODE]
                        [--pipeline-model-parallel]
                        [--pipeline-balance PIPELINE_BALANCE]
                        [--pipeline-devices PIPELINE_DEVICES]
                        [--pipeline-chunks PIPELINE_CHUNKS]
                        [--pipeline-encoder-balance PIPELINE_ENCODER_BALANCE]
                        [--pipeline-encoder-devices PIPELINE_ENCODER_DEVICES]
                        [--pipeline-decoder-balance PIPELINE_DECODER_BALANCE]
                        [--pipeline-decoder-devices PIPELINE_DECODER_DEVICES]
                        [--pipeline-checkpoint {always,never,except_last}]
                        [--zero-sharding {none,os}] [--path PATH]
                        [--remove-bpe [REMOVE_BPE]] [--quiet]
                        [--model-overrides MODEL_OVERRIDES]
                        [--results-path RESULTS_PATH] [--beam N] [--nbest N]
                        [--max-len-a N] [--max-len-b N] [--min-len N]
                        [--match-source-len] [--no-early-stop]
                        [--unnormalized] [--no-beamable-mm] [--lenpen LENPEN]
                        [--unkpen UNKPEN] [--replace-unk [REPLACE_UNK]]
                        [--sacrebleu] [--score-reference] [--prefix-size PS]
                        [--no-repeat-ngram-size N] [--sampling]
                        [--sampling-topk PS] [--sampling-topp PS]
                        [--constraints [{ordered,unordered}]]
                        [--temperature N] [--diverse-beam-groups N]
                        [--diverse-beam-strength N] [--diversity-rate N]
                        [--print-alignment] [--print-step] [--lm-path PATH]
                        [--lm-weight N] [--iter-decode-eos-penalty N]
                        [--iter-decode-max-iter N]
                        [--iter-decode-force-max-iter]
                        [--iter-decode-with-beam N]
                        [--iter-decode-with-external-reranker]
                        [--retain-iter-history] [--retain-dropout]
                        [--retain-dropout-modules RETAIN_DROPOUT_MODULES [RETAIN_DROPOUT_MODULES ...]]
                        [--decoding-format {unigram,ensemble,vote,dp,bs}]

Named Arguments¶

`--no-progress-bar`	disable progress bar Default: False
`--log-interval`	log progress every N batches (when progress bar is disabled) Default: 100
`--log-format`	Possible choices: json, none, simple, tqdm log format to use
`--tensorboard-logdir`	path to save logs for tensorboard, should match –logdir of running tensorboard (default: no tensorboard logging)
`--seed`	pseudo random number generator seed Default: 1
`--cpu`	use CPU instead of CUDA Default: False
`--tpu`	use TPU instead of CUDA Default: False
`--bf16`	use bfloat16; implies –tpu Default: False
`--memory-efficient-bf16`	use a memory-efficient version of BF16 training; implies –bf16 Default: False
`--fp16`	use FP16 Default: False
`--memory-efficient-fp16`	use a memory-efficient version of FP16 training; implies –fp16 Default: False
`--fp16-no-flatten-grads`	don’t flatten FP16 grads tensor Default: False
`--fp16-init-scale`	default FP16 loss scale Default: 128
`--fp16-scale-window`	number of updates before increasing loss scale
`--fp16-scale-tolerance`	pct of updates that can overflow before decreasing the loss scale Default: 0.0
`--min-loss-scale`	minimum FP16 loss scale, after which training is stopped Default: 0.0001
`--threshold-loss-scale`	threshold FP16 loss scale from below
`--user-dir`	path to a python module containing custom extensions (tasks and/or architectures)
`--empty-cache-freq`	how often to clear the PyTorch CUDA cache (0 to disable) Default: 0
`--all-gather-list-size`	number of bytes reserved for gathering stats from workers Default: 16384
`--model-parallel-size`	total number of GPUs to parallelize model over Default: 1
`--checkpoint-suffix`	suffix to add to the checkpoint file name Default: “”
`--checkpoint-shard-count`	Number of shards containing the checkpoint - if the checkpoint is over 300GB, it is preferable to split it into shards to prevent OOM on CPU while loading the checkpoint Default: 1
`--quantization-config-path`	path to quantization config file
`--profile`	enable autograd profiler emit_nvtx Default: False
`--criterion`	Possible choices: sentence_prediction, ctc, adaptive_loss, label_smoothed_cross_entropy, composite_loss, nat_loss, masked_lm, sentence_ranking, legacy_masked_lm_loss, cross_entropy, wav2vec, label_smoothed_cross_entropy_with_alignment, vocab_parallel_cross_entropy Default: “cross_entropy”
`--tokenizer`	Possible choices: nltk, space, moses
`--bpe`	Possible choices: gpt2, bytes, sentencepiece, subword_nmt, byte_bpe, characters, bert, fastbpe, hf_byte_bpe
`--optimizer`	Possible choices: adadelta, adam, adafactor, adagrad, lamb, nag, adamax, sgd
`--lr-scheduler`	Possible choices: cosine, reduce_lr_on_plateau, fixed, triangular, polynomial_decay, tri_stage, inverse_sqrt Default: “fixed”
`--scoring`	Possible choices: chrf, wer, sacrebleu, bleu Default: “bleu”
`--task`	Possible choices: sentence_prediction, translation, translation_from_pretrained_xlm, denoising, multilingual_translation, semisupervised_translation, cross_lingual_lm, multilingual_denoising, translation_from_pretrained_bart, masked_lm, sentence_ranking, speech_to_text, audio_pretraining, legacy_masked_lm, translation_multi_simple_epoch, multilingual_masked_lm, language_modeling, translation_lev, dummy_lm, dummy_masked_lm, dummy_mt task Default: “translation”

dataset_data_loading¶

`--num-workers`	how many subprocesses to use for data loading Default: 1
`--skip-invalid-size-inputs-valid-test`	ignore too long or too short lines in valid and test set Default: False
`--max-tokens`	maximum number of tokens in a batch
`--batch-size`	number of examples in a batch
`--required-batch-size-multiple`	batch size will be a multiplier of this value Default: 8
`--required-seq-len-multiple`	maximum sequence length in batch will be a multiplier of this value Default: 1
`--dataset-impl`	Possible choices: raw, lazy, cached, mmap, fasta output dataset implementation
`--data-buffer-size`	Number of batches to preload Default: 10
`--train-subset`	data subset to use for training (e.g. train, valid, test) Default: “train”
`--valid-subset`	comma separated list of data subsets to use for validation (e.g. train, valid, test) Default: “valid”
`--validate-interval`	validate every N epochs Default: 1
`--validate-interval-updates`	validate every N updates Default: 0
`--validate-after-updates`	dont validate until reaching this many updates Default: 0
`--fixed-validation-seed`	specified random seed for validation
`--disable-validation`	disable validation Default: False
`--max-tokens-valid`	maximum number of tokens in a validation batch (defaults to –max-tokens)
`--batch-size-valid`	batch size of the validation batch (defaults to –batch-size)
`--curriculum`	don’t shuffle batches for first N epochs Default: 0
`--gen-subset`	data subset to generate (train, valid, test) Default: “test”
`--num-shards`	shard generation over N shards Default: 1
`--shard-id`	id of the shard to generate (id < num_shards) Default: 0

distributed_training¶

`--distributed-world-size`	total number of GPUs across all nodes (default: all visible GPUs) Default: 1
`--distributed-rank`	rank of the current worker Default: 0
`--distributed-backend`	distributed backend Default: “nccl”
`--distributed-init-method`	typically tcp://hostname:port that will be used to establish initial connetion
`--distributed-port`	port number (not required if using –distributed-init-method) Default: -1
`--device-id, --local_rank`	which GPU to use (usually configured automatically) Default: 0
`--distributed-no-spawn`	do not spawn multiple processes even if multiple GPUs are visible Default: False
`--ddp-backend`	Possible choices: c10d, no_c10d DistributedDataParallel backend Default: “c10d”
`--bucket-cap-mb`	bucket size for reduction Default: 25
`--fix-batches-to-gpus`	don’t shuffle batches between GPUs; this reduces overall randomness and may affect precision but avoids the cost of re-reading the data Default: False
`--find-unused-parameters`	disable unused parameter detection (not applicable to no_c10d ddp-backend Default: False
`--fast-stat-sync`	[deprecated] this is now defined per Criterion Default: False
`--broadcast-buffers`	Copy non-trainable parameters between GPUs, such as batchnorm population statistics Default: False
`--distributed-wrapper`	Possible choices: DDP, SlowMo DistributedDataParallel backend Default: “DDP”
`--slowmo-momentum`	SlowMo momentum term; by default use 0.0 for 16 GPUs, 0.2 for 32 GPUs; 0.5 for 64 GPUs, 0.6 for > 64 GPUs
`--slowmo-algorithm`	whether to use LocalSGD or SGP Default: “LocalSGD”
`--localsgd-frequency`	Local SGD allreduce frequency Default: 3
`--nprocs-per-node`	number of GPUs in each node. An allreduce operation across GPUs in a node is very fast. Hence, we do allreduce across GPUs in a node, and gossip across different nodes Default: 1
`--pipeline-model-parallel`	if set, use pipeline model parallelism across GPUs Default: False
`--pipeline-balance`	partition the model into N_K pieces, where each piece contains N_i layers. The sum(args.pipeline_balance) should equal the total number of layers in the model
`--pipeline-devices`	a list of device indices indicating which device to place each of the N_K partitions. The length of this list should equal the length of the –pipeline-balance argument
`--pipeline-chunks`	microbatch count for pipeline model parallelism Default: 0
`--pipeline-encoder-balance`	partition the pipeline parallel encoder into N_K pieces, where each piece contains N_i layers. The sum(args.pipeline_encoder_balance) should equal the total number of encoder layers in the model
`--pipeline-encoder-devices`	a list of device indices indicating which device to place each of the N_K partitions. The length of this list should equal the length of the –pipeline-encoder-balance argument
`--pipeline-decoder-balance`	partition the pipeline parallel decoder into N_K pieces, where each piece contains N_i layers. The sum(args.pipeline_decoder_balance) should equal the total number of decoder layers in the model
`--pipeline-decoder-devices`	a list of device indices indicating which device to place each of the N_K partitions. The length of this list should equal the length of the –pipeline-decoder-balance argument
`--pipeline-checkpoint`	Possible choices: always, never, except_last checkpointing mode for pipeline model parallelism Default: “never”
`--zero-sharding`	Possible choices: none, os ZeRO sharding Default: “none”

Generation¶

`--path`	path(s) to model file(s), colon separated
`--remove-bpe`	remove BPE tokens before scoring (can be set to sentencepiece)
`--quiet`	only print final scores Default: False
`--model-overrides`	a dictionary used to override model args at generation that were used during model training Default: “{}”
`--results-path`	path to save eval results (optional)
`--beam`	beam size Default: 5
`--nbest`	number of hypotheses to output Default: 1
`--max-len-a`	generate sequences of maximum length ax + b, where x is the source length Default: 0
`--max-len-b`	generate sequences of maximum length ax + b, where x is the source length Default: 200
`--min-len`	minimum generation length Default: 1
`--match-source-len`	generations should match the source length Default: False
`--no-early-stop`	deprecated Default: False
`--unnormalized`	compare unnormalized hypothesis scores Default: False
`--no-beamable-mm`	don’t use BeamableMM in attention layers Default: False
`--lenpen`	length penalty: <1.0 favors shorter, >1.0 favors longer sentences Default: 1
`--unkpen`	unknown word penalty: <0 produces more unks, >0 produces fewer Default: 0
`--replace-unk`	perform unknown replacement (optionally with alignment dictionary)
`--sacrebleu`	score with sacrebleu Default: False
`--score-reference`	just score the reference translation Default: False
`--prefix-size`	initialize generation by target prefix of given length Default: 0
`--no-repeat-ngram-size`	ngram blocking such that this size ngram cannot be repeated in the generation Default: 0
`--sampling`	sample hypotheses instead of using beam search Default: False
`--sampling-topk`	sample from top K likely next words instead of all words Default: -1
`--sampling-topp`	sample from the smallest set whose cumulative probability mass exceeds p for next words Default: -1.0
`--constraints`	Possible choices: ordered, unordered enables lexically constrained decoding
`--temperature`	temperature for generation Default: 1.0
`--diverse-beam-groups`	number of groups for Diverse Beam Search Default: -1
`--diverse-beam-strength`	strength of diversity penalty for Diverse Beam Search Default: 0.5
`--diversity-rate`	strength of diversity penalty for Diverse Siblings Search Default: -1.0
`--print-alignment`	if set, uses attention feedback to compute and print alignment to source tokens Default: False
`--print-step`	Default: False
`--lm-path`	path to lm checkpoint for lm fusion
`--lm-weight`	weight for lm probs for lm fusion Default: 0.0
`--iter-decode-eos-penalty`	if > 0.0, it penalized early-stopping in decoding. Default: 0.0
`--iter-decode-max-iter`	maximum iterations for iterative refinement. Default: 10
`--iter-decode-force-max-iter`	if set, run exact the maximum number of iterations without early stop Default: False
`--iter-decode-with-beam`	if > 1, model will generate translations varying by the lengths. Default: 1
`--iter-decode-with-external-reranker`	if set, the last checkpoint are assumed to be a reranker to rescore the translations Default: False
`--retain-iter-history`	if set, decoding returns the whole history of iterative refinement Default: False
`--retain-dropout`	Use dropout at inference time Default: False
`--retain-dropout-modules`	if set, only retain dropout for the specified modules; if not set, then dropout will be retained for all modules
`--decoding-format`	Possible choices: unigram, ensemble, vote, dp, bs

fairseq-interactive¶

Translate raw text with a trained model. Batches data on-the-fly.

usage: fairseq-interactive [-h] [--no-progress-bar]
                           [--log-interval LOG_INTERVAL]
                           [--log-format {json,none,simple,tqdm}]
                           [--tensorboard-logdir TENSORBOARD_LOGDIR]
                           [--seed SEED] [--cpu] [--tpu] [--bf16]
                           [--memory-efficient-bf16] [--fp16]
                           [--memory-efficient-fp16] [--fp16-no-flatten-grads]
                           [--fp16-init-scale FP16_INIT_SCALE]
                           [--fp16-scale-window FP16_SCALE_WINDOW]
                           [--fp16-scale-tolerance FP16_SCALE_TOLERANCE]
                           [--min-loss-scale MIN_LOSS_SCALE]
                           [--threshold-loss-scale THRESHOLD_LOSS_SCALE]
                           [--user-dir USER_DIR]
                           [--empty-cache-freq EMPTY_CACHE_FREQ]
                           [--all-gather-list-size ALL_GATHER_LIST_SIZE]
                           [--model-parallel-size MODEL_PARALLEL_SIZE]
                           [--checkpoint-suffix CHECKPOINT_SUFFIX]
                           [--checkpoint-shard-count CHECKPOINT_SHARD_COUNT]
                           [--quantization-config-path QUANTIZATION_CONFIG_PATH]
                           [--profile]
                           [--criterion {sentence_prediction,ctc,adaptive_loss,label_smoothed_cross_entropy,composite_loss,nat_loss,masked_lm,sentence_ranking,legacy_masked_lm_loss,cross_entropy,wav2vec,label_smoothed_cross_entropy_with_alignment,vocab_parallel_cross_entropy}]
                           [--tokenizer {nltk,space,moses}]
                           [--bpe {gpt2,bytes,sentencepiece,subword_nmt,byte_bpe,characters,bert,fastbpe,hf_byte_bpe}]
                           [--optimizer {adadelta,adam,adafactor,adagrad,lamb,nag,adamax,sgd}]
                           [--lr-scheduler {cosine,reduce_lr_on_plateau,fixed,triangular,polynomial_decay,tri_stage,inverse_sqrt}]
                           [--scoring {chrf,wer,sacrebleu,bleu}] [--task TASK]
                           [--num-workers NUM_WORKERS]
                           [--skip-invalid-size-inputs-valid-test]
                           [--max-tokens MAX_TOKENS] [--batch-size BATCH_SIZE]
                           [--required-batch-size-multiple REQUIRED_BATCH_SIZE_MULTIPLE]
                           [--required-seq-len-multiple REQUIRED_SEQ_LEN_MULTIPLE]
                           [--dataset-impl {raw,lazy,cached,mmap,fasta}]
                           [--data-buffer-size DATA_BUFFER_SIZE]
                           [--train-subset TRAIN_SUBSET]
                           [--valid-subset VALID_SUBSET]
                           [--validate-interval VALIDATE_INTERVAL]
                           [--validate-interval-updates VALIDATE_INTERVAL_UPDATES]
                           [--validate-after-updates VALIDATE_AFTER_UPDATES]
                           [--fixed-validation-seed FIXED_VALIDATION_SEED]
                           [--disable-validation]
                           [--max-tokens-valid MAX_TOKENS_VALID]
                           [--batch-size-valid BATCH_SIZE_VALID]
                           [--curriculum CURRICULUM] [--gen-subset GEN_SUBSET]
                           [--num-shards NUM_SHARDS] [--shard-id SHARD_ID]
                           [--distributed-world-size DISTRIBUTED_WORLD_SIZE]
                           [--distributed-rank DISTRIBUTED_RANK]
                           [--distributed-backend DISTRIBUTED_BACKEND]
                           [--distributed-init-method DISTRIBUTED_INIT_METHOD]
                           [--distributed-port DISTRIBUTED_PORT]
                           [--device-id DEVICE_ID] [--distributed-no-spawn]
                           [--ddp-backend {c10d,no_c10d}]
                           [--bucket-cap-mb BUCKET_CAP_MB]
                           [--fix-batches-to-gpus] [--find-unused-parameters]
                           [--fast-stat-sync] [--broadcast-buffers]
                           [--distributed-wrapper {DDP,SlowMo}]
                           [--slowmo-momentum SLOWMO_MOMENTUM]
                           [--slowmo-algorithm SLOWMO_ALGORITHM]
                           [--localsgd-frequency LOCALSGD_FREQUENCY]
                           [--nprocs-per-node NPROCS_PER_NODE]
                           [--pipeline-model-parallel]
                           [--pipeline-balance PIPELINE_BALANCE]
                           [--pipeline-devices PIPELINE_DEVICES]
                           [--pipeline-chunks PIPELINE_CHUNKS]
                           [--pipeline-encoder-balance PIPELINE_ENCODER_BALANCE]
                           [--pipeline-encoder-devices PIPELINE_ENCODER_DEVICES]
                           [--pipeline-decoder-balance PIPELINE_DECODER_BALANCE]
                           [--pipeline-decoder-devices PIPELINE_DECODER_DEVICES]
                           [--pipeline-checkpoint {always,never,except_last}]
                           [--zero-sharding {none,os}] [--path PATH]
                           [--remove-bpe [REMOVE_BPE]] [--quiet]
                           [--model-overrides MODEL_OVERRIDES]
                           [--results-path RESULTS_PATH] [--beam N]
                           [--nbest N] [--max-len-a N] [--max-len-b N]
                           [--min-len N] [--match-source-len]
                           [--no-early-stop] [--unnormalized]
                           [--no-beamable-mm] [--lenpen LENPEN]
                           [--unkpen UNKPEN] [--replace-unk [REPLACE_UNK]]
                           [--sacrebleu] [--score-reference]
                           [--prefix-size PS] [--no-repeat-ngram-size N]
                           [--sampling] [--sampling-topk PS]
                           [--sampling-topp PS]
                           [--constraints [{ordered,unordered}]]
                           [--temperature N] [--diverse-beam-groups N]
                           [--diverse-beam-strength N] [--diversity-rate N]
                           [--print-alignment] [--print-step] [--lm-path PATH]
                           [--lm-weight N] [--iter-decode-eos-penalty N]
                           [--iter-decode-max-iter N]
                           [--iter-decode-force-max-iter]
                           [--iter-decode-with-beam N]
                           [--iter-decode-with-external-reranker]
                           [--retain-iter-history] [--retain-dropout]
                           [--retain-dropout-modules RETAIN_DROPOUT_MODULES [RETAIN_DROPOUT_MODULES ...]]
                           [--decoding-format {unigram,ensemble,vote,dp,bs}]
                           [--buffer-size N] [--input FILE]

Named Arguments¶

`--no-progress-bar`	disable progress bar Default: False
`--log-interval`	log progress every N batches (when progress bar is disabled) Default: 100
`--log-format`	Possible choices: json, none, simple, tqdm log format to use
`--tensorboard-logdir`	path to save logs for tensorboard, should match –logdir of running tensorboard (default: no tensorboard logging)
`--seed`	pseudo random number generator seed Default: 1
`--cpu`	use CPU instead of CUDA Default: False
`--tpu`	use TPU instead of CUDA Default: False
`--bf16`	use bfloat16; implies –tpu Default: False
`--memory-efficient-bf16`	use a memory-efficient version of BF16 training; implies –bf16 Default: False
`--fp16`	use FP16 Default: False
`--memory-efficient-fp16`	use a memory-efficient version of FP16 training; implies –fp16 Default: False
`--fp16-no-flatten-grads`	don’t flatten FP16 grads tensor Default: False
`--fp16-init-scale`	default FP16 loss scale Default: 128
`--fp16-scale-window`	number of updates before increasing loss scale
`--fp16-scale-tolerance`	pct of updates that can overflow before decreasing the loss scale Default: 0.0
`--min-loss-scale`	minimum FP16 loss scale, after which training is stopped Default: 0.0001
`--threshold-loss-scale`	threshold FP16 loss scale from below
`--user-dir`	path to a python module containing custom extensions (tasks and/or architectures)
`--empty-cache-freq`	how often to clear the PyTorch CUDA cache (0 to disable) Default: 0
`--all-gather-list-size`	number of bytes reserved for gathering stats from workers Default: 16384
`--model-parallel-size`	total number of GPUs to parallelize model over Default: 1
`--checkpoint-suffix`	suffix to add to the checkpoint file name Default: “”
`--checkpoint-shard-count`	Number of shards containing the checkpoint - if the checkpoint is over 300GB, it is preferable to split it into shards to prevent OOM on CPU while loading the checkpoint Default: 1
`--quantization-config-path`	path to quantization config file
`--profile`	enable autograd profiler emit_nvtx Default: False
`--criterion`	Possible choices: sentence_prediction, ctc, adaptive_loss, label_smoothed_cross_entropy, composite_loss, nat_loss, masked_lm, sentence_ranking, legacy_masked_lm_loss, cross_entropy, wav2vec, label_smoothed_cross_entropy_with_alignment, vocab_parallel_cross_entropy Default: “cross_entropy”
`--tokenizer`	Possible choices: nltk, space, moses
`--bpe`	Possible choices: gpt2, bytes, sentencepiece, subword_nmt, byte_bpe, characters, bert, fastbpe, hf_byte_bpe
`--optimizer`	Possible choices: adadelta, adam, adafactor, adagrad, lamb, nag, adamax, sgd
`--lr-scheduler`	Possible choices: cosine, reduce_lr_on_plateau, fixed, triangular, polynomial_decay, tri_stage, inverse_sqrt Default: “fixed”
`--scoring`	Possible choices: chrf, wer, sacrebleu, bleu Default: “bleu”
`--task`	Possible choices: sentence_prediction, translation, translation_from_pretrained_xlm, denoising, multilingual_translation, semisupervised_translation, cross_lingual_lm, multilingual_denoising, translation_from_pretrained_bart, masked_lm, sentence_ranking, speech_to_text, audio_pretraining, legacy_masked_lm, translation_multi_simple_epoch, multilingual_masked_lm, language_modeling, translation_lev, dummy_lm, dummy_masked_lm, dummy_mt task Default: “translation”

dataset_data_loading¶

`--num-workers`	how many subprocesses to use for data loading Default: 1
`--skip-invalid-size-inputs-valid-test`	ignore too long or too short lines in valid and test set Default: False
`--max-tokens`	maximum number of tokens in a batch
`--batch-size`	number of examples in a batch
`--required-batch-size-multiple`	batch size will be a multiplier of this value Default: 8
`--required-seq-len-multiple`	maximum sequence length in batch will be a multiplier of this value Default: 1
`--dataset-impl`	Possible choices: raw, lazy, cached, mmap, fasta output dataset implementation
`--data-buffer-size`	Number of batches to preload Default: 10
`--train-subset`	data subset to use for training (e.g. train, valid, test) Default: “train”
`--valid-subset`	comma separated list of data subsets to use for validation (e.g. train, valid, test) Default: “valid”
`--validate-interval`	validate every N epochs Default: 1
`--validate-interval-updates`	validate every N updates Default: 0
`--validate-after-updates`	dont validate until reaching this many updates Default: 0
`--fixed-validation-seed`	specified random seed for validation
`--disable-validation`	disable validation Default: False
`--max-tokens-valid`	maximum number of tokens in a validation batch (defaults to –max-tokens)
`--batch-size-valid`	batch size of the validation batch (defaults to –batch-size)
`--curriculum`	don’t shuffle batches for first N epochs Default: 0
`--gen-subset`	data subset to generate (train, valid, test) Default: “test”
`--num-shards`	shard generation over N shards Default: 1
`--shard-id`	id of the shard to generate (id < num_shards) Default: 0

distributed_training¶

`--distributed-world-size`	total number of GPUs across all nodes (default: all visible GPUs) Default: 1
`--distributed-rank`	rank of the current worker Default: 0
`--distributed-backend`	distributed backend Default: “nccl”
`--distributed-init-method`	typically tcp://hostname:port that will be used to establish initial connetion
`--distributed-port`	port number (not required if using –distributed-init-method) Default: -1
`--device-id, --local_rank`	which GPU to use (usually configured automatically) Default: 0
`--distributed-no-spawn`	do not spawn multiple processes even if multiple GPUs are visible Default: False
`--ddp-backend`	Possible choices: c10d, no_c10d DistributedDataParallel backend Default: “c10d”
`--bucket-cap-mb`	bucket size for reduction Default: 25
`--fix-batches-to-gpus`	don’t shuffle batches between GPUs; this reduces overall randomness and may affect precision but avoids the cost of re-reading the data Default: False
`--find-unused-parameters`	disable unused parameter detection (not applicable to no_c10d ddp-backend Default: False
`--fast-stat-sync`	[deprecated] this is now defined per Criterion Default: False
`--broadcast-buffers`	Copy non-trainable parameters between GPUs, such as batchnorm population statistics Default: False
`--distributed-wrapper`	Possible choices: DDP, SlowMo DistributedDataParallel backend Default: “DDP”
`--slowmo-momentum`	SlowMo momentum term; by default use 0.0 for 16 GPUs, 0.2 for 32 GPUs; 0.5 for 64 GPUs, 0.6 for > 64 GPUs
`--slowmo-algorithm`	whether to use LocalSGD or SGP Default: “LocalSGD”
`--localsgd-frequency`	Local SGD allreduce frequency Default: 3
`--nprocs-per-node`	number of GPUs in each node. An allreduce operation across GPUs in a node is very fast. Hence, we do allreduce across GPUs in a node, and gossip across different nodes Default: 1
`--pipeline-model-parallel`	if set, use pipeline model parallelism across GPUs Default: False
`--pipeline-balance`	partition the model into N_K pieces, where each piece contains N_i layers. The sum(args.pipeline_balance) should equal the total number of layers in the model
`--pipeline-devices`	a list of device indices indicating which device to place each of the N_K partitions. The length of this list should equal the length of the –pipeline-balance argument
`--pipeline-chunks`	microbatch count for pipeline model parallelism Default: 0
`--pipeline-encoder-balance`	partition the pipeline parallel encoder into N_K pieces, where each piece contains N_i layers. The sum(args.pipeline_encoder_balance) should equal the total number of encoder layers in the model
`--pipeline-encoder-devices`	a list of device indices indicating which device to place each of the N_K partitions. The length of this list should equal the length of the –pipeline-encoder-balance argument
`--pipeline-decoder-balance`	partition the pipeline parallel decoder into N_K pieces, where each piece contains N_i layers. The sum(args.pipeline_decoder_balance) should equal the total number of decoder layers in the model
`--pipeline-decoder-devices`	a list of device indices indicating which device to place each of the N_K partitions. The length of this list should equal the length of the –pipeline-decoder-balance argument
`--pipeline-checkpoint`	Possible choices: always, never, except_last checkpointing mode for pipeline model parallelism Default: “never”
`--zero-sharding`	Possible choices: none, os ZeRO sharding Default: “none”

Generation¶

`--path`	path(s) to model file(s), colon separated
`--remove-bpe`	remove BPE tokens before scoring (can be set to sentencepiece)
`--quiet`	only print final scores Default: False
`--model-overrides`	a dictionary used to override model args at generation that were used during model training Default: “{}”
`--results-path`	path to save eval results (optional)
`--beam`	beam size Default: 5
`--nbest`	number of hypotheses to output Default: 1
`--max-len-a`	generate sequences of maximum length ax + b, where x is the source length Default: 0
`--max-len-b`	generate sequences of maximum length ax + b, where x is the source length Default: 200
`--min-len`	minimum generation length Default: 1
`--match-source-len`	generations should match the source length Default: False
`--no-early-stop`	deprecated Default: False
`--unnormalized`	compare unnormalized hypothesis scores Default: False
`--no-beamable-mm`	don’t use BeamableMM in attention layers Default: False
`--lenpen`	length penalty: <1.0 favors shorter, >1.0 favors longer sentences Default: 1
`--unkpen`	unknown word penalty: <0 produces more unks, >0 produces fewer Default: 0
`--replace-unk`	perform unknown replacement (optionally with alignment dictionary)
`--sacrebleu`	score with sacrebleu Default: False
`--score-reference`	just score the reference translation Default: False
`--prefix-size`	initialize generation by target prefix of given length Default: 0
`--no-repeat-ngram-size`	ngram blocking such that this size ngram cannot be repeated in the generation Default: 0
`--sampling`	sample hypotheses instead of using beam search Default: False
`--sampling-topk`	sample from top K likely next words instead of all words Default: -1
`--sampling-topp`	sample from the smallest set whose cumulative probability mass exceeds p for next words Default: -1.0
`--constraints`	Possible choices: ordered, unordered enables lexically constrained decoding
`--temperature`	temperature for generation Default: 1.0
`--diverse-beam-groups`	number of groups for Diverse Beam Search Default: -1
`--diverse-beam-strength`	strength of diversity penalty for Diverse Beam Search Default: 0.5
`--diversity-rate`	strength of diversity penalty for Diverse Siblings Search Default: -1.0
`--print-alignment`	if set, uses attention feedback to compute and print alignment to source tokens Default: False
`--print-step`	Default: False
`--lm-path`	path to lm checkpoint for lm fusion
`--lm-weight`	weight for lm probs for lm fusion Default: 0.0
`--iter-decode-eos-penalty`	if > 0.0, it penalized early-stopping in decoding. Default: 0.0
`--iter-decode-max-iter`	maximum iterations for iterative refinement. Default: 10
`--iter-decode-force-max-iter`	if set, run exact the maximum number of iterations without early stop Default: False
`--iter-decode-with-beam`	if > 1, model will generate translations varying by the lengths. Default: 1
`--iter-decode-with-external-reranker`	if set, the last checkpoint are assumed to be a reranker to rescore the translations Default: False
`--retain-iter-history`	if set, decoding returns the whole history of iterative refinement Default: False
`--retain-dropout`	Use dropout at inference time Default: False
`--retain-dropout-modules`	if set, only retain dropout for the specified modules; if not set, then dropout will be retained for all modules
`--decoding-format`	Possible choices: unigram, ensemble, vote, dp, bs

Interactive¶

--buffer-size

read this many sentences into a buffer before processing them

Default: 0

--input

file to read from; use - for stdin

Default: “-“

fairseq-score¶

BLEU scoring of generated translations against reference translations.

Command-line script for BLEU scoring.

usage: fairseq-score [-h] [-s SYS] -r REF [-o N] [--ignore-case] [--sacrebleu]
                     [--sentence-bleu]

Named Arguments¶

`-s, --sys`	system output Default: “-“
`-r, --ref`	references
`-o, --order`	consider ngrams up to this order Default: 4
`--ignore-case`	case-insensitive scoring Default: False
`--sacrebleu`	score with sacrebleu Default: False
`--sentence-bleu`	report sentence-level BLEUs (i.e., with +1 smoothing) Default: False

fairseq-eval-lm¶

Evaluate the perplexity of a trained language model.

usage: fairseq-eval-lm [-h] [--no-progress-bar] [--log-interval LOG_INTERVAL]
                       [--log-format {json,none,simple,tqdm}]
                       [--tensorboard-logdir TENSORBOARD_LOGDIR] [--seed SEED]
                       [--cpu] [--tpu] [--bf16] [--memory-efficient-bf16]
                       [--fp16] [--memory-efficient-fp16]
                       [--fp16-no-flatten-grads]
                       [--fp16-init-scale FP16_INIT_SCALE]
                       [--fp16-scale-window FP16_SCALE_WINDOW]
                       [--fp16-scale-tolerance FP16_SCALE_TOLERANCE]
                       [--min-loss-scale MIN_LOSS_SCALE]
                       [--threshold-loss-scale THRESHOLD_LOSS_SCALE]
                       [--user-dir USER_DIR]
                       [--empty-cache-freq EMPTY_CACHE_FREQ]
                       [--all-gather-list-size ALL_GATHER_LIST_SIZE]
                       [--model-parallel-size MODEL_PARALLEL_SIZE]
                       [--checkpoint-suffix CHECKPOINT_SUFFIX]
                       [--checkpoint-shard-count CHECKPOINT_SHARD_COUNT]
                       [--quantization-config-path QUANTIZATION_CONFIG_PATH]
                       [--profile]
                       [--criterion {sentence_prediction,ctc,adaptive_loss,label_smoothed_cross_entropy,composite_loss,nat_loss,masked_lm,sentence_ranking,legacy_masked_lm_loss,cross_entropy,wav2vec,label_smoothed_cross_entropy_with_alignment,vocab_parallel_cross_entropy}]
                       [--tokenizer {nltk,space,moses}]
                       [--bpe {gpt2,bytes,sentencepiece,subword_nmt,byte_bpe,characters,bert,fastbpe,hf_byte_bpe}]
                       [--optimizer {adadelta,adam,adafactor,adagrad,lamb,nag,adamax,sgd}]
                       [--lr-scheduler {cosine,reduce_lr_on_plateau,fixed,triangular,polynomial_decay,tri_stage,inverse_sqrt}]
                       [--scoring {chrf,wer,sacrebleu,bleu}] [--task TASK]
                       [--num-workers NUM_WORKERS]
                       [--skip-invalid-size-inputs-valid-test]
                       [--max-tokens MAX_TOKENS] [--batch-size BATCH_SIZE]
                       [--required-batch-size-multiple REQUIRED_BATCH_SIZE_MULTIPLE]
                       [--required-seq-len-multiple REQUIRED_SEQ_LEN_MULTIPLE]
                       [--dataset-impl {raw,lazy,cached,mmap,fasta}]
                       [--data-buffer-size DATA_BUFFER_SIZE]
                       [--train-subset TRAIN_SUBSET]
                       [--valid-subset VALID_SUBSET]
                       [--validate-interval VALIDATE_INTERVAL]
                       [--validate-interval-updates VALIDATE_INTERVAL_UPDATES]
                       [--validate-after-updates VALIDATE_AFTER_UPDATES]
                       [--fixed-validation-seed FIXED_VALIDATION_SEED]
                       [--disable-validation]
                       [--max-tokens-valid MAX_TOKENS_VALID]
                       [--batch-size-valid BATCH_SIZE_VALID]
                       [--curriculum CURRICULUM] [--gen-subset GEN_SUBSET]
                       [--num-shards NUM_SHARDS] [--shard-id SHARD_ID]
                       [--distributed-world-size DISTRIBUTED_WORLD_SIZE]
                       [--distributed-rank DISTRIBUTED_RANK]
                       [--distributed-backend DISTRIBUTED_BACKEND]
                       [--distributed-init-method DISTRIBUTED_INIT_METHOD]
                       [--distributed-port DISTRIBUTED_PORT]
                       [--device-id DEVICE_ID] [--distributed-no-spawn]
                       [--ddp-backend {c10d,no_c10d}]
                       [--bucket-cap-mb BUCKET_CAP_MB] [--fix-batches-to-gpus]
                       [--find-unused-parameters] [--fast-stat-sync]
                       [--broadcast-buffers]
                       [--distributed-wrapper {DDP,SlowMo}]
                       [--slowmo-momentum SLOWMO_MOMENTUM]
                       [--slowmo-algorithm SLOWMO_ALGORITHM]
                       [--localsgd-frequency LOCALSGD_FREQUENCY]
                       [--nprocs-per-node NPROCS_PER_NODE]
                       [--pipeline-model-parallel]
                       [--pipeline-balance PIPELINE_BALANCE]
                       [--pipeline-devices PIPELINE_DEVICES]
                       [--pipeline-chunks PIPELINE_CHUNKS]
                       [--pipeline-encoder-balance PIPELINE_ENCODER_BALANCE]
                       [--pipeline-encoder-devices PIPELINE_ENCODER_DEVICES]
                       [--pipeline-decoder-balance PIPELINE_DECODER_BALANCE]
                       [--pipeline-decoder-devices PIPELINE_DECODER_DEVICES]
                       [--pipeline-checkpoint {always,never,except_last}]
                       [--zero-sharding {none,os}] [--path PATH]
                       [--remove-bpe [REMOVE_BPE]] [--quiet]
                       [--model-overrides MODEL_OVERRIDES]
                       [--results-path RESULTS_PATH] [--output-word-probs]
                       [--output-word-stats] [--context-window CONTEXT_WINDOW]
                       [--softmax-batch SOFTMAX_BATCH]

Named Arguments¶

`--no-progress-bar`	disable progress bar Default: False
`--log-interval`	log progress every N batches (when progress bar is disabled) Default: 100
`--log-format`	Possible choices: json, none, simple, tqdm log format to use
`--tensorboard-logdir`	path to save logs for tensorboard, should match –logdir of running tensorboard (default: no tensorboard logging)
`--seed`	pseudo random number generator seed Default: 1
`--cpu`	use CPU instead of CUDA Default: False
`--tpu`	use TPU instead of CUDA Default: False
`--bf16`	use bfloat16; implies –tpu Default: False
`--memory-efficient-bf16`	use a memory-efficient version of BF16 training; implies –bf16 Default: False
`--fp16`	use FP16 Default: False
`--memory-efficient-fp16`	use a memory-efficient version of FP16 training; implies –fp16 Default: False
`--fp16-no-flatten-grads`	don’t flatten FP16 grads tensor Default: False
`--fp16-init-scale`	default FP16 loss scale Default: 128
`--fp16-scale-window`	number of updates before increasing loss scale
`--fp16-scale-tolerance`	pct of updates that can overflow before decreasing the loss scale Default: 0.0
`--min-loss-scale`	minimum FP16 loss scale, after which training is stopped Default: 0.0001
`--threshold-loss-scale`	threshold FP16 loss scale from below
`--user-dir`	path to a python module containing custom extensions (tasks and/or architectures)
`--empty-cache-freq`	how often to clear the PyTorch CUDA cache (0 to disable) Default: 0
`--all-gather-list-size`	number of bytes reserved for gathering stats from workers Default: 16384
`--model-parallel-size`	total number of GPUs to parallelize model over Default: 1
`--checkpoint-suffix`	suffix to add to the checkpoint file name Default: “”
`--checkpoint-shard-count`	Number of shards containing the checkpoint - if the checkpoint is over 300GB, it is preferable to split it into shards to prevent OOM on CPU while loading the checkpoint Default: 1
`--quantization-config-path`	path to quantization config file
`--profile`	enable autograd profiler emit_nvtx Default: False
`--criterion`	Possible choices: sentence_prediction, ctc, adaptive_loss, label_smoothed_cross_entropy, composite_loss, nat_loss, masked_lm, sentence_ranking, legacy_masked_lm_loss, cross_entropy, wav2vec, label_smoothed_cross_entropy_with_alignment, vocab_parallel_cross_entropy Default: “cross_entropy”
`--tokenizer`	Possible choices: nltk, space, moses
`--bpe`	Possible choices: gpt2, bytes, sentencepiece, subword_nmt, byte_bpe, characters, bert, fastbpe, hf_byte_bpe
`--optimizer`	Possible choices: adadelta, adam, adafactor, adagrad, lamb, nag, adamax, sgd
`--lr-scheduler`	Possible choices: cosine, reduce_lr_on_plateau, fixed, triangular, polynomial_decay, tri_stage, inverse_sqrt Default: “fixed”
`--scoring`	Possible choices: chrf, wer, sacrebleu, bleu Default: “bleu”
`--task`	Possible choices: sentence_prediction, translation, translation_from_pretrained_xlm, denoising, multilingual_translation, semisupervised_translation, cross_lingual_lm, multilingual_denoising, translation_from_pretrained_bart, masked_lm, sentence_ranking, speech_to_text, audio_pretraining, legacy_masked_lm, translation_multi_simple_epoch, multilingual_masked_lm, language_modeling, translation_lev, dummy_lm, dummy_masked_lm, dummy_mt task Default: “language_modeling”

dataset_data_loading¶

`--num-workers`	how many subprocesses to use for data loading Default: 1
`--skip-invalid-size-inputs-valid-test`	ignore too long or too short lines in valid and test set Default: False
`--max-tokens`	maximum number of tokens in a batch
`--batch-size`	number of examples in a batch
`--required-batch-size-multiple`	batch size will be a multiplier of this value Default: 8
`--required-seq-len-multiple`	maximum sequence length in batch will be a multiplier of this value Default: 1
`--dataset-impl`	Possible choices: raw, lazy, cached, mmap, fasta output dataset implementation
`--data-buffer-size`	Number of batches to preload Default: 10
`--train-subset`	data subset to use for training (e.g. train, valid, test) Default: “train”
`--valid-subset`	comma separated list of data subsets to use for validation (e.g. train, valid, test) Default: “valid”
`--validate-interval`	validate every N epochs Default: 1
`--validate-interval-updates`	validate every N updates Default: 0
`--validate-after-updates`	dont validate until reaching this many updates Default: 0
`--fixed-validation-seed`	specified random seed for validation
`--disable-validation`	disable validation Default: False
`--max-tokens-valid`	maximum number of tokens in a validation batch (defaults to –max-tokens)
`--batch-size-valid`	batch size of the validation batch (defaults to –batch-size)
`--curriculum`	don’t shuffle batches for first N epochs Default: 0
`--gen-subset`	data subset to generate (train, valid, test) Default: “test”
`--num-shards`	shard generation over N shards Default: 1
`--shard-id`	id of the shard to generate (id < num_shards) Default: 0

distributed_training¶

`--distributed-world-size`	total number of GPUs across all nodes (default: all visible GPUs) Default: 1
`--distributed-rank`	rank of the current worker Default: 0
`--distributed-backend`	distributed backend Default: “nccl”
`--distributed-init-method`	typically tcp://hostname:port that will be used to establish initial connetion
`--distributed-port`	port number (not required if using –distributed-init-method) Default: -1
`--device-id, --local_rank`	which GPU to use (usually configured automatically) Default: 0
`--distributed-no-spawn`	do not spawn multiple processes even if multiple GPUs are visible Default: False
`--ddp-backend`	Possible choices: c10d, no_c10d DistributedDataParallel backend Default: “c10d”
`--bucket-cap-mb`	bucket size for reduction Default: 25
`--fix-batches-to-gpus`	don’t shuffle batches between GPUs; this reduces overall randomness and may affect precision but avoids the cost of re-reading the data Default: False
`--find-unused-parameters`	disable unused parameter detection (not applicable to no_c10d ddp-backend Default: False
`--fast-stat-sync`	[deprecated] this is now defined per Criterion Default: False
`--broadcast-buffers`	Copy non-trainable parameters between GPUs, such as batchnorm population statistics Default: False
`--distributed-wrapper`	Possible choices: DDP, SlowMo DistributedDataParallel backend Default: “DDP”
`--slowmo-momentum`	SlowMo momentum term; by default use 0.0 for 16 GPUs, 0.2 for 32 GPUs; 0.5 for 64 GPUs, 0.6 for > 64 GPUs
`--slowmo-algorithm`	whether to use LocalSGD or SGP Default: “LocalSGD”
`--localsgd-frequency`	Local SGD allreduce frequency Default: 3
`--nprocs-per-node`	number of GPUs in each node. An allreduce operation across GPUs in a node is very fast. Hence, we do allreduce across GPUs in a node, and gossip across different nodes Default: 1
`--pipeline-model-parallel`	if set, use pipeline model parallelism across GPUs Default: False
`--pipeline-balance`	partition the model into N_K pieces, where each piece contains N_i layers. The sum(args.pipeline_balance) should equal the total number of layers in the model
`--pipeline-devices`	a list of device indices indicating which device to place each of the N_K partitions. The length of this list should equal the length of the –pipeline-balance argument
`--pipeline-chunks`	microbatch count for pipeline model parallelism Default: 0
`--pipeline-encoder-balance`	partition the pipeline parallel encoder into N_K pieces, where each piece contains N_i layers. The sum(args.pipeline_encoder_balance) should equal the total number of encoder layers in the model
`--pipeline-encoder-devices`	a list of device indices indicating which device to place each of the N_K partitions. The length of this list should equal the length of the –pipeline-encoder-balance argument
`--pipeline-decoder-balance`	partition the pipeline parallel decoder into N_K pieces, where each piece contains N_i layers. The sum(args.pipeline_decoder_balance) should equal the total number of decoder layers in the model
`--pipeline-decoder-devices`	a list of device indices indicating which device to place each of the N_K partitions. The length of this list should equal the length of the –pipeline-decoder-balance argument
`--pipeline-checkpoint`	Possible choices: always, never, except_last checkpointing mode for pipeline model parallelism Default: “never”
`--zero-sharding`	Possible choices: none, os ZeRO sharding Default: “none”

LM Evaluation¶

`--path`	path(s) to model file(s), colon separated
`--remove-bpe`	remove BPE tokens before scoring (can be set to sentencepiece)
`--quiet`	only print final scores Default: False
`--model-overrides`	a dictionary used to override model args at generation that were used during model training Default: “{}”
`--results-path`	path to save eval results (optional)
`--output-word-probs`	if set, outputs words and their predicted log probabilities to standard output Default: False
`--output-word-stats`	if set, outputs word statistics such as word count, average probability, etc Default: False
`--context-window`	ensures that every evaluated token has access to a context of at least this size, if possible Default: 0
`--softmax-batch`	if BxT is more than this, will batch the softmax over vocab to this amount of tokens, in order to fit into GPU memory Default: 9223372036854775807