Command-line Tools

Fairseq provides several command-line tools for training and evaluating models:

fairseq-preprocess

Data pre-processing: build vocabularies and binarize training data.

usage: fairseq-preprocess [-h] [--no-progress-bar]
                          [--log-interval LOG_INTERVAL]
                          [--log-format {json,none,simple,tqdm}]
                          [--tensorboard-logdir TENSORBOARD_LOGDIR]
                          [--seed SEED] [--cpu] [--tpu] [--bf16]
                          [--memory-efficient-bf16] [--fp16]
                          [--memory-efficient-fp16] [--fp16-no-flatten-grads]
                          [--fp16-init-scale FP16_INIT_SCALE]
                          [--fp16-scale-window FP16_SCALE_WINDOW]
                          [--fp16-scale-tolerance FP16_SCALE_TOLERANCE]
                          [--min-loss-scale MIN_LOSS_SCALE]
                          [--threshold-loss-scale THRESHOLD_LOSS_SCALE]
                          [--user-dir USER_DIR]
                          [--empty-cache-freq EMPTY_CACHE_FREQ]
                          [--all-gather-list-size ALL_GATHER_LIST_SIZE]
                          [--model-parallel-size MODEL_PARALLEL_SIZE]
                          [--checkpoint-suffix CHECKPOINT_SUFFIX]
                          [--checkpoint-shard-count CHECKPOINT_SHARD_COUNT]
                          [--quantization-config-path QUANTIZATION_CONFIG_PATH]
                          [--profile]
                          [--criterion {sentence_prediction,ctc,adaptive_loss,label_smoothed_cross_entropy,composite_loss,nat_loss,masked_lm,sentence_ranking,legacy_masked_lm_loss,cross_entropy,wav2vec,label_smoothed_cross_entropy_with_alignment,vocab_parallel_cross_entropy}]
                          [--tokenizer {nltk,space,moses}]
                          [--bpe {gpt2,bytes,sentencepiece,subword_nmt,byte_bpe,characters,bert,fastbpe,hf_byte_bpe}]
                          [--optimizer {adadelta,adam,adafactor,adagrad,lamb,nag,adamax,sgd}]
                          [--lr-scheduler {cosine,reduce_lr_on_plateau,fixed,triangular,polynomial_decay,tri_stage,inverse_sqrt}]
                          [--scoring {chrf,wer,sacrebleu,bleu}] [--task TASK]
                          [-s SRC] [-t TARGET] [--trainpref FP]
                          [--validpref FP] [--testpref FP] [--align-suffix FP]
                          [--destdir DIR] [--thresholdtgt N]
                          [--thresholdsrc N] [--tgtdict FP] [--srcdict FP]
                          [--nwordstgt N] [--nwordssrc N] [--alignfile ALIGN]
                          [--dataset-impl FORMAT] [--joined-dictionary]
                          [--only-source] [--padding-factor N] [--workers N]

Named Arguments

--no-progress-bar

disable progress bar

Default: False

--log-interval

log progress every N batches (when progress bar is disabled)

Default: 100

--log-format

Possible choices: json, none, simple, tqdm

log format to use

--tensorboard-logdir path to save logs for tensorboard, should match –logdir of running tensorboard (default: no tensorboard logging)
--seed

pseudo random number generator seed

Default: 1

--cpu

use CPU instead of CUDA

Default: False

--tpu

use TPU instead of CUDA

Default: False

--bf16

use bfloat16; implies –tpu

Default: False

--memory-efficient-bf16

use a memory-efficient version of BF16 training; implies –bf16

Default: False

--fp16

use FP16

Default: False

--memory-efficient-fp16

use a memory-efficient version of FP16 training; implies –fp16

Default: False

--fp16-no-flatten-grads

don’t flatten FP16 grads tensor

Default: False

--fp16-init-scale

default FP16 loss scale

Default: 128

--fp16-scale-window number of updates before increasing loss scale
--fp16-scale-tolerance

pct of updates that can overflow before decreasing the loss scale

Default: 0.0

--min-loss-scale

minimum FP16 loss scale, after which training is stopped

Default: 0.0001

--threshold-loss-scale threshold FP16 loss scale from below
--user-dir path to a python module containing custom extensions (tasks and/or architectures)
--empty-cache-freq

how often to clear the PyTorch CUDA cache (0 to disable)

Default: 0

--all-gather-list-size

number of bytes reserved for gathering stats from workers

Default: 16384

--model-parallel-size

total number of GPUs to parallelize model over

Default: 1

--checkpoint-suffix

suffix to add to the checkpoint file name

Default: “”

--checkpoint-shard-count

Number of shards containing the checkpoint - if the checkpoint is over 300GB, it is preferable to split it into shards to prevent OOM on CPU while loading the checkpoint

Default: 1

--quantization-config-path path to quantization config file
--profile

enable autograd profiler emit_nvtx

Default: False

--criterion

Possible choices: sentence_prediction, ctc, adaptive_loss, label_smoothed_cross_entropy, composite_loss, nat_loss, masked_lm, sentence_ranking, legacy_masked_lm_loss, cross_entropy, wav2vec, label_smoothed_cross_entropy_with_alignment, vocab_parallel_cross_entropy

Default: “cross_entropy”

--tokenizer Possible choices: nltk, space, moses
--bpe Possible choices: gpt2, bytes, sentencepiece, subword_nmt, byte_bpe, characters, bert, fastbpe, hf_byte_bpe
--optimizer Possible choices: adadelta, adam, adafactor, adagrad, lamb, nag, adamax, sgd
--lr-scheduler

Possible choices: cosine, reduce_lr_on_plateau, fixed, triangular, polynomial_decay, tri_stage, inverse_sqrt

Default: “fixed”

--scoring

Possible choices: chrf, wer, sacrebleu, bleu

Default: “bleu”

--task

Possible choices: sentence_prediction, translation, translation_from_pretrained_xlm, denoising, multilingual_translation, semisupervised_translation, cross_lingual_lm, multilingual_denoising, translation_from_pretrained_bart, masked_lm, sentence_ranking, speech_to_text, audio_pretraining, legacy_masked_lm, translation_multi_simple_epoch, multilingual_masked_lm, language_modeling, translation_lev, dummy_lm, dummy_masked_lm, dummy_mt

task

Default: “translation”

--dataset-impl

Possible choices: raw, lazy, cached, mmap, fasta

output dataset implementation

Default: “mmap”

Preprocessing

-s, --source-lang source language
-t, --target-lang target language
--trainpref train file prefix
--validpref comma separated, valid file prefixes
--testpref comma separated, test file prefixes
--align-suffix alignment file suffix
--destdir

destination dir

Default: “data-bin”

--thresholdtgt

map words appearing less than threshold times to unknown

Default: 0

--thresholdsrc

map words appearing less than threshold times to unknown

Default: 0

--tgtdict reuse given target dictionary
--srcdict reuse given source dictionary
--nwordstgt

number of target words to retain

Default: -1

--nwordssrc

number of source words to retain

Default: -1

--alignfile an alignment file (optional)
--joined-dictionary

Generate joined dictionary

Default: False

--only-source

Only process the source language

Default: False

--padding-factor

Pad dictionary size to be multiple of N

Default: 8

--workers

number of parallel workers

Default: 1

fairseq-train

Train a new model on one or across multiple GPUs.

usage: fairseq-train [-h] [--no-progress-bar] [--log-interval LOG_INTERVAL]
                     [--log-format {json,none,simple,tqdm}]
                     [--tensorboard-logdir TENSORBOARD_LOGDIR] [--seed SEED]
                     [--cpu] [--tpu] [--bf16] [--memory-efficient-bf16]
                     [--fp16] [--memory-efficient-fp16]
                     [--fp16-no-flatten-grads]
                     [--fp16-init-scale FP16_INIT_SCALE]
                     [--fp16-scale-window FP16_SCALE_WINDOW]
                     [--fp16-scale-tolerance FP16_SCALE_TOLERANCE]
                     [--min-loss-scale MIN_LOSS_SCALE]
                     [--threshold-loss-scale THRESHOLD_LOSS_SCALE]
                     [--user-dir USER_DIR]
                     [--empty-cache-freq EMPTY_CACHE_FREQ]
                     [--all-gather-list-size ALL_GATHER_LIST_SIZE]
                     [--model-parallel-size MODEL_PARALLEL_SIZE]
                     [--checkpoint-suffix CHECKPOINT_SUFFIX]
                     [--checkpoint-shard-count CHECKPOINT_SHARD_COUNT]
                     [--quantization-config-path QUANTIZATION_CONFIG_PATH]
                     [--profile]
                     [--criterion {sentence_prediction,ctc,adaptive_loss,label_smoothed_cross_entropy,composite_loss,nat_loss,masked_lm,sentence_ranking,legacy_masked_lm_loss,cross_entropy,wav2vec,label_smoothed_cross_entropy_with_alignment,vocab_parallel_cross_entropy}]
                     [--tokenizer {nltk,space,moses}]
                     [--bpe {gpt2,bytes,sentencepiece,subword_nmt,byte_bpe,characters,bert,fastbpe,hf_byte_bpe}]
                     [--optimizer {adadelta,adam,adafactor,adagrad,lamb,nag,adamax,sgd}]
                     [--lr-scheduler {cosine,reduce_lr_on_plateau,fixed,triangular,polynomial_decay,tri_stage,inverse_sqrt}]
                     [--scoring {chrf,wer,sacrebleu,bleu}] [--task TASK]
                     [--num-workers NUM_WORKERS]
                     [--skip-invalid-size-inputs-valid-test]
                     [--max-tokens MAX_TOKENS] [--batch-size BATCH_SIZE]
                     [--required-batch-size-multiple REQUIRED_BATCH_SIZE_MULTIPLE]
                     [--required-seq-len-multiple REQUIRED_SEQ_LEN_MULTIPLE]
                     [--dataset-impl {raw,lazy,cached,mmap,fasta}]
                     [--data-buffer-size DATA_BUFFER_SIZE]
                     [--train-subset TRAIN_SUBSET]
                     [--valid-subset VALID_SUBSET]
                     [--validate-interval VALIDATE_INTERVAL]
                     [--validate-interval-updates VALIDATE_INTERVAL_UPDATES]
                     [--validate-after-updates VALIDATE_AFTER_UPDATES]
                     [--fixed-validation-seed FIXED_VALIDATION_SEED]
                     [--disable-validation]
                     [--max-tokens-valid MAX_TOKENS_VALID]
                     [--batch-size-valid BATCH_SIZE_VALID]
                     [--curriculum CURRICULUM] [--gen-subset GEN_SUBSET]
                     [--num-shards NUM_SHARDS] [--shard-id SHARD_ID]
                     [--distributed-world-size DISTRIBUTED_WORLD_SIZE]
                     [--distributed-rank DISTRIBUTED_RANK]
                     [--distributed-backend DISTRIBUTED_BACKEND]
                     [--distributed-init-method DISTRIBUTED_INIT_METHOD]
                     [--distributed-port DISTRIBUTED_PORT]
                     [--device-id DEVICE_ID] [--distributed-no-spawn]
                     [--ddp-backend {c10d,no_c10d}]
                     [--bucket-cap-mb BUCKET_CAP_MB] [--fix-batches-to-gpus]
                     [--find-unused-parameters] [--fast-stat-sync]
                     [--broadcast-buffers]
                     [--distributed-wrapper {DDP,SlowMo}]
                     [--slowmo-momentum SLOWMO_MOMENTUM]
                     [--slowmo-algorithm SLOWMO_ALGORITHM]
                     [--localsgd-frequency LOCALSGD_FREQUENCY]
                     [--nprocs-per-node NPROCS_PER_NODE]
                     [--pipeline-model-parallel]
                     [--pipeline-balance PIPELINE_BALANCE]
                     [--pipeline-devices PIPELINE_DEVICES]
                     [--pipeline-chunks PIPELINE_CHUNKS]
                     [--pipeline-encoder-balance PIPELINE_ENCODER_BALANCE]
                     [--pipeline-encoder-devices PIPELINE_ENCODER_DEVICES]
                     [--pipeline-decoder-balance PIPELINE_DECODER_BALANCE]
                     [--pipeline-decoder-devices PIPELINE_DECODER_DEVICES]
                     [--pipeline-checkpoint {always,never,except_last}]
                     [--zero-sharding {none,os}] [--arch ARCH]
                     [--max-epoch MAX_EPOCH] [--max-update MAX_UPDATE]
                     [--stop-time-hours STOP_TIME_HOURS]
                     [--clip-norm CLIP_NORM] [--sentence-avg]
                     [--update-freq UPDATE_FREQ] [--lr LR] [--min-lr MIN_LR]
                     [--use-bmuf] [--save-dir SAVE_DIR]
                     [--restore-file RESTORE_FILE]
                     [--finetune-from-model FINETUNE_FROM_MODEL]
                     [--reset-dataloader] [--reset-lr-scheduler]
                     [--reset-meters] [--reset-optimizer]
                     [--optimizer-overrides OPTIMIZER_OVERRIDES]
                     [--save-interval SAVE_INTERVAL]
                     [--save-interval-updates SAVE_INTERVAL_UPDATES]
                     [--keep-interval-updates KEEP_INTERVAL_UPDATES]
                     [--keep-last-epochs KEEP_LAST_EPOCHS]
                     [--keep-best-checkpoints KEEP_BEST_CHECKPOINTS]
                     [--no-save] [--no-epoch-checkpoints]
                     [--no-last-checkpoints] [--no-save-optimizer-state]
                     [--best-checkpoint-metric BEST_CHECKPOINT_METRIC]
                     [--maximize-best-checkpoint-metric] [--patience PATIENCE]

Named Arguments

--no-progress-bar

disable progress bar

Default: False

--log-interval

log progress every N batches (when progress bar is disabled)

Default: 100

--log-format

Possible choices: json, none, simple, tqdm

log format to use

--tensorboard-logdir path to save logs for tensorboard, should match –logdir of running tensorboard (default: no tensorboard logging)
--seed

pseudo random number generator seed

Default: 1

--cpu

use CPU instead of CUDA

Default: False

--tpu

use TPU instead of CUDA

Default: False

--bf16

use bfloat16; implies –tpu

Default: False

--memory-efficient-bf16

use a memory-efficient version of BF16 training; implies –bf16

Default: False

--fp16

use FP16

Default: False

--memory-efficient-fp16

use a memory-efficient version of FP16 training; implies –fp16

Default: False

--fp16-no-flatten-grads

don’t flatten FP16 grads tensor

Default: False

--fp16-init-scale

default FP16 loss scale

Default: 128

--fp16-scale-window number of updates before increasing loss scale
--fp16-scale-tolerance

pct of updates that can overflow before decreasing the loss scale

Default: 0.0

--min-loss-scale

minimum FP16 loss scale, after which training is stopped

Default: 0.0001

--threshold-loss-scale threshold FP16 loss scale from below
--user-dir path to a python module containing custom extensions (tasks and/or architectures)
--empty-cache-freq

how often to clear the PyTorch CUDA cache (0 to disable)

Default: 0

--all-gather-list-size

number of bytes reserved for gathering stats from workers

Default: 16384

--model-parallel-size

total number of GPUs to parallelize model over

Default: 1

--checkpoint-suffix

suffix to add to the checkpoint file name

Default: “”

--checkpoint-shard-count

Number of shards containing the checkpoint - if the checkpoint is over 300GB, it is preferable to split it into shards to prevent OOM on CPU while loading the checkpoint

Default: 1

--quantization-config-path path to quantization config file
--profile

enable autograd profiler emit_nvtx

Default: False

--criterion

Possible choices: sentence_prediction, ctc, adaptive_loss, label_smoothed_cross_entropy, composite_loss, nat_loss, masked_lm, sentence_ranking, legacy_masked_lm_loss, cross_entropy, wav2vec, label_smoothed_cross_entropy_with_alignment, vocab_parallel_cross_entropy

Default: “cross_entropy”

--tokenizer Possible choices: nltk, space, moses
--bpe Possible choices: gpt2, bytes, sentencepiece, subword_nmt, byte_bpe, characters, bert, fastbpe, hf_byte_bpe
--optimizer Possible choices: adadelta, adam, adafactor, adagrad, lamb, nag, adamax, sgd
--lr-scheduler

Possible choices: cosine, reduce_lr_on_plateau, fixed, triangular, polynomial_decay, tri_stage, inverse_sqrt

Default: “fixed”

--scoring

Possible choices: chrf, wer, sacrebleu, bleu

Default: “bleu”

--task

Possible choices: sentence_prediction, translation, translation_from_pretrained_xlm, denoising, multilingual_translation, semisupervised_translation, cross_lingual_lm, multilingual_denoising, translation_from_pretrained_bart, masked_lm, sentence_ranking, speech_to_text, audio_pretraining, legacy_masked_lm, translation_multi_simple_epoch, multilingual_masked_lm, language_modeling, translation_lev, dummy_lm, dummy_masked_lm, dummy_mt

task

Default: “translation”

dataset_data_loading

--num-workers

how many subprocesses to use for data loading

Default: 1

--skip-invalid-size-inputs-valid-test

ignore too long or too short lines in valid and test set

Default: False

--max-tokens maximum number of tokens in a batch
--batch-size number of examples in a batch
--required-batch-size-multiple

batch size will be a multiplier of this value

Default: 8

--required-seq-len-multiple

maximum sequence length in batch will be a multiplier of this value

Default: 1

--dataset-impl

Possible choices: raw, lazy, cached, mmap, fasta

output dataset implementation

--data-buffer-size

Number of batches to preload

Default: 10

--train-subset

data subset to use for training (e.g. train, valid, test)

Default: “train”

--valid-subset

comma separated list of data subsets to use for validation (e.g. train, valid, test)

Default: “valid”

--validate-interval

validate every N epochs

Default: 1

--validate-interval-updates

validate every N updates

Default: 0

--validate-after-updates

dont validate until reaching this many updates

Default: 0

--fixed-validation-seed specified random seed for validation
--disable-validation

disable validation

Default: False

--max-tokens-valid maximum number of tokens in a validation batch (defaults to –max-tokens)
--batch-size-valid batch size of the validation batch (defaults to –batch-size)
--curriculum

don’t shuffle batches for first N epochs

Default: 0

--gen-subset

data subset to generate (train, valid, test)

Default: “test”

--num-shards

shard generation over N shards

Default: 1

--shard-id

id of the shard to generate (id < num_shards)

Default: 0

distributed_training

--distributed-world-size

total number of GPUs across all nodes (default: all visible GPUs)

Default: 1

--distributed-rank

rank of the current worker

Default: 0

--distributed-backend

distributed backend

Default: “nccl”

--distributed-init-method typically tcp://hostname:port that will be used to establish initial connetion
--distributed-port

port number (not required if using –distributed-init-method)

Default: -1

--device-id, --local_rank

which GPU to use (usually configured automatically)

Default: 0

--distributed-no-spawn

do not spawn multiple processes even if multiple GPUs are visible

Default: False

--ddp-backend

Possible choices: c10d, no_c10d

DistributedDataParallel backend

Default: “c10d”

--bucket-cap-mb

bucket size for reduction

Default: 25

--fix-batches-to-gpus

don’t shuffle batches between GPUs; this reduces overall randomness and may affect precision but avoids the cost of re-reading the data

Default: False

--find-unused-parameters

disable unused parameter detection (not applicable to no_c10d ddp-backend

Default: False

--fast-stat-sync

[deprecated] this is now defined per Criterion

Default: False

--broadcast-buffers

Copy non-trainable parameters between GPUs, such as batchnorm population statistics

Default: False

--distributed-wrapper

Possible choices: DDP, SlowMo

DistributedDataParallel backend

Default: “DDP”

--slowmo-momentum SlowMo momentum term; by default use 0.0 for 16 GPUs, 0.2 for 32 GPUs; 0.5 for 64 GPUs, 0.6 for > 64 GPUs
--slowmo-algorithm

whether to use LocalSGD or SGP

Default: “LocalSGD”

--localsgd-frequency

Local SGD allreduce frequency

Default: 3

--nprocs-per-node

number of GPUs in each node. An allreduce operation across GPUs in a node is very fast. Hence, we do allreduce across GPUs in a node, and gossip across different nodes

Default: 1

--pipeline-model-parallel

if set, use pipeline model parallelism across GPUs

Default: False

--pipeline-balance partition the model into N_K pieces, where each piece contains N_i layers. The sum(args.pipeline_balance) should equal the total number of layers in the model
--pipeline-devices a list of device indices indicating which device to place each of the N_K partitions. The length of this list should equal the length of the –pipeline-balance argument
--pipeline-chunks

microbatch count for pipeline model parallelism

Default: 0

--pipeline-encoder-balance partition the pipeline parallel encoder into N_K pieces, where each piece contains N_i layers. The sum(args.pipeline_encoder_balance) should equal the total number of encoder layers in the model
--pipeline-encoder-devices a list of device indices indicating which device to place each of the N_K partitions. The length of this list should equal the length of the –pipeline-encoder-balance argument
--pipeline-decoder-balance partition the pipeline parallel decoder into N_K pieces, where each piece contains N_i layers. The sum(args.pipeline_decoder_balance) should equal the total number of decoder layers in the model
--pipeline-decoder-devices a list of device indices indicating which device to place each of the N_K partitions. The length of this list should equal the length of the –pipeline-decoder-balance argument
--pipeline-checkpoint

Possible choices: always, never, except_last

checkpointing mode for pipeline model parallelism

Default: “never”

--zero-sharding

Possible choices: none, os

ZeRO sharding

Default: “none”

Model configuration

--arch, -a

Possible choices: roberta, roberta_base, roberta_large, xlm, transformer, transformer_iwslt_de_en, transformer_wmt_en_de, transformer_vaswani_wmt_en_de_big, transformer_vaswani_wmt_en_fr_big, transformer_wmt_en_de_big, transformer_wmt_en_de_big_t2t, transformer_align, transformer_wmt_en_de_big_align, wav2vec, wav2vec2, wav2vec_ctc, wav2vec_seq2seq, transformer_from_pretrained_xlm, s2t_berard, s2t_berard_256_3_3, s2t_berard_512_3_2, s2t_berard_512_5_3, s2t_transformer, s2t_transformer_s, s2t_transformer_sp, s2t_transformer_m, s2t_transformer_mp, s2t_transformer_l, s2t_transformer_lp, transformer_lm, transformer_lm_big, transformer_lm_baevski_wiki103, transformer_lm_wiki103, transformer_lm_baevski_gbw, transformer_lm_gbw, transformer_lm_gpt, transformer_lm_gpt2_small, transformer_lm_gpt2_medium, transformer_lm_gpt2_big, lightconv, lightconv_iwslt_de_en, lightconv_wmt_en_de, lightconv_wmt_en_de_big, lightconv_wmt_en_fr_big, lightconv_wmt_zh_en_big, masked_lm, bert_base, bert_large, xlm_base, fconv, fconv_iwslt_de_en, fconv_wmt_en_ro, fconv_wmt_en_de, fconv_wmt_en_fr, fconv_lm, fconv_lm_dauphin_wikitext103, fconv_lm_dauphin_gbw, nonautoregressive_transformer, nonautoregressive_transformer_wmt_en_de, nacrf_transformer, iterative_nonautoregressive_transformer, iterative_nonautoregressive_transformer_wmt_en_de, cmlm_transformer, cmlm_transformer_wmt_en_de, levenshtein_transformer, levenshtein_transformer_wmt_en_de, levenshtein_transformer_vaswani_wmt_en_de_big, levenshtein_transformer_wmt_en_de_big, insertion_transformer, lightconv_lm, lightconv_lm_gbw, fconv_self_att, fconv_self_att_wp, bart_large, bart_base, mbart_large, mbart_base, mbart_base_wmt20, lstm, lstm_wiseman_iwslt_de_en, lstm_luong_wmt_en_de, multilingual_transformer, multilingual_transformer_iwslt_de_en, hf_gpt2, hf_gpt2_medium, hf_gpt2_large, hf_gpt2_xl, lstm_lm, dummy_model, model_parallel_roberta, model_parallel_roberta_base, model_parallel_roberta_large, transformer_lm_megatron, transformer_lm_megatron_11b, transformer_iwslt_de_en_pipeline_parallel, transformer_wmt_en_de_big_pipeline_parallel

model architecture

optimization

--max-epoch

force stop training at specified epoch

Default: 0

--max-update

force stop training at specified update

Default: 0

--stop-time-hours

force stop training after specified cumulative time (if >0)

Default: 0

--clip-norm

clip threshold of gradients

Default: 0.0

--sentence-avg

normalize gradients by the number of sentences in a batch (default is to normalize by number of tokens)

Default: False

--update-freq

update parameters every N_i batches, when in epoch i

Default: 1

--lr

learning rate for the first N epochs; all epochs >N using LR_N (note: this may be interpreted differently depending on –lr-scheduler)

Default: 0.25

--min-lr

stop training when the learning rate reaches this minimum

Default: -1.0

--use-bmuf

specify global optimizer for syncing models on different GPUs/shards

Default: False

checkpoint

--save-dir

path to save checkpoints

Default: “checkpoints”

--restore-file

filename from which to load checkpoint (default: <save-dir>/checkpoint_last.pt

Default: “checkpoint_last.pt”

--finetune-from-model finetune from a pretrained model; note that meters and lr scheduler will be reset
--reset-dataloader

if set, does not reload dataloader state from the checkpoint

Default: False

--reset-lr-scheduler

if set, does not load lr scheduler state from the checkpoint

Default: False

--reset-meters

if set, does not load meters from the checkpoint

Default: False

--reset-optimizer

if set, does not load optimizer state from the checkpoint

Default: False

--optimizer-overrides

a dictionary used to override optimizer args when loading a checkpoint

Default: “{}”

--save-interval

save a checkpoint every N epochs

Default: 1

--save-interval-updates

save a checkpoint (and validate) every N updates

Default: 0

--keep-interval-updates

keep the last N checkpoints saved with –save-interval-updates

Default: -1

--keep-last-epochs

keep last N epoch checkpoints

Default: -1

--keep-best-checkpoints

keep best N checkpoints based on scores

Default: -1

--no-save

don’t save models or checkpoints

Default: False

--no-epoch-checkpoints

only store last and best checkpoints

Default: False

--no-last-checkpoints

don’t store last checkpoints

Default: False

--no-save-optimizer-state

don’t save optimizer-state as part of checkpoint

Default: False

--best-checkpoint-metric

metric to use for saving “best” checkpoints

Default: “loss”

--maximize-best-checkpoint-metric

select the largest metric value for saving “best” checkpoints

Default: False

--patience

early stop training if valid performance doesn’t improve for N consecutive validation runs; note that this is influenced by –validate-interval

Default: -1

fairseq-generate

Translate pre-processed data with a trained model.

usage: fairseq-generate [-h] [--no-progress-bar] [--log-interval LOG_INTERVAL]
                        [--log-format {json,none,simple,tqdm}]
                        [--tensorboard-logdir TENSORBOARD_LOGDIR]
                        [--seed SEED] [--cpu] [--tpu] [--bf16]
                        [--memory-efficient-bf16] [--fp16]
                        [--memory-efficient-fp16] [--fp16-no-flatten-grads]
                        [--fp16-init-scale FP16_INIT_SCALE]
                        [--fp16-scale-window FP16_SCALE_WINDOW]
                        [--fp16-scale-tolerance FP16_SCALE_TOLERANCE]
                        [--min-loss-scale MIN_LOSS_SCALE]
                        [--threshold-loss-scale THRESHOLD_LOSS_SCALE]
                        [--user-dir USER_DIR]
                        [--empty-cache-freq EMPTY_CACHE_FREQ]
                        [--all-gather-list-size ALL_GATHER_LIST_SIZE]
                        [--model-parallel-size MODEL_PARALLEL_SIZE]
                        [--checkpoint-suffix CHECKPOINT_SUFFIX]
                        [--checkpoint-shard-count CHECKPOINT_SHARD_COUNT]
                        [--quantization-config-path QUANTIZATION_CONFIG_PATH]
                        [--profile]
                        [--criterion {sentence_prediction,ctc,adaptive_loss,label_smoothed_cross_entropy,composite_loss,nat_loss,masked_lm,sentence_ranking,legacy_masked_lm_loss,cross_entropy,wav2vec,label_smoothed_cross_entropy_with_alignment,vocab_parallel_cross_entropy}]
                        [--tokenizer {nltk,space,moses}]
                        [--bpe {gpt2,bytes,sentencepiece,subword_nmt,byte_bpe,characters,bert,fastbpe,hf_byte_bpe}]
                        [--optimizer {adadelta,adam,adafactor,adagrad,lamb,nag,adamax,sgd}]
                        [--lr-scheduler {cosine,reduce_lr_on_plateau,fixed,triangular,polynomial_decay,tri_stage,inverse_sqrt}]
                        [--scoring {chrf,wer,sacrebleu,bleu}] [--task TASK]
                        [--num-workers NUM_WORKERS]
                        [--skip-invalid-size-inputs-valid-test]
                        [--max-tokens MAX_TOKENS] [--batch-size BATCH_SIZE]
                        [--required-batch-size-multiple REQUIRED_BATCH_SIZE_MULTIPLE]
                        [--required-seq-len-multiple REQUIRED_SEQ_LEN_MULTIPLE]
                        [--dataset-impl {raw,lazy,cached,mmap,fasta}]
                        [--data-buffer-size DATA_BUFFER_SIZE]
                        [--train-subset TRAIN_SUBSET]
                        [--valid-subset VALID_SUBSET]
                        [--validate-interval VALIDATE_INTERVAL]
                        [--validate-interval-updates VALIDATE_INTERVAL_UPDATES]
                        [--validate-after-updates VALIDATE_AFTER_UPDATES]
                        [--fixed-validation-seed FIXED_VALIDATION_SEED]
                        [--disable-validation]
                        [--max-tokens-valid MAX_TOKENS_VALID]
                        [--batch-size-valid BATCH_SIZE_VALID]
                        [--curriculum CURRICULUM] [--gen-subset GEN_SUBSET]
                        [--num-shards NUM_SHARDS] [--shard-id SHARD_ID]
                        [--distributed-world-size DISTRIBUTED_WORLD_SIZE]
                        [--distributed-rank DISTRIBUTED_RANK]
                        [--distributed-backend DISTRIBUTED_BACKEND]
                        [--distributed-init-method DISTRIBUTED_INIT_METHOD]
                        [--distributed-port DISTRIBUTED_PORT]
                        [--device-id DEVICE_ID] [--distributed-no-spawn]
                        [--ddp-backend {c10d,no_c10d}]
                        [--bucket-cap-mb BUCKET_CAP_MB]
                        [--fix-batches-to-gpus] [--find-unused-parameters]
                        [--fast-stat-sync] [--broadcast-buffers]
                        [--distributed-wrapper {DDP,SlowMo}]
                        [--slowmo-momentum SLOWMO_MOMENTUM]
                        [--slowmo-algorithm SLOWMO_ALGORITHM]
                        [--localsgd-frequency LOCALSGD_FREQUENCY]
                        [--nprocs-per-node NPROCS_PER_NODE]
                        [--pipeline-model-parallel]
                        [--pipeline-balance PIPELINE_BALANCE]
                        [--pipeline-devices PIPELINE_DEVICES]
                        [--pipeline-chunks PIPELINE_CHUNKS]
                        [--pipeline-encoder-balance PIPELINE_ENCODER_BALANCE]
                        [--pipeline-encoder-devices PIPELINE_ENCODER_DEVICES]
                        [--pipeline-decoder-balance PIPELINE_DECODER_BALANCE]
                        [--pipeline-decoder-devices PIPELINE_DECODER_DEVICES]
                        [--pipeline-checkpoint {always,never,except_last}]
                        [--zero-sharding {none,os}] [--path PATH]
                        [--remove-bpe [REMOVE_BPE]] [--quiet]
                        [--model-overrides MODEL_OVERRIDES]
                        [--results-path RESULTS_PATH] [--beam N] [--nbest N]
                        [--max-len-a N] [--max-len-b N] [--min-len N]
                        [--match-source-len] [--no-early-stop]
                        [--unnormalized] [--no-beamable-mm] [--lenpen LENPEN]
                        [--unkpen UNKPEN] [--replace-unk [REPLACE_UNK]]
                        [--sacrebleu] [--score-reference] [--prefix-size PS]
                        [--no-repeat-ngram-size N] [--sampling]
                        [--sampling-topk PS] [--sampling-topp PS]
                        [--constraints [{ordered,unordered}]]
                        [--temperature N] [--diverse-beam-groups N]
                        [--diverse-beam-strength N] [--diversity-rate N]
                        [--print-alignment] [--print-step] [--lm-path PATH]
                        [--lm-weight N] [--iter-decode-eos-penalty N]
                        [--iter-decode-max-iter N]
                        [--iter-decode-force-max-iter]
                        [--iter-decode-with-beam N]
                        [--iter-decode-with-external-reranker]
                        [--retain-iter-history] [--retain-dropout]
                        [--retain-dropout-modules RETAIN_DROPOUT_MODULES [RETAIN_DROPOUT_MODULES ...]]
                        [--decoding-format {unigram,ensemble,vote,dp,bs}]

Named Arguments

--no-progress-bar

disable progress bar

Default: False

--log-interval

log progress every N batches (when progress bar is disabled)

Default: 100

--log-format

Possible choices: json, none, simple, tqdm

log format to use

--tensorboard-logdir path to save logs for tensorboard, should match –logdir of running tensorboard (default: no tensorboard logging)
--seed

pseudo random number generator seed

Default: 1

--cpu

use CPU instead of CUDA

Default: False

--tpu

use TPU instead of CUDA

Default: False

--bf16

use bfloat16; implies –tpu

Default: False

--memory-efficient-bf16

use a memory-efficient version of BF16 training; implies –bf16

Default: False

--fp16

use FP16

Default: False

--memory-efficient-fp16

use a memory-efficient version of FP16 training; implies –fp16

Default: False

--fp16-no-flatten-grads

don’t flatten FP16 grads tensor

Default: False

--fp16-init-scale

default FP16 loss scale

Default: 128

--fp16-scale-window number of updates before increasing loss scale
--fp16-scale-tolerance

pct of updates that can overflow before decreasing the loss scale

Default: 0.0

--min-loss-scale

minimum FP16 loss scale, after which training is stopped

Default: 0.0001

--threshold-loss-scale threshold FP16 loss scale from below
--user-dir path to a python module containing custom extensions (tasks and/or architectures)
--empty-cache-freq

how often to clear the PyTorch CUDA cache (0 to disable)

Default: 0

--all-gather-list-size

number of bytes reserved for gathering stats from workers

Default: 16384

--model-parallel-size

total number of GPUs to parallelize model over

Default: 1

--checkpoint-suffix

suffix to add to the checkpoint file name

Default: “”

--checkpoint-shard-count

Number of shards containing the checkpoint - if the checkpoint is over 300GB, it is preferable to split it into shards to prevent OOM on CPU while loading the checkpoint

Default: 1

--quantization-config-path path to quantization config file
--profile

enable autograd profiler emit_nvtx

Default: False

--criterion

Possible choices: sentence_prediction, ctc, adaptive_loss, label_smoothed_cross_entropy, composite_loss, nat_loss, masked_lm, sentence_ranking, legacy_masked_lm_loss, cross_entropy, wav2vec, label_smoothed_cross_entropy_with_alignment, vocab_parallel_cross_entropy

Default: “cross_entropy”

--tokenizer Possible choices: nltk, space, moses
--bpe Possible choices: gpt2, bytes, sentencepiece, subword_nmt, byte_bpe, characters, bert, fastbpe, hf_byte_bpe
--optimizer Possible choices: adadelta, adam, adafactor, adagrad, lamb, nag, adamax, sgd
--lr-scheduler

Possible choices: cosine, reduce_lr_on_plateau, fixed, triangular, polynomial_decay, tri_stage, inverse_sqrt

Default: “fixed”

--scoring

Possible choices: chrf, wer, sacrebleu, bleu

Default: “bleu”

--task

Possible choices: sentence_prediction, translation, translation_from_pretrained_xlm, denoising, multilingual_translation, semisupervised_translation, cross_lingual_lm, multilingual_denoising, translation_from_pretrained_bart, masked_lm, sentence_ranking, speech_to_text, audio_pretraining, legacy_masked_lm, translation_multi_simple_epoch, multilingual_masked_lm, language_modeling, translation_lev, dummy_lm, dummy_masked_lm, dummy_mt

task

Default: “translation”

dataset_data_loading

--num-workers

how many subprocesses to use for data loading

Default: 1

--skip-invalid-size-inputs-valid-test

ignore too long or too short lines in valid and test set

Default: False

--max-tokens maximum number of tokens in a batch
--batch-size number of examples in a batch
--required-batch-size-multiple

batch size will be a multiplier of this value

Default: 8

--required-seq-len-multiple

maximum sequence length in batch will be a multiplier of this value

Default: 1

--dataset-impl

Possible choices: raw, lazy, cached, mmap, fasta

output dataset implementation

--data-buffer-size

Number of batches to preload

Default: 10

--train-subset

data subset to use for training (e.g. train, valid, test)

Default: “train”

--valid-subset

comma separated list of data subsets to use for validation (e.g. train, valid, test)

Default: “valid”

--validate-interval

validate every N epochs

Default: 1

--validate-interval-updates

validate every N updates

Default: 0

--validate-after-updates

dont validate until reaching this many updates

Default: 0

--fixed-validation-seed specified random seed for validation
--disable-validation

disable validation

Default: False

--max-tokens-valid maximum number of tokens in a validation batch (defaults to –max-tokens)
--batch-size-valid batch size of the validation batch (defaults to –batch-size)
--curriculum

don’t shuffle batches for first N epochs

Default: 0

--gen-subset

data subset to generate (train, valid, test)

Default: “test”

--num-shards

shard generation over N shards

Default: 1

--shard-id

id of the shard to generate (id < num_shards)

Default: 0

distributed_training

--distributed-world-size

total number of GPUs across all nodes (default: all visible GPUs)

Default: 1

--distributed-rank

rank of the current worker

Default: 0

--distributed-backend

distributed backend

Default: “nccl”

--distributed-init-method typically tcp://hostname:port that will be used to establish initial connetion
--distributed-port

port number (not required if using –distributed-init-method)

Default: -1

--device-id, --local_rank

which GPU to use (usually configured automatically)

Default: 0

--distributed-no-spawn

do not spawn multiple processes even if multiple GPUs are visible

Default: False

--ddp-backend

Possible choices: c10d, no_c10d

DistributedDataParallel backend

Default: “c10d”

--bucket-cap-mb

bucket size for reduction

Default: 25

--fix-batches-to-gpus

don’t shuffle batches between GPUs; this reduces overall randomness and may affect precision but avoids the cost of re-reading the data

Default: False

--find-unused-parameters

disable unused parameter detection (not applicable to no_c10d ddp-backend

Default: False

--fast-stat-sync

[deprecated] this is now defined per Criterion

Default: False

--broadcast-buffers

Copy non-trainable parameters between GPUs, such as batchnorm population statistics

Default: False

--distributed-wrapper

Possible choices: DDP, SlowMo

DistributedDataParallel backend

Default: “DDP”

--slowmo-momentum SlowMo momentum term; by default use 0.0 for 16 GPUs, 0.2 for 32 GPUs; 0.5 for 64 GPUs, 0.6 for > 64 GPUs
--slowmo-algorithm

whether to use LocalSGD or SGP

Default: “LocalSGD”

--localsgd-frequency

Local SGD allreduce frequency

Default: 3

--nprocs-per-node

number of GPUs in each node. An allreduce operation across GPUs in a node is very fast. Hence, we do allreduce across GPUs in a node, and gossip across different nodes

Default: 1

--pipeline-model-parallel

if set, use pipeline model parallelism across GPUs

Default: False

--pipeline-balance partition the model into N_K pieces, where each piece contains N_i layers. The sum(args.pipeline_balance) should equal the total number of layers in the model
--pipeline-devices a list of device indices indicating which device to place each of the N_K partitions. The length of this list should equal the length of the –pipeline-balance argument
--pipeline-chunks

microbatch count for pipeline model parallelism

Default: 0

--pipeline-encoder-balance partition the pipeline parallel encoder into N_K pieces, where each piece contains N_i layers. The sum(args.pipeline_encoder_balance) should equal the total number of encoder layers in the model
--pipeline-encoder-devices a list of device indices indicating which device to place each of the N_K partitions. The length of this list should equal the length of the –pipeline-encoder-balance argument
--pipeline-decoder-balance partition the pipeline parallel decoder into N_K pieces, where each piece contains N_i layers. The sum(args.pipeline_decoder_balance) should equal the total number of decoder layers in the model
--pipeline-decoder-devices a list of device indices indicating which device to place each of the N_K partitions. The length of this list should equal the length of the –pipeline-decoder-balance argument
--pipeline-checkpoint

Possible choices: always, never, except_last

checkpointing mode for pipeline model parallelism

Default: “never”

--zero-sharding

Possible choices: none, os

ZeRO sharding

Default: “none”

Generation

--path path(s) to model file(s), colon separated
--remove-bpe remove BPE tokens before scoring (can be set to sentencepiece)
--quiet

only print final scores

Default: False

--model-overrides

a dictionary used to override model args at generation that were used during model training

Default: “{}”

--results-path path to save eval results (optional)
--beam

beam size

Default: 5

--nbest

number of hypotheses to output

Default: 1

--max-len-a

generate sequences of maximum length ax + b, where x is the source length

Default: 0

--max-len-b

generate sequences of maximum length ax + b, where x is the source length

Default: 200

--min-len

minimum generation length

Default: 1

--match-source-len

generations should match the source length

Default: False

--no-early-stop

deprecated

Default: False

--unnormalized

compare unnormalized hypothesis scores

Default: False

--no-beamable-mm

don’t use BeamableMM in attention layers

Default: False

--lenpen

length penalty: <1.0 favors shorter, >1.0 favors longer sentences

Default: 1

--unkpen

unknown word penalty: <0 produces more unks, >0 produces fewer

Default: 0

--replace-unk perform unknown replacement (optionally with alignment dictionary)
--sacrebleu

score with sacrebleu

Default: False

--score-reference

just score the reference translation

Default: False

--prefix-size

initialize generation by target prefix of given length

Default: 0

--no-repeat-ngram-size

ngram blocking such that this size ngram cannot be repeated in the generation

Default: 0

--sampling

sample hypotheses instead of using beam search

Default: False

--sampling-topk

sample from top K likely next words instead of all words

Default: -1

--sampling-topp

sample from the smallest set whose cumulative probability mass exceeds p for next words

Default: -1.0

--constraints

Possible choices: ordered, unordered

enables lexically constrained decoding

--temperature

temperature for generation

Default: 1.0

--diverse-beam-groups

number of groups for Diverse Beam Search

Default: -1

--diverse-beam-strength

strength of diversity penalty for Diverse Beam Search

Default: 0.5

--diversity-rate

strength of diversity penalty for Diverse Siblings Search

Default: -1.0

--print-alignment

if set, uses attention feedback to compute and print alignment to source tokens

Default: False

--print-step Default: False
--lm-path path to lm checkpoint for lm fusion
--lm-weight

weight for lm probs for lm fusion

Default: 0.0

--iter-decode-eos-penalty

if > 0.0, it penalized early-stopping in decoding.

Default: 0.0

--iter-decode-max-iter

maximum iterations for iterative refinement.

Default: 10

--iter-decode-force-max-iter

if set, run exact the maximum number of iterations without early stop

Default: False

--iter-decode-with-beam

if > 1, model will generate translations varying by the lengths.

Default: 1

--iter-decode-with-external-reranker

if set, the last checkpoint are assumed to be a reranker to rescore the translations

Default: False

--retain-iter-history

if set, decoding returns the whole history of iterative refinement

Default: False

--retain-dropout

Use dropout at inference time

Default: False

--retain-dropout-modules if set, only retain dropout for the specified modules; if not set, then dropout will be retained for all modules
--decoding-format Possible choices: unigram, ensemble, vote, dp, bs

fairseq-interactive

Translate raw text with a trained model. Batches data on-the-fly.

usage: fairseq-interactive [-h] [--no-progress-bar]
                           [--log-interval LOG_INTERVAL]
                           [--log-format {json,none,simple,tqdm}]
                           [--tensorboard-logdir TENSORBOARD_LOGDIR]
                           [--seed SEED] [--cpu] [--tpu] [--bf16]
                           [--memory-efficient-bf16] [--fp16]
                           [--memory-efficient-fp16] [--fp16-no-flatten-grads]
                           [--fp16-init-scale FP16_INIT_SCALE]
                           [--fp16-scale-window FP16_SCALE_WINDOW]
                           [--fp16-scale-tolerance FP16_SCALE_TOLERANCE]
                           [--min-loss-scale MIN_LOSS_SCALE]
                           [--threshold-loss-scale THRESHOLD_LOSS_SCALE]
                           [--user-dir USER_DIR]
                           [--empty-cache-freq EMPTY_CACHE_FREQ]
                           [--all-gather-list-size ALL_GATHER_LIST_SIZE]
                           [--model-parallel-size MODEL_PARALLEL_SIZE]
                           [--checkpoint-suffix CHECKPOINT_SUFFIX]
                           [--checkpoint-shard-count CHECKPOINT_SHARD_COUNT]
                           [--quantization-config-path QUANTIZATION_CONFIG_PATH]
                           [--profile]
                           [--criterion {sentence_prediction,ctc,adaptive_loss,label_smoothed_cross_entropy,composite_loss,nat_loss,masked_lm,sentence_ranking,legacy_masked_lm_loss,cross_entropy,wav2vec,label_smoothed_cross_entropy_with_alignment,vocab_parallel_cross_entropy}]
                           [--tokenizer {nltk,space,moses}]
                           [--bpe {gpt2,bytes,sentencepiece,subword_nmt,byte_bpe,characters,bert,fastbpe,hf_byte_bpe}]
                           [--optimizer {adadelta,adam,adafactor,adagrad,lamb,nag,adamax,sgd}]
                           [--lr-scheduler {cosine,reduce_lr_on_plateau,fixed,triangular,polynomial_decay,tri_stage,inverse_sqrt}]
                           [--scoring {chrf,wer,sacrebleu,bleu}] [--task TASK]
                           [--num-workers NUM_WORKERS]
                           [--skip-invalid-size-inputs-valid-test]
                           [--max-tokens MAX_TOKENS] [--batch-size BATCH_SIZE]
                           [--required-batch-size-multiple REQUIRED_BATCH_SIZE_MULTIPLE]
                           [--required-seq-len-multiple REQUIRED_SEQ_LEN_MULTIPLE]
                           [--dataset-impl {raw,lazy,cached,mmap,fasta}]
                           [--data-buffer-size DATA_BUFFER_SIZE]
                           [--train-subset TRAIN_SUBSET]
                           [--valid-subset VALID_SUBSET]
                           [--validate-interval VALIDATE_INTERVAL]
                           [--validate-interval-updates VALIDATE_INTERVAL_UPDATES]
                           [--validate-after-updates VALIDATE_AFTER_UPDATES]
                           [--fixed-validation-seed FIXED_VALIDATION_SEED]
                           [--disable-validation]
                           [--max-tokens-valid MAX_TOKENS_VALID]
                           [--batch-size-valid BATCH_SIZE_VALID]
                           [--curriculum CURRICULUM] [--gen-subset GEN_SUBSET]
                           [--num-shards NUM_SHARDS] [--shard-id SHARD_ID]
                           [--distributed-world-size DISTRIBUTED_WORLD_SIZE]
                           [--distributed-rank DISTRIBUTED_RANK]
                           [--distributed-backend DISTRIBUTED_BACKEND]
                           [--distributed-init-method DISTRIBUTED_INIT_METHOD]
                           [--distributed-port DISTRIBUTED_PORT]
                           [--device-id DEVICE_ID] [--distributed-no-spawn]
                           [--ddp-backend {c10d,no_c10d}]
                           [--bucket-cap-mb BUCKET_CAP_MB]
                           [--fix-batches-to-gpus] [--find-unused-parameters]
                           [--fast-stat-sync] [--broadcast-buffers]
                           [--distributed-wrapper {DDP,SlowMo}]
                           [--slowmo-momentum SLOWMO_MOMENTUM]
                           [--slowmo-algorithm SLOWMO_ALGORITHM]
                           [--localsgd-frequency LOCALSGD_FREQUENCY]
                           [--nprocs-per-node NPROCS_PER_NODE]
                           [--pipeline-model-parallel]
                           [--pipeline-balance PIPELINE_BALANCE]
                           [--pipeline-devices PIPELINE_DEVICES]
                           [--pipeline-chunks PIPELINE_CHUNKS]
                           [--pipeline-encoder-balance PIPELINE_ENCODER_BALANCE]
                           [--pipeline-encoder-devices PIPELINE_ENCODER_DEVICES]
                           [--pipeline-decoder-balance PIPELINE_DECODER_BALANCE]
                           [--pipeline-decoder-devices PIPELINE_DECODER_DEVICES]
                           [--pipeline-checkpoint {always,never,except_last}]
                           [--zero-sharding {none,os}] [--path PATH]
                           [--remove-bpe [REMOVE_BPE]] [--quiet]
                           [--model-overrides MODEL_OVERRIDES]
                           [--results-path RESULTS_PATH] [--beam N]
                           [--nbest N] [--max-len-a N] [--max-len-b N]
                           [--min-len N] [--match-source-len]
                           [--no-early-stop] [--unnormalized]
                           [--no-beamable-mm] [--lenpen LENPEN]
                           [--unkpen UNKPEN] [--replace-unk [REPLACE_UNK]]
                           [--sacrebleu] [--score-reference]
                           [--prefix-size PS] [--no-repeat-ngram-size N]
                           [--sampling] [--sampling-topk PS]
                           [--sampling-topp PS]
                           [--constraints [{ordered,unordered}]]
                           [--temperature N] [--diverse-beam-groups N]
                           [--diverse-beam-strength N] [--diversity-rate N]
                           [--print-alignment] [--print-step] [--lm-path PATH]
                           [--lm-weight N] [--iter-decode-eos-penalty N]
                           [--iter-decode-max-iter N]
                           [--iter-decode-force-max-iter]
                           [--iter-decode-with-beam N]
                           [--iter-decode-with-external-reranker]
                           [--retain-iter-history] [--retain-dropout]
                           [--retain-dropout-modules RETAIN_DROPOUT_MODULES [RETAIN_DROPOUT_MODULES ...]]
                           [--decoding-format {unigram,ensemble,vote,dp,bs}]
                           [--buffer-size N] [--input FILE]

Named Arguments

--no-progress-bar

disable progress bar

Default: False

--log-interval

log progress every N batches (when progress bar is disabled)

Default: 100

--log-format

Possible choices: json, none, simple, tqdm

log format to use

--tensorboard-logdir path to save logs for tensorboard, should match –logdir of running tensorboard (default: no tensorboard logging)
--seed

pseudo random number generator seed

Default: 1

--cpu

use CPU instead of CUDA

Default: False

--tpu

use TPU instead of CUDA

Default: False

--bf16

use bfloat16; implies –tpu

Default: False

--memory-efficient-bf16

use a memory-efficient version of BF16 training; implies –bf16

Default: False

--fp16

use FP16

Default: False

--memory-efficient-fp16

use a memory-efficient version of FP16 training; implies –fp16

Default: False

--fp16-no-flatten-grads

don’t flatten FP16 grads tensor

Default: False

--fp16-init-scale

default FP16 loss scale

Default: 128

--fp16-scale-window number of updates before increasing loss scale
--fp16-scale-tolerance

pct of updates that can overflow before decreasing the loss scale

Default: 0.0

--min-loss-scale

minimum FP16 loss scale, after which training is stopped

Default: 0.0001

--threshold-loss-scale threshold FP16 loss scale from below
--user-dir path to a python module containing custom extensions (tasks and/or architectures)
--empty-cache-freq

how often to clear the PyTorch CUDA cache (0 to disable)

Default: 0

--all-gather-list-size

number of bytes reserved for gathering stats from workers

Default: 16384

--model-parallel-size

total number of GPUs to parallelize model over

Default: 1

--checkpoint-suffix

suffix to add to the checkpoint file name

Default: “”

--checkpoint-shard-count

Number of shards containing the checkpoint - if the checkpoint is over 300GB, it is preferable to split it into shards to prevent OOM on CPU while loading the checkpoint

Default: 1

--quantization-config-path path to quantization config file
--profile

enable autograd profiler emit_nvtx

Default: False

--criterion

Possible choices: sentence_prediction, ctc, adaptive_loss, label_smoothed_cross_entropy, composite_loss, nat_loss, masked_lm, sentence_ranking, legacy_masked_lm_loss, cross_entropy, wav2vec, label_smoothed_cross_entropy_with_alignment, vocab_parallel_cross_entropy

Default: “cross_entropy”

--tokenizer Possible choices: nltk, space, moses
--bpe Possible choices: gpt2, bytes, sentencepiece, subword_nmt, byte_bpe, characters, bert, fastbpe, hf_byte_bpe
--optimizer Possible choices: adadelta, adam, adafactor, adagrad, lamb, nag, adamax, sgd
--lr-scheduler

Possible choices: cosine, reduce_lr_on_plateau, fixed, triangular, polynomial_decay, tri_stage, inverse_sqrt

Default: “fixed”

--scoring

Possible choices: chrf, wer, sacrebleu, bleu

Default: “bleu”

--task

Possible choices: sentence_prediction, translation, translation_from_pretrained_xlm, denoising, multilingual_translation, semisupervised_translation, cross_lingual_lm, multilingual_denoising, translation_from_pretrained_bart, masked_lm, sentence_ranking, speech_to_text, audio_pretraining, legacy_masked_lm, translation_multi_simple_epoch, multilingual_masked_lm, language_modeling, translation_lev, dummy_lm, dummy_masked_lm, dummy_mt

task

Default: “translation”

dataset_data_loading

--num-workers

how many subprocesses to use for data loading

Default: 1

--skip-invalid-size-inputs-valid-test

ignore too long or too short lines in valid and test set

Default: False

--max-tokens maximum number of tokens in a batch
--batch-size number of examples in a batch
--required-batch-size-multiple

batch size will be a multiplier of this value

Default: 8

--required-seq-len-multiple

maximum sequence length in batch will be a multiplier of this value

Default: 1

--dataset-impl

Possible choices: raw, lazy, cached, mmap, fasta

output dataset implementation

--data-buffer-size

Number of batches to preload

Default: 10

--train-subset

data subset to use for training (e.g. train, valid, test)

Default: “train”

--valid-subset

comma separated list of data subsets to use for validation (e.g. train, valid, test)

Default: “valid”

--validate-interval

validate every N epochs

Default: 1

--validate-interval-updates

validate every N updates

Default: 0

--validate-after-updates

dont validate until reaching this many updates

Default: 0

--fixed-validation-seed specified random seed for validation
--disable-validation

disable validation

Default: False

--max-tokens-valid maximum number of tokens in a validation batch (defaults to –max-tokens)
--batch-size-valid batch size of the validation batch (defaults to –batch-size)
--curriculum

don’t shuffle batches for first N epochs

Default: 0

--gen-subset

data subset to generate (train, valid, test)

Default: “test”

--num-shards

shard generation over N shards

Default: 1

--shard-id

id of the shard to generate (id < num_shards)

Default: 0

distributed_training

--distributed-world-size

total number of GPUs across all nodes (default: all visible GPUs)

Default: 1

--distributed-rank

rank of the current worker

Default: 0

--distributed-backend

distributed backend

Default: “nccl”

--distributed-init-method typically tcp://hostname:port that will be used to establish initial connetion
--distributed-port

port number (not required if using –distributed-init-method)

Default: -1

--device-id, --local_rank

which GPU to use (usually configured automatically)

Default: 0

--distributed-no-spawn

do not spawn multiple processes even if multiple GPUs are visible

Default: False

--ddp-backend

Possible choices: c10d, no_c10d

DistributedDataParallel backend

Default: “c10d”

--bucket-cap-mb

bucket size for reduction

Default: 25

--fix-batches-to-gpus

don’t shuffle batches between GPUs; this reduces overall randomness and may affect precision but avoids the cost of re-reading the data

Default: False

--find-unused-parameters

disable unused parameter detection (not applicable to no_c10d ddp-backend

Default: False

--fast-stat-sync

[deprecated] this is now defined per Criterion

Default: False

--broadcast-buffers

Copy non-trainable parameters between GPUs, such as batchnorm population statistics

Default: False

--distributed-wrapper

Possible choices: DDP, SlowMo

DistributedDataParallel backend

Default: “DDP”

--slowmo-momentum SlowMo momentum term; by default use 0.0 for 16 GPUs, 0.2 for 32 GPUs; 0.5 for 64 GPUs, 0.6 for > 64 GPUs
--slowmo-algorithm

whether to use LocalSGD or SGP

Default: “LocalSGD”

--localsgd-frequency

Local SGD allreduce frequency

Default: 3

--nprocs-per-node

number of GPUs in each node. An allreduce operation across GPUs in a node is very fast. Hence, we do allreduce across GPUs in a node, and gossip across different nodes

Default: 1

--pipeline-model-parallel

if set, use pipeline model parallelism across GPUs

Default: False

--pipeline-balance partition the model into N_K pieces, where each piece contains N_i layers. The sum(args.pipeline_balance) should equal the total number of layers in the model
--pipeline-devices a list of device indices indicating which device to place each of the N_K partitions. The length of this list should equal the length of the –pipeline-balance argument
--pipeline-chunks

microbatch count for pipeline model parallelism

Default: 0

--pipeline-encoder-balance partition the pipeline parallel encoder into N_K pieces, where each piece contains N_i layers. The sum(args.pipeline_encoder_balance) should equal the total number of encoder layers in the model
--pipeline-encoder-devices a list of device indices indicating which device to place each of the N_K partitions. The length of this list should equal the length of the –pipeline-encoder-balance argument
--pipeline-decoder-balance partition the pipeline parallel decoder into N_K pieces, where each piece contains N_i layers. The sum(args.pipeline_decoder_balance) should equal the total number of decoder layers in the model
--pipeline-decoder-devices a list of device indices indicating which device to place each of the N_K partitions. The length of this list should equal the length of the –pipeline-decoder-balance argument
--pipeline-checkpoint

Possible choices: always, never, except_last

checkpointing mode for pipeline model parallelism

Default: “never”

--zero-sharding

Possible choices: none, os

ZeRO sharding

Default: “none”

Generation

--path path(s) to model file(s), colon separated
--remove-bpe remove BPE tokens before scoring (can be set to sentencepiece)
--quiet

only print final scores

Default: False

--model-overrides

a dictionary used to override model args at generation that were used during model training

Default: “{}”

--results-path path to save eval results (optional)
--beam

beam size

Default: 5

--nbest

number of hypotheses to output

Default: 1

--max-len-a

generate sequences of maximum length ax + b, where x is the source length

Default: 0

--max-len-b

generate sequences of maximum length ax + b, where x is the source length

Default: 200

--min-len

minimum generation length

Default: 1

--match-source-len

generations should match the source length

Default: False

--no-early-stop

deprecated

Default: False

--unnormalized

compare unnormalized hypothesis scores

Default: False

--no-beamable-mm

don’t use BeamableMM in attention layers

Default: False

--lenpen

length penalty: <1.0 favors shorter, >1.0 favors longer sentences

Default: 1

--unkpen

unknown word penalty: <0 produces more unks, >0 produces fewer

Default: 0

--replace-unk perform unknown replacement (optionally with alignment dictionary)
--sacrebleu

score with sacrebleu

Default: False

--score-reference

just score the reference translation

Default: False

--prefix-size

initialize generation by target prefix of given length

Default: 0

--no-repeat-ngram-size

ngram blocking such that this size ngram cannot be repeated in the generation

Default: 0

--sampling

sample hypotheses instead of using beam search

Default: False

--sampling-topk

sample from top K likely next words instead of all words

Default: -1

--sampling-topp

sample from the smallest set whose cumulative probability mass exceeds p for next words

Default: -1.0

--constraints

Possible choices: ordered, unordered

enables lexically constrained decoding

--temperature

temperature for generation

Default: 1.0

--diverse-beam-groups

number of groups for Diverse Beam Search

Default: -1

--diverse-beam-strength

strength of diversity penalty for Diverse Beam Search

Default: 0.5

--diversity-rate

strength of diversity penalty for Diverse Siblings Search

Default: -1.0

--print-alignment

if set, uses attention feedback to compute and print alignment to source tokens

Default: False

--print-step Default: False
--lm-path path to lm checkpoint for lm fusion
--lm-weight

weight for lm probs for lm fusion

Default: 0.0

--iter-decode-eos-penalty

if > 0.0, it penalized early-stopping in decoding.

Default: 0.0

--iter-decode-max-iter

maximum iterations for iterative refinement.

Default: 10

--iter-decode-force-max-iter

if set, run exact the maximum number of iterations without early stop

Default: False

--iter-decode-with-beam

if > 1, model will generate translations varying by the lengths.

Default: 1

--iter-decode-with-external-reranker

if set, the last checkpoint are assumed to be a reranker to rescore the translations

Default: False

--retain-iter-history

if set, decoding returns the whole history of iterative refinement

Default: False

--retain-dropout

Use dropout at inference time

Default: False

--retain-dropout-modules if set, only retain dropout for the specified modules; if not set, then dropout will be retained for all modules
--decoding-format Possible choices: unigram, ensemble, vote, dp, bs

Interactive

--buffer-size

read this many sentences into a buffer before processing them

Default: 0

--input

file to read from; use - for stdin

Default: “-“

fairseq-score

BLEU scoring of generated translations against reference translations.

Command-line script for BLEU scoring.

usage: fairseq-score [-h] [-s SYS] -r REF [-o N] [--ignore-case] [--sacrebleu]
                     [--sentence-bleu]

Named Arguments

-s, --sys

system output

Default: “-“

-r, --ref references
-o, --order

consider ngrams up to this order

Default: 4

--ignore-case

case-insensitive scoring

Default: False

--sacrebleu

score with sacrebleu

Default: False

--sentence-bleu

report sentence-level BLEUs (i.e., with +1 smoothing)

Default: False

fairseq-eval-lm

Evaluate the perplexity of a trained language model.

usage: fairseq-eval-lm [-h] [--no-progress-bar] [--log-interval LOG_INTERVAL]
                       [--log-format {json,none,simple,tqdm}]
                       [--tensorboard-logdir TENSORBOARD_LOGDIR] [--seed SEED]
                       [--cpu] [--tpu] [--bf16] [--memory-efficient-bf16]
                       [--fp16] [--memory-efficient-fp16]
                       [--fp16-no-flatten-grads]
                       [--fp16-init-scale FP16_INIT_SCALE]
                       [--fp16-scale-window FP16_SCALE_WINDOW]
                       [--fp16-scale-tolerance FP16_SCALE_TOLERANCE]
                       [--min-loss-scale MIN_LOSS_SCALE]
                       [--threshold-loss-scale THRESHOLD_LOSS_SCALE]
                       [--user-dir USER_DIR]
                       [--empty-cache-freq EMPTY_CACHE_FREQ]
                       [--all-gather-list-size ALL_GATHER_LIST_SIZE]
                       [--model-parallel-size MODEL_PARALLEL_SIZE]
                       [--checkpoint-suffix CHECKPOINT_SUFFIX]
                       [--checkpoint-shard-count CHECKPOINT_SHARD_COUNT]
                       [--quantization-config-path QUANTIZATION_CONFIG_PATH]
                       [--profile]
                       [--criterion {sentence_prediction,ctc,adaptive_loss,label_smoothed_cross_entropy,composite_loss,nat_loss,masked_lm,sentence_ranking,legacy_masked_lm_loss,cross_entropy,wav2vec,label_smoothed_cross_entropy_with_alignment,vocab_parallel_cross_entropy}]
                       [--tokenizer {nltk,space,moses}]
                       [--bpe {gpt2,bytes,sentencepiece,subword_nmt,byte_bpe,characters,bert,fastbpe,hf_byte_bpe}]
                       [--optimizer {adadelta,adam,adafactor,adagrad,lamb,nag,adamax,sgd}]
                       [--lr-scheduler {cosine,reduce_lr_on_plateau,fixed,triangular,polynomial_decay,tri_stage,inverse_sqrt}]
                       [--scoring {chrf,wer,sacrebleu,bleu}] [--task TASK]
                       [--num-workers NUM_WORKERS]
                       [--skip-invalid-size-inputs-valid-test]
                       [--max-tokens MAX_TOKENS] [--batch-size BATCH_SIZE]
                       [--required-batch-size-multiple REQUIRED_BATCH_SIZE_MULTIPLE]
                       [--required-seq-len-multiple REQUIRED_SEQ_LEN_MULTIPLE]
                       [--dataset-impl {raw,lazy,cached,mmap,fasta}]
                       [--data-buffer-size DATA_BUFFER_SIZE]
                       [--train-subset TRAIN_SUBSET]
                       [--valid-subset VALID_SUBSET]
                       [--validate-interval VALIDATE_INTERVAL]
                       [--validate-interval-updates VALIDATE_INTERVAL_UPDATES]
                       [--validate-after-updates VALIDATE_AFTER_UPDATES]
                       [--fixed-validation-seed FIXED_VALIDATION_SEED]
                       [--disable-validation]
                       [--max-tokens-valid MAX_TOKENS_VALID]
                       [--batch-size-valid BATCH_SIZE_VALID]
                       [--curriculum CURRICULUM] [--gen-subset GEN_SUBSET]
                       [--num-shards NUM_SHARDS] [--shard-id SHARD_ID]
                       [--distributed-world-size DISTRIBUTED_WORLD_SIZE]
                       [--distributed-rank DISTRIBUTED_RANK]
                       [--distributed-backend DISTRIBUTED_BACKEND]
                       [--distributed-init-method DISTRIBUTED_INIT_METHOD]
                       [--distributed-port DISTRIBUTED_PORT]
                       [--device-id DEVICE_ID] [--distributed-no-spawn]
                       [--ddp-backend {c10d,no_c10d}]
                       [--bucket-cap-mb BUCKET_CAP_MB] [--fix-batches-to-gpus]
                       [--find-unused-parameters] [--fast-stat-sync]
                       [--broadcast-buffers]
                       [--distributed-wrapper {DDP,SlowMo}]
                       [--slowmo-momentum SLOWMO_MOMENTUM]
                       [--slowmo-algorithm SLOWMO_ALGORITHM]
                       [--localsgd-frequency LOCALSGD_FREQUENCY]
                       [--nprocs-per-node NPROCS_PER_NODE]
                       [--pipeline-model-parallel]
                       [--pipeline-balance PIPELINE_BALANCE]
                       [--pipeline-devices PIPELINE_DEVICES]
                       [--pipeline-chunks PIPELINE_CHUNKS]
                       [--pipeline-encoder-balance PIPELINE_ENCODER_BALANCE]
                       [--pipeline-encoder-devices PIPELINE_ENCODER_DEVICES]
                       [--pipeline-decoder-balance PIPELINE_DECODER_BALANCE]
                       [--pipeline-decoder-devices PIPELINE_DECODER_DEVICES]
                       [--pipeline-checkpoint {always,never,except_last}]
                       [--zero-sharding {none,os}] [--path PATH]
                       [--remove-bpe [REMOVE_BPE]] [--quiet]
                       [--model-overrides MODEL_OVERRIDES]
                       [--results-path RESULTS_PATH] [--output-word-probs]
                       [--output-word-stats] [--context-window CONTEXT_WINDOW]
                       [--softmax-batch SOFTMAX_BATCH]

Named Arguments

--no-progress-bar

disable progress bar

Default: False

--log-interval

log progress every N batches (when progress bar is disabled)

Default: 100

--log-format

Possible choices: json, none, simple, tqdm

log format to use

--tensorboard-logdir path to save logs for tensorboard, should match –logdir of running tensorboard (default: no tensorboard logging)
--seed

pseudo random number generator seed

Default: 1

--cpu

use CPU instead of CUDA

Default: False

--tpu

use TPU instead of CUDA

Default: False

--bf16

use bfloat16; implies –tpu

Default: False

--memory-efficient-bf16

use a memory-efficient version of BF16 training; implies –bf16

Default: False

--fp16

use FP16

Default: False

--memory-efficient-fp16

use a memory-efficient version of FP16 training; implies –fp16

Default: False

--fp16-no-flatten-grads

don’t flatten FP16 grads tensor

Default: False

--fp16-init-scale

default FP16 loss scale

Default: 128

--fp16-scale-window number of updates before increasing loss scale
--fp16-scale-tolerance

pct of updates that can overflow before decreasing the loss scale

Default: 0.0

--min-loss-scale

minimum FP16 loss scale, after which training is stopped

Default: 0.0001

--threshold-loss-scale threshold FP16 loss scale from below
--user-dir path to a python module containing custom extensions (tasks and/or architectures)
--empty-cache-freq

how often to clear the PyTorch CUDA cache (0 to disable)

Default: 0

--all-gather-list-size

number of bytes reserved for gathering stats from workers

Default: 16384

--model-parallel-size

total number of GPUs to parallelize model over

Default: 1

--checkpoint-suffix

suffix to add to the checkpoint file name

Default: “”

--checkpoint-shard-count

Number of shards containing the checkpoint - if the checkpoint is over 300GB, it is preferable to split it into shards to prevent OOM on CPU while loading the checkpoint

Default: 1

--quantization-config-path path to quantization config file
--profile

enable autograd profiler emit_nvtx

Default: False

--criterion

Possible choices: sentence_prediction, ctc, adaptive_loss, label_smoothed_cross_entropy, composite_loss, nat_loss, masked_lm, sentence_ranking, legacy_masked_lm_loss, cross_entropy, wav2vec, label_smoothed_cross_entropy_with_alignment, vocab_parallel_cross_entropy

Default: “cross_entropy”

--tokenizer Possible choices: nltk, space, moses
--bpe Possible choices: gpt2, bytes, sentencepiece, subword_nmt, byte_bpe, characters, bert, fastbpe, hf_byte_bpe
--optimizer Possible choices: adadelta, adam, adafactor, adagrad, lamb, nag, adamax, sgd
--lr-scheduler

Possible choices: cosine, reduce_lr_on_plateau, fixed, triangular, polynomial_decay, tri_stage, inverse_sqrt

Default: “fixed”

--scoring

Possible choices: chrf, wer, sacrebleu, bleu

Default: “bleu”

--task

Possible choices: sentence_prediction, translation, translation_from_pretrained_xlm, denoising, multilingual_translation, semisupervised_translation, cross_lingual_lm, multilingual_denoising, translation_from_pretrained_bart, masked_lm, sentence_ranking, speech_to_text, audio_pretraining, legacy_masked_lm, translation_multi_simple_epoch, multilingual_masked_lm, language_modeling, translation_lev, dummy_lm, dummy_masked_lm, dummy_mt

task

Default: “language_modeling”

dataset_data_loading

--num-workers

how many subprocesses to use for data loading

Default: 1

--skip-invalid-size-inputs-valid-test

ignore too long or too short lines in valid and test set

Default: False

--max-tokens maximum number of tokens in a batch
--batch-size number of examples in a batch
--required-batch-size-multiple

batch size will be a multiplier of this value

Default: 8

--required-seq-len-multiple

maximum sequence length in batch will be a multiplier of this value

Default: 1

--dataset-impl

Possible choices: raw, lazy, cached, mmap, fasta

output dataset implementation

--data-buffer-size

Number of batches to preload

Default: 10

--train-subset

data subset to use for training (e.g. train, valid, test)

Default: “train”

--valid-subset

comma separated list of data subsets to use for validation (e.g. train, valid, test)

Default: “valid”

--validate-interval

validate every N epochs

Default: 1

--validate-interval-updates

validate every N updates

Default: 0

--validate-after-updates

dont validate until reaching this many updates

Default: 0

--fixed-validation-seed specified random seed for validation
--disable-validation

disable validation

Default: False

--max-tokens-valid maximum number of tokens in a validation batch (defaults to –max-tokens)
--batch-size-valid batch size of the validation batch (defaults to –batch-size)
--curriculum

don’t shuffle batches for first N epochs

Default: 0

--gen-subset

data subset to generate (train, valid, test)

Default: “test”

--num-shards

shard generation over N shards

Default: 1

--shard-id

id of the shard to generate (id < num_shards)

Default: 0

distributed_training

--distributed-world-size

total number of GPUs across all nodes (default: all visible GPUs)

Default: 1

--distributed-rank

rank of the current worker

Default: 0

--distributed-backend

distributed backend

Default: “nccl”

--distributed-init-method typically tcp://hostname:port that will be used to establish initial connetion
--distributed-port

port number (not required if using –distributed-init-method)

Default: -1

--device-id, --local_rank

which GPU to use (usually configured automatically)

Default: 0

--distributed-no-spawn

do not spawn multiple processes even if multiple GPUs are visible

Default: False

--ddp-backend

Possible choices: c10d, no_c10d

DistributedDataParallel backend

Default: “c10d”

--bucket-cap-mb

bucket size for reduction

Default: 25

--fix-batches-to-gpus

don’t shuffle batches between GPUs; this reduces overall randomness and may affect precision but avoids the cost of re-reading the data

Default: False

--find-unused-parameters

disable unused parameter detection (not applicable to no_c10d ddp-backend

Default: False

--fast-stat-sync

[deprecated] this is now defined per Criterion

Default: False

--broadcast-buffers

Copy non-trainable parameters between GPUs, such as batchnorm population statistics

Default: False

--distributed-wrapper

Possible choices: DDP, SlowMo

DistributedDataParallel backend

Default: “DDP”

--slowmo-momentum SlowMo momentum term; by default use 0.0 for 16 GPUs, 0.2 for 32 GPUs; 0.5 for 64 GPUs, 0.6 for > 64 GPUs
--slowmo-algorithm

whether to use LocalSGD or SGP

Default: “LocalSGD”

--localsgd-frequency

Local SGD allreduce frequency

Default: 3

--nprocs-per-node

number of GPUs in each node. An allreduce operation across GPUs in a node is very fast. Hence, we do allreduce across GPUs in a node, and gossip across different nodes

Default: 1

--pipeline-model-parallel

if set, use pipeline model parallelism across GPUs

Default: False

--pipeline-balance partition the model into N_K pieces, where each piece contains N_i layers. The sum(args.pipeline_balance) should equal the total number of layers in the model
--pipeline-devices a list of device indices indicating which device to place each of the N_K partitions. The length of this list should equal the length of the –pipeline-balance argument
--pipeline-chunks

microbatch count for pipeline model parallelism

Default: 0

--pipeline-encoder-balance partition the pipeline parallel encoder into N_K pieces, where each piece contains N_i layers. The sum(args.pipeline_encoder_balance) should equal the total number of encoder layers in the model
--pipeline-encoder-devices a list of device indices indicating which device to place each of the N_K partitions. The length of this list should equal the length of the –pipeline-encoder-balance argument
--pipeline-decoder-balance partition the pipeline parallel decoder into N_K pieces, where each piece contains N_i layers. The sum(args.pipeline_decoder_balance) should equal the total number of decoder layers in the model
--pipeline-decoder-devices a list of device indices indicating which device to place each of the N_K partitions. The length of this list should equal the length of the –pipeline-decoder-balance argument
--pipeline-checkpoint

Possible choices: always, never, except_last

checkpointing mode for pipeline model parallelism

Default: “never”

--zero-sharding

Possible choices: none, os

ZeRO sharding

Default: “none”

LM Evaluation

--path path(s) to model file(s), colon separated
--remove-bpe remove BPE tokens before scoring (can be set to sentencepiece)
--quiet

only print final scores

Default: False

--model-overrides

a dictionary used to override model args at generation that were used during model training

Default: “{}”

--results-path path to save eval results (optional)
--output-word-probs

if set, outputs words and their predicted log probabilities to standard output

Default: False

--output-word-stats

if set, outputs word statistics such as word count, average probability, etc

Default: False

--context-window

ensures that every evaluated token has access to a context of at least this size, if possible

Default: 0

--softmax-batch

if BxT is more than this, will batch the softmax over vocab to this amount of tokens, in order to fit into GPU memory

Default: 9223372036854775807