Command-line Tools¶
Fairseq provides several command-line tools for training and evaluating models:
- fairseq-preprocess: Data pre-processing: build vocabularies and binarize training data
- fairseq-train: Train a new model on one or multiple GPUs
- fairseq-generate: Translate pre-processed data with a trained model
- fairseq-interactive: Translate raw text with a trained model
- fairseq-score: BLEU scoring of generated translations against reference translations
- fairseq-eval-lm: Language model evaluation
fairseq-preprocess¶
Data pre-processing: build vocabularies and binarize training data.
usage: fairseq-preprocess [-h] [--no-progress-bar]
[--log-interval LOG_INTERVAL]
[--log-format {json,none,simple,tqdm}]
[--tensorboard-logdir TENSORBOARD_LOGDIR]
[--seed SEED] [--cpu] [--tpu] [--bf16]
[--memory-efficient-bf16] [--fp16]
[--memory-efficient-fp16] [--fp16-no-flatten-grads]
[--fp16-init-scale FP16_INIT_SCALE]
[--fp16-scale-window FP16_SCALE_WINDOW]
[--fp16-scale-tolerance FP16_SCALE_TOLERANCE]
[--min-loss-scale MIN_LOSS_SCALE]
[--threshold-loss-scale THRESHOLD_LOSS_SCALE]
[--user-dir USER_DIR]
[--empty-cache-freq EMPTY_CACHE_FREQ]
[--all-gather-list-size ALL_GATHER_LIST_SIZE]
[--model-parallel-size MODEL_PARALLEL_SIZE]
[--checkpoint-suffix CHECKPOINT_SUFFIX]
[--checkpoint-shard-count CHECKPOINT_SHARD_COUNT]
[--quantization-config-path QUANTIZATION_CONFIG_PATH]
[--profile]
[--criterion {sentence_prediction,ctc,adaptive_loss,label_smoothed_cross_entropy,composite_loss,nat_loss,masked_lm,sentence_ranking,legacy_masked_lm_loss,cross_entropy,wav2vec,label_smoothed_cross_entropy_with_alignment,vocab_parallel_cross_entropy}]
[--tokenizer {nltk,space,moses}]
[--bpe {gpt2,bytes,sentencepiece,subword_nmt,byte_bpe,characters,bert,fastbpe,hf_byte_bpe}]
[--optimizer {adadelta,adam,adafactor,adagrad,lamb,nag,adamax,sgd}]
[--lr-scheduler {cosine,reduce_lr_on_plateau,fixed,triangular,polynomial_decay,tri_stage,inverse_sqrt}]
[--scoring {chrf,wer,sacrebleu,bleu}] [--task TASK]
[-s SRC] [-t TARGET] [--trainpref FP]
[--validpref FP] [--testpref FP] [--align-suffix FP]
[--destdir DIR] [--thresholdtgt N]
[--thresholdsrc N] [--tgtdict FP] [--srcdict FP]
[--nwordstgt N] [--nwordssrc N] [--alignfile ALIGN]
[--dataset-impl FORMAT] [--joined-dictionary]
[--only-source] [--padding-factor N] [--workers N]
Named Arguments¶
--no-progress-bar | disable progress bar Default: False |
--log-interval | log progress every N batches (when progress bar is disabled) Default: 100 |
--log-format | Possible choices: json, none, simple, tqdm log format to use |
--tensorboard-logdir | path to save logs for tensorboard, should match –logdir of running tensorboard (default: no tensorboard logging) |
--seed | pseudo random number generator seed Default: 1 |
--cpu | use CPU instead of CUDA Default: False |
--tpu | use TPU instead of CUDA Default: False |
--bf16 | use bfloat16; implies –tpu Default: False |
--memory-efficient-bf16 | use a memory-efficient version of BF16 training; implies –bf16 Default: False |
--fp16 | use FP16 Default: False |
--memory-efficient-fp16 | use a memory-efficient version of FP16 training; implies –fp16 Default: False |
--fp16-no-flatten-grads | don’t flatten FP16 grads tensor Default: False |
--fp16-init-scale | default FP16 loss scale Default: 128 |
--fp16-scale-window | number of updates before increasing loss scale |
--fp16-scale-tolerance | pct of updates that can overflow before decreasing the loss scale Default: 0.0 |
--min-loss-scale | minimum FP16 loss scale, after which training is stopped Default: 0.0001 |
--threshold-loss-scale | threshold FP16 loss scale from below |
--user-dir | path to a python module containing custom extensions (tasks and/or architectures) |
--empty-cache-freq | how often to clear the PyTorch CUDA cache (0 to disable) Default: 0 |
--all-gather-list-size | number of bytes reserved for gathering stats from workers Default: 16384 |
--model-parallel-size | total number of GPUs to parallelize model over Default: 1 |
--checkpoint-suffix | suffix to add to the checkpoint file name Default: “” |
--checkpoint-shard-count | Number of shards containing the checkpoint - if the checkpoint is over 300GB, it is preferable to split it into shards to prevent OOM on CPU while loading the checkpoint Default: 1 |
--quantization-config-path | path to quantization config file |
--profile | enable autograd profiler emit_nvtx Default: False |
--criterion | Possible choices: sentence_prediction, ctc, adaptive_loss, label_smoothed_cross_entropy, composite_loss, nat_loss, masked_lm, sentence_ranking, legacy_masked_lm_loss, cross_entropy, wav2vec, label_smoothed_cross_entropy_with_alignment, vocab_parallel_cross_entropy Default: “cross_entropy” |
--tokenizer | Possible choices: nltk, space, moses |
--bpe | Possible choices: gpt2, bytes, sentencepiece, subword_nmt, byte_bpe, characters, bert, fastbpe, hf_byte_bpe |
--optimizer | Possible choices: adadelta, adam, adafactor, adagrad, lamb, nag, adamax, sgd |
--lr-scheduler | Possible choices: cosine, reduce_lr_on_plateau, fixed, triangular, polynomial_decay, tri_stage, inverse_sqrt Default: “fixed” |
--scoring | Possible choices: chrf, wer, sacrebleu, bleu Default: “bleu” |
--task | Possible choices: sentence_prediction, translation, translation_from_pretrained_xlm, denoising, multilingual_translation, semisupervised_translation, cross_lingual_lm, multilingual_denoising, translation_from_pretrained_bart, masked_lm, sentence_ranking, speech_to_text, audio_pretraining, legacy_masked_lm, translation_multi_simple_epoch, multilingual_masked_lm, language_modeling, translation_lev, dummy_lm, dummy_masked_lm, dummy_mt task Default: “translation” |
--dataset-impl | Possible choices: raw, lazy, cached, mmap, fasta output dataset implementation Default: “mmap” |
Preprocessing¶
-s, --source-lang | source language |
-t, --target-lang | target language |
--trainpref | train file prefix |
--validpref | comma separated, valid file prefixes |
--testpref | comma separated, test file prefixes |
--align-suffix | alignment file suffix |
--destdir | destination dir Default: “data-bin” |
--thresholdtgt | map words appearing less than threshold times to unknown Default: 0 |
--thresholdsrc | map words appearing less than threshold times to unknown Default: 0 |
--tgtdict | reuse given target dictionary |
--srcdict | reuse given source dictionary |
--nwordstgt | number of target words to retain Default: -1 |
--nwordssrc | number of source words to retain Default: -1 |
--alignfile | an alignment file (optional) |
--joined-dictionary | Generate joined dictionary Default: False |
--only-source | Only process the source language Default: False |
--padding-factor | Pad dictionary size to be multiple of N Default: 8 |
--workers | number of parallel workers Default: 1 |
fairseq-train¶
Train a new model on one or across multiple GPUs.
usage: fairseq-train [-h] [--no-progress-bar] [--log-interval LOG_INTERVAL]
[--log-format {json,none,simple,tqdm}]
[--tensorboard-logdir TENSORBOARD_LOGDIR] [--seed SEED]
[--cpu] [--tpu] [--bf16] [--memory-efficient-bf16]
[--fp16] [--memory-efficient-fp16]
[--fp16-no-flatten-grads]
[--fp16-init-scale FP16_INIT_SCALE]
[--fp16-scale-window FP16_SCALE_WINDOW]
[--fp16-scale-tolerance FP16_SCALE_TOLERANCE]
[--min-loss-scale MIN_LOSS_SCALE]
[--threshold-loss-scale THRESHOLD_LOSS_SCALE]
[--user-dir USER_DIR]
[--empty-cache-freq EMPTY_CACHE_FREQ]
[--all-gather-list-size ALL_GATHER_LIST_SIZE]
[--model-parallel-size MODEL_PARALLEL_SIZE]
[--checkpoint-suffix CHECKPOINT_SUFFIX]
[--checkpoint-shard-count CHECKPOINT_SHARD_COUNT]
[--quantization-config-path QUANTIZATION_CONFIG_PATH]
[--profile]
[--criterion {sentence_prediction,ctc,adaptive_loss,label_smoothed_cross_entropy,composite_loss,nat_loss,masked_lm,sentence_ranking,legacy_masked_lm_loss,cross_entropy,wav2vec,label_smoothed_cross_entropy_with_alignment,vocab_parallel_cross_entropy}]
[--tokenizer {nltk,space,moses}]
[--bpe {gpt2,bytes,sentencepiece,subword_nmt,byte_bpe,characters,bert,fastbpe,hf_byte_bpe}]
[--optimizer {adadelta,adam,adafactor,adagrad,lamb,nag,adamax,sgd}]
[--lr-scheduler {cosine,reduce_lr_on_plateau,fixed,triangular,polynomial_decay,tri_stage,inverse_sqrt}]
[--scoring {chrf,wer,sacrebleu,bleu}] [--task TASK]
[--num-workers NUM_WORKERS]
[--skip-invalid-size-inputs-valid-test]
[--max-tokens MAX_TOKENS] [--batch-size BATCH_SIZE]
[--required-batch-size-multiple REQUIRED_BATCH_SIZE_MULTIPLE]
[--required-seq-len-multiple REQUIRED_SEQ_LEN_MULTIPLE]
[--dataset-impl {raw,lazy,cached,mmap,fasta}]
[--data-buffer-size DATA_BUFFER_SIZE]
[--train-subset TRAIN_SUBSET]
[--valid-subset VALID_SUBSET]
[--validate-interval VALIDATE_INTERVAL]
[--validate-interval-updates VALIDATE_INTERVAL_UPDATES]
[--validate-after-updates VALIDATE_AFTER_UPDATES]
[--fixed-validation-seed FIXED_VALIDATION_SEED]
[--disable-validation]
[--max-tokens-valid MAX_TOKENS_VALID]
[--batch-size-valid BATCH_SIZE_VALID]
[--curriculum CURRICULUM] [--gen-subset GEN_SUBSET]
[--num-shards NUM_SHARDS] [--shard-id SHARD_ID]
[--distributed-world-size DISTRIBUTED_WORLD_SIZE]
[--distributed-rank DISTRIBUTED_RANK]
[--distributed-backend DISTRIBUTED_BACKEND]
[--distributed-init-method DISTRIBUTED_INIT_METHOD]
[--distributed-port DISTRIBUTED_PORT]
[--device-id DEVICE_ID] [--distributed-no-spawn]
[--ddp-backend {c10d,no_c10d}]
[--bucket-cap-mb BUCKET_CAP_MB] [--fix-batches-to-gpus]
[--find-unused-parameters] [--fast-stat-sync]
[--broadcast-buffers]
[--distributed-wrapper {DDP,SlowMo}]
[--slowmo-momentum SLOWMO_MOMENTUM]
[--slowmo-algorithm SLOWMO_ALGORITHM]
[--localsgd-frequency LOCALSGD_FREQUENCY]
[--nprocs-per-node NPROCS_PER_NODE]
[--pipeline-model-parallel]
[--pipeline-balance PIPELINE_BALANCE]
[--pipeline-devices PIPELINE_DEVICES]
[--pipeline-chunks PIPELINE_CHUNKS]
[--pipeline-encoder-balance PIPELINE_ENCODER_BALANCE]
[--pipeline-encoder-devices PIPELINE_ENCODER_DEVICES]
[--pipeline-decoder-balance PIPELINE_DECODER_BALANCE]
[--pipeline-decoder-devices PIPELINE_DECODER_DEVICES]
[--pipeline-checkpoint {always,never,except_last}]
[--zero-sharding {none,os}] [--arch ARCH]
[--max-epoch MAX_EPOCH] [--max-update MAX_UPDATE]
[--stop-time-hours STOP_TIME_HOURS]
[--clip-norm CLIP_NORM] [--sentence-avg]
[--update-freq UPDATE_FREQ] [--lr LR] [--min-lr MIN_LR]
[--use-bmuf] [--save-dir SAVE_DIR]
[--restore-file RESTORE_FILE]
[--finetune-from-model FINETUNE_FROM_MODEL]
[--reset-dataloader] [--reset-lr-scheduler]
[--reset-meters] [--reset-optimizer]
[--optimizer-overrides OPTIMIZER_OVERRIDES]
[--save-interval SAVE_INTERVAL]
[--save-interval-updates SAVE_INTERVAL_UPDATES]
[--keep-interval-updates KEEP_INTERVAL_UPDATES]
[--keep-last-epochs KEEP_LAST_EPOCHS]
[--keep-best-checkpoints KEEP_BEST_CHECKPOINTS]
[--no-save] [--no-epoch-checkpoints]
[--no-last-checkpoints] [--no-save-optimizer-state]
[--best-checkpoint-metric BEST_CHECKPOINT_METRIC]
[--maximize-best-checkpoint-metric] [--patience PATIENCE]
Named Arguments¶
--no-progress-bar | disable progress bar Default: False |
--log-interval | log progress every N batches (when progress bar is disabled) Default: 100 |
--log-format | Possible choices: json, none, simple, tqdm log format to use |
--tensorboard-logdir | path to save logs for tensorboard, should match –logdir of running tensorboard (default: no tensorboard logging) |
--seed | pseudo random number generator seed Default: 1 |
--cpu | use CPU instead of CUDA Default: False |
--tpu | use TPU instead of CUDA Default: False |
--bf16 | use bfloat16; implies –tpu Default: False |
--memory-efficient-bf16 | use a memory-efficient version of BF16 training; implies –bf16 Default: False |
--fp16 | use FP16 Default: False |
--memory-efficient-fp16 | use a memory-efficient version of FP16 training; implies –fp16 Default: False |
--fp16-no-flatten-grads | don’t flatten FP16 grads tensor Default: False |
--fp16-init-scale | default FP16 loss scale Default: 128 |
--fp16-scale-window | number of updates before increasing loss scale |
--fp16-scale-tolerance | pct of updates that can overflow before decreasing the loss scale Default: 0.0 |
--min-loss-scale | minimum FP16 loss scale, after which training is stopped Default: 0.0001 |
--threshold-loss-scale | threshold FP16 loss scale from below |
--user-dir | path to a python module containing custom extensions (tasks and/or architectures) |
--empty-cache-freq | how often to clear the PyTorch CUDA cache (0 to disable) Default: 0 |
--all-gather-list-size | number of bytes reserved for gathering stats from workers Default: 16384 |
--model-parallel-size | total number of GPUs to parallelize model over Default: 1 |
--checkpoint-suffix | suffix to add to the checkpoint file name Default: “” |
--checkpoint-shard-count | Number of shards containing the checkpoint - if the checkpoint is over 300GB, it is preferable to split it into shards to prevent OOM on CPU while loading the checkpoint Default: 1 |
--quantization-config-path | path to quantization config file |
--profile | enable autograd profiler emit_nvtx Default: False |
--criterion | Possible choices: sentence_prediction, ctc, adaptive_loss, label_smoothed_cross_entropy, composite_loss, nat_loss, masked_lm, sentence_ranking, legacy_masked_lm_loss, cross_entropy, wav2vec, label_smoothed_cross_entropy_with_alignment, vocab_parallel_cross_entropy Default: “cross_entropy” |
--tokenizer | Possible choices: nltk, space, moses |
--bpe | Possible choices: gpt2, bytes, sentencepiece, subword_nmt, byte_bpe, characters, bert, fastbpe, hf_byte_bpe |
--optimizer | Possible choices: adadelta, adam, adafactor, adagrad, lamb, nag, adamax, sgd |
--lr-scheduler | Possible choices: cosine, reduce_lr_on_plateau, fixed, triangular, polynomial_decay, tri_stage, inverse_sqrt Default: “fixed” |
--scoring | Possible choices: chrf, wer, sacrebleu, bleu Default: “bleu” |
--task | Possible choices: sentence_prediction, translation, translation_from_pretrained_xlm, denoising, multilingual_translation, semisupervised_translation, cross_lingual_lm, multilingual_denoising, translation_from_pretrained_bart, masked_lm, sentence_ranking, speech_to_text, audio_pretraining, legacy_masked_lm, translation_multi_simple_epoch, multilingual_masked_lm, language_modeling, translation_lev, dummy_lm, dummy_masked_lm, dummy_mt task Default: “translation” |
dataset_data_loading¶
--num-workers | how many subprocesses to use for data loading Default: 1 |
--skip-invalid-size-inputs-valid-test | ignore too long or too short lines in valid and test set Default: False |
--max-tokens | maximum number of tokens in a batch |
--batch-size | number of examples in a batch |
--required-batch-size-multiple | batch size will be a multiplier of this value Default: 8 |
--required-seq-len-multiple | maximum sequence length in batch will be a multiplier of this value Default: 1 |
--dataset-impl | Possible choices: raw, lazy, cached, mmap, fasta output dataset implementation |
--data-buffer-size | Number of batches to preload Default: 10 |
--train-subset | data subset to use for training (e.g. train, valid, test) Default: “train” |
--valid-subset | comma separated list of data subsets to use for validation (e.g. train, valid, test) Default: “valid” |
--validate-interval | validate every N epochs Default: 1 |
--validate-interval-updates | validate every N updates Default: 0 |
--validate-after-updates | dont validate until reaching this many updates Default: 0 |
--fixed-validation-seed | specified random seed for validation |
--disable-validation | disable validation Default: False |
--max-tokens-valid | maximum number of tokens in a validation batch (defaults to –max-tokens) |
--batch-size-valid | batch size of the validation batch (defaults to –batch-size) |
--curriculum | don’t shuffle batches for first N epochs Default: 0 |
--gen-subset | data subset to generate (train, valid, test) Default: “test” |
--num-shards | shard generation over N shards Default: 1 |
--shard-id | id of the shard to generate (id < num_shards) Default: 0 |
distributed_training¶
--distributed-world-size | total number of GPUs across all nodes (default: all visible GPUs) Default: 1 |
--distributed-rank | rank of the current worker Default: 0 |
--distributed-backend | distributed backend Default: “nccl” |
--distributed-init-method | typically tcp://hostname:port that will be used to establish initial connetion |
--distributed-port | port number (not required if using –distributed-init-method) Default: -1 |
--device-id, --local_rank | which GPU to use (usually configured automatically) Default: 0 |
--distributed-no-spawn | do not spawn multiple processes even if multiple GPUs are visible Default: False |
--ddp-backend | Possible choices: c10d, no_c10d DistributedDataParallel backend Default: “c10d” |
--bucket-cap-mb | bucket size for reduction Default: 25 |
--fix-batches-to-gpus | don’t shuffle batches between GPUs; this reduces overall randomness and may affect precision but avoids the cost of re-reading the data Default: False |
--find-unused-parameters | disable unused parameter detection (not applicable to no_c10d ddp-backend Default: False |
--fast-stat-sync | [deprecated] this is now defined per Criterion Default: False |
--broadcast-buffers | Copy non-trainable parameters between GPUs, such as batchnorm population statistics Default: False |
--distributed-wrapper | Possible choices: DDP, SlowMo DistributedDataParallel backend Default: “DDP” |
--slowmo-momentum | SlowMo momentum term; by default use 0.0 for 16 GPUs, 0.2 for 32 GPUs; 0.5 for 64 GPUs, 0.6 for > 64 GPUs |
--slowmo-algorithm | whether to use LocalSGD or SGP Default: “LocalSGD” |
--localsgd-frequency | Local SGD allreduce frequency Default: 3 |
--nprocs-per-node | number of GPUs in each node. An allreduce operation across GPUs in a node is very fast. Hence, we do allreduce across GPUs in a node, and gossip across different nodes Default: 1 |
--pipeline-model-parallel | if set, use pipeline model parallelism across GPUs Default: False |
--pipeline-balance | partition the model into N_K pieces, where each piece contains N_i layers. The sum(args.pipeline_balance) should equal the total number of layers in the model |
--pipeline-devices | a list of device indices indicating which device to place each of the N_K partitions. The length of this list should equal the length of the –pipeline-balance argument |
--pipeline-chunks | microbatch count for pipeline model parallelism Default: 0 |
--pipeline-encoder-balance | partition the pipeline parallel encoder into N_K pieces, where each piece contains N_i layers. The sum(args.pipeline_encoder_balance) should equal the total number of encoder layers in the model |
--pipeline-encoder-devices | a list of device indices indicating which device to place each of the N_K partitions. The length of this list should equal the length of the –pipeline-encoder-balance argument |
--pipeline-decoder-balance | partition the pipeline parallel decoder into N_K pieces, where each piece contains N_i layers. The sum(args.pipeline_decoder_balance) should equal the total number of decoder layers in the model |
--pipeline-decoder-devices | a list of device indices indicating which device to place each of the N_K partitions. The length of this list should equal the length of the –pipeline-decoder-balance argument |
--pipeline-checkpoint | Possible choices: always, never, except_last checkpointing mode for pipeline model parallelism Default: “never” |
--zero-sharding | Possible choices: none, os ZeRO sharding Default: “none” |
Model configuration¶
--arch, -a | Possible choices: roberta, roberta_base, roberta_large, xlm, transformer, transformer_iwslt_de_en, transformer_wmt_en_de, transformer_vaswani_wmt_en_de_big, transformer_vaswani_wmt_en_fr_big, transformer_wmt_en_de_big, transformer_wmt_en_de_big_t2t, transformer_align, transformer_wmt_en_de_big_align, wav2vec, wav2vec2, wav2vec_ctc, wav2vec_seq2seq, transformer_from_pretrained_xlm, s2t_berard, s2t_berard_256_3_3, s2t_berard_512_3_2, s2t_berard_512_5_3, s2t_transformer, s2t_transformer_s, s2t_transformer_sp, s2t_transformer_m, s2t_transformer_mp, s2t_transformer_l, s2t_transformer_lp, transformer_lm, transformer_lm_big, transformer_lm_baevski_wiki103, transformer_lm_wiki103, transformer_lm_baevski_gbw, transformer_lm_gbw, transformer_lm_gpt, transformer_lm_gpt2_small, transformer_lm_gpt2_medium, transformer_lm_gpt2_big, lightconv, lightconv_iwslt_de_en, lightconv_wmt_en_de, lightconv_wmt_en_de_big, lightconv_wmt_en_fr_big, lightconv_wmt_zh_en_big, masked_lm, bert_base, bert_large, xlm_base, fconv, fconv_iwslt_de_en, fconv_wmt_en_ro, fconv_wmt_en_de, fconv_wmt_en_fr, fconv_lm, fconv_lm_dauphin_wikitext103, fconv_lm_dauphin_gbw, nonautoregressive_transformer, nonautoregressive_transformer_wmt_en_de, nacrf_transformer, iterative_nonautoregressive_transformer, iterative_nonautoregressive_transformer_wmt_en_de, cmlm_transformer, cmlm_transformer_wmt_en_de, levenshtein_transformer, levenshtein_transformer_wmt_en_de, levenshtein_transformer_vaswani_wmt_en_de_big, levenshtein_transformer_wmt_en_de_big, insertion_transformer, lightconv_lm, lightconv_lm_gbw, fconv_self_att, fconv_self_att_wp, bart_large, bart_base, mbart_large, mbart_base, mbart_base_wmt20, lstm, lstm_wiseman_iwslt_de_en, lstm_luong_wmt_en_de, multilingual_transformer, multilingual_transformer_iwslt_de_en, hf_gpt2, hf_gpt2_medium, hf_gpt2_large, hf_gpt2_xl, lstm_lm, dummy_model, model_parallel_roberta, model_parallel_roberta_base, model_parallel_roberta_large, transformer_lm_megatron, transformer_lm_megatron_11b, transformer_iwslt_de_en_pipeline_parallel, transformer_wmt_en_de_big_pipeline_parallel model architecture |
optimization¶
--max-epoch | force stop training at specified epoch Default: 0 |
--max-update | force stop training at specified update Default: 0 |
--stop-time-hours | force stop training after specified cumulative time (if >0) Default: 0 |
--clip-norm | clip threshold of gradients Default: 0.0 |
--sentence-avg | normalize gradients by the number of sentences in a batch (default is to normalize by number of tokens) Default: False |
--update-freq | update parameters every N_i batches, when in epoch i Default: 1 |
--lr | learning rate for the first N epochs; all epochs >N using LR_N (note: this may be interpreted differently depending on –lr-scheduler) Default: 0.25 |
--min-lr | stop training when the learning rate reaches this minimum Default: -1.0 |
--use-bmuf | specify global optimizer for syncing models on different GPUs/shards Default: False |
checkpoint¶
--save-dir | path to save checkpoints Default: “checkpoints” |
--restore-file | filename from which to load checkpoint (default: <save-dir>/checkpoint_last.pt Default: “checkpoint_last.pt” |
--finetune-from-model | finetune from a pretrained model; note that meters and lr scheduler will be reset |
--reset-dataloader | if set, does not reload dataloader state from the checkpoint Default: False |
--reset-lr-scheduler | if set, does not load lr scheduler state from the checkpoint Default: False |
--reset-meters | if set, does not load meters from the checkpoint Default: False |
--reset-optimizer | if set, does not load optimizer state from the checkpoint Default: False |
--optimizer-overrides | a dictionary used to override optimizer args when loading a checkpoint Default: “{}” |
--save-interval | save a checkpoint every N epochs Default: 1 |
--save-interval-updates | save a checkpoint (and validate) every N updates Default: 0 |
--keep-interval-updates | keep the last N checkpoints saved with –save-interval-updates Default: -1 |
--keep-last-epochs | keep last N epoch checkpoints Default: -1 |
--keep-best-checkpoints | keep best N checkpoints based on scores Default: -1 |
--no-save | don’t save models or checkpoints Default: False |
--no-epoch-checkpoints | only store last and best checkpoints Default: False |
--no-last-checkpoints | don’t store last checkpoints Default: False |
--no-save-optimizer-state | don’t save optimizer-state as part of checkpoint Default: False |
--best-checkpoint-metric | metric to use for saving “best” checkpoints Default: “loss” |
--maximize-best-checkpoint-metric | select the largest metric value for saving “best” checkpoints Default: False |
--patience | early stop training if valid performance doesn’t improve for N consecutive validation runs; note that this is influenced by –validate-interval Default: -1 |
fairseq-generate¶
Translate pre-processed data with a trained model.
usage: fairseq-generate [-h] [--no-progress-bar] [--log-interval LOG_INTERVAL]
[--log-format {json,none,simple,tqdm}]
[--tensorboard-logdir TENSORBOARD_LOGDIR]
[--seed SEED] [--cpu] [--tpu] [--bf16]
[--memory-efficient-bf16] [--fp16]
[--memory-efficient-fp16] [--fp16-no-flatten-grads]
[--fp16-init-scale FP16_INIT_SCALE]
[--fp16-scale-window FP16_SCALE_WINDOW]
[--fp16-scale-tolerance FP16_SCALE_TOLERANCE]
[--min-loss-scale MIN_LOSS_SCALE]
[--threshold-loss-scale THRESHOLD_LOSS_SCALE]
[--user-dir USER_DIR]
[--empty-cache-freq EMPTY_CACHE_FREQ]
[--all-gather-list-size ALL_GATHER_LIST_SIZE]
[--model-parallel-size MODEL_PARALLEL_SIZE]
[--checkpoint-suffix CHECKPOINT_SUFFIX]
[--checkpoint-shard-count CHECKPOINT_SHARD_COUNT]
[--quantization-config-path QUANTIZATION_CONFIG_PATH]
[--profile]
[--criterion {sentence_prediction,ctc,adaptive_loss,label_smoothed_cross_entropy,composite_loss,nat_loss,masked_lm,sentence_ranking,legacy_masked_lm_loss,cross_entropy,wav2vec,label_smoothed_cross_entropy_with_alignment,vocab_parallel_cross_entropy}]
[--tokenizer {nltk,space,moses}]
[--bpe {gpt2,bytes,sentencepiece,subword_nmt,byte_bpe,characters,bert,fastbpe,hf_byte_bpe}]
[--optimizer {adadelta,adam,adafactor,adagrad,lamb,nag,adamax,sgd}]
[--lr-scheduler {cosine,reduce_lr_on_plateau,fixed,triangular,polynomial_decay,tri_stage,inverse_sqrt}]
[--scoring {chrf,wer,sacrebleu,bleu}] [--task TASK]
[--num-workers NUM_WORKERS]
[--skip-invalid-size-inputs-valid-test]
[--max-tokens MAX_TOKENS] [--batch-size BATCH_SIZE]
[--required-batch-size-multiple REQUIRED_BATCH_SIZE_MULTIPLE]
[--required-seq-len-multiple REQUIRED_SEQ_LEN_MULTIPLE]
[--dataset-impl {raw,lazy,cached,mmap,fasta}]
[--data-buffer-size DATA_BUFFER_SIZE]
[--train-subset TRAIN_SUBSET]
[--valid-subset VALID_SUBSET]
[--validate-interval VALIDATE_INTERVAL]
[--validate-interval-updates VALIDATE_INTERVAL_UPDATES]
[--validate-after-updates VALIDATE_AFTER_UPDATES]
[--fixed-validation-seed FIXED_VALIDATION_SEED]
[--disable-validation]
[--max-tokens-valid MAX_TOKENS_VALID]
[--batch-size-valid BATCH_SIZE_VALID]
[--curriculum CURRICULUM] [--gen-subset GEN_SUBSET]
[--num-shards NUM_SHARDS] [--shard-id SHARD_ID]
[--distributed-world-size DISTRIBUTED_WORLD_SIZE]
[--distributed-rank DISTRIBUTED_RANK]
[--distributed-backend DISTRIBUTED_BACKEND]
[--distributed-init-method DISTRIBUTED_INIT_METHOD]
[--distributed-port DISTRIBUTED_PORT]
[--device-id DEVICE_ID] [--distributed-no-spawn]
[--ddp-backend {c10d,no_c10d}]
[--bucket-cap-mb BUCKET_CAP_MB]
[--fix-batches-to-gpus] [--find-unused-parameters]
[--fast-stat-sync] [--broadcast-buffers]
[--distributed-wrapper {DDP,SlowMo}]
[--slowmo-momentum SLOWMO_MOMENTUM]
[--slowmo-algorithm SLOWMO_ALGORITHM]
[--localsgd-frequency LOCALSGD_FREQUENCY]
[--nprocs-per-node NPROCS_PER_NODE]
[--pipeline-model-parallel]
[--pipeline-balance PIPELINE_BALANCE]
[--pipeline-devices PIPELINE_DEVICES]
[--pipeline-chunks PIPELINE_CHUNKS]
[--pipeline-encoder-balance PIPELINE_ENCODER_BALANCE]
[--pipeline-encoder-devices PIPELINE_ENCODER_DEVICES]
[--pipeline-decoder-balance PIPELINE_DECODER_BALANCE]
[--pipeline-decoder-devices PIPELINE_DECODER_DEVICES]
[--pipeline-checkpoint {always,never,except_last}]
[--zero-sharding {none,os}] [--path PATH]
[--remove-bpe [REMOVE_BPE]] [--quiet]
[--model-overrides MODEL_OVERRIDES]
[--results-path RESULTS_PATH] [--beam N] [--nbest N]
[--max-len-a N] [--max-len-b N] [--min-len N]
[--match-source-len] [--no-early-stop]
[--unnormalized] [--no-beamable-mm] [--lenpen LENPEN]
[--unkpen UNKPEN] [--replace-unk [REPLACE_UNK]]
[--sacrebleu] [--score-reference] [--prefix-size PS]
[--no-repeat-ngram-size N] [--sampling]
[--sampling-topk PS] [--sampling-topp PS]
[--constraints [{ordered,unordered}]]
[--temperature N] [--diverse-beam-groups N]
[--diverse-beam-strength N] [--diversity-rate N]
[--print-alignment] [--print-step] [--lm-path PATH]
[--lm-weight N] [--iter-decode-eos-penalty N]
[--iter-decode-max-iter N]
[--iter-decode-force-max-iter]
[--iter-decode-with-beam N]
[--iter-decode-with-external-reranker]
[--retain-iter-history] [--retain-dropout]
[--retain-dropout-modules RETAIN_DROPOUT_MODULES [RETAIN_DROPOUT_MODULES ...]]
[--decoding-format {unigram,ensemble,vote,dp,bs}]
Named Arguments¶
--no-progress-bar | disable progress bar Default: False |
--log-interval | log progress every N batches (when progress bar is disabled) Default: 100 |
--log-format | Possible choices: json, none, simple, tqdm log format to use |
--tensorboard-logdir | path to save logs for tensorboard, should match –logdir of running tensorboard (default: no tensorboard logging) |
--seed | pseudo random number generator seed Default: 1 |
--cpu | use CPU instead of CUDA Default: False |
--tpu | use TPU instead of CUDA Default: False |
--bf16 | use bfloat16; implies –tpu Default: False |
--memory-efficient-bf16 | use a memory-efficient version of BF16 training; implies –bf16 Default: False |
--fp16 | use FP16 Default: False |
--memory-efficient-fp16 | use a memory-efficient version of FP16 training; implies –fp16 Default: False |
--fp16-no-flatten-grads | don’t flatten FP16 grads tensor Default: False |
--fp16-init-scale | default FP16 loss scale Default: 128 |
--fp16-scale-window | number of updates before increasing loss scale |
--fp16-scale-tolerance | pct of updates that can overflow before decreasing the loss scale Default: 0.0 |
--min-loss-scale | minimum FP16 loss scale, after which training is stopped Default: 0.0001 |
--threshold-loss-scale | threshold FP16 loss scale from below |
--user-dir | path to a python module containing custom extensions (tasks and/or architectures) |
--empty-cache-freq | how often to clear the PyTorch CUDA cache (0 to disable) Default: 0 |
--all-gather-list-size | number of bytes reserved for gathering stats from workers Default: 16384 |
--model-parallel-size | total number of GPUs to parallelize model over Default: 1 |
--checkpoint-suffix | suffix to add to the checkpoint file name Default: “” |
--checkpoint-shard-count | Number of shards containing the checkpoint - if the checkpoint is over 300GB, it is preferable to split it into shards to prevent OOM on CPU while loading the checkpoint Default: 1 |
--quantization-config-path | path to quantization config file |
--profile | enable autograd profiler emit_nvtx Default: False |
--criterion | Possible choices: sentence_prediction, ctc, adaptive_loss, label_smoothed_cross_entropy, composite_loss, nat_loss, masked_lm, sentence_ranking, legacy_masked_lm_loss, cross_entropy, wav2vec, label_smoothed_cross_entropy_with_alignment, vocab_parallel_cross_entropy Default: “cross_entropy” |
--tokenizer | Possible choices: nltk, space, moses |
--bpe | Possible choices: gpt2, bytes, sentencepiece, subword_nmt, byte_bpe, characters, bert, fastbpe, hf_byte_bpe |
--optimizer | Possible choices: adadelta, adam, adafactor, adagrad, lamb, nag, adamax, sgd |
--lr-scheduler | Possible choices: cosine, reduce_lr_on_plateau, fixed, triangular, polynomial_decay, tri_stage, inverse_sqrt Default: “fixed” |
--scoring | Possible choices: chrf, wer, sacrebleu, bleu Default: “bleu” |
--task | Possible choices: sentence_prediction, translation, translation_from_pretrained_xlm, denoising, multilingual_translation, semisupervised_translation, cross_lingual_lm, multilingual_denoising, translation_from_pretrained_bart, masked_lm, sentence_ranking, speech_to_text, audio_pretraining, legacy_masked_lm, translation_multi_simple_epoch, multilingual_masked_lm, language_modeling, translation_lev, dummy_lm, dummy_masked_lm, dummy_mt task Default: “translation” |
dataset_data_loading¶
--num-workers | how many subprocesses to use for data loading Default: 1 |
--skip-invalid-size-inputs-valid-test | ignore too long or too short lines in valid and test set Default: False |
--max-tokens | maximum number of tokens in a batch |
--batch-size | number of examples in a batch |
--required-batch-size-multiple | batch size will be a multiplier of this value Default: 8 |
--required-seq-len-multiple | maximum sequence length in batch will be a multiplier of this value Default: 1 |
--dataset-impl | Possible choices: raw, lazy, cached, mmap, fasta output dataset implementation |
--data-buffer-size | Number of batches to preload Default: 10 |
--train-subset | data subset to use for training (e.g. train, valid, test) Default: “train” |
--valid-subset | comma separated list of data subsets to use for validation (e.g. train, valid, test) Default: “valid” |
--validate-interval | validate every N epochs Default: 1 |
--validate-interval-updates | validate every N updates Default: 0 |
--validate-after-updates | dont validate until reaching this many updates Default: 0 |
--fixed-validation-seed | specified random seed for validation |
--disable-validation | disable validation Default: False |
--max-tokens-valid | maximum number of tokens in a validation batch (defaults to –max-tokens) |
--batch-size-valid | batch size of the validation batch (defaults to –batch-size) |
--curriculum | don’t shuffle batches for first N epochs Default: 0 |
--gen-subset | data subset to generate (train, valid, test) Default: “test” |
--num-shards | shard generation over N shards Default: 1 |
--shard-id | id of the shard to generate (id < num_shards) Default: 0 |
distributed_training¶
--distributed-world-size | total number of GPUs across all nodes (default: all visible GPUs) Default: 1 |
--distributed-rank | rank of the current worker Default: 0 |
--distributed-backend | distributed backend Default: “nccl” |
--distributed-init-method | typically tcp://hostname:port that will be used to establish initial connetion |
--distributed-port | port number (not required if using –distributed-init-method) Default: -1 |
--device-id, --local_rank | which GPU to use (usually configured automatically) Default: 0 |
--distributed-no-spawn | do not spawn multiple processes even if multiple GPUs are visible Default: False |
--ddp-backend | Possible choices: c10d, no_c10d DistributedDataParallel backend Default: “c10d” |
--bucket-cap-mb | bucket size for reduction Default: 25 |
--fix-batches-to-gpus | don’t shuffle batches between GPUs; this reduces overall randomness and may affect precision but avoids the cost of re-reading the data Default: False |
--find-unused-parameters | disable unused parameter detection (not applicable to no_c10d ddp-backend Default: False |
--fast-stat-sync | [deprecated] this is now defined per Criterion Default: False |
--broadcast-buffers | Copy non-trainable parameters between GPUs, such as batchnorm population statistics Default: False |
--distributed-wrapper | Possible choices: DDP, SlowMo DistributedDataParallel backend Default: “DDP” |
--slowmo-momentum | SlowMo momentum term; by default use 0.0 for 16 GPUs, 0.2 for 32 GPUs; 0.5 for 64 GPUs, 0.6 for > 64 GPUs |
--slowmo-algorithm | whether to use LocalSGD or SGP Default: “LocalSGD” |
--localsgd-frequency | Local SGD allreduce frequency Default: 3 |
--nprocs-per-node | number of GPUs in each node. An allreduce operation across GPUs in a node is very fast. Hence, we do allreduce across GPUs in a node, and gossip across different nodes Default: 1 |
--pipeline-model-parallel | if set, use pipeline model parallelism across GPUs Default: False |
--pipeline-balance | partition the model into N_K pieces, where each piece contains N_i layers. The sum(args.pipeline_balance) should equal the total number of layers in the model |
--pipeline-devices | a list of device indices indicating which device to place each of the N_K partitions. The length of this list should equal the length of the –pipeline-balance argument |
--pipeline-chunks | microbatch count for pipeline model parallelism Default: 0 |
--pipeline-encoder-balance | partition the pipeline parallel encoder into N_K pieces, where each piece contains N_i layers. The sum(args.pipeline_encoder_balance) should equal the total number of encoder layers in the model |
--pipeline-encoder-devices | a list of device indices indicating which device to place each of the N_K partitions. The length of this list should equal the length of the –pipeline-encoder-balance argument |
--pipeline-decoder-balance | partition the pipeline parallel decoder into N_K pieces, where each piece contains N_i layers. The sum(args.pipeline_decoder_balance) should equal the total number of decoder layers in the model |
--pipeline-decoder-devices | a list of device indices indicating which device to place each of the N_K partitions. The length of this list should equal the length of the –pipeline-decoder-balance argument |
--pipeline-checkpoint | Possible choices: always, never, except_last checkpointing mode for pipeline model parallelism Default: “never” |
--zero-sharding | Possible choices: none, os ZeRO sharding Default: “none” |
Generation¶
--path | path(s) to model file(s), colon separated |
--remove-bpe | remove BPE tokens before scoring (can be set to sentencepiece) |
--quiet | only print final scores Default: False |
--model-overrides | a dictionary used to override model args at generation that were used during model training Default: “{}” |
--results-path | path to save eval results (optional) |
--beam | beam size Default: 5 |
--nbest | number of hypotheses to output Default: 1 |
--max-len-a | generate sequences of maximum length ax + b, where x is the source length Default: 0 |
--max-len-b | generate sequences of maximum length ax + b, where x is the source length Default: 200 |
--min-len | minimum generation length Default: 1 |
--match-source-len | generations should match the source length Default: False |
--no-early-stop | deprecated Default: False |
--unnormalized | compare unnormalized hypothesis scores Default: False |
--no-beamable-mm | don’t use BeamableMM in attention layers Default: False |
--lenpen | length penalty: <1.0 favors shorter, >1.0 favors longer sentences Default: 1 |
--unkpen | unknown word penalty: <0 produces more unks, >0 produces fewer Default: 0 |
--replace-unk | perform unknown replacement (optionally with alignment dictionary) |
--sacrebleu | score with sacrebleu Default: False |
--score-reference | just score the reference translation Default: False |
--prefix-size | initialize generation by target prefix of given length Default: 0 |
--no-repeat-ngram-size | ngram blocking such that this size ngram cannot be repeated in the generation Default: 0 |
--sampling | sample hypotheses instead of using beam search Default: False |
--sampling-topk | sample from top K likely next words instead of all words Default: -1 |
--sampling-topp | sample from the smallest set whose cumulative probability mass exceeds p for next words Default: -1.0 |
--constraints | Possible choices: ordered, unordered enables lexically constrained decoding |
--temperature | temperature for generation Default: 1.0 |
--diverse-beam-groups | number of groups for Diverse Beam Search Default: -1 |
--diverse-beam-strength | strength of diversity penalty for Diverse Beam Search Default: 0.5 |
--diversity-rate | strength of diversity penalty for Diverse Siblings Search Default: -1.0 |
--print-alignment | if set, uses attention feedback to compute and print alignment to source tokens Default: False |
--print-step | Default: False |
--lm-path | path to lm checkpoint for lm fusion |
--lm-weight | weight for lm probs for lm fusion Default: 0.0 |
--iter-decode-eos-penalty | if > 0.0, it penalized early-stopping in decoding. Default: 0.0 |
--iter-decode-max-iter | maximum iterations for iterative refinement. Default: 10 |
--iter-decode-force-max-iter | if set, run exact the maximum number of iterations without early stop Default: False |
--iter-decode-with-beam | if > 1, model will generate translations varying by the lengths. Default: 1 |
--iter-decode-with-external-reranker | if set, the last checkpoint are assumed to be a reranker to rescore the translations Default: False |
--retain-iter-history | if set, decoding returns the whole history of iterative refinement Default: False |
--retain-dropout | Use dropout at inference time Default: False |
--retain-dropout-modules | if set, only retain dropout for the specified modules; if not set, then dropout will be retained for all modules |
--decoding-format | Possible choices: unigram, ensemble, vote, dp, bs |
fairseq-interactive¶
Translate raw text with a trained model. Batches data on-the-fly.
usage: fairseq-interactive [-h] [--no-progress-bar]
[--log-interval LOG_INTERVAL]
[--log-format {json,none,simple,tqdm}]
[--tensorboard-logdir TENSORBOARD_LOGDIR]
[--seed SEED] [--cpu] [--tpu] [--bf16]
[--memory-efficient-bf16] [--fp16]
[--memory-efficient-fp16] [--fp16-no-flatten-grads]
[--fp16-init-scale FP16_INIT_SCALE]
[--fp16-scale-window FP16_SCALE_WINDOW]
[--fp16-scale-tolerance FP16_SCALE_TOLERANCE]
[--min-loss-scale MIN_LOSS_SCALE]
[--threshold-loss-scale THRESHOLD_LOSS_SCALE]
[--user-dir USER_DIR]
[--empty-cache-freq EMPTY_CACHE_FREQ]
[--all-gather-list-size ALL_GATHER_LIST_SIZE]
[--model-parallel-size MODEL_PARALLEL_SIZE]
[--checkpoint-suffix CHECKPOINT_SUFFIX]
[--checkpoint-shard-count CHECKPOINT_SHARD_COUNT]
[--quantization-config-path QUANTIZATION_CONFIG_PATH]
[--profile]
[--criterion {sentence_prediction,ctc,adaptive_loss,label_smoothed_cross_entropy,composite_loss,nat_loss,masked_lm,sentence_ranking,legacy_masked_lm_loss,cross_entropy,wav2vec,label_smoothed_cross_entropy_with_alignment,vocab_parallel_cross_entropy}]
[--tokenizer {nltk,space,moses}]
[--bpe {gpt2,bytes,sentencepiece,subword_nmt,byte_bpe,characters,bert,fastbpe,hf_byte_bpe}]
[--optimizer {adadelta,adam,adafactor,adagrad,lamb,nag,adamax,sgd}]
[--lr-scheduler {cosine,reduce_lr_on_plateau,fixed,triangular,polynomial_decay,tri_stage,inverse_sqrt}]
[--scoring {chrf,wer,sacrebleu,bleu}] [--task TASK]
[--num-workers NUM_WORKERS]
[--skip-invalid-size-inputs-valid-test]
[--max-tokens MAX_TOKENS] [--batch-size BATCH_SIZE]
[--required-batch-size-multiple REQUIRED_BATCH_SIZE_MULTIPLE]
[--required-seq-len-multiple REQUIRED_SEQ_LEN_MULTIPLE]
[--dataset-impl {raw,lazy,cached,mmap,fasta}]
[--data-buffer-size DATA_BUFFER_SIZE]
[--train-subset TRAIN_SUBSET]
[--valid-subset VALID_SUBSET]
[--validate-interval VALIDATE_INTERVAL]
[--validate-interval-updates VALIDATE_INTERVAL_UPDATES]
[--validate-after-updates VALIDATE_AFTER_UPDATES]
[--fixed-validation-seed FIXED_VALIDATION_SEED]
[--disable-validation]
[--max-tokens-valid MAX_TOKENS_VALID]
[--batch-size-valid BATCH_SIZE_VALID]
[--curriculum CURRICULUM] [--gen-subset GEN_SUBSET]
[--num-shards NUM_SHARDS] [--shard-id SHARD_ID]
[--distributed-world-size DISTRIBUTED_WORLD_SIZE]
[--distributed-rank DISTRIBUTED_RANK]
[--distributed-backend DISTRIBUTED_BACKEND]
[--distributed-init-method DISTRIBUTED_INIT_METHOD]
[--distributed-port DISTRIBUTED_PORT]
[--device-id DEVICE_ID] [--distributed-no-spawn]
[--ddp-backend {c10d,no_c10d}]
[--bucket-cap-mb BUCKET_CAP_MB]
[--fix-batches-to-gpus] [--find-unused-parameters]
[--fast-stat-sync] [--broadcast-buffers]
[--distributed-wrapper {DDP,SlowMo}]
[--slowmo-momentum SLOWMO_MOMENTUM]
[--slowmo-algorithm SLOWMO_ALGORITHM]
[--localsgd-frequency LOCALSGD_FREQUENCY]
[--nprocs-per-node NPROCS_PER_NODE]
[--pipeline-model-parallel]
[--pipeline-balance PIPELINE_BALANCE]
[--pipeline-devices PIPELINE_DEVICES]
[--pipeline-chunks PIPELINE_CHUNKS]
[--pipeline-encoder-balance PIPELINE_ENCODER_BALANCE]
[--pipeline-encoder-devices PIPELINE_ENCODER_DEVICES]
[--pipeline-decoder-balance PIPELINE_DECODER_BALANCE]
[--pipeline-decoder-devices PIPELINE_DECODER_DEVICES]
[--pipeline-checkpoint {always,never,except_last}]
[--zero-sharding {none,os}] [--path PATH]
[--remove-bpe [REMOVE_BPE]] [--quiet]
[--model-overrides MODEL_OVERRIDES]
[--results-path RESULTS_PATH] [--beam N]
[--nbest N] [--max-len-a N] [--max-len-b N]
[--min-len N] [--match-source-len]
[--no-early-stop] [--unnormalized]
[--no-beamable-mm] [--lenpen LENPEN]
[--unkpen UNKPEN] [--replace-unk [REPLACE_UNK]]
[--sacrebleu] [--score-reference]
[--prefix-size PS] [--no-repeat-ngram-size N]
[--sampling] [--sampling-topk PS]
[--sampling-topp PS]
[--constraints [{ordered,unordered}]]
[--temperature N] [--diverse-beam-groups N]
[--diverse-beam-strength N] [--diversity-rate N]
[--print-alignment] [--print-step] [--lm-path PATH]
[--lm-weight N] [--iter-decode-eos-penalty N]
[--iter-decode-max-iter N]
[--iter-decode-force-max-iter]
[--iter-decode-with-beam N]
[--iter-decode-with-external-reranker]
[--retain-iter-history] [--retain-dropout]
[--retain-dropout-modules RETAIN_DROPOUT_MODULES [RETAIN_DROPOUT_MODULES ...]]
[--decoding-format {unigram,ensemble,vote,dp,bs}]
[--buffer-size N] [--input FILE]
Named Arguments¶
--no-progress-bar | disable progress bar Default: False |
--log-interval | log progress every N batches (when progress bar is disabled) Default: 100 |
--log-format | Possible choices: json, none, simple, tqdm log format to use |
--tensorboard-logdir | path to save logs for tensorboard, should match –logdir of running tensorboard (default: no tensorboard logging) |
--seed | pseudo random number generator seed Default: 1 |
--cpu | use CPU instead of CUDA Default: False |
--tpu | use TPU instead of CUDA Default: False |
--bf16 | use bfloat16; implies –tpu Default: False |
--memory-efficient-bf16 | use a memory-efficient version of BF16 training; implies –bf16 Default: False |
--fp16 | use FP16 Default: False |
--memory-efficient-fp16 | use a memory-efficient version of FP16 training; implies –fp16 Default: False |
--fp16-no-flatten-grads | don’t flatten FP16 grads tensor Default: False |
--fp16-init-scale | default FP16 loss scale Default: 128 |
--fp16-scale-window | number of updates before increasing loss scale |
--fp16-scale-tolerance | pct of updates that can overflow before decreasing the loss scale Default: 0.0 |
--min-loss-scale | minimum FP16 loss scale, after which training is stopped Default: 0.0001 |
--threshold-loss-scale | threshold FP16 loss scale from below |
--user-dir | path to a python module containing custom extensions (tasks and/or architectures) |
--empty-cache-freq | how often to clear the PyTorch CUDA cache (0 to disable) Default: 0 |
--all-gather-list-size | number of bytes reserved for gathering stats from workers Default: 16384 |
--model-parallel-size | total number of GPUs to parallelize model over Default: 1 |
--checkpoint-suffix | suffix to add to the checkpoint file name Default: “” |
--checkpoint-shard-count | Number of shards containing the checkpoint - if the checkpoint is over 300GB, it is preferable to split it into shards to prevent OOM on CPU while loading the checkpoint Default: 1 |
--quantization-config-path | path to quantization config file |
--profile | enable autograd profiler emit_nvtx Default: False |
--criterion | Possible choices: sentence_prediction, ctc, adaptive_loss, label_smoothed_cross_entropy, composite_loss, nat_loss, masked_lm, sentence_ranking, legacy_masked_lm_loss, cross_entropy, wav2vec, label_smoothed_cross_entropy_with_alignment, vocab_parallel_cross_entropy Default: “cross_entropy” |
--tokenizer | Possible choices: nltk, space, moses |
--bpe | Possible choices: gpt2, bytes, sentencepiece, subword_nmt, byte_bpe, characters, bert, fastbpe, hf_byte_bpe |
--optimizer | Possible choices: adadelta, adam, adafactor, adagrad, lamb, nag, adamax, sgd |
--lr-scheduler | Possible choices: cosine, reduce_lr_on_plateau, fixed, triangular, polynomial_decay, tri_stage, inverse_sqrt Default: “fixed” |
--scoring | Possible choices: chrf, wer, sacrebleu, bleu Default: “bleu” |
--task | Possible choices: sentence_prediction, translation, translation_from_pretrained_xlm, denoising, multilingual_translation, semisupervised_translation, cross_lingual_lm, multilingual_denoising, translation_from_pretrained_bart, masked_lm, sentence_ranking, speech_to_text, audio_pretraining, legacy_masked_lm, translation_multi_simple_epoch, multilingual_masked_lm, language_modeling, translation_lev, dummy_lm, dummy_masked_lm, dummy_mt task Default: “translation” |
dataset_data_loading¶
--num-workers | how many subprocesses to use for data loading Default: 1 |
--skip-invalid-size-inputs-valid-test | ignore too long or too short lines in valid and test set Default: False |
--max-tokens | maximum number of tokens in a batch |
--batch-size | number of examples in a batch |
--required-batch-size-multiple | batch size will be a multiplier of this value Default: 8 |
--required-seq-len-multiple | maximum sequence length in batch will be a multiplier of this value Default: 1 |
--dataset-impl | Possible choices: raw, lazy, cached, mmap, fasta output dataset implementation |
--data-buffer-size | Number of batches to preload Default: 10 |
--train-subset | data subset to use for training (e.g. train, valid, test) Default: “train” |
--valid-subset | comma separated list of data subsets to use for validation (e.g. train, valid, test) Default: “valid” |
--validate-interval | validate every N epochs Default: 1 |
--validate-interval-updates | validate every N updates Default: 0 |
--validate-after-updates | dont validate until reaching this many updates Default: 0 |
--fixed-validation-seed | specified random seed for validation |
--disable-validation | disable validation Default: False |
--max-tokens-valid | maximum number of tokens in a validation batch (defaults to –max-tokens) |
--batch-size-valid | batch size of the validation batch (defaults to –batch-size) |
--curriculum | don’t shuffle batches for first N epochs Default: 0 |
--gen-subset | data subset to generate (train, valid, test) Default: “test” |
--num-shards | shard generation over N shards Default: 1 |
--shard-id | id of the shard to generate (id < num_shards) Default: 0 |
distributed_training¶
--distributed-world-size | total number of GPUs across all nodes (default: all visible GPUs) Default: 1 |
--distributed-rank | rank of the current worker Default: 0 |
--distributed-backend | distributed backend Default: “nccl” |
--distributed-init-method | typically tcp://hostname:port that will be used to establish initial connetion |
--distributed-port | port number (not required if using –distributed-init-method) Default: -1 |
--device-id, --local_rank | which GPU to use (usually configured automatically) Default: 0 |
--distributed-no-spawn | do not spawn multiple processes even if multiple GPUs are visible Default: False |
--ddp-backend | Possible choices: c10d, no_c10d DistributedDataParallel backend Default: “c10d” |
--bucket-cap-mb | bucket size for reduction Default: 25 |
--fix-batches-to-gpus | don’t shuffle batches between GPUs; this reduces overall randomness and may affect precision but avoids the cost of re-reading the data Default: False |
--find-unused-parameters | disable unused parameter detection (not applicable to no_c10d ddp-backend Default: False |
--fast-stat-sync | [deprecated] this is now defined per Criterion Default: False |
--broadcast-buffers | Copy non-trainable parameters between GPUs, such as batchnorm population statistics Default: False |
--distributed-wrapper | Possible choices: DDP, SlowMo DistributedDataParallel backend Default: “DDP” |
--slowmo-momentum | SlowMo momentum term; by default use 0.0 for 16 GPUs, 0.2 for 32 GPUs; 0.5 for 64 GPUs, 0.6 for > 64 GPUs |
--slowmo-algorithm | whether to use LocalSGD or SGP Default: “LocalSGD” |
--localsgd-frequency | Local SGD allreduce frequency Default: 3 |
--nprocs-per-node | number of GPUs in each node. An allreduce operation across GPUs in a node is very fast. Hence, we do allreduce across GPUs in a node, and gossip across different nodes Default: 1 |
--pipeline-model-parallel | if set, use pipeline model parallelism across GPUs Default: False |
--pipeline-balance | partition the model into N_K pieces, where each piece contains N_i layers. The sum(args.pipeline_balance) should equal the total number of layers in the model |
--pipeline-devices | a list of device indices indicating which device to place each of the N_K partitions. The length of this list should equal the length of the –pipeline-balance argument |
--pipeline-chunks | microbatch count for pipeline model parallelism Default: 0 |
--pipeline-encoder-balance | partition the pipeline parallel encoder into N_K pieces, where each piece contains N_i layers. The sum(args.pipeline_encoder_balance) should equal the total number of encoder layers in the model |
--pipeline-encoder-devices | a list of device indices indicating which device to place each of the N_K partitions. The length of this list should equal the length of the –pipeline-encoder-balance argument |
--pipeline-decoder-balance | partition the pipeline parallel decoder into N_K pieces, where each piece contains N_i layers. The sum(args.pipeline_decoder_balance) should equal the total number of decoder layers in the model |
--pipeline-decoder-devices | a list of device indices indicating which device to place each of the N_K partitions. The length of this list should equal the length of the –pipeline-decoder-balance argument |
--pipeline-checkpoint | Possible choices: always, never, except_last checkpointing mode for pipeline model parallelism Default: “never” |
--zero-sharding | Possible choices: none, os ZeRO sharding Default: “none” |
Generation¶
--path | path(s) to model file(s), colon separated |
--remove-bpe | remove BPE tokens before scoring (can be set to sentencepiece) |
--quiet | only print final scores Default: False |
--model-overrides | a dictionary used to override model args at generation that were used during model training Default: “{}” |
--results-path | path to save eval results (optional) |
--beam | beam size Default: 5 |
--nbest | number of hypotheses to output Default: 1 |
--max-len-a | generate sequences of maximum length ax + b, where x is the source length Default: 0 |
--max-len-b | generate sequences of maximum length ax + b, where x is the source length Default: 200 |
--min-len | minimum generation length Default: 1 |
--match-source-len | generations should match the source length Default: False |
--no-early-stop | deprecated Default: False |
--unnormalized | compare unnormalized hypothesis scores Default: False |
--no-beamable-mm | don’t use BeamableMM in attention layers Default: False |
--lenpen | length penalty: <1.0 favors shorter, >1.0 favors longer sentences Default: 1 |
--unkpen | unknown word penalty: <0 produces more unks, >0 produces fewer Default: 0 |
--replace-unk | perform unknown replacement (optionally with alignment dictionary) |
--sacrebleu | score with sacrebleu Default: False |
--score-reference | just score the reference translation Default: False |
--prefix-size | initialize generation by target prefix of given length Default: 0 |
--no-repeat-ngram-size | ngram blocking such that this size ngram cannot be repeated in the generation Default: 0 |
--sampling | sample hypotheses instead of using beam search Default: False |
--sampling-topk | sample from top K likely next words instead of all words Default: -1 |
--sampling-topp | sample from the smallest set whose cumulative probability mass exceeds p for next words Default: -1.0 |
--constraints | Possible choices: ordered, unordered enables lexically constrained decoding |
--temperature | temperature for generation Default: 1.0 |
--diverse-beam-groups | number of groups for Diverse Beam Search Default: -1 |
--diverse-beam-strength | strength of diversity penalty for Diverse Beam Search Default: 0.5 |
--diversity-rate | strength of diversity penalty for Diverse Siblings Search Default: -1.0 |
--print-alignment | if set, uses attention feedback to compute and print alignment to source tokens Default: False |
--print-step | Default: False |
--lm-path | path to lm checkpoint for lm fusion |
--lm-weight | weight for lm probs for lm fusion Default: 0.0 |
--iter-decode-eos-penalty | if > 0.0, it penalized early-stopping in decoding. Default: 0.0 |
--iter-decode-max-iter | maximum iterations for iterative refinement. Default: 10 |
--iter-decode-force-max-iter | if set, run exact the maximum number of iterations without early stop Default: False |
--iter-decode-with-beam | if > 1, model will generate translations varying by the lengths. Default: 1 |
--iter-decode-with-external-reranker | if set, the last checkpoint are assumed to be a reranker to rescore the translations Default: False |
--retain-iter-history | if set, decoding returns the whole history of iterative refinement Default: False |
--retain-dropout | Use dropout at inference time Default: False |
--retain-dropout-modules | if set, only retain dropout for the specified modules; if not set, then dropout will be retained for all modules |
--decoding-format | Possible choices: unigram, ensemble, vote, dp, bs |
Interactive¶
--buffer-size | read this many sentences into a buffer before processing them Default: 0 |
--input | file to read from; use - for stdin Default: “-“ |
fairseq-score¶
BLEU scoring of generated translations against reference translations.
Command-line script for BLEU scoring.
usage: fairseq-score [-h] [-s SYS] -r REF [-o N] [--ignore-case] [--sacrebleu]
[--sentence-bleu]
Named Arguments¶
-s, --sys | system output Default: “-“ |
-r, --ref | references |
-o, --order | consider ngrams up to this order Default: 4 |
--ignore-case | case-insensitive scoring Default: False |
--sacrebleu | score with sacrebleu Default: False |
--sentence-bleu | report sentence-level BLEUs (i.e., with +1 smoothing) Default: False |
fairseq-eval-lm¶
Evaluate the perplexity of a trained language model.
usage: fairseq-eval-lm [-h] [--no-progress-bar] [--log-interval LOG_INTERVAL]
[--log-format {json,none,simple,tqdm}]
[--tensorboard-logdir TENSORBOARD_LOGDIR] [--seed SEED]
[--cpu] [--tpu] [--bf16] [--memory-efficient-bf16]
[--fp16] [--memory-efficient-fp16]
[--fp16-no-flatten-grads]
[--fp16-init-scale FP16_INIT_SCALE]
[--fp16-scale-window FP16_SCALE_WINDOW]
[--fp16-scale-tolerance FP16_SCALE_TOLERANCE]
[--min-loss-scale MIN_LOSS_SCALE]
[--threshold-loss-scale THRESHOLD_LOSS_SCALE]
[--user-dir USER_DIR]
[--empty-cache-freq EMPTY_CACHE_FREQ]
[--all-gather-list-size ALL_GATHER_LIST_SIZE]
[--model-parallel-size MODEL_PARALLEL_SIZE]
[--checkpoint-suffix CHECKPOINT_SUFFIX]
[--checkpoint-shard-count CHECKPOINT_SHARD_COUNT]
[--quantization-config-path QUANTIZATION_CONFIG_PATH]
[--profile]
[--criterion {sentence_prediction,ctc,adaptive_loss,label_smoothed_cross_entropy,composite_loss,nat_loss,masked_lm,sentence_ranking,legacy_masked_lm_loss,cross_entropy,wav2vec,label_smoothed_cross_entropy_with_alignment,vocab_parallel_cross_entropy}]
[--tokenizer {nltk,space,moses}]
[--bpe {gpt2,bytes,sentencepiece,subword_nmt,byte_bpe,characters,bert,fastbpe,hf_byte_bpe}]
[--optimizer {adadelta,adam,adafactor,adagrad,lamb,nag,adamax,sgd}]
[--lr-scheduler {cosine,reduce_lr_on_plateau,fixed,triangular,polynomial_decay,tri_stage,inverse_sqrt}]
[--scoring {chrf,wer,sacrebleu,bleu}] [--task TASK]
[--num-workers NUM_WORKERS]
[--skip-invalid-size-inputs-valid-test]
[--max-tokens MAX_TOKENS] [--batch-size BATCH_SIZE]
[--required-batch-size-multiple REQUIRED_BATCH_SIZE_MULTIPLE]
[--required-seq-len-multiple REQUIRED_SEQ_LEN_MULTIPLE]
[--dataset-impl {raw,lazy,cached,mmap,fasta}]
[--data-buffer-size DATA_BUFFER_SIZE]
[--train-subset TRAIN_SUBSET]
[--valid-subset VALID_SUBSET]
[--validate-interval VALIDATE_INTERVAL]
[--validate-interval-updates VALIDATE_INTERVAL_UPDATES]
[--validate-after-updates VALIDATE_AFTER_UPDATES]
[--fixed-validation-seed FIXED_VALIDATION_SEED]
[--disable-validation]
[--max-tokens-valid MAX_TOKENS_VALID]
[--batch-size-valid BATCH_SIZE_VALID]
[--curriculum CURRICULUM] [--gen-subset GEN_SUBSET]
[--num-shards NUM_SHARDS] [--shard-id SHARD_ID]
[--distributed-world-size DISTRIBUTED_WORLD_SIZE]
[--distributed-rank DISTRIBUTED_RANK]
[--distributed-backend DISTRIBUTED_BACKEND]
[--distributed-init-method DISTRIBUTED_INIT_METHOD]
[--distributed-port DISTRIBUTED_PORT]
[--device-id DEVICE_ID] [--distributed-no-spawn]
[--ddp-backend {c10d,no_c10d}]
[--bucket-cap-mb BUCKET_CAP_MB] [--fix-batches-to-gpus]
[--find-unused-parameters] [--fast-stat-sync]
[--broadcast-buffers]
[--distributed-wrapper {DDP,SlowMo}]
[--slowmo-momentum SLOWMO_MOMENTUM]
[--slowmo-algorithm SLOWMO_ALGORITHM]
[--localsgd-frequency LOCALSGD_FREQUENCY]
[--nprocs-per-node NPROCS_PER_NODE]
[--pipeline-model-parallel]
[--pipeline-balance PIPELINE_BALANCE]
[--pipeline-devices PIPELINE_DEVICES]
[--pipeline-chunks PIPELINE_CHUNKS]
[--pipeline-encoder-balance PIPELINE_ENCODER_BALANCE]
[--pipeline-encoder-devices PIPELINE_ENCODER_DEVICES]
[--pipeline-decoder-balance PIPELINE_DECODER_BALANCE]
[--pipeline-decoder-devices PIPELINE_DECODER_DEVICES]
[--pipeline-checkpoint {always,never,except_last}]
[--zero-sharding {none,os}] [--path PATH]
[--remove-bpe [REMOVE_BPE]] [--quiet]
[--model-overrides MODEL_OVERRIDES]
[--results-path RESULTS_PATH] [--output-word-probs]
[--output-word-stats] [--context-window CONTEXT_WINDOW]
[--softmax-batch SOFTMAX_BATCH]
Named Arguments¶
--no-progress-bar | disable progress bar Default: False |
--log-interval | log progress every N batches (when progress bar is disabled) Default: 100 |
--log-format | Possible choices: json, none, simple, tqdm log format to use |
--tensorboard-logdir | path to save logs for tensorboard, should match –logdir of running tensorboard (default: no tensorboard logging) |
--seed | pseudo random number generator seed Default: 1 |
--cpu | use CPU instead of CUDA Default: False |
--tpu | use TPU instead of CUDA Default: False |
--bf16 | use bfloat16; implies –tpu Default: False |
--memory-efficient-bf16 | use a memory-efficient version of BF16 training; implies –bf16 Default: False |
--fp16 | use FP16 Default: False |
--memory-efficient-fp16 | use a memory-efficient version of FP16 training; implies –fp16 Default: False |
--fp16-no-flatten-grads | don’t flatten FP16 grads tensor Default: False |
--fp16-init-scale | default FP16 loss scale Default: 128 |
--fp16-scale-window | number of updates before increasing loss scale |
--fp16-scale-tolerance | pct of updates that can overflow before decreasing the loss scale Default: 0.0 |
--min-loss-scale | minimum FP16 loss scale, after which training is stopped Default: 0.0001 |
--threshold-loss-scale | threshold FP16 loss scale from below |
--user-dir | path to a python module containing custom extensions (tasks and/or architectures) |
--empty-cache-freq | how often to clear the PyTorch CUDA cache (0 to disable) Default: 0 |
--all-gather-list-size | number of bytes reserved for gathering stats from workers Default: 16384 |
--model-parallel-size | total number of GPUs to parallelize model over Default: 1 |
--checkpoint-suffix | suffix to add to the checkpoint file name Default: “” |
--checkpoint-shard-count | Number of shards containing the checkpoint - if the checkpoint is over 300GB, it is preferable to split it into shards to prevent OOM on CPU while loading the checkpoint Default: 1 |
--quantization-config-path | path to quantization config file |
--profile | enable autograd profiler emit_nvtx Default: False |
--criterion | Possible choices: sentence_prediction, ctc, adaptive_loss, label_smoothed_cross_entropy, composite_loss, nat_loss, masked_lm, sentence_ranking, legacy_masked_lm_loss, cross_entropy, wav2vec, label_smoothed_cross_entropy_with_alignment, vocab_parallel_cross_entropy Default: “cross_entropy” |
--tokenizer | Possible choices: nltk, space, moses |
--bpe | Possible choices: gpt2, bytes, sentencepiece, subword_nmt, byte_bpe, characters, bert, fastbpe, hf_byte_bpe |
--optimizer | Possible choices: adadelta, adam, adafactor, adagrad, lamb, nag, adamax, sgd |
--lr-scheduler | Possible choices: cosine, reduce_lr_on_plateau, fixed, triangular, polynomial_decay, tri_stage, inverse_sqrt Default: “fixed” |
--scoring | Possible choices: chrf, wer, sacrebleu, bleu Default: “bleu” |
--task | Possible choices: sentence_prediction, translation, translation_from_pretrained_xlm, denoising, multilingual_translation, semisupervised_translation, cross_lingual_lm, multilingual_denoising, translation_from_pretrained_bart, masked_lm, sentence_ranking, speech_to_text, audio_pretraining, legacy_masked_lm, translation_multi_simple_epoch, multilingual_masked_lm, language_modeling, translation_lev, dummy_lm, dummy_masked_lm, dummy_mt task Default: “language_modeling” |
dataset_data_loading¶
--num-workers | how many subprocesses to use for data loading Default: 1 |
--skip-invalid-size-inputs-valid-test | ignore too long or too short lines in valid and test set Default: False |
--max-tokens | maximum number of tokens in a batch |
--batch-size | number of examples in a batch |
--required-batch-size-multiple | batch size will be a multiplier of this value Default: 8 |
--required-seq-len-multiple | maximum sequence length in batch will be a multiplier of this value Default: 1 |
--dataset-impl | Possible choices: raw, lazy, cached, mmap, fasta output dataset implementation |
--data-buffer-size | Number of batches to preload Default: 10 |
--train-subset | data subset to use for training (e.g. train, valid, test) Default: “train” |
--valid-subset | comma separated list of data subsets to use for validation (e.g. train, valid, test) Default: “valid” |
--validate-interval | validate every N epochs Default: 1 |
--validate-interval-updates | validate every N updates Default: 0 |
--validate-after-updates | dont validate until reaching this many updates Default: 0 |
--fixed-validation-seed | specified random seed for validation |
--disable-validation | disable validation Default: False |
--max-tokens-valid | maximum number of tokens in a validation batch (defaults to –max-tokens) |
--batch-size-valid | batch size of the validation batch (defaults to –batch-size) |
--curriculum | don’t shuffle batches for first N epochs Default: 0 |
--gen-subset | data subset to generate (train, valid, test) Default: “test” |
--num-shards | shard generation over N shards Default: 1 |
--shard-id | id of the shard to generate (id < num_shards) Default: 0 |
distributed_training¶
--distributed-world-size | total number of GPUs across all nodes (default: all visible GPUs) Default: 1 |
--distributed-rank | rank of the current worker Default: 0 |
--distributed-backend | distributed backend Default: “nccl” |
--distributed-init-method | typically tcp://hostname:port that will be used to establish initial connetion |
--distributed-port | port number (not required if using –distributed-init-method) Default: -1 |
--device-id, --local_rank | which GPU to use (usually configured automatically) Default: 0 |
--distributed-no-spawn | do not spawn multiple processes even if multiple GPUs are visible Default: False |
--ddp-backend | Possible choices: c10d, no_c10d DistributedDataParallel backend Default: “c10d” |
--bucket-cap-mb | bucket size for reduction Default: 25 |
--fix-batches-to-gpus | don’t shuffle batches between GPUs; this reduces overall randomness and may affect precision but avoids the cost of re-reading the data Default: False |
--find-unused-parameters | disable unused parameter detection (not applicable to no_c10d ddp-backend Default: False |
--fast-stat-sync | [deprecated] this is now defined per Criterion Default: False |
--broadcast-buffers | Copy non-trainable parameters between GPUs, such as batchnorm population statistics Default: False |
--distributed-wrapper | Possible choices: DDP, SlowMo DistributedDataParallel backend Default: “DDP” |
--slowmo-momentum | SlowMo momentum term; by default use 0.0 for 16 GPUs, 0.2 for 32 GPUs; 0.5 for 64 GPUs, 0.6 for > 64 GPUs |
--slowmo-algorithm | whether to use LocalSGD or SGP Default: “LocalSGD” |
--localsgd-frequency | Local SGD allreduce frequency Default: 3 |
--nprocs-per-node | number of GPUs in each node. An allreduce operation across GPUs in a node is very fast. Hence, we do allreduce across GPUs in a node, and gossip across different nodes Default: 1 |
--pipeline-model-parallel | if set, use pipeline model parallelism across GPUs Default: False |
--pipeline-balance | partition the model into N_K pieces, where each piece contains N_i layers. The sum(args.pipeline_balance) should equal the total number of layers in the model |
--pipeline-devices | a list of device indices indicating which device to place each of the N_K partitions. The length of this list should equal the length of the –pipeline-balance argument |
--pipeline-chunks | microbatch count for pipeline model parallelism Default: 0 |
--pipeline-encoder-balance | partition the pipeline parallel encoder into N_K pieces, where each piece contains N_i layers. The sum(args.pipeline_encoder_balance) should equal the total number of encoder layers in the model |
--pipeline-encoder-devices | a list of device indices indicating which device to place each of the N_K partitions. The length of this list should equal the length of the –pipeline-encoder-balance argument |
--pipeline-decoder-balance | partition the pipeline parallel decoder into N_K pieces, where each piece contains N_i layers. The sum(args.pipeline_decoder_balance) should equal the total number of decoder layers in the model |
--pipeline-decoder-devices | a list of device indices indicating which device to place each of the N_K partitions. The length of this list should equal the length of the –pipeline-decoder-balance argument |
--pipeline-checkpoint | Possible choices: always, never, except_last checkpointing mode for pipeline model parallelism Default: “never” |
--zero-sharding | Possible choices: none, os ZeRO sharding Default: “none” |
LM Evaluation¶
--path | path(s) to model file(s), colon separated |
--remove-bpe | remove BPE tokens before scoring (can be set to sentencepiece) |
--quiet | only print final scores Default: False |
--model-overrides | a dictionary used to override model args at generation that were used during model training Default: “{}” |
--results-path | path to save eval results (optional) |
--output-word-probs | if set, outputs words and their predicted log probabilities to standard output Default: False |
--output-word-stats | if set, outputs word statistics such as word count, average probability, etc Default: False |
--context-window | ensures that every evaluated token has access to a context of at least this size, if possible Default: 0 |
--softmax-batch | if BxT is more than this, will batch the softmax over vocab to this amount of tokens, in order to fit into GPU memory Default: 9223372036854775807 |