Command-line Tools¶
Fairseq provides several command-line tools for training and evaluating models:
- fairseq-preprocess: Data pre-processing: build vocabularies and binarize training data
- fairseq-train: Train a new model on one or multiple GPUs
- fairseq-generate: Translate pre-processed data with a trained model
- fairseq-interactive: Translate raw text with a trained model
- fairseq-score: BLEU scoring of generated translations against reference translations
- fairseq-eval-lm: Language model evaluation
fairseq-preprocess¶
Data pre-processing: build vocabularies and binarize training data.
usage: fairseq-preprocess [-h] [--no-progress-bar]
[--log-interval LOG_INTERVAL]
[--log-format {json,none,simple,tqdm}]
[--log-file LOG_FILE] [--aim-repo AIM_REPO]
[--aim-run-hash AIM_RUN_HASH]
[--tensorboard-logdir TENSORBOARD_LOGDIR]
[--wandb-project WANDB_PROJECT] [--azureml-logging]
[--seed SEED] [--cpu] [--tpu] [--bf16]
[--memory-efficient-bf16] [--fp16]
[--memory-efficient-fp16] [--fp16-no-flatten-grads]
[--fp16-init-scale FP16_INIT_SCALE]
[--fp16-scale-window FP16_SCALE_WINDOW]
[--fp16-scale-tolerance FP16_SCALE_TOLERANCE]
[--on-cpu-convert-precision]
[--min-loss-scale MIN_LOSS_SCALE]
[--threshold-loss-scale THRESHOLD_LOSS_SCALE]
[--amp] [--amp-batch-retries AMP_BATCH_RETRIES]
[--amp-init-scale AMP_INIT_SCALE]
[--amp-scale-window AMP_SCALE_WINDOW]
[--user-dir USER_DIR]
[--empty-cache-freq EMPTY_CACHE_FREQ]
[--all-gather-list-size ALL_GATHER_LIST_SIZE]
[--model-parallel-size MODEL_PARALLEL_SIZE]
[--quantization-config-path QUANTIZATION_CONFIG_PATH]
[--profile] [--reset-logging] [--suppress-crashes]
[--use-plasma-view] [--plasma-path PLASMA_PATH]
[--criterion {adaptive_loss,composite_loss,cross_entropy,ctc,fastspeech2,hubert,label_smoothed_cross_entropy,latency_augmented_label_smoothed_cross_entropy,label_smoothed_cross_entropy_with_alignment,label_smoothed_cross_entropy_with_ctc,legacy_masked_lm_loss,masked_lm,model,nat_loss,sentence_prediction,sentence_prediction_adapters,sentence_ranking,tacotron2,speech_to_unit,speech_to_spectrogram,speech_unit_lm_criterion,wav2vec,vocab_parallel_cross_entropy}]
[--tokenizer {moses,nltk,space}]
[--bpe {byte_bpe,bytes,characters,fastbpe,gpt2,bert,hf_byte_bpe,sentencepiece,subword_nmt}]
[--optimizer {adadelta,adafactor,adagrad,adam,adamax,composite,cpu_adam,lamb,nag,sgd}]
[--lr-scheduler {cosine,fixed,inverse_sqrt,manual,pass_through,polynomial_decay,reduce_lr_on_plateau,step,tri_stage,triangular}]
[--scoring {bert_score,sacrebleu,bleu,chrf,meteor,wer}]
[--task TASK] [-s SRC] [-t TARGET] [--trainpref FP]
[--validpref FP] [--testpref FP] [--align-suffix FP]
[--destdir DIR] [--thresholdtgt N]
[--thresholdsrc N] [--tgtdict FP] [--srcdict FP]
[--nwordstgt N] [--nwordssrc N] [--alignfile ALIGN]
[--dataset-impl FORMAT] [--joined-dictionary]
[--only-source] [--padding-factor N] [--workers N]
[--dict-only]
Named Arguments¶
--no-progress-bar | disable progress bar Default: False |
--log-interval | log progress every N batches (when progress bar is disabled) Default: 100 |
--log-format | Possible choices: json, none, simple, tqdm log format to use |
--log-file | log file to copy metrics to. |
--aim-repo | path to Aim repository |
--aim-run-hash | Aim run hash. If skipped, creates or continues run based on save_dir |
--tensorboard-logdir | path to save logs for tensorboard, should match –logdir of running tensorboard (default: no tensorboard logging) |
--wandb-project | Weights and Biases project name to use for logging |
--azureml-logging | Log scalars to AzureML context Default: False |
--seed | pseudo random number generator seed Default: 1 |
--cpu | use CPU instead of CUDA Default: False |
--tpu | use TPU instead of CUDA Default: False |
--bf16 | use bfloat16; implies –tpu Default: False |
--memory-efficient-bf16 | use a memory-efficient version of BF16 training; implies –bf16 Default: False |
--fp16 | use FP16 Default: False |
--memory-efficient-fp16 | use a memory-efficient version of FP16 training; implies –fp16 Default: False |
--fp16-no-flatten-grads | don’t flatten FP16 grads tensor Default: False |
--fp16-init-scale | default FP16 loss scale Default: 128 |
--fp16-scale-window | number of updates before increasing loss scale |
--fp16-scale-tolerance | pct of updates that can overflow before decreasing the loss scale Default: 0.0 |
--on-cpu-convert-precision | if set, the floating point conversion to fp16/bf16 runs on CPU. This reduces bus transfer time and GPU memory usage. Default: False |
--min-loss-scale | minimum FP16/AMP loss scale, after which training is stopped Default: 0.0001 |
--threshold-loss-scale | threshold FP16 loss scale from below |
--amp | use automatic mixed precision Default: False |
--amp-batch-retries | number of retries of same batch after reducing loss scale with AMP Default: 2 |
--amp-init-scale | default AMP loss scale Default: 128 |
--amp-scale-window | number of updates before increasing AMP loss scale |
--user-dir | path to a python module containing custom extensions (tasks and/or architectures) |
--empty-cache-freq | how often to clear the PyTorch CUDA cache (0 to disable) Default: 0 |
--all-gather-list-size | number of bytes reserved for gathering stats from workers Default: 16384 |
--model-parallel-size | total number of GPUs to parallelize model over Default: 1 |
--quantization-config-path | path to quantization config file |
--profile | enable autograd profiler emit_nvtx Default: False |
--reset-logging | when using Hydra, reset the logging at the beginning of training Default: False |
--suppress-crashes | suppress crashes when training with the hydra_train entry point so that the main method can return a value (useful for sweeps) Default: False |
--use-plasma-view | Store indices and sizes in shared memory Default: False |
--plasma-path | path to run plasma_store, defaults to /tmp/plasma. Paths outside /tmp tend to fail. Default: “/tmp/plasma” |
--criterion | Possible choices: adaptive_loss, composite_loss, cross_entropy, ctc, fastspeech2, hubert, label_smoothed_cross_entropy, latency_augmented_label_smoothed_cross_entropy, label_smoothed_cross_entropy_with_alignment, label_smoothed_cross_entropy_with_ctc, legacy_masked_lm_loss, masked_lm, model, nat_loss, sentence_prediction, sentence_prediction_adapters, sentence_ranking, tacotron2, speech_to_unit, speech_to_spectrogram, speech_unit_lm_criterion, wav2vec, vocab_parallel_cross_entropy Default: “cross_entropy” |
--tokenizer | Possible choices: moses, nltk, space |
--bpe | Possible choices: byte_bpe, bytes, characters, fastbpe, gpt2, bert, hf_byte_bpe, sentencepiece, subword_nmt |
--optimizer | Possible choices: adadelta, adafactor, adagrad, adam, adamax, composite, cpu_adam, lamb, nag, sgd |
--lr-scheduler | Possible choices: cosine, fixed, inverse_sqrt, manual, pass_through, polynomial_decay, reduce_lr_on_plateau, step, tri_stage, triangular Default: “fixed” |
--scoring | Possible choices: bert_score, sacrebleu, bleu, chrf, meteor, wer Default: “bleu” |
--task | Possible choices: multilingual_language_modeling, speech_unit_modeling, hubert_pretraining, translation, multilingual_translation, semisupervised_translation, translation_from_pretrained_xlm, speech_to_text, text_to_speech, frm_text_to_speech, legacy_masked_lm, audio_pretraining, audio_finetuning, sentence_ranking, online_backtranslation, simul_speech_to_text, simul_text_to_text, cross_lingual_lm, span_masked_lm, denoising, multilingual_denoising, multilingual_masked_lm, language_modeling, masked_lm, nlu_finetuning, speech_to_speech, sentence_prediction, translation_from_pretrained_bart, sentence_prediction_adapters, translation_multi_simple_epoch, translation_lev, dummy_lm, dummy_masked_lm, dummy_mt task Default: “translation” |
--dataset-impl | Possible choices: raw, lazy, cached, mmap, fasta, huffman output dataset implementation Default: “mmap” |
Preprocessing¶
-s, --source-lang | source language |
-t, --target-lang | target language |
--trainpref | train file prefix (also used to build dictionaries) |
--validpref | comma separated, valid file prefixes (words missing from train set are replaced with <unk>) |
--testpref | comma separated, test file prefixes (words missing from train set are replaced with <unk>) |
--align-suffix | alignment file suffix |
--destdir | destination dir Default: “data-bin” |
--thresholdtgt | map words appearing less than threshold times to unknown Default: 0 |
--thresholdsrc | map words appearing less than threshold times to unknown Default: 0 |
--tgtdict | reuse given target dictionary |
--srcdict | reuse given source dictionary |
--nwordstgt | number of target words to retain Default: -1 |
--nwordssrc | number of source words to retain Default: -1 |
--alignfile | an alignment file (optional) |
--joined-dictionary | Generate joined dictionary Default: False |
--only-source | Only process the source language Default: False |
--padding-factor | Pad dictionary size to be multiple of N Default: 8 |
--workers | number of parallel workers Default: 1 |
--dict-only | if true, only builds a dictionary and then exits Default: False |
fairseq-train¶
Train a new model on one or across multiple GPUs.
usage: fairseq-train [-h] [--no-progress-bar] [--log-interval LOG_INTERVAL]
[--log-format {json,none,simple,tqdm}]
[--log-file LOG_FILE] [--aim-repo AIM_REPO]
[--aim-run-hash AIM_RUN_HASH]
[--tensorboard-logdir TENSORBOARD_LOGDIR]
[--wandb-project WANDB_PROJECT] [--azureml-logging]
[--seed SEED] [--cpu] [--tpu] [--bf16]
[--memory-efficient-bf16] [--fp16]
[--memory-efficient-fp16] [--fp16-no-flatten-grads]
[--fp16-init-scale FP16_INIT_SCALE]
[--fp16-scale-window FP16_SCALE_WINDOW]
[--fp16-scale-tolerance FP16_SCALE_TOLERANCE]
[--on-cpu-convert-precision]
[--min-loss-scale MIN_LOSS_SCALE]
[--threshold-loss-scale THRESHOLD_LOSS_SCALE] [--amp]
[--amp-batch-retries AMP_BATCH_RETRIES]
[--amp-init-scale AMP_INIT_SCALE]
[--amp-scale-window AMP_SCALE_WINDOW]
[--user-dir USER_DIR]
[--empty-cache-freq EMPTY_CACHE_FREQ]
[--all-gather-list-size ALL_GATHER_LIST_SIZE]
[--model-parallel-size MODEL_PARALLEL_SIZE]
[--quantization-config-path QUANTIZATION_CONFIG_PATH]
[--profile] [--reset-logging] [--suppress-crashes]
[--use-plasma-view] [--plasma-path PLASMA_PATH]
[--criterion {adaptive_loss,composite_loss,cross_entropy,ctc,fastspeech2,hubert,label_smoothed_cross_entropy,latency_augmented_label_smoothed_cross_entropy,label_smoothed_cross_entropy_with_alignment,label_smoothed_cross_entropy_with_ctc,legacy_masked_lm_loss,masked_lm,model,nat_loss,sentence_prediction,sentence_prediction_adapters,sentence_ranking,tacotron2,speech_to_unit,speech_to_spectrogram,speech_unit_lm_criterion,wav2vec,vocab_parallel_cross_entropy}]
[--tokenizer {moses,nltk,space}]
[--bpe {byte_bpe,bytes,characters,fastbpe,gpt2,bert,hf_byte_bpe,sentencepiece,subword_nmt}]
[--optimizer {adadelta,adafactor,adagrad,adam,adamax,composite,cpu_adam,lamb,nag,sgd}]
[--lr-scheduler {cosine,fixed,inverse_sqrt,manual,pass_through,polynomial_decay,reduce_lr_on_plateau,step,tri_stage,triangular}]
[--scoring {bert_score,sacrebleu,bleu,chrf,meteor,wer}]
[--task TASK] [--num-workers NUM_WORKERS]
[--skip-invalid-size-inputs-valid-test]
[--max-tokens MAX_TOKENS] [--batch-size BATCH_SIZE]
[--required-batch-size-multiple REQUIRED_BATCH_SIZE_MULTIPLE]
[--required-seq-len-multiple REQUIRED_SEQ_LEN_MULTIPLE]
[--dataset-impl {raw,lazy,cached,mmap,fasta,huffman}]
[--data-buffer-size DATA_BUFFER_SIZE]
[--train-subset TRAIN_SUBSET]
[--valid-subset VALID_SUBSET] [--combine-valid-subsets]
[--ignore-unused-valid-subsets]
[--validate-interval VALIDATE_INTERVAL]
[--validate-interval-updates VALIDATE_INTERVAL_UPDATES]
[--validate-after-updates VALIDATE_AFTER_UPDATES]
[--fixed-validation-seed FIXED_VALIDATION_SEED]
[--disable-validation]
[--max-tokens-valid MAX_TOKENS_VALID]
[--batch-size-valid BATCH_SIZE_VALID]
[--max-valid-steps MAX_VALID_STEPS]
[--curriculum CURRICULUM] [--gen-subset GEN_SUBSET]
[--num-shards NUM_SHARDS] [--shard-id SHARD_ID]
[--grouped-shuffling]
[--update-epoch-batch-itr UPDATE_EPOCH_BATCH_ITR]
[--update-ordered-indices-seed]
[--distributed-world-size DISTRIBUTED_WORLD_SIZE]
[--distributed-num-procs DISTRIBUTED_NUM_PROCS]
[--distributed-rank DISTRIBUTED_RANK]
[--distributed-backend DISTRIBUTED_BACKEND]
[--distributed-init-method DISTRIBUTED_INIT_METHOD]
[--distributed-port DISTRIBUTED_PORT]
[--device-id DEVICE_ID] [--distributed-no-spawn]
[--ddp-backend {c10d,fully_sharded,legacy_ddp,no_c10d,pytorch_ddp,slowmo}]
[--ddp-comm-hook {none,fp16}]
[--bucket-cap-mb BUCKET_CAP_MB] [--fix-batches-to-gpus]
[--find-unused-parameters] [--gradient-as-bucket-view]
[--fast-stat-sync]
[--heartbeat-timeout HEARTBEAT_TIMEOUT]
[--broadcast-buffers] [--slowmo-momentum SLOWMO_MOMENTUM]
[--slowmo-base-algorithm SLOWMO_BASE_ALGORITHM]
[--localsgd-frequency LOCALSGD_FREQUENCY]
[--nprocs-per-node NPROCS_PER_NODE]
[--pipeline-model-parallel]
[--pipeline-balance PIPELINE_BALANCE]
[--pipeline-devices PIPELINE_DEVICES]
[--pipeline-chunks PIPELINE_CHUNKS]
[--pipeline-encoder-balance PIPELINE_ENCODER_BALANCE]
[--pipeline-encoder-devices PIPELINE_ENCODER_DEVICES]
[--pipeline-decoder-balance PIPELINE_DECODER_BALANCE]
[--pipeline-decoder-devices PIPELINE_DECODER_DEVICES]
[--pipeline-checkpoint {always,never,except_last}]
[--zero-sharding {none,os}] [--no-reshard-after-forward]
[--fp32-reduce-scatter] [--cpu-offload]
[--use-sharded-state] [--not-fsdp-flatten-parameters]
[--arch ARCH] [--max-epoch MAX_EPOCH]
[--max-update MAX_UPDATE]
[--stop-time-hours STOP_TIME_HOURS]
[--clip-norm CLIP_NORM] [--sentence-avg]
[--update-freq UPDATE_FREQ] [--lr LR]
[--stop-min-lr STOP_MIN_LR] [--use-bmuf]
[--skip-remainder-batch] [--save-dir SAVE_DIR]
[--restore-file RESTORE_FILE]
[--continue-once CONTINUE_ONCE]
[--finetune-from-model FINETUNE_FROM_MODEL]
[--reset-dataloader] [--reset-lr-scheduler]
[--reset-meters] [--reset-optimizer]
[--optimizer-overrides OPTIMIZER_OVERRIDES]
[--save-interval SAVE_INTERVAL]
[--save-interval-updates SAVE_INTERVAL_UPDATES]
[--keep-interval-updates KEEP_INTERVAL_UPDATES]
[--keep-interval-updates-pattern KEEP_INTERVAL_UPDATES_PATTERN]
[--keep-last-epochs KEEP_LAST_EPOCHS]
[--keep-best-checkpoints KEEP_BEST_CHECKPOINTS]
[--no-save] [--no-epoch-checkpoints]
[--no-last-checkpoints] [--no-save-optimizer-state]
[--best-checkpoint-metric BEST_CHECKPOINT_METRIC]
[--maximize-best-checkpoint-metric] [--patience PATIENCE]
[--checkpoint-suffix CHECKPOINT_SUFFIX]
[--checkpoint-shard-count CHECKPOINT_SHARD_COUNT]
[--load-checkpoint-on-all-dp-ranks]
[--write-checkpoints-asynchronously] [--store-ema]
[--ema-decay EMA_DECAY]
[--ema-start-update EMA_START_UPDATE]
[--ema-seed-model EMA_SEED_MODEL]
[--ema-update-freq EMA_UPDATE_FREQ] [--ema-fp32]
Named Arguments¶
--no-progress-bar | disable progress bar Default: False |
--log-interval | log progress every N batches (when progress bar is disabled) Default: 100 |
--log-format | Possible choices: json, none, simple, tqdm log format to use |
--log-file | log file to copy metrics to. |
--aim-repo | path to Aim repository |
--aim-run-hash | Aim run hash. If skipped, creates or continues run based on save_dir |
--tensorboard-logdir | path to save logs for tensorboard, should match –logdir of running tensorboard (default: no tensorboard logging) |
--wandb-project | Weights and Biases project name to use for logging |
--azureml-logging | Log scalars to AzureML context Default: False |
--seed | pseudo random number generator seed Default: 1 |
--cpu | use CPU instead of CUDA Default: False |
--tpu | use TPU instead of CUDA Default: False |
--bf16 | use bfloat16; implies –tpu Default: False |
--memory-efficient-bf16 | use a memory-efficient version of BF16 training; implies –bf16 Default: False |
--fp16 | use FP16 Default: False |
--memory-efficient-fp16 | use a memory-efficient version of FP16 training; implies –fp16 Default: False |
--fp16-no-flatten-grads | don’t flatten FP16 grads tensor Default: False |
--fp16-init-scale | default FP16 loss scale Default: 128 |
--fp16-scale-window | number of updates before increasing loss scale |
--fp16-scale-tolerance | pct of updates that can overflow before decreasing the loss scale Default: 0.0 |
--on-cpu-convert-precision | if set, the floating point conversion to fp16/bf16 runs on CPU. This reduces bus transfer time and GPU memory usage. Default: False |
--min-loss-scale | minimum FP16/AMP loss scale, after which training is stopped Default: 0.0001 |
--threshold-loss-scale | threshold FP16 loss scale from below |
--amp | use automatic mixed precision Default: False |
--amp-batch-retries | number of retries of same batch after reducing loss scale with AMP Default: 2 |
--amp-init-scale | default AMP loss scale Default: 128 |
--amp-scale-window | number of updates before increasing AMP loss scale |
--user-dir | path to a python module containing custom extensions (tasks and/or architectures) |
--empty-cache-freq | how often to clear the PyTorch CUDA cache (0 to disable) Default: 0 |
--all-gather-list-size | number of bytes reserved for gathering stats from workers Default: 16384 |
--model-parallel-size | total number of GPUs to parallelize model over Default: 1 |
--quantization-config-path | path to quantization config file |
--profile | enable autograd profiler emit_nvtx Default: False |
--reset-logging | when using Hydra, reset the logging at the beginning of training Default: False |
--suppress-crashes | suppress crashes when training with the hydra_train entry point so that the main method can return a value (useful for sweeps) Default: False |
--use-plasma-view | Store indices and sizes in shared memory Default: False |
--plasma-path | path to run plasma_store, defaults to /tmp/plasma. Paths outside /tmp tend to fail. Default: “/tmp/plasma” |
--criterion | Possible choices: adaptive_loss, composite_loss, cross_entropy, ctc, fastspeech2, hubert, label_smoothed_cross_entropy, latency_augmented_label_smoothed_cross_entropy, label_smoothed_cross_entropy_with_alignment, label_smoothed_cross_entropy_with_ctc, legacy_masked_lm_loss, masked_lm, model, nat_loss, sentence_prediction, sentence_prediction_adapters, sentence_ranking, tacotron2, speech_to_unit, speech_to_spectrogram, speech_unit_lm_criterion, wav2vec, vocab_parallel_cross_entropy Default: “cross_entropy” |
--tokenizer | Possible choices: moses, nltk, space |
--bpe | Possible choices: byte_bpe, bytes, characters, fastbpe, gpt2, bert, hf_byte_bpe, sentencepiece, subword_nmt |
--optimizer | Possible choices: adadelta, adafactor, adagrad, adam, adamax, composite, cpu_adam, lamb, nag, sgd |
--lr-scheduler | Possible choices: cosine, fixed, inverse_sqrt, manual, pass_through, polynomial_decay, reduce_lr_on_plateau, step, tri_stage, triangular Default: “fixed” |
--scoring | Possible choices: bert_score, sacrebleu, bleu, chrf, meteor, wer Default: “bleu” |
--task | Possible choices: multilingual_language_modeling, speech_unit_modeling, hubert_pretraining, translation, multilingual_translation, semisupervised_translation, translation_from_pretrained_xlm, speech_to_text, text_to_speech, frm_text_to_speech, legacy_masked_lm, audio_pretraining, audio_finetuning, sentence_ranking, online_backtranslation, simul_speech_to_text, simul_text_to_text, cross_lingual_lm, span_masked_lm, denoising, multilingual_denoising, multilingual_masked_lm, language_modeling, masked_lm, nlu_finetuning, speech_to_speech, sentence_prediction, translation_from_pretrained_bart, sentence_prediction_adapters, translation_multi_simple_epoch, translation_lev, dummy_lm, dummy_masked_lm, dummy_mt task Default: “translation” |
dataset_data_loading¶
--num-workers | how many subprocesses to use for data loading Default: 1 |
--skip-invalid-size-inputs-valid-test | ignore too long or too short lines in valid and test set Default: False |
--max-tokens | maximum number of tokens in a batch |
--batch-size, --max-sentences | number of examples in a batch |
--required-batch-size-multiple | batch size will be a multiplier of this value Default: 8 |
--required-seq-len-multiple | maximum sequence length in batch will be a multiplier of this value Default: 1 |
--dataset-impl | Possible choices: raw, lazy, cached, mmap, fasta, huffman output dataset implementation |
--data-buffer-size | Number of batches to preload Default: 10 |
--train-subset | data subset to use for training (e.g. train, valid, test) Default: “train” |
--valid-subset | comma separated list of data subsets to use for validation (e.g. train, valid, test) Default: “valid” |
--combine-valid-subsets, --combine-val | comma separated list of data subsets to use for validation (e.g. train, valid, test) |
--ignore-unused-valid-subsets | do not raise error if valid subsets are ignored Default: False |
--validate-interval | validate every N epochs Default: 1 |
--validate-interval-updates | validate every N updates Default: 0 |
--validate-after-updates | dont validate until reaching this many updates Default: 0 |
--fixed-validation-seed | specified random seed for validation |
--disable-validation | disable validation Default: False |
--max-tokens-valid | maximum number of tokens in a validation batch (defaults to –max-tokens) |
--batch-size-valid, --max-sentences-valid | batch size of the validation batch (defaults to –batch-size) |
--max-valid-steps, --nval | How many batches to evaluate |
--curriculum | don’t shuffle batches for first N epochs Default: 0 |
--gen-subset | data subset to generate (train, valid, test) Default: “test” |
--num-shards | shard generation over N shards Default: 1 |
--shard-id | id of the shard to generate (id < num_shards) Default: 0 |
--grouped-shuffling | shuffle batches in groups of num_shards to enable similar sequence lengths on each GPU worker when batches are sorted by length Default: False |
--update-epoch-batch-itr | if true then prevents the reuse the epoch batch iterator by setting can_reuse_epoch_itr to false, defaults to –grouped-shuffling ) |
--update-ordered-indices-seed | if true then increment seed with epoch for getting batch iterators, defautls to False. Default: False |
distributed_training¶
--distributed-world-size | total number of GPUs across all nodes (default: all visible GPUs) Default: 1 |
--distributed-num-procs | total number of processes to fork (default: all visible GPUs) Default: 1 |
--distributed-rank | rank of the current worker Default: 0 |
--distributed-backend | distributed backend Default: “nccl” |
--distributed-init-method | typically tcp://hostname:port that will be used to establish initial connetion |
--distributed-port | port number (not required if using –distributed-init-method) Default: -1 |
--device-id, --local_rank | which GPU to use (by default looks for $LOCAL_RANK, usually configured automatically) Default: 0 |
--distributed-no-spawn | do not spawn multiple processes even if multiple GPUs are visible Default: False |
--ddp-backend | Possible choices: c10d, fully_sharded, legacy_ddp, no_c10d, pytorch_ddp, slowmo DistributedDataParallel backend Default: “pytorch_ddp” |
--ddp-comm-hook | Possible choices: none, fp16 communication hook Default: “none” |
--bucket-cap-mb | bucket size for reduction Default: 25 |
--fix-batches-to-gpus | don’t shuffle batches between GPUs; this reduces overall randomness and may affect precision but avoids the cost of re-reading the data Default: False |
--find-unused-parameters | disable unused parameter detection (not applicable to –ddp-backend=legacy_ddp) Default: False |
--gradient-as-bucket-view | when set to True, gradients will be views pointing to different offsets of allreduce communication buckets. This can reduce peak memory usage, where the saved memory size will be equal to the total gradients size. –gradient-as-bucket-view=gradient_as_bucket_view) Default: False |
--fast-stat-sync | [deprecated] this is now defined per Criterion Default: False |
--heartbeat-timeout | kill the job if no progress is made in N seconds; set to -1 to disable Default: -1 |
--broadcast-buffers | Copy non-trainable parameters between GPUs, such as batchnorm population statistics Default: False |
--slowmo-momentum | SlowMo momentum term; by default use 0.0 for 16 GPUs, 0.2 for 32 GPUs; 0.5 for 64 GPUs, 0.6 for > 64 GPUs |
--slowmo-base-algorithm | Base algorithm. Either ‘localsgd’ or ‘sgp’. Please refer to the documentation of ‘slowmo_base_algorithm’ parameter in https://fairscale.readthedocs.io/en/latest/api/experimental/nn/slowmo_ddp.html for more details Default: “localsgd” |
--localsgd-frequency | Local SGD allreduce frequency Default: 3 |
--nprocs-per-node | number of GPUs in each node. An allreduce operation across GPUs in a node is very fast. Hence, we do allreduce across GPUs in a node, and gossip across different nodes Default: 1 |
--pipeline-model-parallel | if set, use pipeline model parallelism across GPUs Default: False |
--pipeline-balance | partition the model into N_K pieces, where each piece contains N_i layers. The sum(args.pipeline_balance) should equal the total number of layers in the model |
--pipeline-devices | a list of device indices indicating which device to place each of the N_K partitions. The length of this list should equal the length of the –pipeline-balance argument |
--pipeline-chunks | microbatch count for pipeline model parallelism Default: 0 |
--pipeline-encoder-balance | partition the pipeline parallel encoder into N_K pieces, where each piece contains N_i layers. The sum(args.pipeline_encoder_balance) should equal the total number of encoder layers in the model |
--pipeline-encoder-devices | a list of device indices indicating which device to place each of the N_K partitions. The length of this list should equal the length of the –pipeline-encoder-balance argument |
--pipeline-decoder-balance | partition the pipeline parallel decoder into N_K pieces, where each piece contains N_i layers. The sum(args.pipeline_decoder_balance) should equal the total number of decoder layers in the model |
--pipeline-decoder-devices | a list of device indices indicating which device to place each of the N_K partitions. The length of this list should equal the length of the –pipeline-decoder-balance argument |
--pipeline-checkpoint | Possible choices: always, never, except_last checkpointing mode for pipeline model parallelism Default: “never” |
--zero-sharding | Possible choices: none, os ZeRO sharding Default: “none” |
--no-reshard-after-forward | don’t reshard parameters after forward pass Default: False |
--fp32-reduce-scatter | reduce-scatter grads in FP32 Default: False |
--cpu-offload | offload FP32 params to CPU Default: False |
--use-sharded-state | use sharded checkpoint files Default: False |
--not-fsdp-flatten-parameters | not flatten parameter param for fsdp Default: False |
Model configuration¶
--arch, -a | Possible choices: transformer_tiny, transformer, transformer_iwslt_de_en, transformer_wmt_en_de, transformer_vaswani_wmt_en_de_big, transformer_vaswani_wmt_en_fr_big, transformer_wmt_en_de_big, transformer_wmt_en_de_big_t2t, transformer_from_pretrained_xlm, transformer_align, transformer_wmt_en_de_big_align, fconv, fconv_iwslt_de_en, fconv_wmt_en_ro, fconv_wmt_en_de, fconv_wmt_en_fr, roberta, roberta_prenorm, roberta_base, roberta_large, xlm, roberta_enc_dec, xmod_base_13, xmod_base_30, xmod_base_60, xmod_base_75, xmod_base, xmod_large_prenorm, s2t_berard, s2t_berard_256_3_3, s2t_berard_512_3_2, s2t_berard_512_5_3, convtransformer, convtransformer_espnet, s2t_transformer, s2t_transformer_s, s2t_transformer_xs, s2t_transformer_sp, s2t_transformer_m, s2t_transformer_mp, s2t_transformer_l, s2t_transformer_lp, wav2vec, wav2vec2, wav2vec_ctc, wav2vec_seq2seq, xm_transformer, s2t_conformer, lstm, lstm_wiseman_iwslt_de_en, lstm_luong_wmt_en_de, masked_lm, bert_base, bert_large, xlm_base, tacotron_2, tts_transformer, fastspeech2, lightconv, lightconv_iwslt_de_en, lightconv_wmt_en_de, lightconv_wmt_en_de_big, lightconv_wmt_en_fr_big, lightconv_wmt_zh_en_big, lightconv_lm, lightconv_lm_gbw, lstm_lm, s2ut_transformer, s2ut_transformer_fisher, s2spect_transformer, s2spect_transformer_fisher, s2ut_conformer, hf_gpt2, hf_gpt2_medium, hf_gpt2_large, hf_gpt2_xl, transformer_lm, transformer_lm_big, transformer_lm_baevski_wiki103, transformer_lm_wiki103, transformer_lm_baevski_gbw, transformer_lm_gbw, transformer_lm_gpt, transformer_lm_gpt2_small, transformer_lm_gpt2_tiny, transformer_lm_gpt2_medium, transformer_lm_gpt2_big, transformer_lm_gpt2_big_wide, transformer_lm_gpt2_bigger, transformer_lm_gpt3_small, transformer_lm_gpt3_medium, transformer_lm_gpt3_large, transformer_lm_gpt3_xl, transformer_lm_gpt3_2_7, transformer_lm_gpt3_6_7, transformer_lm_gpt3_13, transformer_lm_gpt3_175, multilingual_transformer, multilingual_transformer_iwslt_de_en, bart_large, bart_base, mbart_large, mbart_base, mbart_base_wmt20, transformer_ulm, transformer_ulm_big, transformer_ulm_tiny, hubert, hubert_ctc, hubert_seq2seq, fconv_self_att, fconv_self_att_wp, fconv_lm, fconv_lm_dauphin_wikitext103, fconv_lm_dauphin_gbw, nonautoregressive_transformer, nonautoregressive_transformer_wmt_en_de, nacrf_transformer, iterative_nonautoregressive_transformer, iterative_nonautoregressive_transformer_wmt_en_de, cmlm_transformer, cmlm_transformer_wmt_en_de, levenshtein_transformer, levenshtein_transformer_wmt_en_de, levenshtein_transformer_vaswani_wmt_en_de_big, levenshtein_transformer_wmt_en_de_big, insertion_transformer, dummy_model, model_parallel_roberta, model_parallel_roberta_v1, model_parallel_roberta_postnorm, model_parallel_roberta_base, model_parallel_roberta_large, transformer_iwslt_de_en_pipeline_parallel, transformer_wmt_en_de_big_pipeline_parallel, transformer_lm_megatron, transformer_lm_megatron_11b model architecture |
optimization¶
--max-epoch | force stop training at specified epoch Default: 0 |
--max-update | force stop training at specified update Default: 0 |
--stop-time-hours | force stop training after specified cumulative time (if >0) Default: 0 |
--clip-norm | clip threshold of gradients Default: 0.0 |
--sentence-avg | normalize gradients by the number of sentences in a batch (default is to normalize by number of tokens) Default: False |
--update-freq | update parameters every N_i batches, when in epoch i Default: 1 |
--lr | learning rate for the first N epochs; all epochs >N using LR_N (note: this may be interpreted differently depending on –lr-scheduler) Default: 0.25 |
--stop-min-lr | stop training when the learning rate reaches this minimum Default: -1.0 |
--use-bmuf | specify global optimizer for syncing models on different GPUs/shards Default: False |
--skip-remainder-batch | if set, include the last (partial) batch of each epoch in training (default is to skip it). Default: False |
checkpoint¶
--save-dir | path to save checkpoints Default: “checkpoints” |
--restore-file | filename from which to load checkpoint (default: <save-dir>/checkpoint_last.pt Default: “checkpoint_last.pt” |
--continue-once | continues from this checkpoint, unless a checkpoint indicated in ‘restore_file’ option is present |
--finetune-from-model | finetune from a pretrained model; note that meters and lr scheduler will be reset |
--reset-dataloader | if set, does not reload dataloader state from the checkpoint Default: False |
--reset-lr-scheduler | if set, does not load lr scheduler state from the checkpoint Default: False |
--reset-meters | if set, does not load meters from the checkpoint Default: False |
--reset-optimizer | if set, does not load optimizer state from the checkpoint Default: False |
--optimizer-overrides | a dictionary used to override optimizer args when loading a checkpoint Default: “{}” |
--save-interval | save a checkpoint every N epochs Default: 1 |
--save-interval-updates | save a checkpoint (and validate) every N updates Default: 0 |
--keep-interval-updates | keep the last N checkpoints saved with –save-interval-updates Default: -1 |
--keep-interval-updates-pattern | when used with –keep-interval-updates, skips deleting any checkpoints with update X where X % keep_interval_updates_pattern == 0 Default: -1 |
--keep-last-epochs | keep last N epoch checkpoints Default: -1 |
--keep-best-checkpoints | keep best N checkpoints based on scores Default: -1 |
--no-save | don’t save models or checkpoints Default: False |
--no-epoch-checkpoints | only store last and best checkpoints Default: False |
--no-last-checkpoints | don’t store last checkpoints Default: False |
--no-save-optimizer-state | don’t save optimizer-state as part of checkpoint Default: False |
--best-checkpoint-metric | metric to use for saving “best” checkpoints Default: “loss” |
--maximize-best-checkpoint-metric | select the largest metric value for saving “best” checkpoints Default: False |
--patience | early stop training if valid performance doesn’t improve for N consecutive validation runs; note that this is influenced by –validate-interval Default: -1 |
--checkpoint-suffix | suffix to add to the checkpoint file name Default: “” |
--checkpoint-shard-count | Number of shards containing the checkpoint - if the checkpoint is over 300GB, it is preferable to split it into shards to prevent OOM on CPU while loading the checkpoint Default: 1 |
--load-checkpoint-on-all-dp-ranks | load checkpoints on all data parallel devices (default: only load on rank 0 and broadcast to other devices) Default: False |
--write-checkpoints-asynchronously, --save-async | Write checkpoints asynchronously in a separate thread. NOTE: This feature is currently being tested. Default: False |
EMA configuration¶
--store-ema | Default: False |
--ema-decay | decay for exponential moving average model Default: 0.9999 |
--ema-start-update | start EMA update after this many model updates Default: 0 |
--ema-seed-model | Seed to load EMA model from. Used to load EMA model separately from the actual model. |
--ema-update-freq | Do EMA update every this many model updates Default: 1 |
--ema-fp32 | If true, store EMA model in fp32 even if model is in fp16 Default: False |
fairseq-generate¶
Translate pre-processed data with a trained model.
usage: fairseq-generate [-h] [--no-progress-bar] [--log-interval LOG_INTERVAL]
[--log-format {json,none,simple,tqdm}]
[--log-file LOG_FILE] [--aim-repo AIM_REPO]
[--aim-run-hash AIM_RUN_HASH]
[--tensorboard-logdir TENSORBOARD_LOGDIR]
[--wandb-project WANDB_PROJECT] [--azureml-logging]
[--seed SEED] [--cpu] [--tpu] [--bf16]
[--memory-efficient-bf16] [--fp16]
[--memory-efficient-fp16] [--fp16-no-flatten-grads]
[--fp16-init-scale FP16_INIT_SCALE]
[--fp16-scale-window FP16_SCALE_WINDOW]
[--fp16-scale-tolerance FP16_SCALE_TOLERANCE]
[--on-cpu-convert-precision]
[--min-loss-scale MIN_LOSS_SCALE]
[--threshold-loss-scale THRESHOLD_LOSS_SCALE] [--amp]
[--amp-batch-retries AMP_BATCH_RETRIES]
[--amp-init-scale AMP_INIT_SCALE]
[--amp-scale-window AMP_SCALE_WINDOW]
[--user-dir USER_DIR]
[--empty-cache-freq EMPTY_CACHE_FREQ]
[--all-gather-list-size ALL_GATHER_LIST_SIZE]
[--model-parallel-size MODEL_PARALLEL_SIZE]
[--quantization-config-path QUANTIZATION_CONFIG_PATH]
[--profile] [--reset-logging] [--suppress-crashes]
[--use-plasma-view] [--plasma-path PLASMA_PATH]
[--criterion {adaptive_loss,composite_loss,cross_entropy,ctc,fastspeech2,hubert,label_smoothed_cross_entropy,latency_augmented_label_smoothed_cross_entropy,label_smoothed_cross_entropy_with_alignment,label_smoothed_cross_entropy_with_ctc,legacy_masked_lm_loss,masked_lm,model,nat_loss,sentence_prediction,sentence_prediction_adapters,sentence_ranking,tacotron2,speech_to_unit,speech_to_spectrogram,speech_unit_lm_criterion,wav2vec,vocab_parallel_cross_entropy}]
[--tokenizer {moses,nltk,space}]
[--bpe {byte_bpe,bytes,characters,fastbpe,gpt2,bert,hf_byte_bpe,sentencepiece,subword_nmt}]
[--optimizer {adadelta,adafactor,adagrad,adam,adamax,composite,cpu_adam,lamb,nag,sgd}]
[--lr-scheduler {cosine,fixed,inverse_sqrt,manual,pass_through,polynomial_decay,reduce_lr_on_plateau,step,tri_stage,triangular}]
[--scoring {bert_score,sacrebleu,bleu,chrf,meteor,wer}]
[--task TASK] [--num-workers NUM_WORKERS]
[--skip-invalid-size-inputs-valid-test]
[--max-tokens MAX_TOKENS] [--batch-size BATCH_SIZE]
[--required-batch-size-multiple REQUIRED_BATCH_SIZE_MULTIPLE]
[--required-seq-len-multiple REQUIRED_SEQ_LEN_MULTIPLE]
[--dataset-impl {raw,lazy,cached,mmap,fasta,huffman}]
[--data-buffer-size DATA_BUFFER_SIZE]
[--train-subset TRAIN_SUBSET]
[--valid-subset VALID_SUBSET]
[--combine-valid-subsets]
[--ignore-unused-valid-subsets]
[--validate-interval VALIDATE_INTERVAL]
[--validate-interval-updates VALIDATE_INTERVAL_UPDATES]
[--validate-after-updates VALIDATE_AFTER_UPDATES]
[--fixed-validation-seed FIXED_VALIDATION_SEED]
[--disable-validation]
[--max-tokens-valid MAX_TOKENS_VALID]
[--batch-size-valid BATCH_SIZE_VALID]
[--max-valid-steps MAX_VALID_STEPS]
[--curriculum CURRICULUM] [--gen-subset GEN_SUBSET]
[--num-shards NUM_SHARDS] [--shard-id SHARD_ID]
[--grouped-shuffling]
[--update-epoch-batch-itr UPDATE_EPOCH_BATCH_ITR]
[--update-ordered-indices-seed]
[--distributed-world-size DISTRIBUTED_WORLD_SIZE]
[--distributed-num-procs DISTRIBUTED_NUM_PROCS]
[--distributed-rank DISTRIBUTED_RANK]
[--distributed-backend DISTRIBUTED_BACKEND]
[--distributed-init-method DISTRIBUTED_INIT_METHOD]
[--distributed-port DISTRIBUTED_PORT]
[--device-id DEVICE_ID] [--distributed-no-spawn]
[--ddp-backend {c10d,fully_sharded,legacy_ddp,no_c10d,pytorch_ddp,slowmo}]
[--ddp-comm-hook {none,fp16}]
[--bucket-cap-mb BUCKET_CAP_MB]
[--fix-batches-to-gpus] [--find-unused-parameters]
[--gradient-as-bucket-view] [--fast-stat-sync]
[--heartbeat-timeout HEARTBEAT_TIMEOUT]
[--broadcast-buffers]
[--slowmo-momentum SLOWMO_MOMENTUM]
[--slowmo-base-algorithm SLOWMO_BASE_ALGORITHM]
[--localsgd-frequency LOCALSGD_FREQUENCY]
[--nprocs-per-node NPROCS_PER_NODE]
[--pipeline-model-parallel]
[--pipeline-balance PIPELINE_BALANCE]
[--pipeline-devices PIPELINE_DEVICES]
[--pipeline-chunks PIPELINE_CHUNKS]
[--pipeline-encoder-balance PIPELINE_ENCODER_BALANCE]
[--pipeline-encoder-devices PIPELINE_ENCODER_DEVICES]
[--pipeline-decoder-balance PIPELINE_DECODER_BALANCE]
[--pipeline-decoder-devices PIPELINE_DECODER_DEVICES]
[--pipeline-checkpoint {always,never,except_last}]
[--zero-sharding {none,os}]
[--no-reshard-after-forward] [--fp32-reduce-scatter]
[--cpu-offload] [--use-sharded-state]
[--not-fsdp-flatten-parameters] [--path PATH]
[--post-process [POST_PROCESS]] [--quiet]
[--model-overrides MODEL_OVERRIDES]
[--results-path RESULTS_PATH] [--beam BEAM]
[--nbest NBEST] [--max-len-a MAX_LEN_A]
[--max-len-b MAX_LEN_B] [--min-len MIN_LEN]
[--match-source-len] [--unnormalized]
[--no-early-stop] [--no-beamable-mm] [--lenpen LENPEN]
[--unkpen UNKPEN] [--replace-unk [REPLACE_UNK]]
[--sacrebleu] [--score-reference]
[--prefix-size PREFIX_SIZE]
[--no-repeat-ngram-size NO_REPEAT_NGRAM_SIZE]
[--sampling] [--sampling-topk SAMPLING_TOPK]
[--sampling-topp SAMPLING_TOPP]
[--constraints [{ordered,unordered}]]
[--temperature TEMPERATURE]
[--diverse-beam-groups DIVERSE_BEAM_GROUPS]
[--diverse-beam-strength DIVERSE_BEAM_STRENGTH]
[--diversity-rate DIVERSITY_RATE]
[--print-alignment [{hard,soft}]] [--print-step]
[--lm-path LM_PATH] [--lm-weight LM_WEIGHT]
[--iter-decode-eos-penalty ITER_DECODE_EOS_PENALTY]
[--iter-decode-max-iter ITER_DECODE_MAX_ITER]
[--iter-decode-force-max-iter]
[--iter-decode-with-beam ITER_DECODE_WITH_BEAM]
[--iter-decode-with-external-reranker]
[--retain-iter-history] [--retain-dropout]
[--retain-dropout-modules RETAIN_DROPOUT_MODULES]
[--decoding-format {unigram,ensemble,vote,dp,bs}]
[--no-seed-provided] [--eos-token EOS_TOKEN]
[--save-dir SAVE_DIR] [--restore-file RESTORE_FILE]
[--continue-once CONTINUE_ONCE]
[--finetune-from-model FINETUNE_FROM_MODEL]
[--reset-dataloader] [--reset-lr-scheduler]
[--reset-meters] [--reset-optimizer]
[--optimizer-overrides OPTIMIZER_OVERRIDES]
[--save-interval SAVE_INTERVAL]
[--save-interval-updates SAVE_INTERVAL_UPDATES]
[--keep-interval-updates KEEP_INTERVAL_UPDATES]
[--keep-interval-updates-pattern KEEP_INTERVAL_UPDATES_PATTERN]
[--keep-last-epochs KEEP_LAST_EPOCHS]
[--keep-best-checkpoints KEEP_BEST_CHECKPOINTS]
[--no-save] [--no-epoch-checkpoints]
[--no-last-checkpoints] [--no-save-optimizer-state]
[--best-checkpoint-metric BEST_CHECKPOINT_METRIC]
[--maximize-best-checkpoint-metric]
[--patience PATIENCE]
[--checkpoint-suffix CHECKPOINT_SUFFIX]
[--checkpoint-shard-count CHECKPOINT_SHARD_COUNT]
[--load-checkpoint-on-all-dp-ranks]
[--write-checkpoints-asynchronously]
Named Arguments¶
--no-progress-bar | disable progress bar Default: False |
--log-interval | log progress every N batches (when progress bar is disabled) Default: 100 |
--log-format | Possible choices: json, none, simple, tqdm log format to use |
--log-file | log file to copy metrics to. |
--aim-repo | path to Aim repository |
--aim-run-hash | Aim run hash. If skipped, creates or continues run based on save_dir |
--tensorboard-logdir | path to save logs for tensorboard, should match –logdir of running tensorboard (default: no tensorboard logging) |
--wandb-project | Weights and Biases project name to use for logging |
--azureml-logging | Log scalars to AzureML context Default: False |
--seed | pseudo random number generator seed Default: 1 |
--cpu | use CPU instead of CUDA Default: False |
--tpu | use TPU instead of CUDA Default: False |
--bf16 | use bfloat16; implies –tpu Default: False |
--memory-efficient-bf16 | use a memory-efficient version of BF16 training; implies –bf16 Default: False |
--fp16 | use FP16 Default: False |
--memory-efficient-fp16 | use a memory-efficient version of FP16 training; implies –fp16 Default: False |
--fp16-no-flatten-grads | don’t flatten FP16 grads tensor Default: False |
--fp16-init-scale | default FP16 loss scale Default: 128 |
--fp16-scale-window | number of updates before increasing loss scale |
--fp16-scale-tolerance | pct of updates that can overflow before decreasing the loss scale Default: 0.0 |
--on-cpu-convert-precision | if set, the floating point conversion to fp16/bf16 runs on CPU. This reduces bus transfer time and GPU memory usage. Default: False |
--min-loss-scale | minimum FP16/AMP loss scale, after which training is stopped Default: 0.0001 |
--threshold-loss-scale | threshold FP16 loss scale from below |
--amp | use automatic mixed precision Default: False |
--amp-batch-retries | number of retries of same batch after reducing loss scale with AMP Default: 2 |
--amp-init-scale | default AMP loss scale Default: 128 |
--amp-scale-window | number of updates before increasing AMP loss scale |
--user-dir | path to a python module containing custom extensions (tasks and/or architectures) |
--empty-cache-freq | how often to clear the PyTorch CUDA cache (0 to disable) Default: 0 |
--all-gather-list-size | number of bytes reserved for gathering stats from workers Default: 16384 |
--model-parallel-size | total number of GPUs to parallelize model over Default: 1 |
--quantization-config-path | path to quantization config file |
--profile | enable autograd profiler emit_nvtx Default: False |
--reset-logging | when using Hydra, reset the logging at the beginning of training Default: False |
--suppress-crashes | suppress crashes when training with the hydra_train entry point so that the main method can return a value (useful for sweeps) Default: False |
--use-plasma-view | Store indices and sizes in shared memory Default: False |
--plasma-path | path to run plasma_store, defaults to /tmp/plasma. Paths outside /tmp tend to fail. Default: “/tmp/plasma” |
--criterion | Possible choices: adaptive_loss, composite_loss, cross_entropy, ctc, fastspeech2, hubert, label_smoothed_cross_entropy, latency_augmented_label_smoothed_cross_entropy, label_smoothed_cross_entropy_with_alignment, label_smoothed_cross_entropy_with_ctc, legacy_masked_lm_loss, masked_lm, model, nat_loss, sentence_prediction, sentence_prediction_adapters, sentence_ranking, tacotron2, speech_to_unit, speech_to_spectrogram, speech_unit_lm_criterion, wav2vec, vocab_parallel_cross_entropy Default: “cross_entropy” |
--tokenizer | Possible choices: moses, nltk, space |
--bpe | Possible choices: byte_bpe, bytes, characters, fastbpe, gpt2, bert, hf_byte_bpe, sentencepiece, subword_nmt |
--optimizer | Possible choices: adadelta, adafactor, adagrad, adam, adamax, composite, cpu_adam, lamb, nag, sgd |
--lr-scheduler | Possible choices: cosine, fixed, inverse_sqrt, manual, pass_through, polynomial_decay, reduce_lr_on_plateau, step, tri_stage, triangular Default: “fixed” |
--scoring | Possible choices: bert_score, sacrebleu, bleu, chrf, meteor, wer Default: “bleu” |
--task | Possible choices: multilingual_language_modeling, speech_unit_modeling, hubert_pretraining, translation, multilingual_translation, semisupervised_translation, translation_from_pretrained_xlm, speech_to_text, text_to_speech, frm_text_to_speech, legacy_masked_lm, audio_pretraining, audio_finetuning, sentence_ranking, online_backtranslation, simul_speech_to_text, simul_text_to_text, cross_lingual_lm, span_masked_lm, denoising, multilingual_denoising, multilingual_masked_lm, language_modeling, masked_lm, nlu_finetuning, speech_to_speech, sentence_prediction, translation_from_pretrained_bart, sentence_prediction_adapters, translation_multi_simple_epoch, translation_lev, dummy_lm, dummy_masked_lm, dummy_mt task Default: “translation” |
dataset_data_loading¶
--num-workers | how many subprocesses to use for data loading Default: 1 |
--skip-invalid-size-inputs-valid-test | ignore too long or too short lines in valid and test set Default: False |
--max-tokens | maximum number of tokens in a batch |
--batch-size, --max-sentences | number of examples in a batch |
--required-batch-size-multiple | batch size will be a multiplier of this value Default: 8 |
--required-seq-len-multiple | maximum sequence length in batch will be a multiplier of this value Default: 1 |
--dataset-impl | Possible choices: raw, lazy, cached, mmap, fasta, huffman output dataset implementation |
--data-buffer-size | Number of batches to preload Default: 10 |
--train-subset | data subset to use for training (e.g. train, valid, test) Default: “train” |
--valid-subset | comma separated list of data subsets to use for validation (e.g. train, valid, test) Default: “valid” |
--combine-valid-subsets, --combine-val | comma separated list of data subsets to use for validation (e.g. train, valid, test) |
--ignore-unused-valid-subsets | do not raise error if valid subsets are ignored Default: False |
--validate-interval | validate every N epochs Default: 1 |
--validate-interval-updates | validate every N updates Default: 0 |
--validate-after-updates | dont validate until reaching this many updates Default: 0 |
--fixed-validation-seed | specified random seed for validation |
--disable-validation | disable validation Default: False |
--max-tokens-valid | maximum number of tokens in a validation batch (defaults to –max-tokens) |
--batch-size-valid, --max-sentences-valid | batch size of the validation batch (defaults to –batch-size) |
--max-valid-steps, --nval | How many batches to evaluate |
--curriculum | don’t shuffle batches for first N epochs Default: 0 |
--gen-subset | data subset to generate (train, valid, test) Default: “test” |
--num-shards | shard generation over N shards Default: 1 |
--shard-id | id of the shard to generate (id < num_shards) Default: 0 |
--grouped-shuffling | shuffle batches in groups of num_shards to enable similar sequence lengths on each GPU worker when batches are sorted by length Default: False |
--update-epoch-batch-itr | if true then prevents the reuse the epoch batch iterator by setting can_reuse_epoch_itr to false, defaults to –grouped-shuffling ) |
--update-ordered-indices-seed | if true then increment seed with epoch for getting batch iterators, defautls to False. Default: False |
distributed_training¶
--distributed-world-size | total number of GPUs across all nodes (default: all visible GPUs) Default: 1 |
--distributed-num-procs | total number of processes to fork (default: all visible GPUs) Default: 1 |
--distributed-rank | rank of the current worker Default: 0 |
--distributed-backend | distributed backend Default: “nccl” |
--distributed-init-method | typically tcp://hostname:port that will be used to establish initial connetion |
--distributed-port | port number (not required if using –distributed-init-method) Default: -1 |
--device-id, --local_rank | which GPU to use (by default looks for $LOCAL_RANK, usually configured automatically) Default: 0 |
--distributed-no-spawn | do not spawn multiple processes even if multiple GPUs are visible Default: False |
--ddp-backend | Possible choices: c10d, fully_sharded, legacy_ddp, no_c10d, pytorch_ddp, slowmo DistributedDataParallel backend Default: “pytorch_ddp” |
--ddp-comm-hook | Possible choices: none, fp16 communication hook Default: “none” |
--bucket-cap-mb | bucket size for reduction Default: 25 |
--fix-batches-to-gpus | don’t shuffle batches between GPUs; this reduces overall randomness and may affect precision but avoids the cost of re-reading the data Default: False |
--find-unused-parameters | disable unused parameter detection (not applicable to –ddp-backend=legacy_ddp) Default: False |
--gradient-as-bucket-view | when set to True, gradients will be views pointing to different offsets of allreduce communication buckets. This can reduce peak memory usage, where the saved memory size will be equal to the total gradients size. –gradient-as-bucket-view=gradient_as_bucket_view) Default: False |
--fast-stat-sync | [deprecated] this is now defined per Criterion Default: False |
--heartbeat-timeout | kill the job if no progress is made in N seconds; set to -1 to disable Default: -1 |
--broadcast-buffers | Copy non-trainable parameters between GPUs, such as batchnorm population statistics Default: False |
--slowmo-momentum | SlowMo momentum term; by default use 0.0 for 16 GPUs, 0.2 for 32 GPUs; 0.5 for 64 GPUs, 0.6 for > 64 GPUs |
--slowmo-base-algorithm | Base algorithm. Either ‘localsgd’ or ‘sgp’. Please refer to the documentation of ‘slowmo_base_algorithm’ parameter in https://fairscale.readthedocs.io/en/latest/api/experimental/nn/slowmo_ddp.html for more details Default: “localsgd” |
--localsgd-frequency | Local SGD allreduce frequency Default: 3 |
--nprocs-per-node | number of GPUs in each node. An allreduce operation across GPUs in a node is very fast. Hence, we do allreduce across GPUs in a node, and gossip across different nodes Default: 1 |
--pipeline-model-parallel | if set, use pipeline model parallelism across GPUs Default: False |
--pipeline-balance | partition the model into N_K pieces, where each piece contains N_i layers. The sum(args.pipeline_balance) should equal the total number of layers in the model |
--pipeline-devices | a list of device indices indicating which device to place each of the N_K partitions. The length of this list should equal the length of the –pipeline-balance argument |
--pipeline-chunks | microbatch count for pipeline model parallelism Default: 0 |
--pipeline-encoder-balance | partition the pipeline parallel encoder into N_K pieces, where each piece contains N_i layers. The sum(args.pipeline_encoder_balance) should equal the total number of encoder layers in the model |
--pipeline-encoder-devices | a list of device indices indicating which device to place each of the N_K partitions. The length of this list should equal the length of the –pipeline-encoder-balance argument |
--pipeline-decoder-balance | partition the pipeline parallel decoder into N_K pieces, where each piece contains N_i layers. The sum(args.pipeline_decoder_balance) should equal the total number of decoder layers in the model |
--pipeline-decoder-devices | a list of device indices indicating which device to place each of the N_K partitions. The length of this list should equal the length of the –pipeline-decoder-balance argument |
--pipeline-checkpoint | Possible choices: always, never, except_last checkpointing mode for pipeline model parallelism Default: “never” |
--zero-sharding | Possible choices: none, os ZeRO sharding Default: “none” |
--no-reshard-after-forward | don’t reshard parameters after forward pass Default: False |
--fp32-reduce-scatter | reduce-scatter grads in FP32 Default: False |
--cpu-offload | offload FP32 params to CPU Default: False |
--use-sharded-state | use sharded checkpoint files Default: False |
--not-fsdp-flatten-parameters | not flatten parameter param for fsdp Default: False |
Generation¶
--path | path(s) to model file(s), colon separated |
--post-process, --remove-bpe | post-process text by removing BPE, letter segmentation, etc. Valid options can be found in fairseq.data.utils.post_process. |
--quiet | only print final scores Default: False |
--model-overrides | a dictionary used to override model args at generation that were used during model training Default: “{}” |
--results-path | path to save eval results (optional) |
--beam | beam size Default: 5 |
--nbest | number of hypotheses to output Default: 1 |
--max-len-a | generate sequences of maximum length ax + b, where x is the source length Default: 0 |
--max-len-b | generate sequences of maximum length ax + b, where x is the source length Default: 200 |
--min-len | minimum generation length Default: 1 |
--match-source-len | generations should match the source length Default: False |
--unnormalized | compare unnormalized hypothesis scores Default: False |
--no-early-stop | deprecated Default: False |
--no-beamable-mm | don’t use BeamableMM in attention layers Default: False |
--lenpen | length penalty: <1.0 favors shorter, >1.0 favors longer sentences Default: 1 |
--unkpen | unknown word penalty: <0 produces more unks, >0 produces fewer Default: 0 |
--replace-unk | perform unknown replacement (optionally with alignment dictionary) |
--sacrebleu | score with sacrebleu Default: False |
--score-reference | just score the reference translation Default: False |
--prefix-size | initialize generation by target prefix of given length Default: 0 |
--no-repeat-ngram-size | ngram blocking such that this size ngram cannot be repeated in the generation Default: 0 |
--sampling | sample hypotheses instead of using beam search Default: False |
--sampling-topk | sample from top K likely next words instead of all words Default: -1 |
--sampling-topp | sample from the smallest set whose cumulative probability mass exceeds p for next words Default: -1.0 |
--constraints | Possible choices: ordered, unordered enables lexically constrained decoding |
--temperature | temperature for generation Default: 1.0 |
--diverse-beam-groups | number of groups for Diverse Beam Search Default: -1 |
--diverse-beam-strength | strength of diversity penalty for Diverse Beam Search Default: 0.5 |
--diversity-rate | strength of diversity penalty for Diverse Siblings Search Default: -1.0 |
--print-alignment | Possible choices: hard, soft if set, uses attention feedback to compute and print alignment to source tokens (valid options are: hard, soft, otherwise treated as hard alignment) |
--print-step | print steps Default: False |
--lm-path | path to lm checkpoint for lm fusion |
--lm-weight | weight for lm probs for lm fusion Default: 0.0 |
--iter-decode-eos-penalty | if > 0.0, it penalized early-stopping in decoding. Default: 0.0 |
--iter-decode-max-iter | maximum iterations for iterative refinement. Default: 10 |
--iter-decode-force-max-iter | if set, run exact the maximum number of iterations without early stop Default: False |
--iter-decode-with-beam | if > 1, model will generate translations varying by the lengths. Default: 1 |
--iter-decode-with-external-reranker | if set, the last checkpoint are assumed to be a reranker to rescore the translations Default: False |
--retain-iter-history | if set, decoding returns the whole history of iterative refinement Default: False |
--retain-dropout | Use dropout at inference time Default: False |
--retain-dropout-modules | if set, only retain dropout for the specified modules; if not set, then dropout will be retained for all modules |
--decoding-format | Possible choices: unigram, ensemble, vote, dp, bs special decoding format for advanced decoding. |
--no-seed-provided | if set, dont use seed for initializing random generators Default: False |
--eos-token | EOS token |
checkpoint¶
--save-dir | path to save checkpoints Default: “checkpoints” |
--restore-file | filename from which to load checkpoint (default: <save-dir>/checkpoint_last.pt Default: “checkpoint_last.pt” |
--continue-once | continues from this checkpoint, unless a checkpoint indicated in ‘restore_file’ option is present |
--finetune-from-model | finetune from a pretrained model; note that meters and lr scheduler will be reset |
--reset-dataloader | if set, does not reload dataloader state from the checkpoint Default: False |
--reset-lr-scheduler | if set, does not load lr scheduler state from the checkpoint Default: False |
--reset-meters | if set, does not load meters from the checkpoint Default: False |
--reset-optimizer | if set, does not load optimizer state from the checkpoint Default: False |
--optimizer-overrides | a dictionary used to override optimizer args when loading a checkpoint Default: “{}” |
--save-interval | save a checkpoint every N epochs Default: 1 |
--save-interval-updates | save a checkpoint (and validate) every N updates Default: 0 |
--keep-interval-updates | keep the last N checkpoints saved with –save-interval-updates Default: -1 |
--keep-interval-updates-pattern | when used with –keep-interval-updates, skips deleting any checkpoints with update X where X % keep_interval_updates_pattern == 0 Default: -1 |
--keep-last-epochs | keep last N epoch checkpoints Default: -1 |
--keep-best-checkpoints | keep best N checkpoints based on scores Default: -1 |
--no-save | don’t save models or checkpoints Default: False |
--no-epoch-checkpoints | only store last and best checkpoints Default: False |
--no-last-checkpoints | don’t store last checkpoints Default: False |
--no-save-optimizer-state | don’t save optimizer-state as part of checkpoint Default: False |
--best-checkpoint-metric | metric to use for saving “best” checkpoints Default: “loss” |
--maximize-best-checkpoint-metric | select the largest metric value for saving “best” checkpoints Default: False |
--patience | early stop training if valid performance doesn’t improve for N consecutive validation runs; note that this is influenced by –validate-interval Default: -1 |
--checkpoint-suffix | suffix to add to the checkpoint file name Default: “” |
--checkpoint-shard-count | Number of shards containing the checkpoint - if the checkpoint is over 300GB, it is preferable to split it into shards to prevent OOM on CPU while loading the checkpoint Default: 1 |
--load-checkpoint-on-all-dp-ranks | load checkpoints on all data parallel devices (default: only load on rank 0 and broadcast to other devices) Default: False |
--write-checkpoints-asynchronously, --save-async | Write checkpoints asynchronously in a separate thread. NOTE: This feature is currently being tested. Default: False |
fairseq-interactive¶
Translate raw text with a trained model. Batches data on-the-fly.
usage: fairseq-interactive [-h] [--no-progress-bar]
[--log-interval LOG_INTERVAL]
[--log-format {json,none,simple,tqdm}]
[--log-file LOG_FILE] [--aim-repo AIM_REPO]
[--aim-run-hash AIM_RUN_HASH]
[--tensorboard-logdir TENSORBOARD_LOGDIR]
[--wandb-project WANDB_PROJECT] [--azureml-logging]
[--seed SEED] [--cpu] [--tpu] [--bf16]
[--memory-efficient-bf16] [--fp16]
[--memory-efficient-fp16] [--fp16-no-flatten-grads]
[--fp16-init-scale FP16_INIT_SCALE]
[--fp16-scale-window FP16_SCALE_WINDOW]
[--fp16-scale-tolerance FP16_SCALE_TOLERANCE]
[--on-cpu-convert-precision]
[--min-loss-scale MIN_LOSS_SCALE]
[--threshold-loss-scale THRESHOLD_LOSS_SCALE]
[--amp] [--amp-batch-retries AMP_BATCH_RETRIES]
[--amp-init-scale AMP_INIT_SCALE]
[--amp-scale-window AMP_SCALE_WINDOW]
[--user-dir USER_DIR]
[--empty-cache-freq EMPTY_CACHE_FREQ]
[--all-gather-list-size ALL_GATHER_LIST_SIZE]
[--model-parallel-size MODEL_PARALLEL_SIZE]
[--quantization-config-path QUANTIZATION_CONFIG_PATH]
[--profile] [--reset-logging] [--suppress-crashes]
[--use-plasma-view] [--plasma-path PLASMA_PATH]
[--criterion {adaptive_loss,composite_loss,cross_entropy,ctc,fastspeech2,hubert,label_smoothed_cross_entropy,latency_augmented_label_smoothed_cross_entropy,label_smoothed_cross_entropy_with_alignment,label_smoothed_cross_entropy_with_ctc,legacy_masked_lm_loss,masked_lm,model,nat_loss,sentence_prediction,sentence_prediction_adapters,sentence_ranking,tacotron2,speech_to_unit,speech_to_spectrogram,speech_unit_lm_criterion,wav2vec,vocab_parallel_cross_entropy}]
[--tokenizer {moses,nltk,space}]
[--bpe {byte_bpe,bytes,characters,fastbpe,gpt2,bert,hf_byte_bpe,sentencepiece,subword_nmt}]
[--optimizer {adadelta,adafactor,adagrad,adam,adamax,composite,cpu_adam,lamb,nag,sgd}]
[--lr-scheduler {cosine,fixed,inverse_sqrt,manual,pass_through,polynomial_decay,reduce_lr_on_plateau,step,tri_stage,triangular}]
[--scoring {bert_score,sacrebleu,bleu,chrf,meteor,wer}]
[--task TASK] [--num-workers NUM_WORKERS]
[--skip-invalid-size-inputs-valid-test]
[--max-tokens MAX_TOKENS] [--batch-size BATCH_SIZE]
[--required-batch-size-multiple REQUIRED_BATCH_SIZE_MULTIPLE]
[--required-seq-len-multiple REQUIRED_SEQ_LEN_MULTIPLE]
[--dataset-impl {raw,lazy,cached,mmap,fasta,huffman}]
[--data-buffer-size DATA_BUFFER_SIZE]
[--train-subset TRAIN_SUBSET]
[--valid-subset VALID_SUBSET]
[--combine-valid-subsets]
[--ignore-unused-valid-subsets]
[--validate-interval VALIDATE_INTERVAL]
[--validate-interval-updates VALIDATE_INTERVAL_UPDATES]
[--validate-after-updates VALIDATE_AFTER_UPDATES]
[--fixed-validation-seed FIXED_VALIDATION_SEED]
[--disable-validation]
[--max-tokens-valid MAX_TOKENS_VALID]
[--batch-size-valid BATCH_SIZE_VALID]
[--max-valid-steps MAX_VALID_STEPS]
[--curriculum CURRICULUM] [--gen-subset GEN_SUBSET]
[--num-shards NUM_SHARDS] [--shard-id SHARD_ID]
[--grouped-shuffling]
[--update-epoch-batch-itr UPDATE_EPOCH_BATCH_ITR]
[--update-ordered-indices-seed]
[--distributed-world-size DISTRIBUTED_WORLD_SIZE]
[--distributed-num-procs DISTRIBUTED_NUM_PROCS]
[--distributed-rank DISTRIBUTED_RANK]
[--distributed-backend DISTRIBUTED_BACKEND]
[--distributed-init-method DISTRIBUTED_INIT_METHOD]
[--distributed-port DISTRIBUTED_PORT]
[--device-id DEVICE_ID] [--distributed-no-spawn]
[--ddp-backend {c10d,fully_sharded,legacy_ddp,no_c10d,pytorch_ddp,slowmo}]
[--ddp-comm-hook {none,fp16}]
[--bucket-cap-mb BUCKET_CAP_MB]
[--fix-batches-to-gpus] [--find-unused-parameters]
[--gradient-as-bucket-view] [--fast-stat-sync]
[--heartbeat-timeout HEARTBEAT_TIMEOUT]
[--broadcast-buffers]
[--slowmo-momentum SLOWMO_MOMENTUM]
[--slowmo-base-algorithm SLOWMO_BASE_ALGORITHM]
[--localsgd-frequency LOCALSGD_FREQUENCY]
[--nprocs-per-node NPROCS_PER_NODE]
[--pipeline-model-parallel]
[--pipeline-balance PIPELINE_BALANCE]
[--pipeline-devices PIPELINE_DEVICES]
[--pipeline-chunks PIPELINE_CHUNKS]
[--pipeline-encoder-balance PIPELINE_ENCODER_BALANCE]
[--pipeline-encoder-devices PIPELINE_ENCODER_DEVICES]
[--pipeline-decoder-balance PIPELINE_DECODER_BALANCE]
[--pipeline-decoder-devices PIPELINE_DECODER_DEVICES]
[--pipeline-checkpoint {always,never,except_last}]
[--zero-sharding {none,os}]
[--no-reshard-after-forward]
[--fp32-reduce-scatter] [--cpu-offload]
[--use-sharded-state]
[--not-fsdp-flatten-parameters] [--path PATH]
[--post-process [POST_PROCESS]] [--quiet]
[--model-overrides MODEL_OVERRIDES]
[--results-path RESULTS_PATH] [--beam BEAM]
[--nbest NBEST] [--max-len-a MAX_LEN_A]
[--max-len-b MAX_LEN_B] [--min-len MIN_LEN]
[--match-source-len] [--unnormalized]
[--no-early-stop] [--no-beamable-mm]
[--lenpen LENPEN] [--unkpen UNKPEN]
[--replace-unk [REPLACE_UNK]] [--sacrebleu]
[--score-reference] [--prefix-size PREFIX_SIZE]
[--no-repeat-ngram-size NO_REPEAT_NGRAM_SIZE]
[--sampling] [--sampling-topk SAMPLING_TOPK]
[--sampling-topp SAMPLING_TOPP]
[--constraints [{ordered,unordered}]]
[--temperature TEMPERATURE]
[--diverse-beam-groups DIVERSE_BEAM_GROUPS]
[--diverse-beam-strength DIVERSE_BEAM_STRENGTH]
[--diversity-rate DIVERSITY_RATE]
[--print-alignment [{hard,soft}]] [--print-step]
[--lm-path LM_PATH] [--lm-weight LM_WEIGHT]
[--iter-decode-eos-penalty ITER_DECODE_EOS_PENALTY]
[--iter-decode-max-iter ITER_DECODE_MAX_ITER]
[--iter-decode-force-max-iter]
[--iter-decode-with-beam ITER_DECODE_WITH_BEAM]
[--iter-decode-with-external-reranker]
[--retain-iter-history] [--retain-dropout]
[--retain-dropout-modules RETAIN_DROPOUT_MODULES]
[--decoding-format {unigram,ensemble,vote,dp,bs}]
[--no-seed-provided] [--eos-token EOS_TOKEN]
[--save-dir SAVE_DIR] [--restore-file RESTORE_FILE]
[--continue-once CONTINUE_ONCE]
[--finetune-from-model FINETUNE_FROM_MODEL]
[--reset-dataloader] [--reset-lr-scheduler]
[--reset-meters] [--reset-optimizer]
[--optimizer-overrides OPTIMIZER_OVERRIDES]
[--save-interval SAVE_INTERVAL]
[--save-interval-updates SAVE_INTERVAL_UPDATES]
[--keep-interval-updates KEEP_INTERVAL_UPDATES]
[--keep-interval-updates-pattern KEEP_INTERVAL_UPDATES_PATTERN]
[--keep-last-epochs KEEP_LAST_EPOCHS]
[--keep-best-checkpoints KEEP_BEST_CHECKPOINTS]
[--no-save] [--no-epoch-checkpoints]
[--no-last-checkpoints] [--no-save-optimizer-state]
[--best-checkpoint-metric BEST_CHECKPOINT_METRIC]
[--maximize-best-checkpoint-metric]
[--patience PATIENCE]
[--checkpoint-suffix CHECKPOINT_SUFFIX]
[--checkpoint-shard-count CHECKPOINT_SHARD_COUNT]
[--load-checkpoint-on-all-dp-ranks]
[--write-checkpoints-asynchronously]
[--buffer-size BUFFER_SIZE] [--input INPUT]
Named Arguments¶
--no-progress-bar | disable progress bar Default: False |
--log-interval | log progress every N batches (when progress bar is disabled) Default: 100 |
--log-format | Possible choices: json, none, simple, tqdm log format to use |
--log-file | log file to copy metrics to. |
--aim-repo | path to Aim repository |
--aim-run-hash | Aim run hash. If skipped, creates or continues run based on save_dir |
--tensorboard-logdir | path to save logs for tensorboard, should match –logdir of running tensorboard (default: no tensorboard logging) |
--wandb-project | Weights and Biases project name to use for logging |
--azureml-logging | Log scalars to AzureML context Default: False |
--seed | pseudo random number generator seed Default: 1 |
--cpu | use CPU instead of CUDA Default: False |
--tpu | use TPU instead of CUDA Default: False |
--bf16 | use bfloat16; implies –tpu Default: False |
--memory-efficient-bf16 | use a memory-efficient version of BF16 training; implies –bf16 Default: False |
--fp16 | use FP16 Default: False |
--memory-efficient-fp16 | use a memory-efficient version of FP16 training; implies –fp16 Default: False |
--fp16-no-flatten-grads | don’t flatten FP16 grads tensor Default: False |
--fp16-init-scale | default FP16 loss scale Default: 128 |
--fp16-scale-window | number of updates before increasing loss scale |
--fp16-scale-tolerance | pct of updates that can overflow before decreasing the loss scale Default: 0.0 |
--on-cpu-convert-precision | if set, the floating point conversion to fp16/bf16 runs on CPU. This reduces bus transfer time and GPU memory usage. Default: False |
--min-loss-scale | minimum FP16/AMP loss scale, after which training is stopped Default: 0.0001 |
--threshold-loss-scale | threshold FP16 loss scale from below |
--amp | use automatic mixed precision Default: False |
--amp-batch-retries | number of retries of same batch after reducing loss scale with AMP Default: 2 |
--amp-init-scale | default AMP loss scale Default: 128 |
--amp-scale-window | number of updates before increasing AMP loss scale |
--user-dir | path to a python module containing custom extensions (tasks and/or architectures) |
--empty-cache-freq | how often to clear the PyTorch CUDA cache (0 to disable) Default: 0 |
--all-gather-list-size | number of bytes reserved for gathering stats from workers Default: 16384 |
--model-parallel-size | total number of GPUs to parallelize model over Default: 1 |
--quantization-config-path | path to quantization config file |
--profile | enable autograd profiler emit_nvtx Default: False |
--reset-logging | when using Hydra, reset the logging at the beginning of training Default: False |
--suppress-crashes | suppress crashes when training with the hydra_train entry point so that the main method can return a value (useful for sweeps) Default: False |
--use-plasma-view | Store indices and sizes in shared memory Default: False |
--plasma-path | path to run plasma_store, defaults to /tmp/plasma. Paths outside /tmp tend to fail. Default: “/tmp/plasma” |
--criterion | Possible choices: adaptive_loss, composite_loss, cross_entropy, ctc, fastspeech2, hubert, label_smoothed_cross_entropy, latency_augmented_label_smoothed_cross_entropy, label_smoothed_cross_entropy_with_alignment, label_smoothed_cross_entropy_with_ctc, legacy_masked_lm_loss, masked_lm, model, nat_loss, sentence_prediction, sentence_prediction_adapters, sentence_ranking, tacotron2, speech_to_unit, speech_to_spectrogram, speech_unit_lm_criterion, wav2vec, vocab_parallel_cross_entropy Default: “cross_entropy” |
--tokenizer | Possible choices: moses, nltk, space |
--bpe | Possible choices: byte_bpe, bytes, characters, fastbpe, gpt2, bert, hf_byte_bpe, sentencepiece, subword_nmt |
--optimizer | Possible choices: adadelta, adafactor, adagrad, adam, adamax, composite, cpu_adam, lamb, nag, sgd |
--lr-scheduler | Possible choices: cosine, fixed, inverse_sqrt, manual, pass_through, polynomial_decay, reduce_lr_on_plateau, step, tri_stage, triangular Default: “fixed” |
--scoring | Possible choices: bert_score, sacrebleu, bleu, chrf, meteor, wer Default: “bleu” |
--task | Possible choices: multilingual_language_modeling, speech_unit_modeling, hubert_pretraining, translation, multilingual_translation, semisupervised_translation, translation_from_pretrained_xlm, speech_to_text, text_to_speech, frm_text_to_speech, legacy_masked_lm, audio_pretraining, audio_finetuning, sentence_ranking, online_backtranslation, simul_speech_to_text, simul_text_to_text, cross_lingual_lm, span_masked_lm, denoising, multilingual_denoising, multilingual_masked_lm, language_modeling, masked_lm, nlu_finetuning, speech_to_speech, sentence_prediction, translation_from_pretrained_bart, sentence_prediction_adapters, translation_multi_simple_epoch, translation_lev, dummy_lm, dummy_masked_lm, dummy_mt task Default: “translation” |
dataset_data_loading¶
--num-workers | how many subprocesses to use for data loading Default: 1 |
--skip-invalid-size-inputs-valid-test | ignore too long or too short lines in valid and test set Default: False |
--max-tokens | maximum number of tokens in a batch |
--batch-size, --max-sentences | number of examples in a batch |
--required-batch-size-multiple | batch size will be a multiplier of this value Default: 8 |
--required-seq-len-multiple | maximum sequence length in batch will be a multiplier of this value Default: 1 |
--dataset-impl | Possible choices: raw, lazy, cached, mmap, fasta, huffman output dataset implementation |
--data-buffer-size | Number of batches to preload Default: 10 |
--train-subset | data subset to use for training (e.g. train, valid, test) Default: “train” |
--valid-subset | comma separated list of data subsets to use for validation (e.g. train, valid, test) Default: “valid” |
--combine-valid-subsets, --combine-val | comma separated list of data subsets to use for validation (e.g. train, valid, test) |
--ignore-unused-valid-subsets | do not raise error if valid subsets are ignored Default: False |
--validate-interval | validate every N epochs Default: 1 |
--validate-interval-updates | validate every N updates Default: 0 |
--validate-after-updates | dont validate until reaching this many updates Default: 0 |
--fixed-validation-seed | specified random seed for validation |
--disable-validation | disable validation Default: False |
--max-tokens-valid | maximum number of tokens in a validation batch (defaults to –max-tokens) |
--batch-size-valid, --max-sentences-valid | batch size of the validation batch (defaults to –batch-size) |
--max-valid-steps, --nval | How many batches to evaluate |
--curriculum | don’t shuffle batches for first N epochs Default: 0 |
--gen-subset | data subset to generate (train, valid, test) Default: “test” |
--num-shards | shard generation over N shards Default: 1 |
--shard-id | id of the shard to generate (id < num_shards) Default: 0 |
--grouped-shuffling | shuffle batches in groups of num_shards to enable similar sequence lengths on each GPU worker when batches are sorted by length Default: False |
--update-epoch-batch-itr | if true then prevents the reuse the epoch batch iterator by setting can_reuse_epoch_itr to false, defaults to –grouped-shuffling ) |
--update-ordered-indices-seed | if true then increment seed with epoch for getting batch iterators, defautls to False. Default: False |
distributed_training¶
--distributed-world-size | total number of GPUs across all nodes (default: all visible GPUs) Default: 1 |
--distributed-num-procs | total number of processes to fork (default: all visible GPUs) Default: 1 |
--distributed-rank | rank of the current worker Default: 0 |
--distributed-backend | distributed backend Default: “nccl” |
--distributed-init-method | typically tcp://hostname:port that will be used to establish initial connetion |
--distributed-port | port number (not required if using –distributed-init-method) Default: -1 |
--device-id, --local_rank | which GPU to use (by default looks for $LOCAL_RANK, usually configured automatically) Default: 0 |
--distributed-no-spawn | do not spawn multiple processes even if multiple GPUs are visible Default: False |
--ddp-backend | Possible choices: c10d, fully_sharded, legacy_ddp, no_c10d, pytorch_ddp, slowmo DistributedDataParallel backend Default: “pytorch_ddp” |
--ddp-comm-hook | Possible choices: none, fp16 communication hook Default: “none” |
--bucket-cap-mb | bucket size for reduction Default: 25 |
--fix-batches-to-gpus | don’t shuffle batches between GPUs; this reduces overall randomness and may affect precision but avoids the cost of re-reading the data Default: False |
--find-unused-parameters | disable unused parameter detection (not applicable to –ddp-backend=legacy_ddp) Default: False |
--gradient-as-bucket-view | when set to True, gradients will be views pointing to different offsets of allreduce communication buckets. This can reduce peak memory usage, where the saved memory size will be equal to the total gradients size. –gradient-as-bucket-view=gradient_as_bucket_view) Default: False |
--fast-stat-sync | [deprecated] this is now defined per Criterion Default: False |
--heartbeat-timeout | kill the job if no progress is made in N seconds; set to -1 to disable Default: -1 |
--broadcast-buffers | Copy non-trainable parameters between GPUs, such as batchnorm population statistics Default: False |
--slowmo-momentum | SlowMo momentum term; by default use 0.0 for 16 GPUs, 0.2 for 32 GPUs; 0.5 for 64 GPUs, 0.6 for > 64 GPUs |
--slowmo-base-algorithm | Base algorithm. Either ‘localsgd’ or ‘sgp’. Please refer to the documentation of ‘slowmo_base_algorithm’ parameter in https://fairscale.readthedocs.io/en/latest/api/experimental/nn/slowmo_ddp.html for more details Default: “localsgd” |
--localsgd-frequency | Local SGD allreduce frequency Default: 3 |
--nprocs-per-node | number of GPUs in each node. An allreduce operation across GPUs in a node is very fast. Hence, we do allreduce across GPUs in a node, and gossip across different nodes Default: 1 |
--pipeline-model-parallel | if set, use pipeline model parallelism across GPUs Default: False |
--pipeline-balance | partition the model into N_K pieces, where each piece contains N_i layers. The sum(args.pipeline_balance) should equal the total number of layers in the model |
--pipeline-devices | a list of device indices indicating which device to place each of the N_K partitions. The length of this list should equal the length of the –pipeline-balance argument |
--pipeline-chunks | microbatch count for pipeline model parallelism Default: 0 |
--pipeline-encoder-balance | partition the pipeline parallel encoder into N_K pieces, where each piece contains N_i layers. The sum(args.pipeline_encoder_balance) should equal the total number of encoder layers in the model |
--pipeline-encoder-devices | a list of device indices indicating which device to place each of the N_K partitions. The length of this list should equal the length of the –pipeline-encoder-balance argument |
--pipeline-decoder-balance | partition the pipeline parallel decoder into N_K pieces, where each piece contains N_i layers. The sum(args.pipeline_decoder_balance) should equal the total number of decoder layers in the model |
--pipeline-decoder-devices | a list of device indices indicating which device to place each of the N_K partitions. The length of this list should equal the length of the –pipeline-decoder-balance argument |
--pipeline-checkpoint | Possible choices: always, never, except_last checkpointing mode for pipeline model parallelism Default: “never” |
--zero-sharding | Possible choices: none, os ZeRO sharding Default: “none” |
--no-reshard-after-forward | don’t reshard parameters after forward pass Default: False |
--fp32-reduce-scatter | reduce-scatter grads in FP32 Default: False |
--cpu-offload | offload FP32 params to CPU Default: False |
--use-sharded-state | use sharded checkpoint files Default: False |
--not-fsdp-flatten-parameters | not flatten parameter param for fsdp Default: False |
Generation¶
--path | path(s) to model file(s), colon separated |
--post-process, --remove-bpe | post-process text by removing BPE, letter segmentation, etc. Valid options can be found in fairseq.data.utils.post_process. |
--quiet | only print final scores Default: False |
--model-overrides | a dictionary used to override model args at generation that were used during model training Default: “{}” |
--results-path | path to save eval results (optional) |
--beam | beam size Default: 5 |
--nbest | number of hypotheses to output Default: 1 |
--max-len-a | generate sequences of maximum length ax + b, where x is the source length Default: 0 |
--max-len-b | generate sequences of maximum length ax + b, where x is the source length Default: 200 |
--min-len | minimum generation length Default: 1 |
--match-source-len | generations should match the source length Default: False |
--unnormalized | compare unnormalized hypothesis scores Default: False |
--no-early-stop | deprecated Default: False |
--no-beamable-mm | don’t use BeamableMM in attention layers Default: False |
--lenpen | length penalty: <1.0 favors shorter, >1.0 favors longer sentences Default: 1 |
--unkpen | unknown word penalty: <0 produces more unks, >0 produces fewer Default: 0 |
--replace-unk | perform unknown replacement (optionally with alignment dictionary) |
--sacrebleu | score with sacrebleu Default: False |
--score-reference | just score the reference translation Default: False |
--prefix-size | initialize generation by target prefix of given length Default: 0 |
--no-repeat-ngram-size | ngram blocking such that this size ngram cannot be repeated in the generation Default: 0 |
--sampling | sample hypotheses instead of using beam search Default: False |
--sampling-topk | sample from top K likely next words instead of all words Default: -1 |
--sampling-topp | sample from the smallest set whose cumulative probability mass exceeds p for next words Default: -1.0 |
--constraints | Possible choices: ordered, unordered enables lexically constrained decoding |
--temperature | temperature for generation Default: 1.0 |
--diverse-beam-groups | number of groups for Diverse Beam Search Default: -1 |
--diverse-beam-strength | strength of diversity penalty for Diverse Beam Search Default: 0.5 |
--diversity-rate | strength of diversity penalty for Diverse Siblings Search Default: -1.0 |
--print-alignment | Possible choices: hard, soft if set, uses attention feedback to compute and print alignment to source tokens (valid options are: hard, soft, otherwise treated as hard alignment) |
--print-step | print steps Default: False |
--lm-path | path to lm checkpoint for lm fusion |
--lm-weight | weight for lm probs for lm fusion Default: 0.0 |
--iter-decode-eos-penalty | if > 0.0, it penalized early-stopping in decoding. Default: 0.0 |
--iter-decode-max-iter | maximum iterations for iterative refinement. Default: 10 |
--iter-decode-force-max-iter | if set, run exact the maximum number of iterations without early stop Default: False |
--iter-decode-with-beam | if > 1, model will generate translations varying by the lengths. Default: 1 |
--iter-decode-with-external-reranker | if set, the last checkpoint are assumed to be a reranker to rescore the translations Default: False |
--retain-iter-history | if set, decoding returns the whole history of iterative refinement Default: False |
--retain-dropout | Use dropout at inference time Default: False |
--retain-dropout-modules | if set, only retain dropout for the specified modules; if not set, then dropout will be retained for all modules |
--decoding-format | Possible choices: unigram, ensemble, vote, dp, bs special decoding format for advanced decoding. |
--no-seed-provided | if set, dont use seed for initializing random generators Default: False |
--eos-token | EOS token |
checkpoint¶
--save-dir | path to save checkpoints Default: “checkpoints” |
--restore-file | filename from which to load checkpoint (default: <save-dir>/checkpoint_last.pt Default: “checkpoint_last.pt” |
--continue-once | continues from this checkpoint, unless a checkpoint indicated in ‘restore_file’ option is present |
--finetune-from-model | finetune from a pretrained model; note that meters and lr scheduler will be reset |
--reset-dataloader | if set, does not reload dataloader state from the checkpoint Default: False |
--reset-lr-scheduler | if set, does not load lr scheduler state from the checkpoint Default: False |
--reset-meters | if set, does not load meters from the checkpoint Default: False |
--reset-optimizer | if set, does not load optimizer state from the checkpoint Default: False |
--optimizer-overrides | a dictionary used to override optimizer args when loading a checkpoint Default: “{}” |
--save-interval | save a checkpoint every N epochs Default: 1 |
--save-interval-updates | save a checkpoint (and validate) every N updates Default: 0 |
--keep-interval-updates | keep the last N checkpoints saved with –save-interval-updates Default: -1 |
--keep-interval-updates-pattern | when used with –keep-interval-updates, skips deleting any checkpoints with update X where X % keep_interval_updates_pattern == 0 Default: -1 |
--keep-last-epochs | keep last N epoch checkpoints Default: -1 |
--keep-best-checkpoints | keep best N checkpoints based on scores Default: -1 |
--no-save | don’t save models or checkpoints Default: False |
--no-epoch-checkpoints | only store last and best checkpoints Default: False |
--no-last-checkpoints | don’t store last checkpoints Default: False |
--no-save-optimizer-state | don’t save optimizer-state as part of checkpoint Default: False |
--best-checkpoint-metric | metric to use for saving “best” checkpoints Default: “loss” |
--maximize-best-checkpoint-metric | select the largest metric value for saving “best” checkpoints Default: False |
--patience | early stop training if valid performance doesn’t improve for N consecutive validation runs; note that this is influenced by –validate-interval Default: -1 |
--checkpoint-suffix | suffix to add to the checkpoint file name Default: “” |
--checkpoint-shard-count | Number of shards containing the checkpoint - if the checkpoint is over 300GB, it is preferable to split it into shards to prevent OOM on CPU while loading the checkpoint Default: 1 |
--load-checkpoint-on-all-dp-ranks | load checkpoints on all data parallel devices (default: only load on rank 0 and broadcast to other devices) Default: False |
--write-checkpoints-asynchronously, --save-async | Write checkpoints asynchronously in a separate thread. NOTE: This feature is currently being tested. Default: False |
Interactive¶
--buffer-size | read this many sentences into a buffer before processing them Default: 0 |
--input | file to read from; use - for stdin Default: “-” |
fairseq-score¶
BLEU scoring of generated translations against reference translations.
Command-line script for BLEU scoring.
usage: fairseq-score [-h] [-s SYS] -r REF [-o N] [--ignore-case] [--sacrebleu]
[--sentence-bleu]
Named Arguments¶
-s, --sys | system output Default: “-” |
-r, --ref | references |
-o, --order | consider ngrams up to this order Default: 4 |
--ignore-case | case-insensitive scoring Default: False |
--sacrebleu | score with sacrebleu Default: False |
--sentence-bleu | report sentence-level BLEUs (i.e., with +1 smoothing) Default: False |
fairseq-eval-lm¶
Evaluate the perplexity of a trained language model.
usage: fairseq-eval-lm [-h] [--no-progress-bar] [--log-interval LOG_INTERVAL]
[--log-format {json,none,simple,tqdm}]
[--log-file LOG_FILE] [--aim-repo AIM_REPO]
[--aim-run-hash AIM_RUN_HASH]
[--tensorboard-logdir TENSORBOARD_LOGDIR]
[--wandb-project WANDB_PROJECT] [--azureml-logging]
[--seed SEED] [--cpu] [--tpu] [--bf16]
[--memory-efficient-bf16] [--fp16]
[--memory-efficient-fp16] [--fp16-no-flatten-grads]
[--fp16-init-scale FP16_INIT_SCALE]
[--fp16-scale-window FP16_SCALE_WINDOW]
[--fp16-scale-tolerance FP16_SCALE_TOLERANCE]
[--on-cpu-convert-precision]
[--min-loss-scale MIN_LOSS_SCALE]
[--threshold-loss-scale THRESHOLD_LOSS_SCALE] [--amp]
[--amp-batch-retries AMP_BATCH_RETRIES]
[--amp-init-scale AMP_INIT_SCALE]
[--amp-scale-window AMP_SCALE_WINDOW]
[--user-dir USER_DIR]
[--empty-cache-freq EMPTY_CACHE_FREQ]
[--all-gather-list-size ALL_GATHER_LIST_SIZE]
[--model-parallel-size MODEL_PARALLEL_SIZE]
[--quantization-config-path QUANTIZATION_CONFIG_PATH]
[--profile] [--reset-logging] [--suppress-crashes]
[--use-plasma-view] [--plasma-path PLASMA_PATH]
[--criterion {adaptive_loss,composite_loss,cross_entropy,ctc,fastspeech2,hubert,label_smoothed_cross_entropy,latency_augmented_label_smoothed_cross_entropy,label_smoothed_cross_entropy_with_alignment,label_smoothed_cross_entropy_with_ctc,legacy_masked_lm_loss,masked_lm,model,nat_loss,sentence_prediction,sentence_prediction_adapters,sentence_ranking,tacotron2,speech_to_unit,speech_to_spectrogram,speech_unit_lm_criterion,wav2vec,vocab_parallel_cross_entropy}]
[--tokenizer {moses,nltk,space}]
[--bpe {byte_bpe,bytes,characters,fastbpe,gpt2,bert,hf_byte_bpe,sentencepiece,subword_nmt}]
[--optimizer {adadelta,adafactor,adagrad,adam,adamax,composite,cpu_adam,lamb,nag,sgd}]
[--lr-scheduler {cosine,fixed,inverse_sqrt,manual,pass_through,polynomial_decay,reduce_lr_on_plateau,step,tri_stage,triangular}]
[--scoring {bert_score,sacrebleu,bleu,chrf,meteor,wer}]
[--task TASK] [--num-workers NUM_WORKERS]
[--skip-invalid-size-inputs-valid-test]
[--max-tokens MAX_TOKENS] [--batch-size BATCH_SIZE]
[--required-batch-size-multiple REQUIRED_BATCH_SIZE_MULTIPLE]
[--required-seq-len-multiple REQUIRED_SEQ_LEN_MULTIPLE]
[--dataset-impl {raw,lazy,cached,mmap,fasta,huffman}]
[--data-buffer-size DATA_BUFFER_SIZE]
[--train-subset TRAIN_SUBSET]
[--valid-subset VALID_SUBSET] [--combine-valid-subsets]
[--ignore-unused-valid-subsets]
[--validate-interval VALIDATE_INTERVAL]
[--validate-interval-updates VALIDATE_INTERVAL_UPDATES]
[--validate-after-updates VALIDATE_AFTER_UPDATES]
[--fixed-validation-seed FIXED_VALIDATION_SEED]
[--disable-validation]
[--max-tokens-valid MAX_TOKENS_VALID]
[--batch-size-valid BATCH_SIZE_VALID]
[--max-valid-steps MAX_VALID_STEPS]
[--curriculum CURRICULUM] [--gen-subset GEN_SUBSET]
[--num-shards NUM_SHARDS] [--shard-id SHARD_ID]
[--grouped-shuffling]
[--update-epoch-batch-itr UPDATE_EPOCH_BATCH_ITR]
[--update-ordered-indices-seed]
[--distributed-world-size DISTRIBUTED_WORLD_SIZE]
[--distributed-num-procs DISTRIBUTED_NUM_PROCS]
[--distributed-rank DISTRIBUTED_RANK]
[--distributed-backend DISTRIBUTED_BACKEND]
[--distributed-init-method DISTRIBUTED_INIT_METHOD]
[--distributed-port DISTRIBUTED_PORT]
[--device-id DEVICE_ID] [--distributed-no-spawn]
[--ddp-backend {c10d,fully_sharded,legacy_ddp,no_c10d,pytorch_ddp,slowmo}]
[--ddp-comm-hook {none,fp16}]
[--bucket-cap-mb BUCKET_CAP_MB] [--fix-batches-to-gpus]
[--find-unused-parameters] [--gradient-as-bucket-view]
[--fast-stat-sync]
[--heartbeat-timeout HEARTBEAT_TIMEOUT]
[--broadcast-buffers]
[--slowmo-momentum SLOWMO_MOMENTUM]
[--slowmo-base-algorithm SLOWMO_BASE_ALGORITHM]
[--localsgd-frequency LOCALSGD_FREQUENCY]
[--nprocs-per-node NPROCS_PER_NODE]
[--pipeline-model-parallel]
[--pipeline-balance PIPELINE_BALANCE]
[--pipeline-devices PIPELINE_DEVICES]
[--pipeline-chunks PIPELINE_CHUNKS]
[--pipeline-encoder-balance PIPELINE_ENCODER_BALANCE]
[--pipeline-encoder-devices PIPELINE_ENCODER_DEVICES]
[--pipeline-decoder-balance PIPELINE_DECODER_BALANCE]
[--pipeline-decoder-devices PIPELINE_DECODER_DEVICES]
[--pipeline-checkpoint {always,never,except_last}]
[--zero-sharding {none,os}]
[--no-reshard-after-forward] [--fp32-reduce-scatter]
[--cpu-offload] [--use-sharded-state]
[--not-fsdp-flatten-parameters] [--path PATH]
[--post-process [POST_PROCESS]] [--quiet]
[--model-overrides MODEL_OVERRIDES]
[--results-path RESULTS_PATH] [--output-word-probs]
[--output-word-stats] [--context-window CONTEXT_WINDOW]
[--softmax-batch SOFTMAX_BATCH]
Named Arguments¶
--no-progress-bar | disable progress bar Default: False |
--log-interval | log progress every N batches (when progress bar is disabled) Default: 100 |
--log-format | Possible choices: json, none, simple, tqdm log format to use |
--log-file | log file to copy metrics to. |
--aim-repo | path to Aim repository |
--aim-run-hash | Aim run hash. If skipped, creates or continues run based on save_dir |
--tensorboard-logdir | path to save logs for tensorboard, should match –logdir of running tensorboard (default: no tensorboard logging) |
--wandb-project | Weights and Biases project name to use for logging |
--azureml-logging | Log scalars to AzureML context Default: False |
--seed | pseudo random number generator seed Default: 1 |
--cpu | use CPU instead of CUDA Default: False |
--tpu | use TPU instead of CUDA Default: False |
--bf16 | use bfloat16; implies –tpu Default: False |
--memory-efficient-bf16 | use a memory-efficient version of BF16 training; implies –bf16 Default: False |
--fp16 | use FP16 Default: False |
--memory-efficient-fp16 | use a memory-efficient version of FP16 training; implies –fp16 Default: False |
--fp16-no-flatten-grads | don’t flatten FP16 grads tensor Default: False |
--fp16-init-scale | default FP16 loss scale Default: 128 |
--fp16-scale-window | number of updates before increasing loss scale |
--fp16-scale-tolerance | pct of updates that can overflow before decreasing the loss scale Default: 0.0 |
--on-cpu-convert-precision | if set, the floating point conversion to fp16/bf16 runs on CPU. This reduces bus transfer time and GPU memory usage. Default: False |
--min-loss-scale | minimum FP16/AMP loss scale, after which training is stopped Default: 0.0001 |
--threshold-loss-scale | threshold FP16 loss scale from below |
--amp | use automatic mixed precision Default: False |
--amp-batch-retries | number of retries of same batch after reducing loss scale with AMP Default: 2 |
--amp-init-scale | default AMP loss scale Default: 128 |
--amp-scale-window | number of updates before increasing AMP loss scale |
--user-dir | path to a python module containing custom extensions (tasks and/or architectures) |
--empty-cache-freq | how often to clear the PyTorch CUDA cache (0 to disable) Default: 0 |
--all-gather-list-size | number of bytes reserved for gathering stats from workers Default: 16384 |
--model-parallel-size | total number of GPUs to parallelize model over Default: 1 |
--quantization-config-path | path to quantization config file |
--profile | enable autograd profiler emit_nvtx Default: False |
--reset-logging | when using Hydra, reset the logging at the beginning of training Default: False |
--suppress-crashes | suppress crashes when training with the hydra_train entry point so that the main method can return a value (useful for sweeps) Default: False |
--use-plasma-view | Store indices and sizes in shared memory Default: False |
--plasma-path | path to run plasma_store, defaults to /tmp/plasma. Paths outside /tmp tend to fail. Default: “/tmp/plasma” |
--criterion | Possible choices: adaptive_loss, composite_loss, cross_entropy, ctc, fastspeech2, hubert, label_smoothed_cross_entropy, latency_augmented_label_smoothed_cross_entropy, label_smoothed_cross_entropy_with_alignment, label_smoothed_cross_entropy_with_ctc, legacy_masked_lm_loss, masked_lm, model, nat_loss, sentence_prediction, sentence_prediction_adapters, sentence_ranking, tacotron2, speech_to_unit, speech_to_spectrogram, speech_unit_lm_criterion, wav2vec, vocab_parallel_cross_entropy Default: “cross_entropy” |
--tokenizer | Possible choices: moses, nltk, space |
--bpe | Possible choices: byte_bpe, bytes, characters, fastbpe, gpt2, bert, hf_byte_bpe, sentencepiece, subword_nmt |
--optimizer | Possible choices: adadelta, adafactor, adagrad, adam, adamax, composite, cpu_adam, lamb, nag, sgd |
--lr-scheduler | Possible choices: cosine, fixed, inverse_sqrt, manual, pass_through, polynomial_decay, reduce_lr_on_plateau, step, tri_stage, triangular Default: “fixed” |
--scoring | Possible choices: bert_score, sacrebleu, bleu, chrf, meteor, wer Default: “bleu” |
--task | Possible choices: multilingual_language_modeling, speech_unit_modeling, hubert_pretraining, translation, multilingual_translation, semisupervised_translation, translation_from_pretrained_xlm, speech_to_text, text_to_speech, frm_text_to_speech, legacy_masked_lm, audio_pretraining, audio_finetuning, sentence_ranking, online_backtranslation, simul_speech_to_text, simul_text_to_text, cross_lingual_lm, span_masked_lm, denoising, multilingual_denoising, multilingual_masked_lm, language_modeling, masked_lm, nlu_finetuning, speech_to_speech, sentence_prediction, translation_from_pretrained_bart, sentence_prediction_adapters, translation_multi_simple_epoch, translation_lev, dummy_lm, dummy_masked_lm, dummy_mt task Default: “language_modeling” |
dataset_data_loading¶
--num-workers | how many subprocesses to use for data loading Default: 1 |
--skip-invalid-size-inputs-valid-test | ignore too long or too short lines in valid and test set Default: False |
--max-tokens | maximum number of tokens in a batch |
--batch-size, --max-sentences | number of examples in a batch |
--required-batch-size-multiple | batch size will be a multiplier of this value Default: 8 |
--required-seq-len-multiple | maximum sequence length in batch will be a multiplier of this value Default: 1 |
--dataset-impl | Possible choices: raw, lazy, cached, mmap, fasta, huffman output dataset implementation |
--data-buffer-size | Number of batches to preload Default: 10 |
--train-subset | data subset to use for training (e.g. train, valid, test) Default: “train” |
--valid-subset | comma separated list of data subsets to use for validation (e.g. train, valid, test) Default: “valid” |
--combine-valid-subsets, --combine-val | comma separated list of data subsets to use for validation (e.g. train, valid, test) |
--ignore-unused-valid-subsets | do not raise error if valid subsets are ignored Default: False |
--validate-interval | validate every N epochs Default: 1 |
--validate-interval-updates | validate every N updates Default: 0 |
--validate-after-updates | dont validate until reaching this many updates Default: 0 |
--fixed-validation-seed | specified random seed for validation |
--disable-validation | disable validation Default: False |
--max-tokens-valid | maximum number of tokens in a validation batch (defaults to –max-tokens) |
--batch-size-valid, --max-sentences-valid | batch size of the validation batch (defaults to –batch-size) |
--max-valid-steps, --nval | How many batches to evaluate |
--curriculum | don’t shuffle batches for first N epochs Default: 0 |
--gen-subset | data subset to generate (train, valid, test) Default: “test” |
--num-shards | shard generation over N shards Default: 1 |
--shard-id | id of the shard to generate (id < num_shards) Default: 0 |
--grouped-shuffling | shuffle batches in groups of num_shards to enable similar sequence lengths on each GPU worker when batches are sorted by length Default: False |
--update-epoch-batch-itr | if true then prevents the reuse the epoch batch iterator by setting can_reuse_epoch_itr to false, defaults to –grouped-shuffling ) |
--update-ordered-indices-seed | if true then increment seed with epoch for getting batch iterators, defautls to False. Default: False |
distributed_training¶
--distributed-world-size | total number of GPUs across all nodes (default: all visible GPUs) Default: 1 |
--distributed-num-procs | total number of processes to fork (default: all visible GPUs) Default: 1 |
--distributed-rank | rank of the current worker Default: 0 |
--distributed-backend | distributed backend Default: “nccl” |
--distributed-init-method | typically tcp://hostname:port that will be used to establish initial connetion |
--distributed-port | port number (not required if using –distributed-init-method) Default: -1 |
--device-id, --local_rank | which GPU to use (by default looks for $LOCAL_RANK, usually configured automatically) Default: 0 |
--distributed-no-spawn | do not spawn multiple processes even if multiple GPUs are visible Default: False |
--ddp-backend | Possible choices: c10d, fully_sharded, legacy_ddp, no_c10d, pytorch_ddp, slowmo DistributedDataParallel backend Default: “pytorch_ddp” |
--ddp-comm-hook | Possible choices: none, fp16 communication hook Default: “none” |
--bucket-cap-mb | bucket size for reduction Default: 25 |
--fix-batches-to-gpus | don’t shuffle batches between GPUs; this reduces overall randomness and may affect precision but avoids the cost of re-reading the data Default: False |
--find-unused-parameters | disable unused parameter detection (not applicable to –ddp-backend=legacy_ddp) Default: False |
--gradient-as-bucket-view | when set to True, gradients will be views pointing to different offsets of allreduce communication buckets. This can reduce peak memory usage, where the saved memory size will be equal to the total gradients size. –gradient-as-bucket-view=gradient_as_bucket_view) Default: False |
--fast-stat-sync | [deprecated] this is now defined per Criterion Default: False |
--heartbeat-timeout | kill the job if no progress is made in N seconds; set to -1 to disable Default: -1 |
--broadcast-buffers | Copy non-trainable parameters between GPUs, such as batchnorm population statistics Default: False |
--slowmo-momentum | SlowMo momentum term; by default use 0.0 for 16 GPUs, 0.2 for 32 GPUs; 0.5 for 64 GPUs, 0.6 for > 64 GPUs |
--slowmo-base-algorithm | Base algorithm. Either ‘localsgd’ or ‘sgp’. Please refer to the documentation of ‘slowmo_base_algorithm’ parameter in https://fairscale.readthedocs.io/en/latest/api/experimental/nn/slowmo_ddp.html for more details Default: “localsgd” |
--localsgd-frequency | Local SGD allreduce frequency Default: 3 |
--nprocs-per-node | number of GPUs in each node. An allreduce operation across GPUs in a node is very fast. Hence, we do allreduce across GPUs in a node, and gossip across different nodes Default: 1 |
--pipeline-model-parallel | if set, use pipeline model parallelism across GPUs Default: False |
--pipeline-balance | partition the model into N_K pieces, where each piece contains N_i layers. The sum(args.pipeline_balance) should equal the total number of layers in the model |
--pipeline-devices | a list of device indices indicating which device to place each of the N_K partitions. The length of this list should equal the length of the –pipeline-balance argument |
--pipeline-chunks | microbatch count for pipeline model parallelism Default: 0 |
--pipeline-encoder-balance | partition the pipeline parallel encoder into N_K pieces, where each piece contains N_i layers. The sum(args.pipeline_encoder_balance) should equal the total number of encoder layers in the model |
--pipeline-encoder-devices | a list of device indices indicating which device to place each of the N_K partitions. The length of this list should equal the length of the –pipeline-encoder-balance argument |
--pipeline-decoder-balance | partition the pipeline parallel decoder into N_K pieces, where each piece contains N_i layers. The sum(args.pipeline_decoder_balance) should equal the total number of decoder layers in the model |
--pipeline-decoder-devices | a list of device indices indicating which device to place each of the N_K partitions. The length of this list should equal the length of the –pipeline-decoder-balance argument |
--pipeline-checkpoint | Possible choices: always, never, except_last checkpointing mode for pipeline model parallelism Default: “never” |
--zero-sharding | Possible choices: none, os ZeRO sharding Default: “none” |
--no-reshard-after-forward | don’t reshard parameters after forward pass Default: False |
--fp32-reduce-scatter | reduce-scatter grads in FP32 Default: False |
--cpu-offload | offload FP32 params to CPU Default: False |
--use-sharded-state | use sharded checkpoint files Default: False |
--not-fsdp-flatten-parameters | not flatten parameter param for fsdp Default: False |
LM Evaluation¶
--path | path(s) to model file(s), colon separated |
--post-process, --remove-bpe | post-process text by removing BPE, letter segmentation, etc. Valid options can be found in fairseq.data.utils.post_process. |
--quiet | only print final scores Default: False |
--model-overrides | a dictionary used to override model args at generation that were used during model training Default: “{}” |
--results-path | path to save eval results (optional) |
--output-word-probs | if set, outputs words and their predicted log probabilities to standard output Default: False |
--output-word-stats | if set, outputs word statistics such as word count, average probability, etc Default: False |
--context-window | ensures that every evaluated token has access to a context of at least this size, if possible Default: 0 |
--softmax-batch | if BxT is more than this, will batch the softmax over vocab to this amount of tokens, in order to fit into GPU memory Default: 9223372036854775807 |