Command-line Tools

Fairseq provides several command-line tools for training and evaluating models:

fairseq-preprocess

Data pre-processing: build vocabularies and binarize training data.

usage: fairseq-preprocess [-h] [--no-progress-bar] [--log-interval N]
                          [--log-format {json,none,simple,tqdm}]
                          [--tensorboard-logdir DIR] [--tbmf-wrapper]
                          [--seed N] [--cpu] [--fp16]
                          [--memory-efficient-fp16]
                          [--fp16-init-scale FP16_INIT_SCALE]
                          [--fp16-scale-window FP16_SCALE_WINDOW]
                          [--fp16-scale-tolerance FP16_SCALE_TOLERANCE]
                          [--min-loss-scale D]
                          [--threshold-loss-scale THRESHOLD_LOSS_SCALE]
                          [--user-dir USER_DIR]
                          [--criterion {composite_loss,legacy_masked_lm_loss,sentence_prediction,cross_entropy,adaptive_loss,masked_lm,label_smoothed_cross_entropy,sentence_ranking,binary_cross_entropy}]
                          [--tokenizer {space,nltk,moses}]
                          [--bpe {sentencepiece,subword_nmt,fastbpe,gpt2}]
                          [--optimizer {nag,adagrad,adam,adafactor,adamax,sgd,adadelta}]
                          [--lr-scheduler {fixed,polynomial_decay,triangular,reduce_lr_on_plateau,inverse_sqrt,cosine}]
                          [--task TASK] [-s SRC] [-t TARGET] [--trainpref FP]
                          [--validpref FP] [--testpref FP] [--destdir DIR]
                          [--thresholdtgt N] [--thresholdsrc N] [--tgtdict FP]
                          [--srcdict FP] [--nwordstgt N] [--nwordssrc N]
                          [--alignfile ALIGN] [--dataset-impl FORMAT]
                          [--joined-dictionary] [--only-source]
                          [--padding-factor N] [--workers N]

Named Arguments

--no-progress-bar

disable progress bar

Default: False

--log-interval

log progress every N batches (when progress bar is disabled)

Default: 1000

--log-format

Possible choices: json, none, simple, tqdm

log format to use

--tensorboard-logdir

path to save logs for tensorboard, should match –logdir of running tensorboard (default: no tensorboard logging)

Default: “”

--tbmf-wrapper

[FB only]

Default: False

--seed

pseudo random number generator seed

Default: 1

--cpu

use CPU instead of CUDA

Default: False

--fp16

use FP16

Default: False

--memory-efficient-fp16

use a memory-efficient version of FP16 training; implies –fp16

Default: False

--fp16-init-scale

default FP16 loss scale

Default: 128

--fp16-scale-window number of updates before increasing loss scale
--fp16-scale-tolerance

pct of updates that can overflow before decreasing the loss scale

Default: 0.0

--min-loss-scale

minimum FP16 loss scale, after which training is stopped

Default: 0.0001

--threshold-loss-scale threshold FP16 loss scale from below
--user-dir path to a python module containing custom extensions (tasks and/or architectures)
--criterion

Possible choices: composite_loss, legacy_masked_lm_loss, sentence_prediction, cross_entropy, adaptive_loss, masked_lm, label_smoothed_cross_entropy, sentence_ranking, binary_cross_entropy

Default: “cross_entropy”

--tokenizer Possible choices: space, nltk, moses
--bpe Possible choices: sentencepiece, subword_nmt, fastbpe, gpt2
--optimizer

Possible choices: nag, adagrad, adam, adafactor, adamax, sgd, adadelta

Default: “nag”

--lr-scheduler

Possible choices: fixed, polynomial_decay, triangular, reduce_lr_on_plateau, inverse_sqrt, cosine

Default: “fixed”

--task

Possible choices: translation, multilingual_translation, semisupervised_translation, legacy_masked_lm, translation_moe, sentence_prediction, language_modeling, masked_lm, translation_from_pretrained_xlm, sentence_ranking, cross_lingual_lm, audio_pretraining

task

Default: “translation”

--dataset-impl

Possible choices: raw, lazy, cached, mmap

output dataset implementation

Default: “mmap”

Preprocessing

-s, --source-lang source language
-t, --target-lang target language
--trainpref train file prefix
--validpref comma separated, valid file prefixes
--testpref comma separated, test file prefixes
--destdir

destination dir

Default: “data-bin”

--thresholdtgt

map words appearing less than threshold times to unknown

Default: 0

--thresholdsrc

map words appearing less than threshold times to unknown

Default: 0

--tgtdict reuse given target dictionary
--srcdict reuse given source dictionary
--nwordstgt

number of target words to retain

Default: -1

--nwordssrc

number of source words to retain

Default: -1

--alignfile an alignment file (optional)
--joined-dictionary

Generate joined dictionary

Default: False

--only-source

Only process the source language

Default: False

--padding-factor

Pad dictionary size to be multiple of N

Default: 8

--workers

number of parallel workers

Default: 1

fairseq-train

Train a new model on one or across multiple GPUs.

usage: fairseq-train [-h] [--no-progress-bar] [--log-interval N]
                     [--log-format {json,none,simple,tqdm}]
                     [--tensorboard-logdir DIR] [--tbmf-wrapper] [--seed N]
                     [--cpu] [--fp16] [--memory-efficient-fp16]
                     [--fp16-init-scale FP16_INIT_SCALE]
                     [--fp16-scale-window FP16_SCALE_WINDOW]
                     [--fp16-scale-tolerance FP16_SCALE_TOLERANCE]
                     [--min-loss-scale D]
                     [--threshold-loss-scale THRESHOLD_LOSS_SCALE]
                     [--user-dir USER_DIR]
                     [--criterion {composite_loss,legacy_masked_lm_loss,sentence_prediction,cross_entropy,adaptive_loss,masked_lm,label_smoothed_cross_entropy,sentence_ranking,binary_cross_entropy}]
                     [--tokenizer {space,nltk,moses}]
                     [--bpe {sentencepiece,subword_nmt,fastbpe,gpt2}]
                     [--optimizer {nag,adagrad,adam,adafactor,adamax,sgd,adadelta}]
                     [--lr-scheduler {fixed,polynomial_decay,triangular,reduce_lr_on_plateau,inverse_sqrt,cosine}]
                     [--task TASK] [--num-workers N]
                     [--skip-invalid-size-inputs-valid-test] [--max-tokens N]
                     [--max-sentences N] [--required-batch-size-multiple N]
                     [--dataset-impl FORMAT] [--train-subset SPLIT]
                     [--valid-subset SPLIT] [--validate-interval N]
                     [--disable-validation] [--max-tokens-valid N]
                     [--max-sentences-valid N] [--curriculum N]
                     [--distributed-world-size N]
                     [--distributed-rank DISTRIBUTED_RANK]
                     [--distributed-backend DISTRIBUTED_BACKEND]
                     [--distributed-init-method DISTRIBUTED_INIT_METHOD]
                     [--distributed-port DISTRIBUTED_PORT]
                     [--device-id DEVICE_ID] [--distributed-no-spawn]
                     [--ddp-backend {c10d,no_c10d}] [--bucket-cap-mb MB]
                     [--fix-batches-to-gpus] [--find-unused-parameters] --arch
                     ARCH [--max-epoch N] [--max-update N] [--clip-norm NORM]
                     [--sentence-avg] [--update-freq N1,N2,...,N_K]
                     [--lr LR_1,LR_2,...,LR_N] [--min-lr LR] [--use-bmuf]
                     [--save-dir DIR] [--restore-file RESTORE_FILE]
                     [--reset-dataloader] [--reset-lr-scheduler]
                     [--reset-meters] [--reset-optimizer]
                     [--optimizer-overrides DICT] [--save-interval N]
                     [--save-interval-updates N] [--keep-interval-updates N]
                     [--keep-last-epochs N] [--no-save]
                     [--no-epoch-checkpoints] [--no-last-checkpoints]
                     [--no-save-optimizer-state]
                     [--best-checkpoint-metric BEST_CHECKPOINT_METRIC]
                     [--maximize-best-checkpoint-metric]

Named Arguments

--no-progress-bar

disable progress bar

Default: False

--log-interval

log progress every N batches (when progress bar is disabled)

Default: 1000

--log-format

Possible choices: json, none, simple, tqdm

log format to use

--tensorboard-logdir

path to save logs for tensorboard, should match –logdir of running tensorboard (default: no tensorboard logging)

Default: “”

--tbmf-wrapper

[FB only]

Default: False

--seed

pseudo random number generator seed

Default: 1

--cpu

use CPU instead of CUDA

Default: False

--fp16

use FP16

Default: False

--memory-efficient-fp16

use a memory-efficient version of FP16 training; implies –fp16

Default: False

--fp16-init-scale

default FP16 loss scale

Default: 128

--fp16-scale-window number of updates before increasing loss scale
--fp16-scale-tolerance

pct of updates that can overflow before decreasing the loss scale

Default: 0.0

--min-loss-scale

minimum FP16 loss scale, after which training is stopped

Default: 0.0001

--threshold-loss-scale threshold FP16 loss scale from below
--user-dir path to a python module containing custom extensions (tasks and/or architectures)
--criterion

Possible choices: composite_loss, legacy_masked_lm_loss, sentence_prediction, cross_entropy, adaptive_loss, masked_lm, label_smoothed_cross_entropy, sentence_ranking, binary_cross_entropy

Default: “cross_entropy”

--tokenizer Possible choices: space, nltk, moses
--bpe Possible choices: sentencepiece, subword_nmt, fastbpe, gpt2
--optimizer

Possible choices: nag, adagrad, adam, adafactor, adamax, sgd, adadelta

Default: “nag”

--lr-scheduler

Possible choices: fixed, polynomial_decay, triangular, reduce_lr_on_plateau, inverse_sqrt, cosine

Default: “fixed”

--task

Possible choices: translation, multilingual_translation, semisupervised_translation, legacy_masked_lm, translation_moe, sentence_prediction, language_modeling, masked_lm, translation_from_pretrained_xlm, sentence_ranking, cross_lingual_lm, audio_pretraining

task

Default: “translation”

--dataset-impl

Possible choices: raw, lazy, cached, mmap

output dataset implementation

Dataset and data loading

--num-workers

how many subprocesses to use for data loading

Default: 1

--skip-invalid-size-inputs-valid-test

ignore too long or too short lines in valid and test set

Default: False

--max-tokens maximum number of tokens in a batch
--max-sentences, --batch-size maximum number of sentences in a batch
--required-batch-size-multiple

batch size will be a multiplier of this value

Default: 8

--train-subset

Possible choices: train, valid, test

data subset to use for training (train, valid, test)

Default: “train”

--valid-subset

comma separated list of data subsets to use for validation (train, valid, valid1, test, test1)

Default: “valid”

--validate-interval

validate every N epochs

Default: 1

--disable-validation

disable validation

Default: False

--max-tokens-valid maximum number of tokens in a validation batch (defaults to –max-tokens)
--max-sentences-valid maximum number of sentences in a validation batch (defaults to –max-sentences)
--curriculum

don’t shuffle batches for first N epochs

Default: 0

Distributed training

--distributed-world-size

total number of GPUs across all nodes (default: all visible GPUs)

Default: 1

--distributed-rank

rank of the current worker

Default: 0

--distributed-backend

distributed backend

Default: “nccl”

--distributed-init-method typically tcp://hostname:port that will be used to establish initial connetion
--distributed-port

port number (not required if using –distributed-init-method)

Default: -1

--device-id, --local_rank

which GPU to use (usually configured automatically)

Default: 0

--distributed-no-spawn

do not spawn multiple processes even if multiple GPUs are visible

Default: False

--ddp-backend

Possible choices: c10d, no_c10d

DistributedDataParallel backend

Default: “c10d”

--bucket-cap-mb

bucket size for reduction

Default: 25

--fix-batches-to-gpus

don’t shuffle batches between GPUs; this reduces overall randomness and may affect precision but avoids the cost of re-reading the data

Default: False

--find-unused-parameters

disable unused parameter detection (not applicable to no_c10d ddp-backend

Default: False

Model configuration

--arch, -a

Possible choices: fconv, fconv_iwslt_de_en, fconv_wmt_en_ro, fconv_wmt_en_de, fconv_wmt_en_fr, lightconv, lightconv_iwslt_de_en, lightconv_wmt_en_de, lightconv_wmt_en_de_big, lightconv_wmt_en_fr_big, lightconv_wmt_zh_en_big, transformer, transformer_iwslt_de_en, transformer_wmt_en_de, transformer_vaswani_wmt_en_de_big, transformer_vaswani_wmt_en_fr_big, transformer_wmt_en_de_big, transformer_wmt_en_de_big_t2t, transformer_from_pretrained_xlm, transformer_lm, transformer_lm_big, transformer_lm_baevski_wiki103, transformer_lm_wiki103, transformer_lm_baevski_gbw, transformer_lm_gbw, transformer_lm_gpt, transformer_lm_gpt2_small, transformer_lm_gpt2_medium, transformer_lm_gpt2_big, lstm, lstm_wiseman_iwslt_de_en, lstm_luong_wmt_en_de, fconv_self_att, fconv_self_att_wp, masked_lm, bert_base, bert_large, xlm_base, multilingual_transformer, multilingual_transformer_iwslt_de_en, wav2vec, roberta, roberta_base, roberta_large, fconv_lm, fconv_lm_dauphin_wikitext103, fconv_lm_dauphin_gbw, lightconv_lm, lightconv_lm_gbw

Model Architecture

Default: “fconv”

Optimization

--max-epoch, --me

force stop training at specified epoch

Default: 0

--max-update, --mu

force stop training at specified update

Default: 0

--clip-norm

clip threshold of gradients

Default: 25

--sentence-avg

normalize gradients by the number of sentences in a batch (default is to normalize by number of tokens)

Default: False

--update-freq

update parameters every N_i batches, when in epoch i

Default: 1

--lr, --learning-rate

learning rate for the first N epochs; all epochs >N using LR_N (note: this may be interpreted differently depending on –lr-scheduler)

Default: 0.25

--min-lr

stop training when the learning rate reaches this minimum

Default: -1

--use-bmuf

specify global optimizer for syncing models on different GPUs/shards

Default: False

Checkpointing

--save-dir

path to save checkpoints

Default: “checkpoints”

--restore-file

filename from which to load checkpoint (default: <save-dir>/checkpoint_last.pt

Default: “checkpoint_last.pt”

--reset-dataloader

if set, does not reload dataloader state from the checkpoint

Default: False

--reset-lr-scheduler

if set, does not load lr scheduler state from the checkpoint

Default: False

--reset-meters

if set, does not load meters from the checkpoint

Default: False

--reset-optimizer

if set, does not load optimizer state from the checkpoint

Default: False

--optimizer-overrides

a dictionary used to override optimizer args when loading a checkpoint

Default: “{}”

--save-interval

save a checkpoint every N epochs

Default: 1

--save-interval-updates

save a checkpoint (and validate) every N updates

Default: 0

--keep-interval-updates

keep the last N checkpoints saved with –save-interval-updates

Default: -1

--keep-last-epochs

keep last N epoch checkpoints

Default: -1

--no-save

don’t save models or checkpoints

Default: False

--no-epoch-checkpoints

only store last and best checkpoints

Default: False

--no-last-checkpoints

don’t store last checkpoints

Default: False

--no-save-optimizer-state

don’t save optimizer-state as part of checkpoint

Default: False

--best-checkpoint-metric

metric to use for saving “best” checkpoints

Default: “loss”

--maximize-best-checkpoint-metric

select the largest metric value for saving “best” checkpoints

Default: False

fairseq-generate

fairseq-interactive

Translate raw text with a trained model. Batches data on-the-fly.

usage: fairseq-interactive [-h] [--no-progress-bar] [--log-interval N]
                           [--log-format {json,none,simple,tqdm}]
                           [--tensorboard-logdir DIR] [--tbmf-wrapper]
                           [--seed N] [--cpu] [--fp16]
                           [--memory-efficient-fp16]
                           [--fp16-init-scale FP16_INIT_SCALE]
                           [--fp16-scale-window FP16_SCALE_WINDOW]
                           [--fp16-scale-tolerance FP16_SCALE_TOLERANCE]
                           [--min-loss-scale D]
                           [--threshold-loss-scale THRESHOLD_LOSS_SCALE]
                           [--user-dir USER_DIR]
                           [--criterion {composite_loss,legacy_masked_lm_loss,sentence_prediction,cross_entropy,adaptive_loss,masked_lm,label_smoothed_cross_entropy,sentence_ranking,binary_cross_entropy}]
                           [--tokenizer {space,nltk,moses}]
                           [--bpe {sentencepiece,subword_nmt,fastbpe,gpt2}]
                           [--optimizer {nag,adagrad,adam,adafactor,adamax,sgd,adadelta}]
                           [--lr-scheduler {fixed,polynomial_decay,triangular,reduce_lr_on_plateau,inverse_sqrt,cosine}]
                           [--task TASK] [--num-workers N]
                           [--skip-invalid-size-inputs-valid-test]
                           [--max-tokens N] [--max-sentences N]
                           [--required-batch-size-multiple N]
                           [--dataset-impl FORMAT] [--gen-subset SPLIT]
                           [--num-shards N] [--shard-id ID] [--path FILE]
                           [--remove-bpe [REMOVE_BPE]] [--quiet]
                           [--model-overrides DICT] [--results-path RESDIR]
                           [--beam N] [--nbest N] [--max-len-a N]
                           [--max-len-b N] [--min-len N] [--match-source-len]
                           [--no-early-stop] [--unnormalized]
                           [--no-beamable-mm] [--lenpen LENPEN]
                           [--unkpen UNKPEN] [--replace-unk [REPLACE_UNK]]
                           [--sacrebleu] [--score-reference]
                           [--prefix-size PS] [--no-repeat-ngram-size N]
                           [--sampling] [--sampling-topk PS]
                           [--sampling-topp PS] [--temperature N]
                           [--diverse-beam-groups N]
                           [--diverse-beam-strength N] [--print-alignment]
                           [--buffer-size N] [--input FILE]

Named Arguments

--no-progress-bar

disable progress bar

Default: False

--log-interval

log progress every N batches (when progress bar is disabled)

Default: 1000

--log-format

Possible choices: json, none, simple, tqdm

log format to use

--tensorboard-logdir

path to save logs for tensorboard, should match –logdir of running tensorboard (default: no tensorboard logging)

Default: “”

--tbmf-wrapper

[FB only]

Default: False

--seed

pseudo random number generator seed

Default: 1

--cpu

use CPU instead of CUDA

Default: False

--fp16

use FP16

Default: False

--memory-efficient-fp16

use a memory-efficient version of FP16 training; implies –fp16

Default: False

--fp16-init-scale

default FP16 loss scale

Default: 128

--fp16-scale-window number of updates before increasing loss scale
--fp16-scale-tolerance

pct of updates that can overflow before decreasing the loss scale

Default: 0.0

--min-loss-scale

minimum FP16 loss scale, after which training is stopped

Default: 0.0001

--threshold-loss-scale threshold FP16 loss scale from below
--user-dir path to a python module containing custom extensions (tasks and/or architectures)
--criterion

Possible choices: composite_loss, legacy_masked_lm_loss, sentence_prediction, cross_entropy, adaptive_loss, masked_lm, label_smoothed_cross_entropy, sentence_ranking, binary_cross_entropy

Default: “cross_entropy”

--tokenizer Possible choices: space, nltk, moses
--bpe Possible choices: sentencepiece, subword_nmt, fastbpe, gpt2
--optimizer

Possible choices: nag, adagrad, adam, adafactor, adamax, sgd, adadelta

Default: “nag”

--lr-scheduler

Possible choices: fixed, polynomial_decay, triangular, reduce_lr_on_plateau, inverse_sqrt, cosine

Default: “fixed”

--task

Possible choices: translation, multilingual_translation, semisupervised_translation, legacy_masked_lm, translation_moe, sentence_prediction, language_modeling, masked_lm, translation_from_pretrained_xlm, sentence_ranking, cross_lingual_lm, audio_pretraining

task

Default: “translation”

--dataset-impl

Possible choices: raw, lazy, cached, mmap

output dataset implementation

Dataset and data loading

--num-workers

how many subprocesses to use for data loading

Default: 1

--skip-invalid-size-inputs-valid-test

ignore too long or too short lines in valid and test set

Default: False

--max-tokens maximum number of tokens in a batch
--max-sentences, --batch-size maximum number of sentences in a batch
--required-batch-size-multiple

batch size will be a multiplier of this value

Default: 8

--gen-subset

data subset to generate (train, valid, test)

Default: “test”

--num-shards

shard generation over N shards

Default: 1

--shard-id

id of the shard to generate (id < num_shards)

Default: 0

Generation

--path path(s) to model file(s), colon separated
--remove-bpe remove BPE tokens before scoring (can be set to sentencepiece)
--quiet

only print final scores

Default: False

--model-overrides

a dictionary used to override model args at generation that were used during model training

Default: “{}”

--results-path path to save eval results (optional)”
--beam

beam size

Default: 5

--nbest

number of hypotheses to output

Default: 1

--max-len-a

generate sequences of maximum length ax + b, where x is the source length

Default: 0

--max-len-b

generate sequences of maximum length ax + b, where x is the source length

Default: 200

--min-len

minimum generation length

Default: 1

--match-source-len

generations should match the source length

Default: False

--no-early-stop

deprecated

Default: False

--unnormalized

compare unnormalized hypothesis scores

Default: False

--no-beamable-mm

don’t use BeamableMM in attention layers

Default: False

--lenpen

length penalty: <1.0 favors shorter, >1.0 favors longer sentences

Default: 1

--unkpen

unknown word penalty: <0 produces more unks, >0 produces fewer

Default: 0

--replace-unk perform unknown replacement (optionally with alignment dictionary)
--sacrebleu

score with sacrebleu

Default: False

--score-reference

just score the reference translation

Default: False

--prefix-size

initialize generation by target prefix of given length

Default: 0

--no-repeat-ngram-size

ngram blocking such that this size ngram cannot be repeated in the generation

Default: 0

--sampling

sample hypotheses instead of using beam search

Default: False

--sampling-topk

sample from top K likely next words instead of all words

Default: -1

--sampling-topp

sample from the smallest set whose cumulative probability mass exceeds p for next words

Default: -1.0

--temperature

temperature for generation

Default: 1.0

--diverse-beam-groups

number of groups for Diverse Beam Search

Default: -1

--diverse-beam-strength

strength of diversity penalty for Diverse Beam Search

Default: 0.5

--print-alignment

if set, uses attention feedback to compute and print alignment to source tokens

Default: False

Interactive

--buffer-size

read this many sentences into a buffer before processing them

Default: 0

--input

file to read from; use - for stdin

Default: “-“

fairseq-score

fairseq-eval-lm

Evaluate the perplexity of a trained language model.

usage: fairseq-eval-lm [-h] [--no-progress-bar] [--log-interval N]
                       [--log-format {json,none,simple,tqdm}]
                       [--tensorboard-logdir DIR] [--tbmf-wrapper] [--seed N]
                       [--cpu] [--fp16] [--memory-efficient-fp16]
                       [--fp16-init-scale FP16_INIT_SCALE]
                       [--fp16-scale-window FP16_SCALE_WINDOW]
                       [--fp16-scale-tolerance FP16_SCALE_TOLERANCE]
                       [--min-loss-scale D]
                       [--threshold-loss-scale THRESHOLD_LOSS_SCALE]
                       [--user-dir USER_DIR]
                       [--criterion {composite_loss,legacy_masked_lm_loss,sentence_prediction,cross_entropy,adaptive_loss,masked_lm,label_smoothed_cross_entropy,sentence_ranking,binary_cross_entropy}]
                       [--tokenizer {space,nltk,moses}]
                       [--bpe {sentencepiece,subword_nmt,fastbpe,gpt2}]
                       [--optimizer {nag,adagrad,adam,adafactor,adamax,sgd,adadelta}]
                       [--lr-scheduler {fixed,polynomial_decay,triangular,reduce_lr_on_plateau,inverse_sqrt,cosine}]
                       [--task TASK] [--num-workers N]
                       [--skip-invalid-size-inputs-valid-test]
                       [--max-tokens N] [--max-sentences N]
                       [--required-batch-size-multiple N]
                       [--dataset-impl FORMAT] [--gen-subset SPLIT]
                       [--num-shards N] [--shard-id ID] [--path FILE]
                       [--remove-bpe [REMOVE_BPE]] [--quiet]
                       [--model-overrides DICT] [--results-path RESDIR]
                       [--output-word-probs] [--output-word-stats]
                       [--context-window N] [--softmax-batch N]

Named Arguments

--no-progress-bar

disable progress bar

Default: False

--log-interval

log progress every N batches (when progress bar is disabled)

Default: 1000

--log-format

Possible choices: json, none, simple, tqdm

log format to use

--tensorboard-logdir

path to save logs for tensorboard, should match –logdir of running tensorboard (default: no tensorboard logging)

Default: “”

--tbmf-wrapper

[FB only]

Default: False

--seed

pseudo random number generator seed

Default: 1

--cpu

use CPU instead of CUDA

Default: False

--fp16

use FP16

Default: False

--memory-efficient-fp16

use a memory-efficient version of FP16 training; implies –fp16

Default: False

--fp16-init-scale

default FP16 loss scale

Default: 128

--fp16-scale-window number of updates before increasing loss scale
--fp16-scale-tolerance

pct of updates that can overflow before decreasing the loss scale

Default: 0.0

--min-loss-scale

minimum FP16 loss scale, after which training is stopped

Default: 0.0001

--threshold-loss-scale threshold FP16 loss scale from below
--user-dir path to a python module containing custom extensions (tasks and/or architectures)
--criterion

Possible choices: composite_loss, legacy_masked_lm_loss, sentence_prediction, cross_entropy, adaptive_loss, masked_lm, label_smoothed_cross_entropy, sentence_ranking, binary_cross_entropy

Default: “cross_entropy”

--tokenizer Possible choices: space, nltk, moses
--bpe Possible choices: sentencepiece, subword_nmt, fastbpe, gpt2
--optimizer

Possible choices: nag, adagrad, adam, adafactor, adamax, sgd, adadelta

Default: “nag”

--lr-scheduler

Possible choices: fixed, polynomial_decay, triangular, reduce_lr_on_plateau, inverse_sqrt, cosine

Default: “fixed”

--task

Possible choices: translation, multilingual_translation, semisupervised_translation, legacy_masked_lm, translation_moe, sentence_prediction, language_modeling, masked_lm, translation_from_pretrained_xlm, sentence_ranking, cross_lingual_lm, audio_pretraining

task

Default: “language_modeling”

--dataset-impl

Possible choices: raw, lazy, cached, mmap

output dataset implementation

Dataset and data loading

--num-workers

how many subprocesses to use for data loading

Default: 1

--skip-invalid-size-inputs-valid-test

ignore too long or too short lines in valid and test set

Default: False

--max-tokens maximum number of tokens in a batch
--max-sentences, --batch-size maximum number of sentences in a batch
--required-batch-size-multiple

batch size will be a multiplier of this value

Default: 8

--gen-subset

data subset to generate (train, valid, test)

Default: “test”

--num-shards

shard generation over N shards

Default: 1

--shard-id

id of the shard to generate (id < num_shards)

Default: 0

LM Evaluation

--path path(s) to model file(s), colon separated
--remove-bpe remove BPE tokens before scoring (can be set to sentencepiece)
--quiet

only print final scores

Default: False

--model-overrides

a dictionary used to override model args at generation that were used during model training

Default: “{}”

--results-path path to save eval results (optional)”
--output-word-probs

if set, outputs words and their predicted log probabilities to standard output

Default: False

--output-word-stats

if set, outputs word statistics such as word count, average probability, etc

Default: False

--context-window

ensures that every evaluated token has access to a context of at least this size, if possible

Default: 0

--softmax-batch

if BxT is more than this, will batch the softmax over vocab to this amount of tokens in order to fit into GPU memory

Default: 9223372036854775807