Command-line Tools

Fairseq provides several command-line tools for training and evaluating models:

fairseq-preprocess

Data pre-processing: build vocabularies and binarize training data.

usage: fairseq-preprocess [-h] [--no-progress-bar] [--log-interval N]
                          [--log-format {json,none,simple,tqdm}]
                          [--tensorboard-logdir DIR] [--seed N] [--cpu]
                          [--fp16] [--memory-efficient-fp16]
                          [--fp16-init-scale FP16_INIT_SCALE]
                          [--fp16-scale-window FP16_SCALE_WINDOW]
                          [--fp16-scale-tolerance FP16_SCALE_TOLERANCE]
                          [--min-loss-scale D]
                          [--threshold-loss-scale THRESHOLD_LOSS_SCALE]
                          [--user-dir USER_DIR] [--task TASK] [-s SRC]
                          [-t TARGET] [--trainpref FP] [--validpref FP]
                          [--testpref FP] [--destdir DIR] [--thresholdtgt N]
                          [--thresholdsrc N] [--tgtdict FP] [--srcdict FP]
                          [--nwordstgt N] [--nwordssrc N] [--alignfile ALIGN]
                          [--output-format FORMAT] [--joined-dictionary]
                          [--only-source] [--padding-factor N] [--workers N]

Named Arguments

--no-progress-bar

disable progress bar

Default: False

--log-interval

log progress every N batches (when progress bar is disabled)

Default: 1000

--log-format

Possible choices: json, none, simple, tqdm

log format to use

--tensorboard-logdir

path to save logs for tensorboard, should match –logdir of running tensorboard (default: no tensorboard logging)

Default: “”

--seed

pseudo random number generator seed

Default: 1

--cpu

use CPU instead of CUDA

Default: False

--fp16

use FP16

Default: False

--memory-efficient-fp16

use a memory-efficient version of FP16 training; implies –fp16

Default: False

--fp16-init-scale

default FP16 loss scale

Default: 128

--fp16-scale-window number of updates before increasing loss scale
--fp16-scale-tolerance

pct of updates that can overflow before decreasing the loss scale

Default: 0.0

--min-loss-scale

minimum FP16 loss scale, after which training is stopped

Default: 0.0001

--threshold-loss-scale threshold FP16 loss scale from below
--user-dir path to a python module containing custom extensions (tasks and/or architectures)
--task

Possible choices: multilingual_translation, semisupervised_translation, translation, translation_moe, language_modeling, translation_from_pretrained_xlm, cross_lingual_lm

task

Default: “translation”

Preprocessing

-s, --source-lang source language
-t, --target-lang target language
--trainpref train file prefix
--validpref comma separated, valid file prefixes
--testpref comma separated, test file prefixes
--destdir

destination dir

Default: “data-bin”

--thresholdtgt

map words appearing less than threshold times to unknown

Default: 0

--thresholdsrc

map words appearing less than threshold times to unknown

Default: 0

--tgtdict reuse given target dictionary
--srcdict reuse given source dictionary
--nwordstgt

number of target words to retain

Default: -1

--nwordssrc

number of source words to retain

Default: -1

--alignfile an alignment file (optional)
--output-format

Possible choices: binary, raw

output format (optional)

Default: “binary”

--joined-dictionary

Generate joined dictionary

Default: False

--only-source

Only process the source language

Default: False

--padding-factor

Pad dictionary size to be multiple of N

Default: 8

--workers

number of parallel workers

Default: 1

fairseq-train

Train a new model on one or across multiple GPUs.

usage: fairseq-train [-h] [--no-progress-bar] [--log-interval N]
                     [--log-format {json,none,simple,tqdm}]
                     [--tensorboard-logdir DIR] [--seed N] [--cpu] [--fp16]
                     [--memory-efficient-fp16]
                     [--fp16-init-scale FP16_INIT_SCALE]
                     [--fp16-scale-window FP16_SCALE_WINDOW]
                     [--fp16-scale-tolerance FP16_SCALE_TOLERANCE]
                     [--min-loss-scale D]
                     [--threshold-loss-scale THRESHOLD_LOSS_SCALE]
                     [--user-dir USER_DIR] [--task TASK] [--num-workers N]
                     [--skip-invalid-size-inputs-valid-test] [--max-tokens N]
                     [--max-sentences N] [--required-batch-size-multiple N]
                     [--train-subset SPLIT] [--valid-subset SPLIT]
                     [--max-sentences-valid N] [--curriculum N]
                     [--distributed-world-size N]
                     [--distributed-rank DISTRIBUTED_RANK]
                     [--distributed-backend DISTRIBUTED_BACKEND]
                     [--distributed-init-method DISTRIBUTED_INIT_METHOD]
                     [--distributed-port DISTRIBUTED_PORT]
                     [--device-id DEVICE_ID] [--ddp-backend {c10d,no_c10d}]
                     [--bucket-cap-mb MB] [--fix-batches-to-gpus] --arch ARCH
                     [--criterion CRIT] [--max-epoch N] [--max-update N]
                     [--clip-norm NORM] [--sentence-avg]
                     [--update-freq N1,N2,...,N_K] [--optimizer OPT]
                     [--lr LR_1,LR_2,...,LR_N] [--momentum M]
                     [--weight-decay WD]
                     [--lr-scheduler {fixed,triangular,reduce_lr_on_plateau,inverse_sqrt,cosine}]
                     [--lr-shrink LS] [--min-lr LR] [--save-dir DIR]
                     [--restore-file RESTORE_FILE] [--reset-optimizer]
                     [--reset-lr-scheduler] [--optimizer-overrides DICT]
                     [--save-interval N] [--save-interval-updates N]
                     [--keep-interval-updates N] [--keep-last-epochs N]
                     [--no-save] [--no-epoch-checkpoints]
                     [--validate-interval N]

Named Arguments

--no-progress-bar

disable progress bar

Default: False

--log-interval

log progress every N batches (when progress bar is disabled)

Default: 1000

--log-format

Possible choices: json, none, simple, tqdm

log format to use

--tensorboard-logdir

path to save logs for tensorboard, should match –logdir of running tensorboard (default: no tensorboard logging)

Default: “”

--seed

pseudo random number generator seed

Default: 1

--cpu

use CPU instead of CUDA

Default: False

--fp16

use FP16

Default: False

--memory-efficient-fp16

use a memory-efficient version of FP16 training; implies –fp16

Default: False

--fp16-init-scale

default FP16 loss scale

Default: 128

--fp16-scale-window number of updates before increasing loss scale
--fp16-scale-tolerance

pct of updates that can overflow before decreasing the loss scale

Default: 0.0

--min-loss-scale

minimum FP16 loss scale, after which training is stopped

Default: 0.0001

--threshold-loss-scale threshold FP16 loss scale from below
--user-dir path to a python module containing custom extensions (tasks and/or architectures)
--task

Possible choices: multilingual_translation, semisupervised_translation, translation, translation_moe, language_modeling, translation_from_pretrained_xlm, cross_lingual_lm

task

Default: “translation”

Dataset and data loading

--num-workers

how many subprocesses to use for data loading

Default: 0

--skip-invalid-size-inputs-valid-test

ignore too long or too short lines in valid and test set

Default: False

--max-tokens maximum number of tokens in a batch
--max-sentences, --batch-size maximum number of sentences in a batch
--required-batch-size-multiple

batch size will be a multiplier of this value

Default: 8

--train-subset

Possible choices: train, valid, test

data subset to use for training (train, valid, test)

Default: “train”

--valid-subset

comma separated list of data subsets to use for validation (train, valid, valid1, test, test1)

Default: “valid”

--max-sentences-valid maximum number of sentences in a validation batch (defaults to –max-sentences)
--curriculum

don’t shuffle batches for first N epochs

Default: 0

Distributed training

--distributed-world-size

total number of GPUs across all nodes (default: all visible GPUs)

Default: 1

--distributed-rank

rank of the current worker

Default: 0

--distributed-backend

distributed backend

Default: “nccl”

--distributed-init-method typically tcp://hostname:port that will be used to establish initial connetion
--distributed-port

port number (not required if using –distributed-init-method)

Default: -1

--device-id, --local_rank

which GPU to use (usually configured automatically)

Default: 0

--ddp-backend

Possible choices: c10d, no_c10d

DistributedDataParallel backend

Default: “c10d”

--bucket-cap-mb

bucket size for reduction

Default: 25

--fix-batches-to-gpus

don’t shuffle batches between GPUs; this reduces overall randomness and may affect precision but avoids the cost of re-reading the data

Default: False

Model configuration

--arch, -a

Possible choices: fconv_lm, fconv_lm_dauphin_wikitext103, fconv_lm_dauphin_gbw, fconv, fconv_iwslt_de_en, fconv_wmt_en_ro, fconv_wmt_en_de, fconv_wmt_en_fr, lightconv_lm, lightconv_lm_gbw, lightconv, lightconv_iwslt_de_en, lightconv_wmt_en_de, lightconv_wmt_en_de_big, lightconv_wmt_en_fr_big, lightconv_wmt_zh_en_big, transformer_lm, transformer_lm_big, transformer_lm_wiki103, transformer_lm_gbw, transformer, transformer_iwslt_de_en, transformer_wmt_en_de, transformer_vaswani_wmt_en_de_big, transformer_vaswani_wmt_en_fr_big, transformer_wmt_en_de_big, transformer_wmt_en_de_big_t2t, transformer_from_pretrained_xlm, lstm, lstm_wiseman_iwslt_de_en, lstm_luong_wmt_en_de, fconv_self_att, fconv_self_att_wp, bert_base, xlm_base, multilingual_transformer, multilingual_transformer_iwslt_de_en

Model Architecture

Default: “fconv”

--criterion

Possible choices: composite_loss, cross_entropy, adaptive_loss, label_smoothed_cross_entropy, masked_lm_loss

Training Criterion

Default: “cross_entropy”

Optimization

--max-epoch, --me

force stop training at specified epoch

Default: 0

--max-update, --mu

force stop training at specified update

Default: 0

--clip-norm

clip threshold of gradients

Default: 25

--sentence-avg

normalize gradients by the number of sentences in a batch (default is to normalize by number of tokens)

Default: False

--update-freq

update parameters every N_i batches, when in epoch i

Default: 1

--optimizer

Possible choices: nag, adagrad, adam, adafactor, sgd, adadelta

Optimizer

Default: “nag”

--lr, --learning-rate

learning rate for the first N epochs; all epochs >N using LR_N (note: this may be interpreted differently depending on –lr-scheduler)

Default: 0.25

--momentum

momentum factor

Default: 0.99

--weight-decay, --wd

weight decay

Default: 0.0

--lr-scheduler

Possible choices: fixed, triangular, reduce_lr_on_plateau, inverse_sqrt, cosine

Learning Rate Scheduler

Default: “reduce_lr_on_plateau”

--lr-shrink

learning rate shrink factor for annealing, lr_new = (lr * lr_shrink)

Default: 0.1

--min-lr

minimum learning rate

Default: -1

Checkpointing

--save-dir

path to save checkpoints

Default: “checkpoints”

--restore-file

filename in save-dir from which to load checkpoint

Default: “checkpoint_last.pt”

--reset-optimizer

if set, does not load optimizer state from the checkpoint

Default: False

--reset-lr-scheduler

if set, does not load lr scheduler state from the checkpoint

Default: False

--optimizer-overrides

a dictionary used to override optimizer args when loading a checkpoint

Default: “{}”

--save-interval

save a checkpoint every N epochs

Default: 1

--save-interval-updates

save a checkpoint (and validate) every N updates

Default: 0

--keep-interval-updates

keep the last N checkpoints saved with –save-interval-updates

Default: -1

--keep-last-epochs

keep last N epoch checkpoints

Default: -1

--no-save

don’t save models or checkpoints

Default: False

--no-epoch-checkpoints

only store last and best checkpoints

Default: False

--validate-interval

validate every N epochs

Default: 1

fairseq-generate

fairseq-interactive

Translate raw text with a trained model. Batches data on-the-fly.

usage: fairseq-interactive [-h] [--no-progress-bar] [--log-interval N]
                           [--log-format {json,none,simple,tqdm}]
                           [--tensorboard-logdir DIR] [--seed N] [--cpu]
                           [--fp16] [--memory-efficient-fp16]
                           [--fp16-init-scale FP16_INIT_SCALE]
                           [--fp16-scale-window FP16_SCALE_WINDOW]
                           [--fp16-scale-tolerance FP16_SCALE_TOLERANCE]
                           [--min-loss-scale D]
                           [--threshold-loss-scale THRESHOLD_LOSS_SCALE]
                           [--user-dir USER_DIR] [--task TASK]
                           [--num-workers N]
                           [--skip-invalid-size-inputs-valid-test]
                           [--max-tokens N] [--max-sentences N]
                           [--required-batch-size-multiple N]
                           [--gen-subset SPLIT] [--num-shards N]
                           [--shard-id ID] [--path FILE]
                           [--remove-bpe [REMOVE_BPE]] [--quiet]
                           [--model-overrides DICT] [--results-path RESDIR]
                           [--beam N] [--nbest N] [--max-len-a N]
                           [--max-len-b N] [--min-len N] [--match-source-len]
                           [--no-early-stop] [--unnormalized]
                           [--no-beamable-mm] [--lenpen LENPEN]
                           [--unkpen UNKPEN] [--replace-unk [REPLACE_UNK]]
                           [--sacrebleu] [--score-reference]
                           [--prefix-size PS] [--no-repeat-ngram-size N]
                           [--sampling] [--sampling-topk PS]
                           [--sampling-temperature N]
                           [--diverse-beam-groups N]
                           [--diverse-beam-strength N] [--print-alignment]
                           [--buffer-size N] [--input FILE]

Named Arguments

--no-progress-bar

disable progress bar

Default: False

--log-interval

log progress every N batches (when progress bar is disabled)

Default: 1000

--log-format

Possible choices: json, none, simple, tqdm

log format to use

--tensorboard-logdir

path to save logs for tensorboard, should match –logdir of running tensorboard (default: no tensorboard logging)

Default: “”

--seed

pseudo random number generator seed

Default: 1

--cpu

use CPU instead of CUDA

Default: False

--fp16

use FP16

Default: False

--memory-efficient-fp16

use a memory-efficient version of FP16 training; implies –fp16

Default: False

--fp16-init-scale

default FP16 loss scale

Default: 128

--fp16-scale-window number of updates before increasing loss scale
--fp16-scale-tolerance

pct of updates that can overflow before decreasing the loss scale

Default: 0.0

--min-loss-scale

minimum FP16 loss scale, after which training is stopped

Default: 0.0001

--threshold-loss-scale threshold FP16 loss scale from below
--user-dir path to a python module containing custom extensions (tasks and/or architectures)
--task

Possible choices: multilingual_translation, semisupervised_translation, translation, translation_moe, language_modeling, translation_from_pretrained_xlm, cross_lingual_lm

task

Default: “translation”

Dataset and data loading

--num-workers

how many subprocesses to use for data loading

Default: 0

--skip-invalid-size-inputs-valid-test

ignore too long or too short lines in valid and test set

Default: False

--max-tokens maximum number of tokens in a batch
--max-sentences, --batch-size maximum number of sentences in a batch
--required-batch-size-multiple

batch size will be a multiplier of this value

Default: 8

--gen-subset

data subset to generate (train, valid, test)

Default: “test”

--num-shards

shard generation over N shards

Default: 1

--shard-id

id of the shard to generate (id < num_shards)

Default: 0

Generation

--path path(s) to model file(s), colon separated
--remove-bpe remove BPE tokens before scoring (can be set to sentencepiece)
--quiet

only print final scores

Default: False

--model-overrides

a dictionary used to override model args at generation that were used during model training

Default: “{}”

--results-path path to save eval results (optional)”
--beam

beam size

Default: 5

--nbest

number of hypotheses to output

Default: 1

--max-len-a

generate sequences of maximum length ax + b, where x is the source length

Default: 0

--max-len-b

generate sequences of maximum length ax + b, where x is the source length

Default: 200

--min-len

minimum generation length

Default: 1

--match-source-len

generations should match the source length

Default: False

--no-early-stop

continue searching even after finalizing k=beam hypotheses; this is more correct, but increases generation time by 50%

Default: False

--unnormalized

compare unnormalized hypothesis scores

Default: False

--no-beamable-mm

don’t use BeamableMM in attention layers

Default: False

--lenpen

length penalty: <1.0 favors shorter, >1.0 favors longer sentences

Default: 1

--unkpen

unknown word penalty: <0 produces more unks, >0 produces fewer

Default: 0

--replace-unk perform unknown replacement (optionally with alignment dictionary)
--sacrebleu

score with sacrebleu

Default: False

--score-reference

just score the reference translation

Default: False

--prefix-size

initialize generation by target prefix of given length

Default: 0

--no-repeat-ngram-size

ngram blocking such that this size ngram cannot be repeated in the generation

Default: 0

--sampling

sample hypotheses instead of using beam search

Default: False

--sampling-topk

sample from top K likely next words instead of all words

Default: -1

--sampling-temperature

temperature for random sampling

Default: 1

--diverse-beam-groups

number of groups for Diverse Beam Search

Default: -1

--diverse-beam-strength

strength of diversity penalty for Diverse Beam Search

Default: 0.5

--print-alignment

if set, uses attention feedback to compute and print alignment to source tokens

Default: False

Interactive

--buffer-size

read this many sentences into a buffer before processing them

Default: 0

--input

file to read from; use - for stdin

Default: “-“

fairseq-score

fairseq-eval-lm

Evaluate the perplexity of a trained language model.

usage: fairseq-eval-lm [-h] [--no-progress-bar] [--log-interval N]
                       [--log-format {json,none,simple,tqdm}]
                       [--tensorboard-logdir DIR] [--seed N] [--cpu] [--fp16]
                       [--memory-efficient-fp16]
                       [--fp16-init-scale FP16_INIT_SCALE]
                       [--fp16-scale-window FP16_SCALE_WINDOW]
                       [--fp16-scale-tolerance FP16_SCALE_TOLERANCE]
                       [--min-loss-scale D]
                       [--threshold-loss-scale THRESHOLD_LOSS_SCALE]
                       [--user-dir USER_DIR] [--task TASK] [--num-workers N]
                       [--skip-invalid-size-inputs-valid-test]
                       [--max-tokens N] [--max-sentences N]
                       [--required-batch-size-multiple N] [--gen-subset SPLIT]
                       [--num-shards N] [--shard-id ID] [--path FILE]
                       [--remove-bpe [REMOVE_BPE]] [--quiet]
                       [--model-overrides DICT] [--results-path RESDIR]
                       [--output-word-probs] [--output-word-stats]
                       [--context-window N] [--softmax-batch N]

Named Arguments

--no-progress-bar

disable progress bar

Default: False

--log-interval

log progress every N batches (when progress bar is disabled)

Default: 1000

--log-format

Possible choices: json, none, simple, tqdm

log format to use

--tensorboard-logdir

path to save logs for tensorboard, should match –logdir of running tensorboard (default: no tensorboard logging)

Default: “”

--seed

pseudo random number generator seed

Default: 1

--cpu

use CPU instead of CUDA

Default: False

--fp16

use FP16

Default: False

--memory-efficient-fp16

use a memory-efficient version of FP16 training; implies –fp16

Default: False

--fp16-init-scale

default FP16 loss scale

Default: 128

--fp16-scale-window number of updates before increasing loss scale
--fp16-scale-tolerance

pct of updates that can overflow before decreasing the loss scale

Default: 0.0

--min-loss-scale

minimum FP16 loss scale, after which training is stopped

Default: 0.0001

--threshold-loss-scale threshold FP16 loss scale from below
--user-dir path to a python module containing custom extensions (tasks and/or architectures)
--task

Possible choices: multilingual_translation, semisupervised_translation, translation, translation_moe, language_modeling, translation_from_pretrained_xlm, cross_lingual_lm

task

Default: “language_modeling”

Dataset and data loading

--num-workers

how many subprocesses to use for data loading

Default: 0

--skip-invalid-size-inputs-valid-test

ignore too long or too short lines in valid and test set

Default: False

--max-tokens maximum number of tokens in a batch
--max-sentences, --batch-size maximum number of sentences in a batch
--required-batch-size-multiple

batch size will be a multiplier of this value

Default: 8

--gen-subset

data subset to generate (train, valid, test)

Default: “test”

--num-shards

shard generation over N shards

Default: 1

--shard-id

id of the shard to generate (id < num_shards)

Default: 0

LM Evaluation

--path path(s) to model file(s), colon separated
--remove-bpe remove BPE tokens before scoring (can be set to sentencepiece)
--quiet

only print final scores

Default: False

--model-overrides

a dictionary used to override model args at generation that were used during model training

Default: “{}”

--results-path path to save eval results (optional)”
--output-word-probs

if set, outputs words and their predicted log probabilities to standard output

Default: False

--output-word-stats

if set, outputs word statistics such as word count, average probability, etc

Default: False

--context-window

ensures that every evaluated token has access to a context of at least this size, if possible

Default: 0

--softmax-batch

if BxT is more than this, will batch the softmax over vocab to this amount of tokens in order to fit into GPU memory

Default: 9223372036854775807