Evaluating Pre-trained Models¶
First, download a pre-trained model along with its vocabularies:
> curl https://dl.fbaipublicfiles.com/fairseq/models/wmt14.v2.en-fr.fconv-py.tar.bz2 | tar xvjf -
This model uses a Byte Pair Encoding (BPE)
vocabulary, so we’ll have to apply
the encoding to the source text before it can be translated. This can be
done with the
apply_bpe.py
script using the wmt14.en-fr.fconv-cuda/bpecodes
file. @@
is
used as a continuation marker and the original text can be easily
recovered with e.g. sed s/@@ //g
or by passing the --remove-bpe
flag to fairseq-generate. Prior to BPE, input text needs to be tokenized
using tokenizer.perl
from
mosesdecoder.
Let’s use fairseq-interactive to generate translations interactively. Here, we use a beam size of 5 and preprocess the input with the Moses tokenizer and the given Byte-Pair Encoding vocabulary. It will automatically remove the BPE continuation markers and detokenize the output.
> MODEL_DIR=wmt14.en-fr.fconv-py
> fairseq-interactive \
--path $MODEL_DIR/model.pt $MODEL_DIR \
--beam 5 --source-lang en --target-lang fr \
--tokenizer moses \
--bpe subword_nmt --bpe-codes $MODEL_DIR/bpecodes
| loading model(s) from wmt14.en-fr.fconv-py/model.pt
| [en] dictionary: 44206 types
| [fr] dictionary: 44463 types
| Type the input sentence and press return:
Why is it rare to discover new marine mammal species?
S-0 Why is it rare to discover new marine mam@@ mal species ?
H-0 -0.0643349438905716 Pourquoi est-il rare de découvrir de nouvelles espèces de mammifères marins?
P-0 -0.0763 -0.1849 -0.0956 -0.0946 -0.0735 -0.1150 -0.1301 -0.0042 -0.0321 -0.0171 -0.0052 -0.0062 -0.0015
This generation script produces three types of outputs: a line prefixed with O is a copy of the original source sentence; H is the hypothesis along with an average log-likelihood; and P is the positional score per token position, including the end-of-sentence marker which is omitted from the text.
See the README for a full list of pre-trained models available.
Training a New Model¶
The following tutorial is for machine translation. For an example of how
to use Fairseq for other tasks, such as Language Modeling, please see the
examples/
directory.
Data Pre-processing¶
Fairseq contains example pre-processing scripts for several translation datasets: IWSLT 2014 (German-English), WMT 2014 (English-French) and WMT 2014 (English-German). To pre-process and binarize the IWSLT dataset:
> cd examples/translation/
> bash prepare-iwslt14.sh
> cd ../..
> TEXT=examples/translation/iwslt14.tokenized.de-en
> fairseq-preprocess --source-lang de --target-lang en \
--trainpref $TEXT/train --validpref $TEXT/valid --testpref $TEXT/test \
--destdir data-bin/iwslt14.tokenized.de-en
This will write binarized data that can be used for model training to
data-bin/iwslt14.tokenized.de-en
.
Training¶
Use fairseq-train to train a new model. Here a few example settings that work well for the IWSLT 2014 dataset:
> mkdir -p checkpoints/fconv
> CUDA_VISIBLE_DEVICES=0 fairseq-train data-bin/iwslt14.tokenized.de-en \
--lr 0.25 --clip-norm 0.1 --dropout 0.2 --max-tokens 4000 \
--arch fconv_iwslt_de_en --save-dir checkpoints/fconv
By default, fairseq-train will use all available GPUs on your machine. Use the
CUDA_VISIBLE_DEVICES
environment variable to select specific GPUs and/or to
change the number of GPU devices that will be used.
Also note that the batch size is specified in terms of the maximum
number of tokens per batch (--max-tokens
). You may need to use a
smaller value depending on the available GPU memory on your system.
Generation¶
Once your model is trained, you can generate translations using fairseq-generate (for binarized data) or fairseq-interactive (for raw text):
> fairseq-generate data-bin/iwslt14.tokenized.de-en \
--path checkpoints/fconv/checkpoint_best.pt \
--batch-size 128 --beam 5
| [de] dictionary: 35475 types
| [en] dictionary: 24739 types
| data-bin/iwslt14.tokenized.de-en test 6750 examples
| model fconv
| loaded checkpoint trainings/fconv/checkpoint_best.pt
S-721 danke .
T-721 thank you .
...
To generate translations with only a CPU, use the --cpu
flag. BPE
continuation markers can be removed with the --remove-bpe
flag.
Advanced Training Options¶
Large mini-batch training with delayed updates¶
The --update-freq
option can be used to accumulate gradients from
multiple mini-batches and delay updating, creating a larger effective
batch size. Delayed updates can also improve training speed by reducing
inter-GPU communication costs and by saving idle time caused by variance
in workload across GPUs. See Ott et al.
(2018) for more details.
To train on a single GPU with an effective batch size that is equivalent to training on 8 GPUs:
> CUDA_VISIBLE_DEVICES=0 fairseq-train --update-freq 8 (...)
Training with half precision floating point (FP16)¶
Note
FP16 training requires a Volta GPU and CUDA 9.1 or greater
Recent GPUs enable efficient half precision floating point computation,
e.g., using Nvidia Tensor Cores.
Fairseq supports FP16 training with the --fp16
flag:
> fairseq-train --fp16 (...)
Lazily loading large training datasets¶
By default fairseq loads the entire training set into system memory. For large
datasets, the --lazy-load
option can be used to instead load batches on-demand.
For optimal performance, use the --num-workers
option to control the number
of background processes that will load batches.
Distributed training¶
Distributed training in fairseq is implemented on top of torch.distributed
.
The easiest way to launch jobs is with the torch.distributed.launch tool.
For example, to train a large English-German Transformer model on 2 nodes each
with 8 GPUs (in total 16 GPUs), run the following command on each node,
replacing node_rank=0
with node_rank=1
on the second node:
> python -m torch.distributed.launch --nproc_per_node=8 \
--nnodes=2 --node_rank=0 --master_addr="192.168.1.1" \
--master_port=1234 \
$(which fairseq-train) data-bin/wmt16_en_de_bpe32k \
--arch transformer_vaswani_wmt_en_de_big --share-all-embeddings \
--optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 \
--lr-scheduler inverse_sqrt --warmup-init-lr 1e-07 --warmup-updates 4000 \
--lr 0.0005 --min-lr 1e-09 \
--dropout 0.3 --weight-decay 0.0 --criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
--max-tokens 3584 \
--fp16 --distributed-no-spawn