Optimizers¶

Optimizers update the Model parameters based on the gradients.

isort:skip_file

class fairseq.optim.AMPOptimizer(cfg: omegaconf.dictconfig.DictConfig, params, fp32_optimizer, **kwargs)[source]¶

Wrap an optimizer to support AMP (automatic mixed precision) training.

all_reduce_grads(module)[source]¶: Manually all-reduce gradients (if required).

backward(loss)[source]¶

Computes the sum of gradients of the given tensor w.r.t. graph leaves.

Compared to fairseq.optim.FairseqOptimizer.backward(), this function additionally dynamically scales the loss to avoid gradient underflow.

classmethod build_optimizer(cfg: omegaconf.dictconfig.DictConfig, params, **kwargs)[source]¶

Parameters:	cfg (omegaconf.DictConfig) – fairseq args params (iterable) – iterable of parameters to optimize

clip_grad_norm(max_norm, aggregate_norm_fn=None)[source]¶: Clips gradient norm.

get_lr()[source]¶: Return the current learning rate.

optimizer¶: Return a torch.optim.optimizer.Optimizer instance.

optimizer_config¶: Return a kwarg dictionary that will be used to override optimizer args stored in checkpoints. This allows us to load a checkpoint and resume training using a different set of optimizer args, e.g., with a different learning rate.

set_lr(lr)[source]¶: Set the learning rate.

step()[source]¶: Performs a single optimization step.

supports_flat_params¶: Whether the optimizer supports collapsing of the model parameters/gradients into a single contiguous Tensor.

class fairseq.optim.FP16Optimizer(cfg: omegaconf.dictconfig.DictConfig, params, fp32_optimizer, fp32_params, **kwargs)[source]¶

Wrap an optimizer to support FP16 (mixed precision) training.

all_reduce_grads(module)[source]¶: Manually all-reduce gradients (if required).

classmethod build_optimizer(cfg: omegaconf.dictconfig.DictConfig, params, **kwargs)[source]¶

Parameters:	cfg (omegaconf.DictConfig) – fairseq args params (iterable) – iterable of parameters to optimize

get_lr()[source]¶: Return the current learning rate.

optimizer¶: Return a torch.optim.optimizer.Optimizer instance.

optimizer_config¶: Return a kwarg dictionary that will be used to override optimizer args stored in checkpoints. This allows us to load a checkpoint and resume training using a different set of optimizer args, e.g., with a different learning rate.

set_lr(lr)[source]¶: Set the learning rate.

supports_flat_params¶: Whether the optimizer supports collapsing of the model parameters/gradients into a single contiguous Tensor.

class fairseq.optim.MemoryEfficientFP16Optimizer(cfg: omegaconf.dictconfig.DictConfig, params, optimizer, allow_unsupported=False, **kwargs)[source]¶

Wrap an optimizer to support FP16 (mixed precision) training.

Compared to fairseq.optim.FP16Optimizer, this version does not maintain an FP32 copy of the model. We instead expect the optimizer to convert the gradients to FP32 internally and sync the results back to the FP16 model params. This significantly reduces memory usage but slightly increases the time spent in the optimizer.

Since this wrapper depends on specific functionality in the wrapped optimizer (i.e., on-the-fly conversion of grads to FP32), only certain optimizers can be wrapped. This is determined by the supports_memory_efficient_fp16 property.

all_reduce_grads(module)[source]¶: Manually all-reduce gradients (if required).

classmethod build_optimizer(cfg: omegaconf.dictconfig.DictConfig, params, **kwargs)[source]¶

Parameters:	args (argparse.Namespace) – fairseq args params (iterable) – iterable of parameters to optimize

get_lr()[source]¶: Return the current learning rate.

optimizer¶: Return a torch.optim.optimizer.Optimizer instance.

optimizer_config¶: Return a kwarg dictionary that will be used to override optimizer args stored in checkpoints. This allows us to load a checkpoint and resume training using a different set of optimizer args, e.g., with a different learning rate.

set_lr(lr)[source]¶: Set the learning rate.

class fairseq.optim.FairseqOptimizer(cfg)[source]¶

classmethod add_args(parser)[source]¶: Add optimizer-specific arguments to the parser.

all_reduce_grads(module)[source]¶: Manually all-reduce gradients (if required).

average_params()[source]¶

backward(loss)[source]¶: Computes the sum of gradients of the given tensor w.r.t. graph leaves.

broadcast_global_state_dict(state_dict)[source]¶: Broadcasts a global state dict to all ranks. Useful for optimizers that shard state between ranks.

clip_grad_norm(max_norm, aggregate_norm_fn=None)[source]¶: Clips gradient norm.

get_lr()[source]¶: Return the current learning rate.

load_state_dict(state_dict, optimizer_overrides=None)[source]¶

Load an optimizer state dict.

In general we should prefer the configuration of the existing optimizer instance (e.g., learning rate) over that found in the state_dict. This allows us to resume training from a checkpoint using a new set of optimizer args.

multiply_grads(c)[source]¶: Multiplies grads by a constant c.

optimizer¶: Return a torch.optim.optimizer.Optimizer instance.

optimizer_config¶: Return a kwarg dictionary that will be used to override optimizer args stored in checkpoints. This allows us to load a checkpoint and resume training using a different set of optimizer args, e.g., with a different learning rate.

param_groups¶

params¶: Return an iterable of the parameters held by the optimizer.

set_lr(lr)[source]¶: Set the learning rate.

state_dict()[source]¶: Return the optimizer’s state dict.

step(closure=None, scale=1.0, groups=None)[source]¶: Performs a single optimization step.

supports_flat_params¶: Whether the optimizer supports collapsing of the model parameters/gradients into a single contiguous Tensor.

supports_groups¶

supports_memory_efficient_fp16¶

supports_step_with_scale¶

zero_grad()[source]¶: Clears the gradients of all optimized parameters.

class fairseq.optim.adadelta.Adadelta(args, params)[source]¶

static add_args(parser)[source]¶: Add optimizer-specific arguments to the parser.

optimizer_config¶: Return a kwarg dictionary that will be used to override optimizer args stored in checkpoints. This allows us to load a checkpoint and resume training using a different set of optimizer args, e.g., with a different learning rate.

supports_flat_params¶: Whether the optimizer supports collapsing of the model parameters/gradients into a single contiguous Tensor.

class fairseq.optim.adagrad.Adagrad(args, params)[source]¶

static add_args(parser)[source]¶: Add optimizer-specific arguments to the parser.

optimizer_config¶: Return a kwarg dictionary that will be used to override optimizer args stored in checkpoints. This allows us to load a checkpoint and resume training using a different set of optimizer args, e.g., with a different learning rate.

supports_flat_params¶: Whether the optimizer supports collapsing of the model parameters/gradients into a single contiguous Tensor.

class fairseq.optim.adafactor.FairseqAdafactor(args, params)[source]¶

static add_args(parser)[source]¶: Add optimizer-specific arguments to the parser.

optimizer_config¶: Return a kwarg dictionary that will be used to override optimizer args stored in checkpoints. This allows us to load a checkpoint and resume training using a different set of optimizer args, e.g., with a different learning rate. Note : Convergence issues empirically observed with fp16 on.

Might require search for appropriate configuration.

class fairseq.optim.adam.FairseqAdam(cfg: fairseq.optim.adam.FairseqAdamConfig, params)[source]¶

Adam optimizer for fairseq.

Important note: this optimizer corresponds to the “AdamW” variant of Adam in its weight decay behavior. As such, it is most closely analogous to torch.optim.AdamW from PyTorch.

average_params()[source]¶: Reduce Params is only used during BMUF distributed training.

optimizer_config¶: Return a kwarg dictionary that will be used to override optimizer args stored in checkpoints. This allows us to load a checkpoint and resume training using a different set of optimizer args, e.g., with a different learning rate.

class fairseq.optim.fp16_optimizer.FP16Optimizer(cfg: omegaconf.dictconfig.DictConfig, params, fp32_optimizer, fp32_params, **kwargs)[source]¶

Wrap an optimizer to support FP16 (mixed precision) training.

all_reduce_grads(module)[source]¶: Manually all-reduce gradients (if required).

classmethod build_optimizer(cfg: omegaconf.dictconfig.DictConfig, params, **kwargs)[source]¶

Parameters:	cfg (omegaconf.DictConfig) – fairseq args params (iterable) – iterable of parameters to optimize

get_lr()[source]¶: Return the current learning rate.

lr_scheduler¶

optimizer¶: Return a torch.optim.optimizer.Optimizer instance.

optimizer_config¶: Return a kwarg dictionary that will be used to override optimizer args stored in checkpoints. This allows us to load a checkpoint and resume training using a different set of optimizer args, e.g., with a different learning rate.

set_lr(lr)[source]¶: Set the learning rate.

supports_flat_params¶: Whether the optimizer supports collapsing of the model parameters/gradients into a single contiguous Tensor.

class fairseq.optim.nag.FairseqNAG(cfg: omegaconf.dictconfig.DictConfig, params)[source]¶

optimizer_config¶: Return a kwarg dictionary that will be used to override optimizer args stored in checkpoints. This allows us to load a checkpoint and resume training using a different set of optimizer args, e.g., with a different learning rate.

class fairseq.optim.sgd.SGD(args, params)[source]¶

static add_args(parser)[source]¶: Add optimizer-specific arguments to the parser.

optimizer_config¶: Return a kwarg dictionary that will be used to override optimizer args stored in checkpoints. This allows us to load a checkpoint and resume training using a different set of optimizer args, e.g., with a different learning rate.

supports_flat_params¶: Whether the optimizer supports collapsing of the model parameters/gradients into a single contiguous Tensor.