torch.optim¶

torch.optim is a package implementing various optimization algorithms. Most commonly used methods are already supported, and the interface is general enough, so that more sophisticated ones can be also easily integrated in the future.

How to use an optimizer¶

To use torch.optim you have to construct an optimizer object, that will hold the current state and will update the parameters based on the computed gradients.

Constructing it¶

To construct an Optimizer you have to give it an iterable containing the parameters (all should be Variable s) to optimize. Then, you can specify optimizer-specific options such as the learning rate, weight decay, etc.

Example:

optimizer = optim.SGD(model.parameters(), lr = 0.01, momentum=0.9)
optimizer = optim.Adam([var1, var2], lr = 0.0001)

Per-parameter options¶

Optimizer s also support specifying per-parameter options. To do this, instead of passing an iterable of Variable s, pass in an iterable of dict s. Each of them will define a separate parameter group, and should contain a params key, containing a list of parameters belonging to it. Other keys should match the keyword arguments accepted by the optimizers, and will be used as optimization options for this group.

Note

You can still pass options as keyword arguments. They will be used as defaults, in the groups that didn’t override them. This is useful when you only want to vary a single option, while keeping all others consistent between parameter groups.

For example, this is very useful when one wants to specify per-layer learning rates:

optim.SGD([
                {'params': model.base.parameters()},
                {'params': model.classifier.parameters(), 'lr': 1e-3}
            ], lr=1e-2, momentum=0.9)

This means that model.base’s parameters will use the default learning rate of 1e-2, model.classifier’s parameters will use a learning rate of 1e-3, and a momentum of 0.9 will be used for all parameters

Taking an optimization step¶

All optimizers implement a step() method, that updates the parameters. It can be used in two ways:

`optimizer.step()`¶

This is a simplified version supported by most optimizers. The function can be called once the gradients are computed using e.g. backward().

Example:

for input, target in dataset:
    optimizer.zero_grad()
    output = model(input)
    loss = loss_fn(output, target)
    loss.backward()
    optimizer.step()

`optimizer.step(closure)`¶

Some optimization algorithms such as Conjugate Gradient and LBFGS need to reevaluate the function multiple times, so you have to pass in a closure that allows them to recompute your model. The closure should clear the gradients, compute the loss, and return it.

Example:

for input, target in dataset:
    def closure():
        optimizer.zero_grad()
        output = model(input)
        loss = loss_fn(output, target)
        loss.backward()
        return loss
    optimizer.step(closure)

Algorithms¶

class torch.optim.Optimizer(params, defaults)[source]¶

Base class for all optimizers.

Parameters:	params (iterable) – an iterable of `Variable` s or `dict` s. Specifies what Variables should be optimized. defaults – (dict): a dict containing default values of optimization options (used when a parameter group doesn’t specify them).

load_state_dict(state_dict)[source]¶

Loads the optimizer state.

Parameters:	state_dict (dict) – optimizer state. Should be an object returned from a call to `state_dict()`.

state_dict()[source]¶

Returns the state of the optimizer as a dict.

It contains two entries:

state - a dict holding current optimization state. Its content

differs between optimizer classes.
param_groups - a dict containig all parameter groups

step(closure)[source]¶

Performs a single optimization step (parameter update).

Parameters:	closure (callable) – A closure that reevaluates the model and returns the loss. Optional for most optimizers.

zero_grad()[source]¶: Clears the gradients of all optimized Variable s.

class torch.optim.Adadelta(params, lr=1.0, rho=0.9, eps=1e-06, weight_decay=0)[source]¶

Implements Adadelta algorithm.

It has been proposed in ADADELTA: An Adaptive Learning Rate Method.

Parameters:

params (iterable) – iterable of parameters to optimize or dicts defining parameter groups
rho (float, optional) – coefficient used for computing a running average of squared gradients (default: 0.9)
eps (float, optional) – term added to the denominator to improve numerical stability (default: 1e-6)
lr (float, optional) – coefficient that scale delta before it is applied to the parameters (default: 1.0)
weight_decay (float, optional) – weight decay (L2 penalty) (default: 0)

step(closure=None)[source]¶

Performs a single optimization step.

Parameters:	closure (callable, optional) – A closure that reevaluates the model and returns the loss.

class torch.optim.Adagrad(params, lr=0.01, lr_decay=0, weight_decay=0)[source]¶

Implements Adagrad algorithm.

It has been proposed in Adaptive Subgradient Methods for Online Learning and Stochastic Optimization.

Parameters:	params (iterable) – iterable of parameters to optimize or dicts defining parameter groups lr (float, optional) – learning rate (default: 1e-2) lr_decay (float, optional) – learning rate decay (default: 0) weight_decay (float, optional) – weight decay (L2 penalty) (default: 0)

step(closure=None)[source]¶

Performs a single optimization step.

Parameters:	closure (callable, optional) – A closure that reevaluates the model and returns the loss.

class torch.optim.Adam(params, lr=0.001, betas=(0.9, 0.999), eps=1e-08, weight_decay=0)[source]¶

Implements Adam algorithm.

It has been proposed in Adam: A Method for Stochastic Optimization.

Parameters:

params (iterable) – iterable of parameters to optimize or dicts defining parameter groups
lr (float, optional) – learning rate (default: 1e-3)
betas (Tuple[float, float], optional) – coefficients used for computing running averages of gradient and its square (default: (0.9, 0.999))
eps (float, optional) – term added to the denominator to improve numerical stability (default: 1e-8)
weight_decay (float, optional) – weight decay (L2 penalty) (default: 0)

step(closure=None)[source]¶

Performs a single optimization step.

Parameters:	closure (callable, optional) – A closure that reevaluates the model and returns the loss.

class torch.optim.Adamax(params, lr=0.002, betas=(0.9, 0.999), eps=1e-08, weight_decay=0)[source]¶

Implements Adamax algorithm (a variant of Adam based on infinity norm).

It has been proposed in Adam: A Method for Stochastic Optimization.

Parameters:

params (iterable) – iterable of parameters to optimize or dicts defining parameter groups
lr (float, optional) – learning rate (default: 2e-3)
betas (Tuple[float, float], optional) – coefficients used for computing running averages of gradient and its square
eps (float, optional) – term added to the denominator to improve numerical stability (default: 1e-8)
weight_decay (float, optional) – weight decay (L2 penalty) (default: 0)

step(closure=None)[source]¶

Performs a single optimization step.

Parameters:	closure (callable, optional) – A closure that reevaluates the model and returns the loss.

class torch.optim.ASGD(params, lr=0.01, lambd=0.0001, alpha=0.75, t0=1000000.0, weight_decay=0)[source]¶

Implements Averaged Stochastic Gradient Descent.

It has been proposed in Acceleration of stochastic approximation by averaging.

Parameters:

params (iterable) – iterable of parameters to optimize or dicts defining parameter groups
lr (float, optional) – learning rate (default: 1e-2)
lambd (float, optional) – decay term (default: 1e-4)
alpha (float, optional) – power for eta update (default: 0.75)
t0 (float, optional) – point at which to start averaging (default: 1e6)
weight_decay (float, optional) – weight decay (L2 penalty) (default: 0)

step(closure=None)[source]¶

Performs a single optimization step.

Parameters:	closure (callable, optional) – A closure that reevaluates the model and returns the loss.

class torch.optim.LBFGS(params, lr=1, max_iter=20, max_eval=None, tolerance_grad=1e-05, tolerance_change=1e-09, history_size=100, line_search_fn=None)[source]¶

Implements L-BFGS algorithm.

Warning

This optimizer doesn’t support per-parameter options and parameter groups (there can be only one).

Warning

Right now all parameters have to be on a single device. This will be improved in the future.

Note

This is a very memory intensive optimizer (it requires additional param_bytes * (history_size + 1) bytes). If it doesn’t fit in memory try reducing the history size, or use a different algorithm.

Parameters:

lr (float) – learning rate (default: 1)
max_iter (int) – maximal number of iterations per optimization step (default: 20)
max_eval (int) – maximal number of function evaluations per optimization step (default: max_iter * 1.25).
tolerance_grad (float) – termination tolerance on first order optimality (default: 1e-5).
tolerance_change (float) – termination tolerance on function value/parameter changes (default: 1e-9).
history_size (int) – update history size (default: 100).

step(closure)[source]¶

Performs a single optimization step.

Parameters:	closure (callable) – A closure that reevaluates the model and returns the loss.

class torch.optim.RMSprop(params, lr=0.01, alpha=0.99, eps=1e-08, weight_decay=0, momentum=0, centered=False)[source]¶

Implements RMSprop algorithm.

Proposed by G. Hinton in his course.

The centered version first appears in Generating Sequences With Recurrent Neural Networks.

Parameters:

params (iterable) – iterable of parameters to optimize or dicts defining parameter groups
lr (float, optional) – learning rate (default: 1e-2)
momentum (float, optional) – momentum factor (default: 0)
alpha (float, optional) – smoothing constant (default: 0.99)
eps (float, optional) – term added to the denominator to improve numerical stability (default: 1e-8)
centered (bool, optional) – if True, compute the centered RMSProp, the gradient is normalized by an estimation of its variance
weight_decay (float, optional) – weight decay (L2 penalty) (default: 0)

step(closure=None)[source]¶

Performs a single optimization step.

Parameters:	closure (callable, optional) – A closure that reevaluates the model and returns the loss.

class torch.optim.Rprop(params, lr=0.01, etas=(0.5, 1.2), step_sizes=(1e-06, 50))[source]¶

Implements the resilient backpropagation algorithm.

Parameters:

params (iterable) – iterable of parameters to optimize or dicts defining parameter groups
lr (float, optional) – learning rate (default: 1e-2)
etas (Tuple[float, float], optional) – pair of (etaminus, etaplis), that are multiplicative increase and decrease factors (default: (0.5, 1.2))
step_sizes (Tuple[float, float], optional) – a pair of minimal and maximal allowed step sizes (default: (1e-6, 50))

step(closure=None)[source]¶

Performs a single optimization step.

Parameters:	closure (callable, optional) – A closure that reevaluates the model and returns the loss.

class torch.optim.SGD(params, lr=<object object>, momentum=0, dampening=0, weight_decay=0, nesterov=False)[source]¶

Implements stochastic gradient descent (optionally with momentum).

Nesterov momentum is based on the formula from On the importance of initialization and momentum in deep learning.

Parameters:

params (iterable) – iterable of parameters to optimize or dicts defining parameter groups
lr (float) – learning rate
momentum (float, optional) – momentum factor (default: 0)
weight_decay (float, optional) – weight decay (L2 penalty) (default: 0)
dampening (float, optional) – dampening for momentum (default: 0)
nesterov (bool, optional) – enables Nesterov momentum (default: False)

Example

>>> optimizer = torch.optim.SGD(model.parameters(), lr=0.1, momentum=0.9)
>>> optimizer.zero_grad()
>>> loss_fn(model(input), target).backward()
>>> optimizer.step()

Note

The implementation of SGD with Momentum/Nesterov subtly differs from Sutskever et. al. and implementations in some other frameworks.

Considering the specific case of Momentum, the update can be written as

\[\begin{split}v = \rho * v + g \\ p = p - lr * v\end{split}\]

where p, g, v and \(\rho\) denote the parameters, gradient, velocity, and momentum respectively.

This is in constrast to Sutskever et. al. and other frameworks which employ an update of the form

\[\begin{split}v = \rho * v + lr * g \\ p = p - v\end{split}\]

The Nesterov version is analogously modified.

step(closure=None)[source]¶

Performs a single optimization step.

Parameters:	closure (callable, optional) – A closure that reevaluates the model and returns the loss.

torch.optim¶

How to use an optimizer¶

Constructing it¶

Per-parameter options¶

Taking an optimization step¶

optimizer.step()¶

optimizer.step(closure)¶

Algorithms¶

`optimizer.step()`¶

`optimizer.step(closure)`¶