---
myst:
  html_meta:
    description: SGD optimizer in PyTorch C++ — stochastic gradient descent with momentum and weight decay.
    keywords: PyTorch, C++, SGD, gradient descent, momentum, weight decay, optimizer
---

# Gradient Descent Optimizers

These optimizers use gradient descent with optional enhancements like momentum.
They are the foundation of neural network training and work well when you can
afford careful hyperparameter tuning.

## SGD (Stochastic Gradient Descent)

The classic optimization algorithm. SGD with momentum is often the best choice
for convolutional neural networks when properly tuned. While requiring more
careful learning rate selection than adaptive methods, it frequently achieves
the best final accuracy.

**When to use:**

- Training CNNs (ResNet, VGG, etc.) where you want maximum accuracy
- When you have time for hyperparameter tuning
- When combined with learning rate schedules (warmup, cosine annealing)

**Key parameters:**

- `lr`: Learning rate (typical: 0.01-0.1 for CNNs)
- `momentum`: Accelerates convergence (typical: 0.9)
- `weight_decay`: L2 regularization coefficient
- `nesterov`: Use Nesterov momentum (often improves convergence)

```{doxygenclass} torch::optim::SGD
:members:
:undoc-members:
```

**Example:**

```cpp
// Standard SGD with momentum - good for CNNs
auto optimizer = torch::optim::SGD(
    model->parameters(),
    torch::optim::SGDOptions(0.01)   // learning rate
        .momentum(0.9)               // momentum factor
        .weight_decay(1e-4)          // L2 regularization
        .nesterov(true));            // Nesterov momentum
```