---
myst:
html_meta:
description: CUDA streams in PyTorch C++ — CUDAStream for asynchronous GPU execution and synchronization.
keywords: PyTorch, C++, CUDA, CUDAStream, stream, asynchronous, GPU, synchronization
---
# CUDA Streams
CUDA streams provide a mechanism for asynchronous execution of operations
on the GPU. Operations queued to the same stream execute in order, while
operations on different streams can execute concurrently.
## CUDAStream
```{doxygenclass} c10::cuda::CUDAStream
:members:
:undoc-members:
```
**Example:**
```cpp
#include
// Get the default stream for current device
auto stream = c10::cuda::getDefaultCUDAStream();
// Create a new stream
auto new_stream = c10::cuda::getStreamFromPool();
// Get current stream
auto current = c10::cuda::getCurrentCUDAStream();
// Synchronize
stream.synchronize();
```
## Acquiring CUDA Streams
PyTorch provides several ways to acquire CUDA streams:
1. **From the stream pool** (round-robin allocation):
```cpp
// Normal priority stream
at::cuda::CUDAStream stream = at::cuda::getStreamFromPool();
// High priority stream
at::cuda::CUDAStream high_prio = at::cuda::getStreamFromPool(/*isHighPriority=*/true);
// Stream for specific device
at::cuda::CUDAStream dev1_stream = at::cuda::getStreamFromPool(false, /*device=*/1);
```
2. **Default stream** (where most computation occurs):
```cpp
at::cuda::CUDAStream defaultStream = at::cuda::getDefaultCUDAStream();
```
3. **Current stream** (may differ if changed with guards):
```cpp
at::cuda::CUDAStream currentStream = at::cuda::getCurrentCUDAStream();
```
## Setting CUDA Streams
**Using setCurrentCUDAStream:**
```cpp
torch::Tensor tensor0 = torch::ones({2, 2}, torch::device(torch::kCUDA));
// Get a new stream and set it as current
at::cuda::CUDAStream myStream = at::cuda::getStreamFromPool();
at::cuda::setCurrentCUDAStream(myStream);
// Operations now use myStream
tensor0.sum();
// Restore default stream
at::cuda::setCurrentCUDAStream(at::cuda::getDefaultCUDAStream());
```
**Using CUDAStreamGuard (recommended):**
```cpp
torch::Tensor tensor0 = torch::ones({2, 2}, torch::device(torch::kCUDA));
at::cuda::CUDAStream myStream = at::cuda::getStreamFromPool();
{
at::cuda::CUDAStreamGuard guard(myStream);
// Operations use myStream within this scope
tensor0.sum();
}
// Stream automatically restored to default
```
## Multi-Device Stream Management
**Streams on multiple devices:**
```cpp
// Acquire streams for different devices
at::cuda::CUDAStream stream0 = at::cuda::getStreamFromPool(false, 0);
at::cuda::CUDAStream stream1 = at::cuda::getStreamFromPool(false, 1);
// Set current streams on each device
at::cuda::setCurrentCUDAStream(stream0);
at::cuda::setCurrentCUDAStream(stream1);
// Create tensors on device 0
torch::Tensor tensor0 = torch::ones({2, 2}, torch::device(at::kCUDA));
tensor0.sum(); // Uses stream0
// Switch to device 1
{
at::cuda::CUDAGuard device_guard(1);
torch::Tensor tensor1 = torch::ones({2, 2}, torch::device(at::kCUDA));
tensor1.sum(); // Uses stream1
}
```
**Using CUDAMultiStreamGuard:**
```cpp
torch::Tensor tensor0 = torch::ones({2, 2}, torch::device({torch::kCUDA, 0}));
torch::Tensor tensor1 = torch::ones({2, 2}, torch::device({torch::kCUDA, 1}));
at::cuda::CUDAStream stream0 = at::cuda::getStreamFromPool(false, 0);
at::cuda::CUDAStream stream1 = at::cuda::getStreamFromPool(false, 1);
{
// Set streams on both devices simultaneously
at::cuda::CUDAMultiStreamGuard multi_guard({stream0, stream1});
tensor0.sum(); // Uses stream0 on device 0
tensor1.sum(); // Uses stream1 on device 1
}
// Both streams restored to defaults
```
```{attention}
`CUDAMultiStreamGuard` does not change the current device index. It only
changes the stream on each passed-in stream's device.
```
## Multi-Device Stream Handling Pattern
The following skeleton shows three common patterns for acquiring and setting
streams across multiple CUDA devices:
```cpp
// Create stream vectors on device 0
std::vector streams0 =
{at::cuda::getDefaultCUDAStream(), at::cuda::getStreamFromPool()};
at::cuda::setCurrentCUDAStream(streams0[0]);
// Create stream vector on device 1 using CUDAGuard
std::vector streams1;
{
at::cuda::CUDAGuard device_guard(1);
streams1.push_back(at::cuda::getDefaultCUDAStream());
streams1.push_back(at::cuda::getStreamFromPool());
}
at::cuda::setCurrentCUDAStream(streams1[0]);
// Pattern 1: CUDAGuard changes current device only, not streams
{
at::cuda::CUDAGuard device_guard(1);
// current device is 1, current stream on device 1 is still streams1[0]
}
// Pattern 2: CUDAStreamGuard changes both current device and current stream
{
at::cuda::CUDAStreamGuard stream_guard(streams1[1]);
// current device is 1, current stream is streams1[1]
}
// restored to device 0, stream streams0[0]
// Pattern 3: CUDAMultiStreamGuard sets streams on multiple devices at once
{
at::cuda::CUDAMultiStreamGuard multi_guard({streams0[1], streams1[1]});
// current device unchanged (still 0)
// stream on device 0 is streams0[1], stream on device 1 is streams1[1]
}
// streams restored to streams0[0] and streams1[0]
```