TensorIterator (Python)#

Created On: May 21, 2026 | Last Updated On: May 21, 2026

torch._tensor_iterator is a thin Python surface over ATen’s at::TensorIterator build pipeline. It is a developer tool: it lets you inspect the result of a TensorIterator build (shape after coalesce/reorder, strides, dtype/device inference, broadcast result) without leaving Python.

There is no for_each here – this surface is build-only. Use it to debug shape/dtype inference, validate custom-op contracts, or pattern-match the post-build geometry from a dispatch decision (see torch._native/ops/scatter_add/cutedsl_impl.py for an example).

C++ fluent → Python kwargs#

The C++ builder is a fluent at::TensorIteratorConfig whose setters return *this:

auto iter = at::TensorIteratorConfig()
    .add_output(out)
    .add_const_input(a)
    .add_const_input(b)
    .promote_inputs_to_common_dtype(true)
    .cast_common_dtype_to_outputs(true)
    .enforce_safe_casting_to_output(true)
    .build();

The Python equivalent passes operands and flags as keyword arguments to TensorIterator(...):

from torch._tensor_iterator import TensorIterator

it = TensorIterator(
    outputs=[out],
    const_inputs=[a, b],
    promote_inputs_to_common_dtype=True,
    cast_common_dtype_to_outputs=True,
    enforce_safe_casting_to_output=True,
)

The mapping is mechanical:

C++ setter	Python kwarg	Default
`add_output(t)`	`outputs=[t, ...]` (or `[None]`)	`[]`
`add_input(t)`	`inputs=[t, ...]`	`[]`
`add_const_input(t)`	`const_inputs=[t, ...]`	`[]`
`check_all_same_dtype(b)`	`check_all_same_dtype=b`	`True`
`check_all_same_device(b)`	`check_all_same_device=b`	`True`
`promote_inputs_to_common_dtype(b)`	`promote_inputs_to_common_dtype=b`	`False`
`promote_integer_inputs_to_float(b)`	`promote_integer_inputs_to_float=b`	`False`
`cast_common_dtype_to_outputs(b)`	`cast_common_dtype_to_outputs=b`	`False`
`enforce_safe_casting_to_output(b)`	`enforce_safe_casting_to_output=b`	`False`
`enforce_linear_iteration(b)`	`enforce_linear_iteration=b`	`False`
`resize_outputs(b)`	`resize_outputs=b`	`True`
`set_check_mem_overlap(b)`	`check_mem_overlap=b`	`True`
`allow_cpu_scalars(b)`	`allow_cpu_scalars=b`	`False`
`is_reduction(b)`	`is_reduction=b`	`False`
`declare_static_dtype(d)`	`static_dtype=d`	`None`
`declare_static_device(dev)`	`static_device=dev`	`None`
`declare_static_shape(s, squash)`	`static_shape=s, squash_dims=squash`	`None`

outputs accepts None placeholders for outputs the iterator should allocate itself; inputs and const_inputs must be defined tensors.

Factory shortcuts#

The C++ named constructors at aten/src/ATen/TensorIterator.cpp (binary_op, unary_op, comparison_op, nullary_op, reduce_op, binary_float_op, unary_float_op) have direct Python equivalents that bake in the canonical flag combinations:

from torch._tensor_iterator import (
    binary_op,
    binary_float_op,
    comparison_op,
    nullary_op,
    reduce_op,
    unary_op,
    unary_float_op,
)

it = binary_op(None, a, b)             # auto-allocate output, promote+cast
it = comparison_op(None, a, b)         # output dtype forced to bool
it = unary_float_op(None, int_tensor)  # promotes int input to float

Each factory mirrors its C++ counterpart’s flag set exactly; reach for them when you’d reach for the C++ named constructor.

Canonical-recipe caveats#

The Python surface is a canonical projection of the C++ builder, not a faithful replay of arbitrary fluent call sequences. Two consequences:

Operand ordering is fixed at outputs → inputs → const_inputs. The C++ builder distinguishes add_input(a); add_const_input(b) from add_const_input(b); add_input(a) – input(0) refers to different operands. The Python surface cannot express that distinction: every inputs[i] precedes every const_inputs[j] in the registered operand list.

Setters are applied as final state, not as a sequence of calls. Some C++ setters have order-dependent side effects – e.g. promote_inputs_to_common_dtype(true) also flips check_all_same_dtype to false. The Python surface materializes the final boolean state of each knob, so it can’t reproduce a sequence where an intermediate setter observed a since-overwritten value.

Every in-tree caller of at::TensorIteratorConfig fits the canonical-recipe shape, so the lossiness is theoretical, not practical.

Inspecting the result#

After construction, the iterator is read-only. Useful properties and methods:

it.ndim           # rank after coalesce/reorder
it.shape          # zero-copy memoryview of int64 dims
it.numel          # product of shape
it.ntensors       # total operands (outputs + inputs)
it.ninputs
it.noutputs
it.is_contiguous
it.is_trivial_1d
it.common_dtype   # inferred computation dtype, or None

it.tensor(i)              # operand at flat index i
it.input(i=0)             # input by input-index
it.output(i=0)            # output by output-index
it.dtype(i=0)             # per-operand dtype
it.device(i=0)            # per-operand device
it.strides(i)             # byte strides, zero-copy memoryview
it.element_strides(i)     # element strides (byte_stride // element_size)

shape and strides(i) return memoryview objects backed by the iterator’s own buffers. They are valid for the lifetime of the iterator; copy via tuple(it.shape) if you need a snapshot.

When to use this#

Pre-dispatch layout analysis. Build a TI on the same operands an aten kernel would, then pattern-match it.ndim / it.strides(i) to decide whether your custom kernel can handle the shape. The _scatter_add_eligibility helper in torch/_native/ops/scatter_add/ is a worked example.
Debugging dtype/promotion surprises. Construct a TI with the flags you think a kernel uses; it.common_dtype and it.dtype(i) show what the builder actually inferred.
Validating custom op contracts. If your kernel claims to handle a certain shape/dtype combination, build a TI and assert on its post-build geometry.

When not to use this#

You want to actually run a kernel. There is no for_each – use the public torch.* op or write a C++ kernel.
You need exact replay of an arbitrary TensorIteratorConfig call sequence (see caveats above).