.. _storage:

Storing large heterogeneous data
================================

TensorDict can back its data by several storage backends so that large
datasets never need to live entirely in process memory.  This page explains
how to choose a backend, declare a schema up-front, read and write data,
store non-tensor values, and combine everything with typed wrappers
(:class:`~tensordict.TensorClass` and :class:`~tensordict.TypedTensorDict`).

.. contents:: On this page
   :local:
   :depth: 2


Quick overview
--------------

.. list-table::
   :header-rows: 1
   :widths: 18 20 20 15 15

   * - Backend
     - ``storage=``
     - Result type
     - Persistence
     - Multi-process
   * - Regular tensors
     - ``None``
     - :class:`~tensordict.TensorDict`
     - No
     - No
   * - Memory-mapped
     - ``"memmap"``
     - :class:`~tensordict.TensorDict`
     - On disk
     - Yes (NFS)
   * - HDF5
     - ``"h5"``
     - :class:`~tensordict.PersistentTensorDict`
     - On disk
     - Limited
   * - Shared memory
     - ``"shared"``
     - :class:`~tensordict.TensorDict`
     - No
     - Yes (same node)
   * - Redis / Dragonfly
     - ``"redis"``
     - :class:`~tensordict.store.TensorDictStore`
     - Server
     - Yes (network)


Declaring a schema with ``from_schema``
---------------------------------------

:meth:`~tensordict.TensorDictBase.from_schema` creates a pre-allocated,
zero-filled :class:`~tensordict.TensorDictBase` from a dictionary that maps
field names to ``(element_shape, dtype)`` pairs.  The ``storage`` keyword
selects which backend is used.

.. code-block:: python

   >>> import torch
   >>> from tensordict import TensorDict

   >>> schema = {
   ...     "obs": ([84, 84, 3], torch.uint8),
   ...     "action": ([4], torch.float32),
   ...     "reward": ([], torch.float32),
   ... }

   >>> td = TensorDict.from_schema(schema, batch_size=[100_000])
   >>> td["obs"].shape
   torch.Size([100000, 84, 84, 3])

Each element shape is prepended by ``batch_size``, so a scalar reward with
``batch_size=[N]`` yields a tensor of shape ``(N,)``.

The ``storage`` keyword selects the backend:

.. code-block:: python

   >>> # Memory-mapped tensors on disk
   >>> td = TensorDict.from_schema(schema, batch_size=[100_000],
   ...                             storage="memmap", prefix="/data/replay")

   >>> # HDF5 file
   >>> td = TensorDict.from_schema(schema, batch_size=[100_000],
   ...                             storage="h5", filename="/data/replay.h5")

   >>> # Shared memory (single-node multi-process)
   >>> td = TensorDict.from_schema(schema, batch_size=[100_000],
   ...                             storage="shared")

   >>> # Redis server (multi-node)
   >>> td = TensorDict.from_schema(schema, batch_size=[100_000],
   ...                             storage="redis", host="redis-node")

Extra keyword arguments are forwarded to the backend constructor.  See each
backend section below for details.


Backend details
---------------

Memory-mapped tensors (``storage="memmap"``)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Memory-mapped tensors live in per-key ``.memmap`` files on disk, accessed via
:class:`~tensordict.MemoryMappedTensor`.  The OS page cache keeps frequently
accessed regions in RAM while allowing the dataset to far exceed physical
memory.

**Pre-allocating from a schema** uses the expand trick internally -- each
value starts as ``torch.zeros(()).expand(shape)``, which allocates no memory,
and :meth:`~tensordict.TensorDictBase.memmap_like` then creates the on-disk
files.

.. code-block:: python

   >>> import torch, tempfile
   >>> from tensordict import TensorDict

   >>> with tempfile.TemporaryDirectory() as d:
   ...     td = TensorDict.from_schema(
   ...         {"obs": ([4], torch.float32), "reward": ([], torch.float32)},
   ...         batch_size=[1_000],
   ...         storage="memmap",
   ...         prefix=d,
   ...     )
   ...     assert td.is_memmap()
   ...     # Fill iteratively -- each write goes directly to disk
   ...     td[0] = TensorDict(obs=torch.randn(4), reward=torch.tensor(1.0), batch_size=[])
   ...     assert (td[0]["reward"] == 1.0).all()

Keyword arguments:

- ``prefix`` -- directory where ``.memmap`` files are stored.

.. note::

   Memory-mapped TensorDicts are locked after creation.  Use
   :meth:`~tensordict.TensorDictBase.set_` and
   :meth:`~tensordict.TensorDictBase.update_` for in-place writes, or index
   assignment (``td[i] = ...``) which is always in-place.

For more details on the memory-mapped API (``memmap_``, ``memmap_like``,
``load_memmap``, directory layout, ``meta.json``), see :ref:`saving`.


HDF5 (``storage="h5"``)
~~~~~~~~~~~~~~~~~~~~~~~

HDF5-backed storage is provided by :class:`~tensordict.PersistentTensorDict`.
Each tensor becomes an HDF5 dataset; nested keys become HDF5 groups.  This is
useful for datasets that must be portable and inspectable with standard tools
like ``h5py`` or ``HDFView``.

.. code-block:: python

   >>> import torch, tempfile
   >>> from tensordict import TensorDict

   >>> with tempfile.NamedTemporaryFile(suffix=".h5") as f:
   ...     td = TensorDict.from_schema(
   ...         {"obs": ([4], torch.float32), "label": ([], torch.int64)},
   ...         batch_size=[500],
   ...         storage="h5",
   ...         filename=f.name,
   ...     )
   ...     td[0] = TensorDict(obs=torch.randn(4), label=torch.tensor(0), batch_size=[])

You can also load an existing file:

.. code-block:: python

   >>> from tensordict import PersistentTensorDict
   >>> td = PersistentTensorDict.from_h5("data.h5")

Or convert an in-memory TensorDict:

.. code-block:: python

   >>> td_mem = TensorDict(obs=torch.randn(500, 4), batch_size=[500])
   >>> td_h5 = PersistentTensorDict.from_dict(td_mem, "data.h5")

Keyword arguments forwarded by ``from_schema``:

- ``filename`` (required) -- path to the HDF5 file.


Shared memory (``storage="shared"``)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Shared-memory tensors allow zero-copy access across processes on the same
machine.  This is the fastest option for single-node multi-process setups
(e.g. multi-worker dataloading).

.. code-block:: python

   >>> import torch
   >>> from tensordict import TensorDict

   >>> td = TensorDict.from_schema(
   ...     {"obs": ([4], torch.float32)},
   ...     batch_size=[1000],
   ...     storage="shared",
   ... )
   >>> assert td.is_shared()

No additional keyword arguments are required.

.. note::

   Shared-memory TensorDicts are locked.  Use in-place operations for writes
   (``set_()``, ``update_()``, index assignment).


Redis / Dragonfly (``storage="redis"``)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

:class:`~tensordict.store.TensorDictStore` stores tensors as raw bytes on a
Redis-compatible server (Redis, Dragonfly, KeyDB).  This enables cross-node
shared data stores and replay buffers without a shared file system.

``from_schema`` on the base class delegates to
:meth:`~tensordict.store.TensorDictStore.from_schema`, which uses
server-side ``SETRANGE`` to pre-allocate storage without sending tensor data
through Python:

.. code-block:: python

   >>> import torch
   >>> from tensordict import TensorDict

   >>> td = TensorDict.from_schema(
   ...     {"obs": ([84, 84, 3], torch.uint8),
   ...      "action": ([4], torch.float32),
   ...      "reward": ([], torch.float32)},
   ...     batch_size=[100_000],
   ...     storage="redis",
   ...     host="redis-node",
   ... )

Keyword arguments:

- ``host`` -- server hostname (default ``"localhost"``).
- ``port`` -- server port (default ``6379``).
- ``db`` -- database number (default ``0``).
- ``unix_socket_path`` -- Unix domain socket (alternative to host/port).
- ``prefix`` -- key namespace (default ``"tensordict"``).

You can also connect to an existing store from another process:

.. code-block:: python

   >>> from tensordict.store import TensorDictStore
   >>> td = TensorDictStore.from_store(td_id="<id>", host="redis-node")


Pre-allocation patterns
-----------------------

Pre-allocating large storage and filling it iteratively avoids allocating the
full dataset in process memory.  All backends support the same pattern via
``from_schema``:

.. code-block:: python

   >>> import torch
   >>> from tensordict import TensorDict

   >>> schema = {
   ...     "image": ([3, 64, 64], torch.uint8),
   ...     "label": ([], torch.int64),
   ... }
   >>> buffer = TensorDict.from_schema(
   ...     schema, batch_size=[1_000_000], storage="memmap", prefix="/data/buffer"
   ... )
   >>> for i, sample in enumerate(data_stream):  # doctest: +SKIP
   ...     buffer[i] = TensorDict(
   ...         image=sample["image"], label=sample["label"], batch_size=[]
   ...     )

The expand trick used internally ensures that no temporary allocation happens
regardless of dataset size.

For memory-mapped storage you can also pre-allocate manually using
:meth:`~tensordict.TensorDictBase.memmap_like`:

.. code-block:: python

   >>> datum = TensorDict(image=torch.zeros(3, 64, 64, dtype=torch.uint8),
   ...                    label=torch.tensor(0), batch_size=[])
   >>> buffer = datum.expand(1_000_000).memmap_like("/data/buffer")


Non-tensor data
---------------

Each backend stores non-tensor values (strings, Python objects) using its
own mechanism:

.. list-table::
   :header-rows: 1
   :widths: 25 40 35

   * - Backend
     - Serialisation
     - Access pattern
   * - memmap
     - JSON in ``meta.json``; pickle fallback (``other.pickle``)
     - :class:`~tensordict.NonTensorData` wrapper
   * - HDF5
     - HDF5 string/opaque datasets
     - :class:`~tensordict.NonTensorData` wrapper on read
   * - Redis
     - JSON string or pickle bytes in Redis ``SET``
     - Transparent via metadata hash

For **memmap**, non-tensor data is serialised via tensorclass's
:class:`~tensordict.NonTensorData`:

.. code-block:: python

   >>> from tensordict import TensorDict, NonTensorData
   >>> td = TensorDict(
   ...     obs=torch.randn(4, 3),
   ...     label=NonTensorData(data="cat", batch_size=[4]),
   ...     batch_size=[4],
   ... )
   >>> td_mm = td.memmap_("/tmp/example")
   >>> loaded = TensorDict.load_memmap("/tmp/example")  # doctest: +SKIP
   >>> loaded["label"].data  # doctest: +SKIP
   'cat'

For **HDF5**, non-tensor values are stored as HDF5 string or opaque datasets:

.. code-block:: python

   >>> from tensordict import PersistentTensorDict
   >>> td_h5 = PersistentTensorDict(filename="data.h5", mode="w", batch_size=[4])  # doctest: +SKIP
   >>> td_h5["label"] = NonTensorData(data="cat", batch_size=[4])  # doctest: +SKIP

For **Redis**, non-tensor data is transparently serialised as JSON (falling
back to pickle for non-JSON-serialisable objects):

.. code-block:: python

   >>> from tensordict.store import TensorDictStore  # doctest: +SKIP
   >>> store = TensorDictStore(batch_size=[4])  # doctest: +SKIP
   >>> store["label"] = NonTensorData(data="cat", batch_size=[4])  # doctest: +SKIP


Typed wrappers
--------------

Both :class:`~tensordict.TensorClass` and :class:`~tensordict.TypedTensorDict`
can wrap any backend via ``from_tensordict``.  Combined with ``from_schema``,
this gives you typed, pre-allocated, backend-agnostic data stores.

Using TypedTensorDict
~~~~~~~~~~~~~~~~~~~~~

.. code-block:: python

   >>> import torch
   >>> from tensordict import TensorDict, TypedTensorDict
   >>> from torch import Tensor

   >>> class Replay(TypedTensorDict):
   ...     obs: Tensor
   ...     action: Tensor
   ...     reward: Tensor

   >>> # Pre-allocate with any backend, then wrap
   >>> store = TensorDict.from_schema(
   ...     {"obs": ([4], torch.float32),
   ...      "action": ([2], torch.float32),
   ...      "reward": ([], torch.float32)},
   ...     batch_size=[10_000],
   ...     storage="memmap",
   ...     prefix="/data/replay",
   ... )
   >>> replay = Replay.from_tensordict(store)
   >>> replay.obs.shape
   torch.Size([10000, 4])

   >>> # Fill iteratively with full type-safety
   >>> replay[0] = Replay(
   ...     obs=torch.randn(4), action=torch.randn(2), reward=torch.tensor(1.0),
   ...     batch_size=[],
   ... )

Using TensorClass
~~~~~~~~~~~~~~~~~

.. code-block:: python

   >>> import torch
   >>> from tensordict import TensorDict, TensorClass
   >>> from torch import Tensor

   >>> class Transition(TensorClass):
   ...     obs: Tensor
   ...     action: Tensor
   ...     reward: float  # non-tensor field

   >>> store = TensorDict.from_schema(
   ...     {"obs": ([4], torch.float32),
   ...      "action": ([2], torch.float32)},
   ...     batch_size=[10_000],
   ...     storage="shared",
   ... )
   >>> tc = Transition.from_tensordict(store, non_tensordict={"reward": 0.0})  # doctest: +SKIP

Unlike :class:`~tensordict.TypedTensorDict`, :class:`~tensordict.TensorClass`
supports non-tensor fields (strings, numbers, arbitrary Python objects).
See :ref:`the compatibility page <compat-redis-prealloc>` for the full
backend-support matrix.


Choosing a backend
------------------

- **Memory-mapped** -- best for large on-disk datasets where you want
  memory-efficient random access (replay buffers, offline RL, large
  datasets that exceed RAM).  Works across processes via NFS.
- **HDF5** -- best when you need a portable, self-describing file format
  inspectable with standard tools.  Good for archival.
- **Shared memory** -- best for single-node multi-process workloads
  (multi-worker dataloading, parallel envs).  Fastest IPC but data does not
  persist.
- **Redis** -- best for multi-node shared data stores (distributed replay
  buffers, parameter servers).  Requires a running server.
- **Plain tensors** -- best for small datasets that fit in memory.  No
  overhead, full PyTorch API.