.. meta:: :description: A guide to torch.backends.mkldnn, a PyTorch backend to run MKLDNN operations :keywords: optimize PyTorch, MKLDNN .. _mkldnn_backend: MKLDNN backend --------------------------------------------------- MKLDNN is an open-source cross-platform performance library of basic building blocks for deep learning applications. .. code:: python # The flag below controls whether enable MKLDNN backend in Pytorch. torch.backends.mkldnn.enabled = True Users can disable MKLDNN backend by: .. code:: python torch.backends.mkldnn.enabled = False .. _bf16_on_mkldnn: Bfloat16 (BF16) on MKLDNN backend --------------------------------------------------- Starting in PyTorch 2.4, there is a set of APIs to control the internal computation precision for `float32` operators. .. code:: python # The flag below controls the internal computation precision for mkldnn matmul. Default ieee is float32. torch.backends.mkldnn.matmul.fp32_precision = "ieee" # The flag below controls the internal computation precision for mkldnn conv. Default ieee is float32. torch.backends.mkldnn.conv.fp32_precision = "ieee" # The flag below controls the internal computation precision for mkldnn rnn. Default ieee is float32. torch.backends.mkldnn.rnn.fp32_precision = "ieee" Note that besides matmuls and convolutions themselves, functions and nn modules that internally uses matmuls or convolutions are also affected. These include :class:`torch.nn.Linear`, :class:`torch.nn._ConvNd`, :func:`torch.cdist`, :func:`torch.tensordot`, :func:`torch.nn.functional.affine_grid` and :func:`torch.nn.functional.grid_sample`, :class:`torch.nn.AdaptiveLogSoftmaxWithLoss`, :class:`torch.nn.GRU` and :class:`torch.nn.LSTM`. To get an idea of the precision and speed, see the example code and benchmark data (on SPR) below: .. code:: python torch.manual_seed(0) a_full = torch.randn(10240, 10240, dtype=torch.double) b_full = torch.randn(10240, 10240, dtype=torch.double) ab_full = a_full @ b_full mean = ab_full.abs().mean() # 80.7451 a = a_full.float() b = b_full.float() # Do matmul at BF16 mode. torch.backends.mkldnn.matmul.fp32_precision = 'bf16' ab_bf16 = a @ b # expected speedup with BF16 dot-product acceleration error = (ab_bf16 - ab_full).abs().max() # 1.3704 relative_error = error / mean # 0.0170 print(error, relative_error) # Do matmul FP32 mode. torch.backends.mkldnn.matmul.fp32_precision = 'ieee' ab_fp32 = a @ b error = (ab_fp32 - ab_full).abs().max() # 0.0003 relative_error = error / mean # 0.00000317 print(error, relative_error) From the above example, we can see that with BF16, the speed is ~7x faster on SPR, and that relative error compared to double precision is approximately 2 orders of magnitude larger. If full FP32 precision is needed, users can disable BF16 by: .. code:: python torch.backends.mkldnn.matmul.fp32_precision = 'ieee' torch.backends.mkldnn.conv.fp32_precision = 'ieee' torch.backends.mkldnn.rnn.fp32_precision = 'ieee' To toggle the BF16 flags off in C++, you can do .. code:: C++ at::globalContext().setFloat32Precision("ieee", "mkldnn", "matmul"); at::globalContext().setFloat32Precision("ieee", "mkldnn", "conv"); at::globalContext().setFloat32Precision("ieee", "mkldnn", "rnn"); We can override a generic setting for a specific operator or backend if the fp32_precision is set to `ieee`. .. code:: python torch.backends.fp32_precision = "bf16" torch.backends.mkldnn.fp32_precision = "ieee" torch.backends.mkldnn.matmul.fp32_precision = "ieee" For such case, both `torch.backends.mkldnn.fp32_precision` and `torch.backends.mkldnn.matmul.fp32_precision` is overridden to bf16.