Rate this Page

Explicit horizontal fusion with foreach_map and torch.compile#

Author: Michael Lazos

Horizontal fusion is a key optimization in ML compilers. In eager,

this is typically expressed using the torch._foreach* ops which parallelizes operations across a list of tensors. However, supporting all possible permutations of arguments is quite difficult (e.g. mixtures of scalars and lists). Foreach_map allows conversion of any pointwise op in torch to a horiztonally fused foreach variant. In this tutorial, we will demonstrate how to implement the Adam optimizer with foreach_map to generate a fully fused kernel.

Note

This recipe describes a prototype feature. Prototype features are typically at an early stage for feedback and testing and are subject to change.

Prerequisites#

  • PyTorch v2.7.0 or later

Model Setup#

For this example, we’ll use a simple sequence of linear layers. We instantiate an independent copy to compare the two optimizer implementations.

import torch

# exit cleanly if we are on a device that doesn't support ``torch.compile``
if torch.cuda.get_device_capability() < (7, 0):
    print("Exiting because torch.compile is not supported on this device.")
    import sys
    sys.exit(0)

# Create simple model
model = torch.nn.Sequential(
    *[torch.nn.Linear(1024, 1024, False, device="cuda") for _ in range(10)]
)
model_copy = torch.nn.Sequential(
    *[torch.nn.Linear(1024, 1024, False, device="cuda") for _ in range(10)]
)
input = torch.rand(1024, device="cuda")

# run forward pass
output = model(input)
output_copy = model_copy(input)

# run backward to populate the grads for our optimizer below
output.sum().backward()
output_copy.sum().backward()

Helper functions for foreach_map implementation#

In this section, we’ll begin our implementation of the Adam optimizer.

from torch._higher_order_ops.foreach_map import foreach_map

# Helper function to extract optimizer states from a torch.optim.Adam instance
def get_inputs(optim):
    steps = []
    params = []
    grads = []
    exp_avgs = []
    exp_avg_sqs = []
    for group in optim.param_groups:
        for p in group["params"]:
            params.append(p)
            grads.append(p.grad)
            state = optim.state[p]
            exp_avgs.append(state["exp_avg"])
            exp_avg_sqs.append(state["exp_avg_sq"])
            steps.append(state["step"])

    return steps, params, exp_avgs, exp_avg_sqs


# Functions to update the different optimizer states
def update_exp_avg_sq(exp_avg_sq, grad, beta2):
    return exp_avg_sq.mul(beta2).addcmul(grad, grad, value=1 - beta2)

def update_param(param, step, exp_avg, exp_avg_sq, beta1, beta2, lr, eps):
    bias_correction1 = 1 - torch.pow(beta1, step)
    bias_correction2 = (1 - torch.pow(beta2, step)).sqrt()
    step_size = (lr / bias_correction1).neg()
    denom = (exp_avg_sq.sqrt() / (bias_correction2 * step_size)).add(eps / step_size)
    return torch.add(param, torch.div(exp_avg, denom))

# Our full Adam implementation
def foreach_map_adam(
    steps,
    params,
    exp_avgs,
    exp_avg_sqs,
    weight_decay=0,
    beta1=0.9,
    beta2=0.999,
    lr=1e-3,
    eps=1e-8,
):
    with torch.no_grad():
        grads = [param.grad for param in params]
        # update step
        updated_steps = foreach_map(lambda x: x + 1, steps)
        torch._foreach_copy_(steps, updated_steps)

        if weight_decay != 0:
            foreach_map(torch.add, (grads,), alpha=weight_decay)

        # Higher-order operators (HOPs) cannot have multiple outputs at the moment
        # need to call foreach_map once for each output
        exp_avgs_updated = foreach_map(torch.lerp, exp_avgs, grads, 1 - beta1)
        exp_avgs_sq_updated = foreach_map(update_exp_avg_sq, exp_avg_sqs, grads, beta2)
        params_updated = foreach_map(
            update_param,
            params,
            steps,
            exp_avgs_updated,
            exp_avgs_sq_updated,
            beta1,
            beta2,
            lr,
            eps,
        )
        # Higher-order operators (HOPs) don't support input mutation today
        # so manually  update the states in-place
        torch._foreach_copy_(exp_avgs, exp_avgs_updated)
        torch._foreach_copy_(exp_avg_sqs, exp_avgs_sq_updated)
        torch._foreach_copy_(params, params_updated)
    return

Setting up and running the compiled kernel#

In this section, we’ll run our Adam optimizer and compare the results

Note

torch.compile is only supported on CUDA devices that have a compute capability of 7.0 or higher.

opt_eager = torch.optim.Adam(model.parameters(), lr=torch.tensor(0.01))
opt_eager_copy = torch.optim.Adam(model_copy.parameters(), lr=torch.tensor(0.01))

# warm up the optimizer state dict
opt_eager.step()
opt_eager_copy.step()

inputs = get_inputs(opt_eager_copy)
compiled_adam = torch.compile(foreach_map_adam)

# optionally view the output code
torch._logging.set_logs(output_code=True)

# Warmup runs to compile the function
for _ in range(5):
    opt_eager.step()
    compiled_adam(*inputs)

for eager_p, compile_p in zip(opt_eager.param_groups[0]["params"], opt_eager_copy.param_groups[0]["params"]):
    torch.allclose(eager_p, compile_p)

# Benchmark performance

 # Let's define a helpful benchmarking function:
import torch.utils.benchmark as benchmark

def benchmark_torch_function_in_microseconds(f, *args, **kwargs):
    t0 = benchmark.Timer(
        stmt="f(*args, **kwargs)", globals={"args": args, "kwargs": kwargs, "f": f}
    )
    return t0.blocked_autorange().mean * 1e6

eager_runtime = benchmark_torch_function_in_microseconds(opt_eager.step)
compiled_runtime = benchmark_torch_function_in_microseconds(lambda: compiled_adam(*inputs))

assert eager_runtime > compiled_runtime

print(f"eager runtime: {eager_runtime}us")
print(f"compiled runtime: {compiled_runtime}us")
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code] Output code:
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code] # AOT ID: ['0_inference']
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code] from ctypes import c_void_p, c_long, c_int
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code] import torch
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code] import math
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code] import random
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code] import os
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code] import tempfile
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code] from math import inf, nan
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code] from cmath import nanj
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code] from torch._inductor.hooks import run_intermediate_hooks
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code] from torch._inductor.utils import maybe_profile
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code] from torch._inductor.codegen.memory_planning import _align as align
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code] from torch import device, empty_strided
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code] from torch._inductor.async_compile import AsyncCompile
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code] from torch._inductor.select_algorithm import extern_kernels
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code] from torch._C import _cuda_getCurrentRawStream as get_raw_stream
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code] import triton
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code] import triton.language as tl
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code] from torch._inductor.runtime.triton_heuristics import start_graph, end_graph
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code] from torch._C import _cuda_getCurrentRawStream as get_raw_stream
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code] aten = torch.ops.aten
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code] inductor_ops = torch.ops.inductor
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code] _quantized = torch.ops._quantized
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code] assert_size_stride = torch._C._dynamo.guards.assert_size_stride
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code] assert_alignment = torch._C._dynamo.guards.assert_alignment
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code] empty_strided_cpu = torch._C._dynamo.guards._empty_strided_cpu
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code] empty_strided_cuda = torch._C._dynamo.guards._empty_strided_cuda
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code] empty_strided_xpu = torch._C._dynamo.guards._empty_strided_xpu
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code] reinterpret_tensor = torch._C._dynamo.guards._reinterpret_tensor
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code] alloc_from_pool = torch.ops.inductor._alloc_from_pool
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code] async_compile = AsyncCompile()
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code] empty_strided_p2p = torch._C._distributed_c10d._SymmetricMemory.empty_strided_p2p
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code] # kernel path: /tmp/torchinductor_ci-user/al/calrezlmzale753uatf4r4hyoxrgj2cygyga4s35ygdnlqxtbqrk.py
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code] # Unsorted Source Nodes: [], Original ATen: []
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code] # Source node to ATen node mapping:
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code] triton_for_fused_0 = async_compile.triton('triton_for_fused_0', '''
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code] import triton
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code] import triton.language as tl
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code] from torch._inductor.runtime import triton_helpers, triton_heuristics
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code] from torch._inductor.runtime.triton_helpers import libdevice, math as tl_math
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code] from torch._inductor.runtime.hints import AutotuneHint, ReductionHint, TileHint, DeviceProperties
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code] @triton_heuristics.foreach(
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]     num_warps=8,
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]     triton_meta={'signature': {'in_ptr0': '*fp32', 'in_ptr1': '*fp32', 'in_ptr2': '*fp32', 'in_ptr3': '*fp32', 'in_ptr4': 'fp32', 'in_ptr5': '*fp32', 'in_ptr6': '*fp32', 'in_ptr7': '*fp32', 'in_ptr8': '*fp32', 'in_ptr9': 'fp32', 'in_ptr10': '*fp32', 'in_ptr11': '*fp32', 'in_ptr12': '*fp32', 'in_ptr13': '*fp32', 'in_ptr14': 'fp32', 'in_ptr15': '*fp32', 'in_ptr16': '*fp32', 'in_ptr17': '*fp32', 'in_ptr18': '*fp32', 'in_ptr19': 'fp32', 'in_ptr20': '*fp32', 'in_ptr21': '*fp32', 'in_ptr22': '*fp32', 'in_ptr23': '*fp32', 'in_ptr24': 'fp32', 'in_ptr25': '*fp32', 'in_ptr26': '*fp32', 'in_ptr27': '*fp32', 'in_ptr28': '*fp32', 'in_ptr29': 'fp32', 'in_ptr30': '*fp32', 'in_ptr31': '*fp32', 'in_ptr32': '*fp32', 'in_ptr33': '*fp32', 'in_ptr34': 'fp32', 'in_ptr35': '*fp32', 'in_ptr36': '*fp32', 'in_ptr37': '*fp32', 'in_ptr38': '*fp32', 'in_ptr39': 'fp32', 'in_ptr40': '*fp32', 'in_ptr41': '*fp32', 'in_ptr42': '*fp32', 'in_ptr43': '*fp32', 'in_ptr44': 'fp32', 'in_ptr45': '*fp32', 'in_ptr46': '*fp32', 'in_ptr47': '*fp32', 'in_ptr48': '*fp32', 'in_ptr49': 'fp32', 'out_ptr6': '*fp32', 'out_ptr7': '*fp32', 'out_ptr8': '*fp32', 'out_ptr15': '*fp32', 'out_ptr16': '*fp32', 'out_ptr17': '*fp32', 'out_ptr24': '*fp32', 'out_ptr25': '*fp32', 'out_ptr26': '*fp32', 'out_ptr33': '*fp32', 'out_ptr34': '*fp32', 'out_ptr35': '*fp32', 'out_ptr42': '*fp32', 'out_ptr43': '*fp32', 'out_ptr44': '*fp32', 'out_ptr51': '*fp32', 'out_ptr52': '*fp32', 'out_ptr53': '*fp32', 'out_ptr60': '*fp32', 'out_ptr61': '*fp32', 'out_ptr62': '*fp32', 'out_ptr69': '*fp32', 'out_ptr70': '*fp32', 'out_ptr71': '*fp32', 'out_ptr78': '*fp32', 'out_ptr79': '*fp32', 'out_ptr80': '*fp32', 'out_ptr87': '*fp32', 'out_ptr88': '*fp32', 'out_ptr89': '*fp32'}, 'device': DeviceProperties(type='cuda', index=0, multi_processor_count=80, cc=86, major=8, regs_per_multiprocessor=65536, max_threads_per_multi_processor=1536, warp_size=32), 'constants': {}, 'configs': [{(0,): [['tt.divisibility', 16]], (1,): [['tt.divisibility', 16]], (2,): [['tt.divisibility', 16]], (3,): [['tt.divisibility', 16]], (5,): [['tt.divisibility', 16]], (6,): [['tt.divisibility', 16]], (7,): [['tt.divisibility', 16]], (8,): [['tt.divisibility', 16]], (10,): [['tt.divisibility', 16]], (11,): [['tt.divisibility', 16]], (12,): [['tt.divisibility', 16]], (13,): [['tt.divisibility', 16]], (15,): [['tt.divisibility', 16]], (16,): [['tt.divisibility', 16]], (17,): [['tt.divisibility', 16]], (18,): [['tt.divisibility', 16]], (20,): [['tt.divisibility', 16]], (21,): [['tt.divisibility', 16]], (22,): [['tt.divisibility', 16]], (23,): [['tt.divisibility', 16]], (25,): [['tt.divisibility', 16]], (26,): [['tt.divisibility', 16]], (27,): [['tt.divisibility', 16]], (28,): [['tt.divisibility', 16]], (30,): [['tt.divisibility', 16]], (31,): [['tt.divisibility', 16]], (32,): [['tt.divisibility', 16]], (33,): [['tt.divisibility', 16]], (35,): [['tt.divisibility', 16]], (36,): [['tt.divisibility', 16]], (37,): [['tt.divisibility', 16]], (38,): [['tt.divisibility', 16]], (40,): [['tt.divisibility', 16]], (41,): [['tt.divisibility', 16]], (42,): [['tt.divisibility', 16]], (43,): [['tt.divisibility', 16]], (45,): [['tt.divisibility', 16]], (46,): [['tt.divisibility', 16]], (47,): [['tt.divisibility', 16]], (48,): [['tt.divisibility', 16]], (50,): [['tt.divisibility', 16]], (51,): [['tt.divisibility', 16]], (52,): [['tt.divisibility', 16]], (53,): [['tt.divisibility', 16]], (54,): [['tt.divisibility', 16]], (55,): [['tt.divisibility', 16]], (56,): [['tt.divisibility', 16]], (57,): [['tt.divisibility', 16]], (58,): [['tt.divisibility', 16]], (59,): [['tt.divisibility', 16]], (60,): [['tt.divisibility', 16]], (61,): [['tt.divisibility', 16]], (62,): [['tt.divisibility', 16]], (63,): [['tt.divisibility', 16]], (64,): [['tt.divisibility', 16]], (65,): [['tt.divisibility', 16]], (66,): [['tt.divisibility', 16]], (67,): [['tt.divisibility', 16]], (68,): [['tt.divisibility', 16]], (69,): [['tt.divisibility', 16]], (70,): [['tt.divisibility', 16]], (71,): [['tt.divisibility', 16]], (72,): [['tt.divisibility', 16]], (73,): [['tt.divisibility', 16]], (74,): [['tt.divisibility', 16]], (75,): [['tt.divisibility', 16]], (76,): [['tt.divisibility', 16]], (77,): [['tt.divisibility', 16]], (78,): [['tt.divisibility', 16]], (79,): [['tt.divisibility', 16]]}]},
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]     inductor_meta={'grid_type': 'SequentialComboKernelGrid', 'combo_grid_meta': {'num_kernels': 10, 'min_blocks': 0, 'default_config': {'XBLOCK': 1024}, 'no_x_dim_0': False, 'xnumel_0': 1048576, 'no_x_dim_1': False, 'xnumel_1': 1048576, 'no_x_dim_2': False, 'xnumel_2': 1048576, 'no_x_dim_3': False, 'xnumel_3': 1048576, 'no_x_dim_4': False, 'xnumel_4': 1048576, 'no_x_dim_5': False, 'xnumel_5': 1048576, 'no_x_dim_6': False, 'xnumel_6': 1048576, 'no_x_dim_7': False, 'xnumel_7': 1048576, 'no_x_dim_8': False, 'xnumel_8': 1048576, 'no_x_dim_9': False, 'xnumel_9': 1048576}, 'kernel_name': 'triton_for_fused_0', 'mutated_arg_names': ['in_ptr1', 'in_ptr11', 'in_ptr12', 'in_ptr13', 'in_ptr16', 'in_ptr17', 'in_ptr18', 'in_ptr2', 'in_ptr21', 'in_ptr22', 'in_ptr23', 'in_ptr26', 'in_ptr27', 'in_ptr28', 'in_ptr3', 'in_ptr31', 'in_ptr32', 'in_ptr33', 'in_ptr36', 'in_ptr37', 'in_ptr38', 'in_ptr41', 'in_ptr42', 'in_ptr43', 'in_ptr46', 'in_ptr47', 'in_ptr48', 'in_ptr6', 'in_ptr7', 'in_ptr8', 'out_ptr15', 'out_ptr16', 'out_ptr17', 'out_ptr24', 'out_ptr25', 'out_ptr26', 'out_ptr33', 'out_ptr34', 'out_ptr35', 'out_ptr42', 'out_ptr43', 'out_ptr44', 'out_ptr51', 'out_ptr52', 'out_ptr53', 'out_ptr6', 'out_ptr60', 'out_ptr61', 'out_ptr62', 'out_ptr69', 'out_ptr7', 'out_ptr70', 'out_ptr71', 'out_ptr78', 'out_ptr79', 'out_ptr8', 'out_ptr80', 'out_ptr87', 'out_ptr88', 'out_ptr89'], 'backend_hash': '5521EADCB2516098F638687B39B477AA524882055648F5AE9FFB68D065B487C6', 'are_deterministic_algorithms_enabled': False, 'assert_indirect_indexing': True, 'autotune_local_cache': True, 'autotune_pointwise': True, 'autotune_remote_cache': None, 'force_disable_caches': False, 'dynamic_scale_rblock': True, 'max_autotune': False, 'max_autotune_pointwise': False, 'min_split_scan_rblock': 256, 'spill_threshold': 16, 'store_cubin': False},
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code] )
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code] @triton.jit
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code] def triton_for_fused_0(in_ptr0, in_ptr1, in_ptr2, in_ptr3, in_ptr4, in_ptr5, in_ptr6, in_ptr7, in_ptr8, in_ptr9, in_ptr10, in_ptr11, in_ptr12, in_ptr13, in_ptr14, in_ptr15, in_ptr16, in_ptr17, in_ptr18, in_ptr19, in_ptr20, in_ptr21, in_ptr22, in_ptr23, in_ptr24, in_ptr25, in_ptr26, in_ptr27, in_ptr28, in_ptr29, in_ptr30, in_ptr31, in_ptr32, in_ptr33, in_ptr34, in_ptr35, in_ptr36, in_ptr37, in_ptr38, in_ptr39, in_ptr40, in_ptr41, in_ptr42, in_ptr43, in_ptr44, in_ptr45, in_ptr46, in_ptr47, in_ptr48, in_ptr49, out_ptr6, out_ptr7, out_ptr8, out_ptr15, out_ptr16, out_ptr17, out_ptr24, out_ptr25, out_ptr26, out_ptr33, out_ptr34, out_ptr35, out_ptr42, out_ptr43, out_ptr44, out_ptr51, out_ptr52, out_ptr53, out_ptr60, out_ptr61, out_ptr62, out_ptr69, out_ptr70, out_ptr71, out_ptr78, out_ptr79, out_ptr80, out_ptr87, out_ptr88, out_ptr89):
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]     pid = tl.program_id(0)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]     XBLOCK: tl.constexpr = 1024
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]     num_xblocks_0 = tl.cdiv(1048576, XBLOCK)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]     num_xblocks_1 = num_xblocks_0 + tl.cdiv(1048576, XBLOCK)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]     num_xblocks_2 = num_xblocks_1 + tl.cdiv(1048576, XBLOCK)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]     num_xblocks_3 = num_xblocks_2 + tl.cdiv(1048576, XBLOCK)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]     num_xblocks_4 = num_xblocks_3 + tl.cdiv(1048576, XBLOCK)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]     num_xblocks_5 = num_xblocks_4 + tl.cdiv(1048576, XBLOCK)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]     num_xblocks_6 = num_xblocks_5 + tl.cdiv(1048576, XBLOCK)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]     num_xblocks_7 = num_xblocks_6 + tl.cdiv(1048576, XBLOCK)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]     num_xblocks_8 = num_xblocks_7 + tl.cdiv(1048576, XBLOCK)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]     num_xblocks_9 = num_xblocks_8 + tl.cdiv(1048576, XBLOCK)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]     if pid < num_xblocks_0:
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         pid_offset = pid
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         xnumel = 1048576
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         r0_numel = 1
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         xoffset = pid_offset * XBLOCK
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         xindex = xoffset + tl.arange(0, XBLOCK)[:]
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         xmask = tl.full([XBLOCK], True, tl.int1)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         x0 = xindex
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp0 = tl.load(in_ptr0 + (x0), None)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp1 = tl.load(in_ptr1 + (x0), None)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp8 = tl.load(in_ptr2 + (x0), None)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp15 = tl.load(in_ptr3 + (x0), None)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp17 = in_ptr4
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp2 = tmp0 - tmp1
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp3 = 0.10000000149011612
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp4 = tmp3 * tmp2
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp5 = tl.full([1], False, tl.int1)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp6 = tl.where(tmp5, tmp0, tmp1)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp7 = tmp4 + tmp6
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp9 = 0.999
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp10 = tmp8 * tmp9
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp11 = 0.0010000000000000009
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp12 = tmp0 * tmp11
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp13 = tmp12 * tmp0
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp14 = tmp10 + tmp13
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp16 = libdevice.sqrt(tmp14)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp18 = 1.0
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp19 = tmp17 + tmp18
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp20 = libdevice.pow(tmp9, tmp19)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp21 = tmp18 - tmp20
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp22 = libdevice.sqrt(tmp21)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp23 = 0.9
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp24 = libdevice.pow(tmp23, tmp19)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp25 = tmp18 - tmp24
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp26 = tl.full([1], 1, tl.int32)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp27 = (tmp26 / tmp25)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp28 = 0.001
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp29 = tmp27 * tmp28
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp30 = -tmp29
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp31 = tmp22 * tmp30
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp32 = (tmp16 / tmp31)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp33 = (tmp26 / tmp30)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp34 = 1e-08
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp35 = tmp33 * tmp34
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp36 = tmp32 + tmp35
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp37 = (tmp7 / tmp36)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp38 = tmp15 + tmp37
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tl.store(out_ptr6 + (x0), tmp38, None)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tl.store(out_ptr7 + (x0), tmp7, None)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tl.store(out_ptr8 + (x0), tmp14, None)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]     elif pid < num_xblocks_1:
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         pid_offset = pid - num_xblocks_0
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         xnumel = 1048576
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         r0_numel = 1
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         xoffset = pid_offset * XBLOCK
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         xindex = xoffset + tl.arange(0, XBLOCK)[:]
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         xmask = tl.full([XBLOCK], True, tl.int1)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         x1 = xindex
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp39 = tl.load(in_ptr5 + (x1), None)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp40 = tl.load(in_ptr6 + (x1), None)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp47 = tl.load(in_ptr7 + (x1), None)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp54 = tl.load(in_ptr8 + (x1), None)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp56 = in_ptr9
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp41 = tmp39 - tmp40
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp42 = 0.10000000149011612
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp43 = tmp42 * tmp41
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp44 = tl.full([1], False, tl.int1)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp45 = tl.where(tmp44, tmp39, tmp40)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp46 = tmp43 + tmp45
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp48 = 0.999
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp49 = tmp47 * tmp48
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp50 = 0.0010000000000000009
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp51 = tmp39 * tmp50
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp52 = tmp51 * tmp39
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp53 = tmp49 + tmp52
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp55 = libdevice.sqrt(tmp53)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp57 = 1.0
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp58 = tmp56 + tmp57
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp59 = libdevice.pow(tmp48, tmp58)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp60 = tmp57 - tmp59
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp61 = libdevice.sqrt(tmp60)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp62 = 0.9
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp63 = libdevice.pow(tmp62, tmp58)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp64 = tmp57 - tmp63
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp65 = tl.full([1], 1, tl.int32)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp66 = (tmp65 / tmp64)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp67 = 0.001
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp68 = tmp66 * tmp67
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp69 = -tmp68
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp70 = tmp61 * tmp69
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp71 = (tmp55 / tmp70)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp72 = (tmp65 / tmp69)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp73 = 1e-08
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp74 = tmp72 * tmp73
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp75 = tmp71 + tmp74
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp76 = (tmp46 / tmp75)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp77 = tmp54 + tmp76
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tl.store(out_ptr15 + (x1), tmp77, None)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tl.store(out_ptr16 + (x1), tmp46, None)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tl.store(out_ptr17 + (x1), tmp53, None)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]     elif pid < num_xblocks_2:
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         pid_offset = pid - num_xblocks_1
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         xnumel = 1048576
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         r0_numel = 1
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         xoffset = pid_offset * XBLOCK
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         xindex = xoffset + tl.arange(0, XBLOCK)[:]
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         xmask = tl.full([XBLOCK], True, tl.int1)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         x2 = xindex
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp78 = tl.load(in_ptr10 + (x2), None)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp79 = tl.load(in_ptr11 + (x2), None)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp86 = tl.load(in_ptr12 + (x2), None)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp93 = tl.load(in_ptr13 + (x2), None)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp95 = in_ptr14
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp80 = tmp78 - tmp79
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp81 = 0.10000000149011612
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp82 = tmp81 * tmp80
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp83 = tl.full([1], False, tl.int1)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp84 = tl.where(tmp83, tmp78, tmp79)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp85 = tmp82 + tmp84
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp87 = 0.999
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp88 = tmp86 * tmp87
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp89 = 0.0010000000000000009
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp90 = tmp78 * tmp89
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp91 = tmp90 * tmp78
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp92 = tmp88 + tmp91
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp94 = libdevice.sqrt(tmp92)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp96 = 1.0
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp97 = tmp95 + tmp96
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp98 = libdevice.pow(tmp87, tmp97)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp99 = tmp96 - tmp98
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp100 = libdevice.sqrt(tmp99)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp101 = 0.9
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp102 = libdevice.pow(tmp101, tmp97)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp103 = tmp96 - tmp102
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp104 = tl.full([1], 1, tl.int32)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp105 = (tmp104 / tmp103)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp106 = 0.001
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp107 = tmp105 * tmp106
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp108 = -tmp107
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp109 = tmp100 * tmp108
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp110 = (tmp94 / tmp109)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp111 = (tmp104 / tmp108)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp112 = 1e-08
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp113 = tmp111 * tmp112
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp114 = tmp110 + tmp113
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp115 = (tmp85 / tmp114)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp116 = tmp93 + tmp115
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tl.store(out_ptr24 + (x2), tmp116, None)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tl.store(out_ptr25 + (x2), tmp85, None)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tl.store(out_ptr26 + (x2), tmp92, None)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]     elif pid < num_xblocks_3:
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         pid_offset = pid - num_xblocks_2
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         xnumel = 1048576
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         r0_numel = 1
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         xoffset = pid_offset * XBLOCK
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         xindex = xoffset + tl.arange(0, XBLOCK)[:]
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         xmask = tl.full([XBLOCK], True, tl.int1)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         x3 = xindex
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp117 = tl.load(in_ptr15 + (x3), None)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp118 = tl.load(in_ptr16 + (x3), None)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp125 = tl.load(in_ptr17 + (x3), None)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp132 = tl.load(in_ptr18 + (x3), None)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp134 = in_ptr19
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp119 = tmp117 - tmp118
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp120 = 0.10000000149011612
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp121 = tmp120 * tmp119
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp122 = tl.full([1], False, tl.int1)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp123 = tl.where(tmp122, tmp117, tmp118)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp124 = tmp121 + tmp123
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp126 = 0.999
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp127 = tmp125 * tmp126
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp128 = 0.0010000000000000009
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp129 = tmp117 * tmp128
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp130 = tmp129 * tmp117
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp131 = tmp127 + tmp130
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp133 = libdevice.sqrt(tmp131)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp135 = 1.0
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp136 = tmp134 + tmp135
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp137 = libdevice.pow(tmp126, tmp136)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp138 = tmp135 - tmp137
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp139 = libdevice.sqrt(tmp138)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp140 = 0.9
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp141 = libdevice.pow(tmp140, tmp136)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp142 = tmp135 - tmp141
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp143 = tl.full([1], 1, tl.int32)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp144 = (tmp143 / tmp142)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp145 = 0.001
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp146 = tmp144 * tmp145
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp147 = -tmp146
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp148 = tmp139 * tmp147
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp149 = (tmp133 / tmp148)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp150 = (tmp143 / tmp147)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp151 = 1e-08
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp152 = tmp150 * tmp151
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp153 = tmp149 + tmp152
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp154 = (tmp124 / tmp153)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp155 = tmp132 + tmp154
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tl.store(out_ptr33 + (x3), tmp155, None)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tl.store(out_ptr34 + (x3), tmp124, None)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tl.store(out_ptr35 + (x3), tmp131, None)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]     elif pid < num_xblocks_4:
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         pid_offset = pid - num_xblocks_3
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         xnumel = 1048576
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         r0_numel = 1
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         xoffset = pid_offset * XBLOCK
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         xindex = xoffset + tl.arange(0, XBLOCK)[:]
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         xmask = tl.full([XBLOCK], True, tl.int1)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         x4 = xindex
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp156 = tl.load(in_ptr20 + (x4), None)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp157 = tl.load(in_ptr21 + (x4), None)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp164 = tl.load(in_ptr22 + (x4), None)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp171 = tl.load(in_ptr23 + (x4), None)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp173 = in_ptr24
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp158 = tmp156 - tmp157
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp159 = 0.10000000149011612
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp160 = tmp159 * tmp158
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp161 = tl.full([1], False, tl.int1)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp162 = tl.where(tmp161, tmp156, tmp157)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp163 = tmp160 + tmp162
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp165 = 0.999
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp166 = tmp164 * tmp165
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp167 = 0.0010000000000000009
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp168 = tmp156 * tmp167
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp169 = tmp168 * tmp156
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp170 = tmp166 + tmp169
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp172 = libdevice.sqrt(tmp170)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp174 = 1.0
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp175 = tmp173 + tmp174
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp176 = libdevice.pow(tmp165, tmp175)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp177 = tmp174 - tmp176
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp178 = libdevice.sqrt(tmp177)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp179 = 0.9
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp180 = libdevice.pow(tmp179, tmp175)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp181 = tmp174 - tmp180
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp182 = tl.full([1], 1, tl.int32)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp183 = (tmp182 / tmp181)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp184 = 0.001
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp185 = tmp183 * tmp184
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp186 = -tmp185
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp187 = tmp178 * tmp186
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp188 = (tmp172 / tmp187)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp189 = (tmp182 / tmp186)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp190 = 1e-08
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp191 = tmp189 * tmp190
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp192 = tmp188 + tmp191
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp193 = (tmp163 / tmp192)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp194 = tmp171 + tmp193
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tl.store(out_ptr42 + (x4), tmp194, None)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tl.store(out_ptr43 + (x4), tmp163, None)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tl.store(out_ptr44 + (x4), tmp170, None)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]     elif pid < num_xblocks_5:
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         pid_offset = pid - num_xblocks_4
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         xnumel = 1048576
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         r0_numel = 1
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         xoffset = pid_offset * XBLOCK
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         xindex = xoffset + tl.arange(0, XBLOCK)[:]
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         xmask = tl.full([XBLOCK], True, tl.int1)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         x5 = xindex
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp195 = tl.load(in_ptr25 + (x5), None)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp196 = tl.load(in_ptr26 + (x5), None)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp203 = tl.load(in_ptr27 + (x5), None)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp210 = tl.load(in_ptr28 + (x5), None)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp212 = in_ptr29
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp197 = tmp195 - tmp196
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp198 = 0.10000000149011612
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp199 = tmp198 * tmp197
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp200 = tl.full([1], False, tl.int1)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp201 = tl.where(tmp200, tmp195, tmp196)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp202 = tmp199 + tmp201
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp204 = 0.999
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp205 = tmp203 * tmp204
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp206 = 0.0010000000000000009
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp207 = tmp195 * tmp206
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp208 = tmp207 * tmp195
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp209 = tmp205 + tmp208
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp211 = libdevice.sqrt(tmp209)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp213 = 1.0
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp214 = tmp212 + tmp213
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp215 = libdevice.pow(tmp204, tmp214)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp216 = tmp213 - tmp215
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp217 = libdevice.sqrt(tmp216)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp218 = 0.9
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp219 = libdevice.pow(tmp218, tmp214)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp220 = tmp213 - tmp219
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp221 = tl.full([1], 1, tl.int32)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp222 = (tmp221 / tmp220)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp223 = 0.001
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp224 = tmp222 * tmp223
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp225 = -tmp224
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp226 = tmp217 * tmp225
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp227 = (tmp211 / tmp226)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp228 = (tmp221 / tmp225)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp229 = 1e-08
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp230 = tmp228 * tmp229
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp231 = tmp227 + tmp230
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp232 = (tmp202 / tmp231)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp233 = tmp210 + tmp232
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tl.store(out_ptr51 + (x5), tmp233, None)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tl.store(out_ptr52 + (x5), tmp202, None)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tl.store(out_ptr53 + (x5), tmp209, None)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]     elif pid < num_xblocks_6:
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         pid_offset = pid - num_xblocks_5
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         xnumel = 1048576
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         r0_numel = 1
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         xoffset = pid_offset * XBLOCK
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         xindex = xoffset + tl.arange(0, XBLOCK)[:]
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         xmask = tl.full([XBLOCK], True, tl.int1)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         x6 = xindex
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp234 = tl.load(in_ptr30 + (x6), None)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp235 = tl.load(in_ptr31 + (x6), None)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp242 = tl.load(in_ptr32 + (x6), None)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp249 = tl.load(in_ptr33 + (x6), None)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp251 = in_ptr34
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp236 = tmp234 - tmp235
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp237 = 0.10000000149011612
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp238 = tmp237 * tmp236
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp239 = tl.full([1], False, tl.int1)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp240 = tl.where(tmp239, tmp234, tmp235)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp241 = tmp238 + tmp240
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp243 = 0.999
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp244 = tmp242 * tmp243
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp245 = 0.0010000000000000009
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp246 = tmp234 * tmp245
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp247 = tmp246 * tmp234
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp248 = tmp244 + tmp247
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp250 = libdevice.sqrt(tmp248)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp252 = 1.0
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp253 = tmp251 + tmp252
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp254 = libdevice.pow(tmp243, tmp253)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp255 = tmp252 - tmp254
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp256 = libdevice.sqrt(tmp255)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp257 = 0.9
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp258 = libdevice.pow(tmp257, tmp253)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp259 = tmp252 - tmp258
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp260 = tl.full([1], 1, tl.int32)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp261 = (tmp260 / tmp259)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp262 = 0.001
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp263 = tmp261 * tmp262
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp264 = -tmp263
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp265 = tmp256 * tmp264
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp266 = (tmp250 / tmp265)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp267 = (tmp260 / tmp264)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp268 = 1e-08
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp269 = tmp267 * tmp268
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp270 = tmp266 + tmp269
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp271 = (tmp241 / tmp270)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp272 = tmp249 + tmp271
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tl.store(out_ptr60 + (x6), tmp272, None)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tl.store(out_ptr61 + (x6), tmp241, None)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tl.store(out_ptr62 + (x6), tmp248, None)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]     elif pid < num_xblocks_7:
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         pid_offset = pid - num_xblocks_6
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         xnumel = 1048576
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         r0_numel = 1
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         xoffset = pid_offset * XBLOCK
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         xindex = xoffset + tl.arange(0, XBLOCK)[:]
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         xmask = tl.full([XBLOCK], True, tl.int1)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         x7 = xindex
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp273 = tl.load(in_ptr35 + (x7), None)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp274 = tl.load(in_ptr36 + (x7), None)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp281 = tl.load(in_ptr37 + (x7), None)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp288 = tl.load(in_ptr38 + (x7), None)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp290 = in_ptr39
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp275 = tmp273 - tmp274
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp276 = 0.10000000149011612
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp277 = tmp276 * tmp275
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp278 = tl.full([1], False, tl.int1)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp279 = tl.where(tmp278, tmp273, tmp274)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp280 = tmp277 + tmp279
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp282 = 0.999
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp283 = tmp281 * tmp282
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp284 = 0.0010000000000000009
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp285 = tmp273 * tmp284
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp286 = tmp285 * tmp273
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp287 = tmp283 + tmp286
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp289 = libdevice.sqrt(tmp287)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp291 = 1.0
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp292 = tmp290 + tmp291
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp293 = libdevice.pow(tmp282, tmp292)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp294 = tmp291 - tmp293
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp295 = libdevice.sqrt(tmp294)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp296 = 0.9
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp297 = libdevice.pow(tmp296, tmp292)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp298 = tmp291 - tmp297
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp299 = tl.full([1], 1, tl.int32)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp300 = (tmp299 / tmp298)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp301 = 0.001
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp302 = tmp300 * tmp301
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp303 = -tmp302
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp304 = tmp295 * tmp303
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp305 = (tmp289 / tmp304)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp306 = (tmp299 / tmp303)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp307 = 1e-08
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp308 = tmp306 * tmp307
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp309 = tmp305 + tmp308
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp310 = (tmp280 / tmp309)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp311 = tmp288 + tmp310
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tl.store(out_ptr69 + (x7), tmp311, None)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tl.store(out_ptr70 + (x7), tmp280, None)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tl.store(out_ptr71 + (x7), tmp287, None)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]     elif pid < num_xblocks_8:
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         pid_offset = pid - num_xblocks_7
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         xnumel = 1048576
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         r0_numel = 1
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         xoffset = pid_offset * XBLOCK
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         xindex = xoffset + tl.arange(0, XBLOCK)[:]
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         xmask = tl.full([XBLOCK], True, tl.int1)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         x8 = xindex
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp312 = tl.load(in_ptr40 + (x8), None)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp313 = tl.load(in_ptr41 + (x8), None)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp320 = tl.load(in_ptr42 + (x8), None)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp327 = tl.load(in_ptr43 + (x8), None)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp329 = in_ptr44
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp314 = tmp312 - tmp313
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp315 = 0.10000000149011612
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp316 = tmp315 * tmp314
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp317 = tl.full([1], False, tl.int1)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp318 = tl.where(tmp317, tmp312, tmp313)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp319 = tmp316 + tmp318
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp321 = 0.999
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp322 = tmp320 * tmp321
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp323 = 0.0010000000000000009
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp324 = tmp312 * tmp323
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp325 = tmp324 * tmp312
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp326 = tmp322 + tmp325
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp328 = libdevice.sqrt(tmp326)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp330 = 1.0
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp331 = tmp329 + tmp330
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp332 = libdevice.pow(tmp321, tmp331)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp333 = tmp330 - tmp332
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp334 = libdevice.sqrt(tmp333)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp335 = 0.9
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp336 = libdevice.pow(tmp335, tmp331)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp337 = tmp330 - tmp336
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp338 = tl.full([1], 1, tl.int32)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp339 = (tmp338 / tmp337)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp340 = 0.001
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp341 = tmp339 * tmp340
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp342 = -tmp341
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp343 = tmp334 * tmp342
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp344 = (tmp328 / tmp343)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp345 = (tmp338 / tmp342)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp346 = 1e-08
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp347 = tmp345 * tmp346
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp348 = tmp344 + tmp347
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp349 = (tmp319 / tmp348)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp350 = tmp327 + tmp349
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tl.store(out_ptr78 + (x8), tmp350, None)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tl.store(out_ptr79 + (x8), tmp319, None)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tl.store(out_ptr80 + (x8), tmp326, None)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]     elif pid < num_xblocks_9:
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         pid_offset = pid - num_xblocks_8
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         xnumel = 1048576
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         r0_numel = 1
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         xoffset = pid_offset * XBLOCK
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         xindex = xoffset + tl.arange(0, XBLOCK)[:]
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         xmask = tl.full([XBLOCK], True, tl.int1)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         x9 = xindex
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp351 = tl.load(in_ptr45 + (x9), None)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp352 = tl.load(in_ptr46 + (x9), None)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp359 = tl.load(in_ptr47 + (x9), None)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp366 = tl.load(in_ptr48 + (x9), None)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp368 = in_ptr49
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp353 = tmp351 - tmp352
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp354 = 0.10000000149011612
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp355 = tmp354 * tmp353
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp356 = tl.full([1], False, tl.int1)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp357 = tl.where(tmp356, tmp351, tmp352)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp358 = tmp355 + tmp357
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp360 = 0.999
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp361 = tmp359 * tmp360
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp362 = 0.0010000000000000009
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp363 = tmp351 * tmp362
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp364 = tmp363 * tmp351
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp365 = tmp361 + tmp364
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp367 = libdevice.sqrt(tmp365)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp369 = 1.0
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp370 = tmp368 + tmp369
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp371 = libdevice.pow(tmp360, tmp370)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp372 = tmp369 - tmp371
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp373 = libdevice.sqrt(tmp372)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp374 = 0.9
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp375 = libdevice.pow(tmp374, tmp370)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp376 = tmp369 - tmp375
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp377 = tl.full([1], 1, tl.int32)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp378 = (tmp377 / tmp376)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp379 = 0.001
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp380 = tmp378 * tmp379
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp381 = -tmp380
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp382 = tmp373 * tmp381
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp383 = (tmp367 / tmp382)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp384 = (tmp377 / tmp381)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp385 = 1e-08
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp386 = tmp384 * tmp385
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp387 = tmp383 + tmp386
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp388 = (tmp358 / tmp387)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tmp389 = tmp366 + tmp388
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tl.store(out_ptr87 + (x9), tmp389, None)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tl.store(out_ptr88 + (x9), tmp358, None)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         tl.store(out_ptr89 + (x9), tmp365, None)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]     else:
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         pass
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code] ''', device_str='cuda')
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code] cpp_fused__foreach_copy_1 = async_compile.cpp_pybinding(['const float*', 'const float*', 'const float*', 'const float*', 'const float*', 'const float*', 'const float*', 'const float*', 'const float*', 'const float*', 'float*', 'float*', 'float*', 'float*', 'float*', 'float*', 'float*', 'float*', 'float*', 'float*'], '''
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code] #include <torch/csrc/inductor/cpp_prefix.h>
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code] extern "C"  void kernel(const float* in_ptr0,
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]                        const float* in_ptr1,
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]                        const float* in_ptr2,
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]                        const float* in_ptr3,
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]                        const float* in_ptr4,
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]                        const float* in_ptr5,
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]                        const float* in_ptr6,
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]                        const float* in_ptr7,
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]                        const float* in_ptr8,
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]                        const float* in_ptr9,
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]                        float* out_ptr1,
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]                        float* out_ptr3,
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]                        float* out_ptr5,
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]                        float* out_ptr7,
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]                        float* out_ptr9,
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]                        float* out_ptr11,
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]                        float* out_ptr13,
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]                        float* out_ptr15,
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]                        float* out_ptr17,
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]                        float* out_ptr19)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code] {
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]     {
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         {
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]             {
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]                 auto tmp0 = in_ptr0[static_cast<int64_t>(0L)];
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]                 auto tmp1 = static_cast<float>(1.0);
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]                 auto tmp2 = float(tmp0 + tmp1);
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]                 out_ptr1[static_cast<int64_t>(0L)] = tmp2;
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]             }
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         }
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]     }
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]     {
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         {
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]             {
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]                 auto tmp0 = in_ptr1[static_cast<int64_t>(0L)];
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]                 auto tmp1 = static_cast<float>(1.0);
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]                 auto tmp2 = float(tmp0 + tmp1);
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]                 out_ptr3[static_cast<int64_t>(0L)] = tmp2;
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]             }
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         }
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]     }
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]     {
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         {
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]             {
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]                 auto tmp0 = in_ptr2[static_cast<int64_t>(0L)];
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]                 auto tmp1 = static_cast<float>(1.0);
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]                 auto tmp2 = float(tmp0 + tmp1);
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]                 out_ptr5[static_cast<int64_t>(0L)] = tmp2;
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]             }
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         }
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]     }
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]     {
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         {
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]             {
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]                 auto tmp0 = in_ptr3[static_cast<int64_t>(0L)];
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]                 auto tmp1 = static_cast<float>(1.0);
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]                 auto tmp2 = float(tmp0 + tmp1);
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]                 out_ptr7[static_cast<int64_t>(0L)] = tmp2;
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]             }
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         }
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]     }
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]     {
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         {
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]             {
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]                 auto tmp0 = in_ptr4[static_cast<int64_t>(0L)];
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]                 auto tmp1 = static_cast<float>(1.0);
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]                 auto tmp2 = float(tmp0 + tmp1);
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]                 out_ptr9[static_cast<int64_t>(0L)] = tmp2;
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]             }
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         }
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]     }
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]     {
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         {
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]             {
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]                 auto tmp0 = in_ptr5[static_cast<int64_t>(0L)];
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]                 auto tmp1 = static_cast<float>(1.0);
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]                 auto tmp2 = float(tmp0 + tmp1);
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]                 out_ptr11[static_cast<int64_t>(0L)] = tmp2;
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]             }
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         }
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]     }
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]     {
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         {
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]             {
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]                 auto tmp0 = in_ptr6[static_cast<int64_t>(0L)];
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]                 auto tmp1 = static_cast<float>(1.0);
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]                 auto tmp2 = float(tmp0 + tmp1);
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]                 out_ptr13[static_cast<int64_t>(0L)] = tmp2;
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]             }
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         }
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]     }
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]     {
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         {
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]             {
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]                 auto tmp0 = in_ptr7[static_cast<int64_t>(0L)];
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]                 auto tmp1 = static_cast<float>(1.0);
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]                 auto tmp2 = float(tmp0 + tmp1);
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]                 out_ptr15[static_cast<int64_t>(0L)] = tmp2;
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]             }
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         }
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]     }
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]     {
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         {
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]             {
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]                 auto tmp0 = in_ptr8[static_cast<int64_t>(0L)];
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]                 auto tmp1 = static_cast<float>(1.0);
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]                 auto tmp2 = float(tmp0 + tmp1);
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]                 out_ptr17[static_cast<int64_t>(0L)] = tmp2;
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]             }
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         }
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]     }
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]     {
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         {
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]             {
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]                 auto tmp0 = in_ptr9[static_cast<int64_t>(0L)];
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]                 auto tmp1 = static_cast<float>(1.0);
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]                 auto tmp2 = float(tmp0 + tmp1);
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]                 out_ptr19[static_cast<int64_t>(0L)] = tmp2;
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]             }
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         }
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]     }
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code] }
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code] ''')
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code] async_compile.wait(globals())
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code] del async_compile
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code] def call(args):
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]     arg0_1, arg1_1, arg2_1, arg3_1, arg4_1, arg5_1, arg6_1, arg7_1, arg8_1, arg9_1, arg10_1, arg11_1, arg12_1, arg13_1, arg14_1, arg15_1, arg16_1, arg17_1, arg18_1, arg19_1, arg20_1, arg21_1, arg22_1, arg23_1, arg24_1, arg25_1, arg26_1, arg27_1, arg28_1, arg29_1, arg30_1, arg31_1, arg32_1, arg33_1, arg34_1, arg35_1, arg36_1, arg37_1, arg38_1, arg39_1, arg40_1, arg41_1, arg42_1, arg43_1, arg44_1, arg45_1, arg46_1, arg47_1, arg48_1, arg49_1 = args
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]     args.clear()
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]     assert_size_stride(arg0_1, (1024, 1024), (1024, 1))
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]     assert_size_stride(arg1_1, (1024, 1024), (1024, 1))
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]     assert_size_stride(arg2_1, (1024, 1024), (1024, 1))
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]     assert_size_stride(arg3_1, (1024, 1024), (1024, 1))
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]     assert_size_stride(arg4_1, (1024, 1024), (1024, 1))
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]     assert_size_stride(arg5_1, (1024, 1024), (1024, 1))
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]     assert_size_stride(arg6_1, (1024, 1024), (1024, 1))
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]     assert_size_stride(arg7_1, (1024, 1024), (1024, 1))
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]     assert_size_stride(arg8_1, (1024, 1024), (1024, 1))
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]     assert_size_stride(arg9_1, (1024, 1024), (1024, 1))
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]     assert_size_stride(arg10_1, (), ())
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]     assert_size_stride(arg11_1, (), ())
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]     assert_size_stride(arg12_1, (), ())
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]     assert_size_stride(arg13_1, (), ())
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]     assert_size_stride(arg14_1, (), ())
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]     assert_size_stride(arg15_1, (), ())
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]     assert_size_stride(arg16_1, (), ())
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]     assert_size_stride(arg17_1, (), ())
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]     assert_size_stride(arg18_1, (), ())
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]     assert_size_stride(arg19_1, (), ())
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]     assert_size_stride(arg20_1, (1024, 1024), (1024, 1))
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]     assert_size_stride(arg21_1, (1024, 1024), (1024, 1))
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]     assert_size_stride(arg22_1, (1024, 1024), (1024, 1))
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]     assert_size_stride(arg23_1, (1024, 1024), (1024, 1))
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]     assert_size_stride(arg24_1, (1024, 1024), (1024, 1))
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]     assert_size_stride(arg25_1, (1024, 1024), (1024, 1))
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]     assert_size_stride(arg26_1, (1024, 1024), (1024, 1))
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]     assert_size_stride(arg27_1, (1024, 1024), (1024, 1))
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]     assert_size_stride(arg28_1, (1024, 1024), (1024, 1))
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]     assert_size_stride(arg29_1, (1024, 1024), (1024, 1))
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]     assert_size_stride(arg30_1, (1024, 1024), (1024, 1))
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]     assert_size_stride(arg31_1, (1024, 1024), (1024, 1))
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]     assert_size_stride(arg32_1, (1024, 1024), (1024, 1))
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]     assert_size_stride(arg33_1, (1024, 1024), (1024, 1))
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]     assert_size_stride(arg34_1, (1024, 1024), (1024, 1))
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]     assert_size_stride(arg35_1, (1024, 1024), (1024, 1))
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]     assert_size_stride(arg36_1, (1024, 1024), (1024, 1))
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]     assert_size_stride(arg37_1, (1024, 1024), (1024, 1))
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]     assert_size_stride(arg38_1, (1024, 1024), (1024, 1))
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]     assert_size_stride(arg39_1, (1024, 1024), (1024, 1))
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]     assert_size_stride(arg40_1, (1024, 1024), (1024, 1))
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]     assert_size_stride(arg41_1, (1024, 1024), (1024, 1))
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]     assert_size_stride(arg42_1, (1024, 1024), (1024, 1))
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]     assert_size_stride(arg43_1, (1024, 1024), (1024, 1))
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]     assert_size_stride(arg44_1, (1024, 1024), (1024, 1))
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]     assert_size_stride(arg45_1, (1024, 1024), (1024, 1))
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]     assert_size_stride(arg46_1, (1024, 1024), (1024, 1))
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]     assert_size_stride(arg47_1, (1024, 1024), (1024, 1))
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]     assert_size_stride(arg48_1, (1024, 1024), (1024, 1))
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]     assert_size_stride(arg49_1, (1024, 1024), (1024, 1))
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]     with torch.cuda._DeviceGuard(0):
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         torch.cuda.set_device(0)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         # Unsorted Source Nodes: [], Original ATen: []
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         stream0 = get_raw_stream(0)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         triton_for_fused_0.run(arg30_1, arg20_1, arg40_1, arg0_1, arg10_1, arg31_1, arg21_1, arg41_1, arg1_1, arg11_1, arg32_1, arg22_1, arg42_1, arg2_1, arg12_1, arg33_1, arg23_1, arg43_1, arg3_1, arg13_1, arg34_1, arg24_1, arg44_1, arg4_1, arg14_1, arg35_1, arg25_1, arg45_1, arg5_1, arg15_1, arg36_1, arg26_1, arg46_1, arg6_1, arg16_1, arg37_1, arg27_1, arg47_1, arg7_1, arg17_1, arg38_1, arg28_1, arg48_1, arg8_1, arg18_1, arg39_1, arg29_1, arg49_1, arg9_1, arg19_1, arg0_1, arg20_1, arg40_1, arg1_1, arg21_1, arg41_1, arg2_1, arg22_1, arg42_1, arg3_1, arg23_1, arg43_1, arg4_1, arg24_1, arg44_1, arg5_1, arg25_1, arg45_1, arg6_1, arg26_1, arg46_1, arg7_1, arg27_1, arg47_1, arg8_1, arg28_1, arg48_1, arg9_1, arg29_1, arg49_1, stream=stream0)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         del arg0_1
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         del arg1_1
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         del arg20_1
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         del arg21_1
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         del arg22_1
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         del arg23_1
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         del arg24_1
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         del arg25_1
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         del arg26_1
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         del arg27_1
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         del arg28_1
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         del arg29_1
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         del arg2_1
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         del arg30_1
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         del arg31_1
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         del arg32_1
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         del arg33_1
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         del arg34_1
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         del arg35_1
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         del arg36_1
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         del arg37_1
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         del arg38_1
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         del arg39_1
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         del arg3_1
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         del arg40_1
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         del arg41_1
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         del arg42_1
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         del arg43_1
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         del arg44_1
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         del arg45_1
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         del arg46_1
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         del arg47_1
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         del arg48_1
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         del arg49_1
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         del arg4_1
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         del arg5_1
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         del arg6_1
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         del arg7_1
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         del arg8_1
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]         del arg9_1
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]     cpp_fused__foreach_copy_1(arg10_1, arg11_1, arg12_1, arg13_1, arg14_1, arg15_1, arg16_1, arg17_1, arg18_1, arg19_1, arg10_1, arg11_1, arg12_1, arg13_1, arg14_1, arg15_1, arg16_1, arg17_1, arg18_1, arg19_1)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]     del arg10_1
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]     del arg11_1
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]     del arg12_1
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]     del arg13_1
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]     del arg14_1
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]     del arg15_1
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]     del arg16_1
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]     del arg17_1
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]     del arg18_1
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]     del arg19_1
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]     return ()
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code] def benchmark_compiled_module(times=10, repeat=10):
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]     from torch._dynamo.testing import rand_strided
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]     from torch._inductor.utils import print_performance
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]     arg0_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]     arg1_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]     arg2_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]     arg3_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]     arg4_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]     arg5_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]     arg6_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]     arg7_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]     arg8_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]     arg9_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]     arg10_1 = rand_strided((), (), device='cpu', dtype=torch.float32)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]     arg11_1 = rand_strided((), (), device='cpu', dtype=torch.float32)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]     arg12_1 = rand_strided((), (), device='cpu', dtype=torch.float32)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]     arg13_1 = rand_strided((), (), device='cpu', dtype=torch.float32)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]     arg14_1 = rand_strided((), (), device='cpu', dtype=torch.float32)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]     arg15_1 = rand_strided((), (), device='cpu', dtype=torch.float32)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]     arg16_1 = rand_strided((), (), device='cpu', dtype=torch.float32)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]     arg17_1 = rand_strided((), (), device='cpu', dtype=torch.float32)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]     arg18_1 = rand_strided((), (), device='cpu', dtype=torch.float32)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]     arg19_1 = rand_strided((), (), device='cpu', dtype=torch.float32)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]     arg20_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]     arg21_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]     arg22_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]     arg23_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]     arg24_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]     arg25_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]     arg26_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]     arg27_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]     arg28_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]     arg29_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]     arg30_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]     arg31_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]     arg32_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]     arg33_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]     arg34_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]     arg35_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]     arg36_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]     arg37_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]     arg38_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]     arg39_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]     arg40_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]     arg41_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]     arg42_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]     arg43_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]     arg44_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]     arg45_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]     arg46_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]     arg47_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]     arg48_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]     arg49_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]     fn = lambda: call([arg0_1, arg1_1, arg2_1, arg3_1, arg4_1, arg5_1, arg6_1, arg7_1, arg8_1, arg9_1, arg10_1, arg11_1, arg12_1, arg13_1, arg14_1, arg15_1, arg16_1, arg17_1, arg18_1, arg19_1, arg20_1, arg21_1, arg22_1, arg23_1, arg24_1, arg25_1, arg26_1, arg27_1, arg28_1, arg29_1, arg30_1, arg31_1, arg32_1, arg33_1, arg34_1, arg35_1, arg36_1, arg37_1, arg38_1, arg39_1, arg40_1, arg41_1, arg42_1, arg43_1, arg44_1, arg45_1, arg46_1, arg47_1, arg48_1, arg49_1])
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]     return print_performance(fn, times=times, repeat=repeat)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code] if __name__ == "__main__":
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]     from torch._inductor.wrapper_benchmark import compiled_module_main
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]     compiled_module_main('None', benchmark_compiled_module)
V0813 15:24:49.532000 22304 torch/_inductor/graph.py:2345] [0/0] [__output_code]
V0813 15:24:49.577000 22304 torch/_inductor/graph.py:2356] [0/0] [__output_code] Output code written to: /tmp/torchinductor_ci-user/ff/cffqwnpkieergngjngozdun467la5vp6eyiisxxpikirosuditrp.py
I0813 15:24:53.487000 22304 torch/_inductor/graph.py:2317] [0/0] [__output_code] Output code written to: /tmp/torchinductor_ci-user/ff/cffqwnpkieergngjngozdun467la5vp6eyiisxxpikirosuditrp.py
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code] Output code:
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code] # AOT ID: ['1_inference']
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code] from ctypes import c_void_p, c_long, c_int
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code] import torch
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code] import math
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code] import random
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code] import os
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code] import tempfile
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code] from math import inf, nan
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code] from cmath import nanj
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code] from torch._inductor.hooks import run_intermediate_hooks
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code] from torch._inductor.utils import maybe_profile
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code] from torch._inductor.codegen.memory_planning import _align as align
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code] from torch import device, empty_strided
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code] from torch._inductor.async_compile import AsyncCompile
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code] from torch._inductor.select_algorithm import extern_kernels
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code] from torch._C import _cuda_getCurrentRawStream as get_raw_stream
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code] import triton
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code] import triton.language as tl
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code] from torch._inductor.runtime.triton_heuristics import start_graph, end_graph
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code] from torch._C import _cuda_getCurrentRawStream as get_raw_stream
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code] aten = torch.ops.aten
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code] inductor_ops = torch.ops.inductor
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code] _quantized = torch.ops._quantized
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code] assert_size_stride = torch._C._dynamo.guards.assert_size_stride
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code] assert_alignment = torch._C._dynamo.guards.assert_alignment
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code] empty_strided_cpu = torch._C._dynamo.guards._empty_strided_cpu
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code] empty_strided_cuda = torch._C._dynamo.guards._empty_strided_cuda
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code] empty_strided_xpu = torch._C._dynamo.guards._empty_strided_xpu
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code] reinterpret_tensor = torch._C._dynamo.guards._reinterpret_tensor
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code] alloc_from_pool = torch.ops.inductor._alloc_from_pool
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code] async_compile = AsyncCompile()
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code] empty_strided_p2p = torch._C._distributed_c10d._SymmetricMemory.empty_strided_p2p
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code] # kernel path: /tmp/torchinductor_ci-user/al/calrezlmzale753uatf4r4hyoxrgj2cygyga4s35ygdnlqxtbqrk.py
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code] # Unsorted Source Nodes: [], Original ATen: []
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code] # Source node to ATen node mapping:
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code] triton_for_fused_0 = async_compile.triton('triton_for_fused_0', '''
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code] import triton
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code] import triton.language as tl
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code] from torch._inductor.runtime import triton_helpers, triton_heuristics
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code] from torch._inductor.runtime.triton_helpers import libdevice, math as tl_math
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code] from torch._inductor.runtime.hints import AutotuneHint, ReductionHint, TileHint, DeviceProperties
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code] @triton_heuristics.foreach(
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]     num_warps=8,
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]     triton_meta={'signature': {'in_ptr0': '*fp32', 'in_ptr1': '*fp32', 'in_ptr2': '*fp32', 'in_ptr3': '*fp32', 'in_ptr4': 'fp32', 'in_ptr5': '*fp32', 'in_ptr6': '*fp32', 'in_ptr7': '*fp32', 'in_ptr8': '*fp32', 'in_ptr9': 'fp32', 'in_ptr10': '*fp32', 'in_ptr11': '*fp32', 'in_ptr12': '*fp32', 'in_ptr13': '*fp32', 'in_ptr14': 'fp32', 'in_ptr15': '*fp32', 'in_ptr16': '*fp32', 'in_ptr17': '*fp32', 'in_ptr18': '*fp32', 'in_ptr19': 'fp32', 'in_ptr20': '*fp32', 'in_ptr21': '*fp32', 'in_ptr22': '*fp32', 'in_ptr23': '*fp32', 'in_ptr24': 'fp32', 'in_ptr25': '*fp32', 'in_ptr26': '*fp32', 'in_ptr27': '*fp32', 'in_ptr28': '*fp32', 'in_ptr29': 'fp32', 'in_ptr30': '*fp32', 'in_ptr31': '*fp32', 'in_ptr32': '*fp32', 'in_ptr33': '*fp32', 'in_ptr34': 'fp32', 'in_ptr35': '*fp32', 'in_ptr36': '*fp32', 'in_ptr37': '*fp32', 'in_ptr38': '*fp32', 'in_ptr39': 'fp32', 'in_ptr40': '*fp32', 'in_ptr41': '*fp32', 'in_ptr42': '*fp32', 'in_ptr43': '*fp32', 'in_ptr44': 'fp32', 'in_ptr45': '*fp32', 'in_ptr46': '*fp32', 'in_ptr47': '*fp32', 'in_ptr48': '*fp32', 'in_ptr49': 'fp32', 'out_ptr6': '*fp32', 'out_ptr7': '*fp32', 'out_ptr8': '*fp32', 'out_ptr15': '*fp32', 'out_ptr16': '*fp32', 'out_ptr17': '*fp32', 'out_ptr24': '*fp32', 'out_ptr25': '*fp32', 'out_ptr26': '*fp32', 'out_ptr33': '*fp32', 'out_ptr34': '*fp32', 'out_ptr35': '*fp32', 'out_ptr42': '*fp32', 'out_ptr43': '*fp32', 'out_ptr44': '*fp32', 'out_ptr51': '*fp32', 'out_ptr52': '*fp32', 'out_ptr53': '*fp32', 'out_ptr60': '*fp32', 'out_ptr61': '*fp32', 'out_ptr62': '*fp32', 'out_ptr69': '*fp32', 'out_ptr70': '*fp32', 'out_ptr71': '*fp32', 'out_ptr78': '*fp32', 'out_ptr79': '*fp32', 'out_ptr80': '*fp32', 'out_ptr87': '*fp32', 'out_ptr88': '*fp32', 'out_ptr89': '*fp32'}, 'device': DeviceProperties(type='cuda', index=0, multi_processor_count=80, cc=86, major=8, regs_per_multiprocessor=65536, max_threads_per_multi_processor=1536, warp_size=32), 'constants': {}, 'configs': [{(0,): [['tt.divisibility', 16]], (1,): [['tt.divisibility', 16]], (2,): [['tt.divisibility', 16]], (3,): [['tt.divisibility', 16]], (5,): [['tt.divisibility', 16]], (6,): [['tt.divisibility', 16]], (7,): [['tt.divisibility', 16]], (8,): [['tt.divisibility', 16]], (10,): [['tt.divisibility', 16]], (11,): [['tt.divisibility', 16]], (12,): [['tt.divisibility', 16]], (13,): [['tt.divisibility', 16]], (15,): [['tt.divisibility', 16]], (16,): [['tt.divisibility', 16]], (17,): [['tt.divisibility', 16]], (18,): [['tt.divisibility', 16]], (20,): [['tt.divisibility', 16]], (21,): [['tt.divisibility', 16]], (22,): [['tt.divisibility', 16]], (23,): [['tt.divisibility', 16]], (25,): [['tt.divisibility', 16]], (26,): [['tt.divisibility', 16]], (27,): [['tt.divisibility', 16]], (28,): [['tt.divisibility', 16]], (30,): [['tt.divisibility', 16]], (31,): [['tt.divisibility', 16]], (32,): [['tt.divisibility', 16]], (33,): [['tt.divisibility', 16]], (35,): [['tt.divisibility', 16]], (36,): [['tt.divisibility', 16]], (37,): [['tt.divisibility', 16]], (38,): [['tt.divisibility', 16]], (40,): [['tt.divisibility', 16]], (41,): [['tt.divisibility', 16]], (42,): [['tt.divisibility', 16]], (43,): [['tt.divisibility', 16]], (45,): [['tt.divisibility', 16]], (46,): [['tt.divisibility', 16]], (47,): [['tt.divisibility', 16]], (48,): [['tt.divisibility', 16]], (50,): [['tt.divisibility', 16]], (51,): [['tt.divisibility', 16]], (52,): [['tt.divisibility', 16]], (53,): [['tt.divisibility', 16]], (54,): [['tt.divisibility', 16]], (55,): [['tt.divisibility', 16]], (56,): [['tt.divisibility', 16]], (57,): [['tt.divisibility', 16]], (58,): [['tt.divisibility', 16]], (59,): [['tt.divisibility', 16]], (60,): [['tt.divisibility', 16]], (61,): [['tt.divisibility', 16]], (62,): [['tt.divisibility', 16]], (63,): [['tt.divisibility', 16]], (64,): [['tt.divisibility', 16]], (65,): [['tt.divisibility', 16]], (66,): [['tt.divisibility', 16]], (67,): [['tt.divisibility', 16]], (68,): [['tt.divisibility', 16]], (69,): [['tt.divisibility', 16]], (70,): [['tt.divisibility', 16]], (71,): [['tt.divisibility', 16]], (72,): [['tt.divisibility', 16]], (73,): [['tt.divisibility', 16]], (74,): [['tt.divisibility', 16]], (75,): [['tt.divisibility', 16]], (76,): [['tt.divisibility', 16]], (77,): [['tt.divisibility', 16]], (78,): [['tt.divisibility', 16]], (79,): [['tt.divisibility', 16]]}]},
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]     inductor_meta={'grid_type': 'SequentialComboKernelGrid', 'combo_grid_meta': {'num_kernels': 10, 'min_blocks': 0, 'default_config': {'XBLOCK': 1024}, 'no_x_dim_0': False, 'xnumel_0': 1048576, 'no_x_dim_1': False, 'xnumel_1': 1048576, 'no_x_dim_2': False, 'xnumel_2': 1048576, 'no_x_dim_3': False, 'xnumel_3': 1048576, 'no_x_dim_4': False, 'xnumel_4': 1048576, 'no_x_dim_5': False, 'xnumel_5': 1048576, 'no_x_dim_6': False, 'xnumel_6': 1048576, 'no_x_dim_7': False, 'xnumel_7': 1048576, 'no_x_dim_8': False, 'xnumel_8': 1048576, 'no_x_dim_9': False, 'xnumel_9': 1048576}, 'kernel_name': 'triton_for_fused_0', 'mutated_arg_names': ['in_ptr1', 'in_ptr11', 'in_ptr12', 'in_ptr13', 'in_ptr16', 'in_ptr17', 'in_ptr18', 'in_ptr2', 'in_ptr21', 'in_ptr22', 'in_ptr23', 'in_ptr26', 'in_ptr27', 'in_ptr28', 'in_ptr3', 'in_ptr31', 'in_ptr32', 'in_ptr33', 'in_ptr36', 'in_ptr37', 'in_ptr38', 'in_ptr41', 'in_ptr42', 'in_ptr43', 'in_ptr46', 'in_ptr47', 'in_ptr48', 'in_ptr6', 'in_ptr7', 'in_ptr8', 'out_ptr15', 'out_ptr16', 'out_ptr17', 'out_ptr24', 'out_ptr25', 'out_ptr26', 'out_ptr33', 'out_ptr34', 'out_ptr35', 'out_ptr42', 'out_ptr43', 'out_ptr44', 'out_ptr51', 'out_ptr52', 'out_ptr53', 'out_ptr6', 'out_ptr60', 'out_ptr61', 'out_ptr62', 'out_ptr69', 'out_ptr7', 'out_ptr70', 'out_ptr71', 'out_ptr78', 'out_ptr79', 'out_ptr8', 'out_ptr80', 'out_ptr87', 'out_ptr88', 'out_ptr89'], 'backend_hash': '5521EADCB2516098F638687B39B477AA524882055648F5AE9FFB68D065B487C6', 'are_deterministic_algorithms_enabled': False, 'assert_indirect_indexing': True, 'autotune_local_cache': True, 'autotune_pointwise': True, 'autotune_remote_cache': None, 'force_disable_caches': False, 'dynamic_scale_rblock': True, 'max_autotune': False, 'max_autotune_pointwise': False, 'min_split_scan_rblock': 256, 'spill_threshold': 16, 'store_cubin': False},
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code] )
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code] @triton.jit
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code] def triton_for_fused_0(in_ptr0, in_ptr1, in_ptr2, in_ptr3, in_ptr4, in_ptr5, in_ptr6, in_ptr7, in_ptr8, in_ptr9, in_ptr10, in_ptr11, in_ptr12, in_ptr13, in_ptr14, in_ptr15, in_ptr16, in_ptr17, in_ptr18, in_ptr19, in_ptr20, in_ptr21, in_ptr22, in_ptr23, in_ptr24, in_ptr25, in_ptr26, in_ptr27, in_ptr28, in_ptr29, in_ptr30, in_ptr31, in_ptr32, in_ptr33, in_ptr34, in_ptr35, in_ptr36, in_ptr37, in_ptr38, in_ptr39, in_ptr40, in_ptr41, in_ptr42, in_ptr43, in_ptr44, in_ptr45, in_ptr46, in_ptr47, in_ptr48, in_ptr49, out_ptr6, out_ptr7, out_ptr8, out_ptr15, out_ptr16, out_ptr17, out_ptr24, out_ptr25, out_ptr26, out_ptr33, out_ptr34, out_ptr35, out_ptr42, out_ptr43, out_ptr44, out_ptr51, out_ptr52, out_ptr53, out_ptr60, out_ptr61, out_ptr62, out_ptr69, out_ptr70, out_ptr71, out_ptr78, out_ptr79, out_ptr80, out_ptr87, out_ptr88, out_ptr89):
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]     pid = tl.program_id(0)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]     XBLOCK: tl.constexpr = 1024
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]     num_xblocks_0 = tl.cdiv(1048576, XBLOCK)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]     num_xblocks_1 = num_xblocks_0 + tl.cdiv(1048576, XBLOCK)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]     num_xblocks_2 = num_xblocks_1 + tl.cdiv(1048576, XBLOCK)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]     num_xblocks_3 = num_xblocks_2 + tl.cdiv(1048576, XBLOCK)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]     num_xblocks_4 = num_xblocks_3 + tl.cdiv(1048576, XBLOCK)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]     num_xblocks_5 = num_xblocks_4 + tl.cdiv(1048576, XBLOCK)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]     num_xblocks_6 = num_xblocks_5 + tl.cdiv(1048576, XBLOCK)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]     num_xblocks_7 = num_xblocks_6 + tl.cdiv(1048576, XBLOCK)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]     num_xblocks_8 = num_xblocks_7 + tl.cdiv(1048576, XBLOCK)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]     num_xblocks_9 = num_xblocks_8 + tl.cdiv(1048576, XBLOCK)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]     if pid < num_xblocks_0:
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         pid_offset = pid
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         xnumel = 1048576
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         r0_numel = 1
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         xoffset = pid_offset * XBLOCK
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         xindex = xoffset + tl.arange(0, XBLOCK)[:]
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         xmask = tl.full([XBLOCK], True, tl.int1)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         x0 = xindex
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp0 = tl.load(in_ptr0 + (x0), None)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp1 = tl.load(in_ptr1 + (x0), None)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp8 = tl.load(in_ptr2 + (x0), None)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp15 = tl.load(in_ptr3 + (x0), None)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp17 = in_ptr4
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp2 = tmp0 - tmp1
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp3 = 0.10000000149011612
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp4 = tmp3 * tmp2
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp5 = tl.full([1], False, tl.int1)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp6 = tl.where(tmp5, tmp0, tmp1)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp7 = tmp4 + tmp6
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp9 = 0.999
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp10 = tmp8 * tmp9
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp11 = 0.0010000000000000009
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp12 = tmp0 * tmp11
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp13 = tmp12 * tmp0
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp14 = tmp10 + tmp13
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp16 = libdevice.sqrt(tmp14)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp18 = 1.0
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp19 = tmp17 + tmp18
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp20 = libdevice.pow(tmp9, tmp19)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp21 = tmp18 - tmp20
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp22 = libdevice.sqrt(tmp21)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp23 = 0.9
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp24 = libdevice.pow(tmp23, tmp19)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp25 = tmp18 - tmp24
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp26 = tl.full([1], 1, tl.int32)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp27 = (tmp26 / tmp25)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp28 = 0.001
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp29 = tmp27 * tmp28
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp30 = -tmp29
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp31 = tmp22 * tmp30
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp32 = (tmp16 / tmp31)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp33 = (tmp26 / tmp30)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp34 = 1e-08
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp35 = tmp33 * tmp34
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp36 = tmp32 + tmp35
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp37 = (tmp7 / tmp36)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp38 = tmp15 + tmp37
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tl.store(out_ptr6 + (x0), tmp38, None)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tl.store(out_ptr7 + (x0), tmp7, None)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tl.store(out_ptr8 + (x0), tmp14, None)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]     elif pid < num_xblocks_1:
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         pid_offset = pid - num_xblocks_0
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         xnumel = 1048576
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         r0_numel = 1
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         xoffset = pid_offset * XBLOCK
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         xindex = xoffset + tl.arange(0, XBLOCK)[:]
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         xmask = tl.full([XBLOCK], True, tl.int1)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         x1 = xindex
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp39 = tl.load(in_ptr5 + (x1), None)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp40 = tl.load(in_ptr6 + (x1), None)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp47 = tl.load(in_ptr7 + (x1), None)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp54 = tl.load(in_ptr8 + (x1), None)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp56 = in_ptr9
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp41 = tmp39 - tmp40
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp42 = 0.10000000149011612
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp43 = tmp42 * tmp41
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp44 = tl.full([1], False, tl.int1)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp45 = tl.where(tmp44, tmp39, tmp40)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp46 = tmp43 + tmp45
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp48 = 0.999
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp49 = tmp47 * tmp48
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp50 = 0.0010000000000000009
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp51 = tmp39 * tmp50
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp52 = tmp51 * tmp39
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp53 = tmp49 + tmp52
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp55 = libdevice.sqrt(tmp53)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp57 = 1.0
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp58 = tmp56 + tmp57
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp59 = libdevice.pow(tmp48, tmp58)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp60 = tmp57 - tmp59
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp61 = libdevice.sqrt(tmp60)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp62 = 0.9
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp63 = libdevice.pow(tmp62, tmp58)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp64 = tmp57 - tmp63
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp65 = tl.full([1], 1, tl.int32)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp66 = (tmp65 / tmp64)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp67 = 0.001
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp68 = tmp66 * tmp67
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp69 = -tmp68
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp70 = tmp61 * tmp69
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp71 = (tmp55 / tmp70)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp72 = (tmp65 / tmp69)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp73 = 1e-08
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp74 = tmp72 * tmp73
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp75 = tmp71 + tmp74
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp76 = (tmp46 / tmp75)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp77 = tmp54 + tmp76
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tl.store(out_ptr15 + (x1), tmp77, None)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tl.store(out_ptr16 + (x1), tmp46, None)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tl.store(out_ptr17 + (x1), tmp53, None)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]     elif pid < num_xblocks_2:
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         pid_offset = pid - num_xblocks_1
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         xnumel = 1048576
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         r0_numel = 1
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         xoffset = pid_offset * XBLOCK
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         xindex = xoffset + tl.arange(0, XBLOCK)[:]
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         xmask = tl.full([XBLOCK], True, tl.int1)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         x2 = xindex
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp78 = tl.load(in_ptr10 + (x2), None)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp79 = tl.load(in_ptr11 + (x2), None)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp86 = tl.load(in_ptr12 + (x2), None)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp93 = tl.load(in_ptr13 + (x2), None)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp95 = in_ptr14
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp80 = tmp78 - tmp79
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp81 = 0.10000000149011612
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp82 = tmp81 * tmp80
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp83 = tl.full([1], False, tl.int1)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp84 = tl.where(tmp83, tmp78, tmp79)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp85 = tmp82 + tmp84
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp87 = 0.999
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp88 = tmp86 * tmp87
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp89 = 0.0010000000000000009
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp90 = tmp78 * tmp89
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp91 = tmp90 * tmp78
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp92 = tmp88 + tmp91
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp94 = libdevice.sqrt(tmp92)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp96 = 1.0
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp97 = tmp95 + tmp96
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp98 = libdevice.pow(tmp87, tmp97)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp99 = tmp96 - tmp98
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp100 = libdevice.sqrt(tmp99)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp101 = 0.9
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp102 = libdevice.pow(tmp101, tmp97)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp103 = tmp96 - tmp102
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp104 = tl.full([1], 1, tl.int32)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp105 = (tmp104 / tmp103)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp106 = 0.001
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp107 = tmp105 * tmp106
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp108 = -tmp107
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp109 = tmp100 * tmp108
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp110 = (tmp94 / tmp109)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp111 = (tmp104 / tmp108)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp112 = 1e-08
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp113 = tmp111 * tmp112
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp114 = tmp110 + tmp113
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp115 = (tmp85 / tmp114)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp116 = tmp93 + tmp115
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tl.store(out_ptr24 + (x2), tmp116, None)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tl.store(out_ptr25 + (x2), tmp85, None)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tl.store(out_ptr26 + (x2), tmp92, None)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]     elif pid < num_xblocks_3:
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         pid_offset = pid - num_xblocks_2
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         xnumel = 1048576
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         r0_numel = 1
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         xoffset = pid_offset * XBLOCK
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         xindex = xoffset + tl.arange(0, XBLOCK)[:]
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         xmask = tl.full([XBLOCK], True, tl.int1)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         x3 = xindex
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp117 = tl.load(in_ptr15 + (x3), None)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp118 = tl.load(in_ptr16 + (x3), None)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp125 = tl.load(in_ptr17 + (x3), None)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp132 = tl.load(in_ptr18 + (x3), None)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp134 = in_ptr19
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp119 = tmp117 - tmp118
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp120 = 0.10000000149011612
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp121 = tmp120 * tmp119
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp122 = tl.full([1], False, tl.int1)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp123 = tl.where(tmp122, tmp117, tmp118)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp124 = tmp121 + tmp123
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp126 = 0.999
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp127 = tmp125 * tmp126
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp128 = 0.0010000000000000009
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp129 = tmp117 * tmp128
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp130 = tmp129 * tmp117
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp131 = tmp127 + tmp130
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp133 = libdevice.sqrt(tmp131)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp135 = 1.0
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp136 = tmp134 + tmp135
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp137 = libdevice.pow(tmp126, tmp136)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp138 = tmp135 - tmp137
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp139 = libdevice.sqrt(tmp138)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp140 = 0.9
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp141 = libdevice.pow(tmp140, tmp136)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp142 = tmp135 - tmp141
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp143 = tl.full([1], 1, tl.int32)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp144 = (tmp143 / tmp142)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp145 = 0.001
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp146 = tmp144 * tmp145
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp147 = -tmp146
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp148 = tmp139 * tmp147
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp149 = (tmp133 / tmp148)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp150 = (tmp143 / tmp147)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp151 = 1e-08
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp152 = tmp150 * tmp151
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp153 = tmp149 + tmp152
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp154 = (tmp124 / tmp153)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp155 = tmp132 + tmp154
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tl.store(out_ptr33 + (x3), tmp155, None)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tl.store(out_ptr34 + (x3), tmp124, None)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tl.store(out_ptr35 + (x3), tmp131, None)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]     elif pid < num_xblocks_4:
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         pid_offset = pid - num_xblocks_3
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         xnumel = 1048576
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         r0_numel = 1
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         xoffset = pid_offset * XBLOCK
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         xindex = xoffset + tl.arange(0, XBLOCK)[:]
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         xmask = tl.full([XBLOCK], True, tl.int1)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         x4 = xindex
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp156 = tl.load(in_ptr20 + (x4), None)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp157 = tl.load(in_ptr21 + (x4), None)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp164 = tl.load(in_ptr22 + (x4), None)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp171 = tl.load(in_ptr23 + (x4), None)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp173 = in_ptr24
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp158 = tmp156 - tmp157
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp159 = 0.10000000149011612
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp160 = tmp159 * tmp158
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp161 = tl.full([1], False, tl.int1)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp162 = tl.where(tmp161, tmp156, tmp157)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp163 = tmp160 + tmp162
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp165 = 0.999
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp166 = tmp164 * tmp165
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp167 = 0.0010000000000000009
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp168 = tmp156 * tmp167
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp169 = tmp168 * tmp156
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp170 = tmp166 + tmp169
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp172 = libdevice.sqrt(tmp170)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp174 = 1.0
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp175 = tmp173 + tmp174
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp176 = libdevice.pow(tmp165, tmp175)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp177 = tmp174 - tmp176
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp178 = libdevice.sqrt(tmp177)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp179 = 0.9
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp180 = libdevice.pow(tmp179, tmp175)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp181 = tmp174 - tmp180
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp182 = tl.full([1], 1, tl.int32)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp183 = (tmp182 / tmp181)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp184 = 0.001
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp185 = tmp183 * tmp184
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp186 = -tmp185
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp187 = tmp178 * tmp186
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp188 = (tmp172 / tmp187)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp189 = (tmp182 / tmp186)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp190 = 1e-08
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp191 = tmp189 * tmp190
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp192 = tmp188 + tmp191
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp193 = (tmp163 / tmp192)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp194 = tmp171 + tmp193
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tl.store(out_ptr42 + (x4), tmp194, None)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tl.store(out_ptr43 + (x4), tmp163, None)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tl.store(out_ptr44 + (x4), tmp170, None)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]     elif pid < num_xblocks_5:
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         pid_offset = pid - num_xblocks_4
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         xnumel = 1048576
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         r0_numel = 1
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         xoffset = pid_offset * XBLOCK
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         xindex = xoffset + tl.arange(0, XBLOCK)[:]
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         xmask = tl.full([XBLOCK], True, tl.int1)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         x5 = xindex
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp195 = tl.load(in_ptr25 + (x5), None)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp196 = tl.load(in_ptr26 + (x5), None)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp203 = tl.load(in_ptr27 + (x5), None)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp210 = tl.load(in_ptr28 + (x5), None)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp212 = in_ptr29
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp197 = tmp195 - tmp196
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp198 = 0.10000000149011612
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp199 = tmp198 * tmp197
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp200 = tl.full([1], False, tl.int1)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp201 = tl.where(tmp200, tmp195, tmp196)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp202 = tmp199 + tmp201
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp204 = 0.999
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp205 = tmp203 * tmp204
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp206 = 0.0010000000000000009
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp207 = tmp195 * tmp206
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp208 = tmp207 * tmp195
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp209 = tmp205 + tmp208
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp211 = libdevice.sqrt(tmp209)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp213 = 1.0
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp214 = tmp212 + tmp213
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp215 = libdevice.pow(tmp204, tmp214)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp216 = tmp213 - tmp215
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp217 = libdevice.sqrt(tmp216)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp218 = 0.9
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp219 = libdevice.pow(tmp218, tmp214)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp220 = tmp213 - tmp219
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp221 = tl.full([1], 1, tl.int32)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp222 = (tmp221 / tmp220)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp223 = 0.001
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp224 = tmp222 * tmp223
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp225 = -tmp224
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp226 = tmp217 * tmp225
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp227 = (tmp211 / tmp226)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp228 = (tmp221 / tmp225)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp229 = 1e-08
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp230 = tmp228 * tmp229
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp231 = tmp227 + tmp230
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp232 = (tmp202 / tmp231)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp233 = tmp210 + tmp232
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tl.store(out_ptr51 + (x5), tmp233, None)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tl.store(out_ptr52 + (x5), tmp202, None)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tl.store(out_ptr53 + (x5), tmp209, None)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]     elif pid < num_xblocks_6:
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         pid_offset = pid - num_xblocks_5
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         xnumel = 1048576
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         r0_numel = 1
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         xoffset = pid_offset * XBLOCK
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         xindex = xoffset + tl.arange(0, XBLOCK)[:]
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         xmask = tl.full([XBLOCK], True, tl.int1)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         x6 = xindex
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp234 = tl.load(in_ptr30 + (x6), None)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp235 = tl.load(in_ptr31 + (x6), None)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp242 = tl.load(in_ptr32 + (x6), None)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp249 = tl.load(in_ptr33 + (x6), None)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp251 = in_ptr34
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp236 = tmp234 - tmp235
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp237 = 0.10000000149011612
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp238 = tmp237 * tmp236
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp239 = tl.full([1], False, tl.int1)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp240 = tl.where(tmp239, tmp234, tmp235)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp241 = tmp238 + tmp240
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp243 = 0.999
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp244 = tmp242 * tmp243
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp245 = 0.0010000000000000009
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp246 = tmp234 * tmp245
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp247 = tmp246 * tmp234
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp248 = tmp244 + tmp247
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp250 = libdevice.sqrt(tmp248)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp252 = 1.0
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp253 = tmp251 + tmp252
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp254 = libdevice.pow(tmp243, tmp253)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp255 = tmp252 - tmp254
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp256 = libdevice.sqrt(tmp255)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp257 = 0.9
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp258 = libdevice.pow(tmp257, tmp253)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp259 = tmp252 - tmp258
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp260 = tl.full([1], 1, tl.int32)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp261 = (tmp260 / tmp259)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp262 = 0.001
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp263 = tmp261 * tmp262
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp264 = -tmp263
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp265 = tmp256 * tmp264
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp266 = (tmp250 / tmp265)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp267 = (tmp260 / tmp264)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp268 = 1e-08
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp269 = tmp267 * tmp268
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp270 = tmp266 + tmp269
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp271 = (tmp241 / tmp270)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp272 = tmp249 + tmp271
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tl.store(out_ptr60 + (x6), tmp272, None)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tl.store(out_ptr61 + (x6), tmp241, None)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tl.store(out_ptr62 + (x6), tmp248, None)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]     elif pid < num_xblocks_7:
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         pid_offset = pid - num_xblocks_6
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         xnumel = 1048576
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         r0_numel = 1
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         xoffset = pid_offset * XBLOCK
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         xindex = xoffset + tl.arange(0, XBLOCK)[:]
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         xmask = tl.full([XBLOCK], True, tl.int1)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         x7 = xindex
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp273 = tl.load(in_ptr35 + (x7), None)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp274 = tl.load(in_ptr36 + (x7), None)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp281 = tl.load(in_ptr37 + (x7), None)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp288 = tl.load(in_ptr38 + (x7), None)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp290 = in_ptr39
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp275 = tmp273 - tmp274
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp276 = 0.10000000149011612
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp277 = tmp276 * tmp275
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp278 = tl.full([1], False, tl.int1)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp279 = tl.where(tmp278, tmp273, tmp274)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp280 = tmp277 + tmp279
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp282 = 0.999
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp283 = tmp281 * tmp282
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp284 = 0.0010000000000000009
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp285 = tmp273 * tmp284
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp286 = tmp285 * tmp273
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp287 = tmp283 + tmp286
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp289 = libdevice.sqrt(tmp287)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp291 = 1.0
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp292 = tmp290 + tmp291
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp293 = libdevice.pow(tmp282, tmp292)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp294 = tmp291 - tmp293
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp295 = libdevice.sqrt(tmp294)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp296 = 0.9
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp297 = libdevice.pow(tmp296, tmp292)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp298 = tmp291 - tmp297
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp299 = tl.full([1], 1, tl.int32)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp300 = (tmp299 / tmp298)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp301 = 0.001
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp302 = tmp300 * tmp301
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp303 = -tmp302
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp304 = tmp295 * tmp303
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp305 = (tmp289 / tmp304)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp306 = (tmp299 / tmp303)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp307 = 1e-08
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp308 = tmp306 * tmp307
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp309 = tmp305 + tmp308
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp310 = (tmp280 / tmp309)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp311 = tmp288 + tmp310
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tl.store(out_ptr69 + (x7), tmp311, None)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tl.store(out_ptr70 + (x7), tmp280, None)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tl.store(out_ptr71 + (x7), tmp287, None)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]     elif pid < num_xblocks_8:
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         pid_offset = pid - num_xblocks_7
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         xnumel = 1048576
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         r0_numel = 1
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         xoffset = pid_offset * XBLOCK
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         xindex = xoffset + tl.arange(0, XBLOCK)[:]
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         xmask = tl.full([XBLOCK], True, tl.int1)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         x8 = xindex
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp312 = tl.load(in_ptr40 + (x8), None)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp313 = tl.load(in_ptr41 + (x8), None)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp320 = tl.load(in_ptr42 + (x8), None)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp327 = tl.load(in_ptr43 + (x8), None)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp329 = in_ptr44
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp314 = tmp312 - tmp313
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp315 = 0.10000000149011612
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp316 = tmp315 * tmp314
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp317 = tl.full([1], False, tl.int1)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp318 = tl.where(tmp317, tmp312, tmp313)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp319 = tmp316 + tmp318
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp321 = 0.999
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp322 = tmp320 * tmp321
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp323 = 0.0010000000000000009
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp324 = tmp312 * tmp323
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp325 = tmp324 * tmp312
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp326 = tmp322 + tmp325
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp328 = libdevice.sqrt(tmp326)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp330 = 1.0
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp331 = tmp329 + tmp330
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp332 = libdevice.pow(tmp321, tmp331)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp333 = tmp330 - tmp332
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp334 = libdevice.sqrt(tmp333)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp335 = 0.9
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp336 = libdevice.pow(tmp335, tmp331)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp337 = tmp330 - tmp336
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp338 = tl.full([1], 1, tl.int32)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp339 = (tmp338 / tmp337)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp340 = 0.001
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp341 = tmp339 * tmp340
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp342 = -tmp341
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp343 = tmp334 * tmp342
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp344 = (tmp328 / tmp343)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp345 = (tmp338 / tmp342)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp346 = 1e-08
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp347 = tmp345 * tmp346
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp348 = tmp344 + tmp347
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp349 = (tmp319 / tmp348)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp350 = tmp327 + tmp349
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tl.store(out_ptr78 + (x8), tmp350, None)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tl.store(out_ptr79 + (x8), tmp319, None)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tl.store(out_ptr80 + (x8), tmp326, None)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]     elif pid < num_xblocks_9:
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         pid_offset = pid - num_xblocks_8
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         xnumel = 1048576
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         r0_numel = 1
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         xoffset = pid_offset * XBLOCK
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         xindex = xoffset + tl.arange(0, XBLOCK)[:]
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         xmask = tl.full([XBLOCK], True, tl.int1)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         x9 = xindex
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp351 = tl.load(in_ptr45 + (x9), None)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp352 = tl.load(in_ptr46 + (x9), None)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp359 = tl.load(in_ptr47 + (x9), None)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp366 = tl.load(in_ptr48 + (x9), None)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp368 = in_ptr49
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp353 = tmp351 - tmp352
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp354 = 0.10000000149011612
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp355 = tmp354 * tmp353
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp356 = tl.full([1], False, tl.int1)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp357 = tl.where(tmp356, tmp351, tmp352)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp358 = tmp355 + tmp357
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp360 = 0.999
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp361 = tmp359 * tmp360
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp362 = 0.0010000000000000009
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp363 = tmp351 * tmp362
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp364 = tmp363 * tmp351
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp365 = tmp361 + tmp364
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp367 = libdevice.sqrt(tmp365)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp369 = 1.0
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp370 = tmp368 + tmp369
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp371 = libdevice.pow(tmp360, tmp370)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp372 = tmp369 - tmp371
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp373 = libdevice.sqrt(tmp372)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp374 = 0.9
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp375 = libdevice.pow(tmp374, tmp370)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp376 = tmp369 - tmp375
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp377 = tl.full([1], 1, tl.int32)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp378 = (tmp377 / tmp376)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp379 = 0.001
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp380 = tmp378 * tmp379
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp381 = -tmp380
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp382 = tmp373 * tmp381
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp383 = (tmp367 / tmp382)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp384 = (tmp377 / tmp381)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp385 = 1e-08
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp386 = tmp384 * tmp385
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp387 = tmp383 + tmp386
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp388 = (tmp358 / tmp387)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tmp389 = tmp366 + tmp388
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tl.store(out_ptr87 + (x9), tmp389, None)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tl.store(out_ptr88 + (x9), tmp358, None)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         tl.store(out_ptr89 + (x9), tmp365, None)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]     else:
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         pass
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code] ''', device_str='cuda')
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code] cpp_fused__foreach_copy_1 = async_compile.cpp_pybinding(['const float*', 'const float*', 'const float*', 'const float*', 'const float*', 'const float*', 'const float*', 'const float*', 'const float*', 'const float*', 'float*', 'float*', 'float*', 'float*', 'float*', 'float*', 'float*', 'float*', 'float*', 'float*'], '''
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code] #include <torch/csrc/inductor/cpp_prefix.h>
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code] extern "C"  void kernel(const float* in_ptr0,
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]                        const float* in_ptr1,
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]                        const float* in_ptr2,
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]                        const float* in_ptr3,
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]                        const float* in_ptr4,
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]                        const float* in_ptr5,
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]                        const float* in_ptr6,
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]                        const float* in_ptr7,
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]                        const float* in_ptr8,
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]                        const float* in_ptr9,
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]                        float* out_ptr1,
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]                        float* out_ptr3,
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]                        float* out_ptr5,
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]                        float* out_ptr7,
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]                        float* out_ptr9,
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]                        float* out_ptr11,
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]                        float* out_ptr13,
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]                        float* out_ptr15,
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]                        float* out_ptr17,
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]                        float* out_ptr19)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code] {
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]     {
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         {
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]             {
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]                 auto tmp0 = in_ptr0[static_cast<int64_t>(0L)];
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]                 auto tmp1 = static_cast<float>(1.0);
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]                 auto tmp2 = float(tmp0 + tmp1);
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]                 out_ptr1[static_cast<int64_t>(0L)] = tmp2;
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]             }
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         }
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]     }
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]     {
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         {
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]             {
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]                 auto tmp0 = in_ptr1[static_cast<int64_t>(0L)];
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]                 auto tmp1 = static_cast<float>(1.0);
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]                 auto tmp2 = float(tmp0 + tmp1);
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]                 out_ptr3[static_cast<int64_t>(0L)] = tmp2;
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]             }
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         }
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]     }
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]     {
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         {
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]             {
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]                 auto tmp0 = in_ptr2[static_cast<int64_t>(0L)];
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]                 auto tmp1 = static_cast<float>(1.0);
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]                 auto tmp2 = float(tmp0 + tmp1);
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]                 out_ptr5[static_cast<int64_t>(0L)] = tmp2;
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]             }
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         }
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]     }
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]     {
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         {
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]             {
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]                 auto tmp0 = in_ptr3[static_cast<int64_t>(0L)];
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]                 auto tmp1 = static_cast<float>(1.0);
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]                 auto tmp2 = float(tmp0 + tmp1);
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]                 out_ptr7[static_cast<int64_t>(0L)] = tmp2;
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]             }
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         }
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]     }
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]     {
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         {
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]             {
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]                 auto tmp0 = in_ptr4[static_cast<int64_t>(0L)];
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]                 auto tmp1 = static_cast<float>(1.0);
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]                 auto tmp2 = float(tmp0 + tmp1);
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]                 out_ptr9[static_cast<int64_t>(0L)] = tmp2;
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]             }
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         }
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]     }
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]     {
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         {
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]             {
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]                 auto tmp0 = in_ptr5[static_cast<int64_t>(0L)];
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]                 auto tmp1 = static_cast<float>(1.0);
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]                 auto tmp2 = float(tmp0 + tmp1);
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]                 out_ptr11[static_cast<int64_t>(0L)] = tmp2;
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]             }
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         }
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]     }
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]     {
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         {
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]             {
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]                 auto tmp0 = in_ptr6[static_cast<int64_t>(0L)];
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]                 auto tmp1 = static_cast<float>(1.0);
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]                 auto tmp2 = float(tmp0 + tmp1);
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]                 out_ptr13[static_cast<int64_t>(0L)] = tmp2;
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]             }
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         }
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]     }
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]     {
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         {
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]             {
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]                 auto tmp0 = in_ptr7[static_cast<int64_t>(0L)];
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]                 auto tmp1 = static_cast<float>(1.0);
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]                 auto tmp2 = float(tmp0 + tmp1);
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]                 out_ptr15[static_cast<int64_t>(0L)] = tmp2;
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]             }
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         }
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]     }
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]     {
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         {
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]             {
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]                 auto tmp0 = in_ptr8[static_cast<int64_t>(0L)];
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]                 auto tmp1 = static_cast<float>(1.0);
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]                 auto tmp2 = float(tmp0 + tmp1);
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]                 out_ptr17[static_cast<int64_t>(0L)] = tmp2;
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]             }
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         }
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]     }
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]     {
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         {
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]             {
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]                 auto tmp0 = in_ptr9[static_cast<int64_t>(0L)];
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]                 auto tmp1 = static_cast<float>(1.0);
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]                 auto tmp2 = float(tmp0 + tmp1);
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]                 out_ptr19[static_cast<int64_t>(0L)] = tmp2;
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]             }
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         }
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]     }
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code] }
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code] ''')
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code] async_compile.wait(globals())
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code] del async_compile
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code] def call(args):
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]     arg0_1, arg1_1, arg2_1, arg3_1, arg4_1, arg5_1, arg6_1, arg7_1, arg8_1, arg9_1, arg10_1, arg11_1, arg12_1, arg13_1, arg14_1, arg15_1, arg16_1, arg17_1, arg18_1, arg19_1, arg20_1, arg21_1, arg22_1, arg23_1, arg24_1, arg25_1, arg26_1, arg27_1, arg28_1, arg29_1, arg30_1, arg31_1, arg32_1, arg33_1, arg34_1, arg35_1, arg36_1, arg37_1, arg38_1, arg39_1, arg40_1, arg41_1, arg42_1, arg43_1, arg44_1, arg45_1, arg46_1, arg47_1, arg48_1, arg49_1 = args
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]     args.clear()
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]     assert_size_stride(arg0_1, (1024, 1024), (1024, 1))
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]     assert_size_stride(arg1_1, (1024, 1024), (1024, 1))
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]     assert_size_stride(arg2_1, (1024, 1024), (1024, 1))
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]     assert_size_stride(arg3_1, (1024, 1024), (1024, 1))
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]     assert_size_stride(arg4_1, (1024, 1024), (1024, 1))
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]     assert_size_stride(arg5_1, (1024, 1024), (1024, 1))
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]     assert_size_stride(arg6_1, (1024, 1024), (1024, 1))
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]     assert_size_stride(arg7_1, (1024, 1024), (1024, 1))
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]     assert_size_stride(arg8_1, (1024, 1024), (1024, 1))
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]     assert_size_stride(arg9_1, (1024, 1024), (1024, 1))
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]     assert_size_stride(arg10_1, (), ())
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]     assert_size_stride(arg11_1, (), ())
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]     assert_size_stride(arg12_1, (), ())
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]     assert_size_stride(arg13_1, (), ())
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]     assert_size_stride(arg14_1, (), ())
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]     assert_size_stride(arg15_1, (), ())
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]     assert_size_stride(arg16_1, (), ())
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]     assert_size_stride(arg17_1, (), ())
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]     assert_size_stride(arg18_1, (), ())
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]     assert_size_stride(arg19_1, (), ())
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]     assert_size_stride(arg20_1, (1024, 1024), (1024, 1))
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]     assert_size_stride(arg21_1, (1024, 1024), (1024, 1))
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]     assert_size_stride(arg22_1, (1024, 1024), (1024, 1))
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]     assert_size_stride(arg23_1, (1024, 1024), (1024, 1))
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]     assert_size_stride(arg24_1, (1024, 1024), (1024, 1))
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]     assert_size_stride(arg25_1, (1024, 1024), (1024, 1))
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]     assert_size_stride(arg26_1, (1024, 1024), (1024, 1))
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]     assert_size_stride(arg27_1, (1024, 1024), (1024, 1))
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]     assert_size_stride(arg28_1, (1024, 1024), (1024, 1))
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]     assert_size_stride(arg29_1, (1024, 1024), (1024, 1))
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]     assert_size_stride(arg30_1, (1024, 1024), (1024, 1))
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]     assert_size_stride(arg31_1, (1024, 1024), (1024, 1))
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]     assert_size_stride(arg32_1, (1024, 1024), (1024, 1))
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]     assert_size_stride(arg33_1, (1024, 1024), (1024, 1))
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]     assert_size_stride(arg34_1, (1024, 1024), (1024, 1))
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]     assert_size_stride(arg35_1, (1024, 1024), (1024, 1))
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]     assert_size_stride(arg36_1, (1024, 1024), (1024, 1))
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]     assert_size_stride(arg37_1, (1024, 1024), (1024, 1))
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]     assert_size_stride(arg38_1, (1024, 1024), (1024, 1))
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]     assert_size_stride(arg39_1, (1024, 1024), (1024, 1))
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]     assert_size_stride(arg40_1, (1024, 1024), (1024, 1))
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]     assert_size_stride(arg41_1, (1024, 1024), (1024, 1))
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]     assert_size_stride(arg42_1, (1024, 1024), (1024, 1))
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]     assert_size_stride(arg43_1, (1024, 1024), (1024, 1))
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]     assert_size_stride(arg44_1, (1024, 1024), (1024, 1))
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]     assert_size_stride(arg45_1, (1024, 1024), (1024, 1))
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]     assert_size_stride(arg46_1, (1024, 1024), (1024, 1))
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]     assert_size_stride(arg47_1, (1024, 1024), (1024, 1))
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]     assert_size_stride(arg48_1, (1024, 1024), (1024, 1))
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]     assert_size_stride(arg49_1, (1024, 1024), (1024, 1))
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]     with torch.cuda._DeviceGuard(0):
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         torch.cuda.set_device(0)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         # Unsorted Source Nodes: [], Original ATen: []
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         stream0 = get_raw_stream(0)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         triton_for_fused_0.run(arg30_1, arg20_1, arg40_1, arg0_1, arg10_1, arg31_1, arg21_1, arg41_1, arg1_1, arg11_1, arg32_1, arg22_1, arg42_1, arg2_1, arg12_1, arg33_1, arg23_1, arg43_1, arg3_1, arg13_1, arg34_1, arg24_1, arg44_1, arg4_1, arg14_1, arg35_1, arg25_1, arg45_1, arg5_1, arg15_1, arg36_1, arg26_1, arg46_1, arg6_1, arg16_1, arg37_1, arg27_1, arg47_1, arg7_1, arg17_1, arg38_1, arg28_1, arg48_1, arg8_1, arg18_1, arg39_1, arg29_1, arg49_1, arg9_1, arg19_1, arg0_1, arg20_1, arg40_1, arg1_1, arg21_1, arg41_1, arg2_1, arg22_1, arg42_1, arg3_1, arg23_1, arg43_1, arg4_1, arg24_1, arg44_1, arg5_1, arg25_1, arg45_1, arg6_1, arg26_1, arg46_1, arg7_1, arg27_1, arg47_1, arg8_1, arg28_1, arg48_1, arg9_1, arg29_1, arg49_1, stream=stream0)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         del arg0_1
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         del arg1_1
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         del arg20_1
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         del arg21_1
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         del arg22_1
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         del arg23_1
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         del arg24_1
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         del arg25_1
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         del arg26_1
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         del arg27_1
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         del arg28_1
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         del arg29_1
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         del arg2_1
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         del arg30_1
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         del arg31_1
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         del arg32_1
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         del arg33_1
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         del arg34_1
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         del arg35_1
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         del arg36_1
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         del arg37_1
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         del arg38_1
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         del arg39_1
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         del arg3_1
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         del arg40_1
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         del arg41_1
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         del arg42_1
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         del arg43_1
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         del arg44_1
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         del arg45_1
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         del arg46_1
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         del arg47_1
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         del arg48_1
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         del arg49_1
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         del arg4_1
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         del arg5_1
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         del arg6_1
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         del arg7_1
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         del arg8_1
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]         del arg9_1
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]     cpp_fused__foreach_copy_1(arg10_1, arg11_1, arg12_1, arg13_1, arg14_1, arg15_1, arg16_1, arg17_1, arg18_1, arg19_1, arg10_1, arg11_1, arg12_1, arg13_1, arg14_1, arg15_1, arg16_1, arg17_1, arg18_1, arg19_1)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]     del arg10_1
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]     del arg11_1
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]     del arg12_1
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]     del arg13_1
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]     del arg14_1
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]     del arg15_1
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]     del arg16_1
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]     del arg17_1
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]     del arg18_1
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]     del arg19_1
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]     return ()
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code] def benchmark_compiled_module(times=10, repeat=10):
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]     from torch._dynamo.testing import rand_strided
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]     from torch._inductor.utils import print_performance
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]     arg0_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]     arg1_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]     arg2_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]     arg3_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]     arg4_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]     arg5_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]     arg6_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]     arg7_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]     arg8_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]     arg9_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]     arg10_1 = rand_strided((), (), device='cpu', dtype=torch.float32)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]     arg11_1 = rand_strided((), (), device='cpu', dtype=torch.float32)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]     arg12_1 = rand_strided((), (), device='cpu', dtype=torch.float32)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]     arg13_1 = rand_strided((), (), device='cpu', dtype=torch.float32)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]     arg14_1 = rand_strided((), (), device='cpu', dtype=torch.float32)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]     arg15_1 = rand_strided((), (), device='cpu', dtype=torch.float32)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]     arg16_1 = rand_strided((), (), device='cpu', dtype=torch.float32)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]     arg17_1 = rand_strided((), (), device='cpu', dtype=torch.float32)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]     arg18_1 = rand_strided((), (), device='cpu', dtype=torch.float32)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]     arg19_1 = rand_strided((), (), device='cpu', dtype=torch.float32)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]     arg20_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]     arg21_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]     arg22_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]     arg23_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]     arg24_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]     arg25_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]     arg26_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]     arg27_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]     arg28_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]     arg29_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]     arg30_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]     arg31_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]     arg32_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]     arg33_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]     arg34_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]     arg35_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]     arg36_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]     arg37_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]     arg38_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]     arg39_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]     arg40_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]     arg41_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]     arg42_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]     arg43_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]     arg44_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]     arg45_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]     arg46_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]     arg47_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]     arg48_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]     arg49_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]     fn = lambda: call([arg0_1, arg1_1, arg2_1, arg3_1, arg4_1, arg5_1, arg6_1, arg7_1, arg8_1, arg9_1, arg10_1, arg11_1, arg12_1, arg13_1, arg14_1, arg15_1, arg16_1, arg17_1, arg18_1, arg19_1, arg20_1, arg21_1, arg22_1, arg23_1, arg24_1, arg25_1, arg26_1, arg27_1, arg28_1, arg29_1, arg30_1, arg31_1, arg32_1, arg33_1, arg34_1, arg35_1, arg36_1, arg37_1, arg38_1, arg39_1, arg40_1, arg41_1, arg42_1, arg43_1, arg44_1, arg45_1, arg46_1, arg47_1, arg48_1, arg49_1])
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]     return print_performance(fn, times=times, repeat=repeat)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code] if __name__ == "__main__":
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]     from torch._inductor.wrapper_benchmark import compiled_module_main
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]     compiled_module_main('None', benchmark_compiled_module)
V0813 15:24:56.329000 22304 torch/_inductor/graph.py:2345] [0/1] [__output_code]
V0813 15:24:56.377000 22304 torch/_inductor/graph.py:2356] [0/1] [__output_code] Output code written to: /tmp/torchinductor_ci-user/da/cdabi6efsaqwxkw2y4xsbsvooc4l752igga6mfi4rfeqb4ikja3b.py
I0813 15:24:56.412000 22304 torch/_inductor/graph.py:2317] [0/1] [__output_code] Output code written to: /tmp/torchinductor_ci-user/da/cdabi6efsaqwxkw2y4xsbsvooc4l752igga6mfi4rfeqb4ikja3b.py
eager runtime: 1204.5601950012497us
compiled runtime: 768.8826858184849us

Conclusion#

In this tutorial, we successfully implemented a custom fully-fused Adam optimizer using foreach_map. By leveraging the power of foreach_map and torch.compile, we were able to create an optimized version of the Adam optimizer that can be used in various machine learning applications. This tutorial provides a comprehensive guide on how to use foreach_map and torch.compile to optimize machine learning models, and serves as a valuable resource for developers looking to improve the performance of their models with horizontal fusion.

See also:

Total running time of the script: (0 minutes 15.451 seconds)