Note
Click here to download the full example code
Explicit horizontal fusion with foreach_map and torch.compile¶
Author: Michael Lazos
- Horizontal fusion is a key optimization in ML compilers. In eager,
this is typically expressed using the torch._foreach* ops which parallelizes operations across a list of tensors. However, supporting all possible permutations of arguments is quite difficult (e.g. mixtures of scalars and lists). Foreach_map allows conversion of any pointwise op in
torch
to a horiztonally fused foreach variant. In this tutorial, we will demonstrate how to implement the Adam optimizer withforeach_map
to generate a fully fused kernel.
Note
This recipe describes a prototype feature. Prototype features are typically at an early stage for feedback and testing and are subject to change.
Prerequisites¶
PyTorch v2.7.0 or later
Model Setup¶
For this example, we’ll use a simple sequence of linear layers. We instantiate an independent copy to compare the two optimizer implementations.
import torch
# exit cleanly if we are on a device that doesn't support ``torch.compile``
if torch.cuda.get_device_capability() < (7, 0):
print("Exiting because torch.compile is not supported on this device.")
import sys
sys.exit(0)
# Create simple model
model = torch.nn.Sequential(
*[torch.nn.Linear(1024, 1024, False, device="cuda") for _ in range(10)]
)
model_copy = torch.nn.Sequential(
*[torch.nn.Linear(1024, 1024, False, device="cuda") for _ in range(10)]
)
input = torch.rand(1024, device="cuda")
# run forward pass
output = model(input)
output_copy = model_copy(input)
# run backward to populate the grads for our optimizer below
output.sum().backward()
output_copy.sum().backward()
Helper functions for foreach_map implementation¶
In this section, we’ll begin our implementation of the Adam optimizer.
from torch._higher_order_ops.foreach_map import foreach_map
# Helper function to extract optimizer states from a torch.optim.Adam instance
def get_inputs(optim):
steps = []
params = []
grads = []
exp_avgs = []
exp_avg_sqs = []
for group in optim.param_groups:
for p in group["params"]:
params.append(p)
grads.append(p.grad)
state = optim.state[p]
exp_avgs.append(state["exp_avg"])
exp_avg_sqs.append(state["exp_avg_sq"])
steps.append(state["step"])
return steps, params, exp_avgs, exp_avg_sqs
# Functions to update the different optimizer states
def update_exp_avg_sq(exp_avg_sq, grad, beta2):
return exp_avg_sq.mul(beta2).addcmul(grad, grad, value=1 - beta2)
def update_param(param, step, exp_avg, exp_avg_sq, beta1, beta2, lr, eps):
bias_correction1 = 1 - torch.pow(beta1, step)
bias_correction2 = (1 - torch.pow(beta2, step)).sqrt()
step_size = (lr / bias_correction1).neg()
denom = (exp_avg_sq.sqrt() / (bias_correction2 * step_size)).add(eps / step_size)
return torch.add(param, torch.div(exp_avg, denom))
# Our full Adam implementation
def foreach_map_adam(
steps,
params,
exp_avgs,
exp_avg_sqs,
weight_decay=0,
beta1=0.9,
beta2=0.999,
lr=1e-3,
eps=1e-8,
):
with torch.no_grad():
grads = [param.grad for param in params]
# update step
updated_steps = foreach_map(lambda x: x + 1, steps)
torch._foreach_copy_(steps, updated_steps)
if weight_decay != 0:
foreach_map(torch.add, (grads,), alpha=weight_decay)
# Higher-order operators (HOPs) cannot have multiple outputs at the moment
# need to call foreach_map once for each output
exp_avgs_updated = foreach_map(torch.lerp, exp_avgs, grads, 1 - beta1)
exp_avgs_sq_updated = foreach_map(update_exp_avg_sq, exp_avg_sqs, grads, beta2)
params_updated = foreach_map(
update_param,
params,
steps,
exp_avgs_updated,
exp_avgs_sq_updated,
beta1,
beta2,
lr,
eps,
)
# Higher-order operators (HOPs) don't support input mutation today
# so manually update the states in-place
torch._foreach_copy_(exp_avgs, exp_avgs_updated)
torch._foreach_copy_(exp_avg_sqs, exp_avgs_sq_updated)
torch._foreach_copy_(params, params_updated)
return
Setting up and running the compiled kernel¶
In this section, we’ll run our Adam optimizer and compare the results
Note
torch.compile
is only supported on CUDA devices that have a compute capability of 7.0 or higher.
opt_eager = torch.optim.Adam(model.parameters(), lr=torch.tensor(0.01))
opt_eager_copy = torch.optim.Adam(model_copy.parameters(), lr=torch.tensor(0.01))
# warm up the optimizer state dict
opt_eager.step()
opt_eager_copy.step()
inputs = get_inputs(opt_eager_copy)
compiled_adam = torch.compile(foreach_map_adam)
# optionally view the output code
torch._logging.set_logs(output_code=True)
# Warmup runs to compile the function
for _ in range(5):
opt_eager.step()
compiled_adam(*inputs)
for eager_p, compile_p in zip(opt_eager.param_groups[0]["params"], opt_eager_copy.param_groups[0]["params"]):
torch.allclose(eager_p, compile_p)
# Benchmark performance
# Let's define a helpful benchmarking function:
import torch.utils.benchmark as benchmark
def benchmark_torch_function_in_microseconds(f, *args, **kwargs):
t0 = benchmark.Timer(
stmt="f(*args, **kwargs)", globals={"args": args, "kwargs": kwargs, "f": f}
)
return t0.blocked_autorange().mean * 1e6
eager_runtime = benchmark_torch_function_in_microseconds(opt_eager.step)
compiled_runtime = benchmark_torch_function_in_microseconds(lambda: compiled_adam(*inputs))
assert eager_runtime > compiled_runtime
print(f"eager runtime: {eager_runtime}us")
print(f"compiled runtime: {compiled_runtime}us")
/usr/local/lib/python3.10/dist-packages/torch/_dynamo/pgo.py:465: UserWarning:
dynamo_pgo force disabled by torch._inductor.config.force_disable_caches
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] Output code:
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] # AOT ID: ['0_inference']
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] from ctypes import c_void_p, c_long, c_int
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] import torch
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] import math
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] import random
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] import os
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] import tempfile
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] from math import inf, nan
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] from cmath import nanj
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] from torch._inductor.hooks import run_intermediate_hooks
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] from torch._inductor.utils import maybe_profile
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] from torch._inductor.codegen.memory_planning import _align as align
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] from torch import device, empty_strided
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] from torch._inductor.async_compile import AsyncCompile
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] from torch._inductor.select_algorithm import extern_kernels
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] from torch._inductor.codegen.multi_kernel import MultiKernelCall
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] from torch._C import _cuda_getCurrentRawStream as get_raw_stream
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] import triton
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] import triton.language as tl
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] from torch._inductor.runtime.triton_heuristics import start_graph, end_graph
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] from torch._C import _cuda_getCurrentRawStream as get_raw_stream
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code]
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] aten = torch.ops.aten
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] inductor_ops = torch.ops.inductor
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] _quantized = torch.ops._quantized
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] assert_size_stride = torch._C._dynamo.guards.assert_size_stride
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] empty_strided_cpu = torch._C._dynamo.guards._empty_strided_cpu
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] empty_strided_cuda = torch._C._dynamo.guards._empty_strided_cuda
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] empty_strided_xpu = torch._C._dynamo.guards._empty_strided_xpu
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] reinterpret_tensor = torch._C._dynamo.guards._reinterpret_tensor
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] alloc_from_pool = torch.ops.inductor._alloc_from_pool
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] async_compile = AsyncCompile()
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] empty_strided_p2p = torch._C._distributed_c10d._SymmetricMemory.empty_strided_p2p
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code]
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code]
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] # kernel path: /tmp/torchinductor_ci-user/tmppsh5aeyx/ca/ccaq3yt7deumw4kqto4xmlusqkcgnugnkdlnmxsrillr7l42wecc.py
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] # Unsorted Source Nodes: [], Original ATen: []
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] # Source node to ATen node mapping:
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] triton_for_fused_0 = async_compile.triton('triton_for_fused_0', '''
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] import triton
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] import triton.language as tl
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code]
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] from torch._inductor.runtime import triton_helpers, triton_heuristics
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] from torch._inductor.runtime.triton_helpers import libdevice, math as tl_math
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] from torch._inductor.runtime.hints import AutotuneHint, ReductionHint, TileHint, DeviceProperties
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code]
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] @triton_heuristics.foreach(
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] num_warps=8,
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] triton_meta={'signature': {'in_ptr0': '*fp32', 'in_ptr1': '*fp32', 'in_ptr2': '*fp32', 'in_ptr3': '*fp32', 'in_ptr4': 'fp32', 'in_ptr5': '*fp32', 'in_ptr6': '*fp32', 'in_ptr7': '*fp32', 'in_ptr8': '*fp32', 'in_ptr9': 'fp32', 'in_ptr10': '*fp32', 'in_ptr11': '*fp32', 'in_ptr12': '*fp32', 'in_ptr13': '*fp32', 'in_ptr14': 'fp32', 'in_ptr15': '*fp32', 'in_ptr16': '*fp32', 'in_ptr17': '*fp32', 'in_ptr18': '*fp32', 'in_ptr19': 'fp32', 'in_ptr20': '*fp32', 'in_ptr21': '*fp32', 'in_ptr22': '*fp32', 'in_ptr23': '*fp32', 'in_ptr24': 'fp32', 'in_ptr25': '*fp32', 'in_ptr26': '*fp32', 'in_ptr27': '*fp32', 'in_ptr28': '*fp32', 'in_ptr29': 'fp32', 'in_ptr30': '*fp32', 'in_ptr31': '*fp32', 'in_ptr32': '*fp32', 'in_ptr33': '*fp32', 'in_ptr34': 'fp32', 'in_ptr35': '*fp32', 'in_ptr36': '*fp32', 'in_ptr37': '*fp32', 'in_ptr38': '*fp32', 'in_ptr39': 'fp32', 'in_ptr40': '*fp32', 'in_ptr41': '*fp32', 'in_ptr42': '*fp32', 'in_ptr43': '*fp32', 'in_ptr44': 'fp32', 'in_ptr45': '*fp32', 'in_ptr46': '*fp32', 'in_ptr47': '*fp32', 'in_ptr48': '*fp32', 'in_ptr49': 'fp32', 'out_ptr6': '*fp32', 'out_ptr7': '*fp32', 'out_ptr8': '*fp32', 'out_ptr15': '*fp32', 'out_ptr16': '*fp32', 'out_ptr17': '*fp32', 'out_ptr24': '*fp32', 'out_ptr25': '*fp32', 'out_ptr26': '*fp32', 'out_ptr33': '*fp32', 'out_ptr34': '*fp32', 'out_ptr35': '*fp32', 'out_ptr42': '*fp32', 'out_ptr43': '*fp32', 'out_ptr44': '*fp32', 'out_ptr51': '*fp32', 'out_ptr52': '*fp32', 'out_ptr53': '*fp32', 'out_ptr60': '*fp32', 'out_ptr61': '*fp32', 'out_ptr62': '*fp32', 'out_ptr69': '*fp32', 'out_ptr70': '*fp32', 'out_ptr71': '*fp32', 'out_ptr78': '*fp32', 'out_ptr79': '*fp32', 'out_ptr80': '*fp32', 'out_ptr87': '*fp32', 'out_ptr88': '*fp32', 'out_ptr89': '*fp32'}, 'device': DeviceProperties(type='cuda', index=0, multi_processor_count=80, cc=86, major=8, regs_per_multiprocessor=65536, max_threads_per_multi_processor=1536, warp_size=32), 'constants': {}, 'configs': [{(0,): [['tt.divisibility', 16]], (1,): [['tt.divisibility', 16]], (2,): [['tt.divisibility', 16]], (3,): [['tt.divisibility', 16]], (5,): [['tt.divisibility', 16]], (6,): [['tt.divisibility', 16]], (7,): [['tt.divisibility', 16]], (8,): [['tt.divisibility', 16]], (10,): [['tt.divisibility', 16]], (11,): [['tt.divisibility', 16]], (12,): [['tt.divisibility', 16]], (13,): [['tt.divisibility', 16]], (15,): [['tt.divisibility', 16]], (16,): [['tt.divisibility', 16]], (17,): [['tt.divisibility', 16]], (18,): [['tt.divisibility', 16]], (20,): [['tt.divisibility', 16]], (21,): [['tt.divisibility', 16]], (22,): [['tt.divisibility', 16]], (23,): [['tt.divisibility', 16]], (25,): [['tt.divisibility', 16]], (26,): [['tt.divisibility', 16]], (27,): [['tt.divisibility', 16]], (28,): [['tt.divisibility', 16]], (30,): [['tt.divisibility', 16]], (31,): [['tt.divisibility', 16]], (32,): [['tt.divisibility', 16]], (33,): [['tt.divisibility', 16]], (35,): [['tt.divisibility', 16]], (36,): [['tt.divisibility', 16]], (37,): [['tt.divisibility', 16]], (38,): [['tt.divisibility', 16]], (40,): [['tt.divisibility', 16]], (41,): [['tt.divisibility', 16]], (42,): [['tt.divisibility', 16]], (43,): [['tt.divisibility', 16]], (45,): [['tt.divisibility', 16]], (46,): [['tt.divisibility', 16]], (47,): [['tt.divisibility', 16]], (48,): [['tt.divisibility', 16]], (50,): [['tt.divisibility', 16]], (51,): [['tt.divisibility', 16]], (52,): [['tt.divisibility', 16]], (53,): [['tt.divisibility', 16]], (54,): [['tt.divisibility', 16]], (55,): [['tt.divisibility', 16]], (56,): [['tt.divisibility', 16]], (57,): [['tt.divisibility', 16]], (58,): [['tt.divisibility', 16]], (59,): [['tt.divisibility', 16]], (60,): [['tt.divisibility', 16]], (61,): [['tt.divisibility', 16]], (62,): [['tt.divisibility', 16]], (63,): [['tt.divisibility', 16]], (64,): [['tt.divisibility', 16]], (65,): [['tt.divisibility', 16]], (66,): [['tt.divisibility', 16]], (67,): [['tt.divisibility', 16]], (68,): [['tt.divisibility', 16]], (69,): [['tt.divisibility', 16]], (70,): [['tt.divisibility', 16]], (71,): [['tt.divisibility', 16]], (72,): [['tt.divisibility', 16]], (73,): [['tt.divisibility', 16]], (74,): [['tt.divisibility', 16]], (75,): [['tt.divisibility', 16]], (76,): [['tt.divisibility', 16]], (77,): [['tt.divisibility', 16]], (78,): [['tt.divisibility', 16]], (79,): [['tt.divisibility', 16]]}]},
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] inductor_meta={'grid_type': 'SequentialComboKernelGrid', 'combo_grid_meta': {'num_kernels': 10, 'min_blocks': 0, 'default_config': {'XBLOCK': 1024}, 'no_x_dim_0': False, 'xnumel_0': 1048576, 'no_x_dim_1': False, 'xnumel_1': 1048576, 'no_x_dim_2': False, 'xnumel_2': 1048576, 'no_x_dim_3': False, 'xnumel_3': 1048576, 'no_x_dim_4': False, 'xnumel_4': 1048576, 'no_x_dim_5': False, 'xnumel_5': 1048576, 'no_x_dim_6': False, 'xnumel_6': 1048576, 'no_x_dim_7': False, 'xnumel_7': 1048576, 'no_x_dim_8': False, 'xnumel_8': 1048576, 'no_x_dim_9': False, 'xnumel_9': 1048576}, 'kernel_name': 'triton_for_fused_0', 'mutated_arg_names': ['in_ptr1', 'in_ptr11', 'in_ptr12', 'in_ptr13', 'in_ptr16', 'in_ptr17', 'in_ptr18', 'in_ptr2', 'in_ptr21', 'in_ptr22', 'in_ptr23', 'in_ptr26', 'in_ptr27', 'in_ptr28', 'in_ptr3', 'in_ptr31', 'in_ptr32', 'in_ptr33', 'in_ptr36', 'in_ptr37', 'in_ptr38', 'in_ptr41', 'in_ptr42', 'in_ptr43', 'in_ptr46', 'in_ptr47', 'in_ptr48', 'in_ptr6', 'in_ptr7', 'in_ptr8', 'out_ptr15', 'out_ptr16', 'out_ptr17', 'out_ptr24', 'out_ptr25', 'out_ptr26', 'out_ptr33', 'out_ptr34', 'out_ptr35', 'out_ptr42', 'out_ptr43', 'out_ptr44', 'out_ptr51', 'out_ptr52', 'out_ptr53', 'out_ptr6', 'out_ptr60', 'out_ptr61', 'out_ptr62', 'out_ptr69', 'out_ptr7', 'out_ptr70', 'out_ptr71', 'out_ptr78', 'out_ptr79', 'out_ptr8', 'out_ptr80', 'out_ptr87', 'out_ptr88', 'out_ptr89'], 'backend_hash': '1E2C16421D4C3DBA4AD92BFC4278A3CB24C43DEDA6EE7FF9E3FBB1DBB80802DB', 'are_deterministic_algorithms_enabled': False, 'assert_indirect_indexing': True, 'autotune_local_cache': True, 'autotune_pointwise': True, 'autotune_remote_cache': None, 'force_disable_caches': True, 'dynamic_scale_rblock': True, 'max_autotune': False, 'max_autotune_pointwise': False, 'min_split_scan_rblock': 256, 'spill_threshold': 16, 'store_cubin': False},
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] )
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] @triton.jit
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] def triton_for_fused_0(in_ptr0, in_ptr1, in_ptr2, in_ptr3, in_ptr4, in_ptr5, in_ptr6, in_ptr7, in_ptr8, in_ptr9, in_ptr10, in_ptr11, in_ptr12, in_ptr13, in_ptr14, in_ptr15, in_ptr16, in_ptr17, in_ptr18, in_ptr19, in_ptr20, in_ptr21, in_ptr22, in_ptr23, in_ptr24, in_ptr25, in_ptr26, in_ptr27, in_ptr28, in_ptr29, in_ptr30, in_ptr31, in_ptr32, in_ptr33, in_ptr34, in_ptr35, in_ptr36, in_ptr37, in_ptr38, in_ptr39, in_ptr40, in_ptr41, in_ptr42, in_ptr43, in_ptr44, in_ptr45, in_ptr46, in_ptr47, in_ptr48, in_ptr49, out_ptr6, out_ptr7, out_ptr8, out_ptr15, out_ptr16, out_ptr17, out_ptr24, out_ptr25, out_ptr26, out_ptr33, out_ptr34, out_ptr35, out_ptr42, out_ptr43, out_ptr44, out_ptr51, out_ptr52, out_ptr53, out_ptr60, out_ptr61, out_ptr62, out_ptr69, out_ptr70, out_ptr71, out_ptr78, out_ptr79, out_ptr80, out_ptr87, out_ptr88, out_ptr89):
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] pid = tl.program_id(0)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] XBLOCK: tl.constexpr = 1024
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] num_xblocks_0 = tl.cdiv(1048576, XBLOCK)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] num_xblocks_1 = num_xblocks_0 + tl.cdiv(1048576, XBLOCK)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] num_xblocks_2 = num_xblocks_1 + tl.cdiv(1048576, XBLOCK)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] num_xblocks_3 = num_xblocks_2 + tl.cdiv(1048576, XBLOCK)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] num_xblocks_4 = num_xblocks_3 + tl.cdiv(1048576, XBLOCK)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] num_xblocks_5 = num_xblocks_4 + tl.cdiv(1048576, XBLOCK)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] num_xblocks_6 = num_xblocks_5 + tl.cdiv(1048576, XBLOCK)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] num_xblocks_7 = num_xblocks_6 + tl.cdiv(1048576, XBLOCK)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] num_xblocks_8 = num_xblocks_7 + tl.cdiv(1048576, XBLOCK)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] num_xblocks_9 = num_xblocks_8 + tl.cdiv(1048576, XBLOCK)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] if pid < num_xblocks_0:
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] pid_offset = pid
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] xnumel = 1048576
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] r0_numel = 1
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] xoffset = pid_offset * XBLOCK
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] xindex = xoffset + tl.arange(0, XBLOCK)[:]
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] xmask = tl.full([XBLOCK], True, tl.int1)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] x0 = xindex
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp5 = tl.load(in_ptr0 + (x0), None)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp6 = tl.load(in_ptr1 + (x0), None)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp11 = tl.load(in_ptr2 + (x0), None)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp18 = tl.load(in_ptr3 + (x0), None)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp20 = in_ptr4
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp0 = 0.09999999999999998
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp1 = 0.5
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp2 = tmp0 >= tmp1
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp3 = -0.9
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp4 = tl.where(tmp2, tmp3, tmp0)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp7 = tmp5 - tmp6
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp8 = tmp4 * tmp7
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp9 = tl.where(tmp2, tmp5, tmp6)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp10 = tmp8 + tmp9
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp12 = 0.999
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp13 = tmp11 * tmp12
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp14 = 0.0010000000000000009
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp15 = tmp5 * tmp14
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp16 = tmp15 * tmp5
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp17 = tmp13 + tmp16
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp19 = libdevice.sqrt(tmp17)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp21 = 1.0
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp22 = tmp20 + tmp21
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp23 = libdevice.pow(tmp12, tmp22)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp24 = tmp21 - tmp23
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp25 = libdevice.sqrt(tmp24)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp26 = 0.9
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp27 = libdevice.pow(tmp26, tmp22)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp28 = tmp21 - tmp27
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp29 = tl.full([1], 1, tl.int32)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp30 = (tmp29 / tmp28)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp31 = 0.001
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp32 = tmp30 * tmp31
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp33 = -tmp32
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp34 = tmp25 * tmp33
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp35 = (tmp19 / tmp34)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp36 = (tmp29 / tmp33)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp37 = 1e-08
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp38 = tmp36 * tmp37
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp39 = tmp35 + tmp38
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp40 = (tmp10 / tmp39)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp41 = tmp18 + tmp40
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tl.store(out_ptr6 + (x0), tmp41, None)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tl.store(out_ptr7 + (x0), tmp10, None)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tl.store(out_ptr8 + (x0), tmp17, None)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] elif pid < num_xblocks_1:
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] pid_offset = pid - num_xblocks_0
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] xnumel = 1048576
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] r0_numel = 1
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] xoffset = pid_offset * XBLOCK
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] xindex = xoffset + tl.arange(0, XBLOCK)[:]
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] xmask = tl.full([XBLOCK], True, tl.int1)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] x1 = xindex
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp47 = tl.load(in_ptr5 + (x1), None)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp48 = tl.load(in_ptr6 + (x1), None)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp53 = tl.load(in_ptr7 + (x1), None)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp60 = tl.load(in_ptr8 + (x1), None)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp62 = in_ptr9
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp42 = 0.09999999999999998
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp43 = 0.5
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp44 = tmp42 >= tmp43
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp45 = -0.9
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp46 = tl.where(tmp44, tmp45, tmp42)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp49 = tmp47 - tmp48
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp50 = tmp46 * tmp49
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp51 = tl.where(tmp44, tmp47, tmp48)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp52 = tmp50 + tmp51
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp54 = 0.999
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp55 = tmp53 * tmp54
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp56 = 0.0010000000000000009
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp57 = tmp47 * tmp56
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp58 = tmp57 * tmp47
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp59 = tmp55 + tmp58
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp61 = libdevice.sqrt(tmp59)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp63 = 1.0
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp64 = tmp62 + tmp63
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp65 = libdevice.pow(tmp54, tmp64)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp66 = tmp63 - tmp65
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp67 = libdevice.sqrt(tmp66)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp68 = 0.9
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp69 = libdevice.pow(tmp68, tmp64)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp70 = tmp63 - tmp69
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp71 = tl.full([1], 1, tl.int32)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp72 = (tmp71 / tmp70)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp73 = 0.001
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp74 = tmp72 * tmp73
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp75 = -tmp74
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp76 = tmp67 * tmp75
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp77 = (tmp61 / tmp76)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp78 = (tmp71 / tmp75)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp79 = 1e-08
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp80 = tmp78 * tmp79
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp81 = tmp77 + tmp80
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp82 = (tmp52 / tmp81)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp83 = tmp60 + tmp82
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tl.store(out_ptr15 + (x1), tmp83, None)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tl.store(out_ptr16 + (x1), tmp52, None)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tl.store(out_ptr17 + (x1), tmp59, None)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] elif pid < num_xblocks_2:
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] pid_offset = pid - num_xblocks_1
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] xnumel = 1048576
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] r0_numel = 1
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] xoffset = pid_offset * XBLOCK
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] xindex = xoffset + tl.arange(0, XBLOCK)[:]
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] xmask = tl.full([XBLOCK], True, tl.int1)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] x2 = xindex
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp89 = tl.load(in_ptr10 + (x2), None)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp90 = tl.load(in_ptr11 + (x2), None)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp95 = tl.load(in_ptr12 + (x2), None)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp102 = tl.load(in_ptr13 + (x2), None)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp104 = in_ptr14
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp84 = 0.09999999999999998
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp85 = 0.5
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp86 = tmp84 >= tmp85
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp87 = -0.9
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp88 = tl.where(tmp86, tmp87, tmp84)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp91 = tmp89 - tmp90
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp92 = tmp88 * tmp91
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp93 = tl.where(tmp86, tmp89, tmp90)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp94 = tmp92 + tmp93
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp96 = 0.999
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp97 = tmp95 * tmp96
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp98 = 0.0010000000000000009
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp99 = tmp89 * tmp98
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp100 = tmp99 * tmp89
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp101 = tmp97 + tmp100
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp103 = libdevice.sqrt(tmp101)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp105 = 1.0
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp106 = tmp104 + tmp105
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp107 = libdevice.pow(tmp96, tmp106)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp108 = tmp105 - tmp107
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp109 = libdevice.sqrt(tmp108)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp110 = 0.9
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp111 = libdevice.pow(tmp110, tmp106)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp112 = tmp105 - tmp111
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp113 = tl.full([1], 1, tl.int32)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp114 = (tmp113 / tmp112)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp115 = 0.001
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp116 = tmp114 * tmp115
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp117 = -tmp116
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp118 = tmp109 * tmp117
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp119 = (tmp103 / tmp118)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp120 = (tmp113 / tmp117)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp121 = 1e-08
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp122 = tmp120 * tmp121
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp123 = tmp119 + tmp122
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp124 = (tmp94 / tmp123)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp125 = tmp102 + tmp124
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tl.store(out_ptr24 + (x2), tmp125, None)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tl.store(out_ptr25 + (x2), tmp94, None)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tl.store(out_ptr26 + (x2), tmp101, None)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] elif pid < num_xblocks_3:
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] pid_offset = pid - num_xblocks_2
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] xnumel = 1048576
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] r0_numel = 1
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] xoffset = pid_offset * XBLOCK
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] xindex = xoffset + tl.arange(0, XBLOCK)[:]
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] xmask = tl.full([XBLOCK], True, tl.int1)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] x3 = xindex
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp131 = tl.load(in_ptr15 + (x3), None)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp132 = tl.load(in_ptr16 + (x3), None)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp137 = tl.load(in_ptr17 + (x3), None)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp144 = tl.load(in_ptr18 + (x3), None)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp146 = in_ptr19
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp126 = 0.09999999999999998
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp127 = 0.5
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp128 = tmp126 >= tmp127
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp129 = -0.9
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp130 = tl.where(tmp128, tmp129, tmp126)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp133 = tmp131 - tmp132
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp134 = tmp130 * tmp133
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp135 = tl.where(tmp128, tmp131, tmp132)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp136 = tmp134 + tmp135
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp138 = 0.999
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp139 = tmp137 * tmp138
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp140 = 0.0010000000000000009
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp141 = tmp131 * tmp140
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp142 = tmp141 * tmp131
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp143 = tmp139 + tmp142
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp145 = libdevice.sqrt(tmp143)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp147 = 1.0
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp148 = tmp146 + tmp147
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp149 = libdevice.pow(tmp138, tmp148)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp150 = tmp147 - tmp149
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp151 = libdevice.sqrt(tmp150)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp152 = 0.9
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp153 = libdevice.pow(tmp152, tmp148)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp154 = tmp147 - tmp153
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp155 = tl.full([1], 1, tl.int32)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp156 = (tmp155 / tmp154)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp157 = 0.001
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp158 = tmp156 * tmp157
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp159 = -tmp158
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp160 = tmp151 * tmp159
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp161 = (tmp145 / tmp160)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp162 = (tmp155 / tmp159)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp163 = 1e-08
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp164 = tmp162 * tmp163
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp165 = tmp161 + tmp164
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp166 = (tmp136 / tmp165)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp167 = tmp144 + tmp166
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tl.store(out_ptr33 + (x3), tmp167, None)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tl.store(out_ptr34 + (x3), tmp136, None)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tl.store(out_ptr35 + (x3), tmp143, None)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] elif pid < num_xblocks_4:
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] pid_offset = pid - num_xblocks_3
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] xnumel = 1048576
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] r0_numel = 1
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] xoffset = pid_offset * XBLOCK
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] xindex = xoffset + tl.arange(0, XBLOCK)[:]
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] xmask = tl.full([XBLOCK], True, tl.int1)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] x4 = xindex
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp173 = tl.load(in_ptr20 + (x4), None)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp174 = tl.load(in_ptr21 + (x4), None)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp179 = tl.load(in_ptr22 + (x4), None)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp186 = tl.load(in_ptr23 + (x4), None)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp188 = in_ptr24
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp168 = 0.09999999999999998
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp169 = 0.5
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp170 = tmp168 >= tmp169
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp171 = -0.9
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp172 = tl.where(tmp170, tmp171, tmp168)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp175 = tmp173 - tmp174
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp176 = tmp172 * tmp175
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp177 = tl.where(tmp170, tmp173, tmp174)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp178 = tmp176 + tmp177
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp180 = 0.999
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp181 = tmp179 * tmp180
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp182 = 0.0010000000000000009
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp183 = tmp173 * tmp182
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp184 = tmp183 * tmp173
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp185 = tmp181 + tmp184
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp187 = libdevice.sqrt(tmp185)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp189 = 1.0
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp190 = tmp188 + tmp189
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp191 = libdevice.pow(tmp180, tmp190)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp192 = tmp189 - tmp191
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp193 = libdevice.sqrt(tmp192)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp194 = 0.9
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp195 = libdevice.pow(tmp194, tmp190)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp196 = tmp189 - tmp195
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp197 = tl.full([1], 1, tl.int32)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp198 = (tmp197 / tmp196)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp199 = 0.001
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp200 = tmp198 * tmp199
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp201 = -tmp200
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp202 = tmp193 * tmp201
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp203 = (tmp187 / tmp202)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp204 = (tmp197 / tmp201)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp205 = 1e-08
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp206 = tmp204 * tmp205
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp207 = tmp203 + tmp206
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp208 = (tmp178 / tmp207)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp209 = tmp186 + tmp208
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tl.store(out_ptr42 + (x4), tmp209, None)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tl.store(out_ptr43 + (x4), tmp178, None)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tl.store(out_ptr44 + (x4), tmp185, None)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] elif pid < num_xblocks_5:
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] pid_offset = pid - num_xblocks_4
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] xnumel = 1048576
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] r0_numel = 1
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] xoffset = pid_offset * XBLOCK
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] xindex = xoffset + tl.arange(0, XBLOCK)[:]
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] xmask = tl.full([XBLOCK], True, tl.int1)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] x5 = xindex
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp215 = tl.load(in_ptr25 + (x5), None)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp216 = tl.load(in_ptr26 + (x5), None)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp221 = tl.load(in_ptr27 + (x5), None)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp228 = tl.load(in_ptr28 + (x5), None)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp230 = in_ptr29
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp210 = 0.09999999999999998
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp211 = 0.5
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp212 = tmp210 >= tmp211
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp213 = -0.9
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp214 = tl.where(tmp212, tmp213, tmp210)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp217 = tmp215 - tmp216
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp218 = tmp214 * tmp217
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp219 = tl.where(tmp212, tmp215, tmp216)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp220 = tmp218 + tmp219
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp222 = 0.999
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp223 = tmp221 * tmp222
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp224 = 0.0010000000000000009
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp225 = tmp215 * tmp224
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp226 = tmp225 * tmp215
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp227 = tmp223 + tmp226
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp229 = libdevice.sqrt(tmp227)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp231 = 1.0
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp232 = tmp230 + tmp231
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp233 = libdevice.pow(tmp222, tmp232)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp234 = tmp231 - tmp233
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp235 = libdevice.sqrt(tmp234)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp236 = 0.9
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp237 = libdevice.pow(tmp236, tmp232)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp238 = tmp231 - tmp237
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp239 = tl.full([1], 1, tl.int32)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp240 = (tmp239 / tmp238)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp241 = 0.001
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp242 = tmp240 * tmp241
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp243 = -tmp242
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp244 = tmp235 * tmp243
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp245 = (tmp229 / tmp244)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp246 = (tmp239 / tmp243)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp247 = 1e-08
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp248 = tmp246 * tmp247
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp249 = tmp245 + tmp248
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp250 = (tmp220 / tmp249)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp251 = tmp228 + tmp250
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tl.store(out_ptr51 + (x5), tmp251, None)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tl.store(out_ptr52 + (x5), tmp220, None)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tl.store(out_ptr53 + (x5), tmp227, None)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] elif pid < num_xblocks_6:
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] pid_offset = pid - num_xblocks_5
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] xnumel = 1048576
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] r0_numel = 1
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] xoffset = pid_offset * XBLOCK
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] xindex = xoffset + tl.arange(0, XBLOCK)[:]
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] xmask = tl.full([XBLOCK], True, tl.int1)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] x6 = xindex
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp257 = tl.load(in_ptr30 + (x6), None)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp258 = tl.load(in_ptr31 + (x6), None)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp263 = tl.load(in_ptr32 + (x6), None)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp270 = tl.load(in_ptr33 + (x6), None)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp272 = in_ptr34
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp252 = 0.09999999999999998
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp253 = 0.5
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp254 = tmp252 >= tmp253
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp255 = -0.9
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp256 = tl.where(tmp254, tmp255, tmp252)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp259 = tmp257 - tmp258
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp260 = tmp256 * tmp259
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp261 = tl.where(tmp254, tmp257, tmp258)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp262 = tmp260 + tmp261
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp264 = 0.999
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp265 = tmp263 * tmp264
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp266 = 0.0010000000000000009
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp267 = tmp257 * tmp266
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp268 = tmp267 * tmp257
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp269 = tmp265 + tmp268
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp271 = libdevice.sqrt(tmp269)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp273 = 1.0
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp274 = tmp272 + tmp273
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp275 = libdevice.pow(tmp264, tmp274)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp276 = tmp273 - tmp275
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp277 = libdevice.sqrt(tmp276)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp278 = 0.9
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp279 = libdevice.pow(tmp278, tmp274)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp280 = tmp273 - tmp279
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp281 = tl.full([1], 1, tl.int32)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp282 = (tmp281 / tmp280)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp283 = 0.001
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp284 = tmp282 * tmp283
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp285 = -tmp284
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp286 = tmp277 * tmp285
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp287 = (tmp271 / tmp286)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp288 = (tmp281 / tmp285)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp289 = 1e-08
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp290 = tmp288 * tmp289
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp291 = tmp287 + tmp290
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp292 = (tmp262 / tmp291)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp293 = tmp270 + tmp292
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tl.store(out_ptr60 + (x6), tmp293, None)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tl.store(out_ptr61 + (x6), tmp262, None)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tl.store(out_ptr62 + (x6), tmp269, None)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] elif pid < num_xblocks_7:
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] pid_offset = pid - num_xblocks_6
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] xnumel = 1048576
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] r0_numel = 1
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] xoffset = pid_offset * XBLOCK
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] xindex = xoffset + tl.arange(0, XBLOCK)[:]
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] xmask = tl.full([XBLOCK], True, tl.int1)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] x7 = xindex
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp299 = tl.load(in_ptr35 + (x7), None)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp300 = tl.load(in_ptr36 + (x7), None)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp305 = tl.load(in_ptr37 + (x7), None)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp312 = tl.load(in_ptr38 + (x7), None)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp314 = in_ptr39
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp294 = 0.09999999999999998
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp295 = 0.5
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp296 = tmp294 >= tmp295
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp297 = -0.9
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp298 = tl.where(tmp296, tmp297, tmp294)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp301 = tmp299 - tmp300
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp302 = tmp298 * tmp301
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp303 = tl.where(tmp296, tmp299, tmp300)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp304 = tmp302 + tmp303
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp306 = 0.999
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp307 = tmp305 * tmp306
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp308 = 0.0010000000000000009
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp309 = tmp299 * tmp308
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp310 = tmp309 * tmp299
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp311 = tmp307 + tmp310
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp313 = libdevice.sqrt(tmp311)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp315 = 1.0
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp316 = tmp314 + tmp315
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp317 = libdevice.pow(tmp306, tmp316)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp318 = tmp315 - tmp317
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp319 = libdevice.sqrt(tmp318)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp320 = 0.9
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp321 = libdevice.pow(tmp320, tmp316)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp322 = tmp315 - tmp321
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp323 = tl.full([1], 1, tl.int32)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp324 = (tmp323 / tmp322)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp325 = 0.001
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp326 = tmp324 * tmp325
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp327 = -tmp326
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp328 = tmp319 * tmp327
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp329 = (tmp313 / tmp328)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp330 = (tmp323 / tmp327)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp331 = 1e-08
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp332 = tmp330 * tmp331
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp333 = tmp329 + tmp332
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp334 = (tmp304 / tmp333)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp335 = tmp312 + tmp334
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tl.store(out_ptr69 + (x7), tmp335, None)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tl.store(out_ptr70 + (x7), tmp304, None)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tl.store(out_ptr71 + (x7), tmp311, None)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] elif pid < num_xblocks_8:
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] pid_offset = pid - num_xblocks_7
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] xnumel = 1048576
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] r0_numel = 1
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] xoffset = pid_offset * XBLOCK
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] xindex = xoffset + tl.arange(0, XBLOCK)[:]
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] xmask = tl.full([XBLOCK], True, tl.int1)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] x8 = xindex
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp341 = tl.load(in_ptr40 + (x8), None)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp342 = tl.load(in_ptr41 + (x8), None)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp347 = tl.load(in_ptr42 + (x8), None)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp354 = tl.load(in_ptr43 + (x8), None)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp356 = in_ptr44
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp336 = 0.09999999999999998
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp337 = 0.5
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp338 = tmp336 >= tmp337
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp339 = -0.9
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp340 = tl.where(tmp338, tmp339, tmp336)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp343 = tmp341 - tmp342
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp344 = tmp340 * tmp343
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp345 = tl.where(tmp338, tmp341, tmp342)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp346 = tmp344 + tmp345
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp348 = 0.999
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp349 = tmp347 * tmp348
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp350 = 0.0010000000000000009
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp351 = tmp341 * tmp350
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp352 = tmp351 * tmp341
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp353 = tmp349 + tmp352
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp355 = libdevice.sqrt(tmp353)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp357 = 1.0
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp358 = tmp356 + tmp357
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp359 = libdevice.pow(tmp348, tmp358)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp360 = tmp357 - tmp359
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp361 = libdevice.sqrt(tmp360)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp362 = 0.9
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp363 = libdevice.pow(tmp362, tmp358)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp364 = tmp357 - tmp363
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp365 = tl.full([1], 1, tl.int32)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp366 = (tmp365 / tmp364)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp367 = 0.001
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp368 = tmp366 * tmp367
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp369 = -tmp368
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp370 = tmp361 * tmp369
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp371 = (tmp355 / tmp370)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp372 = (tmp365 / tmp369)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp373 = 1e-08
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp374 = tmp372 * tmp373
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp375 = tmp371 + tmp374
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp376 = (tmp346 / tmp375)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp377 = tmp354 + tmp376
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tl.store(out_ptr78 + (x8), tmp377, None)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tl.store(out_ptr79 + (x8), tmp346, None)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tl.store(out_ptr80 + (x8), tmp353, None)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] elif pid < num_xblocks_9:
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] pid_offset = pid - num_xblocks_8
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] xnumel = 1048576
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] r0_numel = 1
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] xoffset = pid_offset * XBLOCK
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] xindex = xoffset + tl.arange(0, XBLOCK)[:]
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] xmask = tl.full([XBLOCK], True, tl.int1)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] x9 = xindex
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp383 = tl.load(in_ptr45 + (x9), None)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp384 = tl.load(in_ptr46 + (x9), None)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp389 = tl.load(in_ptr47 + (x9), None)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp396 = tl.load(in_ptr48 + (x9), None)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp398 = in_ptr49
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp378 = 0.09999999999999998
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp379 = 0.5
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp380 = tmp378 >= tmp379
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp381 = -0.9
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp382 = tl.where(tmp380, tmp381, tmp378)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp385 = tmp383 - tmp384
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp386 = tmp382 * tmp385
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp387 = tl.where(tmp380, tmp383, tmp384)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp388 = tmp386 + tmp387
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp390 = 0.999
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp391 = tmp389 * tmp390
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp392 = 0.0010000000000000009
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp393 = tmp383 * tmp392
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp394 = tmp393 * tmp383
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp395 = tmp391 + tmp394
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp397 = libdevice.sqrt(tmp395)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp399 = 1.0
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp400 = tmp398 + tmp399
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp401 = libdevice.pow(tmp390, tmp400)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp402 = tmp399 - tmp401
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp403 = libdevice.sqrt(tmp402)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp404 = 0.9
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp405 = libdevice.pow(tmp404, tmp400)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp406 = tmp399 - tmp405
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp407 = tl.full([1], 1, tl.int32)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp408 = (tmp407 / tmp406)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp409 = 0.001
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp410 = tmp408 * tmp409
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp411 = -tmp410
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp412 = tmp403 * tmp411
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp413 = (tmp397 / tmp412)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp414 = (tmp407 / tmp411)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp415 = 1e-08
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp416 = tmp414 * tmp415
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp417 = tmp413 + tmp416
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp418 = (tmp388 / tmp417)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp419 = tmp396 + tmp418
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tl.store(out_ptr87 + (x9), tmp419, None)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tl.store(out_ptr88 + (x9), tmp388, None)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] tl.store(out_ptr89 + (x9), tmp395, None)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] else:
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] pass
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] ''', device_str='cuda')
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code]
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code]
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] cpp_fused__foreach_copy_1 = async_compile.cpp_pybinding(['const float*', 'const float*', 'const float*', 'const float*', 'const float*', 'const float*', 'const float*', 'const float*', 'const float*', 'const float*', 'float*', 'float*', 'float*', 'float*', 'float*', 'float*', 'float*', 'float*', 'float*', 'float*'], '''
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] #include "/tmp/torchinductor_ci-user/tmppsh5aeyx/pi/cpicxudqmdsjh5cm4klbtbrvy2cxwr7whxl3md2zzdjdf3orvfdf.h"
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] extern "C" void kernel(const float* in_ptr0,
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] const float* in_ptr1,
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] const float* in_ptr2,
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] const float* in_ptr3,
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] const float* in_ptr4,
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] const float* in_ptr5,
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] const float* in_ptr6,
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] const float* in_ptr7,
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] const float* in_ptr8,
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] const float* in_ptr9,
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] float* out_ptr1,
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] float* out_ptr3,
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] float* out_ptr5,
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] float* out_ptr7,
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] float* out_ptr9,
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] float* out_ptr11,
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] float* out_ptr13,
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] float* out_ptr15,
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] float* out_ptr17,
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] float* out_ptr19)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] {
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] {
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] {
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] {
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] auto tmp0 = in_ptr0[static_cast<int64_t>(0L)];
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] auto tmp1 = static_cast<float>(1.0);
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] auto tmp2 = decltype(tmp0)(tmp0 + tmp1);
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] out_ptr1[static_cast<int64_t>(0L)] = tmp2;
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] }
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] }
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] }
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] {
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] {
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] {
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] auto tmp0 = in_ptr1[static_cast<int64_t>(0L)];
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] auto tmp1 = static_cast<float>(1.0);
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] auto tmp2 = decltype(tmp0)(tmp0 + tmp1);
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] out_ptr3[static_cast<int64_t>(0L)] = tmp2;
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] }
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] }
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] }
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] {
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] {
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] {
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] auto tmp0 = in_ptr2[static_cast<int64_t>(0L)];
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] auto tmp1 = static_cast<float>(1.0);
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] auto tmp2 = decltype(tmp0)(tmp0 + tmp1);
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] out_ptr5[static_cast<int64_t>(0L)] = tmp2;
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] }
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] }
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] }
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] {
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] {
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] {
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] auto tmp0 = in_ptr3[static_cast<int64_t>(0L)];
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] auto tmp1 = static_cast<float>(1.0);
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] auto tmp2 = decltype(tmp0)(tmp0 + tmp1);
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] out_ptr7[static_cast<int64_t>(0L)] = tmp2;
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] }
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] }
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] }
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] {
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] {
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] {
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] auto tmp0 = in_ptr4[static_cast<int64_t>(0L)];
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] auto tmp1 = static_cast<float>(1.0);
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] auto tmp2 = decltype(tmp0)(tmp0 + tmp1);
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] out_ptr9[static_cast<int64_t>(0L)] = tmp2;
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] }
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] }
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] }
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] {
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] {
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] {
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] auto tmp0 = in_ptr5[static_cast<int64_t>(0L)];
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] auto tmp1 = static_cast<float>(1.0);
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] auto tmp2 = decltype(tmp0)(tmp0 + tmp1);
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] out_ptr11[static_cast<int64_t>(0L)] = tmp2;
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] }
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] }
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] }
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] {
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] {
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] {
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] auto tmp0 = in_ptr6[static_cast<int64_t>(0L)];
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] auto tmp1 = static_cast<float>(1.0);
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] auto tmp2 = decltype(tmp0)(tmp0 + tmp1);
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] out_ptr13[static_cast<int64_t>(0L)] = tmp2;
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] }
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] }
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] }
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] {
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] {
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] {
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] auto tmp0 = in_ptr7[static_cast<int64_t>(0L)];
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] auto tmp1 = static_cast<float>(1.0);
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] auto tmp2 = decltype(tmp0)(tmp0 + tmp1);
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] out_ptr15[static_cast<int64_t>(0L)] = tmp2;
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] }
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] }
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] }
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] {
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] {
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] {
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] auto tmp0 = in_ptr8[static_cast<int64_t>(0L)];
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] auto tmp1 = static_cast<float>(1.0);
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] auto tmp2 = decltype(tmp0)(tmp0 + tmp1);
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] out_ptr17[static_cast<int64_t>(0L)] = tmp2;
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] }
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] }
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] }
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] {
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] {
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] {
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] auto tmp0 = in_ptr9[static_cast<int64_t>(0L)];
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] auto tmp1 = static_cast<float>(1.0);
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] auto tmp2 = decltype(tmp0)(tmp0 + tmp1);
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] out_ptr19[static_cast<int64_t>(0L)] = tmp2;
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] }
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] }
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] }
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] }
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] ''')
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code]
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code]
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] async_compile.wait(globals())
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] del async_compile
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code]
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] def call(args):
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] arg0_1, arg1_1, arg2_1, arg3_1, arg4_1, arg5_1, arg6_1, arg7_1, arg8_1, arg9_1, arg10_1, arg11_1, arg12_1, arg13_1, arg14_1, arg15_1, arg16_1, arg17_1, arg18_1, arg19_1, arg20_1, arg21_1, arg22_1, arg23_1, arg24_1, arg25_1, arg26_1, arg27_1, arg28_1, arg29_1, arg30_1, arg31_1, arg32_1, arg33_1, arg34_1, arg35_1, arg36_1, arg37_1, arg38_1, arg39_1, arg40_1, arg41_1, arg42_1, arg43_1, arg44_1, arg45_1, arg46_1, arg47_1, arg48_1, arg49_1 = args
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] args.clear()
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] assert_size_stride(arg0_1, (1024, 1024), (1024, 1))
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] assert_size_stride(arg1_1, (1024, 1024), (1024, 1))
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] assert_size_stride(arg2_1, (1024, 1024), (1024, 1))
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] assert_size_stride(arg3_1, (1024, 1024), (1024, 1))
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] assert_size_stride(arg4_1, (1024, 1024), (1024, 1))
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] assert_size_stride(arg5_1, (1024, 1024), (1024, 1))
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] assert_size_stride(arg6_1, (1024, 1024), (1024, 1))
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] assert_size_stride(arg7_1, (1024, 1024), (1024, 1))
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] assert_size_stride(arg8_1, (1024, 1024), (1024, 1))
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] assert_size_stride(arg9_1, (1024, 1024), (1024, 1))
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] assert_size_stride(arg10_1, (1024, 1024), (1024, 1))
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] assert_size_stride(arg11_1, (1024, 1024), (1024, 1))
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] assert_size_stride(arg12_1, (1024, 1024), (1024, 1))
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] assert_size_stride(arg13_1, (1024, 1024), (1024, 1))
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] assert_size_stride(arg14_1, (1024, 1024), (1024, 1))
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] assert_size_stride(arg15_1, (1024, 1024), (1024, 1))
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] assert_size_stride(arg16_1, (1024, 1024), (1024, 1))
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] assert_size_stride(arg17_1, (1024, 1024), (1024, 1))
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] assert_size_stride(arg18_1, (1024, 1024), (1024, 1))
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] assert_size_stride(arg19_1, (1024, 1024), (1024, 1))
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] assert_size_stride(arg20_1, (), ())
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] assert_size_stride(arg21_1, (), ())
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] assert_size_stride(arg22_1, (), ())
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] assert_size_stride(arg23_1, (), ())
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] assert_size_stride(arg24_1, (), ())
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] assert_size_stride(arg25_1, (), ())
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] assert_size_stride(arg26_1, (), ())
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] assert_size_stride(arg27_1, (), ())
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] assert_size_stride(arg28_1, (), ())
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] assert_size_stride(arg29_1, (), ())
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] assert_size_stride(arg30_1, (1024, 1024), (1024, 1))
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] assert_size_stride(arg31_1, (1024, 1024), (1024, 1))
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] assert_size_stride(arg32_1, (1024, 1024), (1024, 1))
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] assert_size_stride(arg33_1, (1024, 1024), (1024, 1))
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] assert_size_stride(arg34_1, (1024, 1024), (1024, 1))
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] assert_size_stride(arg35_1, (1024, 1024), (1024, 1))
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] assert_size_stride(arg36_1, (1024, 1024), (1024, 1))
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] assert_size_stride(arg37_1, (1024, 1024), (1024, 1))
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] assert_size_stride(arg38_1, (1024, 1024), (1024, 1))
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] assert_size_stride(arg39_1, (1024, 1024), (1024, 1))
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] assert_size_stride(arg40_1, (1024, 1024), (1024, 1))
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] assert_size_stride(arg41_1, (1024, 1024), (1024, 1))
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] assert_size_stride(arg42_1, (1024, 1024), (1024, 1))
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] assert_size_stride(arg43_1, (1024, 1024), (1024, 1))
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] assert_size_stride(arg44_1, (1024, 1024), (1024, 1))
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] assert_size_stride(arg45_1, (1024, 1024), (1024, 1))
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] assert_size_stride(arg46_1, (1024, 1024), (1024, 1))
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] assert_size_stride(arg47_1, (1024, 1024), (1024, 1))
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] assert_size_stride(arg48_1, (1024, 1024), (1024, 1))
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] assert_size_stride(arg49_1, (1024, 1024), (1024, 1))
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] with torch.cuda._DeviceGuard(0):
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] torch.cuda.set_device(0)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] # Unsorted Source Nodes: [], Original ATen: []
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] stream0 = get_raw_stream(0)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] triton_for_fused_0.run(arg1_1, arg30_1, arg40_1, arg0_1, arg20_1.item(), arg3_1, arg31_1, arg41_1, arg2_1, arg21_1.item(), arg5_1, arg32_1, arg42_1, arg4_1, arg22_1.item(), arg7_1, arg33_1, arg43_1, arg6_1, arg23_1.item(), arg9_1, arg34_1, arg44_1, arg8_1, arg24_1.item(), arg11_1, arg35_1, arg45_1, arg10_1, arg25_1.item(), arg13_1, arg36_1, arg46_1, arg12_1, arg26_1.item(), arg15_1, arg37_1, arg47_1, arg14_1, arg27_1.item(), arg17_1, arg38_1, arg48_1, arg16_1, arg28_1.item(), arg19_1, arg39_1, arg49_1, arg18_1, arg29_1.item(), arg0_1, arg30_1, arg40_1, arg2_1, arg31_1, arg41_1, arg4_1, arg32_1, arg42_1, arg6_1, arg33_1, arg43_1, arg8_1, arg34_1, arg44_1, arg10_1, arg35_1, arg45_1, arg12_1, arg36_1, arg46_1, arg14_1, arg37_1, arg47_1, arg16_1, arg38_1, arg48_1, arg18_1, arg39_1, arg49_1, stream=stream0)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] del arg0_1
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] del arg10_1
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] del arg11_1
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] del arg12_1
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] del arg13_1
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] del arg14_1
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] del arg15_1
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] del arg16_1
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] del arg17_1
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] del arg18_1
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] del arg19_1
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] del arg1_1
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] del arg2_1
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] del arg30_1
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] del arg31_1
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] del arg32_1
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] del arg33_1
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] del arg34_1
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] del arg35_1
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] del arg36_1
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] del arg37_1
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] del arg38_1
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] del arg39_1
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] del arg3_1
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] del arg40_1
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] del arg41_1
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] del arg42_1
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] del arg43_1
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] del arg44_1
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] del arg45_1
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] del arg46_1
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] del arg47_1
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] del arg48_1
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] del arg49_1
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] del arg4_1
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] del arg5_1
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] del arg6_1
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] del arg7_1
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] del arg8_1
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] del arg9_1
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] cpp_fused__foreach_copy_1(arg20_1, arg21_1, arg22_1, arg23_1, arg24_1, arg25_1, arg26_1, arg27_1, arg28_1, arg29_1, arg20_1, arg21_1, arg22_1, arg23_1, arg24_1, arg25_1, arg26_1, arg27_1, arg28_1, arg29_1)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] del arg20_1
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] del arg21_1
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] del arg22_1
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] del arg23_1
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] del arg24_1
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] del arg25_1
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] del arg26_1
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] del arg27_1
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] del arg28_1
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] del arg29_1
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] return ()
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code]
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code]
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] def benchmark_compiled_module(times=10, repeat=10):
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] from torch._dynamo.testing import rand_strided
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] from torch._inductor.utils import print_performance
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] arg0_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] arg1_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] arg2_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] arg3_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] arg4_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] arg5_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] arg6_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] arg7_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] arg8_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] arg9_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] arg10_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] arg11_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] arg12_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] arg13_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] arg14_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] arg15_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] arg16_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] arg17_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] arg18_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] arg19_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] arg20_1 = rand_strided((), (), device='cpu', dtype=torch.float32)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] arg21_1 = rand_strided((), (), device='cpu', dtype=torch.float32)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] arg22_1 = rand_strided((), (), device='cpu', dtype=torch.float32)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] arg23_1 = rand_strided((), (), device='cpu', dtype=torch.float32)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] arg24_1 = rand_strided((), (), device='cpu', dtype=torch.float32)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] arg25_1 = rand_strided((), (), device='cpu', dtype=torch.float32)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] arg26_1 = rand_strided((), (), device='cpu', dtype=torch.float32)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] arg27_1 = rand_strided((), (), device='cpu', dtype=torch.float32)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] arg28_1 = rand_strided((), (), device='cpu', dtype=torch.float32)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] arg29_1 = rand_strided((), (), device='cpu', dtype=torch.float32)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] arg30_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] arg31_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] arg32_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] arg33_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] arg34_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] arg35_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] arg36_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] arg37_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] arg38_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] arg39_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] arg40_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] arg41_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] arg42_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] arg43_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] arg44_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] arg45_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] arg46_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] arg47_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] arg48_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] arg49_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] fn = lambda: call([arg0_1, arg1_1, arg2_1, arg3_1, arg4_1, arg5_1, arg6_1, arg7_1, arg8_1, arg9_1, arg10_1, arg11_1, arg12_1, arg13_1, arg14_1, arg15_1, arg16_1, arg17_1, arg18_1, arg19_1, arg20_1, arg21_1, arg22_1, arg23_1, arg24_1, arg25_1, arg26_1, arg27_1, arg28_1, arg29_1, arg30_1, arg31_1, arg32_1, arg33_1, arg34_1, arg35_1, arg36_1, arg37_1, arg38_1, arg39_1, arg40_1, arg41_1, arg42_1, arg43_1, arg44_1, arg45_1, arg46_1, arg47_1, arg48_1, arg49_1])
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] return print_performance(fn, times=times, repeat=repeat)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code]
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code]
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] if __name__ == "__main__":
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] from torch._inductor.wrapper_benchmark import compiled_module_main
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code] compiled_module_main('None', benchmark_compiled_module)
V0512 16:37:14.169000 635 torch/_inductor/graph.py:2104] [0/0] [__output_code]
V0512 16:37:14.215000 635 torch/_inductor/graph.py:2115] [0/0] [__output_code] Output code written to: /tmp/torchinductor_ci-user/tmppsh5aeyx/my/cmyjzqyayz2mizopent5g4wl5qce4wsnbiqdwodx2buiex5f7ryy.py
I0512 16:37:15.725000 635 torch/_inductor/graph.py:2149] [0/0] [__output_code] Output code written to: /tmp/torchinductor_ci-user/tmppsh5aeyx/my/cmyjzqyayz2mizopent5g4wl5qce4wsnbiqdwodx2buiex5f7ryy.py
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] Output code:
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] # AOT ID: ['1_inference']
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] from ctypes import c_void_p, c_long, c_int
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] import torch
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] import math
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] import random
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] import os
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] import tempfile
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] from math import inf, nan
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] from cmath import nanj
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] from torch._inductor.hooks import run_intermediate_hooks
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] from torch._inductor.utils import maybe_profile
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] from torch._inductor.codegen.memory_planning import _align as align
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] from torch import device, empty_strided
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] from torch._inductor.async_compile import AsyncCompile
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] from torch._inductor.select_algorithm import extern_kernels
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] from torch._inductor.codegen.multi_kernel import MultiKernelCall
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] from torch._C import _cuda_getCurrentRawStream as get_raw_stream
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] import triton
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] import triton.language as tl
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] from torch._inductor.runtime.triton_heuristics import start_graph, end_graph
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] from torch._C import _cuda_getCurrentRawStream as get_raw_stream
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code]
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] aten = torch.ops.aten
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] inductor_ops = torch.ops.inductor
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] _quantized = torch.ops._quantized
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] assert_size_stride = torch._C._dynamo.guards.assert_size_stride
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] empty_strided_cpu = torch._C._dynamo.guards._empty_strided_cpu
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] empty_strided_cuda = torch._C._dynamo.guards._empty_strided_cuda
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] empty_strided_xpu = torch._C._dynamo.guards._empty_strided_xpu
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] reinterpret_tensor = torch._C._dynamo.guards._reinterpret_tensor
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] alloc_from_pool = torch.ops.inductor._alloc_from_pool
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] async_compile = AsyncCompile()
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] empty_strided_p2p = torch._C._distributed_c10d._SymmetricMemory.empty_strided_p2p
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code]
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code]
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] # kernel path: /tmp/torchinductor_ci-user/tmp4ttmqsz0/ca/ccaq3yt7deumw4kqto4xmlusqkcgnugnkdlnmxsrillr7l42wecc.py
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] # Unsorted Source Nodes: [], Original ATen: []
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] # Source node to ATen node mapping:
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] triton_for_fused_0 = async_compile.triton('triton_for_fused_0', '''
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] import triton
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] import triton.language as tl
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code]
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] from torch._inductor.runtime import triton_helpers, triton_heuristics
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] from torch._inductor.runtime.triton_helpers import libdevice, math as tl_math
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] from torch._inductor.runtime.hints import AutotuneHint, ReductionHint, TileHint, DeviceProperties
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code]
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] @triton_heuristics.foreach(
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] num_warps=8,
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] triton_meta={'signature': {'in_ptr0': '*fp32', 'in_ptr1': '*fp32', 'in_ptr2': '*fp32', 'in_ptr3': '*fp32', 'in_ptr4': 'fp32', 'in_ptr5': '*fp32', 'in_ptr6': '*fp32', 'in_ptr7': '*fp32', 'in_ptr8': '*fp32', 'in_ptr9': 'fp32', 'in_ptr10': '*fp32', 'in_ptr11': '*fp32', 'in_ptr12': '*fp32', 'in_ptr13': '*fp32', 'in_ptr14': 'fp32', 'in_ptr15': '*fp32', 'in_ptr16': '*fp32', 'in_ptr17': '*fp32', 'in_ptr18': '*fp32', 'in_ptr19': 'fp32', 'in_ptr20': '*fp32', 'in_ptr21': '*fp32', 'in_ptr22': '*fp32', 'in_ptr23': '*fp32', 'in_ptr24': 'fp32', 'in_ptr25': '*fp32', 'in_ptr26': '*fp32', 'in_ptr27': '*fp32', 'in_ptr28': '*fp32', 'in_ptr29': 'fp32', 'in_ptr30': '*fp32', 'in_ptr31': '*fp32', 'in_ptr32': '*fp32', 'in_ptr33': '*fp32', 'in_ptr34': 'fp32', 'in_ptr35': '*fp32', 'in_ptr36': '*fp32', 'in_ptr37': '*fp32', 'in_ptr38': '*fp32', 'in_ptr39': 'fp32', 'in_ptr40': '*fp32', 'in_ptr41': '*fp32', 'in_ptr42': '*fp32', 'in_ptr43': '*fp32', 'in_ptr44': 'fp32', 'in_ptr45': '*fp32', 'in_ptr46': '*fp32', 'in_ptr47': '*fp32', 'in_ptr48': '*fp32', 'in_ptr49': 'fp32', 'out_ptr6': '*fp32', 'out_ptr7': '*fp32', 'out_ptr8': '*fp32', 'out_ptr15': '*fp32', 'out_ptr16': '*fp32', 'out_ptr17': '*fp32', 'out_ptr24': '*fp32', 'out_ptr25': '*fp32', 'out_ptr26': '*fp32', 'out_ptr33': '*fp32', 'out_ptr34': '*fp32', 'out_ptr35': '*fp32', 'out_ptr42': '*fp32', 'out_ptr43': '*fp32', 'out_ptr44': '*fp32', 'out_ptr51': '*fp32', 'out_ptr52': '*fp32', 'out_ptr53': '*fp32', 'out_ptr60': '*fp32', 'out_ptr61': '*fp32', 'out_ptr62': '*fp32', 'out_ptr69': '*fp32', 'out_ptr70': '*fp32', 'out_ptr71': '*fp32', 'out_ptr78': '*fp32', 'out_ptr79': '*fp32', 'out_ptr80': '*fp32', 'out_ptr87': '*fp32', 'out_ptr88': '*fp32', 'out_ptr89': '*fp32'}, 'device': DeviceProperties(type='cuda', index=0, multi_processor_count=80, cc=86, major=8, regs_per_multiprocessor=65536, max_threads_per_multi_processor=1536, warp_size=32), 'constants': {}, 'configs': [{(0,): [['tt.divisibility', 16]], (1,): [['tt.divisibility', 16]], (2,): [['tt.divisibility', 16]], (3,): [['tt.divisibility', 16]], (5,): [['tt.divisibility', 16]], (6,): [['tt.divisibility', 16]], (7,): [['tt.divisibility', 16]], (8,): [['tt.divisibility', 16]], (10,): [['tt.divisibility', 16]], (11,): [['tt.divisibility', 16]], (12,): [['tt.divisibility', 16]], (13,): [['tt.divisibility', 16]], (15,): [['tt.divisibility', 16]], (16,): [['tt.divisibility', 16]], (17,): [['tt.divisibility', 16]], (18,): [['tt.divisibility', 16]], (20,): [['tt.divisibility', 16]], (21,): [['tt.divisibility', 16]], (22,): [['tt.divisibility', 16]], (23,): [['tt.divisibility', 16]], (25,): [['tt.divisibility', 16]], (26,): [['tt.divisibility', 16]], (27,): [['tt.divisibility', 16]], (28,): [['tt.divisibility', 16]], (30,): [['tt.divisibility', 16]], (31,): [['tt.divisibility', 16]], (32,): [['tt.divisibility', 16]], (33,): [['tt.divisibility', 16]], (35,): [['tt.divisibility', 16]], (36,): [['tt.divisibility', 16]], (37,): [['tt.divisibility', 16]], (38,): [['tt.divisibility', 16]], (40,): [['tt.divisibility', 16]], (41,): [['tt.divisibility', 16]], (42,): [['tt.divisibility', 16]], (43,): [['tt.divisibility', 16]], (45,): [['tt.divisibility', 16]], (46,): [['tt.divisibility', 16]], (47,): [['tt.divisibility', 16]], (48,): [['tt.divisibility', 16]], (50,): [['tt.divisibility', 16]], (51,): [['tt.divisibility', 16]], (52,): [['tt.divisibility', 16]], (53,): [['tt.divisibility', 16]], (54,): [['tt.divisibility', 16]], (55,): [['tt.divisibility', 16]], (56,): [['tt.divisibility', 16]], (57,): [['tt.divisibility', 16]], (58,): [['tt.divisibility', 16]], (59,): [['tt.divisibility', 16]], (60,): [['tt.divisibility', 16]], (61,): [['tt.divisibility', 16]], (62,): [['tt.divisibility', 16]], (63,): [['tt.divisibility', 16]], (64,): [['tt.divisibility', 16]], (65,): [['tt.divisibility', 16]], (66,): [['tt.divisibility', 16]], (67,): [['tt.divisibility', 16]], (68,): [['tt.divisibility', 16]], (69,): [['tt.divisibility', 16]], (70,): [['tt.divisibility', 16]], (71,): [['tt.divisibility', 16]], (72,): [['tt.divisibility', 16]], (73,): [['tt.divisibility', 16]], (74,): [['tt.divisibility', 16]], (75,): [['tt.divisibility', 16]], (76,): [['tt.divisibility', 16]], (77,): [['tt.divisibility', 16]], (78,): [['tt.divisibility', 16]], (79,): [['tt.divisibility', 16]]}]},
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] inductor_meta={'grid_type': 'SequentialComboKernelGrid', 'combo_grid_meta': {'num_kernels': 10, 'min_blocks': 0, 'default_config': {'XBLOCK': 1024}, 'no_x_dim_0': False, 'xnumel_0': 1048576, 'no_x_dim_1': False, 'xnumel_1': 1048576, 'no_x_dim_2': False, 'xnumel_2': 1048576, 'no_x_dim_3': False, 'xnumel_3': 1048576, 'no_x_dim_4': False, 'xnumel_4': 1048576, 'no_x_dim_5': False, 'xnumel_5': 1048576, 'no_x_dim_6': False, 'xnumel_6': 1048576, 'no_x_dim_7': False, 'xnumel_7': 1048576, 'no_x_dim_8': False, 'xnumel_8': 1048576, 'no_x_dim_9': False, 'xnumel_9': 1048576}, 'kernel_name': 'triton_for_fused_0', 'mutated_arg_names': ['in_ptr1', 'in_ptr11', 'in_ptr12', 'in_ptr13', 'in_ptr16', 'in_ptr17', 'in_ptr18', 'in_ptr2', 'in_ptr21', 'in_ptr22', 'in_ptr23', 'in_ptr26', 'in_ptr27', 'in_ptr28', 'in_ptr3', 'in_ptr31', 'in_ptr32', 'in_ptr33', 'in_ptr36', 'in_ptr37', 'in_ptr38', 'in_ptr41', 'in_ptr42', 'in_ptr43', 'in_ptr46', 'in_ptr47', 'in_ptr48', 'in_ptr6', 'in_ptr7', 'in_ptr8', 'out_ptr15', 'out_ptr16', 'out_ptr17', 'out_ptr24', 'out_ptr25', 'out_ptr26', 'out_ptr33', 'out_ptr34', 'out_ptr35', 'out_ptr42', 'out_ptr43', 'out_ptr44', 'out_ptr51', 'out_ptr52', 'out_ptr53', 'out_ptr6', 'out_ptr60', 'out_ptr61', 'out_ptr62', 'out_ptr69', 'out_ptr7', 'out_ptr70', 'out_ptr71', 'out_ptr78', 'out_ptr79', 'out_ptr8', 'out_ptr80', 'out_ptr87', 'out_ptr88', 'out_ptr89'], 'backend_hash': '1E2C16421D4C3DBA4AD92BFC4278A3CB24C43DEDA6EE7FF9E3FBB1DBB80802DB', 'are_deterministic_algorithms_enabled': False, 'assert_indirect_indexing': True, 'autotune_local_cache': True, 'autotune_pointwise': True, 'autotune_remote_cache': None, 'force_disable_caches': True, 'dynamic_scale_rblock': True, 'max_autotune': False, 'max_autotune_pointwise': False, 'min_split_scan_rblock': 256, 'spill_threshold': 16, 'store_cubin': False},
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] )
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] @triton.jit
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] def triton_for_fused_0(in_ptr0, in_ptr1, in_ptr2, in_ptr3, in_ptr4, in_ptr5, in_ptr6, in_ptr7, in_ptr8, in_ptr9, in_ptr10, in_ptr11, in_ptr12, in_ptr13, in_ptr14, in_ptr15, in_ptr16, in_ptr17, in_ptr18, in_ptr19, in_ptr20, in_ptr21, in_ptr22, in_ptr23, in_ptr24, in_ptr25, in_ptr26, in_ptr27, in_ptr28, in_ptr29, in_ptr30, in_ptr31, in_ptr32, in_ptr33, in_ptr34, in_ptr35, in_ptr36, in_ptr37, in_ptr38, in_ptr39, in_ptr40, in_ptr41, in_ptr42, in_ptr43, in_ptr44, in_ptr45, in_ptr46, in_ptr47, in_ptr48, in_ptr49, out_ptr6, out_ptr7, out_ptr8, out_ptr15, out_ptr16, out_ptr17, out_ptr24, out_ptr25, out_ptr26, out_ptr33, out_ptr34, out_ptr35, out_ptr42, out_ptr43, out_ptr44, out_ptr51, out_ptr52, out_ptr53, out_ptr60, out_ptr61, out_ptr62, out_ptr69, out_ptr70, out_ptr71, out_ptr78, out_ptr79, out_ptr80, out_ptr87, out_ptr88, out_ptr89):
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] pid = tl.program_id(0)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] XBLOCK: tl.constexpr = 1024
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] num_xblocks_0 = tl.cdiv(1048576, XBLOCK)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] num_xblocks_1 = num_xblocks_0 + tl.cdiv(1048576, XBLOCK)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] num_xblocks_2 = num_xblocks_1 + tl.cdiv(1048576, XBLOCK)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] num_xblocks_3 = num_xblocks_2 + tl.cdiv(1048576, XBLOCK)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] num_xblocks_4 = num_xblocks_3 + tl.cdiv(1048576, XBLOCK)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] num_xblocks_5 = num_xblocks_4 + tl.cdiv(1048576, XBLOCK)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] num_xblocks_6 = num_xblocks_5 + tl.cdiv(1048576, XBLOCK)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] num_xblocks_7 = num_xblocks_6 + tl.cdiv(1048576, XBLOCK)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] num_xblocks_8 = num_xblocks_7 + tl.cdiv(1048576, XBLOCK)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] num_xblocks_9 = num_xblocks_8 + tl.cdiv(1048576, XBLOCK)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] if pid < num_xblocks_0:
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] pid_offset = pid
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] xnumel = 1048576
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] r0_numel = 1
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] xoffset = pid_offset * XBLOCK
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] xindex = xoffset + tl.arange(0, XBLOCK)[:]
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] xmask = tl.full([XBLOCK], True, tl.int1)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] x0 = xindex
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp5 = tl.load(in_ptr0 + (x0), None)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp6 = tl.load(in_ptr1 + (x0), None)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp11 = tl.load(in_ptr2 + (x0), None)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp18 = tl.load(in_ptr3 + (x0), None)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp20 = in_ptr4
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp0 = 0.09999999999999998
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp1 = 0.5
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp2 = tmp0 >= tmp1
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp3 = -0.9
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp4 = tl.where(tmp2, tmp3, tmp0)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp7 = tmp5 - tmp6
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp8 = tmp4 * tmp7
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp9 = tl.where(tmp2, tmp5, tmp6)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp10 = tmp8 + tmp9
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp12 = 0.999
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp13 = tmp11 * tmp12
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp14 = 0.0010000000000000009
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp15 = tmp5 * tmp14
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp16 = tmp15 * tmp5
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp17 = tmp13 + tmp16
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp19 = libdevice.sqrt(tmp17)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp21 = 1.0
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp22 = tmp20 + tmp21
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp23 = libdevice.pow(tmp12, tmp22)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp24 = tmp21 - tmp23
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp25 = libdevice.sqrt(tmp24)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp26 = 0.9
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp27 = libdevice.pow(tmp26, tmp22)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp28 = tmp21 - tmp27
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp29 = tl.full([1], 1, tl.int32)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp30 = (tmp29 / tmp28)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp31 = 0.001
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp32 = tmp30 * tmp31
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp33 = -tmp32
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp34 = tmp25 * tmp33
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp35 = (tmp19 / tmp34)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp36 = (tmp29 / tmp33)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp37 = 1e-08
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp38 = tmp36 * tmp37
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp39 = tmp35 + tmp38
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp40 = (tmp10 / tmp39)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp41 = tmp18 + tmp40
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tl.store(out_ptr6 + (x0), tmp41, None)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tl.store(out_ptr7 + (x0), tmp10, None)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tl.store(out_ptr8 + (x0), tmp17, None)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] elif pid < num_xblocks_1:
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] pid_offset = pid - num_xblocks_0
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] xnumel = 1048576
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] r0_numel = 1
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] xoffset = pid_offset * XBLOCK
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] xindex = xoffset + tl.arange(0, XBLOCK)[:]
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] xmask = tl.full([XBLOCK], True, tl.int1)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] x1 = xindex
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp47 = tl.load(in_ptr5 + (x1), None)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp48 = tl.load(in_ptr6 + (x1), None)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp53 = tl.load(in_ptr7 + (x1), None)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp60 = tl.load(in_ptr8 + (x1), None)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp62 = in_ptr9
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp42 = 0.09999999999999998
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp43 = 0.5
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp44 = tmp42 >= tmp43
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp45 = -0.9
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp46 = tl.where(tmp44, tmp45, tmp42)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp49 = tmp47 - tmp48
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp50 = tmp46 * tmp49
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp51 = tl.where(tmp44, tmp47, tmp48)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp52 = tmp50 + tmp51
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp54 = 0.999
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp55 = tmp53 * tmp54
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp56 = 0.0010000000000000009
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp57 = tmp47 * tmp56
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp58 = tmp57 * tmp47
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp59 = tmp55 + tmp58
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp61 = libdevice.sqrt(tmp59)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp63 = 1.0
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp64 = tmp62 + tmp63
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp65 = libdevice.pow(tmp54, tmp64)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp66 = tmp63 - tmp65
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp67 = libdevice.sqrt(tmp66)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp68 = 0.9
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp69 = libdevice.pow(tmp68, tmp64)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp70 = tmp63 - tmp69
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp71 = tl.full([1], 1, tl.int32)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp72 = (tmp71 / tmp70)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp73 = 0.001
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp74 = tmp72 * tmp73
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp75 = -tmp74
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp76 = tmp67 * tmp75
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp77 = (tmp61 / tmp76)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp78 = (tmp71 / tmp75)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp79 = 1e-08
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp80 = tmp78 * tmp79
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp81 = tmp77 + tmp80
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp82 = (tmp52 / tmp81)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp83 = tmp60 + tmp82
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tl.store(out_ptr15 + (x1), tmp83, None)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tl.store(out_ptr16 + (x1), tmp52, None)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tl.store(out_ptr17 + (x1), tmp59, None)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] elif pid < num_xblocks_2:
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] pid_offset = pid - num_xblocks_1
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] xnumel = 1048576
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] r0_numel = 1
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] xoffset = pid_offset * XBLOCK
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] xindex = xoffset + tl.arange(0, XBLOCK)[:]
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] xmask = tl.full([XBLOCK], True, tl.int1)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] x2 = xindex
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp89 = tl.load(in_ptr10 + (x2), None)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp90 = tl.load(in_ptr11 + (x2), None)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp95 = tl.load(in_ptr12 + (x2), None)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp102 = tl.load(in_ptr13 + (x2), None)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp104 = in_ptr14
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp84 = 0.09999999999999998
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp85 = 0.5
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp86 = tmp84 >= tmp85
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp87 = -0.9
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp88 = tl.where(tmp86, tmp87, tmp84)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp91 = tmp89 - tmp90
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp92 = tmp88 * tmp91
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp93 = tl.where(tmp86, tmp89, tmp90)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp94 = tmp92 + tmp93
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp96 = 0.999
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp97 = tmp95 * tmp96
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp98 = 0.0010000000000000009
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp99 = tmp89 * tmp98
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp100 = tmp99 * tmp89
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp101 = tmp97 + tmp100
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp103 = libdevice.sqrt(tmp101)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp105 = 1.0
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp106 = tmp104 + tmp105
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp107 = libdevice.pow(tmp96, tmp106)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp108 = tmp105 - tmp107
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp109 = libdevice.sqrt(tmp108)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp110 = 0.9
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp111 = libdevice.pow(tmp110, tmp106)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp112 = tmp105 - tmp111
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp113 = tl.full([1], 1, tl.int32)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp114 = (tmp113 / tmp112)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp115 = 0.001
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp116 = tmp114 * tmp115
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp117 = -tmp116
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp118 = tmp109 * tmp117
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp119 = (tmp103 / tmp118)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp120 = (tmp113 / tmp117)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp121 = 1e-08
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp122 = tmp120 * tmp121
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp123 = tmp119 + tmp122
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp124 = (tmp94 / tmp123)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp125 = tmp102 + tmp124
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tl.store(out_ptr24 + (x2), tmp125, None)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tl.store(out_ptr25 + (x2), tmp94, None)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tl.store(out_ptr26 + (x2), tmp101, None)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] elif pid < num_xblocks_3:
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] pid_offset = pid - num_xblocks_2
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] xnumel = 1048576
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] r0_numel = 1
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] xoffset = pid_offset * XBLOCK
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] xindex = xoffset + tl.arange(0, XBLOCK)[:]
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] xmask = tl.full([XBLOCK], True, tl.int1)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] x3 = xindex
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp131 = tl.load(in_ptr15 + (x3), None)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp132 = tl.load(in_ptr16 + (x3), None)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp137 = tl.load(in_ptr17 + (x3), None)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp144 = tl.load(in_ptr18 + (x3), None)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp146 = in_ptr19
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp126 = 0.09999999999999998
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp127 = 0.5
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp128 = tmp126 >= tmp127
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp129 = -0.9
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp130 = tl.where(tmp128, tmp129, tmp126)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp133 = tmp131 - tmp132
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp134 = tmp130 * tmp133
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp135 = tl.where(tmp128, tmp131, tmp132)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp136 = tmp134 + tmp135
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp138 = 0.999
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp139 = tmp137 * tmp138
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp140 = 0.0010000000000000009
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp141 = tmp131 * tmp140
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp142 = tmp141 * tmp131
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp143 = tmp139 + tmp142
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp145 = libdevice.sqrt(tmp143)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp147 = 1.0
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp148 = tmp146 + tmp147
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp149 = libdevice.pow(tmp138, tmp148)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp150 = tmp147 - tmp149
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp151 = libdevice.sqrt(tmp150)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp152 = 0.9
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp153 = libdevice.pow(tmp152, tmp148)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp154 = tmp147 - tmp153
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp155 = tl.full([1], 1, tl.int32)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp156 = (tmp155 / tmp154)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp157 = 0.001
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp158 = tmp156 * tmp157
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp159 = -tmp158
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp160 = tmp151 * tmp159
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp161 = (tmp145 / tmp160)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp162 = (tmp155 / tmp159)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp163 = 1e-08
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp164 = tmp162 * tmp163
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp165 = tmp161 + tmp164
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp166 = (tmp136 / tmp165)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp167 = tmp144 + tmp166
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tl.store(out_ptr33 + (x3), tmp167, None)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tl.store(out_ptr34 + (x3), tmp136, None)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tl.store(out_ptr35 + (x3), tmp143, None)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] elif pid < num_xblocks_4:
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] pid_offset = pid - num_xblocks_3
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] xnumel = 1048576
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] r0_numel = 1
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] xoffset = pid_offset * XBLOCK
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] xindex = xoffset + tl.arange(0, XBLOCK)[:]
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] xmask = tl.full([XBLOCK], True, tl.int1)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] x4 = xindex
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp173 = tl.load(in_ptr20 + (x4), None)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp174 = tl.load(in_ptr21 + (x4), None)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp179 = tl.load(in_ptr22 + (x4), None)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp186 = tl.load(in_ptr23 + (x4), None)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp188 = in_ptr24
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp168 = 0.09999999999999998
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp169 = 0.5
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp170 = tmp168 >= tmp169
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp171 = -0.9
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp172 = tl.where(tmp170, tmp171, tmp168)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp175 = tmp173 - tmp174
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp176 = tmp172 * tmp175
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp177 = tl.where(tmp170, tmp173, tmp174)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp178 = tmp176 + tmp177
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp180 = 0.999
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp181 = tmp179 * tmp180
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp182 = 0.0010000000000000009
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp183 = tmp173 * tmp182
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp184 = tmp183 * tmp173
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp185 = tmp181 + tmp184
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp187 = libdevice.sqrt(tmp185)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp189 = 1.0
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp190 = tmp188 + tmp189
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp191 = libdevice.pow(tmp180, tmp190)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp192 = tmp189 - tmp191
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp193 = libdevice.sqrt(tmp192)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp194 = 0.9
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp195 = libdevice.pow(tmp194, tmp190)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp196 = tmp189 - tmp195
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp197 = tl.full([1], 1, tl.int32)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp198 = (tmp197 / tmp196)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp199 = 0.001
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp200 = tmp198 * tmp199
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp201 = -tmp200
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp202 = tmp193 * tmp201
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp203 = (tmp187 / tmp202)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp204 = (tmp197 / tmp201)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp205 = 1e-08
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp206 = tmp204 * tmp205
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp207 = tmp203 + tmp206
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp208 = (tmp178 / tmp207)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp209 = tmp186 + tmp208
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tl.store(out_ptr42 + (x4), tmp209, None)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tl.store(out_ptr43 + (x4), tmp178, None)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tl.store(out_ptr44 + (x4), tmp185, None)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] elif pid < num_xblocks_5:
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] pid_offset = pid - num_xblocks_4
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] xnumel = 1048576
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] r0_numel = 1
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] xoffset = pid_offset * XBLOCK
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] xindex = xoffset + tl.arange(0, XBLOCK)[:]
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] xmask = tl.full([XBLOCK], True, tl.int1)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] x5 = xindex
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp215 = tl.load(in_ptr25 + (x5), None)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp216 = tl.load(in_ptr26 + (x5), None)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp221 = tl.load(in_ptr27 + (x5), None)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp228 = tl.load(in_ptr28 + (x5), None)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp230 = in_ptr29
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp210 = 0.09999999999999998
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp211 = 0.5
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp212 = tmp210 >= tmp211
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp213 = -0.9
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp214 = tl.where(tmp212, tmp213, tmp210)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp217 = tmp215 - tmp216
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp218 = tmp214 * tmp217
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp219 = tl.where(tmp212, tmp215, tmp216)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp220 = tmp218 + tmp219
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp222 = 0.999
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp223 = tmp221 * tmp222
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp224 = 0.0010000000000000009
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp225 = tmp215 * tmp224
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp226 = tmp225 * tmp215
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp227 = tmp223 + tmp226
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp229 = libdevice.sqrt(tmp227)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp231 = 1.0
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp232 = tmp230 + tmp231
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp233 = libdevice.pow(tmp222, tmp232)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp234 = tmp231 - tmp233
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp235 = libdevice.sqrt(tmp234)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp236 = 0.9
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp237 = libdevice.pow(tmp236, tmp232)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp238 = tmp231 - tmp237
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp239 = tl.full([1], 1, tl.int32)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp240 = (tmp239 / tmp238)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp241 = 0.001
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp242 = tmp240 * tmp241
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp243 = -tmp242
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp244 = tmp235 * tmp243
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp245 = (tmp229 / tmp244)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp246 = (tmp239 / tmp243)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp247 = 1e-08
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp248 = tmp246 * tmp247
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp249 = tmp245 + tmp248
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp250 = (tmp220 / tmp249)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp251 = tmp228 + tmp250
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tl.store(out_ptr51 + (x5), tmp251, None)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tl.store(out_ptr52 + (x5), tmp220, None)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tl.store(out_ptr53 + (x5), tmp227, None)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] elif pid < num_xblocks_6:
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] pid_offset = pid - num_xblocks_5
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] xnumel = 1048576
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] r0_numel = 1
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] xoffset = pid_offset * XBLOCK
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] xindex = xoffset + tl.arange(0, XBLOCK)[:]
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] xmask = tl.full([XBLOCK], True, tl.int1)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] x6 = xindex
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp257 = tl.load(in_ptr30 + (x6), None)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp258 = tl.load(in_ptr31 + (x6), None)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp263 = tl.load(in_ptr32 + (x6), None)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp270 = tl.load(in_ptr33 + (x6), None)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp272 = in_ptr34
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp252 = 0.09999999999999998
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp253 = 0.5
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp254 = tmp252 >= tmp253
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp255 = -0.9
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp256 = tl.where(tmp254, tmp255, tmp252)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp259 = tmp257 - tmp258
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp260 = tmp256 * tmp259
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp261 = tl.where(tmp254, tmp257, tmp258)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp262 = tmp260 + tmp261
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp264 = 0.999
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp265 = tmp263 * tmp264
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp266 = 0.0010000000000000009
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp267 = tmp257 * tmp266
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp268 = tmp267 * tmp257
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp269 = tmp265 + tmp268
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp271 = libdevice.sqrt(tmp269)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp273 = 1.0
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp274 = tmp272 + tmp273
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp275 = libdevice.pow(tmp264, tmp274)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp276 = tmp273 - tmp275
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp277 = libdevice.sqrt(tmp276)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp278 = 0.9
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp279 = libdevice.pow(tmp278, tmp274)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp280 = tmp273 - tmp279
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp281 = tl.full([1], 1, tl.int32)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp282 = (tmp281 / tmp280)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp283 = 0.001
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp284 = tmp282 * tmp283
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp285 = -tmp284
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp286 = tmp277 * tmp285
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp287 = (tmp271 / tmp286)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp288 = (tmp281 / tmp285)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp289 = 1e-08
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp290 = tmp288 * tmp289
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp291 = tmp287 + tmp290
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp292 = (tmp262 / tmp291)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp293 = tmp270 + tmp292
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tl.store(out_ptr60 + (x6), tmp293, None)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tl.store(out_ptr61 + (x6), tmp262, None)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tl.store(out_ptr62 + (x6), tmp269, None)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] elif pid < num_xblocks_7:
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] pid_offset = pid - num_xblocks_6
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] xnumel = 1048576
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] r0_numel = 1
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] xoffset = pid_offset * XBLOCK
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] xindex = xoffset + tl.arange(0, XBLOCK)[:]
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] xmask = tl.full([XBLOCK], True, tl.int1)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] x7 = xindex
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp299 = tl.load(in_ptr35 + (x7), None)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp300 = tl.load(in_ptr36 + (x7), None)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp305 = tl.load(in_ptr37 + (x7), None)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp312 = tl.load(in_ptr38 + (x7), None)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp314 = in_ptr39
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp294 = 0.09999999999999998
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp295 = 0.5
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp296 = tmp294 >= tmp295
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp297 = -0.9
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp298 = tl.where(tmp296, tmp297, tmp294)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp301 = tmp299 - tmp300
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp302 = tmp298 * tmp301
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp303 = tl.where(tmp296, tmp299, tmp300)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp304 = tmp302 + tmp303
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp306 = 0.999
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp307 = tmp305 * tmp306
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp308 = 0.0010000000000000009
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp309 = tmp299 * tmp308
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp310 = tmp309 * tmp299
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp311 = tmp307 + tmp310
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp313 = libdevice.sqrt(tmp311)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp315 = 1.0
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp316 = tmp314 + tmp315
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp317 = libdevice.pow(tmp306, tmp316)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp318 = tmp315 - tmp317
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp319 = libdevice.sqrt(tmp318)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp320 = 0.9
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp321 = libdevice.pow(tmp320, tmp316)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp322 = tmp315 - tmp321
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp323 = tl.full([1], 1, tl.int32)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp324 = (tmp323 / tmp322)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp325 = 0.001
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp326 = tmp324 * tmp325
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp327 = -tmp326
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp328 = tmp319 * tmp327
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp329 = (tmp313 / tmp328)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp330 = (tmp323 / tmp327)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp331 = 1e-08
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp332 = tmp330 * tmp331
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp333 = tmp329 + tmp332
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp334 = (tmp304 / tmp333)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp335 = tmp312 + tmp334
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tl.store(out_ptr69 + (x7), tmp335, None)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tl.store(out_ptr70 + (x7), tmp304, None)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tl.store(out_ptr71 + (x7), tmp311, None)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] elif pid < num_xblocks_8:
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] pid_offset = pid - num_xblocks_7
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] xnumel = 1048576
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] r0_numel = 1
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] xoffset = pid_offset * XBLOCK
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] xindex = xoffset + tl.arange(0, XBLOCK)[:]
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] xmask = tl.full([XBLOCK], True, tl.int1)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] x8 = xindex
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp341 = tl.load(in_ptr40 + (x8), None)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp342 = tl.load(in_ptr41 + (x8), None)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp347 = tl.load(in_ptr42 + (x8), None)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp354 = tl.load(in_ptr43 + (x8), None)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp356 = in_ptr44
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp336 = 0.09999999999999998
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp337 = 0.5
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp338 = tmp336 >= tmp337
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp339 = -0.9
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp340 = tl.where(tmp338, tmp339, tmp336)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp343 = tmp341 - tmp342
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp344 = tmp340 * tmp343
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp345 = tl.where(tmp338, tmp341, tmp342)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp346 = tmp344 + tmp345
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp348 = 0.999
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp349 = tmp347 * tmp348
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp350 = 0.0010000000000000009
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp351 = tmp341 * tmp350
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp352 = tmp351 * tmp341
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp353 = tmp349 + tmp352
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp355 = libdevice.sqrt(tmp353)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp357 = 1.0
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp358 = tmp356 + tmp357
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp359 = libdevice.pow(tmp348, tmp358)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp360 = tmp357 - tmp359
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp361 = libdevice.sqrt(tmp360)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp362 = 0.9
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp363 = libdevice.pow(tmp362, tmp358)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp364 = tmp357 - tmp363
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp365 = tl.full([1], 1, tl.int32)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp366 = (tmp365 / tmp364)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp367 = 0.001
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp368 = tmp366 * tmp367
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp369 = -tmp368
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp370 = tmp361 * tmp369
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp371 = (tmp355 / tmp370)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp372 = (tmp365 / tmp369)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp373 = 1e-08
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp374 = tmp372 * tmp373
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp375 = tmp371 + tmp374
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp376 = (tmp346 / tmp375)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp377 = tmp354 + tmp376
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tl.store(out_ptr78 + (x8), tmp377, None)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tl.store(out_ptr79 + (x8), tmp346, None)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tl.store(out_ptr80 + (x8), tmp353, None)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] elif pid < num_xblocks_9:
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] pid_offset = pid - num_xblocks_8
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] xnumel = 1048576
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] r0_numel = 1
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] xoffset = pid_offset * XBLOCK
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] xindex = xoffset + tl.arange(0, XBLOCK)[:]
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] xmask = tl.full([XBLOCK], True, tl.int1)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] x9 = xindex
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp383 = tl.load(in_ptr45 + (x9), None)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp384 = tl.load(in_ptr46 + (x9), None)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp389 = tl.load(in_ptr47 + (x9), None)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp396 = tl.load(in_ptr48 + (x9), None)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp398 = in_ptr49
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp378 = 0.09999999999999998
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp379 = 0.5
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp380 = tmp378 >= tmp379
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp381 = -0.9
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp382 = tl.where(tmp380, tmp381, tmp378)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp385 = tmp383 - tmp384
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp386 = tmp382 * tmp385
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp387 = tl.where(tmp380, tmp383, tmp384)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp388 = tmp386 + tmp387
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp390 = 0.999
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp391 = tmp389 * tmp390
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp392 = 0.0010000000000000009
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp393 = tmp383 * tmp392
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp394 = tmp393 * tmp383
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp395 = tmp391 + tmp394
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp397 = libdevice.sqrt(tmp395)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp399 = 1.0
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp400 = tmp398 + tmp399
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp401 = libdevice.pow(tmp390, tmp400)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp402 = tmp399 - tmp401
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp403 = libdevice.sqrt(tmp402)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp404 = 0.9
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp405 = libdevice.pow(tmp404, tmp400)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp406 = tmp399 - tmp405
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp407 = tl.full([1], 1, tl.int32)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp408 = (tmp407 / tmp406)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp409 = 0.001
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp410 = tmp408 * tmp409
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp411 = -tmp410
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp412 = tmp403 * tmp411
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp413 = (tmp397 / tmp412)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp414 = (tmp407 / tmp411)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp415 = 1e-08
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp416 = tmp414 * tmp415
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp417 = tmp413 + tmp416
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp418 = (tmp388 / tmp417)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp419 = tmp396 + tmp418
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tl.store(out_ptr87 + (x9), tmp419, None)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tl.store(out_ptr88 + (x9), tmp388, None)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] tl.store(out_ptr89 + (x9), tmp395, None)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] else:
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] pass
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] ''', device_str='cuda')
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code]
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code]
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] cpp_fused__foreach_copy_1 = async_compile.cpp_pybinding(['const float*', 'const float*', 'const float*', 'const float*', 'const float*', 'const float*', 'const float*', 'const float*', 'const float*', 'const float*', 'float*', 'float*', 'float*', 'float*', 'float*', 'float*', 'float*', 'float*', 'float*', 'float*'], '''
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] #include "/tmp/torchinductor_ci-user/tmp4ttmqsz0/pi/cpicxudqmdsjh5cm4klbtbrvy2cxwr7whxl3md2zzdjdf3orvfdf.h"
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] extern "C" void kernel(const float* in_ptr0,
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] const float* in_ptr1,
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] const float* in_ptr2,
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] const float* in_ptr3,
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] const float* in_ptr4,
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] const float* in_ptr5,
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] const float* in_ptr6,
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] const float* in_ptr7,
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] const float* in_ptr8,
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] const float* in_ptr9,
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] float* out_ptr1,
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] float* out_ptr3,
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] float* out_ptr5,
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] float* out_ptr7,
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] float* out_ptr9,
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] float* out_ptr11,
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] float* out_ptr13,
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] float* out_ptr15,
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] float* out_ptr17,
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] float* out_ptr19)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] {
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] {
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] {
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] {
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] auto tmp0 = in_ptr0[static_cast<int64_t>(0L)];
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] auto tmp1 = static_cast<float>(1.0);
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] auto tmp2 = decltype(tmp0)(tmp0 + tmp1);
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] out_ptr1[static_cast<int64_t>(0L)] = tmp2;
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] }
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] }
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] }
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] {
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] {
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] {
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] auto tmp0 = in_ptr1[static_cast<int64_t>(0L)];
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] auto tmp1 = static_cast<float>(1.0);
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] auto tmp2 = decltype(tmp0)(tmp0 + tmp1);
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] out_ptr3[static_cast<int64_t>(0L)] = tmp2;
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] }
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] }
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] }
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] {
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] {
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] {
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] auto tmp0 = in_ptr2[static_cast<int64_t>(0L)];
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] auto tmp1 = static_cast<float>(1.0);
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] auto tmp2 = decltype(tmp0)(tmp0 + tmp1);
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] out_ptr5[static_cast<int64_t>(0L)] = tmp2;
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] }
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] }
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] }
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] {
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] {
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] {
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] auto tmp0 = in_ptr3[static_cast<int64_t>(0L)];
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] auto tmp1 = static_cast<float>(1.0);
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] auto tmp2 = decltype(tmp0)(tmp0 + tmp1);
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] out_ptr7[static_cast<int64_t>(0L)] = tmp2;
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] }
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] }
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] }
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] {
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] {
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] {
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] auto tmp0 = in_ptr4[static_cast<int64_t>(0L)];
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] auto tmp1 = static_cast<float>(1.0);
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] auto tmp2 = decltype(tmp0)(tmp0 + tmp1);
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] out_ptr9[static_cast<int64_t>(0L)] = tmp2;
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] }
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] }
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] }
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] {
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] {
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] {
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] auto tmp0 = in_ptr5[static_cast<int64_t>(0L)];
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] auto tmp1 = static_cast<float>(1.0);
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] auto tmp2 = decltype(tmp0)(tmp0 + tmp1);
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] out_ptr11[static_cast<int64_t>(0L)] = tmp2;
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] }
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] }
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] }
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] {
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] {
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] {
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] auto tmp0 = in_ptr6[static_cast<int64_t>(0L)];
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] auto tmp1 = static_cast<float>(1.0);
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] auto tmp2 = decltype(tmp0)(tmp0 + tmp1);
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] out_ptr13[static_cast<int64_t>(0L)] = tmp2;
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] }
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] }
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] }
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] {
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] {
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] {
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] auto tmp0 = in_ptr7[static_cast<int64_t>(0L)];
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] auto tmp1 = static_cast<float>(1.0);
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] auto tmp2 = decltype(tmp0)(tmp0 + tmp1);
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] out_ptr15[static_cast<int64_t>(0L)] = tmp2;
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] }
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] }
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] }
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] {
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] {
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] {
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] auto tmp0 = in_ptr8[static_cast<int64_t>(0L)];
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] auto tmp1 = static_cast<float>(1.0);
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] auto tmp2 = decltype(tmp0)(tmp0 + tmp1);
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] out_ptr17[static_cast<int64_t>(0L)] = tmp2;
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] }
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] }
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] }
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] {
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] {
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] {
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] auto tmp0 = in_ptr9[static_cast<int64_t>(0L)];
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] auto tmp1 = static_cast<float>(1.0);
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] auto tmp2 = decltype(tmp0)(tmp0 + tmp1);
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] out_ptr19[static_cast<int64_t>(0L)] = tmp2;
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] }
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] }
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] }
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] }
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] ''')
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code]
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code]
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] async_compile.wait(globals())
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] del async_compile
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code]
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] def call(args):
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] arg0_1, arg1_1, arg2_1, arg3_1, arg4_1, arg5_1, arg6_1, arg7_1, arg8_1, arg9_1, arg10_1, arg11_1, arg12_1, arg13_1, arg14_1, arg15_1, arg16_1, arg17_1, arg18_1, arg19_1, arg20_1, arg21_1, arg22_1, arg23_1, arg24_1, arg25_1, arg26_1, arg27_1, arg28_1, arg29_1, arg30_1, arg31_1, arg32_1, arg33_1, arg34_1, arg35_1, arg36_1, arg37_1, arg38_1, arg39_1, arg40_1, arg41_1, arg42_1, arg43_1, arg44_1, arg45_1, arg46_1, arg47_1, arg48_1, arg49_1 = args
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] args.clear()
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] assert_size_stride(arg0_1, (1024, 1024), (1024, 1))
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] assert_size_stride(arg1_1, (1024, 1024), (1024, 1))
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] assert_size_stride(arg2_1, (1024, 1024), (1024, 1))
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] assert_size_stride(arg3_1, (1024, 1024), (1024, 1))
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] assert_size_stride(arg4_1, (1024, 1024), (1024, 1))
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] assert_size_stride(arg5_1, (1024, 1024), (1024, 1))
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] assert_size_stride(arg6_1, (1024, 1024), (1024, 1))
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] assert_size_stride(arg7_1, (1024, 1024), (1024, 1))
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] assert_size_stride(arg8_1, (1024, 1024), (1024, 1))
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] assert_size_stride(arg9_1, (1024, 1024), (1024, 1))
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] assert_size_stride(arg10_1, (1024, 1024), (1024, 1))
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] assert_size_stride(arg11_1, (1024, 1024), (1024, 1))
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] assert_size_stride(arg12_1, (1024, 1024), (1024, 1))
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] assert_size_stride(arg13_1, (1024, 1024), (1024, 1))
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] assert_size_stride(arg14_1, (1024, 1024), (1024, 1))
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] assert_size_stride(arg15_1, (1024, 1024), (1024, 1))
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] assert_size_stride(arg16_1, (1024, 1024), (1024, 1))
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] assert_size_stride(arg17_1, (1024, 1024), (1024, 1))
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] assert_size_stride(arg18_1, (1024, 1024), (1024, 1))
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] assert_size_stride(arg19_1, (1024, 1024), (1024, 1))
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] assert_size_stride(arg20_1, (), ())
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] assert_size_stride(arg21_1, (), ())
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] assert_size_stride(arg22_1, (), ())
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] assert_size_stride(arg23_1, (), ())
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] assert_size_stride(arg24_1, (), ())
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] assert_size_stride(arg25_1, (), ())
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] assert_size_stride(arg26_1, (), ())
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] assert_size_stride(arg27_1, (), ())
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] assert_size_stride(arg28_1, (), ())
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] assert_size_stride(arg29_1, (), ())
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] assert_size_stride(arg30_1, (1024, 1024), (1024, 1))
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] assert_size_stride(arg31_1, (1024, 1024), (1024, 1))
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] assert_size_stride(arg32_1, (1024, 1024), (1024, 1))
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] assert_size_stride(arg33_1, (1024, 1024), (1024, 1))
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] assert_size_stride(arg34_1, (1024, 1024), (1024, 1))
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] assert_size_stride(arg35_1, (1024, 1024), (1024, 1))
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] assert_size_stride(arg36_1, (1024, 1024), (1024, 1))
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] assert_size_stride(arg37_1, (1024, 1024), (1024, 1))
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] assert_size_stride(arg38_1, (1024, 1024), (1024, 1))
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] assert_size_stride(arg39_1, (1024, 1024), (1024, 1))
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] assert_size_stride(arg40_1, (1024, 1024), (1024, 1))
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] assert_size_stride(arg41_1, (1024, 1024), (1024, 1))
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] assert_size_stride(arg42_1, (1024, 1024), (1024, 1))
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] assert_size_stride(arg43_1, (1024, 1024), (1024, 1))
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] assert_size_stride(arg44_1, (1024, 1024), (1024, 1))
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] assert_size_stride(arg45_1, (1024, 1024), (1024, 1))
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] assert_size_stride(arg46_1, (1024, 1024), (1024, 1))
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] assert_size_stride(arg47_1, (1024, 1024), (1024, 1))
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] assert_size_stride(arg48_1, (1024, 1024), (1024, 1))
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] assert_size_stride(arg49_1, (1024, 1024), (1024, 1))
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] with torch.cuda._DeviceGuard(0):
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] torch.cuda.set_device(0)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] # Unsorted Source Nodes: [], Original ATen: []
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] stream0 = get_raw_stream(0)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] triton_for_fused_0.run(arg1_1, arg30_1, arg40_1, arg0_1, arg20_1.item(), arg3_1, arg31_1, arg41_1, arg2_1, arg21_1.item(), arg5_1, arg32_1, arg42_1, arg4_1, arg22_1.item(), arg7_1, arg33_1, arg43_1, arg6_1, arg23_1.item(), arg9_1, arg34_1, arg44_1, arg8_1, arg24_1.item(), arg11_1, arg35_1, arg45_1, arg10_1, arg25_1.item(), arg13_1, arg36_1, arg46_1, arg12_1, arg26_1.item(), arg15_1, arg37_1, arg47_1, arg14_1, arg27_1.item(), arg17_1, arg38_1, arg48_1, arg16_1, arg28_1.item(), arg19_1, arg39_1, arg49_1, arg18_1, arg29_1.item(), arg0_1, arg30_1, arg40_1, arg2_1, arg31_1, arg41_1, arg4_1, arg32_1, arg42_1, arg6_1, arg33_1, arg43_1, arg8_1, arg34_1, arg44_1, arg10_1, arg35_1, arg45_1, arg12_1, arg36_1, arg46_1, arg14_1, arg37_1, arg47_1, arg16_1, arg38_1, arg48_1, arg18_1, arg39_1, arg49_1, stream=stream0)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] del arg0_1
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] del arg10_1
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] del arg11_1
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] del arg12_1
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] del arg13_1
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] del arg14_1
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] del arg15_1
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] del arg16_1
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] del arg17_1
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] del arg18_1
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] del arg19_1
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] del arg1_1
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] del arg2_1
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] del arg30_1
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] del arg31_1
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] del arg32_1
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] del arg33_1
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] del arg34_1
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] del arg35_1
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] del arg36_1
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] del arg37_1
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] del arg38_1
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] del arg39_1
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] del arg3_1
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] del arg40_1
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] del arg41_1
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] del arg42_1
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] del arg43_1
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] del arg44_1
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] del arg45_1
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] del arg46_1
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] del arg47_1
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] del arg48_1
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] del arg49_1
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] del arg4_1
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] del arg5_1
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] del arg6_1
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] del arg7_1
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] del arg8_1
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] del arg9_1
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] cpp_fused__foreach_copy_1(arg20_1, arg21_1, arg22_1, arg23_1, arg24_1, arg25_1, arg26_1, arg27_1, arg28_1, arg29_1, arg20_1, arg21_1, arg22_1, arg23_1, arg24_1, arg25_1, arg26_1, arg27_1, arg28_1, arg29_1)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] del arg20_1
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] del arg21_1
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] del arg22_1
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] del arg23_1
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] del arg24_1
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] del arg25_1
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] del arg26_1
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] del arg27_1
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] del arg28_1
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] del arg29_1
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] return ()
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code]
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code]
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] def benchmark_compiled_module(times=10, repeat=10):
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] from torch._dynamo.testing import rand_strided
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] from torch._inductor.utils import print_performance
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] arg0_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] arg1_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] arg2_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] arg3_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] arg4_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] arg5_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] arg6_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] arg7_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] arg8_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] arg9_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] arg10_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] arg11_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] arg12_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] arg13_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] arg14_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] arg15_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] arg16_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] arg17_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] arg18_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] arg19_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] arg20_1 = rand_strided((), (), device='cpu', dtype=torch.float32)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] arg21_1 = rand_strided((), (), device='cpu', dtype=torch.float32)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] arg22_1 = rand_strided((), (), device='cpu', dtype=torch.float32)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] arg23_1 = rand_strided((), (), device='cpu', dtype=torch.float32)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] arg24_1 = rand_strided((), (), device='cpu', dtype=torch.float32)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] arg25_1 = rand_strided((), (), device='cpu', dtype=torch.float32)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] arg26_1 = rand_strided((), (), device='cpu', dtype=torch.float32)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] arg27_1 = rand_strided((), (), device='cpu', dtype=torch.float32)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] arg28_1 = rand_strided((), (), device='cpu', dtype=torch.float32)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] arg29_1 = rand_strided((), (), device='cpu', dtype=torch.float32)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] arg30_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] arg31_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] arg32_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] arg33_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] arg34_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] arg35_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] arg36_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] arg37_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] arg38_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] arg39_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] arg40_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] arg41_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] arg42_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] arg43_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] arg44_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] arg45_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] arg46_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] arg47_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] arg48_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] arg49_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] fn = lambda: call([arg0_1, arg1_1, arg2_1, arg3_1, arg4_1, arg5_1, arg6_1, arg7_1, arg8_1, arg9_1, arg10_1, arg11_1, arg12_1, arg13_1, arg14_1, arg15_1, arg16_1, arg17_1, arg18_1, arg19_1, arg20_1, arg21_1, arg22_1, arg23_1, arg24_1, arg25_1, arg26_1, arg27_1, arg28_1, arg29_1, arg30_1, arg31_1, arg32_1, arg33_1, arg34_1, arg35_1, arg36_1, arg37_1, arg38_1, arg39_1, arg40_1, arg41_1, arg42_1, arg43_1, arg44_1, arg45_1, arg46_1, arg47_1, arg48_1, arg49_1])
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] return print_performance(fn, times=times, repeat=repeat)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code]
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code]
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] if __name__ == "__main__":
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] from torch._inductor.wrapper_benchmark import compiled_module_main
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code] compiled_module_main('None', benchmark_compiled_module)
V0512 16:37:18.377000 635 torch/_inductor/graph.py:2104] [0/1] [__output_code]
V0512 16:37:18.426000 635 torch/_inductor/graph.py:2115] [0/1] [__output_code] Output code written to: /tmp/torchinductor_ci-user/tmp4ttmqsz0/oh/cohehg6wp7g5q5m2edjm7dfhwwj6v7o7cblf5i3wsrrh4iketdgv.py
I0512 16:37:19.925000 635 torch/_inductor/graph.py:2149] [0/1] [__output_code] Output code written to: /tmp/torchinductor_ci-user/tmp4ttmqsz0/oh/cohehg6wp7g5q5m2edjm7dfhwwj6v7o7cblf5i3wsrrh4iketdgv.py
eager runtime: 1211.0748650047751us
compiled runtime: 763.9768779069409us
Conclusion¶
In this tutorial, we successfully implemented a custom fully-fused Adam optimizer using foreach_map. By leveraging the power of foreach_map and torch.compile, we were able to create an optimized version of the Adam optimizer that can be used in various machine learning applications. This tutorial provides a comprehensive guide on how to use foreach_map and torch.compile to optimize machine learning models, and serves as a valuable resource for developers looking to improve the performance of their models with horizontal fusion.
See also:
Compiled optimizer tutorial - an intro into the compiled optimizer.
Compiling the optimizer with PT2 - deeper technical details on the compiled optimizer.
Total running time of the script: ( 0 minutes 13.026 seconds)