Note
Click here to download the full example code
Explicit horizontal fusion with foreach_map and torch.compile¶
Author: Michael Lazos
- Horizontal fusion is a key optimization in ML compilers. In eager,
this is typically expressed using the torch._foreach* ops which parallelizes operations across a list of tensors. However, supporting all possible permutations of arguments is quite difficult (e.g. mixtures of scalars and lists). Foreach_map allows conversion of any pointwise op in
torch
to a horiztonally fused foreach variant. In this tutorial, we will demonstrate how to implement the Adam optimizer withforeach_map
to generate a fully fused kernel.
Note
This recipe describes a prototype feature. Prototype features are typically at an early stage for feedback and testing and are subject to change.
Prerequisites¶
PyTorch v2.7.0 or later
Model Setup¶
For this example, we’ll use a simple sequence of linear layers. We instantiate an independent copy to compare the two optimizer implementations.
import torch
# exit cleanly if we are on a device that doesn't support ``torch.compile``
if torch.cuda.get_device_capability() < (7, 0):
print("Exiting because torch.compile is not supported on this device.")
import sys
sys.exit(0)
# Create simple model
model = torch.nn.Sequential(
*[torch.nn.Linear(1024, 1024, False, device="cuda") for _ in range(10)]
)
model_copy = torch.nn.Sequential(
*[torch.nn.Linear(1024, 1024, False, device="cuda") for _ in range(10)]
)
input = torch.rand(1024, device="cuda")
# run forward pass
output = model(input)
output_copy = model_copy(input)
# run backward to populate the grads for our optimizer below
output.sum().backward()
output_copy.sum().backward()
Helper functions for foreach_map implementation¶
In this section, we’ll begin our implementation of the Adam optimizer.
from torch._higher_order_ops.foreach_map import foreach_map
# Helper function to extract optimizer states from a torch.optim.Adam instance
def get_inputs(optim):
steps = []
params = []
grads = []
exp_avgs = []
exp_avg_sqs = []
for group in optim.param_groups:
for p in group["params"]:
params.append(p)
grads.append(p.grad)
state = optim.state[p]
exp_avgs.append(state["exp_avg"])
exp_avg_sqs.append(state["exp_avg_sq"])
steps.append(state["step"])
return steps, params, exp_avgs, exp_avg_sqs
# Functions to update the different optimizer states
def update_exp_avg_sq(exp_avg_sq, grad, beta2):
return exp_avg_sq.mul(beta2).addcmul(grad, grad, value=1 - beta2)
def update_param(param, step, exp_avg, exp_avg_sq, beta1, beta2, lr, eps):
bias_correction1 = 1 - torch.pow(beta1, step)
bias_correction2 = (1 - torch.pow(beta2, step)).sqrt()
step_size = (lr / bias_correction1).neg()
denom = (exp_avg_sq.sqrt() / (bias_correction2 * step_size)).add(eps / step_size)
return torch.add(param, torch.div(exp_avg, denom))
# Our full Adam implementation
def foreach_map_adam(
steps,
params,
exp_avgs,
exp_avg_sqs,
weight_decay=0,
beta1=0.9,
beta2=0.999,
lr=1e-3,
eps=1e-8,
):
with torch.no_grad():
grads = [param.grad for param in params]
# update step
updated_steps = foreach_map(lambda x: x + 1, steps)
torch._foreach_copy_(steps, updated_steps)
if weight_decay != 0:
foreach_map(torch.add, (grads,), alpha=weight_decay)
# Higher-order operators (HOPs) cannot have multiple outputs at the moment
# need to call foreach_map once for each output
exp_avgs_updated = foreach_map(torch.lerp, exp_avgs, grads, 1 - beta1)
exp_avgs_sq_updated = foreach_map(update_exp_avg_sq, exp_avg_sqs, grads, beta2)
params_updated = foreach_map(
update_param,
params,
steps,
exp_avgs_updated,
exp_avgs_sq_updated,
beta1,
beta2,
lr,
eps,
)
# Higher-order operators (HOPs) don't support input mutation today
# so manually update the states in-place
torch._foreach_copy_(exp_avgs, exp_avgs_updated)
torch._foreach_copy_(exp_avg_sqs, exp_avgs_sq_updated)
torch._foreach_copy_(params, params_updated)
return
Setting up and running the compiled kernel¶
In this section, we’ll run our Adam optimizer and compare the results
Note
torch.compile
is only supported on CUDA devices that have a compute capability of 7.0 or higher.
opt_eager = torch.optim.Adam(model.parameters(), lr=torch.tensor(0.01))
opt_eager_copy = torch.optim.Adam(model_copy.parameters(), lr=torch.tensor(0.01))
# warm up the optimizer state dict
opt_eager.step()
opt_eager_copy.step()
inputs = get_inputs(opt_eager_copy)
compiled_adam = torch.compile(foreach_map_adam)
# optionally view the output code
torch._logging.set_logs(output_code=True)
# Warmup runs to compile the function
for _ in range(5):
opt_eager.step()
compiled_adam(*inputs)
for eager_p, compile_p in zip(opt_eager.param_groups[0]["params"], opt_eager_copy.param_groups[0]["params"]):
torch.allclose(eager_p, compile_p)
# Benchmark performance
# Let's define a helpful benchmarking function:
import torch.utils.benchmark as benchmark
def benchmark_torch_function_in_microseconds(f, *args, **kwargs):
t0 = benchmark.Timer(
stmt="f(*args, **kwargs)", globals={"args": args, "kwargs": kwargs, "f": f}
)
return t0.blocked_autorange().mean * 1e6
eager_runtime = benchmark_torch_function_in_microseconds(opt_eager.step)
compiled_runtime = benchmark_torch_function_in_microseconds(lambda: compiled_adam(*inputs))
assert eager_runtime > compiled_runtime
print(f"eager runtime: {eager_runtime}us")
print(f"compiled runtime: {compiled_runtime}us")
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] Output code:
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] # AOT ID: ['0_inference']
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] from ctypes import c_void_p, c_long, c_int
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] import torch
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] import math
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] import random
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] import os
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] import tempfile
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] from math import inf, nan
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] from cmath import nanj
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] from torch._inductor.hooks import run_intermediate_hooks
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] from torch._inductor.utils import maybe_profile
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] from torch._inductor.codegen.memory_planning import _align as align
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] from torch import device, empty_strided
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] from torch._inductor.async_compile import AsyncCompile
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] from torch._inductor.select_algorithm import extern_kernels
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] from torch._inductor.codegen.multi_kernel import MultiKernelCall
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] from torch._C import _cuda_getCurrentRawStream as get_raw_stream
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] import triton
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] import triton.language as tl
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] from torch._inductor.runtime.triton_heuristics import start_graph, end_graph
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] from torch._C import _cuda_getCurrentRawStream as get_raw_stream
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code]
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] aten = torch.ops.aten
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] inductor_ops = torch.ops.inductor
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] _quantized = torch.ops._quantized
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] assert_size_stride = torch._C._dynamo.guards.assert_size_stride
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] empty_strided_cpu = torch._C._dynamo.guards._empty_strided_cpu
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] empty_strided_cuda = torch._C._dynamo.guards._empty_strided_cuda
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] empty_strided_xpu = torch._C._dynamo.guards._empty_strided_xpu
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] reinterpret_tensor = torch._C._dynamo.guards._reinterpret_tensor
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] alloc_from_pool = torch.ops.inductor._alloc_from_pool
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] async_compile = AsyncCompile()
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] empty_strided_p2p = torch._C._distributed_c10d._SymmetricMemory.empty_strided_p2p
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code]
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code]
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] # kernel path: /tmp/torchinductor_ci-user/ej/cejr7t4zzqo7llcoxga7clgyc6gs3676lsm4dvilpfw64kudp2ns.py
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] # Unsorted Source Nodes: [], Original ATen: []
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] # Source node to ATen node mapping:
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] triton_for_fused_0 = async_compile.triton('triton_for_fused_0', '''
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] import triton
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] import triton.language as tl
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code]
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] from torch._inductor.runtime import triton_helpers, triton_heuristics
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] from torch._inductor.runtime.triton_helpers import libdevice, math as tl_math
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] from torch._inductor.runtime.hints import AutotuneHint, ReductionHint, TileHint, DeviceProperties
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code]
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] @triton_heuristics.foreach(
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] num_warps=8,
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] triton_meta={'signature': {'in_ptr0': '*fp32', 'in_ptr1': '*fp32', 'in_ptr2': '*fp32', 'in_ptr3': '*fp32', 'in_ptr4': 'fp32', 'in_ptr5': '*fp32', 'in_ptr6': '*fp32', 'in_ptr7': '*fp32', 'in_ptr8': '*fp32', 'in_ptr9': 'fp32', 'in_ptr10': '*fp32', 'in_ptr11': '*fp32', 'in_ptr12': '*fp32', 'in_ptr13': '*fp32', 'in_ptr14': 'fp32', 'in_ptr15': '*fp32', 'in_ptr16': '*fp32', 'in_ptr17': '*fp32', 'in_ptr18': '*fp32', 'in_ptr19': 'fp32', 'in_ptr20': '*fp32', 'in_ptr21': '*fp32', 'in_ptr22': '*fp32', 'in_ptr23': '*fp32', 'in_ptr24': 'fp32', 'in_ptr25': '*fp32', 'in_ptr26': '*fp32', 'in_ptr27': '*fp32', 'in_ptr28': '*fp32', 'in_ptr29': 'fp32', 'in_ptr30': '*fp32', 'in_ptr31': '*fp32', 'in_ptr32': '*fp32', 'in_ptr33': '*fp32', 'in_ptr34': 'fp32', 'in_ptr35': '*fp32', 'in_ptr36': '*fp32', 'in_ptr37': '*fp32', 'in_ptr38': '*fp32', 'in_ptr39': 'fp32', 'in_ptr40': '*fp32', 'in_ptr41': '*fp32', 'in_ptr42': '*fp32', 'in_ptr43': '*fp32', 'in_ptr44': 'fp32', 'in_ptr45': '*fp32', 'in_ptr46': '*fp32', 'in_ptr47': '*fp32', 'in_ptr48': '*fp32', 'in_ptr49': 'fp32', 'out_ptr6': '*fp32', 'out_ptr7': '*fp32', 'out_ptr8': '*fp32', 'out_ptr15': '*fp32', 'out_ptr16': '*fp32', 'out_ptr17': '*fp32', 'out_ptr24': '*fp32', 'out_ptr25': '*fp32', 'out_ptr26': '*fp32', 'out_ptr33': '*fp32', 'out_ptr34': '*fp32', 'out_ptr35': '*fp32', 'out_ptr42': '*fp32', 'out_ptr43': '*fp32', 'out_ptr44': '*fp32', 'out_ptr51': '*fp32', 'out_ptr52': '*fp32', 'out_ptr53': '*fp32', 'out_ptr60': '*fp32', 'out_ptr61': '*fp32', 'out_ptr62': '*fp32', 'out_ptr69': '*fp32', 'out_ptr70': '*fp32', 'out_ptr71': '*fp32', 'out_ptr78': '*fp32', 'out_ptr79': '*fp32', 'out_ptr80': '*fp32', 'out_ptr87': '*fp32', 'out_ptr88': '*fp32', 'out_ptr89': '*fp32'}, 'device': DeviceProperties(type='cuda', index=0, multi_processor_count=80, cc=86, major=8, regs_per_multiprocessor=65536, max_threads_per_multi_processor=1536, warp_size=32), 'constants': {}, 'configs': [{(0,): [['tt.divisibility', 16]], (1,): [['tt.divisibility', 16]], (2,): [['tt.divisibility', 16]], (3,): [['tt.divisibility', 16]], (5,): [['tt.divisibility', 16]], (6,): [['tt.divisibility', 16]], (7,): [['tt.divisibility', 16]], (8,): [['tt.divisibility', 16]], (10,): [['tt.divisibility', 16]], (11,): [['tt.divisibility', 16]], (12,): [['tt.divisibility', 16]], (13,): [['tt.divisibility', 16]], (15,): [['tt.divisibility', 16]], (16,): [['tt.divisibility', 16]], (17,): [['tt.divisibility', 16]], (18,): [['tt.divisibility', 16]], (20,): [['tt.divisibility', 16]], (21,): [['tt.divisibility', 16]], (22,): [['tt.divisibility', 16]], (23,): [['tt.divisibility', 16]], (25,): [['tt.divisibility', 16]], (26,): [['tt.divisibility', 16]], (27,): [['tt.divisibility', 16]], (28,): [['tt.divisibility', 16]], (30,): [['tt.divisibility', 16]], (31,): [['tt.divisibility', 16]], (32,): [['tt.divisibility', 16]], (33,): [['tt.divisibility', 16]], (35,): [['tt.divisibility', 16]], (36,): [['tt.divisibility', 16]], (37,): [['tt.divisibility', 16]], (38,): [['tt.divisibility', 16]], (40,): [['tt.divisibility', 16]], (41,): [['tt.divisibility', 16]], (42,): [['tt.divisibility', 16]], (43,): [['tt.divisibility', 16]], (45,): [['tt.divisibility', 16]], (46,): [['tt.divisibility', 16]], (47,): [['tt.divisibility', 16]], (48,): [['tt.divisibility', 16]], (50,): [['tt.divisibility', 16]], (51,): [['tt.divisibility', 16]], (52,): [['tt.divisibility', 16]], (53,): [['tt.divisibility', 16]], (54,): [['tt.divisibility', 16]], (55,): [['tt.divisibility', 16]], (56,): [['tt.divisibility', 16]], (57,): [['tt.divisibility', 16]], (58,): [['tt.divisibility', 16]], (59,): [['tt.divisibility', 16]], (60,): [['tt.divisibility', 16]], (61,): [['tt.divisibility', 16]], (62,): [['tt.divisibility', 16]], (63,): [['tt.divisibility', 16]], (64,): [['tt.divisibility', 16]], (65,): [['tt.divisibility', 16]], (66,): [['tt.divisibility', 16]], (67,): [['tt.divisibility', 16]], (68,): [['tt.divisibility', 16]], (69,): [['tt.divisibility', 16]], (70,): [['tt.divisibility', 16]], (71,): [['tt.divisibility', 16]], (72,): [['tt.divisibility', 16]], (73,): [['tt.divisibility', 16]], (74,): [['tt.divisibility', 16]], (75,): [['tt.divisibility', 16]], (76,): [['tt.divisibility', 16]], (77,): [['tt.divisibility', 16]], (78,): [['tt.divisibility', 16]], (79,): [['tt.divisibility', 16]]}]},
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] inductor_meta={'grid_type': 'SequentialComboKernelGrid', 'combo_grid_meta': {'num_kernels': 10, 'min_blocks': 0, 'default_config': {'XBLOCK': 1024}, 'no_x_dim_0': False, 'xnumel_0': 1048576, 'no_x_dim_1': False, 'xnumel_1': 1048576, 'no_x_dim_2': False, 'xnumel_2': 1048576, 'no_x_dim_3': False, 'xnumel_3': 1048576, 'no_x_dim_4': False, 'xnumel_4': 1048576, 'no_x_dim_5': False, 'xnumel_5': 1048576, 'no_x_dim_6': False, 'xnumel_6': 1048576, 'no_x_dim_7': False, 'xnumel_7': 1048576, 'no_x_dim_8': False, 'xnumel_8': 1048576, 'no_x_dim_9': False, 'xnumel_9': 1048576}, 'kernel_name': 'triton_for_fused_0', 'mutated_arg_names': ['in_ptr1', 'in_ptr11', 'in_ptr12', 'in_ptr13', 'in_ptr16', 'in_ptr17', 'in_ptr18', 'in_ptr2', 'in_ptr21', 'in_ptr22', 'in_ptr23', 'in_ptr26', 'in_ptr27', 'in_ptr28', 'in_ptr3', 'in_ptr31', 'in_ptr32', 'in_ptr33', 'in_ptr36', 'in_ptr37', 'in_ptr38', 'in_ptr41', 'in_ptr42', 'in_ptr43', 'in_ptr46', 'in_ptr47', 'in_ptr48', 'in_ptr6', 'in_ptr7', 'in_ptr8', 'out_ptr15', 'out_ptr16', 'out_ptr17', 'out_ptr24', 'out_ptr25', 'out_ptr26', 'out_ptr33', 'out_ptr34', 'out_ptr35', 'out_ptr42', 'out_ptr43', 'out_ptr44', 'out_ptr51', 'out_ptr52', 'out_ptr53', 'out_ptr6', 'out_ptr60', 'out_ptr61', 'out_ptr62', 'out_ptr69', 'out_ptr7', 'out_ptr70', 'out_ptr71', 'out_ptr78', 'out_ptr79', 'out_ptr8', 'out_ptr80', 'out_ptr87', 'out_ptr88', 'out_ptr89'], 'backend_hash': '1E2C16421D4C3DBA4AD92BFC4278A3CB24C43DEDA6EE7FF9E3FBB1DBB80802DB', 'are_deterministic_algorithms_enabled': False, 'assert_indirect_indexing': True, 'autotune_local_cache': True, 'autotune_pointwise': True, 'autotune_remote_cache': None, 'force_disable_caches': False, 'dynamic_scale_rblock': True, 'max_autotune': False, 'max_autotune_pointwise': False, 'min_split_scan_rblock': 256, 'spill_threshold': 16, 'store_cubin': False},
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] )
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] @triton.jit
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] def triton_for_fused_0(in_ptr0, in_ptr1, in_ptr2, in_ptr3, in_ptr4, in_ptr5, in_ptr6, in_ptr7, in_ptr8, in_ptr9, in_ptr10, in_ptr11, in_ptr12, in_ptr13, in_ptr14, in_ptr15, in_ptr16, in_ptr17, in_ptr18, in_ptr19, in_ptr20, in_ptr21, in_ptr22, in_ptr23, in_ptr24, in_ptr25, in_ptr26, in_ptr27, in_ptr28, in_ptr29, in_ptr30, in_ptr31, in_ptr32, in_ptr33, in_ptr34, in_ptr35, in_ptr36, in_ptr37, in_ptr38, in_ptr39, in_ptr40, in_ptr41, in_ptr42, in_ptr43, in_ptr44, in_ptr45, in_ptr46, in_ptr47, in_ptr48, in_ptr49, out_ptr6, out_ptr7, out_ptr8, out_ptr15, out_ptr16, out_ptr17, out_ptr24, out_ptr25, out_ptr26, out_ptr33, out_ptr34, out_ptr35, out_ptr42, out_ptr43, out_ptr44, out_ptr51, out_ptr52, out_ptr53, out_ptr60, out_ptr61, out_ptr62, out_ptr69, out_ptr70, out_ptr71, out_ptr78, out_ptr79, out_ptr80, out_ptr87, out_ptr88, out_ptr89):
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] pid = tl.program_id(0)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] XBLOCK: tl.constexpr = 1024
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] num_xblocks_0 = tl.cdiv(1048576, XBLOCK)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] num_xblocks_1 = num_xblocks_0 + tl.cdiv(1048576, XBLOCK)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] num_xblocks_2 = num_xblocks_1 + tl.cdiv(1048576, XBLOCK)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] num_xblocks_3 = num_xblocks_2 + tl.cdiv(1048576, XBLOCK)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] num_xblocks_4 = num_xblocks_3 + tl.cdiv(1048576, XBLOCK)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] num_xblocks_5 = num_xblocks_4 + tl.cdiv(1048576, XBLOCK)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] num_xblocks_6 = num_xblocks_5 + tl.cdiv(1048576, XBLOCK)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] num_xblocks_7 = num_xblocks_6 + tl.cdiv(1048576, XBLOCK)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] num_xblocks_8 = num_xblocks_7 + tl.cdiv(1048576, XBLOCK)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] num_xblocks_9 = num_xblocks_8 + tl.cdiv(1048576, XBLOCK)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] if pid < num_xblocks_0:
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] pid_offset = pid
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] xnumel = 1048576
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] r0_numel = 1
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] xoffset = pid_offset * XBLOCK
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] xindex = xoffset + tl.arange(0, XBLOCK)[:]
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] xmask = tl.full([XBLOCK], True, tl.int1)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] x0 = xindex
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp5 = tl.load(in_ptr0 + (x0), None)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp6 = tl.load(in_ptr1 + (x0), None)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp11 = tl.load(in_ptr2 + (x0), None)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp18 = tl.load(in_ptr3 + (x0), None)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp20 = in_ptr4
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp0 = 0.09999999999999998
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp1 = 0.5
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp2 = tmp0 >= tmp1
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp3 = -0.9
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp4 = tl.where(tmp2, tmp3, tmp0)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp7 = tmp5 - tmp6
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp8 = tmp4 * tmp7
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp9 = tl.where(tmp2, tmp5, tmp6)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp10 = tmp8 + tmp9
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp12 = 0.999
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp13 = tmp11 * tmp12
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp14 = 0.0010000000000000009
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp15 = tmp5 * tmp14
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp16 = tmp15 * tmp5
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp17 = tmp13 + tmp16
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp19 = libdevice.sqrt(tmp17)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp21 = 1.0
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp22 = tmp20 + tmp21
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp23 = libdevice.pow(tmp12, tmp22)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp24 = tmp21 - tmp23
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp25 = libdevice.sqrt(tmp24)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp26 = 0.9
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp27 = libdevice.pow(tmp26, tmp22)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp28 = tmp21 - tmp27
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp29 = tl.full([1], 1, tl.int32)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp30 = (tmp29 / tmp28)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp31 = 0.001
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp32 = tmp30 * tmp31
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp33 = -tmp32
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp34 = tmp25 * tmp33
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp35 = (tmp19 / tmp34)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp36 = (tmp29 / tmp33)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp37 = 1e-08
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp38 = tmp36 * tmp37
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp39 = tmp35 + tmp38
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp40 = (tmp10 / tmp39)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp41 = tmp18 + tmp40
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tl.store(out_ptr6 + (x0), tmp41, None)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tl.store(out_ptr7 + (x0), tmp10, None)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tl.store(out_ptr8 + (x0), tmp17, None)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] elif pid < num_xblocks_1:
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] pid_offset = pid - num_xblocks_0
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] xnumel = 1048576
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] r0_numel = 1
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] xoffset = pid_offset * XBLOCK
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] xindex = xoffset + tl.arange(0, XBLOCK)[:]
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] xmask = tl.full([XBLOCK], True, tl.int1)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] x1 = xindex
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp47 = tl.load(in_ptr5 + (x1), None)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp48 = tl.load(in_ptr6 + (x1), None)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp53 = tl.load(in_ptr7 + (x1), None)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp60 = tl.load(in_ptr8 + (x1), None)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp62 = in_ptr9
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp42 = 0.09999999999999998
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp43 = 0.5
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp44 = tmp42 >= tmp43
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp45 = -0.9
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp46 = tl.where(tmp44, tmp45, tmp42)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp49 = tmp47 - tmp48
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp50 = tmp46 * tmp49
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp51 = tl.where(tmp44, tmp47, tmp48)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp52 = tmp50 + tmp51
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp54 = 0.999
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp55 = tmp53 * tmp54
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp56 = 0.0010000000000000009
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp57 = tmp47 * tmp56
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp58 = tmp57 * tmp47
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp59 = tmp55 + tmp58
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp61 = libdevice.sqrt(tmp59)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp63 = 1.0
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp64 = tmp62 + tmp63
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp65 = libdevice.pow(tmp54, tmp64)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp66 = tmp63 - tmp65
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp67 = libdevice.sqrt(tmp66)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp68 = 0.9
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp69 = libdevice.pow(tmp68, tmp64)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp70 = tmp63 - tmp69
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp71 = tl.full([1], 1, tl.int32)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp72 = (tmp71 / tmp70)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp73 = 0.001
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp74 = tmp72 * tmp73
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp75 = -tmp74
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp76 = tmp67 * tmp75
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp77 = (tmp61 / tmp76)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp78 = (tmp71 / tmp75)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp79 = 1e-08
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp80 = tmp78 * tmp79
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp81 = tmp77 + tmp80
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp82 = (tmp52 / tmp81)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp83 = tmp60 + tmp82
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tl.store(out_ptr15 + (x1), tmp83, None)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tl.store(out_ptr16 + (x1), tmp52, None)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tl.store(out_ptr17 + (x1), tmp59, None)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] elif pid < num_xblocks_2:
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] pid_offset = pid - num_xblocks_1
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] xnumel = 1048576
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] r0_numel = 1
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] xoffset = pid_offset * XBLOCK
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] xindex = xoffset + tl.arange(0, XBLOCK)[:]
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] xmask = tl.full([XBLOCK], True, tl.int1)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] x2 = xindex
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp89 = tl.load(in_ptr10 + (x2), None)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp90 = tl.load(in_ptr11 + (x2), None)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp95 = tl.load(in_ptr12 + (x2), None)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp102 = tl.load(in_ptr13 + (x2), None)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp104 = in_ptr14
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp84 = 0.09999999999999998
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp85 = 0.5
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp86 = tmp84 >= tmp85
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp87 = -0.9
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp88 = tl.where(tmp86, tmp87, tmp84)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp91 = tmp89 - tmp90
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp92 = tmp88 * tmp91
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp93 = tl.where(tmp86, tmp89, tmp90)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp94 = tmp92 + tmp93
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp96 = 0.999
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp97 = tmp95 * tmp96
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp98 = 0.0010000000000000009
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp99 = tmp89 * tmp98
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp100 = tmp99 * tmp89
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp101 = tmp97 + tmp100
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp103 = libdevice.sqrt(tmp101)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp105 = 1.0
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp106 = tmp104 + tmp105
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp107 = libdevice.pow(tmp96, tmp106)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp108 = tmp105 - tmp107
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp109 = libdevice.sqrt(tmp108)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp110 = 0.9
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp111 = libdevice.pow(tmp110, tmp106)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp112 = tmp105 - tmp111
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp113 = tl.full([1], 1, tl.int32)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp114 = (tmp113 / tmp112)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp115 = 0.001
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp116 = tmp114 * tmp115
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp117 = -tmp116
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp118 = tmp109 * tmp117
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp119 = (tmp103 / tmp118)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp120 = (tmp113 / tmp117)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp121 = 1e-08
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp122 = tmp120 * tmp121
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp123 = tmp119 + tmp122
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp124 = (tmp94 / tmp123)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp125 = tmp102 + tmp124
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tl.store(out_ptr24 + (x2), tmp125, None)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tl.store(out_ptr25 + (x2), tmp94, None)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tl.store(out_ptr26 + (x2), tmp101, None)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] elif pid < num_xblocks_3:
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] pid_offset = pid - num_xblocks_2
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] xnumel = 1048576
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] r0_numel = 1
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] xoffset = pid_offset * XBLOCK
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] xindex = xoffset + tl.arange(0, XBLOCK)[:]
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] xmask = tl.full([XBLOCK], True, tl.int1)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] x3 = xindex
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp131 = tl.load(in_ptr15 + (x3), None)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp132 = tl.load(in_ptr16 + (x3), None)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp137 = tl.load(in_ptr17 + (x3), None)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp144 = tl.load(in_ptr18 + (x3), None)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp146 = in_ptr19
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp126 = 0.09999999999999998
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp127 = 0.5
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp128 = tmp126 >= tmp127
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp129 = -0.9
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp130 = tl.where(tmp128, tmp129, tmp126)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp133 = tmp131 - tmp132
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp134 = tmp130 * tmp133
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp135 = tl.where(tmp128, tmp131, tmp132)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp136 = tmp134 + tmp135
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp138 = 0.999
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp139 = tmp137 * tmp138
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp140 = 0.0010000000000000009
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp141 = tmp131 * tmp140
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp142 = tmp141 * tmp131
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp143 = tmp139 + tmp142
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp145 = libdevice.sqrt(tmp143)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp147 = 1.0
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp148 = tmp146 + tmp147
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp149 = libdevice.pow(tmp138, tmp148)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp150 = tmp147 - tmp149
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp151 = libdevice.sqrt(tmp150)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp152 = 0.9
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp153 = libdevice.pow(tmp152, tmp148)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp154 = tmp147 - tmp153
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp155 = tl.full([1], 1, tl.int32)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp156 = (tmp155 / tmp154)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp157 = 0.001
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp158 = tmp156 * tmp157
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp159 = -tmp158
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp160 = tmp151 * tmp159
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp161 = (tmp145 / tmp160)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp162 = (tmp155 / tmp159)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp163 = 1e-08
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp164 = tmp162 * tmp163
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp165 = tmp161 + tmp164
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp166 = (tmp136 / tmp165)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp167 = tmp144 + tmp166
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tl.store(out_ptr33 + (x3), tmp167, None)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tl.store(out_ptr34 + (x3), tmp136, None)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tl.store(out_ptr35 + (x3), tmp143, None)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] elif pid < num_xblocks_4:
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] pid_offset = pid - num_xblocks_3
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] xnumel = 1048576
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] r0_numel = 1
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] xoffset = pid_offset * XBLOCK
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] xindex = xoffset + tl.arange(0, XBLOCK)[:]
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] xmask = tl.full([XBLOCK], True, tl.int1)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] x4 = xindex
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp173 = tl.load(in_ptr20 + (x4), None)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp174 = tl.load(in_ptr21 + (x4), None)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp179 = tl.load(in_ptr22 + (x4), None)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp186 = tl.load(in_ptr23 + (x4), None)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp188 = in_ptr24
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp168 = 0.09999999999999998
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp169 = 0.5
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp170 = tmp168 >= tmp169
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp171 = -0.9
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp172 = tl.where(tmp170, tmp171, tmp168)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp175 = tmp173 - tmp174
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp176 = tmp172 * tmp175
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp177 = tl.where(tmp170, tmp173, tmp174)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp178 = tmp176 + tmp177
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp180 = 0.999
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp181 = tmp179 * tmp180
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp182 = 0.0010000000000000009
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp183 = tmp173 * tmp182
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp184 = tmp183 * tmp173
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp185 = tmp181 + tmp184
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp187 = libdevice.sqrt(tmp185)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp189 = 1.0
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp190 = tmp188 + tmp189
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp191 = libdevice.pow(tmp180, tmp190)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp192 = tmp189 - tmp191
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp193 = libdevice.sqrt(tmp192)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp194 = 0.9
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp195 = libdevice.pow(tmp194, tmp190)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp196 = tmp189 - tmp195
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp197 = tl.full([1], 1, tl.int32)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp198 = (tmp197 / tmp196)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp199 = 0.001
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp200 = tmp198 * tmp199
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp201 = -tmp200
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp202 = tmp193 * tmp201
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp203 = (tmp187 / tmp202)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp204 = (tmp197 / tmp201)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp205 = 1e-08
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp206 = tmp204 * tmp205
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp207 = tmp203 + tmp206
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp208 = (tmp178 / tmp207)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp209 = tmp186 + tmp208
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tl.store(out_ptr42 + (x4), tmp209, None)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tl.store(out_ptr43 + (x4), tmp178, None)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tl.store(out_ptr44 + (x4), tmp185, None)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] elif pid < num_xblocks_5:
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] pid_offset = pid - num_xblocks_4
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] xnumel = 1048576
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] r0_numel = 1
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] xoffset = pid_offset * XBLOCK
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] xindex = xoffset + tl.arange(0, XBLOCK)[:]
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] xmask = tl.full([XBLOCK], True, tl.int1)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] x5 = xindex
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp215 = tl.load(in_ptr25 + (x5), None)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp216 = tl.load(in_ptr26 + (x5), None)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp221 = tl.load(in_ptr27 + (x5), None)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp228 = tl.load(in_ptr28 + (x5), None)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp230 = in_ptr29
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp210 = 0.09999999999999998
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp211 = 0.5
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp212 = tmp210 >= tmp211
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp213 = -0.9
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp214 = tl.where(tmp212, tmp213, tmp210)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp217 = tmp215 - tmp216
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp218 = tmp214 * tmp217
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp219 = tl.where(tmp212, tmp215, tmp216)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp220 = tmp218 + tmp219
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp222 = 0.999
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp223 = tmp221 * tmp222
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp224 = 0.0010000000000000009
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp225 = tmp215 * tmp224
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp226 = tmp225 * tmp215
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp227 = tmp223 + tmp226
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp229 = libdevice.sqrt(tmp227)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp231 = 1.0
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp232 = tmp230 + tmp231
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp233 = libdevice.pow(tmp222, tmp232)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp234 = tmp231 - tmp233
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp235 = libdevice.sqrt(tmp234)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp236 = 0.9
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp237 = libdevice.pow(tmp236, tmp232)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp238 = tmp231 - tmp237
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp239 = tl.full([1], 1, tl.int32)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp240 = (tmp239 / tmp238)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp241 = 0.001
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp242 = tmp240 * tmp241
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp243 = -tmp242
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp244 = tmp235 * tmp243
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp245 = (tmp229 / tmp244)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp246 = (tmp239 / tmp243)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp247 = 1e-08
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp248 = tmp246 * tmp247
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp249 = tmp245 + tmp248
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp250 = (tmp220 / tmp249)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp251 = tmp228 + tmp250
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tl.store(out_ptr51 + (x5), tmp251, None)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tl.store(out_ptr52 + (x5), tmp220, None)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tl.store(out_ptr53 + (x5), tmp227, None)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] elif pid < num_xblocks_6:
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] pid_offset = pid - num_xblocks_5
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] xnumel = 1048576
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] r0_numel = 1
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] xoffset = pid_offset * XBLOCK
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] xindex = xoffset + tl.arange(0, XBLOCK)[:]
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] xmask = tl.full([XBLOCK], True, tl.int1)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] x6 = xindex
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp257 = tl.load(in_ptr30 + (x6), None)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp258 = tl.load(in_ptr31 + (x6), None)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp263 = tl.load(in_ptr32 + (x6), None)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp270 = tl.load(in_ptr33 + (x6), None)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp272 = in_ptr34
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp252 = 0.09999999999999998
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp253 = 0.5
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp254 = tmp252 >= tmp253
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp255 = -0.9
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp256 = tl.where(tmp254, tmp255, tmp252)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp259 = tmp257 - tmp258
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp260 = tmp256 * tmp259
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp261 = tl.where(tmp254, tmp257, tmp258)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp262 = tmp260 + tmp261
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp264 = 0.999
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp265 = tmp263 * tmp264
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp266 = 0.0010000000000000009
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp267 = tmp257 * tmp266
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp268 = tmp267 * tmp257
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp269 = tmp265 + tmp268
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp271 = libdevice.sqrt(tmp269)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp273 = 1.0
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp274 = tmp272 + tmp273
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp275 = libdevice.pow(tmp264, tmp274)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp276 = tmp273 - tmp275
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp277 = libdevice.sqrt(tmp276)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp278 = 0.9
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp279 = libdevice.pow(tmp278, tmp274)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp280 = tmp273 - tmp279
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp281 = tl.full([1], 1, tl.int32)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp282 = (tmp281 / tmp280)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp283 = 0.001
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp284 = tmp282 * tmp283
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp285 = -tmp284
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp286 = tmp277 * tmp285
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp287 = (tmp271 / tmp286)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp288 = (tmp281 / tmp285)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp289 = 1e-08
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp290 = tmp288 * tmp289
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp291 = tmp287 + tmp290
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp292 = (tmp262 / tmp291)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp293 = tmp270 + tmp292
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tl.store(out_ptr60 + (x6), tmp293, None)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tl.store(out_ptr61 + (x6), tmp262, None)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tl.store(out_ptr62 + (x6), tmp269, None)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] elif pid < num_xblocks_7:
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] pid_offset = pid - num_xblocks_6
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] xnumel = 1048576
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] r0_numel = 1
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] xoffset = pid_offset * XBLOCK
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] xindex = xoffset + tl.arange(0, XBLOCK)[:]
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] xmask = tl.full([XBLOCK], True, tl.int1)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] x7 = xindex
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp299 = tl.load(in_ptr35 + (x7), None)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp300 = tl.load(in_ptr36 + (x7), None)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp305 = tl.load(in_ptr37 + (x7), None)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp312 = tl.load(in_ptr38 + (x7), None)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp314 = in_ptr39
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp294 = 0.09999999999999998
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp295 = 0.5
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp296 = tmp294 >= tmp295
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp297 = -0.9
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp298 = tl.where(tmp296, tmp297, tmp294)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp301 = tmp299 - tmp300
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp302 = tmp298 * tmp301
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp303 = tl.where(tmp296, tmp299, tmp300)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp304 = tmp302 + tmp303
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp306 = 0.999
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp307 = tmp305 * tmp306
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp308 = 0.0010000000000000009
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp309 = tmp299 * tmp308
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp310 = tmp309 * tmp299
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp311 = tmp307 + tmp310
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp313 = libdevice.sqrt(tmp311)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp315 = 1.0
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp316 = tmp314 + tmp315
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp317 = libdevice.pow(tmp306, tmp316)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp318 = tmp315 - tmp317
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp319 = libdevice.sqrt(tmp318)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp320 = 0.9
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp321 = libdevice.pow(tmp320, tmp316)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp322 = tmp315 - tmp321
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp323 = tl.full([1], 1, tl.int32)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp324 = (tmp323 / tmp322)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp325 = 0.001
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp326 = tmp324 * tmp325
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp327 = -tmp326
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp328 = tmp319 * tmp327
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp329 = (tmp313 / tmp328)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp330 = (tmp323 / tmp327)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp331 = 1e-08
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp332 = tmp330 * tmp331
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp333 = tmp329 + tmp332
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp334 = (tmp304 / tmp333)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp335 = tmp312 + tmp334
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tl.store(out_ptr69 + (x7), tmp335, None)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tl.store(out_ptr70 + (x7), tmp304, None)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tl.store(out_ptr71 + (x7), tmp311, None)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] elif pid < num_xblocks_8:
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] pid_offset = pid - num_xblocks_7
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] xnumel = 1048576
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] r0_numel = 1
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] xoffset = pid_offset * XBLOCK
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] xindex = xoffset + tl.arange(0, XBLOCK)[:]
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] xmask = tl.full([XBLOCK], True, tl.int1)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] x8 = xindex
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp341 = tl.load(in_ptr40 + (x8), None)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp342 = tl.load(in_ptr41 + (x8), None)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp347 = tl.load(in_ptr42 + (x8), None)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp354 = tl.load(in_ptr43 + (x8), None)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp356 = in_ptr44
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp336 = 0.09999999999999998
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp337 = 0.5
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp338 = tmp336 >= tmp337
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp339 = -0.9
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp340 = tl.where(tmp338, tmp339, tmp336)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp343 = tmp341 - tmp342
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp344 = tmp340 * tmp343
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp345 = tl.where(tmp338, tmp341, tmp342)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp346 = tmp344 + tmp345
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp348 = 0.999
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp349 = tmp347 * tmp348
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp350 = 0.0010000000000000009
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp351 = tmp341 * tmp350
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp352 = tmp351 * tmp341
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp353 = tmp349 + tmp352
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp355 = libdevice.sqrt(tmp353)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp357 = 1.0
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp358 = tmp356 + tmp357
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp359 = libdevice.pow(tmp348, tmp358)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp360 = tmp357 - tmp359
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp361 = libdevice.sqrt(tmp360)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp362 = 0.9
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp363 = libdevice.pow(tmp362, tmp358)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp364 = tmp357 - tmp363
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp365 = tl.full([1], 1, tl.int32)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp366 = (tmp365 / tmp364)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp367 = 0.001
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp368 = tmp366 * tmp367
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp369 = -tmp368
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp370 = tmp361 * tmp369
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp371 = (tmp355 / tmp370)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp372 = (tmp365 / tmp369)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp373 = 1e-08
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp374 = tmp372 * tmp373
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp375 = tmp371 + tmp374
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp376 = (tmp346 / tmp375)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp377 = tmp354 + tmp376
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tl.store(out_ptr78 + (x8), tmp377, None)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tl.store(out_ptr79 + (x8), tmp346, None)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tl.store(out_ptr80 + (x8), tmp353, None)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] elif pid < num_xblocks_9:
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] pid_offset = pid - num_xblocks_8
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] xnumel = 1048576
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] r0_numel = 1
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] xoffset = pid_offset * XBLOCK
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] xindex = xoffset + tl.arange(0, XBLOCK)[:]
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] xmask = tl.full([XBLOCK], True, tl.int1)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] x9 = xindex
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp383 = tl.load(in_ptr45 + (x9), None)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp384 = tl.load(in_ptr46 + (x9), None)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp389 = tl.load(in_ptr47 + (x9), None)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp396 = tl.load(in_ptr48 + (x9), None)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp398 = in_ptr49
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp378 = 0.09999999999999998
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp379 = 0.5
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp380 = tmp378 >= tmp379
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp381 = -0.9
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp382 = tl.where(tmp380, tmp381, tmp378)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp385 = tmp383 - tmp384
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp386 = tmp382 * tmp385
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp387 = tl.where(tmp380, tmp383, tmp384)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp388 = tmp386 + tmp387
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp390 = 0.999
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp391 = tmp389 * tmp390
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp392 = 0.0010000000000000009
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp393 = tmp383 * tmp392
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp394 = tmp393 * tmp383
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp395 = tmp391 + tmp394
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp397 = libdevice.sqrt(tmp395)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp399 = 1.0
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp400 = tmp398 + tmp399
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp401 = libdevice.pow(tmp390, tmp400)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp402 = tmp399 - tmp401
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp403 = libdevice.sqrt(tmp402)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp404 = 0.9
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp405 = libdevice.pow(tmp404, tmp400)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp406 = tmp399 - tmp405
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp407 = tl.full([1], 1, tl.int32)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp408 = (tmp407 / tmp406)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp409 = 0.001
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp410 = tmp408 * tmp409
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp411 = -tmp410
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp412 = tmp403 * tmp411
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp413 = (tmp397 / tmp412)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp414 = (tmp407 / tmp411)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp415 = 1e-08
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp416 = tmp414 * tmp415
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp417 = tmp413 + tmp416
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp418 = (tmp388 / tmp417)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp419 = tmp396 + tmp418
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tl.store(out_ptr87 + (x9), tmp419, None)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tl.store(out_ptr88 + (x9), tmp388, None)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] tl.store(out_ptr89 + (x9), tmp395, None)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] else:
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] pass
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] ''', device_str='cuda')
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code]
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code]
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] cpp_fused__foreach_copy_1 = async_compile.cpp_pybinding(['const float*', 'const float*', 'const float*', 'const float*', 'const float*', 'const float*', 'const float*', 'const float*', 'const float*', 'const float*', 'float*', 'float*', 'float*', 'float*', 'float*', 'float*', 'float*', 'float*', 'float*', 'float*'], '''
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] #include "/tmp/torchinductor_ci-user/pi/cpicxudqmdsjh5cm4klbtbrvy2cxwr7whxl3md2zzdjdf3orvfdf.h"
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] extern "C" void kernel(const float* in_ptr0,
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] const float* in_ptr1,
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] const float* in_ptr2,
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] const float* in_ptr3,
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] const float* in_ptr4,
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] const float* in_ptr5,
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] const float* in_ptr6,
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] const float* in_ptr7,
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] const float* in_ptr8,
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] const float* in_ptr9,
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] float* out_ptr1,
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] float* out_ptr3,
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] float* out_ptr5,
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] float* out_ptr7,
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] float* out_ptr9,
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] float* out_ptr11,
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] float* out_ptr13,
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] float* out_ptr15,
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] float* out_ptr17,
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] float* out_ptr19)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] {
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] {
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] {
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] {
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] auto tmp0 = in_ptr0[static_cast<int64_t>(0L)];
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] auto tmp1 = static_cast<float>(1.0);
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] auto tmp2 = decltype(tmp0)(tmp0 + tmp1);
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] out_ptr1[static_cast<int64_t>(0L)] = tmp2;
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] }
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] }
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] }
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] {
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] {
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] {
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] auto tmp0 = in_ptr1[static_cast<int64_t>(0L)];
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] auto tmp1 = static_cast<float>(1.0);
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] auto tmp2 = decltype(tmp0)(tmp0 + tmp1);
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] out_ptr3[static_cast<int64_t>(0L)] = tmp2;
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] }
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] }
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] }
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] {
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] {
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] {
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] auto tmp0 = in_ptr2[static_cast<int64_t>(0L)];
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] auto tmp1 = static_cast<float>(1.0);
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] auto tmp2 = decltype(tmp0)(tmp0 + tmp1);
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] out_ptr5[static_cast<int64_t>(0L)] = tmp2;
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] }
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] }
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] }
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] {
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] {
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] {
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] auto tmp0 = in_ptr3[static_cast<int64_t>(0L)];
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] auto tmp1 = static_cast<float>(1.0);
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] auto tmp2 = decltype(tmp0)(tmp0 + tmp1);
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] out_ptr7[static_cast<int64_t>(0L)] = tmp2;
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] }
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] }
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] }
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] {
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] {
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] {
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] auto tmp0 = in_ptr4[static_cast<int64_t>(0L)];
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] auto tmp1 = static_cast<float>(1.0);
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] auto tmp2 = decltype(tmp0)(tmp0 + tmp1);
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] out_ptr9[static_cast<int64_t>(0L)] = tmp2;
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] }
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] }
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] }
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] {
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] {
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] {
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] auto tmp0 = in_ptr5[static_cast<int64_t>(0L)];
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] auto tmp1 = static_cast<float>(1.0);
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] auto tmp2 = decltype(tmp0)(tmp0 + tmp1);
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] out_ptr11[static_cast<int64_t>(0L)] = tmp2;
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] }
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] }
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] }
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] {
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] {
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] {
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] auto tmp0 = in_ptr6[static_cast<int64_t>(0L)];
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] auto tmp1 = static_cast<float>(1.0);
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] auto tmp2 = decltype(tmp0)(tmp0 + tmp1);
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] out_ptr13[static_cast<int64_t>(0L)] = tmp2;
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] }
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] }
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] }
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] {
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] {
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] {
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] auto tmp0 = in_ptr7[static_cast<int64_t>(0L)];
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] auto tmp1 = static_cast<float>(1.0);
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] auto tmp2 = decltype(tmp0)(tmp0 + tmp1);
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] out_ptr15[static_cast<int64_t>(0L)] = tmp2;
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] }
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] }
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] }
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] {
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] {
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] {
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] auto tmp0 = in_ptr8[static_cast<int64_t>(0L)];
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] auto tmp1 = static_cast<float>(1.0);
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] auto tmp2 = decltype(tmp0)(tmp0 + tmp1);
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] out_ptr17[static_cast<int64_t>(0L)] = tmp2;
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] }
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] }
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] }
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] {
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] {
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] {
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] auto tmp0 = in_ptr9[static_cast<int64_t>(0L)];
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] auto tmp1 = static_cast<float>(1.0);
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] auto tmp2 = decltype(tmp0)(tmp0 + tmp1);
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] out_ptr19[static_cast<int64_t>(0L)] = tmp2;
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] }
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] }
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] }
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] }
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] ''')
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code]
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code]
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] async_compile.wait(globals())
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] del async_compile
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code]
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] def call(args):
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] arg0_1, arg1_1, arg2_1, arg3_1, arg4_1, arg5_1, arg6_1, arg7_1, arg8_1, arg9_1, arg10_1, arg11_1, arg12_1, arg13_1, arg14_1, arg15_1, arg16_1, arg17_1, arg18_1, arg19_1, arg20_1, arg21_1, arg22_1, arg23_1, arg24_1, arg25_1, arg26_1, arg27_1, arg28_1, arg29_1, arg30_1, arg31_1, arg32_1, arg33_1, arg34_1, arg35_1, arg36_1, arg37_1, arg38_1, arg39_1, arg40_1, arg41_1, arg42_1, arg43_1, arg44_1, arg45_1, arg46_1, arg47_1, arg48_1, arg49_1 = args
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] args.clear()
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] assert_size_stride(arg0_1, (1024, 1024), (1024, 1))
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] assert_size_stride(arg1_1, (1024, 1024), (1024, 1))
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] assert_size_stride(arg2_1, (1024, 1024), (1024, 1))
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] assert_size_stride(arg3_1, (1024, 1024), (1024, 1))
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] assert_size_stride(arg4_1, (1024, 1024), (1024, 1))
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] assert_size_stride(arg5_1, (1024, 1024), (1024, 1))
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] assert_size_stride(arg6_1, (1024, 1024), (1024, 1))
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] assert_size_stride(arg7_1, (1024, 1024), (1024, 1))
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] assert_size_stride(arg8_1, (1024, 1024), (1024, 1))
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] assert_size_stride(arg9_1, (1024, 1024), (1024, 1))
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] assert_size_stride(arg10_1, (1024, 1024), (1024, 1))
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] assert_size_stride(arg11_1, (1024, 1024), (1024, 1))
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] assert_size_stride(arg12_1, (1024, 1024), (1024, 1))
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] assert_size_stride(arg13_1, (1024, 1024), (1024, 1))
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] assert_size_stride(arg14_1, (1024, 1024), (1024, 1))
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] assert_size_stride(arg15_1, (1024, 1024), (1024, 1))
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] assert_size_stride(arg16_1, (1024, 1024), (1024, 1))
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] assert_size_stride(arg17_1, (1024, 1024), (1024, 1))
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] assert_size_stride(arg18_1, (1024, 1024), (1024, 1))
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] assert_size_stride(arg19_1, (1024, 1024), (1024, 1))
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] assert_size_stride(arg20_1, (), ())
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] assert_size_stride(arg21_1, (), ())
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] assert_size_stride(arg22_1, (), ())
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] assert_size_stride(arg23_1, (), ())
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] assert_size_stride(arg24_1, (), ())
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] assert_size_stride(arg25_1, (), ())
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] assert_size_stride(arg26_1, (), ())
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] assert_size_stride(arg27_1, (), ())
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] assert_size_stride(arg28_1, (), ())
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] assert_size_stride(arg29_1, (), ())
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] assert_size_stride(arg30_1, (1024, 1024), (1024, 1))
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] assert_size_stride(arg31_1, (1024, 1024), (1024, 1))
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] assert_size_stride(arg32_1, (1024, 1024), (1024, 1))
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] assert_size_stride(arg33_1, (1024, 1024), (1024, 1))
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] assert_size_stride(arg34_1, (1024, 1024), (1024, 1))
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] assert_size_stride(arg35_1, (1024, 1024), (1024, 1))
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] assert_size_stride(arg36_1, (1024, 1024), (1024, 1))
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] assert_size_stride(arg37_1, (1024, 1024), (1024, 1))
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] assert_size_stride(arg38_1, (1024, 1024), (1024, 1))
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] assert_size_stride(arg39_1, (1024, 1024), (1024, 1))
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] assert_size_stride(arg40_1, (1024, 1024), (1024, 1))
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] assert_size_stride(arg41_1, (1024, 1024), (1024, 1))
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] assert_size_stride(arg42_1, (1024, 1024), (1024, 1))
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] assert_size_stride(arg43_1, (1024, 1024), (1024, 1))
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] assert_size_stride(arg44_1, (1024, 1024), (1024, 1))
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] assert_size_stride(arg45_1, (1024, 1024), (1024, 1))
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] assert_size_stride(arg46_1, (1024, 1024), (1024, 1))
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] assert_size_stride(arg47_1, (1024, 1024), (1024, 1))
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] assert_size_stride(arg48_1, (1024, 1024), (1024, 1))
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] assert_size_stride(arg49_1, (1024, 1024), (1024, 1))
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] with torch.cuda._DeviceGuard(0):
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] torch.cuda.set_device(0)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] # Unsorted Source Nodes: [], Original ATen: []
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] stream0 = get_raw_stream(0)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] triton_for_fused_0.run(arg1_1, arg30_1, arg40_1, arg0_1, arg20_1.item(), arg3_1, arg31_1, arg41_1, arg2_1, arg21_1.item(), arg5_1, arg32_1, arg42_1, arg4_1, arg22_1.item(), arg7_1, arg33_1, arg43_1, arg6_1, arg23_1.item(), arg9_1, arg34_1, arg44_1, arg8_1, arg24_1.item(), arg11_1, arg35_1, arg45_1, arg10_1, arg25_1.item(), arg13_1, arg36_1, arg46_1, arg12_1, arg26_1.item(), arg15_1, arg37_1, arg47_1, arg14_1, arg27_1.item(), arg17_1, arg38_1, arg48_1, arg16_1, arg28_1.item(), arg19_1, arg39_1, arg49_1, arg18_1, arg29_1.item(), arg0_1, arg30_1, arg40_1, arg2_1, arg31_1, arg41_1, arg4_1, arg32_1, arg42_1, arg6_1, arg33_1, arg43_1, arg8_1, arg34_1, arg44_1, arg10_1, arg35_1, arg45_1, arg12_1, arg36_1, arg46_1, arg14_1, arg37_1, arg47_1, arg16_1, arg38_1, arg48_1, arg18_1, arg39_1, arg49_1, stream=stream0)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] del arg0_1
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] del arg10_1
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] del arg11_1
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] del arg12_1
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] del arg13_1
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] del arg14_1
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] del arg15_1
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] del arg16_1
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] del arg17_1
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] del arg18_1
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] del arg19_1
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] del arg1_1
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] del arg2_1
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] del arg30_1
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] del arg31_1
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] del arg32_1
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] del arg33_1
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] del arg34_1
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] del arg35_1
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] del arg36_1
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] del arg37_1
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] del arg38_1
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] del arg39_1
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] del arg3_1
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] del arg40_1
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] del arg41_1
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] del arg42_1
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] del arg43_1
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] del arg44_1
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] del arg45_1
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] del arg46_1
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] del arg47_1
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] del arg48_1
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] del arg49_1
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] del arg4_1
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] del arg5_1
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] del arg6_1
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] del arg7_1
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] del arg8_1
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] del arg9_1
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] cpp_fused__foreach_copy_1(arg20_1, arg21_1, arg22_1, arg23_1, arg24_1, arg25_1, arg26_1, arg27_1, arg28_1, arg29_1, arg20_1, arg21_1, arg22_1, arg23_1, arg24_1, arg25_1, arg26_1, arg27_1, arg28_1, arg29_1)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] del arg20_1
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] del arg21_1
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] del arg22_1
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] del arg23_1
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] del arg24_1
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] del arg25_1
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] del arg26_1
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] del arg27_1
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] del arg28_1
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] del arg29_1
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] return ()
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code]
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code]
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] def benchmark_compiled_module(times=10, repeat=10):
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] from torch._dynamo.testing import rand_strided
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] from torch._inductor.utils import print_performance
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] arg0_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] arg1_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] arg2_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] arg3_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] arg4_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] arg5_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] arg6_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] arg7_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] arg8_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] arg9_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] arg10_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] arg11_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] arg12_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] arg13_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] arg14_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] arg15_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] arg16_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] arg17_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] arg18_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] arg19_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] arg20_1 = rand_strided((), (), device='cpu', dtype=torch.float32)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] arg21_1 = rand_strided((), (), device='cpu', dtype=torch.float32)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] arg22_1 = rand_strided((), (), device='cpu', dtype=torch.float32)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] arg23_1 = rand_strided((), (), device='cpu', dtype=torch.float32)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] arg24_1 = rand_strided((), (), device='cpu', dtype=torch.float32)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] arg25_1 = rand_strided((), (), device='cpu', dtype=torch.float32)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] arg26_1 = rand_strided((), (), device='cpu', dtype=torch.float32)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] arg27_1 = rand_strided((), (), device='cpu', dtype=torch.float32)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] arg28_1 = rand_strided((), (), device='cpu', dtype=torch.float32)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] arg29_1 = rand_strided((), (), device='cpu', dtype=torch.float32)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] arg30_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] arg31_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] arg32_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] arg33_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] arg34_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] arg35_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] arg36_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] arg37_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] arg38_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] arg39_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] arg40_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] arg41_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] arg42_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] arg43_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] arg44_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] arg45_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] arg46_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] arg47_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] arg48_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] arg49_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] fn = lambda: call([arg0_1, arg1_1, arg2_1, arg3_1, arg4_1, arg5_1, arg6_1, arg7_1, arg8_1, arg9_1, arg10_1, arg11_1, arg12_1, arg13_1, arg14_1, arg15_1, arg16_1, arg17_1, arg18_1, arg19_1, arg20_1, arg21_1, arg22_1, arg23_1, arg24_1, arg25_1, arg26_1, arg27_1, arg28_1, arg29_1, arg30_1, arg31_1, arg32_1, arg33_1, arg34_1, arg35_1, arg36_1, arg37_1, arg38_1, arg39_1, arg40_1, arg41_1, arg42_1, arg43_1, arg44_1, arg45_1, arg46_1, arg47_1, arg48_1, arg49_1])
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] return print_performance(fn, times=times, repeat=repeat)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code]
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code]
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] if __name__ == "__main__":
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] from torch._inductor.wrapper_benchmark import compiled_module_main
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code] compiled_module_main('None', benchmark_compiled_module)
V0701 22:32:49.037000 26085 torch/_inductor/graph.py:2104] [0/0] [__output_code]
V0701 22:32:49.086000 26085 torch/_inductor/graph.py:2115] [0/0] [__output_code] Output code written to: /tmp/torchinductor_ci-user/bx/cbxwuspm7iljtlkypwgm5a6rrandaew4wqmdmng4lzas4ogomxpw.py
I0701 22:32:50.614000 26085 torch/_inductor/graph.py:2149] [0/0] [__output_code] Output code written to: /tmp/torchinductor_ci-user/bx/cbxwuspm7iljtlkypwgm5a6rrandaew4wqmdmng4lzas4ogomxpw.py
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] Output code:
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] # AOT ID: ['1_inference']
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] from ctypes import c_void_p, c_long, c_int
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] import torch
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] import math
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] import random
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] import os
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] import tempfile
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] from math import inf, nan
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] from cmath import nanj
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] from torch._inductor.hooks import run_intermediate_hooks
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] from torch._inductor.utils import maybe_profile
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] from torch._inductor.codegen.memory_planning import _align as align
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] from torch import device, empty_strided
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] from torch._inductor.async_compile import AsyncCompile
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] from torch._inductor.select_algorithm import extern_kernels
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] from torch._inductor.codegen.multi_kernel import MultiKernelCall
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] from torch._C import _cuda_getCurrentRawStream as get_raw_stream
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] import triton
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] import triton.language as tl
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] from torch._inductor.runtime.triton_heuristics import start_graph, end_graph
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] from torch._C import _cuda_getCurrentRawStream as get_raw_stream
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code]
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] aten = torch.ops.aten
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] inductor_ops = torch.ops.inductor
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] _quantized = torch.ops._quantized
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] assert_size_stride = torch._C._dynamo.guards.assert_size_stride
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] empty_strided_cpu = torch._C._dynamo.guards._empty_strided_cpu
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] empty_strided_cuda = torch._C._dynamo.guards._empty_strided_cuda
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] empty_strided_xpu = torch._C._dynamo.guards._empty_strided_xpu
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] reinterpret_tensor = torch._C._dynamo.guards._reinterpret_tensor
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] alloc_from_pool = torch.ops.inductor._alloc_from_pool
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] async_compile = AsyncCompile()
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] empty_strided_p2p = torch._C._distributed_c10d._SymmetricMemory.empty_strided_p2p
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code]
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code]
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] # kernel path: /tmp/torchinductor_ci-user/ej/cejr7t4zzqo7llcoxga7clgyc6gs3676lsm4dvilpfw64kudp2ns.py
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] # Unsorted Source Nodes: [], Original ATen: []
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] # Source node to ATen node mapping:
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] triton_for_fused_0 = async_compile.triton('triton_for_fused_0', '''
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] import triton
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] import triton.language as tl
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code]
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] from torch._inductor.runtime import triton_helpers, triton_heuristics
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] from torch._inductor.runtime.triton_helpers import libdevice, math as tl_math
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] from torch._inductor.runtime.hints import AutotuneHint, ReductionHint, TileHint, DeviceProperties
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code]
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] @triton_heuristics.foreach(
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] num_warps=8,
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] triton_meta={'signature': {'in_ptr0': '*fp32', 'in_ptr1': '*fp32', 'in_ptr2': '*fp32', 'in_ptr3': '*fp32', 'in_ptr4': 'fp32', 'in_ptr5': '*fp32', 'in_ptr6': '*fp32', 'in_ptr7': '*fp32', 'in_ptr8': '*fp32', 'in_ptr9': 'fp32', 'in_ptr10': '*fp32', 'in_ptr11': '*fp32', 'in_ptr12': '*fp32', 'in_ptr13': '*fp32', 'in_ptr14': 'fp32', 'in_ptr15': '*fp32', 'in_ptr16': '*fp32', 'in_ptr17': '*fp32', 'in_ptr18': '*fp32', 'in_ptr19': 'fp32', 'in_ptr20': '*fp32', 'in_ptr21': '*fp32', 'in_ptr22': '*fp32', 'in_ptr23': '*fp32', 'in_ptr24': 'fp32', 'in_ptr25': '*fp32', 'in_ptr26': '*fp32', 'in_ptr27': '*fp32', 'in_ptr28': '*fp32', 'in_ptr29': 'fp32', 'in_ptr30': '*fp32', 'in_ptr31': '*fp32', 'in_ptr32': '*fp32', 'in_ptr33': '*fp32', 'in_ptr34': 'fp32', 'in_ptr35': '*fp32', 'in_ptr36': '*fp32', 'in_ptr37': '*fp32', 'in_ptr38': '*fp32', 'in_ptr39': 'fp32', 'in_ptr40': '*fp32', 'in_ptr41': '*fp32', 'in_ptr42': '*fp32', 'in_ptr43': '*fp32', 'in_ptr44': 'fp32', 'in_ptr45': '*fp32', 'in_ptr46': '*fp32', 'in_ptr47': '*fp32', 'in_ptr48': '*fp32', 'in_ptr49': 'fp32', 'out_ptr6': '*fp32', 'out_ptr7': '*fp32', 'out_ptr8': '*fp32', 'out_ptr15': '*fp32', 'out_ptr16': '*fp32', 'out_ptr17': '*fp32', 'out_ptr24': '*fp32', 'out_ptr25': '*fp32', 'out_ptr26': '*fp32', 'out_ptr33': '*fp32', 'out_ptr34': '*fp32', 'out_ptr35': '*fp32', 'out_ptr42': '*fp32', 'out_ptr43': '*fp32', 'out_ptr44': '*fp32', 'out_ptr51': '*fp32', 'out_ptr52': '*fp32', 'out_ptr53': '*fp32', 'out_ptr60': '*fp32', 'out_ptr61': '*fp32', 'out_ptr62': '*fp32', 'out_ptr69': '*fp32', 'out_ptr70': '*fp32', 'out_ptr71': '*fp32', 'out_ptr78': '*fp32', 'out_ptr79': '*fp32', 'out_ptr80': '*fp32', 'out_ptr87': '*fp32', 'out_ptr88': '*fp32', 'out_ptr89': '*fp32'}, 'device': DeviceProperties(type='cuda', index=0, multi_processor_count=80, cc=86, major=8, regs_per_multiprocessor=65536, max_threads_per_multi_processor=1536, warp_size=32), 'constants': {}, 'configs': [{(0,): [['tt.divisibility', 16]], (1,): [['tt.divisibility', 16]], (2,): [['tt.divisibility', 16]], (3,): [['tt.divisibility', 16]], (5,): [['tt.divisibility', 16]], (6,): [['tt.divisibility', 16]], (7,): [['tt.divisibility', 16]], (8,): [['tt.divisibility', 16]], (10,): [['tt.divisibility', 16]], (11,): [['tt.divisibility', 16]], (12,): [['tt.divisibility', 16]], (13,): [['tt.divisibility', 16]], (15,): [['tt.divisibility', 16]], (16,): [['tt.divisibility', 16]], (17,): [['tt.divisibility', 16]], (18,): [['tt.divisibility', 16]], (20,): [['tt.divisibility', 16]], (21,): [['tt.divisibility', 16]], (22,): [['tt.divisibility', 16]], (23,): [['tt.divisibility', 16]], (25,): [['tt.divisibility', 16]], (26,): [['tt.divisibility', 16]], (27,): [['tt.divisibility', 16]], (28,): [['tt.divisibility', 16]], (30,): [['tt.divisibility', 16]], (31,): [['tt.divisibility', 16]], (32,): [['tt.divisibility', 16]], (33,): [['tt.divisibility', 16]], (35,): [['tt.divisibility', 16]], (36,): [['tt.divisibility', 16]], (37,): [['tt.divisibility', 16]], (38,): [['tt.divisibility', 16]], (40,): [['tt.divisibility', 16]], (41,): [['tt.divisibility', 16]], (42,): [['tt.divisibility', 16]], (43,): [['tt.divisibility', 16]], (45,): [['tt.divisibility', 16]], (46,): [['tt.divisibility', 16]], (47,): [['tt.divisibility', 16]], (48,): [['tt.divisibility', 16]], (50,): [['tt.divisibility', 16]], (51,): [['tt.divisibility', 16]], (52,): [['tt.divisibility', 16]], (53,): [['tt.divisibility', 16]], (54,): [['tt.divisibility', 16]], (55,): [['tt.divisibility', 16]], (56,): [['tt.divisibility', 16]], (57,): [['tt.divisibility', 16]], (58,): [['tt.divisibility', 16]], (59,): [['tt.divisibility', 16]], (60,): [['tt.divisibility', 16]], (61,): [['tt.divisibility', 16]], (62,): [['tt.divisibility', 16]], (63,): [['tt.divisibility', 16]], (64,): [['tt.divisibility', 16]], (65,): [['tt.divisibility', 16]], (66,): [['tt.divisibility', 16]], (67,): [['tt.divisibility', 16]], (68,): [['tt.divisibility', 16]], (69,): [['tt.divisibility', 16]], (70,): [['tt.divisibility', 16]], (71,): [['tt.divisibility', 16]], (72,): [['tt.divisibility', 16]], (73,): [['tt.divisibility', 16]], (74,): [['tt.divisibility', 16]], (75,): [['tt.divisibility', 16]], (76,): [['tt.divisibility', 16]], (77,): [['tt.divisibility', 16]], (78,): [['tt.divisibility', 16]], (79,): [['tt.divisibility', 16]]}]},
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] inductor_meta={'grid_type': 'SequentialComboKernelGrid', 'combo_grid_meta': {'num_kernels': 10, 'min_blocks': 0, 'default_config': {'XBLOCK': 1024}, 'no_x_dim_0': False, 'xnumel_0': 1048576, 'no_x_dim_1': False, 'xnumel_1': 1048576, 'no_x_dim_2': False, 'xnumel_2': 1048576, 'no_x_dim_3': False, 'xnumel_3': 1048576, 'no_x_dim_4': False, 'xnumel_4': 1048576, 'no_x_dim_5': False, 'xnumel_5': 1048576, 'no_x_dim_6': False, 'xnumel_6': 1048576, 'no_x_dim_7': False, 'xnumel_7': 1048576, 'no_x_dim_8': False, 'xnumel_8': 1048576, 'no_x_dim_9': False, 'xnumel_9': 1048576}, 'kernel_name': 'triton_for_fused_0', 'mutated_arg_names': ['in_ptr1', 'in_ptr11', 'in_ptr12', 'in_ptr13', 'in_ptr16', 'in_ptr17', 'in_ptr18', 'in_ptr2', 'in_ptr21', 'in_ptr22', 'in_ptr23', 'in_ptr26', 'in_ptr27', 'in_ptr28', 'in_ptr3', 'in_ptr31', 'in_ptr32', 'in_ptr33', 'in_ptr36', 'in_ptr37', 'in_ptr38', 'in_ptr41', 'in_ptr42', 'in_ptr43', 'in_ptr46', 'in_ptr47', 'in_ptr48', 'in_ptr6', 'in_ptr7', 'in_ptr8', 'out_ptr15', 'out_ptr16', 'out_ptr17', 'out_ptr24', 'out_ptr25', 'out_ptr26', 'out_ptr33', 'out_ptr34', 'out_ptr35', 'out_ptr42', 'out_ptr43', 'out_ptr44', 'out_ptr51', 'out_ptr52', 'out_ptr53', 'out_ptr6', 'out_ptr60', 'out_ptr61', 'out_ptr62', 'out_ptr69', 'out_ptr7', 'out_ptr70', 'out_ptr71', 'out_ptr78', 'out_ptr79', 'out_ptr8', 'out_ptr80', 'out_ptr87', 'out_ptr88', 'out_ptr89'], 'backend_hash': '1E2C16421D4C3DBA4AD92BFC4278A3CB24C43DEDA6EE7FF9E3FBB1DBB80802DB', 'are_deterministic_algorithms_enabled': False, 'assert_indirect_indexing': True, 'autotune_local_cache': True, 'autotune_pointwise': True, 'autotune_remote_cache': None, 'force_disable_caches': False, 'dynamic_scale_rblock': True, 'max_autotune': False, 'max_autotune_pointwise': False, 'min_split_scan_rblock': 256, 'spill_threshold': 16, 'store_cubin': False},
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] )
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] @triton.jit
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] def triton_for_fused_0(in_ptr0, in_ptr1, in_ptr2, in_ptr3, in_ptr4, in_ptr5, in_ptr6, in_ptr7, in_ptr8, in_ptr9, in_ptr10, in_ptr11, in_ptr12, in_ptr13, in_ptr14, in_ptr15, in_ptr16, in_ptr17, in_ptr18, in_ptr19, in_ptr20, in_ptr21, in_ptr22, in_ptr23, in_ptr24, in_ptr25, in_ptr26, in_ptr27, in_ptr28, in_ptr29, in_ptr30, in_ptr31, in_ptr32, in_ptr33, in_ptr34, in_ptr35, in_ptr36, in_ptr37, in_ptr38, in_ptr39, in_ptr40, in_ptr41, in_ptr42, in_ptr43, in_ptr44, in_ptr45, in_ptr46, in_ptr47, in_ptr48, in_ptr49, out_ptr6, out_ptr7, out_ptr8, out_ptr15, out_ptr16, out_ptr17, out_ptr24, out_ptr25, out_ptr26, out_ptr33, out_ptr34, out_ptr35, out_ptr42, out_ptr43, out_ptr44, out_ptr51, out_ptr52, out_ptr53, out_ptr60, out_ptr61, out_ptr62, out_ptr69, out_ptr70, out_ptr71, out_ptr78, out_ptr79, out_ptr80, out_ptr87, out_ptr88, out_ptr89):
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] pid = tl.program_id(0)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] XBLOCK: tl.constexpr = 1024
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] num_xblocks_0 = tl.cdiv(1048576, XBLOCK)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] num_xblocks_1 = num_xblocks_0 + tl.cdiv(1048576, XBLOCK)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] num_xblocks_2 = num_xblocks_1 + tl.cdiv(1048576, XBLOCK)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] num_xblocks_3 = num_xblocks_2 + tl.cdiv(1048576, XBLOCK)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] num_xblocks_4 = num_xblocks_3 + tl.cdiv(1048576, XBLOCK)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] num_xblocks_5 = num_xblocks_4 + tl.cdiv(1048576, XBLOCK)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] num_xblocks_6 = num_xblocks_5 + tl.cdiv(1048576, XBLOCK)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] num_xblocks_7 = num_xblocks_6 + tl.cdiv(1048576, XBLOCK)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] num_xblocks_8 = num_xblocks_7 + tl.cdiv(1048576, XBLOCK)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] num_xblocks_9 = num_xblocks_8 + tl.cdiv(1048576, XBLOCK)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] if pid < num_xblocks_0:
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] pid_offset = pid
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] xnumel = 1048576
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] r0_numel = 1
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] xoffset = pid_offset * XBLOCK
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] xindex = xoffset + tl.arange(0, XBLOCK)[:]
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] xmask = tl.full([XBLOCK], True, tl.int1)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] x0 = xindex
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp5 = tl.load(in_ptr0 + (x0), None)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp6 = tl.load(in_ptr1 + (x0), None)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp11 = tl.load(in_ptr2 + (x0), None)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp18 = tl.load(in_ptr3 + (x0), None)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp20 = in_ptr4
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp0 = 0.09999999999999998
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp1 = 0.5
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp2 = tmp0 >= tmp1
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp3 = -0.9
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp4 = tl.where(tmp2, tmp3, tmp0)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp7 = tmp5 - tmp6
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp8 = tmp4 * tmp7
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp9 = tl.where(tmp2, tmp5, tmp6)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp10 = tmp8 + tmp9
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp12 = 0.999
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp13 = tmp11 * tmp12
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp14 = 0.0010000000000000009
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp15 = tmp5 * tmp14
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp16 = tmp15 * tmp5
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp17 = tmp13 + tmp16
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp19 = libdevice.sqrt(tmp17)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp21 = 1.0
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp22 = tmp20 + tmp21
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp23 = libdevice.pow(tmp12, tmp22)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp24 = tmp21 - tmp23
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp25 = libdevice.sqrt(tmp24)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp26 = 0.9
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp27 = libdevice.pow(tmp26, tmp22)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp28 = tmp21 - tmp27
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp29 = tl.full([1], 1, tl.int32)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp30 = (tmp29 / tmp28)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp31 = 0.001
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp32 = tmp30 * tmp31
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp33 = -tmp32
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp34 = tmp25 * tmp33
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp35 = (tmp19 / tmp34)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp36 = (tmp29 / tmp33)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp37 = 1e-08
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp38 = tmp36 * tmp37
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp39 = tmp35 + tmp38
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp40 = (tmp10 / tmp39)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp41 = tmp18 + tmp40
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tl.store(out_ptr6 + (x0), tmp41, None)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tl.store(out_ptr7 + (x0), tmp10, None)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tl.store(out_ptr8 + (x0), tmp17, None)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] elif pid < num_xblocks_1:
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] pid_offset = pid - num_xblocks_0
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] xnumel = 1048576
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] r0_numel = 1
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] xoffset = pid_offset * XBLOCK
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] xindex = xoffset + tl.arange(0, XBLOCK)[:]
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] xmask = tl.full([XBLOCK], True, tl.int1)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] x1 = xindex
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp47 = tl.load(in_ptr5 + (x1), None)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp48 = tl.load(in_ptr6 + (x1), None)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp53 = tl.load(in_ptr7 + (x1), None)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp60 = tl.load(in_ptr8 + (x1), None)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp62 = in_ptr9
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp42 = 0.09999999999999998
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp43 = 0.5
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp44 = tmp42 >= tmp43
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp45 = -0.9
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp46 = tl.where(tmp44, tmp45, tmp42)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp49 = tmp47 - tmp48
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp50 = tmp46 * tmp49
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp51 = tl.where(tmp44, tmp47, tmp48)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp52 = tmp50 + tmp51
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp54 = 0.999
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp55 = tmp53 * tmp54
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp56 = 0.0010000000000000009
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp57 = tmp47 * tmp56
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp58 = tmp57 * tmp47
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp59 = tmp55 + tmp58
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp61 = libdevice.sqrt(tmp59)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp63 = 1.0
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp64 = tmp62 + tmp63
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp65 = libdevice.pow(tmp54, tmp64)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp66 = tmp63 - tmp65
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp67 = libdevice.sqrt(tmp66)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp68 = 0.9
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp69 = libdevice.pow(tmp68, tmp64)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp70 = tmp63 - tmp69
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp71 = tl.full([1], 1, tl.int32)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp72 = (tmp71 / tmp70)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp73 = 0.001
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp74 = tmp72 * tmp73
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp75 = -tmp74
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp76 = tmp67 * tmp75
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp77 = (tmp61 / tmp76)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp78 = (tmp71 / tmp75)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp79 = 1e-08
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp80 = tmp78 * tmp79
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp81 = tmp77 + tmp80
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp82 = (tmp52 / tmp81)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp83 = tmp60 + tmp82
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tl.store(out_ptr15 + (x1), tmp83, None)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tl.store(out_ptr16 + (x1), tmp52, None)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tl.store(out_ptr17 + (x1), tmp59, None)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] elif pid < num_xblocks_2:
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] pid_offset = pid - num_xblocks_1
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] xnumel = 1048576
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] r0_numel = 1
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] xoffset = pid_offset * XBLOCK
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] xindex = xoffset + tl.arange(0, XBLOCK)[:]
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] xmask = tl.full([XBLOCK], True, tl.int1)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] x2 = xindex
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp89 = tl.load(in_ptr10 + (x2), None)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp90 = tl.load(in_ptr11 + (x2), None)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp95 = tl.load(in_ptr12 + (x2), None)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp102 = tl.load(in_ptr13 + (x2), None)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp104 = in_ptr14
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp84 = 0.09999999999999998
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp85 = 0.5
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp86 = tmp84 >= tmp85
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp87 = -0.9
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp88 = tl.where(tmp86, tmp87, tmp84)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp91 = tmp89 - tmp90
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp92 = tmp88 * tmp91
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp93 = tl.where(tmp86, tmp89, tmp90)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp94 = tmp92 + tmp93
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp96 = 0.999
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp97 = tmp95 * tmp96
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp98 = 0.0010000000000000009
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp99 = tmp89 * tmp98
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp100 = tmp99 * tmp89
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp101 = tmp97 + tmp100
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp103 = libdevice.sqrt(tmp101)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp105 = 1.0
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp106 = tmp104 + tmp105
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp107 = libdevice.pow(tmp96, tmp106)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp108 = tmp105 - tmp107
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp109 = libdevice.sqrt(tmp108)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp110 = 0.9
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp111 = libdevice.pow(tmp110, tmp106)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp112 = tmp105 - tmp111
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp113 = tl.full([1], 1, tl.int32)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp114 = (tmp113 / tmp112)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp115 = 0.001
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp116 = tmp114 * tmp115
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp117 = -tmp116
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp118 = tmp109 * tmp117
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp119 = (tmp103 / tmp118)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp120 = (tmp113 / tmp117)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp121 = 1e-08
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp122 = tmp120 * tmp121
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp123 = tmp119 + tmp122
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp124 = (tmp94 / tmp123)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp125 = tmp102 + tmp124
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tl.store(out_ptr24 + (x2), tmp125, None)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tl.store(out_ptr25 + (x2), tmp94, None)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tl.store(out_ptr26 + (x2), tmp101, None)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] elif pid < num_xblocks_3:
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] pid_offset = pid - num_xblocks_2
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] xnumel = 1048576
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] r0_numel = 1
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] xoffset = pid_offset * XBLOCK
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] xindex = xoffset + tl.arange(0, XBLOCK)[:]
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] xmask = tl.full([XBLOCK], True, tl.int1)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] x3 = xindex
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp131 = tl.load(in_ptr15 + (x3), None)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp132 = tl.load(in_ptr16 + (x3), None)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp137 = tl.load(in_ptr17 + (x3), None)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp144 = tl.load(in_ptr18 + (x3), None)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp146 = in_ptr19
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp126 = 0.09999999999999998
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp127 = 0.5
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp128 = tmp126 >= tmp127
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp129 = -0.9
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp130 = tl.where(tmp128, tmp129, tmp126)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp133 = tmp131 - tmp132
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp134 = tmp130 * tmp133
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp135 = tl.where(tmp128, tmp131, tmp132)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp136 = tmp134 + tmp135
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp138 = 0.999
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp139 = tmp137 * tmp138
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp140 = 0.0010000000000000009
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp141 = tmp131 * tmp140
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp142 = tmp141 * tmp131
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp143 = tmp139 + tmp142
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp145 = libdevice.sqrt(tmp143)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp147 = 1.0
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp148 = tmp146 + tmp147
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp149 = libdevice.pow(tmp138, tmp148)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp150 = tmp147 - tmp149
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp151 = libdevice.sqrt(tmp150)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp152 = 0.9
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp153 = libdevice.pow(tmp152, tmp148)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp154 = tmp147 - tmp153
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp155 = tl.full([1], 1, tl.int32)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp156 = (tmp155 / tmp154)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp157 = 0.001
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp158 = tmp156 * tmp157
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp159 = -tmp158
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp160 = tmp151 * tmp159
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp161 = (tmp145 / tmp160)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp162 = (tmp155 / tmp159)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp163 = 1e-08
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp164 = tmp162 * tmp163
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp165 = tmp161 + tmp164
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp166 = (tmp136 / tmp165)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp167 = tmp144 + tmp166
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tl.store(out_ptr33 + (x3), tmp167, None)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tl.store(out_ptr34 + (x3), tmp136, None)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tl.store(out_ptr35 + (x3), tmp143, None)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] elif pid < num_xblocks_4:
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] pid_offset = pid - num_xblocks_3
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] xnumel = 1048576
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] r0_numel = 1
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] xoffset = pid_offset * XBLOCK
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] xindex = xoffset + tl.arange(0, XBLOCK)[:]
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] xmask = tl.full([XBLOCK], True, tl.int1)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] x4 = xindex
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp173 = tl.load(in_ptr20 + (x4), None)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp174 = tl.load(in_ptr21 + (x4), None)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp179 = tl.load(in_ptr22 + (x4), None)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp186 = tl.load(in_ptr23 + (x4), None)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp188 = in_ptr24
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp168 = 0.09999999999999998
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp169 = 0.5
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp170 = tmp168 >= tmp169
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp171 = -0.9
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp172 = tl.where(tmp170, tmp171, tmp168)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp175 = tmp173 - tmp174
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp176 = tmp172 * tmp175
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp177 = tl.where(tmp170, tmp173, tmp174)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp178 = tmp176 + tmp177
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp180 = 0.999
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp181 = tmp179 * tmp180
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp182 = 0.0010000000000000009
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp183 = tmp173 * tmp182
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp184 = tmp183 * tmp173
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp185 = tmp181 + tmp184
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp187 = libdevice.sqrt(tmp185)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp189 = 1.0
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp190 = tmp188 + tmp189
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp191 = libdevice.pow(tmp180, tmp190)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp192 = tmp189 - tmp191
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp193 = libdevice.sqrt(tmp192)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp194 = 0.9
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp195 = libdevice.pow(tmp194, tmp190)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp196 = tmp189 - tmp195
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp197 = tl.full([1], 1, tl.int32)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp198 = (tmp197 / tmp196)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp199 = 0.001
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp200 = tmp198 * tmp199
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp201 = -tmp200
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp202 = tmp193 * tmp201
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp203 = (tmp187 / tmp202)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp204 = (tmp197 / tmp201)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp205 = 1e-08
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp206 = tmp204 * tmp205
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp207 = tmp203 + tmp206
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp208 = (tmp178 / tmp207)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp209 = tmp186 + tmp208
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tl.store(out_ptr42 + (x4), tmp209, None)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tl.store(out_ptr43 + (x4), tmp178, None)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tl.store(out_ptr44 + (x4), tmp185, None)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] elif pid < num_xblocks_5:
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] pid_offset = pid - num_xblocks_4
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] xnumel = 1048576
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] r0_numel = 1
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] xoffset = pid_offset * XBLOCK
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] xindex = xoffset + tl.arange(0, XBLOCK)[:]
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] xmask = tl.full([XBLOCK], True, tl.int1)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] x5 = xindex
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp215 = tl.load(in_ptr25 + (x5), None)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp216 = tl.load(in_ptr26 + (x5), None)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp221 = tl.load(in_ptr27 + (x5), None)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp228 = tl.load(in_ptr28 + (x5), None)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp230 = in_ptr29
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp210 = 0.09999999999999998
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp211 = 0.5
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp212 = tmp210 >= tmp211
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp213 = -0.9
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp214 = tl.where(tmp212, tmp213, tmp210)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp217 = tmp215 - tmp216
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp218 = tmp214 * tmp217
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp219 = tl.where(tmp212, tmp215, tmp216)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp220 = tmp218 + tmp219
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp222 = 0.999
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp223 = tmp221 * tmp222
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp224 = 0.0010000000000000009
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp225 = tmp215 * tmp224
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp226 = tmp225 * tmp215
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp227 = tmp223 + tmp226
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp229 = libdevice.sqrt(tmp227)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp231 = 1.0
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp232 = tmp230 + tmp231
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp233 = libdevice.pow(tmp222, tmp232)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp234 = tmp231 - tmp233
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp235 = libdevice.sqrt(tmp234)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp236 = 0.9
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp237 = libdevice.pow(tmp236, tmp232)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp238 = tmp231 - tmp237
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp239 = tl.full([1], 1, tl.int32)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp240 = (tmp239 / tmp238)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp241 = 0.001
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp242 = tmp240 * tmp241
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp243 = -tmp242
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp244 = tmp235 * tmp243
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp245 = (tmp229 / tmp244)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp246 = (tmp239 / tmp243)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp247 = 1e-08
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp248 = tmp246 * tmp247
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp249 = tmp245 + tmp248
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp250 = (tmp220 / tmp249)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp251 = tmp228 + tmp250
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tl.store(out_ptr51 + (x5), tmp251, None)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tl.store(out_ptr52 + (x5), tmp220, None)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tl.store(out_ptr53 + (x5), tmp227, None)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] elif pid < num_xblocks_6:
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] pid_offset = pid - num_xblocks_5
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] xnumel = 1048576
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] r0_numel = 1
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] xoffset = pid_offset * XBLOCK
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] xindex = xoffset + tl.arange(0, XBLOCK)[:]
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] xmask = tl.full([XBLOCK], True, tl.int1)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] x6 = xindex
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp257 = tl.load(in_ptr30 + (x6), None)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp258 = tl.load(in_ptr31 + (x6), None)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp263 = tl.load(in_ptr32 + (x6), None)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp270 = tl.load(in_ptr33 + (x6), None)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp272 = in_ptr34
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp252 = 0.09999999999999998
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp253 = 0.5
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp254 = tmp252 >= tmp253
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp255 = -0.9
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp256 = tl.where(tmp254, tmp255, tmp252)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp259 = tmp257 - tmp258
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp260 = tmp256 * tmp259
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp261 = tl.where(tmp254, tmp257, tmp258)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp262 = tmp260 + tmp261
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp264 = 0.999
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp265 = tmp263 * tmp264
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp266 = 0.0010000000000000009
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp267 = tmp257 * tmp266
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp268 = tmp267 * tmp257
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp269 = tmp265 + tmp268
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp271 = libdevice.sqrt(tmp269)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp273 = 1.0
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp274 = tmp272 + tmp273
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp275 = libdevice.pow(tmp264, tmp274)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp276 = tmp273 - tmp275
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp277 = libdevice.sqrt(tmp276)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp278 = 0.9
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp279 = libdevice.pow(tmp278, tmp274)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp280 = tmp273 - tmp279
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp281 = tl.full([1], 1, tl.int32)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp282 = (tmp281 / tmp280)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp283 = 0.001
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp284 = tmp282 * tmp283
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp285 = -tmp284
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp286 = tmp277 * tmp285
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp287 = (tmp271 / tmp286)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp288 = (tmp281 / tmp285)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp289 = 1e-08
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp290 = tmp288 * tmp289
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp291 = tmp287 + tmp290
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp292 = (tmp262 / tmp291)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp293 = tmp270 + tmp292
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tl.store(out_ptr60 + (x6), tmp293, None)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tl.store(out_ptr61 + (x6), tmp262, None)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tl.store(out_ptr62 + (x6), tmp269, None)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] elif pid < num_xblocks_7:
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] pid_offset = pid - num_xblocks_6
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] xnumel = 1048576
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] r0_numel = 1
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] xoffset = pid_offset * XBLOCK
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] xindex = xoffset + tl.arange(0, XBLOCK)[:]
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] xmask = tl.full([XBLOCK], True, tl.int1)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] x7 = xindex
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp299 = tl.load(in_ptr35 + (x7), None)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp300 = tl.load(in_ptr36 + (x7), None)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp305 = tl.load(in_ptr37 + (x7), None)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp312 = tl.load(in_ptr38 + (x7), None)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp314 = in_ptr39
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp294 = 0.09999999999999998
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp295 = 0.5
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp296 = tmp294 >= tmp295
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp297 = -0.9
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp298 = tl.where(tmp296, tmp297, tmp294)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp301 = tmp299 - tmp300
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp302 = tmp298 * tmp301
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp303 = tl.where(tmp296, tmp299, tmp300)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp304 = tmp302 + tmp303
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp306 = 0.999
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp307 = tmp305 * tmp306
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp308 = 0.0010000000000000009
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp309 = tmp299 * tmp308
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp310 = tmp309 * tmp299
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp311 = tmp307 + tmp310
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp313 = libdevice.sqrt(tmp311)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp315 = 1.0
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp316 = tmp314 + tmp315
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp317 = libdevice.pow(tmp306, tmp316)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp318 = tmp315 - tmp317
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp319 = libdevice.sqrt(tmp318)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp320 = 0.9
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp321 = libdevice.pow(tmp320, tmp316)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp322 = tmp315 - tmp321
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp323 = tl.full([1], 1, tl.int32)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp324 = (tmp323 / tmp322)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp325 = 0.001
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp326 = tmp324 * tmp325
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp327 = -tmp326
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp328 = tmp319 * tmp327
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp329 = (tmp313 / tmp328)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp330 = (tmp323 / tmp327)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp331 = 1e-08
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp332 = tmp330 * tmp331
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp333 = tmp329 + tmp332
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp334 = (tmp304 / tmp333)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp335 = tmp312 + tmp334
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tl.store(out_ptr69 + (x7), tmp335, None)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tl.store(out_ptr70 + (x7), tmp304, None)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tl.store(out_ptr71 + (x7), tmp311, None)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] elif pid < num_xblocks_8:
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] pid_offset = pid - num_xblocks_7
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] xnumel = 1048576
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] r0_numel = 1
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] xoffset = pid_offset * XBLOCK
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] xindex = xoffset + tl.arange(0, XBLOCK)[:]
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] xmask = tl.full([XBLOCK], True, tl.int1)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] x8 = xindex
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp341 = tl.load(in_ptr40 + (x8), None)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp342 = tl.load(in_ptr41 + (x8), None)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp347 = tl.load(in_ptr42 + (x8), None)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp354 = tl.load(in_ptr43 + (x8), None)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp356 = in_ptr44
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp336 = 0.09999999999999998
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp337 = 0.5
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp338 = tmp336 >= tmp337
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp339 = -0.9
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp340 = tl.where(tmp338, tmp339, tmp336)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp343 = tmp341 - tmp342
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp344 = tmp340 * tmp343
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp345 = tl.where(tmp338, tmp341, tmp342)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp346 = tmp344 + tmp345
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp348 = 0.999
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp349 = tmp347 * tmp348
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp350 = 0.0010000000000000009
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp351 = tmp341 * tmp350
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp352 = tmp351 * tmp341
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp353 = tmp349 + tmp352
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp355 = libdevice.sqrt(tmp353)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp357 = 1.0
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp358 = tmp356 + tmp357
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp359 = libdevice.pow(tmp348, tmp358)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp360 = tmp357 - tmp359
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp361 = libdevice.sqrt(tmp360)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp362 = 0.9
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp363 = libdevice.pow(tmp362, tmp358)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp364 = tmp357 - tmp363
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp365 = tl.full([1], 1, tl.int32)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp366 = (tmp365 / tmp364)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp367 = 0.001
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp368 = tmp366 * tmp367
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp369 = -tmp368
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp370 = tmp361 * tmp369
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp371 = (tmp355 / tmp370)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp372 = (tmp365 / tmp369)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp373 = 1e-08
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp374 = tmp372 * tmp373
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp375 = tmp371 + tmp374
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp376 = (tmp346 / tmp375)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp377 = tmp354 + tmp376
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tl.store(out_ptr78 + (x8), tmp377, None)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tl.store(out_ptr79 + (x8), tmp346, None)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tl.store(out_ptr80 + (x8), tmp353, None)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] elif pid < num_xblocks_9:
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] pid_offset = pid - num_xblocks_8
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] xnumel = 1048576
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] r0_numel = 1
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] xoffset = pid_offset * XBLOCK
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] xindex = xoffset + tl.arange(0, XBLOCK)[:]
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] xmask = tl.full([XBLOCK], True, tl.int1)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] x9 = xindex
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp383 = tl.load(in_ptr45 + (x9), None)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp384 = tl.load(in_ptr46 + (x9), None)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp389 = tl.load(in_ptr47 + (x9), None)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp396 = tl.load(in_ptr48 + (x9), None)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp398 = in_ptr49
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp378 = 0.09999999999999998
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp379 = 0.5
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp380 = tmp378 >= tmp379
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp381 = -0.9
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp382 = tl.where(tmp380, tmp381, tmp378)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp385 = tmp383 - tmp384
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp386 = tmp382 * tmp385
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp387 = tl.where(tmp380, tmp383, tmp384)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp388 = tmp386 + tmp387
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp390 = 0.999
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp391 = tmp389 * tmp390
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp392 = 0.0010000000000000009
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp393 = tmp383 * tmp392
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp394 = tmp393 * tmp383
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp395 = tmp391 + tmp394
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp397 = libdevice.sqrt(tmp395)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp399 = 1.0
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp400 = tmp398 + tmp399
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp401 = libdevice.pow(tmp390, tmp400)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp402 = tmp399 - tmp401
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp403 = libdevice.sqrt(tmp402)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp404 = 0.9
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp405 = libdevice.pow(tmp404, tmp400)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp406 = tmp399 - tmp405
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp407 = tl.full([1], 1, tl.int32)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp408 = (tmp407 / tmp406)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp409 = 0.001
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp410 = tmp408 * tmp409
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp411 = -tmp410
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp412 = tmp403 * tmp411
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp413 = (tmp397 / tmp412)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp414 = (tmp407 / tmp411)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp415 = 1e-08
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp416 = tmp414 * tmp415
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp417 = tmp413 + tmp416
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp418 = (tmp388 / tmp417)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp419 = tmp396 + tmp418
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tl.store(out_ptr87 + (x9), tmp419, None)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tl.store(out_ptr88 + (x9), tmp388, None)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] tl.store(out_ptr89 + (x9), tmp395, None)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] else:
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] pass
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] ''', device_str='cuda')
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code]
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code]
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] cpp_fused__foreach_copy_1 = async_compile.cpp_pybinding(['const float*', 'const float*', 'const float*', 'const float*', 'const float*', 'const float*', 'const float*', 'const float*', 'const float*', 'const float*', 'float*', 'float*', 'float*', 'float*', 'float*', 'float*', 'float*', 'float*', 'float*', 'float*'], '''
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] #include "/tmp/torchinductor_ci-user/pi/cpicxudqmdsjh5cm4klbtbrvy2cxwr7whxl3md2zzdjdf3orvfdf.h"
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] extern "C" void kernel(const float* in_ptr0,
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] const float* in_ptr1,
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] const float* in_ptr2,
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] const float* in_ptr3,
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] const float* in_ptr4,
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] const float* in_ptr5,
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] const float* in_ptr6,
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] const float* in_ptr7,
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] const float* in_ptr8,
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] const float* in_ptr9,
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] float* out_ptr1,
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] float* out_ptr3,
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] float* out_ptr5,
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] float* out_ptr7,
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] float* out_ptr9,
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] float* out_ptr11,
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] float* out_ptr13,
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] float* out_ptr15,
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] float* out_ptr17,
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] float* out_ptr19)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] {
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] {
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] {
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] {
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] auto tmp0 = in_ptr0[static_cast<int64_t>(0L)];
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] auto tmp1 = static_cast<float>(1.0);
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] auto tmp2 = decltype(tmp0)(tmp0 + tmp1);
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] out_ptr1[static_cast<int64_t>(0L)] = tmp2;
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] }
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] }
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] }
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] {
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] {
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] {
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] auto tmp0 = in_ptr1[static_cast<int64_t>(0L)];
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] auto tmp1 = static_cast<float>(1.0);
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] auto tmp2 = decltype(tmp0)(tmp0 + tmp1);
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] out_ptr3[static_cast<int64_t>(0L)] = tmp2;
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] }
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] }
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] }
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] {
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] {
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] {
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] auto tmp0 = in_ptr2[static_cast<int64_t>(0L)];
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] auto tmp1 = static_cast<float>(1.0);
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] auto tmp2 = decltype(tmp0)(tmp0 + tmp1);
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] out_ptr5[static_cast<int64_t>(0L)] = tmp2;
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] }
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] }
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] }
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] {
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] {
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] {
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] auto tmp0 = in_ptr3[static_cast<int64_t>(0L)];
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] auto tmp1 = static_cast<float>(1.0);
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] auto tmp2 = decltype(tmp0)(tmp0 + tmp1);
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] out_ptr7[static_cast<int64_t>(0L)] = tmp2;
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] }
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] }
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] }
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] {
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] {
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] {
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] auto tmp0 = in_ptr4[static_cast<int64_t>(0L)];
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] auto tmp1 = static_cast<float>(1.0);
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] auto tmp2 = decltype(tmp0)(tmp0 + tmp1);
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] out_ptr9[static_cast<int64_t>(0L)] = tmp2;
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] }
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] }
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] }
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] {
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] {
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] {
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] auto tmp0 = in_ptr5[static_cast<int64_t>(0L)];
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] auto tmp1 = static_cast<float>(1.0);
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] auto tmp2 = decltype(tmp0)(tmp0 + tmp1);
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] out_ptr11[static_cast<int64_t>(0L)] = tmp2;
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] }
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] }
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] }
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] {
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] {
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] {
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] auto tmp0 = in_ptr6[static_cast<int64_t>(0L)];
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] auto tmp1 = static_cast<float>(1.0);
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] auto tmp2 = decltype(tmp0)(tmp0 + tmp1);
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] out_ptr13[static_cast<int64_t>(0L)] = tmp2;
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] }
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] }
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] }
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] {
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] {
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] {
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] auto tmp0 = in_ptr7[static_cast<int64_t>(0L)];
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] auto tmp1 = static_cast<float>(1.0);
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] auto tmp2 = decltype(tmp0)(tmp0 + tmp1);
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] out_ptr15[static_cast<int64_t>(0L)] = tmp2;
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] }
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] }
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] }
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] {
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] {
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] {
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] auto tmp0 = in_ptr8[static_cast<int64_t>(0L)];
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] auto tmp1 = static_cast<float>(1.0);
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] auto tmp2 = decltype(tmp0)(tmp0 + tmp1);
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] out_ptr17[static_cast<int64_t>(0L)] = tmp2;
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] }
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] }
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] }
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] {
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] {
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] {
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] auto tmp0 = in_ptr9[static_cast<int64_t>(0L)];
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] auto tmp1 = static_cast<float>(1.0);
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] auto tmp2 = decltype(tmp0)(tmp0 + tmp1);
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] out_ptr19[static_cast<int64_t>(0L)] = tmp2;
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] }
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] }
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] }
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] }
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] ''')
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code]
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code]
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] async_compile.wait(globals())
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] del async_compile
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code]
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] def call(args):
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] arg0_1, arg1_1, arg2_1, arg3_1, arg4_1, arg5_1, arg6_1, arg7_1, arg8_1, arg9_1, arg10_1, arg11_1, arg12_1, arg13_1, arg14_1, arg15_1, arg16_1, arg17_1, arg18_1, arg19_1, arg20_1, arg21_1, arg22_1, arg23_1, arg24_1, arg25_1, arg26_1, arg27_1, arg28_1, arg29_1, arg30_1, arg31_1, arg32_1, arg33_1, arg34_1, arg35_1, arg36_1, arg37_1, arg38_1, arg39_1, arg40_1, arg41_1, arg42_1, arg43_1, arg44_1, arg45_1, arg46_1, arg47_1, arg48_1, arg49_1 = args
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] args.clear()
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] assert_size_stride(arg0_1, (1024, 1024), (1024, 1))
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] assert_size_stride(arg1_1, (1024, 1024), (1024, 1))
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] assert_size_stride(arg2_1, (1024, 1024), (1024, 1))
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] assert_size_stride(arg3_1, (1024, 1024), (1024, 1))
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] assert_size_stride(arg4_1, (1024, 1024), (1024, 1))
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] assert_size_stride(arg5_1, (1024, 1024), (1024, 1))
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] assert_size_stride(arg6_1, (1024, 1024), (1024, 1))
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] assert_size_stride(arg7_1, (1024, 1024), (1024, 1))
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] assert_size_stride(arg8_1, (1024, 1024), (1024, 1))
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] assert_size_stride(arg9_1, (1024, 1024), (1024, 1))
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] assert_size_stride(arg10_1, (1024, 1024), (1024, 1))
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] assert_size_stride(arg11_1, (1024, 1024), (1024, 1))
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] assert_size_stride(arg12_1, (1024, 1024), (1024, 1))
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] assert_size_stride(arg13_1, (1024, 1024), (1024, 1))
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] assert_size_stride(arg14_1, (1024, 1024), (1024, 1))
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] assert_size_stride(arg15_1, (1024, 1024), (1024, 1))
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] assert_size_stride(arg16_1, (1024, 1024), (1024, 1))
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] assert_size_stride(arg17_1, (1024, 1024), (1024, 1))
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] assert_size_stride(arg18_1, (1024, 1024), (1024, 1))
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] assert_size_stride(arg19_1, (1024, 1024), (1024, 1))
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] assert_size_stride(arg20_1, (), ())
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] assert_size_stride(arg21_1, (), ())
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] assert_size_stride(arg22_1, (), ())
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] assert_size_stride(arg23_1, (), ())
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] assert_size_stride(arg24_1, (), ())
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] assert_size_stride(arg25_1, (), ())
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] assert_size_stride(arg26_1, (), ())
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] assert_size_stride(arg27_1, (), ())
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] assert_size_stride(arg28_1, (), ())
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] assert_size_stride(arg29_1, (), ())
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] assert_size_stride(arg30_1, (1024, 1024), (1024, 1))
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] assert_size_stride(arg31_1, (1024, 1024), (1024, 1))
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] assert_size_stride(arg32_1, (1024, 1024), (1024, 1))
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] assert_size_stride(arg33_1, (1024, 1024), (1024, 1))
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] assert_size_stride(arg34_1, (1024, 1024), (1024, 1))
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] assert_size_stride(arg35_1, (1024, 1024), (1024, 1))
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] assert_size_stride(arg36_1, (1024, 1024), (1024, 1))
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] assert_size_stride(arg37_1, (1024, 1024), (1024, 1))
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] assert_size_stride(arg38_1, (1024, 1024), (1024, 1))
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] assert_size_stride(arg39_1, (1024, 1024), (1024, 1))
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] assert_size_stride(arg40_1, (1024, 1024), (1024, 1))
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] assert_size_stride(arg41_1, (1024, 1024), (1024, 1))
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] assert_size_stride(arg42_1, (1024, 1024), (1024, 1))
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] assert_size_stride(arg43_1, (1024, 1024), (1024, 1))
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] assert_size_stride(arg44_1, (1024, 1024), (1024, 1))
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] assert_size_stride(arg45_1, (1024, 1024), (1024, 1))
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] assert_size_stride(arg46_1, (1024, 1024), (1024, 1))
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] assert_size_stride(arg47_1, (1024, 1024), (1024, 1))
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] assert_size_stride(arg48_1, (1024, 1024), (1024, 1))
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] assert_size_stride(arg49_1, (1024, 1024), (1024, 1))
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] with torch.cuda._DeviceGuard(0):
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] torch.cuda.set_device(0)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] # Unsorted Source Nodes: [], Original ATen: []
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] stream0 = get_raw_stream(0)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] triton_for_fused_0.run(arg1_1, arg30_1, arg40_1, arg0_1, arg20_1.item(), arg3_1, arg31_1, arg41_1, arg2_1, arg21_1.item(), arg5_1, arg32_1, arg42_1, arg4_1, arg22_1.item(), arg7_1, arg33_1, arg43_1, arg6_1, arg23_1.item(), arg9_1, arg34_1, arg44_1, arg8_1, arg24_1.item(), arg11_1, arg35_1, arg45_1, arg10_1, arg25_1.item(), arg13_1, arg36_1, arg46_1, arg12_1, arg26_1.item(), arg15_1, arg37_1, arg47_1, arg14_1, arg27_1.item(), arg17_1, arg38_1, arg48_1, arg16_1, arg28_1.item(), arg19_1, arg39_1, arg49_1, arg18_1, arg29_1.item(), arg0_1, arg30_1, arg40_1, arg2_1, arg31_1, arg41_1, arg4_1, arg32_1, arg42_1, arg6_1, arg33_1, arg43_1, arg8_1, arg34_1, arg44_1, arg10_1, arg35_1, arg45_1, arg12_1, arg36_1, arg46_1, arg14_1, arg37_1, arg47_1, arg16_1, arg38_1, arg48_1, arg18_1, arg39_1, arg49_1, stream=stream0)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] del arg0_1
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] del arg10_1
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] del arg11_1
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] del arg12_1
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] del arg13_1
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] del arg14_1
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] del arg15_1
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] del arg16_1
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] del arg17_1
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] del arg18_1
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] del arg19_1
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] del arg1_1
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] del arg2_1
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] del arg30_1
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] del arg31_1
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] del arg32_1
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] del arg33_1
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] del arg34_1
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] del arg35_1
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] del arg36_1
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] del arg37_1
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] del arg38_1
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] del arg39_1
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] del arg3_1
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] del arg40_1
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] del arg41_1
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] del arg42_1
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] del arg43_1
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] del arg44_1
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] del arg45_1
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] del arg46_1
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] del arg47_1
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] del arg48_1
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] del arg49_1
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] del arg4_1
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] del arg5_1
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] del arg6_1
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] del arg7_1
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] del arg8_1
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] del arg9_1
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] cpp_fused__foreach_copy_1(arg20_1, arg21_1, arg22_1, arg23_1, arg24_1, arg25_1, arg26_1, arg27_1, arg28_1, arg29_1, arg20_1, arg21_1, arg22_1, arg23_1, arg24_1, arg25_1, arg26_1, arg27_1, arg28_1, arg29_1)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] del arg20_1
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] del arg21_1
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] del arg22_1
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] del arg23_1
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] del arg24_1
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] del arg25_1
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] del arg26_1
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] del arg27_1
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] del arg28_1
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] del arg29_1
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] return ()
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code]
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code]
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] def benchmark_compiled_module(times=10, repeat=10):
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] from torch._dynamo.testing import rand_strided
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] from torch._inductor.utils import print_performance
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] arg0_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] arg1_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] arg2_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] arg3_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] arg4_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] arg5_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] arg6_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] arg7_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] arg8_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] arg9_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] arg10_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] arg11_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] arg12_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] arg13_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] arg14_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] arg15_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] arg16_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] arg17_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] arg18_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] arg19_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] arg20_1 = rand_strided((), (), device='cpu', dtype=torch.float32)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] arg21_1 = rand_strided((), (), device='cpu', dtype=torch.float32)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] arg22_1 = rand_strided((), (), device='cpu', dtype=torch.float32)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] arg23_1 = rand_strided((), (), device='cpu', dtype=torch.float32)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] arg24_1 = rand_strided((), (), device='cpu', dtype=torch.float32)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] arg25_1 = rand_strided((), (), device='cpu', dtype=torch.float32)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] arg26_1 = rand_strided((), (), device='cpu', dtype=torch.float32)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] arg27_1 = rand_strided((), (), device='cpu', dtype=torch.float32)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] arg28_1 = rand_strided((), (), device='cpu', dtype=torch.float32)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] arg29_1 = rand_strided((), (), device='cpu', dtype=torch.float32)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] arg30_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] arg31_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] arg32_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] arg33_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] arg34_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] arg35_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] arg36_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] arg37_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] arg38_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] arg39_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] arg40_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] arg41_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] arg42_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] arg43_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] arg44_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] arg45_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] arg46_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] arg47_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] arg48_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] arg49_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] fn = lambda: call([arg0_1, arg1_1, arg2_1, arg3_1, arg4_1, arg5_1, arg6_1, arg7_1, arg8_1, arg9_1, arg10_1, arg11_1, arg12_1, arg13_1, arg14_1, arg15_1, arg16_1, arg17_1, arg18_1, arg19_1, arg20_1, arg21_1, arg22_1, arg23_1, arg24_1, arg25_1, arg26_1, arg27_1, arg28_1, arg29_1, arg30_1, arg31_1, arg32_1, arg33_1, arg34_1, arg35_1, arg36_1, arg37_1, arg38_1, arg39_1, arg40_1, arg41_1, arg42_1, arg43_1, arg44_1, arg45_1, arg46_1, arg47_1, arg48_1, arg49_1])
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] return print_performance(fn, times=times, repeat=repeat)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code]
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code]
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] if __name__ == "__main__":
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] from torch._inductor.wrapper_benchmark import compiled_module_main
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code] compiled_module_main('None', benchmark_compiled_module)
V0701 22:32:53.457000 26085 torch/_inductor/graph.py:2104] [0/1] [__output_code]
V0701 22:32:53.505000 26085 torch/_inductor/graph.py:2115] [0/1] [__output_code] Output code written to: /tmp/torchinductor_ci-user/65/c655isihixkazmceuwbfqagiscwkui2zsppjfrucnr3s5l4gahqw.py
I0701 22:32:53.543000 26085 torch/_inductor/graph.py:2149] [0/1] [__output_code] Output code written to: /tmp/torchinductor_ci-user/65/c655isihixkazmceuwbfqagiscwkui2zsppjfrucnr3s5l4gahqw.py
eager runtime: 1213.2122499997422us
compiled runtime: 754.8615094149355us
Conclusion¶
In this tutorial, we successfully implemented a custom fully-fused Adam optimizer using foreach_map. By leveraging the power of foreach_map and torch.compile, we were able to create an optimized version of the Adam optimizer that can be used in various machine learning applications. This tutorial provides a comprehensive guide on how to use foreach_map and torch.compile to optimize machine learning models, and serves as a valuable resource for developers looking to improve the performance of their models with horizontal fusion.
See also:
Compiled optimizer tutorial - an intro into the compiled optimizer.
Compiling the optimizer with PT2 - deeper technical details on the compiled optimizer.
Total running time of the script: ( 0 minutes 12.659 seconds)