Skip to content

ud_ep.c:714 Assertion `UCT_UD_PSN_COMPARE(ep->tx.acked_psn, <, ep->tx.psn)' failed: ep 0x1d767e0: flags=0x28 acked_psn=3 must be smaller than current_psn=1 #11107

@ivanallen

Description

@ivanallen

Describe the bug

Operation

[4/9]
============================================================
case: 002/2
description: single device down
location: 10.10.16.168
command: ifdown ens2f0np0
recovery: ifup ens2f0np0
duration: 30s
============================================================

[10:44:31] ▶ injecting fault on 10.10.16.168...
[10:44:32] exit code: 0
Device 'ens2f0np0' successfully disconnected.
⏳ waiting 30 seconds... [██████████████████████████████] done
[10:45:01] ◀ executing recovery on 10.10.16.168...
[10:45:01] exit code: 0
Connection successfully activated (D-Bus active path: /org/freedesktop/NetworkManager/ActiveConnection/93)
[10:45:01] ✓ case 002/2 completed

10.10.16.168

[2026-01-09 02:45:02.065]
ud_ep.c:714  Assertion `UCT_UD_PSN_COMPARE(ep->tx.acked_psn, <, ep->tx.psn)' failed: ep 0x1d767e0: flags=0x28 acked_psn=3 must be smaller than current_psn=1

coredump:

#0  0x00007f92093d0edc in __pthread_kill_implementation () from /lib64/libc.so.6
[Current thread is 1 (Thread 0x7f91d8d92640 (LWP 7247))]
Missing separate debuginfos, use: dnf debuginfo-install glibc-2.34-168.el9_6.24.x86_64 libaio-0.3.111-13.el9.x86_64 libdwarf-0.3.4-1.el9.1.x86_64 libgcc-11.5.0-5.el9_5.x86_64 libibverbs-54.0-2.el9_6.x86_64 libnl3-3.11.0-1.el9.x86_64 librdmacm-54.0-2.el9_6.x86_64 libstdc++-11.3.1-2.1.el9.x86_64 liburing-2.5-1.el9.x86_64 libzstd-1.5.5-1.el9.x86_64 lz4-libs-1.9.3-5.el9.x86_64 numactl-libs-2.0.19-1.el9.x86_64 openssl-libs-3.2.2-6.el9_5.1.x86_64 zlib-1.2.11-40.el9.x86_64
(gdb) bt
#0  0x00007f92093d0edc in __pthread_kill_implementation () from /lib64/libc.so.6
#1  0x00007f9209383b46 in raise () from /lib64/libc.so.6
#2  0x00007f920936d833 in abort () from /lib64/libc.so.6
#3  0x00007f9209e022ae in ucs_fatal_error_message (file=file@entry=0x7f9209ec1599 "ud/base/ud_ep.c", line=line@entry=714, function=function@entry=0x7f9209ec2b10 <__func__.16> "uct_ud_ep_process_ack",
    message_buf=message_buf@entry=0x4cf1220 "Assertion `UCT_UD_PSN_COMPARE(ep->tx.acked_psn, <, ep->tx.psn)' failed: ep 0x1d767e0: flags=0x28 acked_psn=3 must be smaller than current_psn=1") at debug/assert.c:38
#4  0x00007f9209e02381 in ucs_fatal_error_format (file=file@entry=0x7f9209ec1599 "ud/base/ud_ep.c", line=line@entry=714, function=function@entry=0x7f9209ec2b10 <__func__.16> "uct_ud_ep_process_ack",
    format=format@entry=0x7f9209ec25b0 "Assertion `%s' failed: ep %p: flags=0x%x acked_psn=%u must be smaller than current_psn=%u") at debug/assert.c:53
#5  0x00007f9209dc2242 in uct_ud_ep_process_ack (is_async=0, ack_psn=<optimized out>, ep=0x1d767e0, iface=0x2004000) at ud/base/ud_ep.c:714
#6  uct_ud_ep_process_rx (iface=iface@entry=0x2004000, neth=0x7f91ffdcda28, byte_len=16, skb=0x7f91ffdcd9d4, is_async=is_async@entry=0) at ud/base/ud_ep.c:1009
#7  0x00007f9209dc7f11 in uct_ud_verbs_iface_poll_rx (is_async=0, iface=0x2004000) at ud/verbs/ud_verbs.c:422
#8  uct_ud_verbs_iface_progress (tl_iface=0x2004000) at ud/verbs/ud_verbs.c:463
#9  0x00007f9209d04c1a in ucs_callbackq_dispatch (cbq=<optimized out>) at /root/liufeng/xrpc/xmake_globaldir/.xmake/cache/packages/2511/u/ucx/1.18.0/source/src/ucs/datastruct/callbackq.h:215
#10 uct_worker_progress (worker=<optimized out>) at /root/liufeng/xrpc/xmake_globaldir/.xmake/cache/packages/2511/u/ucx/1.18.0/source/src/uct/api/uct.h:2813
#11 ucp_worker_progress (worker=0x2084000) at core/ucp_worker.c:3033
#12 0x00007f9209a0aff8 in xrpc::Channel::poll (this=this@entry=0x1d70400) at src/channel.cc:1791
#13 0x00007f920990f340 in operator() (__closure=0x4cf1f30) at examples/abyss/client/engine.cc:164

(gdb) f 5
#5  0x00007f9209dc2242 in uct_ud_ep_process_ack (is_async=0, ack_psn=<optimized out>, ep=0x1d767e0, iface=0x2004000) at ud/base/ud_ep.c:714
714     ud/base/ud_ep.c: No such file or directory.
(gdb) p ep
$1 = (uct_ud_ep_t *) 0x1d767e0
(gdb) p ep[0]
$2 = {super = {super = {iface = 0x2004000}}, ep_id = 0, dest_ep_id = 0, tx = {psn = 1, max_psn = 3, acked_psn = 3, resend_count = 0, window = {head = 0x0, ptail = 0x1d767f8}, pending = {group = {tail = 0x0, guard = 0, arbiter = 0x0},
      ops = 0, elem = {list = {prev = 0x0, next = 0x0}, next = 0x0, group = 0x0}}, send_time = 0, resend_time = 0, tick = 21000000}, rx = {acked_psn = 0, ooo_pkts = {list = {head = 0x0, ptail = 0x1d76868}, ready_list = {head = 0x0,
        ptail = 0x1d76878}, head_sn = 0, elem_count = 0, list_count = 0, max_holes = 0}}, ca = {wmax = 1025, cwnd = 2}, resend = {pos = 0x1d767f8, psn = 1, max_psn = 0}, conn_match = {list = {list = {prev = 0x0, next = 0x0}}},
  conn_sn = 0, flags = 40, rx_creq_count = 0 '\000', path_index = 0 '\000', timer = {cb = 0x7f9209dbed60 <uct_ud_ep_timer>, list = {prev = 0x0, next = 0x0}, is_active = 0}, close_time = 0}

(gdb) f 5
#5  0x00007f9209dc2242 in uct_ud_ep_process_ack (is_async=0, ack_psn=<optimized out>, ep=0x1d767e0, iface=0x2004000) at ud/base/ud_ep.c:714
714     ud/base/ud_ep.c: No such file or directory.
(gdb) p ep
$1 = (uct_ud_ep_t *) 0x1d767e0
(gdb) p ep[0]
$2 = {super = {super = {iface = 0x2004000}}, ep_id = 0, dest_ep_id = 0, tx = {psn = 1, max_psn = 3, acked_psn = 3, resend_count = 0, window = {head = 0x0, ptail = 0x1d767f8}, pending = {group = {tail = 0x0, guard = 0, arbiter = 0x0},
      ops = 0, elem = {list = {prev = 0x0, next = 0x0}, next = 0x0, group = 0x0}}, send_time = 0, resend_time = 0, tick = 21000000}, rx = {acked_psn = 0, ooo_pkts = {list = {head = 0x0, ptail = 0x1d76868}, ready_list = {head = 0x0,
        ptail = 0x1d76878}, head_sn = 0, elem_count = 0, list_count = 0, max_holes = 0}}, ca = {wmax = 1025, cwnd = 2}, resend = {pos = 0x1d767f8, psn = 1, max_psn = 0}, conn_match = {list = {list = {prev = 0x0, next = 0x0}}},
  conn_sn = 0, flags = 40, rx_creq_count = 0 '\000', path_index = 0 '\000', timer = {cb = 0x7f9209dbed60 <uct_ud_ep_timer>, list = {prev = 0x0, next = 0x0}, is_active = 0}, close_time = 0}
(gdb) f 6
#6  uct_ud_ep_process_rx (iface=iface@entry=0x2004000, neth=0x7f91ffdcda28, byte_len=16, skb=0x7f91ffdcd9d4, is_async=is_async@entry=0) at ud/base/ud_ep.c:1009
1009    in ud/base/ud_ep.c
(gdb) p neth
$3 = (uct_ud_neth_t *) 0x7f91ffdcda28
(gdb) p neth[0]
$4 = {packet_type = 301989888, psn = 3, ack_psn = 3}
(gdb) info local
dest_id = 0
is_am = 0
am_id = 2
ep = 0x1d767e0
ooo_type = <optimized out>
__func__ = "uct_ud_ep_process_rx"
(gdb)

Steps to Reproduce

  • Command line
  • UCX version: 1.18.0
  • Any UCX environment variables used
export UCX_NET_DEVICES=mlx5_0:1,ens2f0np0
export UCX_HANDLE_ERRORS=none #bt,freeze,debug,none
export UCX_LOG_LEVEL=debug
export UCX_RC_RETRY_COUNT=2
export UCX_DC_MLX5_RETRY_COUNT=2
export UCX_UD_LINGER_TIMEOUT=10s
export UCX_RDMA_CM_TIMEOUT=10s
export UCX_TLS=rc,tcp

Setup and versions

  • OS version (e.g Linux distro) + CPU architecture (x86_64/aarch64/ppc64le/...)
Linux G02-100G-ASW016-M08 5.14.0-162.nos.4.el8.x86_64 #1 SMP PREEMPT_DYNAMIC Thu Nov 24 07:51:00 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
  • For Nvidia Bluefield SmartNIC include cat /etc/mlnx-release (the string identifies software and firmware setup)
  • For RDMA/IB/RoCE related issues:
    • Driver version:
      • libibverbs-54.0-2.el9_6.x86_64
      • MLNX_OFED_LINUX-5.8-4.1.5.0:
    • HW information from ibstat or ibv_devinfo -vv command
        Firmware version: 22.36.1010
        Hardware version: 0
        Node GUID: 0xa088c20300b4a46e
        System image GUID: 0xa088c20300b4a46e
        Port 1:
                State: Active
                Physical state: LinkUp
                Rate: 100
                Base lid: 0
                LMC: 0
                SM lid: 0
                Capability mask: 0x00010000
                Port GUID: 0xa288c2fffeb4a46e
                Link layer: Ethernet
CA 'mlx5_1'
        CA type: MT4125
        Number of ports: 1
        Firmware version: 22.36.1010
        Hardware version: 0
        Node GUID: 0xa088c20300b4a46f
        System image GUID: 0xa088c20300b4a46e
        Port 1:
                State: Active
                Physical state: LinkUp
                Rate: 100
                Base lid: 0
                LMC: 0
                SM lid: 0
                Capability mask: 0x00010000
                Port GUID: 0xa288c2fffeb4a46f
                Link layer: Ethernet
CA 'mlx5_2'
        CA type: MT4125
        Number of ports: 1
        Firmware version: 22.35.3502
        Hardware version: 0
        Node GUID: 0xe8ebd303003a6d44
        System image GUID: 0xe8ebd303003a6d44
        Port 1:
                State: Active
                Physical state: LinkUp
                Rate: 100
                Base lid: 0
                LMC: 0
                SM lid: 0
                Capability mask: 0x00010000
                Port GUID: 0xeaebd3fffe3a6d44
                Link layer: Ethernet
CA 'mlx5_3'
        CA type: MT4125
        Number of ports: 1
        Firmware version: 22.35.3502
        Hardware version: 0
        Node GUID: 0xe8ebd303003a6d45
        System image GUID: 0xe8ebd303003a6d44
        Port 1:
                State: Active
                Physical state: LinkUp
                Rate: 100
                Base lid: 0
                LMC: 0
                SM lid: 0
                Capability mask: 0x00010000
                Port GUID: 0xeaebd3fffe3a6d45
                Link layer: Ethernet

More Info

server ucx log:

[1767926702.078171] [G02-100G-ASW016-M07:606  :1]             sys.c:440  UCX  DEBUG   failed to read from /sys/class/infiniband/mlx5_0/ports/1/gid_attrs/ndevs/4: Invalid argument
[1767926702.078186] [G02-100G-ASW016-M07:606  :1]       ib_device.c:1456 UCX  DIAG    failed to read /sys/class/infiniband/mlx5_0/ports/1/gid_attrs/ndevs/0: Invalid argument
[1767926702.078202] [G02-100G-ASW016-M07:606  :1]             sys.c:440  UCX  DEBUG   failed to read from /sys/class/infiniband/mlx5_0/ports/1/gid_attrs/ndevs/4: Invalid argument
[1767926702.078206] [G02-100G-ASW016-M07:606  :1]       ib_device.c:1456 UCX  DIAG    failed to read /sys/class/infiniband/mlx5_0/ports/1/gid_attrs/ndevs/0: Invalid argument
[1767926702.078215] [G02-100G-ASW016-M07:606  :1]             sys.c:440  UCX  DEBUG   failed to read from /sys/class/infiniband/mlx5_0/ports/1/gid_attrs/ndevs/4: Invalid argument
[1767926702.078218] [G02-100G-ASW016-M07:606  :1]       ib_device.c:1456 UCX  DIAG    failed to read /sys/class/infiniband/mlx5_0/ports/1/gid_attrs/ndevs/0: Invalid argument
[1767926702.078301] [G02-100G-ASW016-M07:606  :1]          wireup.c:1214 UCX  DEBUG   ep 0x7fad251a3058: am_lane 1 wireup_msg_lane 1 cm_lane 0 keepalive_lane 2 reachable_mds 0x1
[1767926702.078304] [G02-100G-ASW016-M07:606  :1]          wireup.c:1237 UCX  DEBUG   ep 0x7fad251a3058: lane[0]: cm rdmacm
[1767926702.078308] [G02-100G-ASW016-M07:606  :1]          wireup.c:1237 UCX  DEBUG   ep 0x7fad251a3058: lane[1]:  0:rc_verbs/mlx5_0:1.0 md[0]     -> addr[0].md[0]/ib/sysdev[0] seg 4294967295 rma_bw#0 am am_bw#0 wireup
[1767926702.078312] [G02-100G-ASW016-M07:606  :1]          wireup.c:1237 UCX  DEBUG   ep 0x7fad251a3058: lane[2]:  1:ud_verbs/mlx5_0:1.0 md[0]     -> addr[1].md[0]/ib/sysdev[0] seg 4294967295 keepalive
[1767926702.078314] [G02-100G-ASW016-M07:606  :1]          wireup.c:1240 UCX  DEBUG   ep 0x7fad251a3058: err mode 1, flags 0x0
[1767926702.078318] [G02-100G-ASW016-M07:606  :1]           ud_ep.c:405  UCX  DEBUG   created ep ep=0x4994b40 iface=0x812e000 id=0
[1767926702.078321] [G02-100G-ASW016-M07:606  :1]       wireup_ep.c:508  UCX  DEBUG   ep 0x7fad251a3058: wireup_ep 0xefc4900 created next_ep 0x4994b40 to <no debug data> using ud_verbs/mlx5_0:1
[1767926702.078327] [G02-100G-ASW016-M07:606  :1]           ud_ep.c:689  UCX  DEBUG   mlx5_0:1/RoCE slid 0 qpn 0xa6bd epid 0 connected to ::ffff:10.10.100.168mtu 1024 pkey 0xffff  qpn 0xc45a epid 0
[1767926702.078331] [G02-100G-ASW016-M07:606  :1]        ib_iface.c:936  UCX  DEBUG   iface 0x812e000: ah_attr dlid=49152 sl=0 port=1 src_path_bits=0 dgid=::ffff:10.10.100.168 flow_label=0xffffffff sgid_index=4 traffic_class=106
[1767926702.078385] [G02-100G-ASW016-M07:606  :1]       wireup_ep.c:440  UCX  DEBUG ep 0x7fad251a3058: destroy wireup ep 0xf34cc00
[1767926702.078388] [G02-100G-ASW016-M07:606  :1]       wireup_ep.c:440  UCX  DEBUG ep 0x7fad251a3058: destroy wireup ep 0xf34cf00
[1767926702.078390] [G02-100G-ASW016-M07:606  :1]       wireup_ep.c:440  UCX  DEBUG ep 0x7fad251a3058: destroy wireup ep 0xefc4900
[1767926702.165810] [G02-100G-ASW016-M07:606  :1]           ud_ep.c:93   UCX  DEBUG ep: 0x49945a0 ca drop@cwnd = 2 in flight: 1
[1767926702.165816] [G02-100G-ASW016-M07:606  :1]           ud_ep.c:1424 UCX  DEBUG ep(0x49945a0): resending rt_psn 3 rt_max_psn 3 acked_psn 2 max_psn 4 ack_req 1
[1767926702.165818] [G02-100G-ASW016-M07:606  :1]           ud_ep.c:1430 UCX  DEBUG ep(0x49945a0): resending completed
[1767926702.346146] [G02-100G-ASW016-M07:606  :1]       rdmacm_cm.c:659  UCX  DEBUG [cep 0xf358640 10.10.100.167:8124->10.10.100.168:54144 server Success] got disconnect event, status Success (0)
[1767926702.346157] [G02-100G-ASW016-M07:606  :1]          ucp_ep.c:1474 UCX  DEBUG ep 0x7fad251a3058: set_ep_failed status Connection reset by remote peer on lane[0]=0xf358640
[1767926702.346169] [G02-100G-ASW016-M07:606  :1]    rdmacm_cm_ep.c:769  UCX  DEBUG [cep 0xf358640 10.10.100.167:8124->10.10.100.168:54144 server Success]: (id=0xf3a4280) disconnected from peer 10.10.100.168:54144
[1767926702.346172] [G02-100G-ASW016-M07:606  :1]          ucp_ep.c:1437 UCX  DEBUG ep 0x7fad251a3058: discarding lanes
[1767926702.346175] [G02-100G-ASW016-M07:606  :1]          ucp_ep.c:1445 UCX  DEBUG ep 0x7fad251a3058: discard uct_ep[0]=0xf358640
[1767926702.346188] [G02-100G-ASW016-M07:606  :1]          ucp_ep.c:1445 UCX  DEBUG ep 0x7fad251a3058: discard uct_ep[1]=0x4c5e770
[1767926702.346487] [G02-100G-ASW016-M07:606  :1]          ucp_ep.c:1445 UCX  DEBUG ep 0x7fad251a3058: discard uct_ep[2]=0x4994b40
[1767926702.346491] [G02-100G-ASW016-M07:606  :1]          ucp_ep.c:3505 UCX  DEBUG ep 0x7fad251a3058: calling user error callback 0x8a94d0 with arg 0xb466900 and status Connection reset by remote peer
[1767926702.346503] [G02-100G-ASW016-M07:606  :1]          ucp_ep.c:1752 UCX  DEBUG ep 0x7fad251a3058 flags 0x372529a cfg_index 1: close_nbx(flags=0x1)
[1767926702.346507] [G02-100G-ASW016-M07:606  :1]          ucp_ep.c:1286 UCX  DEBUG ep 0x7fad251a3058: destroy
[1767926702.346508] [G02-100G-ASW016-M07:606  :1]          ucp_ep.c:1604 UCX  DEBUG ep 0x7fad251a3058: cleanup lanes
[1767926702.346511] [G02-100G-ASW016-M07:606  :1]          ucp_ep.c:1615 UCX  DEBUG ep 0x7fad251a3058: pending & destroy uct_ep[0]=0xf341900
[1767926702.346514] [G02-100G-ASW016-M07:606  :1]          ucp_ep.c:1615 UCX  DEBUG ep 0x7fad251a3058: pending & destroy uct_ep[1]=0xf341900
[1767926702.346516] [G02-100G-ASW016-M07:606  :1]          ucp_ep.c:1615 UCX  DEBUG ep 0x7fad251a3058: pending & destroy uct_ep[2]=0xf341900
[1767926702.346539] [G02-100G-ASW016-M07:606  :a]       ib_device.c:479  UCX  DEBUG IB Async event on mlx5_0: SRQ-attached QP 0x8ffe was flushed
[1767926702.346640] [G02-100G-ASW016-M07:606  :1]           rc_ep.c:185  UCX  DEBUG destroy rc ep 0x4c5e770
[1767926702.346643] [G02-100G-ASW016-M07:606  :1]           ud_ep.c:1783 UCX  DEBUG ep 0x4994b40: disconnect
[1767926702.346647] [G02-100G-ASW016-M07:606  :1]    rdmacm_cm_ep.c:288  UCX  DEBUG cm ep destroy reserved qpn 0xa1af08 on rdmacm_id 0xf3a4280
[1767926706.262624] [G02-100G-ASW016-M07:606  :1]           ud_ep.c:93   UCX  DEBUG ep: 0x49945a0 ca drop@cwnd = 2 in flight: 1
[1767926706.262629] [G02-100G-ASW016-M07:606  :1]           ud_ep.c:1424 UCX  DEBUG ep(0x49945a0): resending rt_psn 3 rt_max_psn 3 acked_psn 2 max_psn 4 ack_req 1
[1767926706.262631] [G02-100G-ASW016-M07:606  :1]           ud_ep.c:1430 UCX  DEBUG ep(0x49945a0): resending completed
[1767926710.359411] [G02-100G-ASW016-M07:606  :1]           ud_ep.c:93   UCX  DEBUG ep: 0x49945a0 ca drop@cwnd = 2 in flight: 1
[1767926710.359418] [G02-100G-ASW016-M07:606  :1]           ud_ep.c:1424 UCX  DEBUG ep(0x49945a0): resending rt_psn 3 rt_max_psn 3 acked_psn 2 max_psn 4 ack_req 1
[1767926710.359420] [G02-100G-ASW016-M07:606  :1]           ud_ep.c:1430 UCX  DEBUG ep(0x49945a0): resending completed
[1767926714.456220] [G02-100G-ASW016-M07:606  :1]           ud_ep.c:93   UCX  DEBUG ep: 0x49945a0 ca drop@cwnd = 2 in flight: 1
[1767926714.456224] [G02-100G-ASW016-M07:606  :1]           ud_ep.c:1424 UCX  DEBUG ep(0x49945a0): resending rt_psn 3 rt_max_psn 3 acked_psn 2 max_psn 4 ack_req 1
[1767926714.456226] [G02-100G-ASW016-M07:606  :1]           ud_ep.c:1430 UCX  DEBUG ep(0x49945a0): resending completed
[1767926715.649628] [G02-100G-ASW016-M07:606  :1]           ud_ep.c:149  UCX  DEBUG ud_ep 0x4994b40 is destroyed after 13.272412s with timeout 10.000000s
[1767926718.553022] [G02-100G-ASW016-M07:606  :1]           ud_ep.c:93   UCX  DEBUG ep: 0x49945a0 ca drop@cwnd = 2 in flight: 1
[1767926718.553026] [G02-100G-ASW016-M07:606  :1]           ud_ep.c:374  UCX  DEBUG ep 0x49945a0: timeout of 33.71 sec, config::peer_timeout - 30.00 sec
[1767926718.553030] [G02-100G-ASW016-M07:606  :1]           ud_ep.c:1424 UCX  DEBUG ep(0x49945a0): resending rt_psn 3 rt_max_psn 3 acked_psn 2 max_psn 4 ack_req 1
[1767926718.553032] [G02-100G-ASW016-M07:606  :1]           ud_ep.c:1430 UCX  DEBUG ep(0x49945a0): resending completed
[1767926718.553038] [G02-100G-ASW016-M07:606  :1]      ucp_worker.c:546  UCX  DEBUG worker 0x7f6c000: error handler called for UCT EP 0x49945a0: Endpoint timeout
[1767926718.553045] [G02-100G-ASW016-M07:606  :1]          ucp_ep.c:1474 UCX  DEBUG ep 0x7fad251a3000: set_ep_failed status Endpoint timeout on lane[2]=0x49945a0
[1767926718.553072] [G02

client ucx log

[1767926701.066370] [G02-100G-ASW016-M08:7164 :1]          ucp_ep.c:1445 UCX  DEBUG ep 0x7f91ffd94000: discard uct_ep[0]=0x76cb500
[1767926701.066372] [G02-100G-ASW016-M08:7164 :1]    proto_common.c:824  UCX  DEBUG abort request 0x4ea6f00 proto reconfig status Input/output error
[1767926701.066379] [G02-100G-ASW016-M08:7164 :1]       wireup_ep.c:440  UCX  DEBUG ep 0x7f91ffd94000: destroy wireup ep 0x76cb500
[1767926701.066382] [G02-100G-ASW016-M08:7164 :1]          ucp_ep.c:3505 UCX  DEBUG ep 0x7f91ffd94000: calling user error callback 0x7f9209c36f10 with arg 0x44f0700 and status Input/output error
[1767926701.351570] [G02-100G-ASW016-M08:7164 :a]       ib_device.c:479  UCX  WARN  IB Async event on mlx5_0: GID table change on port 1
[1767926701.351795] [G02-100G-ASW016-M08:7164 :a]       ib_device.c:479  UCX  WARN  IB Async event on mlx5_0: GID table change on port 1
[1767926702.065736] [G02-100G-ASW016-M08:7164 :1]          ucp_ep.c:1752 UCX  DEBUG ep 0x7f91ffd94000 flags 0x205018 cfg_index 0: close_nbx(flags=0x1)
[1767926702.065742] [G02-100G-ASW016-M08:7164 :1]          ucp_ep.c:1286 UCX  DEBUG ep 0x7f91ffd94000: destroy
[1767926702.065744] [G02-100G-ASW016-M08:7164 :1]          ucp_ep.c:1604 UCX  DEBUG ep 0x7f91ffd94000: cleanup lanes
[1767926702.065748] [G02-100G-ASW016-M08:7164 :1]          ucp_ep.c:1615 UCX  DEBUG ep 0x7f91ffd94000: pending & destroy uct_ep[0]=0x76df3e0
[1767926702.066006] [G02-100G-ASW016-M08:7164 :1]          ucp_ep.c:405  UCX  DEBUG created ep 0x7f91ffd94000 to <no debug data> from api call
[1767926702.066043] [G02-100G-ASW016-M08:7164 :1]    rdmacm_cm_ep.c:806  UCX  DEBUG [cep 0x76af950 <invalid>->10.10.100.167:8124 client Success] created an endpoint on rdmacm 0x32e8000 id: 0x4e9d180
[1767926702.066045] [G02-100G-ASW016-M08:7164 :1]       wireup_ep.c:550  UCX  DEBUG ep 0x7f91ffd94000: wireup_ep 0x76cb500 set next_ep 0x76af950
[1767926702.066409] [G02-100G-ASW016-M08:7164 :a]       wireup_cm.c:607  UCX  DEBUG client created ep 0x7f91ffd94000 on device mlx5_0:1, tl_bitmap 0x3 0x0 on cm rdmacm
[1767926702.066416] [G02-100G-ASW016-M08:7164 :1]       wireup_cm.c:184  UCX  DEBUG ep 0x7f91ffd94000: init_flags 0x36
[1767926702.066552] [G02-100G-ASW016-M08:7164 :1]             sys.c:440  UCX  DEBUG failed to read from /sys/class/infiniband/mlx5_0/ports/1/gid_attrs/ndevs/3: Invalid argument
[1767926702.066556] [G02-100G-ASW016-M08:7164 :1]       ib_device.c:1456 UCX  DIAG  failed to read /sys/class/infiniband/mlx5_0/ports/1/gid_attrs/ndevs/0: Invalid argument
[1767926702.066566] [G02-100G-ASW016-M08:7164 :1]             sys.c:440  UCX  DEBUG failed to read from /sys/class/infiniband/mlx5_0/ports/1/gid_attrs/ndevs/3: Invalid argument
[1767926702.066568] [G02-100G-ASW016-M08:7164 :1]       ib_device.c:1456 UCX  DIAG  failed to read /sys/class/infiniband/mlx5_0/ports/1/gid_attrs/ndevs/0: Invalid argument
[1767926702.066577] [G02-100G-ASW016-M08:7164 :1]       wireup_cm.c:302  UCX  DEBUG CM rdmacm private data buffer is too small to pack UCP endpoint info, cm max_conn_priv 54, service data version 1, size 9, address length 69
[1767926702.066580] [G02-100G-ASW016-M08:7164 :1]       wireup_cm.c:184  UCX  DEBUG ep 0x7f91ffd94000: init_flags 0x236
[1767926702.066604] [G02-100G-ASW016-M08:7164 :1]             sys.c:440  UCX  DEBUG failed to read from /sys/class/infiniband/mlx5_0/ports/1/gid_attrs/ndevs/3: Invalid argument
[1767926702.066606] [G02-100G-ASW016-M08:7164 :1]       ib_device.c:1456 UCX  DIAG  failed to read /sys/class/infiniband/mlx5_0/ports/1/gid_attrs/ndevs/0: Invalid argument
[1767926702.066904] [G02-100G-ASW016-M08:7164 :1]        ib_iface.c:1196 UCX  DEBUG mlx5_0: iface 0x200e000 created RC QP 0xb269 on mlx5_0:1 TX wr:409 sge:5 inl:124 resp:64 RX wr:0 sge:0 resp:64
[1767926702.066908] [G02-100G-ASW016-M08:7164 :1]           rc_ep.c:165  UCX  DEBUG created rc ep 0x1b17340
[1767926702.067110] [G02-100G-ASW016-M08:7164 :1]       wireup_ep.c:508  UCX  DEBUG ep 0x7f91ffd94000: wireup_ep 0x76cb200 created next_ep 0x1b17340 to <no debug data> using rc_verbs/mlx5_0:1
[1767926702.067117] [G02-100G-ASW016-M08:7164 :1]    rdmacm_cm_ep.c:263  UCX  DEBUG created reserved qpn 0xa3ac82 on rdmacm_id 0x4e9d180
[1767926702.068484] [G02-100G-ASW016-M08:7164 :a]       rdmacm_cm.c:464  UCX  DEBUG cm_id 0x4e9d180: ah_attr dlid=0 sl=0 port=1 src_path_bits=0 dgid=::ffff:10.10.100.167 flow_label=0x9bc3c sgid_index=4 traffic_class=106
[1767926702.068493] [G02-100G-ASW016-M08:7164 :a]       wireup_cm.c:778  UCX  DEBUG ep 0x7f91ffd94000 flags 0xa04011 cfg_index 1: client connected status Success
[1767926702.068502] [G02-100G-ASW016-M08:7164 :1]       wireup_cm.c:657  UCX  DEBUG ep 0x7f91ffd94000 flags 0xa04011 cfg_index 1: client connect progress
[1767926702.068538] [G02-100G-ASW016-M08:7164 :1]             sys.c:440  UCX  DEBUG     failed to read from /sys/class/infiniband/mlx5_0/ports/1/gid_attrs/ndevs/3: Invalid argument
[1767926702.068541] [G02-100G-ASW016-M08:7164 :1]       ib_device.c:1456 UCX  DIAG      failed to read /sys/class/infiniband/mlx5_0/ports/1/gid_attrs/ndevs/0: Invalid argument
[1767926702.068553] [G02-100G-ASW016-M08:7164 :1]          wireup.c:1214 UCX  DEBUG     ep 0x7f91ffd94000: am_lane 1 wireup_msg_lane 1 cm_lane 0 keepalive_lane 1 reachable_mds 0x1
[1767926702.068556] [G02-100G-ASW016-M08:7164 :1]          wireup.c:1237 UCX  DEBUG     ep 0x7f91ffd94000: lane[0]: cm rdmacm
[1767926702.068559] [G02-100G-ASW016-M08:7164 :1]          wireup.c:1237 UCX  DEBUG     ep 0x7f91ffd94000: lane[1]:  0:rc_verbs/mlx5_0:1.0 md[0]     -> addr[0].md[0]/ib/sysdev[0] seg 4294967295 rma_bw#0 am am_bw#0 keepalive wireup
[1767926702.068561] [G02-100G-ASW016-M08:7164 :1]          wireup.c:1240 UCX  DEBUG     ep 0x7f91ffd94000: err mode 1, flags 0x4
[1767926702.068568] [G02-100G-ASW016-M08:7164 :1]        ib_iface.c:936  UCX  DEBUG     iface 0x200e000: ah_attr dlid=49152 sl=0 port=1 src_path_bits=0 dgid=::ffff:10.10.100.167 flow_label=0xffffffff sgid_index=4 traffic_class=106
[1767926702.068945] [G02-100G-ASW016-M08:7164 :1]        rc_iface.c:923  UCX  DEBUG     connected rc qp 0xb269 on mlx5_0:1/RoCE to lid 49152(+0) sl 0 remote_qp 0x8ffe mtu 1024 timer 18x2 rnr 13x7 rd_atom 16
[1767926702.069021] [G02-100G-ASW016-M08:7164 :1]             sys.c:440  UCX  DEBUG   failed to read from /sys/class/infiniband/mlx5_0/ports/1/gid_attrs/ndevs/3: Invalid argument
[1767926702.069025] [G02-100G-ASW016-M08:7164 :1]       ib_device.c:1456 UCX  DIAG    failed to read /sys/class/infiniband/mlx5_0/ports/1/gid_attrs/ndevs/0: Invalid argument
[1767926702.069037] [G02-100G-ASW016-M08:7164 :1]             sys.c:440  UCX  DEBUG   failed to read from /sys/class/infiniband/mlx5_0/ports/1/gid_attrs/ndevs/3: Invalid argument
[1767926702.069040] [G02-100G-ASW016-M08:7164 :1]       ib_device.c:1456 UCX  DIAG    failed to read /sys/class/infiniband/mlx5_0/ports/1/gid_attrs/ndevs/0: Invalid argument
[1767926702.069047] [G02-100G-ASW016-M08:7164 :1]             sys.c:440  UCX  DEBUG   failed to read from /sys/class/infiniband/mlx5_0/ports/1/gid_attrs/ndevs/3: Invalid argument
[1767926702.069049] [G02-100G-ASW016-M08:7164 :1]       ib_device.c:1456 UCX  DIAG    failed to read /sys/class/infiniband/mlx5_0/ports/1/gid_attrs/ndevs/0: Invalid argument
[1767926702.069542] [G02-100G-ASW016-M08:7164 :1]            topo.c:898  UCX  DEBUG   /sys/class/net/ens2f0np0: PF sysfs path is '/sys/devices/pci0000:17/0000:17:00.0/0000:18:00.0'
[1767926702.069714] [G02-100G-ASW016-M08:7164 :1]          wireup.c:1214 UCX  DEBUG   ep 0x7f91ffd94000: am_lane 1 wireup_msg_lane 1 cm_lane 0 keepalive_lane 2 reachable_mds 0x3
[1767926702.069717] [G02-100G-ASW016-M08:7164 :1]          wireup.c:1237 UCX  DEBUG   ep 0x7f91ffd94000: lane[0]: cm rdmacm
[1767926702.069720] [G02-100G-ASW016-M08:7164 :1]          wireup.c:1237 UCX  DEBUG   ep 0x7f91ffd94000: lane[1]:  0:rc_verbs/mlx5_0:1.0 md[0]     -> addr[0].md[0]/ib/sysdev[0] seg 4294967295 rma_bw#0 am am_bw#0 wireup
[1767926702.069732] [G02-100G-ASW016-M08:7164 :1]          wireup.c:1237 UCX  DEBUG   ep 0x7f91ffd94000: lane[2]:  1:ud_verbs/mlx5_0:1.0 md[0]     -> addr[1].md[0]/ib/sysdev[0] seg 4294967295 keepalive
[1767926702.069734] [G02-100G-ASW016-M08:7164 :1]          wireup.c:1240 UCX  DEBUG   ep 0x7f91ffd94000: err mode 1, flags 0x0
[1767926702.069740] [G02-100G-ASW016-M08:7164 :1]           ud_ep.c:405  UCX  DEBUG   created ep ep=0x1d767e0 iface=0x2004000 id=0
[1767926702.069743] [G02-100G-ASW016-M08:7164 :1]       wireup_ep.c:508  UCX  DEBUG   ep 0x7f91ffd94000: wireup_ep 0x76caf00 created next_ep 0x1d767e0 to <no debug data> using ud_verbs/mlx5_0:1
[1767926702.069748] [G02-100G-ASW016-M08:7164 :1]          wireup.c:1791 UCX  DEBUG ep 0x7f91ffd94000: send wireup request (flags=0x4a04091)
[1767926702.069976] [G02-100G-ASW016-M08:7164 :1]           ud_ep.c:689  UCX  DEBUG   mlx5_0:1/RoCE slid 0 qpn 0xc45a epid 0 connected to ::ffff:10.10.100.167mtu 1024 pkey 0xffff  qpn 0xa6bd epid 0
[1767926702.069980] [G02-100G-ASW016-M08:7164 :1]        ib_iface.c:936  UCX  DEBUG   iface 0x2004000: ah_attr dlid=49152 sl=0 port=1 src_path_bits=0 dgid=::ffff:10.10.100.167 flow_label=0xffffffff sgid_index=3 traffic_class=106
[1767926702.069985] [G02-100G-ASW016-M08:7164 :1]       wireup_ep.c:440  UCX  DEBUG ep 0x7f91ffd94000: destroy wireup ep 0x76cb500
[1767926702.069987] [G02-100G-ASW016-M08:7164 :1]       wireup_ep.c:440  UCX  DEBUG ep 0x7f91ffd94000: destroy wireup ep 0x76cb200
[1767926702.069989] [G02-100G-ASW016-M08:7164 :1]       wireup_ep.c:440  UCX  DEBUG ep 0x7f91ffd94000: destroy wireup ep 0x76caf00

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions