[4/9]
============================================================
case: 002/2
description: single device down
location: 10.10.16.168
command: ifdown ens2f0np0
recovery: ifup ens2f0np0
duration: 30s
============================================================
[10:44:31] ▶ injecting fault on 10.10.16.168...
[10:44:32] exit code: 0
Device 'ens2f0np0' successfully disconnected.
⏳ waiting 30 seconds... [██████████████████████████████] done
[10:45:01] ◀ executing recovery on 10.10.16.168...
[10:45:01] exit code: 0
Connection successfully activated (D-Bus active path: /org/freedesktop/NetworkManager/ActiveConnection/93)
[10:45:01] ✓ case 002/2 completed
[2026-01-09 02:45:02.065]
ud_ep.c:714 Assertion `UCT_UD_PSN_COMPARE(ep->tx.acked_psn, <, ep->tx.psn)' failed: ep 0x1d767e0: flags=0x28 acked_psn=3 must be smaller than current_psn=1
#0 0x00007f92093d0edc in __pthread_kill_implementation () from /lib64/libc.so.6
[Current thread is 1 (Thread 0x7f91d8d92640 (LWP 7247))]
Missing separate debuginfos, use: dnf debuginfo-install glibc-2.34-168.el9_6.24.x86_64 libaio-0.3.111-13.el9.x86_64 libdwarf-0.3.4-1.el9.1.x86_64 libgcc-11.5.0-5.el9_5.x86_64 libibverbs-54.0-2.el9_6.x86_64 libnl3-3.11.0-1.el9.x86_64 librdmacm-54.0-2.el9_6.x86_64 libstdc++-11.3.1-2.1.el9.x86_64 liburing-2.5-1.el9.x86_64 libzstd-1.5.5-1.el9.x86_64 lz4-libs-1.9.3-5.el9.x86_64 numactl-libs-2.0.19-1.el9.x86_64 openssl-libs-3.2.2-6.el9_5.1.x86_64 zlib-1.2.11-40.el9.x86_64
(gdb) bt
#0 0x00007f92093d0edc in __pthread_kill_implementation () from /lib64/libc.so.6
#1 0x00007f9209383b46 in raise () from /lib64/libc.so.6
#2 0x00007f920936d833 in abort () from /lib64/libc.so.6
#3 0x00007f9209e022ae in ucs_fatal_error_message (file=file@entry=0x7f9209ec1599 "ud/base/ud_ep.c", line=line@entry=714, function=function@entry=0x7f9209ec2b10 <__func__.16> "uct_ud_ep_process_ack",
message_buf=message_buf@entry=0x4cf1220 "Assertion `UCT_UD_PSN_COMPARE(ep->tx.acked_psn, <, ep->tx.psn)' failed: ep 0x1d767e0: flags=0x28 acked_psn=3 must be smaller than current_psn=1") at debug/assert.c:38
#4 0x00007f9209e02381 in ucs_fatal_error_format (file=file@entry=0x7f9209ec1599 "ud/base/ud_ep.c", line=line@entry=714, function=function@entry=0x7f9209ec2b10 <__func__.16> "uct_ud_ep_process_ack",
format=format@entry=0x7f9209ec25b0 "Assertion `%s' failed: ep %p: flags=0x%x acked_psn=%u must be smaller than current_psn=%u") at debug/assert.c:53
#5 0x00007f9209dc2242 in uct_ud_ep_process_ack (is_async=0, ack_psn=<optimized out>, ep=0x1d767e0, iface=0x2004000) at ud/base/ud_ep.c:714
#6 uct_ud_ep_process_rx (iface=iface@entry=0x2004000, neth=0x7f91ffdcda28, byte_len=16, skb=0x7f91ffdcd9d4, is_async=is_async@entry=0) at ud/base/ud_ep.c:1009
#7 0x00007f9209dc7f11 in uct_ud_verbs_iface_poll_rx (is_async=0, iface=0x2004000) at ud/verbs/ud_verbs.c:422
#8 uct_ud_verbs_iface_progress (tl_iface=0x2004000) at ud/verbs/ud_verbs.c:463
#9 0x00007f9209d04c1a in ucs_callbackq_dispatch (cbq=<optimized out>) at /root/liufeng/xrpc/xmake_globaldir/.xmake/cache/packages/2511/u/ucx/1.18.0/source/src/ucs/datastruct/callbackq.h:215
#10 uct_worker_progress (worker=<optimized out>) at /root/liufeng/xrpc/xmake_globaldir/.xmake/cache/packages/2511/u/ucx/1.18.0/source/src/uct/api/uct.h:2813
#11 ucp_worker_progress (worker=0x2084000) at core/ucp_worker.c:3033
#12 0x00007f9209a0aff8 in xrpc::Channel::poll (this=this@entry=0x1d70400) at src/channel.cc:1791
#13 0x00007f920990f340 in operator() (__closure=0x4cf1f30) at examples/abyss/client/engine.cc:164
(gdb) f 5
#5 0x00007f9209dc2242 in uct_ud_ep_process_ack (is_async=0, ack_psn=<optimized out>, ep=0x1d767e0, iface=0x2004000) at ud/base/ud_ep.c:714
714 ud/base/ud_ep.c: No such file or directory.
(gdb) p ep
$1 = (uct_ud_ep_t *) 0x1d767e0
(gdb) p ep[0]
$2 = {super = {super = {iface = 0x2004000}}, ep_id = 0, dest_ep_id = 0, tx = {psn = 1, max_psn = 3, acked_psn = 3, resend_count = 0, window = {head = 0x0, ptail = 0x1d767f8}, pending = {group = {tail = 0x0, guard = 0, arbiter = 0x0},
ops = 0, elem = {list = {prev = 0x0, next = 0x0}, next = 0x0, group = 0x0}}, send_time = 0, resend_time = 0, tick = 21000000}, rx = {acked_psn = 0, ooo_pkts = {list = {head = 0x0, ptail = 0x1d76868}, ready_list = {head = 0x0,
ptail = 0x1d76878}, head_sn = 0, elem_count = 0, list_count = 0, max_holes = 0}}, ca = {wmax = 1025, cwnd = 2}, resend = {pos = 0x1d767f8, psn = 1, max_psn = 0}, conn_match = {list = {list = {prev = 0x0, next = 0x0}}},
conn_sn = 0, flags = 40, rx_creq_count = 0 '\000', path_index = 0 '\000', timer = {cb = 0x7f9209dbed60 <uct_ud_ep_timer>, list = {prev = 0x0, next = 0x0}, is_active = 0}, close_time = 0}
(gdb) f 5
#5 0x00007f9209dc2242 in uct_ud_ep_process_ack (is_async=0, ack_psn=<optimized out>, ep=0x1d767e0, iface=0x2004000) at ud/base/ud_ep.c:714
714 ud/base/ud_ep.c: No such file or directory.
(gdb) p ep
$1 = (uct_ud_ep_t *) 0x1d767e0
(gdb) p ep[0]
$2 = {super = {super = {iface = 0x2004000}}, ep_id = 0, dest_ep_id = 0, tx = {psn = 1, max_psn = 3, acked_psn = 3, resend_count = 0, window = {head = 0x0, ptail = 0x1d767f8}, pending = {group = {tail = 0x0, guard = 0, arbiter = 0x0},
ops = 0, elem = {list = {prev = 0x0, next = 0x0}, next = 0x0, group = 0x0}}, send_time = 0, resend_time = 0, tick = 21000000}, rx = {acked_psn = 0, ooo_pkts = {list = {head = 0x0, ptail = 0x1d76868}, ready_list = {head = 0x0,
ptail = 0x1d76878}, head_sn = 0, elem_count = 0, list_count = 0, max_holes = 0}}, ca = {wmax = 1025, cwnd = 2}, resend = {pos = 0x1d767f8, psn = 1, max_psn = 0}, conn_match = {list = {list = {prev = 0x0, next = 0x0}}},
conn_sn = 0, flags = 40, rx_creq_count = 0 '\000', path_index = 0 '\000', timer = {cb = 0x7f9209dbed60 <uct_ud_ep_timer>, list = {prev = 0x0, next = 0x0}, is_active = 0}, close_time = 0}
(gdb) f 6
#6 uct_ud_ep_process_rx (iface=iface@entry=0x2004000, neth=0x7f91ffdcda28, byte_len=16, skb=0x7f91ffdcd9d4, is_async=is_async@entry=0) at ud/base/ud_ep.c:1009
1009 in ud/base/ud_ep.c
(gdb) p neth
$3 = (uct_ud_neth_t *) 0x7f91ffdcda28
(gdb) p neth[0]
$4 = {packet_type = 301989888, psn = 3, ack_psn = 3}
(gdb) info local
dest_id = 0
is_am = 0
am_id = 2
ep = 0x1d767e0
ooo_type = <optimized out>
__func__ = "uct_ud_ep_process_rx"
(gdb)
export UCX_NET_DEVICES=mlx5_0:1,ens2f0np0
export UCX_HANDLE_ERRORS=none #bt,freeze,debug,none
export UCX_LOG_LEVEL=debug
export UCX_RC_RETRY_COUNT=2
export UCX_DC_MLX5_RETRY_COUNT=2
export UCX_UD_LINGER_TIMEOUT=10s
export UCX_RDMA_CM_TIMEOUT=10s
export UCX_TLS=rc,tcp
Linux G02-100G-ASW016-M08 5.14.0-162.nos.4.el8.x86_64 #1 SMP PREEMPT_DYNAMIC Thu Nov 24 07:51:00 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
Firmware version: 22.36.1010
Hardware version: 0
Node GUID: 0xa088c20300b4a46e
System image GUID: 0xa088c20300b4a46e
Port 1:
State: Active
Physical state: LinkUp
Rate: 100
Base lid: 0
LMC: 0
SM lid: 0
Capability mask: 0x00010000
Port GUID: 0xa288c2fffeb4a46e
Link layer: Ethernet
CA 'mlx5_1'
CA type: MT4125
Number of ports: 1
Firmware version: 22.36.1010
Hardware version: 0
Node GUID: 0xa088c20300b4a46f
System image GUID: 0xa088c20300b4a46e
Port 1:
State: Active
Physical state: LinkUp
Rate: 100
Base lid: 0
LMC: 0
SM lid: 0
Capability mask: 0x00010000
Port GUID: 0xa288c2fffeb4a46f
Link layer: Ethernet
CA 'mlx5_2'
CA type: MT4125
Number of ports: 1
Firmware version: 22.35.3502
Hardware version: 0
Node GUID: 0xe8ebd303003a6d44
System image GUID: 0xe8ebd303003a6d44
Port 1:
State: Active
Physical state: LinkUp
Rate: 100
Base lid: 0
LMC: 0
SM lid: 0
Capability mask: 0x00010000
Port GUID: 0xeaebd3fffe3a6d44
Link layer: Ethernet
CA 'mlx5_3'
CA type: MT4125
Number of ports: 1
Firmware version: 22.35.3502
Hardware version: 0
Node GUID: 0xe8ebd303003a6d45
System image GUID: 0xe8ebd303003a6d44
Port 1:
State: Active
Physical state: LinkUp
Rate: 100
Base lid: 0
LMC: 0
SM lid: 0
Capability mask: 0x00010000
Port GUID: 0xeaebd3fffe3a6d45
Link layer: Ethernet
[1767926702.078171] [G02-100G-ASW016-M07:606 :1] sys.c:440 UCX DEBUG failed to read from /sys/class/infiniband/mlx5_0/ports/1/gid_attrs/ndevs/4: Invalid argument
[1767926702.078186] [G02-100G-ASW016-M07:606 :1] ib_device.c:1456 UCX DIAG failed to read /sys/class/infiniband/mlx5_0/ports/1/gid_attrs/ndevs/0: Invalid argument
[1767926702.078202] [G02-100G-ASW016-M07:606 :1] sys.c:440 UCX DEBUG failed to read from /sys/class/infiniband/mlx5_0/ports/1/gid_attrs/ndevs/4: Invalid argument
[1767926702.078206] [G02-100G-ASW016-M07:606 :1] ib_device.c:1456 UCX DIAG failed to read /sys/class/infiniband/mlx5_0/ports/1/gid_attrs/ndevs/0: Invalid argument
[1767926702.078215] [G02-100G-ASW016-M07:606 :1] sys.c:440 UCX DEBUG failed to read from /sys/class/infiniband/mlx5_0/ports/1/gid_attrs/ndevs/4: Invalid argument
[1767926702.078218] [G02-100G-ASW016-M07:606 :1] ib_device.c:1456 UCX DIAG failed to read /sys/class/infiniband/mlx5_0/ports/1/gid_attrs/ndevs/0: Invalid argument
[1767926702.078301] [G02-100G-ASW016-M07:606 :1] wireup.c:1214 UCX DEBUG ep 0x7fad251a3058: am_lane 1 wireup_msg_lane 1 cm_lane 0 keepalive_lane 2 reachable_mds 0x1
[1767926702.078304] [G02-100G-ASW016-M07:606 :1] wireup.c:1237 UCX DEBUG ep 0x7fad251a3058: lane[0]: cm rdmacm
[1767926702.078308] [G02-100G-ASW016-M07:606 :1] wireup.c:1237 UCX DEBUG ep 0x7fad251a3058: lane[1]: 0:rc_verbs/mlx5_0:1.0 md[0] -> addr[0].md[0]/ib/sysdev[0] seg 4294967295 rma_bw#0 am am_bw#0 wireup
[1767926702.078312] [G02-100G-ASW016-M07:606 :1] wireup.c:1237 UCX DEBUG ep 0x7fad251a3058: lane[2]: 1:ud_verbs/mlx5_0:1.0 md[0] -> addr[1].md[0]/ib/sysdev[0] seg 4294967295 keepalive
[1767926702.078314] [G02-100G-ASW016-M07:606 :1] wireup.c:1240 UCX DEBUG ep 0x7fad251a3058: err mode 1, flags 0x0
[1767926702.078318] [G02-100G-ASW016-M07:606 :1] ud_ep.c:405 UCX DEBUG created ep ep=0x4994b40 iface=0x812e000 id=0
[1767926702.078321] [G02-100G-ASW016-M07:606 :1] wireup_ep.c:508 UCX DEBUG ep 0x7fad251a3058: wireup_ep 0xefc4900 created next_ep 0x4994b40 to <no debug data> using ud_verbs/mlx5_0:1
[1767926702.078327] [G02-100G-ASW016-M07:606 :1] ud_ep.c:689 UCX DEBUG mlx5_0:1/RoCE slid 0 qpn 0xa6bd epid 0 connected to ::ffff:10.10.100.168mtu 1024 pkey 0xffff qpn 0xc45a epid 0
[1767926702.078331] [G02-100G-ASW016-M07:606 :1] ib_iface.c:936 UCX DEBUG iface 0x812e000: ah_attr dlid=49152 sl=0 port=1 src_path_bits=0 dgid=::ffff:10.10.100.168 flow_label=0xffffffff sgid_index=4 traffic_class=106
[1767926702.078385] [G02-100G-ASW016-M07:606 :1] wireup_ep.c:440 UCX DEBUG ep 0x7fad251a3058: destroy wireup ep 0xf34cc00
[1767926702.078388] [G02-100G-ASW016-M07:606 :1] wireup_ep.c:440 UCX DEBUG ep 0x7fad251a3058: destroy wireup ep 0xf34cf00
[1767926702.078390] [G02-100G-ASW016-M07:606 :1] wireup_ep.c:440 UCX DEBUG ep 0x7fad251a3058: destroy wireup ep 0xefc4900
[1767926702.165810] [G02-100G-ASW016-M07:606 :1] ud_ep.c:93 UCX DEBUG ep: 0x49945a0 ca drop@cwnd = 2 in flight: 1
[1767926702.165816] [G02-100G-ASW016-M07:606 :1] ud_ep.c:1424 UCX DEBUG ep(0x49945a0): resending rt_psn 3 rt_max_psn 3 acked_psn 2 max_psn 4 ack_req 1
[1767926702.165818] [G02-100G-ASW016-M07:606 :1] ud_ep.c:1430 UCX DEBUG ep(0x49945a0): resending completed
[1767926702.346146] [G02-100G-ASW016-M07:606 :1] rdmacm_cm.c:659 UCX DEBUG [cep 0xf358640 10.10.100.167:8124->10.10.100.168:54144 server Success] got disconnect event, status Success (0)
[1767926702.346157] [G02-100G-ASW016-M07:606 :1] ucp_ep.c:1474 UCX DEBUG ep 0x7fad251a3058: set_ep_failed status Connection reset by remote peer on lane[0]=0xf358640
[1767926702.346169] [G02-100G-ASW016-M07:606 :1] rdmacm_cm_ep.c:769 UCX DEBUG [cep 0xf358640 10.10.100.167:8124->10.10.100.168:54144 server Success]: (id=0xf3a4280) disconnected from peer 10.10.100.168:54144
[1767926702.346172] [G02-100G-ASW016-M07:606 :1] ucp_ep.c:1437 UCX DEBUG ep 0x7fad251a3058: discarding lanes
[1767926702.346175] [G02-100G-ASW016-M07:606 :1] ucp_ep.c:1445 UCX DEBUG ep 0x7fad251a3058: discard uct_ep[0]=0xf358640
[1767926702.346188] [G02-100G-ASW016-M07:606 :1] ucp_ep.c:1445 UCX DEBUG ep 0x7fad251a3058: discard uct_ep[1]=0x4c5e770
[1767926702.346487] [G02-100G-ASW016-M07:606 :1] ucp_ep.c:1445 UCX DEBUG ep 0x7fad251a3058: discard uct_ep[2]=0x4994b40
[1767926702.346491] [G02-100G-ASW016-M07:606 :1] ucp_ep.c:3505 UCX DEBUG ep 0x7fad251a3058: calling user error callback 0x8a94d0 with arg 0xb466900 and status Connection reset by remote peer
[1767926702.346503] [G02-100G-ASW016-M07:606 :1] ucp_ep.c:1752 UCX DEBUG ep 0x7fad251a3058 flags 0x372529a cfg_index 1: close_nbx(flags=0x1)
[1767926702.346507] [G02-100G-ASW016-M07:606 :1] ucp_ep.c:1286 UCX DEBUG ep 0x7fad251a3058: destroy
[1767926702.346508] [G02-100G-ASW016-M07:606 :1] ucp_ep.c:1604 UCX DEBUG ep 0x7fad251a3058: cleanup lanes
[1767926702.346511] [G02-100G-ASW016-M07:606 :1] ucp_ep.c:1615 UCX DEBUG ep 0x7fad251a3058: pending & destroy uct_ep[0]=0xf341900
[1767926702.346514] [G02-100G-ASW016-M07:606 :1] ucp_ep.c:1615 UCX DEBUG ep 0x7fad251a3058: pending & destroy uct_ep[1]=0xf341900
[1767926702.346516] [G02-100G-ASW016-M07:606 :1] ucp_ep.c:1615 UCX DEBUG ep 0x7fad251a3058: pending & destroy uct_ep[2]=0xf341900
[1767926702.346539] [G02-100G-ASW016-M07:606 :a] ib_device.c:479 UCX DEBUG IB Async event on mlx5_0: SRQ-attached QP 0x8ffe was flushed
[1767926702.346640] [G02-100G-ASW016-M07:606 :1] rc_ep.c:185 UCX DEBUG destroy rc ep 0x4c5e770
[1767926702.346643] [G02-100G-ASW016-M07:606 :1] ud_ep.c:1783 UCX DEBUG ep 0x4994b40: disconnect
[1767926702.346647] [G02-100G-ASW016-M07:606 :1] rdmacm_cm_ep.c:288 UCX DEBUG cm ep destroy reserved qpn 0xa1af08 on rdmacm_id 0xf3a4280
[1767926706.262624] [G02-100G-ASW016-M07:606 :1] ud_ep.c:93 UCX DEBUG ep: 0x49945a0 ca drop@cwnd = 2 in flight: 1
[1767926706.262629] [G02-100G-ASW016-M07:606 :1] ud_ep.c:1424 UCX DEBUG ep(0x49945a0): resending rt_psn 3 rt_max_psn 3 acked_psn 2 max_psn 4 ack_req 1
[1767926706.262631] [G02-100G-ASW016-M07:606 :1] ud_ep.c:1430 UCX DEBUG ep(0x49945a0): resending completed
[1767926710.359411] [G02-100G-ASW016-M07:606 :1] ud_ep.c:93 UCX DEBUG ep: 0x49945a0 ca drop@cwnd = 2 in flight: 1
[1767926710.359418] [G02-100G-ASW016-M07:606 :1] ud_ep.c:1424 UCX DEBUG ep(0x49945a0): resending rt_psn 3 rt_max_psn 3 acked_psn 2 max_psn 4 ack_req 1
[1767926710.359420] [G02-100G-ASW016-M07:606 :1] ud_ep.c:1430 UCX DEBUG ep(0x49945a0): resending completed
[1767926714.456220] [G02-100G-ASW016-M07:606 :1] ud_ep.c:93 UCX DEBUG ep: 0x49945a0 ca drop@cwnd = 2 in flight: 1
[1767926714.456224] [G02-100G-ASW016-M07:606 :1] ud_ep.c:1424 UCX DEBUG ep(0x49945a0): resending rt_psn 3 rt_max_psn 3 acked_psn 2 max_psn 4 ack_req 1
[1767926714.456226] [G02-100G-ASW016-M07:606 :1] ud_ep.c:1430 UCX DEBUG ep(0x49945a0): resending completed
[1767926715.649628] [G02-100G-ASW016-M07:606 :1] ud_ep.c:149 UCX DEBUG ud_ep 0x4994b40 is destroyed after 13.272412s with timeout 10.000000s
[1767926718.553022] [G02-100G-ASW016-M07:606 :1] ud_ep.c:93 UCX DEBUG ep: 0x49945a0 ca drop@cwnd = 2 in flight: 1
[1767926718.553026] [G02-100G-ASW016-M07:606 :1] ud_ep.c:374 UCX DEBUG ep 0x49945a0: timeout of 33.71 sec, config::peer_timeout - 30.00 sec
[1767926718.553030] [G02-100G-ASW016-M07:606 :1] ud_ep.c:1424 UCX DEBUG ep(0x49945a0): resending rt_psn 3 rt_max_psn 3 acked_psn 2 max_psn 4 ack_req 1
[1767926718.553032] [G02-100G-ASW016-M07:606 :1] ud_ep.c:1430 UCX DEBUG ep(0x49945a0): resending completed
[1767926718.553038] [G02-100G-ASW016-M07:606 :1] ucp_worker.c:546 UCX DEBUG worker 0x7f6c000: error handler called for UCT EP 0x49945a0: Endpoint timeout
[1767926718.553045] [G02-100G-ASW016-M07:606 :1] ucp_ep.c:1474 UCX DEBUG ep 0x7fad251a3000: set_ep_failed status Endpoint timeout on lane[2]=0x49945a0
[1767926718.553072] [G02
[1767926701.066370] [G02-100G-ASW016-M08:7164 :1] ucp_ep.c:1445 UCX DEBUG ep 0x7f91ffd94000: discard uct_ep[0]=0x76cb500
[1767926701.066372] [G02-100G-ASW016-M08:7164 :1] proto_common.c:824 UCX DEBUG abort request 0x4ea6f00 proto reconfig status Input/output error
[1767926701.066379] [G02-100G-ASW016-M08:7164 :1] wireup_ep.c:440 UCX DEBUG ep 0x7f91ffd94000: destroy wireup ep 0x76cb500
[1767926701.066382] [G02-100G-ASW016-M08:7164 :1] ucp_ep.c:3505 UCX DEBUG ep 0x7f91ffd94000: calling user error callback 0x7f9209c36f10 with arg 0x44f0700 and status Input/output error
[1767926701.351570] [G02-100G-ASW016-M08:7164 :a] ib_device.c:479 UCX WARN IB Async event on mlx5_0: GID table change on port 1
[1767926701.351795] [G02-100G-ASW016-M08:7164 :a] ib_device.c:479 UCX WARN IB Async event on mlx5_0: GID table change on port 1
[1767926702.065736] [G02-100G-ASW016-M08:7164 :1] ucp_ep.c:1752 UCX DEBUG ep 0x7f91ffd94000 flags 0x205018 cfg_index 0: close_nbx(flags=0x1)
[1767926702.065742] [G02-100G-ASW016-M08:7164 :1] ucp_ep.c:1286 UCX DEBUG ep 0x7f91ffd94000: destroy
[1767926702.065744] [G02-100G-ASW016-M08:7164 :1] ucp_ep.c:1604 UCX DEBUG ep 0x7f91ffd94000: cleanup lanes
[1767926702.065748] [G02-100G-ASW016-M08:7164 :1] ucp_ep.c:1615 UCX DEBUG ep 0x7f91ffd94000: pending & destroy uct_ep[0]=0x76df3e0
[1767926702.066006] [G02-100G-ASW016-M08:7164 :1] ucp_ep.c:405 UCX DEBUG created ep 0x7f91ffd94000 to <no debug data> from api call
[1767926702.066043] [G02-100G-ASW016-M08:7164 :1] rdmacm_cm_ep.c:806 UCX DEBUG [cep 0x76af950 <invalid>->10.10.100.167:8124 client Success] created an endpoint on rdmacm 0x32e8000 id: 0x4e9d180
[1767926702.066045] [G02-100G-ASW016-M08:7164 :1] wireup_ep.c:550 UCX DEBUG ep 0x7f91ffd94000: wireup_ep 0x76cb500 set next_ep 0x76af950
[1767926702.066409] [G02-100G-ASW016-M08:7164 :a] wireup_cm.c:607 UCX DEBUG client created ep 0x7f91ffd94000 on device mlx5_0:1, tl_bitmap 0x3 0x0 on cm rdmacm
[1767926702.066416] [G02-100G-ASW016-M08:7164 :1] wireup_cm.c:184 UCX DEBUG ep 0x7f91ffd94000: init_flags 0x36
[1767926702.066552] [G02-100G-ASW016-M08:7164 :1] sys.c:440 UCX DEBUG failed to read from /sys/class/infiniband/mlx5_0/ports/1/gid_attrs/ndevs/3: Invalid argument
[1767926702.066556] [G02-100G-ASW016-M08:7164 :1] ib_device.c:1456 UCX DIAG failed to read /sys/class/infiniband/mlx5_0/ports/1/gid_attrs/ndevs/0: Invalid argument
[1767926702.066566] [G02-100G-ASW016-M08:7164 :1] sys.c:440 UCX DEBUG failed to read from /sys/class/infiniband/mlx5_0/ports/1/gid_attrs/ndevs/3: Invalid argument
[1767926702.066568] [G02-100G-ASW016-M08:7164 :1] ib_device.c:1456 UCX DIAG failed to read /sys/class/infiniband/mlx5_0/ports/1/gid_attrs/ndevs/0: Invalid argument
[1767926702.066577] [G02-100G-ASW016-M08:7164 :1] wireup_cm.c:302 UCX DEBUG CM rdmacm private data buffer is too small to pack UCP endpoint info, cm max_conn_priv 54, service data version 1, size 9, address length 69
[1767926702.066580] [G02-100G-ASW016-M08:7164 :1] wireup_cm.c:184 UCX DEBUG ep 0x7f91ffd94000: init_flags 0x236
[1767926702.066604] [G02-100G-ASW016-M08:7164 :1] sys.c:440 UCX DEBUG failed to read from /sys/class/infiniband/mlx5_0/ports/1/gid_attrs/ndevs/3: Invalid argument
[1767926702.066606] [G02-100G-ASW016-M08:7164 :1] ib_device.c:1456 UCX DIAG failed to read /sys/class/infiniband/mlx5_0/ports/1/gid_attrs/ndevs/0: Invalid argument
[1767926702.066904] [G02-100G-ASW016-M08:7164 :1] ib_iface.c:1196 UCX DEBUG mlx5_0: iface 0x200e000 created RC QP 0xb269 on mlx5_0:1 TX wr:409 sge:5 inl:124 resp:64 RX wr:0 sge:0 resp:64
[1767926702.066908] [G02-100G-ASW016-M08:7164 :1] rc_ep.c:165 UCX DEBUG created rc ep 0x1b17340
[1767926702.067110] [G02-100G-ASW016-M08:7164 :1] wireup_ep.c:508 UCX DEBUG ep 0x7f91ffd94000: wireup_ep 0x76cb200 created next_ep 0x1b17340 to <no debug data> using rc_verbs/mlx5_0:1
[1767926702.067117] [G02-100G-ASW016-M08:7164 :1] rdmacm_cm_ep.c:263 UCX DEBUG created reserved qpn 0xa3ac82 on rdmacm_id 0x4e9d180
[1767926702.068484] [G02-100G-ASW016-M08:7164 :a] rdmacm_cm.c:464 UCX DEBUG cm_id 0x4e9d180: ah_attr dlid=0 sl=0 port=1 src_path_bits=0 dgid=::ffff:10.10.100.167 flow_label=0x9bc3c sgid_index=4 traffic_class=106
[1767926702.068493] [G02-100G-ASW016-M08:7164 :a] wireup_cm.c:778 UCX DEBUG ep 0x7f91ffd94000 flags 0xa04011 cfg_index 1: client connected status Success
[1767926702.068502] [G02-100G-ASW016-M08:7164 :1] wireup_cm.c:657 UCX DEBUG ep 0x7f91ffd94000 flags 0xa04011 cfg_index 1: client connect progress
[1767926702.068538] [G02-100G-ASW016-M08:7164 :1] sys.c:440 UCX DEBUG failed to read from /sys/class/infiniband/mlx5_0/ports/1/gid_attrs/ndevs/3: Invalid argument
[1767926702.068541] [G02-100G-ASW016-M08:7164 :1] ib_device.c:1456 UCX DIAG failed to read /sys/class/infiniband/mlx5_0/ports/1/gid_attrs/ndevs/0: Invalid argument
[1767926702.068553] [G02-100G-ASW016-M08:7164 :1] wireup.c:1214 UCX DEBUG ep 0x7f91ffd94000: am_lane 1 wireup_msg_lane 1 cm_lane 0 keepalive_lane 1 reachable_mds 0x1
[1767926702.068556] [G02-100G-ASW016-M08:7164 :1] wireup.c:1237 UCX DEBUG ep 0x7f91ffd94000: lane[0]: cm rdmacm
[1767926702.068559] [G02-100G-ASW016-M08:7164 :1] wireup.c:1237 UCX DEBUG ep 0x7f91ffd94000: lane[1]: 0:rc_verbs/mlx5_0:1.0 md[0] -> addr[0].md[0]/ib/sysdev[0] seg 4294967295 rma_bw#0 am am_bw#0 keepalive wireup
[1767926702.068561] [G02-100G-ASW016-M08:7164 :1] wireup.c:1240 UCX DEBUG ep 0x7f91ffd94000: err mode 1, flags 0x4
[1767926702.068568] [G02-100G-ASW016-M08:7164 :1] ib_iface.c:936 UCX DEBUG iface 0x200e000: ah_attr dlid=49152 sl=0 port=1 src_path_bits=0 dgid=::ffff:10.10.100.167 flow_label=0xffffffff sgid_index=4 traffic_class=106
[1767926702.068945] [G02-100G-ASW016-M08:7164 :1] rc_iface.c:923 UCX DEBUG connected rc qp 0xb269 on mlx5_0:1/RoCE to lid 49152(+0) sl 0 remote_qp 0x8ffe mtu 1024 timer 18x2 rnr 13x7 rd_atom 16
[1767926702.069021] [G02-100G-ASW016-M08:7164 :1] sys.c:440 UCX DEBUG failed to read from /sys/class/infiniband/mlx5_0/ports/1/gid_attrs/ndevs/3: Invalid argument
[1767926702.069025] [G02-100G-ASW016-M08:7164 :1] ib_device.c:1456 UCX DIAG failed to read /sys/class/infiniband/mlx5_0/ports/1/gid_attrs/ndevs/0: Invalid argument
[1767926702.069037] [G02-100G-ASW016-M08:7164 :1] sys.c:440 UCX DEBUG failed to read from /sys/class/infiniband/mlx5_0/ports/1/gid_attrs/ndevs/3: Invalid argument
[1767926702.069040] [G02-100G-ASW016-M08:7164 :1] ib_device.c:1456 UCX DIAG failed to read /sys/class/infiniband/mlx5_0/ports/1/gid_attrs/ndevs/0: Invalid argument
[1767926702.069047] [G02-100G-ASW016-M08:7164 :1] sys.c:440 UCX DEBUG failed to read from /sys/class/infiniband/mlx5_0/ports/1/gid_attrs/ndevs/3: Invalid argument
[1767926702.069049] [G02-100G-ASW016-M08:7164 :1] ib_device.c:1456 UCX DIAG failed to read /sys/class/infiniband/mlx5_0/ports/1/gid_attrs/ndevs/0: Invalid argument
[1767926702.069542] [G02-100G-ASW016-M08:7164 :1] topo.c:898 UCX DEBUG /sys/class/net/ens2f0np0: PF sysfs path is '/sys/devices/pci0000:17/0000:17:00.0/0000:18:00.0'
[1767926702.069714] [G02-100G-ASW016-M08:7164 :1] wireup.c:1214 UCX DEBUG ep 0x7f91ffd94000: am_lane 1 wireup_msg_lane 1 cm_lane 0 keepalive_lane 2 reachable_mds 0x3
[1767926702.069717] [G02-100G-ASW016-M08:7164 :1] wireup.c:1237 UCX DEBUG ep 0x7f91ffd94000: lane[0]: cm rdmacm
[1767926702.069720] [G02-100G-ASW016-M08:7164 :1] wireup.c:1237 UCX DEBUG ep 0x7f91ffd94000: lane[1]: 0:rc_verbs/mlx5_0:1.0 md[0] -> addr[0].md[0]/ib/sysdev[0] seg 4294967295 rma_bw#0 am am_bw#0 wireup
[1767926702.069732] [G02-100G-ASW016-M08:7164 :1] wireup.c:1237 UCX DEBUG ep 0x7f91ffd94000: lane[2]: 1:ud_verbs/mlx5_0:1.0 md[0] -> addr[1].md[0]/ib/sysdev[0] seg 4294967295 keepalive
[1767926702.069734] [G02-100G-ASW016-M08:7164 :1] wireup.c:1240 UCX DEBUG ep 0x7f91ffd94000: err mode 1, flags 0x0
[1767926702.069740] [G02-100G-ASW016-M08:7164 :1] ud_ep.c:405 UCX DEBUG created ep ep=0x1d767e0 iface=0x2004000 id=0
[1767926702.069743] [G02-100G-ASW016-M08:7164 :1] wireup_ep.c:508 UCX DEBUG ep 0x7f91ffd94000: wireup_ep 0x76caf00 created next_ep 0x1d767e0 to <no debug data> using ud_verbs/mlx5_0:1
[1767926702.069748] [G02-100G-ASW016-M08:7164 :1] wireup.c:1791 UCX DEBUG ep 0x7f91ffd94000: send wireup request (flags=0x4a04091)
[1767926702.069976] [G02-100G-ASW016-M08:7164 :1] ud_ep.c:689 UCX DEBUG mlx5_0:1/RoCE slid 0 qpn 0xc45a epid 0 connected to ::ffff:10.10.100.167mtu 1024 pkey 0xffff qpn 0xa6bd epid 0
[1767926702.069980] [G02-100G-ASW016-M08:7164 :1] ib_iface.c:936 UCX DEBUG iface 0x2004000: ah_attr dlid=49152 sl=0 port=1 src_path_bits=0 dgid=::ffff:10.10.100.167 flow_label=0xffffffff sgid_index=3 traffic_class=106
[1767926702.069985] [G02-100G-ASW016-M08:7164 :1] wireup_ep.c:440 UCX DEBUG ep 0x7f91ffd94000: destroy wireup ep 0x76cb500
[1767926702.069987] [G02-100G-ASW016-M08:7164 :1] wireup_ep.c:440 UCX DEBUG ep 0x7f91ffd94000: destroy wireup ep 0x76cb200
[1767926702.069989] [G02-100G-ASW016-M08:7164 :1] wireup_ep.c:440 UCX DEBUG ep 0x7f91ffd94000: destroy wireup ep 0x76caf00
Describe the bug
Operation
10.10.16.168
coredump:
Steps to Reproduce
Setup and versions
cat /etc/mlnx-release(the string identifies software and firmware setup)ibstatoribv_devinfo -vvcommandMore Info
server ucx log:
client ucx log