Skip to content

cuda.bindings benchmarks part 5#1964

Open
danielfrg wants to merge 2 commits intomainfrom
cuda-bindings-bench-part-5
Open

cuda.bindings benchmarks part 5#1964
danielfrg wants to merge 2 commits intomainfrom
cuda-bindings-bench-part-5

Conversation

@danielfrg
Copy link
Copy Markdown
Contributor

@danielfrg danielfrg commented Apr 22, 2026

Description

Follow up #1580

This one is to discuss and improve Cpp harness to make a more stable data collection.

I added a with min-time flag and to report the min/std on the compare.

This is the last one i got, where most of them are <2%.

stream.stream_create_destroy did went back from like 8% to 6% when I collected more time.

-----------------------------------------------------------------------------------------------------
Benchmark                                   C++ (mean)   C++ RSD   Python (mean)   Py RSD    Overhead
-----------------------------------------------------------------------------------------------------
ctx_device.ctx_get_current                        6 ns      1.6%          111 ns     1.3%     +105 ns
ctx_device.ctx_get_device                         9 ns      1.5%          116 ns     1.7%     +107 ns
ctx_device.ctx_set_current                        8 ns      1.0%          101 ns     2.8%      +93 ns
ctx_device.device_get                             6 ns      4.4%          127 ns     2.4%     +121 ns
ctx_device.device_get_attribute                   8 ns      1.6%          190 ns     1.5%     +182 ns
event.event_create_destroy                       79 ns      1.3%          307 ns     2.0%     +228 ns
event.event_query                                77 ns      1.2%          210 ns     7.1%     +133 ns
event.event_record                               88 ns      1.2%          221 ns     2.2%     +133 ns
event.event_synchronize                          98 ns      1.2%          225 ns     3.2%     +128 ns
launch.launch_16_args                          1.59 us      1.2%         3.12 us     1.6%    +1528 ns
launch.launch_16_args_pre_packed               1.57 us      0.5%         1.97 us     0.3%     +395 ns
launch.launch_2048b                            1.68 us      1.9%         2.50 us     1.6%     +820 ns
launch.launch_256_args                         2.34 us      0.9%        16.50 us     1.8%   +14158 ns
launch.launch_512_args                         3.33 us      0.8%        31.31 us     1.7%   +27973 ns
launch.launch_512_args_pre_packed              3.37 us      1.2%         3.77 us     0.8%     +392 ns
launch.launch_512_bools                        3.39 us      0.7%        57.82 us     3.2%   +54435 ns
launch.launch_512_bytes                        3.40 us      0.7%        59.53 us     2.8%   +56134 ns
launch.launch_512_doubles                      3.32 us      0.7%        86.76 us     3.6%   +83438 ns
launch.launch_512_ints                         3.18 us      0.8%        60.59 us     3.6%   +57411 ns
launch.launch_512_longlongs                    3.31 us      0.7%        65.91 us     2.4%   +62605 ns
launch.launch_empty_kernel                     1.62 us      1.9%         1.87 us     1.4%     +250 ns
launch.launch_small_kernel                     1.57 us      0.3%         2.22 us     1.2%     +651 ns
memory.mem_alloc_async_free_async               386 ns      1.1%          747 ns     1.7%     +361 ns
memory.mem_alloc_free                          1.55 us      2.2%         1.98 us     1.5%     +429 ns
memory.memcpy_dtod                             2.07 us      0.5%         2.29 us     1.4%     +219 ns
memory.memcpy_dtoh                             4.99 us      0.3%         5.42 us     0.8%     +437 ns
memory.memcpy_htod                             3.94 us      0.2%         4.00 us     1.0%      +61 ns
module.func_get_attribute                        14 ns      1.5%          212 ns     1.8%     +198 ns
module.module_get_function                       33 ns      1.7%          179 ns     1.9%     +146 ns
module.module_load_unload                      7.73 us      0.7%         8.26 us     1.4%     +528 ns
nvrtc.nvrtc_compile_program                 7194.82 us      1.2%      7294.98 us     1.1%  +100158 ns
nvrtc.nvrtc_create_program                       68 ns      1.2%          670 ns     1.8%     +603 ns
nvrtc.nvrtc_create_program_100_headers        11.08 us      3.7%        13.14 us     3.0%    +2052 ns
pointer_attributes.pointer_get_attribute         27 ns      1.6%          487 ns     2.1%     +460 ns
stream.stream_create_destroy                   3.48 us      6.1%         3.53 us     1.6%      +57 ns
stream.stream_query                              84 ns      1.5%          217 ns     2.2%     +133 ns
stream.stream_synchronize                       115 ns      1.1%          243 ns     2.8%     +128 ns
-----------------------------------------------------------------------------------------------------

This makes me feel a bit better about those numbers but the Python ones do seem more stable except event.event_query.

If we think its worth it to do the process isolation for the Cpp one let me know and I will take a look.

Checklist

  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@copy-pr-bot
Copy link
Copy Markdown
Contributor

copy-pr-bot Bot commented Apr 22, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@danielfrg danielfrg requested review from mdboom and rwgk April 22, 2026 18:47
@danielfrg danielfrg self-assigned this Apr 22, 2026
@danielfrg danielfrg added cuda.bindings Everything related to the cuda.bindings module performance labels Apr 22, 2026
@danielfrg danielfrg added this to the cuda.core v1.0.0 milestone Apr 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cuda.bindings Everything related to the cuda.bindings module performance

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant