Create SIMD-accelerated version of compute_gru function#191
Open
Ameobea wants to merge 12 commits intoxiph:masterfrom
Open
Create SIMD-accelerated version of compute_gru function#191Ameobea wants to merge 12 commits intoxiph:masterfrom
compute_gru function#191Ameobea wants to merge 12 commits intoxiph:masterfrom
Conversation
* Add `-O3 -march=native` to the compiler flags in the autoconf/automake/autoetc. stuff * Optimize biquad filter implementation
a-rose
suggested changes
Jul 20, 2021
Author
|
@a-rose thank you very much for the review. I'm glad you were aware of the fact that FMA support doesn't always exist when AVX2 support does. |
* We already have the value in a register, so avoid spilling to stack and reading back
Update FMA check
|
Thanks for taking the time to look into it :) I have created a PR on your fork to improve SIMD detection: Ameobea#1 |
Fix SIMD flags detection
Author
|
I have also opened a pull request on the Xiph-run Gitlab instance referenced in the README: https://gitlab.xiph.org/xiph/rnnoise/-/merge_requests/2 |
JellyBrick
added a commit
to JellyBrick/noise-suppression-for-voice-simd
that referenced
this pull request
Feb 7, 2023
Hartmnt
pushed a commit
to mumble-voip/ReNameNoise
that referenced
this pull request
Apr 3, 2024
Original merge request: xiph/rnnoise#191
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
I did some CPU profiling of pulseeffects / easyeffects which includes this library as a dependency for its noise reduction functionality. This led me to see that the
compute_grufunction was the hottest one in the whole application:The changes in this pull request create a new function
compute_gru_avxwhich has the same functionality ascompute_grubut uses SIMD intrinsics to accelerate the function dramatically. The main changes focus on computing the sums in the GRU function 8 at a time and using FMAs to combine multiplications and adds into a single operation, increasing accuracy as a result. This also serves to reduce the overhead of loop counter checking which my profiling let me to believe was the most expensive part of the whole function before these changes. Additionally, converting the weights (which are stored as 8-bit signed integers) is done 8 at a time using SIMD for an additional speedup.I also made some changes to the build configuration for compiler flags that use
-O3instead of-O2which yielded some benefits on my machine as well as pasing-march=nativewhich facilitates the SIMD used. If these CPU features aren't available, they will be disabled at build-time.After the optimizations, the
compute_gru_avxfunction uses only ~4.4% of the total CPU time compared to ~19.63% from before - a 4.45x speedup:Here is a compiler explorer that shows the full assembly produced by the optimized
compute_gru_avxfunction: https://c.godbolt.org/z/xzEGxj8neTesting done on my own machine using pulseeffects +
librnnoise.sobuilt with the optimized code seems to work identically to before with reduced CPU usage for the application.Let me know if you think this is something that you'd like to get merged into the project. I'm happy to make any changes necessary. There may be a better/different way you'd like to handle the CPU feature detection and and I'd love suggestions on how to handle that.