Skip to content

Speed up multiline regexes anchored by ^. #148762

@haampie

Description

@haampie

Feature or enhancement

Proposal:

Very similar to #148729. I noticed that for regexes of the form

re.compile("^foo", re.MULTILINE)

there's still a character by character loop calling SRE(match).

Replacing this with memchr scanning for \n in the single-byte character string case gives speedups of about 8x on a macbook M4.

A patch would look like this:

diff --git a/Modules/_sre/sre_lib.h b/Modules/_sre/sre_lib.h
index 4f1269988b9..11d341ea2ab 100644
--- a/Modules/_sre/sre_lib.h
+++ b/Modules/_sre/sre_lib.h
@@ -1869,12 +1869,42 @@ SRE(search)(SRE_STATE* state, SRE_CODE* pattern)
             state->start = state->ptr = ptr = end;
             return 0;
         }
-        while (status == 0 && ptr < end) {
-            ptr++;
-            RESET_CAPTURE_GROUP();
-            TRACE(("|%p|%p|SEARCH\n", pattern, ptr));
-            state->start = state->ptr = ptr;
-            status = SRE(match)(state, pattern, 0);
+        if (pattern[0] == SRE_OP_AT &&
+            pattern[1] == SRE_AT_BEGINNING_LINE)
+        {
+            /* Skip to line boundary */
+            while (status == 0 && ptr < end) {
+                ptr++;
+                if (!SRE_IS_LINEBREAK((int) ptr[-1])) {
+#if SIZEOF_SRE_CHAR == 1
+                    ptr = (SRE_CHAR *)memchr(ptr, '\n', end - ptr);
+                    if (!ptr) {
+                        break;
+                    }
+#else
+                    while (ptr < end && !SRE_IS_LINEBREAK((int) *ptr)) {
+                        ptr++;
+                    }
+                    if (ptr >= end) {
+                        break;
+                    }
+#endif
+                    ptr++;
+                }
+                RESET_CAPTURE_GROUP();
+                TRACE(("|%p|%p|SEARCH\n", pattern, ptr));
+                state->start = state->ptr = ptr;
+                status = SRE(match)(state, pattern, 0);
+            }
+        }
+        else {
+            while (status == 0 && ptr < end) {
+                ptr++;
+                RESET_CAPTURE_GROUP();
+                TRACE(("|%p|%p|SEARCH\n", pattern, ptr));
+                state->start = state->ptr = ptr;
+                status = SRE(match)(state, pattern, 0);
+            }
         }
     }
 

Has this already been discussed elsewhere?

This is a minor feature, which does not need previous discussion elsewhere

Links to previous discussion of this feature:

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions