Skip to content

Performance: Optimize repeated regex and string operations #90

@anthropic-code-agent

Description

@anthropic-code-agent

Description

Optimize repeated pattern matching and string operations by caching results, reducing redundant regex calls, and using more efficient string manipulation methods.

Current Implementation Issues

R/MappedData.R - Nested gsub() calls

Lines 153-155: Nested string replacements

  • Pattern: gsub("_x", "", gsub("_y", "", ...))
  • Creates intermediate strings unnecessarily
  • Can be combined into single regex or cached

Line 464: gsub() in loop

  • gsub("y", private$direction, ...) called repeatedly in loop
  • Same pattern replacement done multiple times
  • Should cache result or move outside loop

R/utilities-defaults.R - Repeated pattern replacements

Lines 322-327: Multiple gsub() calls

  • gsub() called multiple times with pattern "ospsuite.plots.geom"
  • Same pattern used repeatedly
  • Could compile regex once or use string constants

R/utilities_export.R - Sequential grepl() calls

Lines 166-171: Multiple pattern checks

  • Multiple grepl() calls in sequence for pattern matching
  • Each call scans the entire string
  • Could combine patterns or use single regex with alternatives

Line 327: gsub() in loop

  • Character replacement inside loop
  • Same substitution applied repeatedly

R/utilities.R - Redundant string operations

Lines 177-180: Multiple trimws() calls

  • trimws(label) called twice
  • trimws(unit) called twice
  • Should cache trimmed values

Suggested Implementation

1. Cache String Operations

Before:

for (item in items) {
  cleaned <- gsub("pattern", "replacement", item)
  process(cleaned)
}

After:

cleaned_items <- gsub("pattern", "replacement", items)  # Vectorized
for (cleaned in cleaned_items) {
  process(cleaned)
}

2. Combine Multiple Patterns

Before:

if (grepl("pattern1", text)) { }
if (grepl("pattern2", text)) { }

After:

if (grepl("pattern1|pattern2", text)) { }

3. Pre-compile Regex (for R >= 4.1)

Before:

for (text in texts) {
  matched <- grepl("complex.*pattern", text)
}

After:

pattern <- gregexpr("complex.*pattern", texts)  # Vectorized

4. Use stringi for Performance

Consider using stringi package for complex string operations (much faster than base R for large datasets).

Expected Benefits

  • Faster string processing (up to 10x for large datasets)
  • Reduced redundant regex compilation
  • Lower CPU usage
  • Better scalability for text-heavy operations

Implementation Notes

Files to Modify

  • R/MappedData.R
  • R/utilities-defaults.R
  • R/utilities_export.R
  • R/utilities.R

Testing

  • Run existing unit tests
  • Add tests for edge cases in string operations
  • Verify all pattern replacements work correctly
  • Test with special characters and Unicode
  • Consider performance benchmarks for large datasets

Metadata

Metadata

Labels

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions