perf: optimize conflict resolution algorithm in AnonymizerEngine#1797
perf: optimize conflict resolution algorithm in AnonymizerEngine#1797Ray0907 wants to merge 3 commits intomicrosoft:mainfrom
Conversation
- Replace O(n) list.remove() with O(1) set operations for tracking active indices - Use pop() instead of remove() in whitespace merge since prev_result is always last - Convert list comprehension to generator expression for short-circuit evaluation - Pre-compile regex pattern to avoid repeated compilation
|
@microsoft-github-policy-service agree |
There was a problem hiding this comment.
Pull request overview
This PR aims to optimize the conflict resolution algorithm in AnonymizerEngine by replacing O(n) list operations with O(1) set operations and making other performance improvements.
Changes:
- Replace list.remove() with set-based active_indices tracking in merge and conflict resolution phases
- Add outer loop with re-sorting to REMOVE_INTERSECTIONS algorithm
- Optimize whitespace merging with pre-compiled regex and pop() instead of remove()
- Convert list comprehension to generator expression for short-circuit evaluation
| # Process until no more overlaps require re-sorting | ||
| needs_resort = True | ||
| while needs_resort: | ||
| needs_resort = False | ||
| index = 0 | ||
| elements_length = len(unique_text_metadata_elements) | ||
| while index < elements_length - 1: | ||
| current_entity = unique_text_metadata_elements[index] | ||
| next_entity = unique_text_metadata_elements[index + 1] | ||
| if current_entity.end <= next_entity.start: | ||
| index += 1 | ||
| else: | ||
| current_entity.end = next_entity.start | ||
| if current_entity.score >= next_entity.score: | ||
| next_entity.start = current_entity.end | ||
| else: | ||
| current_entity.end = next_entity.start | ||
| needs_resort = True | ||
| break | ||
| if needs_resort: | ||
| unique_text_metadata_elements.sort( | ||
| key=lambda element: element.start | ||
| ) |
There was a problem hiding this comment.
The addition of the outer loop with needs_resort flag significantly changes the algorithm behavior and reduces performance. The original algorithm correctly handles overlaps by staying at the same index after adjusting boundaries, then re-checking the pair in the next iteration. This is efficient - only the affected pair is re-checked.
The new algorithm breaks after each adjustment, re-sorts the entire list O(n log n), and restarts from index 0, re-checking all pairs. This is less efficient because:
- Multiple sorts instead of one initial sort
- Re-checking previously resolved pairs
- The original algorithm already correctly resolves all overlaps without re-sorting
This change contradicts the PR's stated goal of performance optimization. The original single-pass algorithm with local re-checking should be preserved.
| @staticmethod | ||
| def __is_result_conflicted_with_other_elements(other_elements, result): | ||
| # Use generator expression for short-circuit evaluation | ||
| return any( | ||
| [result.has_conflict(other_element) for other_element in other_elements] | ||
| result.has_conflict(other_element) for other_element in other_elements | ||
| ) |
There was a problem hiding this comment.
The method __is_result_conflicted_with_other_elements is no longer called anywhere in the code after the refactoring. The functionality has been inlined at line 187. This method should be removed to avoid maintaining dead code.
| return [] | ||
|
|
||
| # Pre-compile regex pattern for efficiency | ||
| whitespace_pattern = re.compile(r"^( )+$") |
There was a problem hiding this comment.
The regex pattern r"^( )+$" contains unnecessary parentheses around the space character. This can be simplified to r"^ +$" for better readability without changing functionality.
| whitespace_pattern = re.compile(r"^( )+$") | |
| whitespace_pattern = re.compile(r"^ +$") |
| return [] | ||
|
|
||
| # Pre-compile regex pattern for efficiency | ||
| whitespace_pattern = re.compile(r"^( )+$") |
There was a problem hiding this comment.
The regex pattern is compiled inside the method, which means it gets recompiled on every call to _merge_entities_with_whitespace_between. For better performance, consider compiling the pattern once at the class or module level (e.g., as a class constant: WHITESPACE_PATTERN = re.compile(r"^ +$")).
|
@Ray0907 are you interested to continue working on this PR? If yes, could you please address the comments? |
Sure, I'd love to. |
Change Description
Issue reference
Fixes #XX
Checklist