perf: optimize conflict resolution algorithm in AnonymizerEngine by Ray0907 · Pull Request #1797 · microsoft/presidio

Ray0907 · 2025-11-29T11:54:12Z

Change Description

Replace O(n) list.remove() with O(1) set operations for tracking active indices
Use pop() instead of remove() in whitespace merge since prev_result is always last
Convert list comprehension to generator expression for short-circuit evaluation
Pre-compile regex pattern to avoid repeated compilation

Issue reference

Fixes #XX

Checklist

I have reviewed the contribution guidelines
I have signed the CLA (if required)
My code includes unit tests
All unit tests and lint checks pass locally
My PR contains documentation updates / additions if required

- Replace O(n) list.remove() with O(1) set operations for tracking active indices - Use pop() instead of remove() in whitespace merge since prev_result is always last - Convert list comprehension to generator expression for short-circuit evaluation - Pre-compile regex pattern to avoid repeated compilation

Ray0907 · 2025-11-29T11:55:47Z

@microsoft-github-policy-service agree

Copilot

Pull request overview

This PR aims to optimize the conflict resolution algorithm in AnonymizerEngine by replacing O(n) list operations with O(1) set operations and making other performance improvements.

Changes:

Replace list.remove() with set-based active_indices tracking in merge and conflict resolution phases
Add outer loop with re-sorting to REMOVE_INTERSECTIONS algorithm
Optimize whitespace merging with pre-compiled regex and pop() instead of remove()
Convert list comprehension to generator expression for short-circuit evaluation

Copilot · 2026-01-12T10:11:17Z

+            # Process until no more overlaps require re-sorting
+            needs_resort = True
+            while needs_resort:
+                needs_resort = False
+                index = 0
+                elements_length = len(unique_text_metadata_elements)
+                while index < elements_length - 1:
+                    current_entity = unique_text_metadata_elements[index]
+                    next_entity = unique_text_metadata_elements[index + 1]
+                    if current_entity.end <= next_entity.start:
+                        index += 1
                    else:
-                        current_entity.end = next_entity.start
+                        if current_entity.score >= next_entity.score:
+                            next_entity.start = current_entity.end
+                        else:
+                            current_entity.end = next_entity.start
+                        needs_resort = True
+                        break
+                if needs_resort:
                    unique_text_metadata_elements.sort(
                        key=lambda element: element.start
                    )


The addition of the outer loop with needs_resort flag significantly changes the algorithm behavior and reduces performance. The original algorithm correctly handles overlaps by staying at the same index after adjusting boundaries, then re-checking the pair in the next iteration. This is efficient - only the affected pair is re-checked.

The new algorithm breaks after each adjustment, re-sorts the entire list O(n log n), and restarts from index 0, re-checking all pairs. This is less efficient because:

Multiple sorts instead of one initial sort

Re-checking previously resolved pairs

The original algorithm already correctly resolves all overlaps without re-sorting

This change contradicts the PR's stated goal of performance optimization. The original single-pass algorithm with local re-checking should be preserved.

Copilot · 2026-01-12T10:11:17Z

    @staticmethod
    def __is_result_conflicted_with_other_elements(other_elements, result):
+        # Use generator expression for short-circuit evaluation
        return any(
-            [result.has_conflict(other_element) for other_element in other_elements]
+            result.has_conflict(other_element) for other_element in other_elements
        )


The method __is_result_conflicted_with_other_elements is no longer called anywhere in the code after the refactoring. The functionality has been inlined at line 187. This method should be removed to avoid maintaining dead code.

Copilot · 2026-01-12T10:11:17Z

+            return []
+
+        # Pre-compile regex pattern for efficiency
+        whitespace_pattern = re.compile(r"^( )+$")


The regex pattern r"^( )+$" contains unnecessary parentheses around the space character. This can be simplified to r"^ +$" for better readability without changing functionality.

Suggested change

whitespace_pattern = re.compile(r"^( )+$")

whitespace_pattern = re.compile(r"^ +$")

Copilot · 2026-01-12T10:11:18Z

+            return []
+
+        # Pre-compile regex pattern for efficiency
+        whitespace_pattern = re.compile(r"^( )+$")


The regex pattern is compiled inside the method, which means it gets recompiled on every call to _merge_entities_with_whitespace_between. For better performance, consider compiling the pattern once at the class or module level (e.g., as a class constant: WHITESPACE_PATTERN = re.compile(r"^ +$")).

omri374 · 2026-04-08T05:50:36Z

@Ray0907 are you interested to continue working on this PR? If yes, could you please address the comments?

Ray0907 · 2026-04-08T06:39:57Z

@Ray0907 are you interested to continue working on this PR? If yes, could you please address the comments?

Sure, I'd love to.

github-actions Bot added the external label Nov 29, 2025

Merge branch 'main' into main

0b80309

RonShakutai requested a review from Copilot January 12, 2026 10:04

Copilot started reviewing on behalf of RonShakutai January 12, 2026 10:05 View session

RonShakutai self-requested a review January 12, 2026 10:06

Copilot AI reviewed Jan 12, 2026

View reviewed changes

Merge branch 'main' into main

3cdd58c

SharonHart closed this Apr 15, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: optimize conflict resolution algorithm in AnonymizerEngine#1797

perf: optimize conflict resolution algorithm in AnonymizerEngine#1797
Ray0907 wants to merge 3 commits intomicrosoft:mainfrom
Ray0907:main

Ray0907 commented Nov 29, 2025

Uh oh!

Ray0907 commented Nov 29, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Jan 12, 2026

Uh oh!

Copilot AI Jan 12, 2026

Uh oh!

Copilot AI Jan 12, 2026

Uh oh!

Copilot AI Jan 12, 2026

Uh oh!

omri374 commented Apr 8, 2026

Uh oh!

Ray0907 commented Apr 8, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

	whitespace_pattern = re.compile(r"^( )+$")
	whitespace_pattern = re.compile(r"^ +$")

Conversation

Ray0907 commented Nov 29, 2025

Change Description

Issue reference

Checklist

Uh oh!

Ray0907 commented Nov 29, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Jan 12, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 12, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 12, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 12, 2026

Choose a reason for hiding this comment

Uh oh!

omri374 commented Apr 8, 2026

Uh oh!

Ray0907 commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Ray0907 commented Apr 8, 2026 •

edited

Loading