Skip to content

perf: optimize conflict resolution algorithm in AnonymizerEngine#1797

Closed
Ray0907 wants to merge 3 commits intomicrosoft:mainfrom
Ray0907:main
Closed

perf: optimize conflict resolution algorithm in AnonymizerEngine#1797
Ray0907 wants to merge 3 commits intomicrosoft:mainfrom
Ray0907:main

Conversation

@Ray0907
Copy link
Copy Markdown

@Ray0907 Ray0907 commented Nov 29, 2025

Change Description

  • Replace O(n) list.remove() with O(1) set operations for tracking active indices
  • Use pop() instead of remove() in whitespace merge since prev_result is always last
  • Convert list comprehension to generator expression for short-circuit evaluation
  • Pre-compile regex pattern to avoid repeated compilation

Issue reference

Fixes #XX

Checklist

  • I have reviewed the contribution guidelines
  • I have signed the CLA (if required)
  • My code includes unit tests
  • All unit tests and lint checks pass locally
  • My PR contains documentation updates / additions if required

  - Replace O(n) list.remove() with O(1) set operations for tracking active
  indices
  - Use pop() instead of remove() in whitespace merge since prev_result is
  always last
  - Convert list comprehension to generator expression for short-circuit
  evaluation
  - Pre-compile regex pattern to avoid repeated compilation
@Ray0907
Copy link
Copy Markdown
Author

Ray0907 commented Nov 29, 2025

@microsoft-github-policy-service agree

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR aims to optimize the conflict resolution algorithm in AnonymizerEngine by replacing O(n) list operations with O(1) set operations and making other performance improvements.

Changes:

  • Replace list.remove() with set-based active_indices tracking in merge and conflict resolution phases
  • Add outer loop with re-sorting to REMOVE_INTERSECTIONS algorithm
  • Optimize whitespace merging with pre-compiled regex and pop() instead of remove()
  • Convert list comprehension to generator expression for short-circuit evaluation

Comment on lines +204 to 225
# Process until no more overlaps require re-sorting
needs_resort = True
while needs_resort:
needs_resort = False
index = 0
elements_length = len(unique_text_metadata_elements)
while index < elements_length - 1:
current_entity = unique_text_metadata_elements[index]
next_entity = unique_text_metadata_elements[index + 1]
if current_entity.end <= next_entity.start:
index += 1
else:
current_entity.end = next_entity.start
if current_entity.score >= next_entity.score:
next_entity.start = current_entity.end
else:
current_entity.end = next_entity.start
needs_resort = True
break
if needs_resort:
unique_text_metadata_elements.sort(
key=lambda element: element.start
)
Copy link

Copilot AI Jan 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The addition of the outer loop with needs_resort flag significantly changes the algorithm behavior and reduces performance. The original algorithm correctly handles overlaps by staying at the same index after adjusting boundaries, then re-checking the pair in the next iteration. This is efficient - only the affected pair is re-checked.

The new algorithm breaks after each adjustment, re-sorts the entire list O(n log n), and restarts from index 0, re-checking all pairs. This is less efficient because:

  1. Multiple sorts instead of one initial sort
  2. Re-checking previously resolved pairs
  3. The original algorithm already correctly resolves all overlaps without re-sorting

This change contradicts the PR's stated goal of performance optimization. The original single-pass algorithm with local re-checking should be preserved.

Copilot uses AI. Check for mistakes.
Comment on lines 262 to 267
@staticmethod
def __is_result_conflicted_with_other_elements(other_elements, result):
# Use generator expression for short-circuit evaluation
return any(
[result.has_conflict(other_element) for other_element in other_elements]
result.has_conflict(other_element) for other_element in other_elements
)
Copy link

Copilot AI Jan 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The method __is_result_conflicted_with_other_elements is no longer called anywhere in the code after the refactoring. The functionality has been inlined at line 187. This method should be removed to avoid maintaining dead code.

Copilot uses AI. Check for mistakes.
return []

# Pre-compile regex pattern for efficiency
whitespace_pattern = re.compile(r"^( )+$")
Copy link

Copilot AI Jan 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The regex pattern r"^( )+$" contains unnecessary parentheses around the space character. This can be simplified to r"^ +$" for better readability without changing functionality.

Suggested change
whitespace_pattern = re.compile(r"^( )+$")
whitespace_pattern = re.compile(r"^ +$")

Copilot uses AI. Check for mistakes.
return []

# Pre-compile regex pattern for efficiency
whitespace_pattern = re.compile(r"^( )+$")
Copy link

Copilot AI Jan 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The regex pattern is compiled inside the method, which means it gets recompiled on every call to _merge_entities_with_whitespace_between. For better performance, consider compiling the pattern once at the class or module level (e.g., as a class constant: WHITESPACE_PATTERN = re.compile(r"^ +$")).

Copilot uses AI. Check for mistakes.
@omri374
Copy link
Copy Markdown
Collaborator

omri374 commented Apr 8, 2026

@Ray0907 are you interested to continue working on this PR? If yes, could you please address the comments?

@Ray0907
Copy link
Copy Markdown
Author

Ray0907 commented Apr 8, 2026

@Ray0907 are you interested to continue working on this PR? If yes, could you please address the comments?

Sure, I'd love to.

@SharonHart SharonHart closed this Apr 15, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants