Skip to content

Add DBSCAN clustering step to Sophronia city#951

Open
camacortespar wants to merge 8 commits into
next-exp:masterfrom
camacortespar:develop
Open

Add DBSCAN clustering step to Sophronia city#951
camacortespar wants to merge 8 commits into
next-exp:masterfrom
camacortespar:develop

Conversation

@camacortespar
Copy link
Copy Markdown

@camacortespar camacortespar commented Feb 24, 2026

Summary

This PR integrates a hit clustering step (using DBSCAN) into the Sophronia city workflow.

Key Changes

1. New Logic (reco/hits_functions.py)

  • Implemented cluster_tagger: A function that applies DBSCAN on an event-by-event basis.
  • It handles spatial anisotropy via scale_xy and scale_z parameters.
  • It preserves data structure and alignment perfectly.

2. Integration (cities/sophronia.py)

  • Added clustering_params to the city configuration.
  • Inserted the clustering step in the dataflow.
  • Backward Compatibility: If clustering_params is None (default), the city skips clustering and produces the exact same output structure as before (no cluster column).

3. Testing (cities/sophronia_test.py & reco/hits_functions_test.py)

  • Added unit tests for structure preservation, row alignment, and noise rejection in DBSCAN.
  • Added integration tests ensuring cluster column appears only when enabled.
  • Verified that enabling clustering does not corrupt other physical variables.

Configuration Example

To use this feature, add the following to the configuration file (these values are optimized for NEXT-100 detector geometry):

clustering_params = dict(
    min_samples  =     5,
    scale_xy     = 15.55,
    scale_z      =   4.0
)

Copy link
Copy Markdown
Collaborator

@gonzaponte gonzaponte left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks quite good. Here is the first round of comments.

Comment thread invisible_cities/cities/components.py Outdated
Comment thread invisible_cities/config/sophronia.conf Outdated
Comment thread invisible_cities/reco/hits_functions.py Outdated
Comment thread invisible_cities/reco/hits_functions_test.py Outdated
Comment thread invisible_cities/reco/hits_functions_test.py Outdated
Comment thread invisible_cities/reco/hits_functions_test.py Outdated
Comment thread invisible_cities/reco/hits_functions_test.py Outdated
Comment thread invisible_cities/reco/hits_functions_test.py Outdated
Comment thread invisible_cities/reco/hits_functions_test.py Outdated
Comment thread invisible_cities/conftest.py Outdated
Comment thread invisible_cities/reco/hits_functions.py Outdated
Comment thread invisible_cities/reco/hits_functions.py Outdated
Comment thread invisible_cities/reco/hits_functions_test.py Outdated
Comment thread invisible_cities/reco/hits_functions_test.py Outdated
Comment thread invisible_cities/reco/hits_functions_test.py Outdated
Comment thread invisible_cities/reco/hits_functions_test.py Outdated
@camacortespar camacortespar force-pushed the develop branch 2 times, most recently from 7995532 to 2eadf57 Compare April 24, 2026 15:28
Copy link
Copy Markdown
Collaborator

@gonzaponte gonzaponte left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

besides these two cosmetic changes, I think this is good to go. I've asked @jwaiton to take a quick look to make sure I didn't miss anything important, but this is essentially approved.

Comment thread invisible_cities/reco/hits_functions.py Outdated
Comment thread invisible_cities/reco/hits_functions.py Outdated
@jwaiton
Copy link
Copy Markdown
Member

jwaiton commented May 7, 2026

Very nice work! I only have one suggestion.

I'd suggest removing the eps parameter entirely from the process, and set it to 1/1.74 when you call DBSCAN.

The scaling applied with scale_xy & scale_z means that (if applied correctly) all neighbouring hits should be within ~1 unit distance from one another, right? If so, eps should always be 1 (or ~1.74 if you want to retain diagonal neighbours I believe).

This removes a parameter that is poorly explained even in DBSCANs own documentation (in my opinion), when it just means 'distance between two nodes'. If our distance will always be 1 or slightly less, there is no need for it 😸

@camacortespar
Copy link
Copy Markdown
Author

But do we want that? As I remember, the optimal parameters include eps = 3.
Might setting eps = 1/1.74 be too restrictive?
I'm concerning about this parameters selection 'cause from my experience with low-background data, it changes significantly the performance of the fiducial volume cut, which implies changes in the fiducial rate.

Would it be better having flexibility for the eps value?

@jwaiton
Copy link
Copy Markdown
Member

jwaiton commented May 8, 2026

But do we want that? As I remember, the optimal parameters include eps = 3.

Not sure where the optimal parameter is extracted, but with parameters of eps = 3, scale_xy = 16, scale_z = 4, you can end up with a cluster of non-neighbouring hits surviving. A simple example is as shown:

# generate XYZ events of
# (0,0,0), (31.1, 31.1, 0), (62.2, 62.2, 0), (100, 100, 100)

data = {
                'event': [ 0,  0,  0,    0],     #
                'X'    : [0., 31.1, 62.2, 100.], #
                'Y'    : [0., 31.1, 62.2, 100.], #  these are not neighbouring hits
                'Z'    : [0., 0, 0, 100.]        #
        }
df = pd.DataFrame(data)

test_params = dict(eps=3, min_samples=2, scale_xy=16, scale_z=4)
df_result   = cluster_tagger(df.copy(), **test_params)

print(df_result)

which results in:

   event      X      Y      Z  cluster
0      0    0.0    0.0    0.0        0
1      0   31.1   31.1    0.0        0
2      0   62.2   62.2    0.0        0
3      0  100.0  100.0  100.0       -1

visually with a 15.55mm pitch looks like so:
image
These should not be labelled as a cluster.

I'm concerning about this parameters selection 'cause from my experience with low-background data, it changes significantly the performance of the fiducial volume cut, which implies changes in the fiducial rate.

Surely if you alter the eps to 1.74, this would improve the fiducial volume cut rate, as more hits (like the ones shown above) would be labelled as noise, right? Apologies if I'm misunderstanding.

The value 1.74 is just $\text{eps} = \sqrt{x^2 + y^2 + z^2} = \sqrt{3} \simeq 1.73$, so 1.74 ensures that all hits within 1 maximal unit distance (diagonally in 3d) will be considered neighbours, contributing to a cluster.

The next distance that we want to avoid would be two hits that are exactly 2 pitch lengths apart (not neighbours). So for example: xyz = [(0,0,0), (0, 31.1, 0)]. Dividing by xy_scale=16 gives us 1.9, so an eps any higher than 1.9 will start allowing clusters to be made with non-neighbouring hits.

Would it be better having flexibility for the eps value?

We have flexibility in the eps value without changing it from 1, as it is directly related to your scale_xy and scale_z. We currently are setting these values such that they (more or less) normalise our grid to 1 in X, Y & Z, but this doesnt have to be the case.

Apologies if I'm explaining this poorly, I'd be happy to chat in a zoom call to try and clarify this :)

@camacortespar
Copy link
Copy Markdown
Author

Yes, I understand your point and if it is the case, I would fix eps = 1.74

But again, I'm still concerned with the clusterization process killing physical hits. For example, if a neighbour hit is not activated (for some threshold-related reason), but the next to it does it, the latter can be labelled as noise. I don't know how likely is this case.

Has the performance of this algorithm been studied using NEXT100 data? I thought it had, and the result had been the set of parameters eps = 3, min_samples = 5. Maybe I misunderstood. If we are sure that with one unit of distance is enough, we can fix eps value.

In any case, I'd like to discuss it next week in a meeting.

@jwaiton
Copy link
Copy Markdown
Member

jwaiton commented May 8, 2026

But again, I'm still concerned with the clusterization process killing physical hits. For example, if a neighbour hit is not activated (for some threshold-related reason), but the next to it does it, the latter can be labelled as noise. I don't know how likely is this case.

This is what my last comment is addressing. If this was the case, setting eps = 1.74 * 2 or scale_xy = 16*2, scale_z = 4*2 would do the same thing I believe, so there is no need for both. :)

I think the likelihood of the number of physical hits being on the order of 10 and being two pitch lengths apart for any hit is unlikely in NEXT-100, but I can check some x-rays/bremstrahlungs for this information (I believe @Ian0sborne would have this information at hand 😸 ).

Has the performance of this algorithm been studied using NEXT100 data? I thought it had, and the result had been the set of parameters eps = 3, min_samples = 5. Maybe I misunderstood. If we are sure that with one unit of distance is enough, we can fix eps value.

I studied/implemented a similar algorithm here, which I didn't provide defaults for, but used min_samples = 3 or 5, and relies on the equivalent value of $\text{eps} = \sqrt{3} \simeq 1.74$ as seen here. I believe Samuele developed this DBSCAN method initially, and the default values can be seen here, with eps = 2.3, xy_scale = 14.55, z_scale = 3.7.

For the scale in NEXT-100 (pitch = 15.55, maximum z bin = 4), this is a scaling of 1.06 in XY and 1.08 in Z, which would then be $\text{maximal unit distance} = \sqrt{1.06^2 + 1.06^2 + 1.08^2 } \simeq 1.85$. The next to next nearest neighbour (at its closest proximity) would be $\sqrt{0^2 + (1.06\cdot2)^2 + 0^2} = 2.12$, the nearest diagonal would be $\sqrt{(1.06)^2 + (1.06\cdot2)^2 + 0^2} = 2.37$. This means that with the default parameters previously given, the behaviour would be inconsistent (retain next-to-next nearest neighbours along the coordinates, but not the diagonals which I find odd).

Lets talk next week 👍

@camacortespar
Copy link
Copy Markdown
Author

As has been discussed, the eps value is fixed to 1.8. This ensures that any direct 3D neighbour (including corners) is considered as part of the cluster. Changes in this condition must be done by applying different scaling factors.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants