Skip to content

Conversation

@JacksonBurns
Copy link
Contributor

Currently we use RingDecomposerLib for finding the Smallest Set of Smallest Rings and getting the Relevant Cycles. This package does not support Python 3.10+ and is thus blocking further upgrades to RMG.

@KnathanM in particular is looking to get RMG to Python 3.11 so as to add support for ChemProp v2.

I believe we can just use RDKit to do these operations instead. The original paper mentions that the functionality was being moved upstream to RDKit. With the help of AI I've taken just a first pass at reimplementing, with the special note that:

This PR will be a draft for now, as it is predicated on Python 3.9 already being available (which it nearly is in #2741)

Motivation or Problem

A clear and concise description of what what you're trying to fix or improve. Please reference any issues that this addresses.

Description of Changes

A clear and concise description of what what you've changed or added.

Testing

A clear and concise description of testing that you've done or plan to do.

Reviewer Tips

Suggestions for verifying that this PR works or other notes for the reviewer.

@JacksonBurns JacksonBurns marked this pull request as draft May 25, 2025 21:43
@JacksonBurns
Copy link
Contributor Author

Cantera 2.6 isn't available for Python 3.12, so this PR will also need to upgrade the Cantera version to 3 as mostly completed in #2751

@JacksonBurns
Copy link
Contributor Author

A note for the path forward on this PR - the get_relevant_cycles and get_smallest_set_of_smallest_rings functions will need to be moved to Molecule. Currently they are in Graph, and Graph has no way to send itself to RDKit in order to be subject to RDKit GetSymmSSSR but Molecule can be converted to RDKit via converted.to_rdkit_mol (also exposed as Molecule.to_rdkit_mol, and then RDKit can be used. This may have implications for inheritance elsewhere in the codebase, but should be OK.

Base automatically changed from feat/py39_rebase to main May 29, 2025 19:55
@jonwzheng jonwzheng self-assigned this Jun 30, 2025
@jonwzheng
Copy link
Contributor

I ended up moving all the functions in Graph that call get_relevant_cycles or get_smallest_set_of_smallest_rings to Molecule as well. Still have a couple of unit tests to update, but if this is the direction we want to take, hopefully this should be adequate

@jonwzheng
Copy link
Contributor

jonwzheng commented Jul 16, 2025

Addressed some of the failures, but still a few nontrivial challenges:

TODO list:

  • address e / electron representation if integrating to the new RDKit workflow
  • fix missing vdW bond order
  • make sure ring algorithms are implemented correctly
  • certain bond types like R need some sort of RDKit definition; or just never parse them into an RDKit mol object in the first places
  • how to represent certain surface bonds?
  • bond orders without definitions? issue from rdkit side or something else?
  • fill in remaining placeholder unit tests

The tests are just failing mainly for the aromatic compounds, indicating there's some issue with the implementation of the code. Since everything runs, the code "flow" seems clean, just need to iron out the implementation.

@jonwzheng
Copy link
Contributor

jonwzheng commented Jul 18, 2025

We are now very close to passing all of the unit tests. There are just a couple major bugs to resolve, namely:

  • how to handle the R/L cutting fragment labels in the RDKit mol object? need to replace with placeholder atom? also with surfaces?
  • some bond orders are weird now, so need to investigate where something is going wrong. Due to sanitization maybe?
  • I think we can undo the changes to AtomTypeTest
  • some surface glitches
  • can't kekulize some molecules: if this occurs, maybe back-up is to use increasingly strict flags? e.g. turn kekulization off, and finally turn all sanitize off?

@jonwzheng
Copy link
Contributor

One of the last failing tests is test_draw_hydrogen_bond_adsorbate, which actually fails due to an assertion error in the codebase:

https://github.com/ReactionMechanismGenerator/RMG-Py/blame/d0d95b3a6bb9cd5d873c8d5b552ec112d6ba9283/rmgpy/molecule/draw.py#L541

@rwest, since you're the last remaining dev who was involved in some capacity with this part of the code, do you know why we have that assert statement? I would think it's better to just plot a "dot" than to outright crash out.
I'm also not sure why this only popped up in this PR - maybe better to address any underlying problem in this code - but thought I'd ask, since this assert might cause problems in the future. There's no safeguards against passing a molecule with no backbone, unless there's something about how this is called that I'm not getting.

While we're at it, the _generate_coordinates also didn't use RDKit to draw if the charge != 0, but maybe that can be relaxed with new versions?

@rwest
Copy link
Member

rwest commented Jul 24, 2025

One of the last failing tests is test_draw_hydrogen_bond_adsorbate, which actually fails due to an assertion error in the codebase:

https://github.com/ReactionMechanismGenerator/RMG-Py/blame/d0d95b3a6bb9cd5d873c8d5b552ec112d6ba9283/rmgpy/molecule/draw.py#L541

@rwest, since you're the last remaining dev who was involved in some capacity with this part of the code, do you know why we have that assert statement?

I don't. I can see that if there is no straight chain backbone that a function to _find_straight_chain_backbone shouldn't work. It should either fail, and whatever called it deals with the exception (maybe a ValueError is better than AssertError, but either way it's an error to be handled). Or it could return None, but then the calling code would also have to be modified to handle that without crashing. Or the calling code shouldn't try calling it - but then it'd have to know there is no straight chain backbone (and perhaps the easiest way to find that out is to ask for it and fail.) So I think either way the calling code needs updating.

There's no safeguards against passing a molecule with no backbone, unless there's something about how this is called that I'm not getting.

I haven't investigated the full stack trace. But some safeguard, or exception handling, sounds appropriate.

While we're at it, the _generate_coordinates also didn't use RDKit to draw if the charge != 0, but maybe that can be relaxed with new versions?

We can try it. I expect RDKit at the time wasn't being helpful. Hopefully the commit messages offer clues? (this is why we should write helpful commit messages that explain why things are being done).

@rwest
Copy link
Member

rwest commented Jul 24, 2025

Is this related?

#2744

@jonwzheng
Copy link
Contributor

jonwzheng commented Jul 24, 2025

regarding the draw unit test

@rwest yes, thanks, #2744 looks relevant. The test that's failing is for this charged species:
image

maybe failing due to the missing X-O connectivity, and so no backbone?
I see the discussion in the thread. I'll move any further discussion about this to that issue.
Maybe a separate PR can fix that issue, merged in, and then this can be rebased onto main.

Edit: I tried modifying the connectivity s.t. there is an apparent backbone. But, I still run into this problem even after changing the adjacency list to

1  O u0 p3 c-1 {2,S} {10,H}
2  N u0 p1 c0 {1,S} {3,S} {4,S}
3  O u0 p2 c0 {2,S} {11,S}
4  O u0 p2 c0 {2,S} {7,S}
5  N u0 p1 c0 {6,S} {8,S} {9,S} {7,H}
6  O u0 p2 c0 {5,S} {10,S}
7  H u0 p0 c0 {4,S} {5,H}
8  H u0 p0 c0 {5,S}
9  H u0 p0 c0 {5,S}
10 H u0 p0 c0 {6,S} {1,H}
11 X u0 p0 c0 {3,S}

So it is more fundamentally a problem with the species , than a unit-test specific one.

Note about sanitization

For posterity: some molecules that fail the first kekulization step include:

Can't kekulize mol.  Unkekulized atoms
[H]C1=C([H])C([H])([H])C([H])([H])[c]([H])c1[c]([H])[H]
[H]C([H])=C([H])C1([H])[c]([H])c([H])[c]([H])C1([H])[H]
[H]C([H])=C1c([c]([H])[H])[c]([H])C([H])([H])C1([H])[H]
[H]C1=C([H])C([H])([H])C2([H])[c]([H])c([H])[c]([H])C2([H])C([H])=C1[H]
[H]C1~[C]([H])C2([H])[c](c([H])[c]~1[H])~C([H])~[C]([H])C2([H])C([H])([H])[H]
[H][c]1c([H])[c]([H])C([H])(C([H])([H])[H])C1([H])[H]
[H][c]([H])c1[c]([H])C([H])([H])c2c([H])c([H])c([H])c([H])c-12
[H][C]1~C([H])~[C]([H])C2([H])[C](C([H])([H])[c]([H])c2[c]([H])[H])~C~1[H]
[H][c]([H])c1[c]([H])C([H])([H])c2c([H])c([H])c([H])c([H])c-12
[H][c]1c([H])[c]([H])C([H])(C([H])([H])[H])C1([H])[H]
[H]C1~[C]([H])C2([H])[c](c([H])[c]~1[H])~C([H])~[C]([H])C2([H])[H]
[H][c]([H])c1[c]([H])C1([H])[H]
non-ring atom marked aromatic:
[H][C]~c[c]([H])C([H])([H])C1=C([H])C([H])=C([H])C([H])=C1[H]
[H]c1=c([H])-c([H])=C(C([H])([H])[c]([H])c([H])[c]([H])[H])c([H])=c-1[H]

As implemented in this PR, the RDKit sanitization step will fallback to not use kekulization.

@jonwzheng
Copy link
Contributor

jonwzheng commented Jul 24, 2025

Regarding the draw issue, I think I've figured out what's happening...

The test case is drawn as a cyclic molecule, even though it's not truly "cyclic". So it should take the find_cyclic_backbone branch rather than the find_straight_chain_backbone branch.

However, it thinks that len(self.cycles) = 0 because of the changes to ring perception, specifically molecule.get_smallest_set_of_smallest_rings: the H bonds are treated as RDKit hydrogen bonds which don't "count" toward the ring. So in RDKit world it's treated like a non-cyclic species, whereas RMG treats it as cyclic.

So, how to proceed? a few options...

  1. Prefer RDKit drawing: if ring perception is through RDKit, then drawing with RDKit is going to generally be safer. Probably the easiest option, as all we need to do is get rid of the check for if there's charged species. (I think it should work alright personally!)
  2. Change ring perception: instead of treating the H bonds as hydrogen bonds, treat them as an explicit bond?
  3. Create a separate function from SSSR that is used purely to see if there's "ring-like" structure regardless of the nature of the bonds, purely for drawing/graph structure type analysis.

@rwest
Copy link
Member

rwest commented Jul 24, 2025

I think using RDKit for drawing as much as possible is fine, if it does a good job. Probably we ran into issues in the past that may not persist with more recent versions of RDKit, so it's fine to revisit past decisions. We should continue to put reasons for things in commit messages, to make life easier for future developers (our future selves included).

@jonwzheng
Copy link
Contributor

OK. Maybe we can incorporate the fix into #2838 and then rebase onto main once it's merged in.

jonwzheng added a commit to kirkbadger18/RMG-Py that referenced this pull request Jul 25, 2025
In ReactionMechanismGenerator#2744 and ReactionMechanismGenerator#2796 it was found that charge-separated bidentate species
can have issues due to ring perception conflicts. The previous implementation
also by default did not use the rdkit backend for charged species, but
this was decided many years ago (~10 years!) In the meantime, RDKit
conformer generation has improved and likely this we can just use RDKit
by default, which would avoid the pesky edge-case issues for
ions/zwitterions.

In case the old behavior is desired, use_rdkit can be set to False.
jonwzheng added a commit to kirkbadger18/RMG-Py that referenced this pull request Jul 25, 2025
In ReactionMechanismGenerator#2744 and ReactionMechanismGenerator#2796 it was found that charge-separated bidentate species
can have issues due to ring perception conflicts. The previous implementation
also by default did not use the rdkit backend for charged species, but
this was decided many years ago (~10 years!) In the meantime, RDKit
conformer generation has improved and likely this we can just use RDKit
by default, which would avoid the pesky edge-case issues for
ions/zwitterions.

In case the old behavior is desired, use_rdkit can be set to False.
jonwzheng added a commit to kirkbadger18/RMG-Py that referenced this pull request Jul 25, 2025
In ReactionMechanismGenerator#2744 and ReactionMechanismGenerator#2796 it was found that charge-separated bidentate species
can have issues due to ring perception conflicts. The previous implementation
also by default did not use the rdkit backend for charged species, but
this was decided many years ago (~10 years!) In the meantime, RDKit
conformer generation has improved and likely this we can just use RDKit
by default, which would avoid the pesky edge-case issues for
ions/zwitterions.

In case the old behavior is desired, use_rdkit can be set to False.
JacksonBurns pushed a commit that referenced this pull request Aug 4, 2025
In #2744 and #2796 it was found that charge-separated bidentate species
can have issues due to ring perception conflicts. The previous implementation
also by default did not use the rdkit backend for charged species, but
this was decided many years ago (~10 years!) In the meantime, RDKit
conformer generation has improved and likely this we can just use RDKit
by default, which would avoid the pesky edge-case issues for
ions/zwitterions.

In case the old behavior is desired, use_rdkit can be set to False.
@JacksonBurns
Copy link
Contributor Author

With the merger of e92bec0 this PR's CI should now pass - let's see!

@JacksonBurns
Copy link
Contributor Author

@jonwzheng the overnight test failures look spurious - re-running them now.

rwest added 7 commits December 9, 2025 11:04
This should save complication and time. Also, don't sanitize.
All we're going to ask it for is ring information. 
For that all bonds (that we give it) count.
If we want it to ignore H-bonds, or vdW bonds, for some scenarios,
then we should figure that out. But often we use this ring detection 
code for things like drawing, which does want to use those bonds.
This reverts commit 55f8fc4.

We should, during ring detection, now be not attempting to convert bond
orders into strings, so this warning should not be triggered.
We can restore the debug level if this becomes too noisy.
We do a call to detect_cutting_label (which is a complicated function
to find them buried inside SMILES strings) passing it the label of
every atom. (Though maybe these are mostly ''?).

This commit checks whether a simpler approach would work: just
checking if the atom.label is either 'R' or 'L'.

If that's the case, we can then make the change.
As far as I can tell, cutting labels can only be "L" or "R".
I had a previous commit that checked this is equivalent, and 
by running the unit tests and regression tests could not find
a counterexample.
This check is much simpler, and about 200 times faster, than
the previous call to detect_cutting_label.
Also simplified the dictionary structure.
…rnings.

Previously, if you call to_rdkit_mol on a Fragment, it would always return
the mapping, but if you called it on a Molecule, it would by default not.

Now if you call it with return_mapping=False (or not specified) then it 
doesn't return the mapping, and if you call it with return_mapping=True,
then it does.

Internally, it calls converter.to_rdkit_mol with return_mapping=True,
but doesn't pass it back unless asked. Now the behavior of thing.to_rdkit_mol()
is the same whether thing is a Molecule or a Fragment (which is a subclass
of Molecule). 
Should reduce number of warnings, eg. when drawing fragments.

Also cleaned up the code a bit, and added some docstrings,
…_rdkit_mol

The latest change of to_rdkit_mol means that both Molecule and Fragment
return the same signature, and the calling code doesn't need to cope
with unexpected tuples.
The comment suggests it was going through Geometry in order to save
the mapping, but this can be done with a direct call to to_rdkit_mol
using return_mapping=True.
@JacksonBurns
Copy link
Contributor Author

@rwest I think we figured out the last little detail holding this up.

First and foremost, the two tests currently failing are testing the same thing: "can these molecule be correctly detected as bicyclic?" (so the second failing test is redundant).

Looking at the actual reason for the failure, it has to do with the way RDKit returns rings for ambiguous systems:

image

For these two molecules, there are three rings of size 6. The 'true' SSSR is two, so GetSSSR makes an arbitrary but reproducible choice of two of the rings, whereas GetSymmSSSR returns all three.

This is an exact case of "The SSSR Problem" described in the RDKit docs: https://www.rdkit.org/docs/GettingStartedInPython.html#the-sssr-problem

The exact solution here is not clear:

  • Previously RMG would make an arbitrary decision for the two rings (and in a non-reproducible way, leading to regression test failures, see get_deterministic_sssr is not really deterministic #2562) which we can easily do again
  • Maybe all three of these rings SHOULD be included in the functional group contributions to the overall thermo?
  • Is there a best choice for the two rings? Is there any remote way we can even hope to identify it in a programmable and general way?

I think the first option is the best one. But let me know your thoughts.

I expect that once we fix the bicyclic detection aspect of this, the numerical results of the tests will also change, but that's an easy fix.

@rwest
Copy link
Member

rwest commented Dec 9, 2025

It might be as simple as detecting bicyclics using RDKit's Chem.GetSSSR(m) which I think would return 2, in this case where len(Chem.GetSymmSSSR(m)) returns 3.

But I think it's important to know where this ring information is used. I think it is the fused ring decomposition algorithm for predicting thermo. If we don't have an exact drop-in replacement, then I fear at least all the groups will need re-training, and in worst case the entire algorithm changing.

Is there anyone still on the team who is familiar with that code / algorithm?

@JacksonBurns
Copy link
Contributor Author

I think that proposed solution would be fine. I can't think of many cases where this would be an issue, anyway.

My understanding is that this design choice will just change which rings are picked when the decision is arbitrary. The tress were presumably trained using some equally arbitrary decision, so I don't know if they need to be refit (?)

@jonwzheng do you know who might know this better?

@rwest
Copy link
Member

rwest commented Dec 9, 2025

I didn't catch until just now that they changed the API and GetSSSR now actually returns a sequence of rings, not just the count.
rdkit/rdkit#5437 I think made it into 2022.09 release - though I missed it when working on it earlier this year, because their online documentation is still inconsistent about this backwards-incompatible change.

@jonwzheng
Copy link
Contributor

I think that proposed solution would be fine. I can't think of many cases where this would be an issue, anyway.

My understanding is that this design choice will just change which rings are picked when the decision is arbitrary. The tress were presumably trained using some equally arbitrary decision, so I don't know if they need to be refit (?)

@jonwzheng do you know who might know this better?

I don't think any current members will know. @mjohnson541 did some of the automatic thermo and kinetic re-fitting, and before that @mliu49 worked on polycyclic corrections. There are some examples here like the correction for barrelene-related groups that seem to have some hints.

@JacksonBurns
Copy link
Contributor Author

JacksonBurns commented Dec 10, 2025

@rwest in that case, can we just go ahead and switch to GetSymmSSSR? GetSSSR?

@rwest
Copy link
Member

rwest commented Dec 10, 2025

Worth a try?

I'm a bit unsure how/if they solved the inconsistency problem that led them to earlier only report the number of SSSSR and not the set.

@JacksonBurns
Copy link
Contributor Author

@rwest I've changed over from GetSymmSSSR to GetSSSR in 043e78e

RDKit changed the return type in the 2022.09 release in response to this user request: rdkit/rdkit#5395

RMG-Py never got to use this because we were stuck with such an old version of RDKit for a long time, until the Python 3.9 upgrade (#2741) went in.

@JacksonBurns
Copy link
Contributor Author

JacksonBurns commented Dec 11, 2025

Commits 93bdede through ae3c879 are CI fixes unrelated to the rest of this PR:

  • GitHub has removed the macos-13 runner, so I took it out
  • Modern Python doesn't support building packages the way that we do it, so I disable the modern behavior inside our Makefile (--no-build-isolation)
  • I set a bound on the supported version of Python, so when doing conda env create --file environment.yml one does not get an invalid version of Python.

Build is now working locally on Ubuntu with Python 3.11, hopefully it works in the CI as well.

@JacksonBurns
Copy link
Contributor Author

Note to self:

  • check the failing test to see which species is causing it
  • revise the usage of symmetrized vs non within the molecule API
    • collapse all of the ring perception methods back down into one
  • remove the use of NotImplementedError as a first-class exception, it is not meant for that

 - cache the result of the call to rdkit to avoid remaking the molecule and running the decomposition
 - use GetSSSR only and deprecate other methods (and do so following conventions correctly)
  - add a test for the OPTIONAL symmetric SSSR passthrough
@JacksonBurns
Copy link
Contributor Author

Note to self: latest commit has something off with polycycle detection. RDKit does correctly detect the cycles for species like C(CCC1C2CCC1CC2)CC1CCC1 that is failing a test, but something is wrong with the way that detect that the resulting vertices are in fact polycycles...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

get_deterministic_sssr is not really deterministic

5 participants