Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed index counting bug in attribution #89

Merged
merged 3 commits into from
Jul 14, 2022

Conversation

whitead
Copy link
Contributor

@whitead whitead commented Jun 8, 2022

I've revised the indexing added in 2.1.0 to follow SMILES string index exactly and SELFIES token index exactly. Previously it used the current length of the SMILES string list which could contain empty elements. In SELFIES, it incorrectly treated order of index. This PR also increments version/updates changelog.

New behavior:

SELFIES [C][N][C][Branch1][C][P][C][C][Ring1][=Branch1]
SMILES C1NC(P)CC1
Attribution:
AttributionMap(index=0, token='C', attribution=[Attribution(index=0, token='[C]')])
AttributionMap(index=2, token='N', attribution=[Attribution(index=1, token='[N]')])
AttributionMap(index=3, token='C', attribution=[Attribution(index=2, token='[C]')])
AttributionMap(index=5, token='P', attribution=[Attribution(index=3, token='[Branch1]'), Attribution(index=5, token='[P]')])
AttributionMap(index=7, token='C', attribution=[Attribution(index=6, token='[C]')])
AttributionMap(index=8, token='C', attribution=[Attribution(index=7, token='[C]')])
SELFIES [C][N][C][C][Branch1][P][C][C][Ring1][=Branch1]
SMILES C1NC(P)CC1
Attribution:
AttributionMap(index=0, token='[C]', attribution=[Attribution(index=0, token='C')])
AttributionMap(index=1, token='[N]', attribution=[Attribution(index=2, token='N')])
AttributionMap(index=2, token='[C]', attribution=[Attribution(index=3, token='C')])
AttributionMap(index=5, token='[P]', attribution=[Attribution(index=5, token='P')])
AttributionMap(index=3, token='[C]', attribution=[Attribution(index=5, token='P')])
AttributionMap(index=4, token='[Branch1]', attribution=[Attribution(index=5, token='P')])
AttributionMap(index=6, token='[C]', attribution=[Attribution(index=7, token='C')])
AttributionMap(index=7, token='[C]', attribution=[Attribution(index=8, token='C')])
AttributionMap(index=8, token='[Ring1]', attribution=None)
AttributionMap(index=9, token='[=Branch1]', attribution=None)

@whitead
Copy link
Contributor Author

whitead commented Jun 9, 2022

@MarioKrenn6240 @akshat998 - It looks like all the dataset tests failed because the SMILES couldn't parse correctly. Maybe we need to pin a dependency here?

@MarioKrenn6240
Copy link
Collaborator

@MarioKrenn6240 @akshat998 - It looks like all the dataset tests failed because the SMILES couldn't parse correctly. Maybe we need to pin a dependency here?

we are on it, thanks!

# account for branch symbol because it is inserted after
for i in range(start, end):
attribution_maps[i].index += len(Q_as_symbols) + 1
derived.append(branch_symbol)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay! I believe the reason the tests are failing might be the moving of this line, since it swaps the order of SELFIES index and branch symbols.

@@ -442,7 +446,7 @@ def _derive_smiles_from_fragment(
token = atom_to_smiles(curr_atom)
derived.append(token)
attribution_maps.append(AttributionMap(
len(derived) - 1 + attribution_index,
_strlen(derived) - 1 + attribution_index,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition, I am a little hesitant to make this change because it joins the entire derived list together every call. Would it be possible to implement this as some sort of running counter instead?

@whitead
Copy link
Contributor Author

whitead commented Jul 11, 2022

Hi @alstonlo - you have a sharp eye! I did not spot that line switch - thanks.

For the second comment about timing, I did try what you said and found it to add quite a bit of complexity to the code because there are possible empty tokens added that require if/else statements for a counter approach. I decided to just join them (usually quite fast) to count instead.

I've put timing information below. Some paths are slower/faster on PR so I don't think it's a meaningful impact on performance.

master 0f0061a

16.10s call     tests/test_on_datasets.py::test_roundtrip_translation[test_path1]
10.49s call     tests/test_on_datasets.py::test_roundtrip_translation[test_path9]
9.98s call     tests/test_on_datasets.py::test_roundtrip_translation[test_path11]
9.93s call     tests/test_on_datasets.py::test_roundtrip_translation[test_path3]
6.23s call     tests/test_on_datasets.py::test_roundtrip_translation[test_path14]
6.11s call     tests/test_on_datasets.py::test_roundtrip_translation[test_path13]
4.94s call     tests/test_on_datasets.py::test_roundtrip_translation[test_path10]
3.60s call     tests/test_on_datasets.py::test_roundtrip_translation[test_path2]
2.23s call     tests/test_on_datasets.py::test_roundtrip_translation[test_path4]
1.98s call     tests/test_on_datasets.py::test_roundtrip_translation[test_path5]
1.86s call     tests/test_on_datasets.py::test_roundtrip_translation[test_path12]
1.58s call     tests/test_on_datasets.py::test_roundtrip_translation[test_path6]
0.62s call     tests/test_on_datasets.py::test_roundtrip_translation[test_path7]
0.24s call     tests/test_on_datasets.py::test_roundtrip_translation[test_path8]
0.04s call     tests/test_on_datasets.py::test_roundtrip_translation[test_path0]

PR e51e3dc

15.74s call     tests/test_on_datasets.py::test_roundtrip_translation[test_path1]
10.62s call     tests/test_on_datasets.py::test_roundtrip_translation[test_path9]
10.04s call     tests/test_on_datasets.py::test_roundtrip_translation[test_path11]
9.71s call     tests/test_on_datasets.py::test_roundtrip_translation[test_path3]
6.45s call     tests/test_on_datasets.py::test_roundtrip_translation[test_path14]
5.91s call     tests/test_on_datasets.py::test_roundtrip_translation[test_path13]
4.80s call     tests/test_on_datasets.py::test_roundtrip_translation[test_path10]
3.54s call     tests/test_on_datasets.py::test_roundtrip_translation[test_path2]
2.27s call     tests/test_on_datasets.py::test_roundtrip_translation[test_path4]
2.03s call     tests/test_on_datasets.py::test_roundtrip_translation[test_path5]
1.86s call     tests/test_on_datasets.py::test_roundtrip_translation[test_path12]
1.60s call     tests/test_on_datasets.py::test_roundtrip_translation[test_path6]
0.62s call     tests/test_on_datasets.py::test_roundtrip_translation[test_path7]
0.24s call     tests/test_on_datasets.py::test_roundtrip_translation[test_path8]
0.04s call     tests/test_on_datasets.py::test_roundtrip_translation[test_path0]

Copy link
Collaborator

@alstonlo alstonlo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@MarioKrenn6240 MarioKrenn6240 merged commit be4db25 into aspuru-guzik-group:master Jul 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants