Fixed index counting bug in attribution #89

whitead · 2022-06-08T21:35:57Z

I've revised the indexing added in 2.1.0 to follow SMILES string index exactly and SELFIES token index exactly. Previously it used the current length of the SMILES string list which could contain empty elements. In SELFIES, it incorrectly treated order of index. This PR also increments version/updates changelog.

New behavior:

SELFIES [C][N][C][Branch1][C][P][C][C][Ring1][=Branch1]
SMILES C1NC(P)CC1
Attribution:
AttributionMap(index=0, token='C', attribution=[Attribution(index=0, token='[C]')])
AttributionMap(index=2, token='N', attribution=[Attribution(index=1, token='[N]')])
AttributionMap(index=3, token='C', attribution=[Attribution(index=2, token='[C]')])
AttributionMap(index=5, token='P', attribution=[Attribution(index=3, token='[Branch1]'), Attribution(index=5, token='[P]')])
AttributionMap(index=7, token='C', attribution=[Attribution(index=6, token='[C]')])
AttributionMap(index=8, token='C', attribution=[Attribution(index=7, token='[C]')])
SELFIES [C][N][C][C][Branch1][P][C][C][Ring1][=Branch1]
SMILES C1NC(P)CC1
Attribution:
AttributionMap(index=0, token='[C]', attribution=[Attribution(index=0, token='C')])
AttributionMap(index=1, token='[N]', attribution=[Attribution(index=2, token='N')])
AttributionMap(index=2, token='[C]', attribution=[Attribution(index=3, token='C')])
AttributionMap(index=5, token='[P]', attribution=[Attribution(index=5, token='P')])
AttributionMap(index=3, token='[C]', attribution=[Attribution(index=5, token='P')])
AttributionMap(index=4, token='[Branch1]', attribution=[Attribution(index=5, token='P')])
AttributionMap(index=6, token='[C]', attribution=[Attribution(index=7, token='C')])
AttributionMap(index=7, token='[C]', attribution=[Attribution(index=8, token='C')])
AttributionMap(index=8, token='[Ring1]', attribution=None)
AttributionMap(index=9, token='[=Branch1]', attribution=None)

whitead · 2022-06-09T14:30:32Z

@MarioKrenn6240 @akshat998 - It looks like all the dataset tests failed because the SMILES couldn't parse correctly. Maybe we need to pin a dependency here?

MarioKrenn6240 · 2022-06-10T14:56:11Z

@MarioKrenn6240 @akshat998 - It looks like all the dataset tests failed because the SMILES couldn't parse correctly. Maybe we need to pin a dependency here?

we are on it, thanks!

alstonlo · 2022-07-10T16:46:02Z

selfies/encoder.py

+ # account for branch symbol because it is inserted after
+ for i in range(start, end):
+ attribution_maps[i].index += len(Q_as_symbols) + 1
+ derived.append(branch_symbol)


Sorry for the delay! I believe the reason the tests are failing might be the moving of this line, since it swaps the order of SELFIES index and branch symbols.

alstonlo · 2022-07-10T16:50:53Z

selfies/utils/smiles_utils.py

@@ -442,7 +446,7 @@ def _derive_smiles_from_fragment(
 token = atom_to_smiles(curr_atom)
 derived.append(token)
 attribution_maps.append(AttributionMap(
- len(derived) - 1 + attribution_index,
+ _strlen(derived) - 1 + attribution_index,


In addition, I am a little hesitant to make this change because it joins the entire derived list together every call. Would it be possible to implement this as some sort of running counter instead?

whitead · 2022-07-11T20:15:30Z

Hi @alstonlo - you have a sharp eye! I did not spot that line switch - thanks.

For the second comment about timing, I did try what you said and found it to add quite a bit of complexity to the code because there are possible empty tokens added that require if/else statements for a counter approach. I decided to just join them (usually quite fast) to count instead.

I've put timing information below. Some paths are slower/faster on PR so I don't think it's a meaningful impact on performance.

master 0f0061a

16.10s call     tests/test_on_datasets.py::test_roundtrip_translation[test_path1]
10.49s call     tests/test_on_datasets.py::test_roundtrip_translation[test_path9]
9.98s call     tests/test_on_datasets.py::test_roundtrip_translation[test_path11]
9.93s call     tests/test_on_datasets.py::test_roundtrip_translation[test_path3]
6.23s call     tests/test_on_datasets.py::test_roundtrip_translation[test_path14]
6.11s call     tests/test_on_datasets.py::test_roundtrip_translation[test_path13]
4.94s call     tests/test_on_datasets.py::test_roundtrip_translation[test_path10]
3.60s call     tests/test_on_datasets.py::test_roundtrip_translation[test_path2]
2.23s call     tests/test_on_datasets.py::test_roundtrip_translation[test_path4]
1.98s call     tests/test_on_datasets.py::test_roundtrip_translation[test_path5]
1.86s call     tests/test_on_datasets.py::test_roundtrip_translation[test_path12]
1.58s call     tests/test_on_datasets.py::test_roundtrip_translation[test_path6]
0.62s call     tests/test_on_datasets.py::test_roundtrip_translation[test_path7]
0.24s call     tests/test_on_datasets.py::test_roundtrip_translation[test_path8]
0.04s call     tests/test_on_datasets.py::test_roundtrip_translation[test_path0]

PR e51e3dc

15.74s call     tests/test_on_datasets.py::test_roundtrip_translation[test_path1]
10.62s call     tests/test_on_datasets.py::test_roundtrip_translation[test_path9]
10.04s call     tests/test_on_datasets.py::test_roundtrip_translation[test_path11]
9.71s call     tests/test_on_datasets.py::test_roundtrip_translation[test_path3]
6.45s call     tests/test_on_datasets.py::test_roundtrip_translation[test_path14]
5.91s call     tests/test_on_datasets.py::test_roundtrip_translation[test_path13]
4.80s call     tests/test_on_datasets.py::test_roundtrip_translation[test_path10]
3.54s call     tests/test_on_datasets.py::test_roundtrip_translation[test_path2]
2.27s call     tests/test_on_datasets.py::test_roundtrip_translation[test_path4]
2.03s call     tests/test_on_datasets.py::test_roundtrip_translation[test_path5]
1.86s call     tests/test_on_datasets.py::test_roundtrip_translation[test_path12]
1.60s call     tests/test_on_datasets.py::test_roundtrip_translation[test_path6]
0.62s call     tests/test_on_datasets.py::test_roundtrip_translation[test_path7]
0.24s call     tests/test_on_datasets.py::test_roundtrip_translation[test_path8]
0.04s call     tests/test_on_datasets.py::test_roundtrip_translation[test_path0]

alstonlo

Thanks!

Fixed index counting bug in attribution

679f08d

whitead force-pushed the master branch from 2b042d1 to 679f08d Compare June 9, 2022 00:27

alstonlo reviewed Jul 10, 2022

View reviewed changes

whitead added 2 commits July 11, 2022 12:55

Merge branch 'aspuru-guzik-group:master' into master

756ebac

Fixed order bug

e51e3dc

alstonlo approved these changes Jul 13, 2022

View reviewed changes

MarioKrenn6240 merged commit be4db25 into aspuru-guzik-group:master Jul 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixed index counting bug in attribution #89

Fixed index counting bug in attribution #89

whitead commented Jun 8, 2022

whitead commented Jun 9, 2022

MarioKrenn6240 commented Jun 10, 2022

alstonlo Jul 10, 2022

alstonlo Jul 10, 2022

whitead commented Jul 11, 2022

alstonlo left a comment

Fixed index counting bug in attribution #89

Fixed index counting bug in attribution #89

Conversation

whitead commented Jun 8, 2022

whitead commented Jun 9, 2022

MarioKrenn6240 commented Jun 10, 2022

alstonlo Jul 10, 2022

Choose a reason for hiding this comment

alstonlo Jul 10, 2022

Choose a reason for hiding this comment

whitead commented Jul 11, 2022

alstonlo left a comment

Choose a reason for hiding this comment