Sparse Molecules #169

dgasmith · 2019-11-12T16:21:13Z

Is your feature request related to a problem? Please describe.
Molecules are typically 3x as heavy as they need to be as many items like mass/real/isotope are generally needed. This causes issues on both object creation speed and serialized size at a number of layers.

Describe alternatives you've considered
Several options are possible:

Molecule.mass = None all values are defaulted to None instead of created values. We could have a create/strip functions that would fill/remove these fields on the fly. .dict() would remove all created fields in the serialization.
Molecule.mass is a lazy evaluated field unless provided. This may not be possible due to metaclass issues, but worth a try.
We have a SimpleMolecule class which contains a subset of Molecule fields. Not my favorite solution, but a quick one.

This is becomes a fairly serious issue for the Database and handling large numbers of Molecules.

@loriab @mattwelborn

The text was updated successfully, but these errors were encountered:

dgasmith · 2019-11-28T15:52:35Z

Looks like we can do this:

from typing import List
from pydantic import BaseModel

class LazyModel(BaseModel):
    symbols: List[int]
    mass_: List[float] = None

    class Config:
        fields = {'mass_': 'mass'}

    @property
    def mass(self) -> List[int]:
        mass = self.__dict__.get('mass_')
        if mass is None:
            mass = [42]  # [MASS_LOOKUP[k] for k in values['symbols']]
        return mass

Example:

m = LazyModel(symbols=[0], mass=[10.3])
print(m.mass)
#> [10.3]
print(m.dict(exclude_unset=True, by_alias=True))
#> {'symbols': [0], 'mass': [10.3]}

m = LazyModel(symbols=[0])
print(m.mass)
#> [42]
print(m.dict(exclude_unset=True, by_alias=True))
#> {'symbols': [0]}

This makes things a bit bulky but workable for this somewhat special class. We would automatically set exclude_unset=True, by_alias=True in the Molecule dict() function. We can also edit the auto docs to make this not look odd I believe.

This would skip the need for a fill function or similar. Any thoughts here?

loriab · 2019-11-28T16:10:19Z

That's looking very promising. There's complications with, say, mass coming not just from the mass field but from the label field, but that's a problem on my side, not with the pydantic integration. Yay for alias fields.

dgasmith · 2019-11-28T16:15:38Z

@property
def labels(self):
    return [LABEL_LOOKUP[x] for x in self.symbols]

@property
def mass(self):
    return [MASS_LOOKUP[x] for x in self.labels]

This should correctly chain.

dgasmith · 2019-11-28T16:39:16Z

I think the total number of lazy fields are:

    masses: Optional[Array[float]]
    real: Optional[Array[bool]]
    atom_labels: Optional[Array[str]]
    atomic_numbers: Optional[Array[np.int16]]
    mass_numbers: Optional[Array[np.int16]]
    connectivity: Optional[List[Tuple[int, int, float]]]
    fragments: Optional[List[Array[np.int32]]]
    fragment_charges: Optional[List[float]]
    fragment_multiplicities: Optional[List[int]]

Keeping the booleans around as they are lightweight. I can start in on this soon.

@loriab Do you have thoughts on how to keep from_arrays sparse? We might need to split the validators into groups.

loriab · 2019-11-28T16:57:38Z

atomic_numbers worth always having around? it and symbols are used a lot.
the trouble is the time to run through the molparse physics validation, right? not just storage of regenerated defaults? (latter's easier on the from_arrays size.)
lazy fields list looks good

dgasmith · 2019-11-28T17:14:53Z

No, nothing that we can regenerate should hang around we should go all in. It is both time and data size, I think we should cache the values however.

loriab · 2019-11-28T18:35:56Z

ok on nothing regenerable
fortunately all the validate_and_fill_* fns in from_arrays already have some separation.
reducing data size comparatively easy as we can have from_arrays only return non-regenerable fields.
reducing processing time is more involved. main targets are https://github.com/MolSSI/QCElemental/blob/master/qcelemental/molparse/from_arrays.py#L378 and https://github.com/MolSSI/QCElemental/blob/master/qcelemental/molparse/from_arrays.py#L354
main thing to confirm is that we still want nominally simple inputs like H 0 0 0 to still go through enough physics processing to get chgmult 0 2?

dgasmith · 2019-11-28T19:41:04Z

We can also strip data size down by only filling fields that were passed in:

validated_kwargs = validate(kwargs)
kwargs = {k: validated_kwargs[k] for k in kwargs.keys()

This might make it easy to let everything else be easy.

It is safe to say the 90% case is symbols + geometry and we need to fill in multiplicity + charge. The 9% case is where we need to do that for fragments, everything else likely falls in the 1%.

dgasmith · 2020-01-16T22:04:50Z

Closed by #191.

dgasmith added the Feature Adds a new feature to the project label Nov 12, 2019

mattwelborn added this to the v0.13.0 milestone Dec 2, 2019

loriab mentioned this issue Jan 12, 2020

try minimal storage mol for atomic_number #190

Merged

2 tasks

dgasmith closed this as completed Jan 16, 2020

loriab mentioned this issue Jan 25, 2020

Keeping QCSchema in sync with QCElemental MolSSI/QCSchema#68

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sparse Molecules #169

Sparse Molecules #169

dgasmith commented Nov 12, 2019

dgasmith commented Nov 28, 2019 •

edited

Loading

loriab commented Nov 28, 2019

dgasmith commented Nov 28, 2019

dgasmith commented Nov 28, 2019 •

edited

Loading

loriab commented Nov 28, 2019

dgasmith commented Nov 28, 2019

loriab commented Nov 28, 2019

dgasmith commented Nov 28, 2019

dgasmith commented Jan 16, 2020

Sparse Molecules #169

Sparse Molecules #169

Comments

dgasmith commented Nov 12, 2019

dgasmith commented Nov 28, 2019 • edited Loading

loriab commented Nov 28, 2019

dgasmith commented Nov 28, 2019

dgasmith commented Nov 28, 2019 • edited Loading

loriab commented Nov 28, 2019

dgasmith commented Nov 28, 2019

loriab commented Nov 28, 2019

dgasmith commented Nov 28, 2019

dgasmith commented Jan 16, 2020

dgasmith commented Nov 28, 2019 •

edited

Loading

dgasmith commented Nov 28, 2019 •

edited

Loading