Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sparse Molecules #169

Closed
dgasmith opened this issue Nov 12, 2019 · 9 comments
Closed

Sparse Molecules #169

dgasmith opened this issue Nov 12, 2019 · 9 comments
Labels
Feature Adds a new feature to the project
Milestone

Comments

@dgasmith
Copy link
Collaborator

Is your feature request related to a problem? Please describe.
Molecules are typically 3x as heavy as they need to be as many items like mass/real/isotope are generally needed. This causes issues on both object creation speed and serialized size at a number of layers.

Describe alternatives you've considered
Several options are possible:

  • Molecule.mass = None all values are defaulted to None instead of created values. We could have a create/strip functions that would fill/remove these fields on the fly. .dict() would remove all created fields in the serialization.
  • Molecule.mass is a lazy evaluated field unless provided. This may not be possible due to metaclass issues, but worth a try.
  • We have a SimpleMolecule class which contains a subset of Molecule fields. Not my favorite solution, but a quick one.

This is becomes a fairly serious issue for the Database and handling large numbers of Molecules.

@loriab @mattwelborn

@dgasmith dgasmith added the Feature Adds a new feature to the project label Nov 12, 2019
@dgasmith
Copy link
Collaborator Author

dgasmith commented Nov 28, 2019

Looks like we can do this:

from typing import List
from pydantic import BaseModel

class LazyModel(BaseModel):
    symbols: List[int]
    mass_: List[float] = None

    class Config:
        fields = {'mass_': 'mass'}

    @property
    def mass(self) -> List[int]:
        mass = self.__dict__.get('mass_')
        if mass is None:
            mass = [42]  # [MASS_LOOKUP[k] for k in values['symbols']]
        return mass

Example:

m = LazyModel(symbols=[0], mass=[10.3])
print(m.mass)
#> [10.3]
print(m.dict(exclude_unset=True, by_alias=True))
#> {'symbols': [0], 'mass': [10.3]}

m = LazyModel(symbols=[0])
print(m.mass)
#> [42]
print(m.dict(exclude_unset=True, by_alias=True))
#> {'symbols': [0]}

This makes things a bit bulky but workable for this somewhat special class. We would automatically set exclude_unset=True, by_alias=True in the Molecule dict() function. We can also edit the auto docs to make this not look odd I believe.

This would skip the need for a fill function or similar. Any thoughts here?

@loriab
Copy link
Collaborator

loriab commented Nov 28, 2019

That's looking very promising. There's complications with, say, mass coming not just from the mass field but from the label field, but that's a problem on my side, not with the pydantic integration. Yay for alias fields.

@dgasmith
Copy link
Collaborator Author

@property
def labels(self):
    return [LABEL_LOOKUP[x] for x in self.symbols]

@property
def mass(self):
    return [MASS_LOOKUP[x] for x in self.labels]

This should correctly chain.

@dgasmith
Copy link
Collaborator Author

dgasmith commented Nov 28, 2019

I think the total number of lazy fields are:

    masses: Optional[Array[float]]
    real: Optional[Array[bool]]
    atom_labels: Optional[Array[str]]
    atomic_numbers: Optional[Array[np.int16]]
    mass_numbers: Optional[Array[np.int16]]
    connectivity: Optional[List[Tuple[int, int, float]]]
    fragments: Optional[List[Array[np.int32]]]
    fragment_charges: Optional[List[float]]
    fragment_multiplicities: Optional[List[int]] 

Keeping the booleans around as they are lightweight. I can start in on this soon.

@loriab Do you have thoughts on how to keep from_arrays sparse? We might need to split the validators into groups.

@loriab
Copy link
Collaborator

loriab commented Nov 28, 2019

  • atomic_numbers worth always having around? it and symbols are used a lot.
  • the trouble is the time to run through the molparse physics validation, right? not just storage of regenerated defaults? (latter's easier on the from_arrays size.)
  • lazy fields list looks good

@dgasmith
Copy link
Collaborator Author

No, nothing that we can regenerate should hang around we should go all in. It is both time and data size, I think we should cache the values however.

@loriab
Copy link
Collaborator

loriab commented Nov 28, 2019

@dgasmith
Copy link
Collaborator Author

We can also strip data size down by only filling fields that were passed in:

validated_kwargs = validate(kwargs)
kwargs = {k: validated_kwargs[k] for k in kwargs.keys()

This might make it easy to let everything else be easy.

It is safe to say the 90% case is symbols + geometry and we need to fill in multiplicity + charge. The 9% case is where we need to do that for fragments, everything else likely falls in the 1%.

@mattwelborn mattwelborn added this to the v0.13.0 milestone Dec 2, 2019
@dgasmith
Copy link
Collaborator Author

Closed by #191.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature Adds a new feature to the project
Projects
None yet
Development

No branches or pull requests

3 participants