Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature normalize and standardize objects #287

Merged
merged 12 commits into from
Jul 21, 2022

Conversation

coltonbh
Copy link
Collaborator

@coltonbh coltonbh commented Apr 6, 2022

Description

The purpose of this PR is to standardize the interface for input and result objects. It also removes duplicate field specifications by moving common attributes to parent classes. The general interface for most end user objects remains almost the same, with a few tweaks to create the commonalities.

Closes #264

Changelog description

  • Refactored Input and Result objects to conform to a standard interface for each object type. Inspiration was drawn from existing implementations in qcelemental and extended to create consistency.
  • All input objects (AtomicInput, OptimizationInput, TorsionDriveInput) conform to the basic interface of .molecule, containing the initial molecule on which to perform/start the computation, and .specification which defines the keywords, program, extras, protocols, and other parameters to fully specify the computation. molecule + specification = complete computation. More sophisticated specifications, like OptimizationSpecification and TorsionDriveSpecification can be composed by adding specifications objects that define the collection of computations needed to execute the procedure. See OptimizationSpecification for an example that composes AtomicSpecification into it.
  • All result objects (AtomicResult, OptimizationResult, TorsionDriveResult, FailedOperation) conform to the basic interface of .input_data which is a perfect copy of the input data used to generate the computation, .success boolean to define whether the operation succeeded or failed, and the stdout and stderr for the computation. Additionally, computation-specific output fields an be added as needed.
  • All other properties found on inputs/outputs (like id, schema_name etc) were unmodified in form but refactored to parent classes to reduce duplicate code.

Status

  • [ x] Code base linted
  • [ x] Community discussion has been had!
  • [ x] With community approval for the design I'll update tests and get the code ready to go.
  • [ x] Ready to go

Interface description

This PR makes all interfaces conform to the following basic specification:

Input Objects

All input objects (AtomicInput, OptimizationInput, TorsionDriveInput, future procedures...) follow the same pattern:

  1. .specification specifies how that computation is to be performed. It contains all the keywords, extras, protocols, possible model/driver, and/or other fields required to be sent directly to the computational chemistry program being called.
  2. .molecule attribute specifies on what molecule to perform (or begin) this operation.

The Specification objects take their inspiration from the QCInputSpecification object. Combining a specification with a molecule creates a computation as per Ben's thoughts.

Increasingly sophisticated computations (like procedures) can now be generated systematically by composition of any number of required Specification objects. A simplified AtomicInput, OptimizationInput, and TorsionDriveInput can be seen below that show how this pattern plays nicely at each level of abstraction.

##### Specification Objects #####

class SpecificationBase(...):
    """Base specification directed at any computational chemistry program"""
    keywords: Dict[str, Any]
    program: str

class AtomicSpecification(SpecificationBase):
    """Specification for single point computations"""
    driver: DriverEnum
    model: Model

class OptimizationSpecification(SpecificationBase):
    """Specification for optimization procedures"""
    gradient_specification: AtomicSpecification

class TorsionDriveSpecification(SpecificationBase):
    """Specification for TorsionDrive procedures"""
    optimization_specification: OptimizationSpecification

##### Input Objects #####

class InputComputationBase(InputBase):
    """Base input directed at any computational chemistry program"""
    specification: SpecificationBase
    molecule: Molecule

class AtomicInput(InputComputationBase):
    r"""The MolSSI Quantum Chemistry Schema"""
    specification: QCInputSpecification 

class OptimizationInput(InputComputationBase):
    """Input object for an optimization computation"""
    specification: OptimizationSpecification

class TorsionDriveInput(InputComputationBase):
    """Inputs for running a torsion drive."""
    specification: TorsionDriveSpecification

Result Objects

All result objects (AtomicResult, OptimizationResult, TorsionDriveResult, FailedOperation) conform to the same interface. The previous split between inheritance (on successful results) and composition (on failed results) is resolved in favor of composition. The inheritance from input objects created difficulty with schema_name that required special validation and the mixing of the .extras dictionary that contained both input and output values. Additionally, creating new inputs from previous results was difficult because the input and output fields were mixed on the same object so things like AtomicInput(**my_old_atomic_result.dict()) raise exceptions. AtomicInput(**my_old_atomic_result.input_data.dict() would be nice. I preferred the ease of accessing the complete input data at .input_data on a FailedOperation object :) Grouping result objects via inheritance also eliminated field duplication. All result objects follow this interface:

  1. .success: bool (True on all successful results, False on FailedOperation)
  2. .input_data: The full input data object that created the computation
  3. .stdout: Optional[str] of stdout data
  4. .stderr: Optional[str] of stderr

Here is a simplified hierarchy:

class ResultBase(InputResultBase):
    """Base class for all result classes"""

    input_data: InputBase = Field(..., description="The input data supplied to generate the computation")
    success: bool
    stdout: Optional[str] = None
    stderr: Optional[str] = None


class SuccessfulResultBase(ResultBase):
    """Base object for any successfully returned result"""

    success: Literal[True]

class AtomicResult(SuccessfulResultBase):
    r"""Results from a CMS program execution."""

    input_data: AtomicInput 
    properties: AtomicResultProperties
    wavefunction: Optional[WavefunctionProperties] 
    return_result: Union[float, Array[float], Dict[str, Any]
    native_files: Dict[str, Any] = {}

class OptimizationResult(SuccessfulResultBase):
    """The result of an optimization procedure"""

    input_data: OptimizationInput
    final_molecule: Optional[Molecule]  # Should this be optional? Seems necessary!
    trajectory: List[AtomicResult]
    energies: List[float]

class TorsionDriveResult(SuccessfulResultBase):
    """Results from running a torsion drive."""

    input_data: TorsionDriveInput
    final_energies: Dict[str, float]
    final_molecules: Dict[str, Molecule]
    optimization_history: Dict[str, List[OptimizationResult]]

class FailedOperation(ResultBase):
    """Record indicating that a given operation (program, procedure, etc.) has failed and containing the reason and input data which generated the failure."""

    input_data: Union[AtomicInput, OptimizationInput, TorsionDriveInput]
    success: Literal[False]
    error: ComputeError

This plays nicely with your brain because you can interact with all inputs and outputs in the same way. Inputs always contain a molecule and a specifications. Specifications always contains program, keywords for that program, and any additional specifications required for sub-computations (like a gradient specification for an optimization routine). All result objects have a .success value, input_data value, and then any additional fields unique to their computation type. I think these changes can make qcelemental even easier to use and more consistent to program with.

I consider this PR a starting point for community conversation! So I look forward to hearing feedback, thoughts, opinions, and concerns. Hope this can move qcel forward in a positive way. Thanks to all the maintainers for your work on the package.

Simple Script

If you want to play with these objects and see how they feel just install qcelemental from this branch and then use this script (assuming you named it hacking.py).

python -i hacking.py
from qcelemental.models import (
    Molecule,
    AtomicInput,
    AtomicResult,
    OptimizationInput,
    OptimizationResult,
    TorsionDriveInput,
    TorsionDriveResult,
)

water = Molecule.from_data(
    """
    0 1
    O  -1.551007  -0.114520   0.000000
    H  -1.934259   0.762503   0.000000
    H  -0.599677   0.040712   0.000000
    """
)

water_input_spec = {"model": {"method": "energy", "basis": "6g-31"}, "driver": "energy", "program": "psi4"}
ai = AtomicInput(
    molecule=water,
    specification=water_input_spec,
)
print(ai, "\n")
ar = AtomicResult(
    input_data=ai,
    return_result=123,
    provenance={"creator": "colton"},
    properties={},
)
print(ar, "\n")

water_gradient = {**water_input_spec, **{"driver": "gradient"}}
opt_spec = {"gradient_specification": water_gradient, "program": "fake_optimizer_program"}
oi = OptimizationInput(molecule=water, specification=opt_spec)
print(oi, "\n")

optres = OptimizationResult(
    input_data=oi,
    final_molecule=water,
    trajectory=[ar],
    provenance={"creator": "colton"},
    energies=[
        123,
        456,
    ],
)

print(optres, "\n")

tdi = TorsionDriveInput(
    molecule=water,
    specification={
        "keywords": {
            "dihedrals": ((1, 2, 3, 4), (1, 2, 3, 4)),
            "grid_spacing": [6, 7],
        },
        "optimization_specification": opt_spec,
        "program": "torsiondrive",
    },
)

print(tdi, "\n")

tdr = TorsionDriveResult(
    input_data=tdi,
    final_energies={"hi": 123.2},
    final_molecules={"water": water},
    optimization_history={"angle": [optres]},
    provenance={"creator": "colton"},
)

Addendum

Because I was using pyreverse to model the various data relationships in the library I also took the liberty to delete the old models that were to be removed in v0.13.0 since this made it easier to visualize the code.

@coltonbh coltonbh force-pushed the feature-normalize-and-standardize-objects branch from e6e821d to 71cb142 Compare April 6, 2022 03:15
@coltonbh coltonbh force-pushed the feature-normalize-and-standardize-objects branch from 71cb142 to 199b324 Compare April 6, 2022 03:24
@coltonbh
Copy link
Collaborator Author

coltonbh commented Apr 6, 2022

A few wins for QCEngine that would come from this PR. Input objects are supposed to be immutable, yet due to the fact that results inherit from inputs qcengine must immediately cast these inputs as mutable dictionaries, inherit the data onto new objects (and modify the schema values where needed), and inject values into the existing .extras field. Having a compositional approach means the input values can be accessed for a computation and the unmodified input set on the result at .input_data. Feels like a cleaner separation with no need to ever modify input data for a computation.

@lgtm-com
Copy link
Contributor

lgtm-com bot commented Apr 6, 2022

This pull request introduces 13 alerts when merging 819bb97 into d04a380 - view on LGTM.com

new alerts:

  • 13 for Unused import

@coltonbh
Copy link
Collaborator Author

coltonbh commented Apr 6, 2022

Added Changelog description above to summarize PR.

@loriab
Copy link
Collaborator

loriab commented Apr 8, 2022

On the whole, looks good, thanks!

  • some of the field names can be debated (e.g., gradient_spec in opt could be more generic)
  • the "immediate" input in each class gets the name input_spec, even though it may be an atomic, an opt, or a torsiondrive. is this a characteristic you've definitely found handy?
  • most of the schema versions would need a bump.
  • good point on helping the immutability of inputs.
  • I'd like to sketch out how this would look for some layered procedures I have planned (e.g., finite difference + composite).
  • we did a molecule v1 -> v2 transition, but probably the hardest part of this will be coordinating qcel, qcng, and the downstreams that use qcsk (e.g., psi4, xtb). but that has to be faced sometime anyways.
  • after iterating some here at qcel, this probably ought to be posted at QCSchema repo (issue with link; not PR) for community comment

@coltonbh
Copy link
Collaborator Author

coltonbh commented Apr 9, 2022

Thanks for the questions! A few thoughts.

  • some of the field names can be debated (e.g., gradient_spec in opt could be more generic)

You'd want to be as specific as possible in this case. The OptimizationInput is a "leaf" on the inheritance tree--a concrete final object that gets consumed by a QC program--so its non-inherited attributes should be as specific as possible to their use case on the OptimizationInput class. The "compositional" approach means the QCInputSpecification is the reusable object following a Specification interface, and you name that attribute something specific for the data structure at hand. It's like a str class, you don't name the attribute "string" so that it's "reusable", you'd name the attribute something specific to that object and declare it's interface (i.e., class) as str. The interface ("this attribute will behave like a string") is the reusable thing, not the nomenclature. The below example may seem overly obvious, but it's the exact same principle at play; you'd want a class-specific attribute name; it may refer to any value as long as it conforms to the str interface:

class Person(BaseModel):
    string: str # probably not what you want
    name: str # probably what you want; "name" is not intended to be generic, the type is generic and you ascribe the generic type (interface) to a specific attribute of the object

I agree re: field names in general though. I stuck with some old names so it looks more familiar. I'd consider input_spec to be redundant (these are already "input" objects! It's clear everything on them is an input). I'd rename it to spec for brevity and clarity.

  • the "immediate" input in each class gets the name input_spec, even though it may be an atomic, an opt, or a torsiondrive. is this a characteristic you've definitely found handy?

100%. This is the intended design and the main benefit I'm trying to highlight. It follows from the logic of the .name attribute above. Each Person has a .name attribute guaranteed to conform to the str interface, even though the value of that attribute will vary. Similarly, each input (AtomicInput, OptimizationInput and TorsionDriveInput) has a .input_spec guaranteed to conform to the InputSpecification interface even though the value of that attribute will vary. The difference is one additional layer of abstraction--there can be multiple subclasses of InputSpecification. This would be like having special subclasses of str for .name for more specific subclasses of Person. Going a bit further with the Person example might look something like this:

class Person(BaseModel):
    name: str # analogous to .molecule
    best_friend: Person # analogous to .input_spec. This annotation means Person or any of its subclasses (YoungPerson, OldPerson) is valid

class OldPerson(Person):
    favorite_newspaper: str

class YoungPerson(Person):
    favorite_cartoon: str

# Examples omit everyone having a `.best_friend` for simplicity

colton = OldPerson(name="colton", best_friend=YoungPerson(name="greg", favorite_cartoon="Rick and Morty"),  ...)
lori = YoungPerson(name="lori", best_friend=OldPerson(name="suzy", favorite_newspaper="Something old!"), ...)

colton and lori are from different classes (like AtomicInput or OptimizationInput), yet they both conform to the Person (InputComputationBase) interface, so I know how to interact with them generally. Each will have a .name conforming to a str (Molecule) interface, and a .best_friend conforming to a Person (InputSpecificationBase) interface. This is hugely advantageous because as I proliferate the kinds of Person (Input) classes that exist, I still interact with them all the same way. And even if the values at .best_friend are different classes (YoungPerson, OldPerson), .best_friend will always conform to a Person interface so I know it will have .name and .best_friend (and other attributes specific to the exact class).

This pattern makes programming much easier because like objects are alike ("I know how to make an input, I need a .molecule and a .input_spec, now let me just grab the right Specification for my input...") rather than each object having its own interface. It makes programmatic access to heterogenous outputs straightforward ("Here's a directory of all my outputs (failed and successful!), I'm gonna iterate over them and grab all input objects corresponding to an H2O molecule. I know all my outputs will have a .input_data attribute and all those will have a .molecule attribute, so this is easy, even if I have a mixture of objects"). No type checking or special access code is required. Note that the code below, (if the results array contains some of each result type), is interacting with every object type in qcel: AtomicInput, OptimizationInput, TorsionDriveInput, QCInputSpecification, OptimizationSpecification, TorsionDriveSpecification, AtomicResult, OptimizationResult, FailedOperation, and TorsionDriveResult, plus any future Input/Result objects. Yet it's super easy to understand and will never need updating when new types are added.

for result in results: # contains any result--Atomic, FailedOp, Opt, Torsion, ...,
    if result.input_data.molecule.name == "water":
        input_data = result.input_data
        if input_data.specification.extras.get("specialkeyword"): 
            if input_data.specification.keywords.get("badkeyword"):
                print("Found my mistake! This extra + keyword ruins everything. All these computations are junk!")

With the current qcel interface, the above code would look like:

for result in results: # contains any kind of result
    if isinstance(result, FailedOperation):
        input_data = result.input_data # NOTE: input_data not typed on FailedOperation, so will be `dict`
        if input_data.get("initial_molecule"): # Is OptimizationInput or TorsionDriveInput
            molecule_dict = input_data.get("initial_molecule")
        else: # is AtomicInput
            molecule_dict = input_data.get("molecule")
        molecule = Molecule(**molecule_dict) # Recast from dict -> Molecule
    elif isinstance(result, TorsionDriveResult) or isinstance(result, OptimizationResult):
        molecule = result.initial_molecule
    else: # is AtomicResult
        molecule = result.molecule
    # TODO: Don't forget to add more cases in the future for all new Result object types!
   
    if molecule.name == 'water':
        if isinstance(result, FailedOperation):
            my_special_keyword = result.input_data["extras"].get("specialkeyword") # keyword may have been added as part of output! Not strictly input data
        else:
            my_special_keyword = result.extras.get("specialkeyword")
        if my_special_keyword:
            if isinstance(result, FailedOperation):
                keywords = result.input_data['keywords']
            else:
                keywords = result.keywords
            if keywords.get("badkeyword"):
                print("Found my mistake! This extra + keyword ruins everything. All these computations are junk!")
        # TODO: Don't forget to add more cases for future Input types!

The above code is the beauty of unified interfaces. Everything you need to do is generic, follows the same pattern, and you can quickly lock on to the patterns of a library (like qcel) and begin to write efficient code without having to look at the source every time you construct and access objects because the same patterns are followed everywhere. You can now also build nice generic tools like def find_results_containing_keyword(keyword: str, results: List[ResultBase] -> List[ResultBase] that play nicely with all object types and won't have to be refactored and updated every time new objects are added to the library. The difference between the required code for today's qcel is the key point. Even after a year of programming with qcel I had to consult the source to write that code (and it probably contains bugs). To write my own example I didn't have to look at any code--the interface is generic and can be memorized from the few rules that define it. If my code above is an unfair comparison b/c a refactor is possible please point that out; however, the fact that we'd have to think that hard about it already suggests the interface can be improved :)

  • most of the schema versions would need a bump.

Correct. Since they seem to evolve together (everything is currently on v1) I've placed this attribute as a common value here. It can be changed on any child object if they don't evolve in lock step.

  • good point on helping the immutability of inputs.

Agree! I like this piece a lot. Our "immutabiliy" is only as good as our perfect coding in qcng. I'd prefer to have the data model enforce the immutability as a natural result.

  • I'd like to sketch out how this would look for some layered procedures I have planned (e.g., finite difference + composite).

Would love to chat about these! I'd benefit from hearing more use cases to inform my perspectives :)

  • we did a molecule v1 -> v2 transition, but probably the hardest part of this will be coordinating qcel, qcng, and the downstreams that use qcsk (e.g., psi4, xtb). but that has to be faced sometime anyways.

I'd be happy to help out on migrating qcng.

  • after iterating some here at qcel, this probably ought to be posted at QCSchema repo (issue with link; not PR) for community comment

Let me know when you'd like this posted over there and I'll do it!

Anything else let me know. Hope the explanations above aren't too pedantic (pydantic?!). I'm trying to be clear and detailed about the design pattern but hope I'm not communicating (or over-communicating) that poorly. Always happy to do a call for clarification and discussion :)

@coltonbh
Copy link
Collaborator Author

One additional note on this:

  • the "immediate" input in each class gets the name input_spec, even though it may be an atomic, an opt, or a torsiondrive. is this a characteristic you've definitely found handy?

This concept is really the big idea behind object oriented programming. It's often called "polymorphism." Not building interfaces this way misses out on the truly powerful syntax that OOP offers. Polymorphism is the reason to use OOP. You may have heard about "duck typing" in python. It's an even more flexible version of the same thing. So just throwing out a few key terms here so the interested can dig deeper on google :)

@loriab
Copy link
Collaborator

loriab commented Jun 8, 2022

My docs pass #289 messed this up -- sorry about that. Please rebase at your convenience.

@coltonbh
Copy link
Collaborator Author

My docs pass #289 messed this up -- sorry about that. Please rebase at your convenience.

What do you mean by this? Not clear what got messed up. Thanks :)

@loriab
Copy link
Collaborator

loriab commented Jun 10, 2022

My docs pass #289 messed this up -- sorry about that. Please rebase at your convenience.

What do you mean by this? Not clear what got messed up. Thanks :)

Sorry I was unclear. There's conflicts in five files that need to be resolved after I renovated docs. Any __base_doc__ changes into __doc__. And I realized yesterday that those ancient deprecated classes like ResultProtocols are still imported by QCFractal, at least until the long-cooking next branch gets merged.

@coltonbh coltonbh changed the base branch from master to next July 12, 2022 02:18
@codecov
Copy link

codecov bot commented Jul 12, 2022

Codecov Report

Merging #287 (c4442de) into next (2956421) will increase coverage by 0.06%.
The diff coverage is 97.03%.

@lgtm-com
Copy link
Contributor

lgtm-com bot commented Jul 12, 2022

This pull request introduces 3 alerts when merging e923f22 into 2956421 - view on LGTM.com

new alerts:

  • 3 for Unused import

…at exported models conform to pydantic's autogenerated schema, which is not necessary to tests. Also, issues arrise with jsonschema which are external to our concerns.
@lgtm-com
Copy link
Contributor

lgtm-com bot commented Jul 12, 2022

This pull request introduces 3 alerts when merging c4442de into 2956421 - view on LGTM.com

new alerts:

  • 3 for Unused import

Copy link
Collaborator

@loriab loriab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm! Let's initialize next with this and see if downstream will need any tweaks in practice. Thank you!

@loriab loriab merged commit 6e341fd into MolSSI:next Jul 21, 2022
@coltonbh
Copy link
Collaborator Author

Hell yeah! I can start to work on cleaning up the models module after ACTC next week :)

loriab pushed a commit that referenced this pull request Dec 9, 2022
* Refactored models. Moved common fields to parent classes. Created standardized input and result interface.

* Basic models implemented; tests not passing yet

* test_model_results passing

* test_molutil passing

* test_molparse_from_string_passing

* test_molecule passing

* Set test_molutil back to original state

* blacken qcel

* Skip --validate tests since they are circular in nature. They test that exported models conform to pydantic's autogenerated schema, which is not necessary to tests. Also, issues arrise with jsonschema which are external to our concerns.
loriab pushed a commit that referenced this pull request Aug 22, 2023
* Refactored models. Moved common fields to parent classes. Created standardized input and result interface.

* Basic models implemented; tests not passing yet

* test_model_results passing

* test_molutil passing

* test_molparse_from_string_passing

* test_molecule passing

* Set test_molutil back to original state

* blacken qcel

* Skip --validate tests since they are circular in nature. They test that exported models conform to pydantic's autogenerated schema, which is not necessary to tests. Also, issues arrise with jsonschema which are external to our concerns.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[DISCUSSION] Reduce redundancy in model specifications
2 participants