Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DISCUSSION] Reduce redundancy in model specifications #264

Open
coltonbh opened this issue May 20, 2021 · 9 comments · Fixed by #287
Open

[DISCUSSION] Reduce redundancy in model specifications #264

coltonbh opened this issue May 20, 2021 · 9 comments · Fixed by #287

Comments

@coltonbh
Copy link
Collaborator

Is your feature request related to a problem? Please describe.

As I've worked through compute implementations of single point calculations and the optimization procedures in qcengine I've found some redundancies and inconsistencies in how inputs are defined. Would it be worth consolidating some of these data models to a more unified approach?

Specifically, the AtomicInput, QCInputSpecification, and OptimizationInput models handle similar (and sometimes duplicate) data differently. Perhaps a unified approach would make the API simpler and more consistent?

The way it is now

QCInputSpecification and AtomicInput contain the exact same fields with the only difference being AtomicInput having a protocols and molecule attribute. Given that the QCInputSpecification is what is used inside of various ProcedureHarness classes--like pyberny's harness--to define the QC gradient calculation, perhaps it would make more sense to directly use the AtomicInput specification to describe this task? This would consolidate where the Molecule is defined for computations across single point and procedure tasks and provide a unified way for defining single point calculations whether requested on their own (via qcng.compute) or as part of a procedure.

Describe the solution you'd like

I would suggest eliminating the QCInputSpecification class and updating the OptimizationInput class as follows:

class OptimizationInput(ProtoModel):
    id: Optional[str] = None
    hash_index: Optional[str] = None
    schema_name: constr(  # type: ignore
        strip_whitespace=True, regex=qcschema_optimization_input_default
    ) = qcschema_optimization_input_default
    schema_version: int = 1

    keywords: Dict[str, Any] = Field({}, description="The optimization specific keywords to be used.")
    extras: Dict[str, Any] = Field({}, description="Extra fields that are not part of the schema.")
    protocols: OptimizationProtocols = Field(OptimizationProtocols(), description=str(OptimizationProtocols.__doc__))

    # input_specification: QCInputSpecification = Field(..., description=str(QCInputSpecification.__doc__))
    input_specification: AtomicInput = Field(..., descriptions="The starting molecule and compute specification for the geometry optimization")
    # initial_molecule: Molecule = Field(..., description="The starting molecule for the geometry optimization.")
    engine: str = Field(..., description="The compute engine to use for energy/gradient evaluations")
    provenance: Provenance = Field(Provenance(**provenance_stamp(__name__)), description=str(Provenance.__doc__))

@property
def initial_molecule(self) -> Molecule:
    """Maintains backwards compatibility for external callers who expect an initial_molecule attribute"""
    return self.input_specification.molecule

This solves a few problems:

  1. It removes the redundant QCInputSpecification model which is really an AtomicInput minus the Molecule
  2. It consolidates how to specify a single point computation (gradient/engergy) to the AtomicInput declaration, the same interface used for single point computations using qcng.compute(). I propose this is advantageous to having a second way to declare this computation for optimization procedures using a QCInputSpecification and a Molecule separately defined on the OptimizationInput model.
  3. Adds a clear definition of the program engine to use rather than .pop()-ing "program" from OptimizationInput.keywords dictionary--not entirely intuitive and one has to read the source code itself to realize this is the required implementation to define the compute engine.

Describe alternatives you've considered

AtomicInput could inherit from QCInputSpecification and then only define its additional .protocol and .molecule attribute. This would at least reduce the code overhead of redundant field declarations that serve the same purpose. But I think the real intent of the library is better served by using the AtomicInput object throughout to define single point calculations, even when those calculations are used inside of a procedure.

Additional context

If this proposal has support I'll gladly draft up a PR for the code and update the associated procedures (berny, geometric, optking). The only update they'd need, if we implemented the initial_molecule property noted above, is to get the engine/program from the explicitly defined OptimizationInput.engine field rather than executing a input_data.keywords.pop("program") command. I think the explicit declaration of the compute engine is better than implicitly placing it into keywords.

Thanks for your time to review and consider this!

@bennybp
Copy link
Contributor

bennybp commented May 21, 2021

This is a timely discussion, as I have been thinking about this QCFractal side for a while.

I think conceptually what you have under the "Describe alternatives you've considered" might be a bit cleaner. The way I think of it:

  • QCInputSpecification describes how to run a computation (this could include protocols)
  • AtomicInput is all information for a calculation (so an input specification + a molecule)
  • OptimizationInput is all information for an optimization. This includes a QCInputSpecification for how to do each gradient.

An AtomicInput doesn't quite make sense in an optimization because it should describe how each gradient should be performed, and each gradient would have a different molecule.

QCFractal side, things will be a bit different, but could possibly re-use some of these classes in QCElemental.

What to do with program is another question. I go either way with that one.

@coltonbh
Copy link
Collaborator Author

coltonbh commented May 21, 2021

Thanks for the thoughts, @bennybp!

A few questions to help decipher between the two alternatives.

Question: Would one ever define a computation without defining the molecule on which to perform the computation?

It appears to me the answer is most likely "no." Which suggests to me the QCInputSpecification is an incomplete abstraction and AtomicInput would be the more correct "base case." AtomicInput defines a computation and the associated molecule on which to perform the computation--anything more granular doesn't have much meaning (I.e., What is a computation specification without a molecule?). A supporting argument is that the OptimizationInput has to compensate for QCInputSpecification's lack of Molecule by defining it's own bespoke attribute--initial_molecule--to use as the starting point. It appears more correct (to me) to say AtomicInput is the most granular unit of computation definition--it describes a computation and describes the molecule on which to perform it.

Assuming I'm wrong in my statement above: What are there cases where we want to define how to run a computation without defining the molecule on which we want to perform it? I.e., cases where the object of computation contains a QCInputSpecification without a Molecule at all? If they exist, I'd say I'm wrong and the QCInputSpecification is necessary.

An AtomicInput doesn't quite make sense in an optimization because it should describe how each gradient should be performed, and each gradient would have a different molecule.

If I understand you correctly here you're saying an AtomicInput should be interpreted as an explicit computation that is performed exactly as specified so it feels wrong as a "starting input" because the molecule will evolve over time as the geometry optimization progresses, correct?

I'd push back by saying the OptimizationInput.trajectory object returns a series of AtomicResult objects--all of which resulted from some AtomicInput definition. Explicitly defining the first AtomicInput calculation that kicks off that series of calculations and attaching it to the input_specification attribute which defines the starting point for a geometry optimization seems logical to me, but perhaps I'm missing something you have in mind from the QCFractal side of things.

That said, I think my thinking boils down to two points:

  1. Having a unified way to define single point calculations (whether as the starting point for a procedure or as the calculation of interest) using AtomicInput maintains consistency across the compute() and compute_procedure() interfaces. This is worth a lot, IMHO. And we can drop an entire redundant data model--less cognitive overhead!
  2. I am unaware of cases where we'd want to define an input specification without also declaring the molecule on which we want that calculation to be performed. If I'm wrong about this, I'd say the QCInputSpecification should defiantly stand alone as a separate object, if not, I like the consistent bundling of calculation definition and molecule together on AtomicInput rather than proliferating Molecule specification as bespoke attributes on a procedure's implementation.

Re: program, I'm tending towards python's overused but helpful axiom, "Explicit is better than implicit." If a procedure is a collection of single point calculations, an engine must always be specified. I like the idea of specifying that upfront and directly on the procedure object of interest. But I have to admit the (apparent) QC package tendency to pass the engine as a parameter to the procedure engine (as done with geomeTRIC where you pass the program as a parameter to geomeTRIC directly) helps me to understand why is was originally defined as a keyword. I'd push for the explicit definition, but feel less strongly about it than I do about the redundant data definitions :)

How does my question about use cases for an input specification sans a Molecule land with you? Are there legitimate use cases I'm not considering? Thanks for your feedback and thoughts!

@bennybp
Copy link
Contributor

bennybp commented May 21, 2021

Interesting points!

Question: Would one ever define a computation without defining the molecule on which to perform the computation?

Yes and no. As far as QCEngine is concerned, no (@loriab can correct me if I am wrong). However, in databases (like QCFractal), you can specify a specification and then a list of molecules to apply that specification to. That way you only store the specification once for many calculations. (This separation will always exist in QCFractal whether it exists in QCElemental/QCSchema or not)

A supporting argument is that the OptimizationInput has to compensate for QCInputSpecification's lack of Molecule by defining it's own bespoke attribute--initial_molecule--to use as the starting point.

I wouldn't say it's compensating. Optimizations have two main molecules, so we should be explicit (better than implicit!) about which is which. Imagine someone using autocomplete in Jupyter notebooks or something. molecule is a bit ambiguous in that sense.

I'd push back by saying the OptimizationInput.trajectory object returns a series of AtomicResult objects--all of which resulted from some AtomicInput definition. Explicitly defining the first AtomicInput calculation that kicks off that series of calculations and attaching it to the input_specification attribute which defines the starting point for a geometry optimization seems logical to me

Yes, but it is up to the optimization program on how to populate those AtomicInputs, given a specification. If your case, the optimizer will likely take the first AtomicInput, strip the molecule, and store that somewhere. That input (minus molecule) can be a useful entity on its own, if only conceptually.

Also, it is not necessarily correct to think that the initial_molecule must be the same molecule as the molecule passed to the first gradient calculation. The optimizer is free to take the initial molecule and do something before running the qc program. For example, it may apply constraints, rotate the molecule, add fragments, etc, which creates a new molecule which is then used in the first gradient calculation.

I tend to think of it this way:

  • QCInputSpecification - how to perform a calculation
  • AtomicInput = QCInputSpecification + molecule
  • OptimizationSpecification (doesn't exist) - how to perform an optimization (optimization-specific options + QCInputSpecification)
  • OptimizationInput = OptimizationSpecification + initial molecule

This is what I'm moving towards in QCFractal (think of it in terms of a relational database). But that doesn't necessarily mean QCElemental must follow suit.


Program is the interesting case. On one hand, you pass an input into a program, and so the program is outside the input. Ie, you could take the same AtomicInput and run it with multiple programs, and so you shouldn't specify programs in inputs. In that case, which program you use for gradient calculations is a keyword to the optimization program since not all optimizers need an external engine (see below).

On the other hand, it really is convenient to have that in there, isn't it :)

If a procedure is a collection of single point calculations, an engine must always be specified.

We could enforce that always be the case. But optimizers don't always use external programs for gradient (e.g., Gaussian does it all by itself). That also allowed for some optimizations and tighter coupling between gradient calculations. This could be worked around by setting the qc program to be the same as the optimization program, though.

Again, interesting discussion!

@coltonbh
Copy link
Collaborator Author

coltonbh commented May 24, 2021

Great context to have. Thank you!

Also, it is not necessarily correct to think that the initial_molecule must be the same molecule as the molecule passed to the first gradient calculation. The optimizer is free to take the initial molecule and do something before running the qc program. For example, it may apply constraints, rotate the molecule, add fragments, etc, which creates a new molecule which is then used in the first gradient calculation.

The implementation as it stands does indeed return the initial molecule submitted by the user as the initial_molecule. I like the idea that initial_molecule be the actual initial_molecule submitted by the user--I think the user would be surprised if it weren't--but makes total sense what you are saying that the first molecule in the .trajectory list would not necessarily be the initial molecule submitted by the user.

I tend to think of it this way:

QCInputSpecification - how to perform a calculation
AtomicInput = QCInputSpecification + molecule
OptimizationSpecification (doesn't exist) - how to perform an optimization (optimization-specific options + QCInputSpecification)
OptimizationInput = OptimizationSpecification + initial molecule

Makes sense from a relational database perspective--you want to segment out values that belong in unique tables and build the correct relationships between them. Molecule is a separate object you'd want to reference from any table in your database, so it makes sense you want this to be the "last" thing specified as a FK to any computation specification. Whether the qcengine interface should be modeled after a particular RDB design in an interesting question. I don't know your designs so no comment from me--though I think the main point stands that a unification of the specifications across .compute() and .compute_procedure() calls would be nice to have :) I'd suspect that modeling a library API after a DB design may not be optimal (I'm thinking about the separation one usually has between business logic and ORM operations--qcengine is more a "data views" experience than ORM experience), but without knowing the specifics I have no strong opinion.

On a higher level note, I'd be curious to know your long-term plans for .compute_procedure(). I found it curious that the base class for specifying a procedure is an OptimizationInput. I'd think (perhaps misinformed!) that one could write a procedure for any type of computation that builds on multiple AtomicResults--say an AIMS simulation, just to throw out a crazy one. Yet the current qcengine code seems to suggest that compute_procedure is really designed for just compute_optimimization because there are no base class objects that specify a general interface for procedures, only for optimizations.

Are there plans for a more general specification for computing procedures? If so, perhaps a rethinking of the core compute inputs to this function would be helpful to properly define the general case for procedures. If procedures will remain as just optimizations for the foreseeable future, perhaps that's not helpful.

I'm planning to use and/or expand QCEngine quite a lot throughout my grad school experience so I'd love to know longer term plans for procedure designs--it's something I've been thinking a lot about recently. I'm up for a Zoom call to brainstorm this idea and hear about your plans for the library, if helpful :)

@coltonbh
Copy link
Collaborator Author

@bennybp I've been doing some more tooling around with the models, specifically doing more large-scale batch computations. After those experiments I'm seeing the wisdom in what you suggested of being able to specify the more granular parts of the computation as separate objects. This makes it easier to define a computation--like a specific energy evaluation--and map it across a library of molecules. I might be totally wrong in my earlier comments--a QCElemental API that looks more like a normalized db schema may be spot on!

I'm digging the composition rather than inheritance approach better. I.e., I like the OptimizationInput API better than what I previously suggested of having AtomicInput inherit from QCInputSpecification. Something like the following for single point calculations (which I think is what you suggested above).

Anyways, just more random food for thought. Generally, I'm leaning your way of more granular definitions as soon as one does bulk computations. And I'm tending to prefer a compositional approach.

I tend to think of it this way:

QCInputSpecification - how to perform a calculation
AtomicInput = QCInputSpecification + molecule
OptimizationSpecification (doesn't exist) - how to perform an optimization (optimization-specific options + QCInputSpecification)
OptimizationInput = OptimizationSpecification + initial molecule

class QCInputSpecification(ProtoModel):
    driver: DriverEnum = Field(DriverEnum.gradient, description=str(DriverEnum.__doc__))
    model: Model = Field(..., description=str(Model.__doc__))
    keywords: Dict[str, Any] = Field({}, description="The program specific keywords to be used.")

class AtomicInput(ProtoModel):
    input_specification: QCInputSepecification
    molecule: Molecule
    # Protocols here or QCInputSpecification? Seems it should live with keywords. There's an argument 
    # for keywords being here instead of QCInputSpecification as well. Could be nice to have separation
    # between things that are general across packages (like driver/model) and specific to a QC package
    # like keywords and protocols (are protocols package specific??).
    protocols: ...

@bennybp
Copy link
Contributor

bennybp commented Nov 15, 2021

I am now getting to this part in my modifications to QCFractal. It's been a few months, and I'm wondering if you have any other thoughts?

I do have some, but I would like to know if you have discovered anything in the meantime before I write them out

@coltonbh
Copy link
Collaborator Author

Hi Ben--

Good to hear from you again on this! No new thoughts. The main thoughts I have after our discussion above and working through various implementations myself are:

  1. A consistent way to define inputs for a computation--whether that computation be single point or a procedure--would be a nice improvement.
  2. Via our conversation and some of my own experience I think that a compositional approach is better than an inheritance approach. This makes it much easier to take pieces of a computation--say a library of Molecule objects and a series of compute specifications--and mix-and-match them to create a bunch of compute operations. The inheritance approach gets ugly when you try to run batch computations.

Happy to discuss more if at all helpful, but I think these two point summarize my high level thoughts. Thanks for continuing to advance the project--I think you are doing such important work for the field :)

@coltonbh
Copy link
Collaborator Author

@bennybp I do have an additional thought now :)

Another layer at which model consolidation would be great and lead to a cleaner API (IMO):

class AtomicResult(ProtoModel):
    input_data: AtomicInput
    ...

class FailedOperation(ProtoModel):
    input_data: AtomicInput
    ...

class OptimizationResult(ProtoModel):
    input_data: OptimizationInput

Currently QCElemental has two ways of dealing with input data returned on a results object:

  1. The result object inherits the input object and adds attributes for the result (e.g., AtomicResult and OptimizationResult)
  2. A model is composed with an attribute--usually input_data--that contains the original AtomicInput submitted for the calculation (e.g., FailedOperation)

I noticed this discrepancy when writing code to handle successful/failed computations. It's super nice to be able to say if result.success: ... since both AtomicResult and FailedOperation have the .success attribute; however, I have to handle the input data that generated the result differently (access it at .input_data on FailedOperation objects, and pull it from attributes directly on AtomicResult). I think it would be nice that all results objects--whether failed or successful--have a .input_data attribute that completely defines the input for the calculation.

Another win for the compositional model you were suggesting above.

@coltonbh
Copy link
Collaborator Author

coltonbh commented Apr 1, 2022

One other benefit to this approach on AtomicResult objects is that the .extras field on AtomicInput is shared between them on the current inheritance approach and gets used bidirectionally for extra input kwargs or parameters, and then used for extra output vars/data. It would be nice to separate the .extras associated with the inputs from the .extras generated for the result :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants