Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(robot-server): hash protocol + labware contents to avoid duplicate entries #12193

Merged
merged 24 commits into from
Mar 17, 2023

Conversation

shlokamin
Copy link
Member

@shlokamin shlokamin commented Feb 24, 2023

Overview

This PR adds a content hash property to protocol resources. The hash is generated when POSTing a protocol in the robot server, and on robot boot when generating protocol sources into memory.

closes RCORE-614

Test Plan

  • Used postman to POST to the /protocols endpoint multiple times with the same protocol
  • Validated that the first POST returned a 201 status, and all subsequent returned 200s
  • Validated that when hitting GET /protocols I only got one entry back
  • Restarted a robot and made sure i still got 200 statuses and new protocol entries were not being created

Changelog

  • Hash protocol and labware contents to avoid duplicate entries

Review requests

Follow test plan above

Risk assessment

High — this touches a core API, and one that the app heavily depends on

@codecov
Copy link

codecov bot commented Feb 24, 2023

Codecov Report

Merging #12193 (77e3cc9) into edge (d47b68e) will increase coverage by 0.22%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##             edge   #12193      +/-   ##
==========================================
+ Coverage   74.02%   74.24%   +0.22%     
==========================================
  Files        2193     2170      -23     
  Lines       60360    59596     -764     
  Branches     6223     6236      +13     
==========================================
- Hits        44679    44245     -434     
+ Misses      14252    13920     -332     
- Partials     1429     1431       +2     
Flag Coverage Δ
notify-server 89.13% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
api/src/opentrons/protocol_reader/__init__.py 100.00% <ø> (ø)
...i/src/opentrons/protocol_reader/protocol_reader.py 100.00% <ø> (ø)
...i/src/opentrons/protocol_reader/protocol_source.py 100.00% <ø> (ø)
...obot-server/robot_server/protocols/dependencies.py 100.00% <ø> (ø)
...ot-server/robot_server/protocols/protocol_store.py 94.73% <ø> (ø)
robot-server/robot_server/protocols/router.py 100.00% <ø> (ø)

... and 70 files with indirect coverage changes

@shlokamin shlokamin force-pushed the robot_server-add-protocol-labware-hashing branch from 9a27d67 to 93f5769 Compare February 24, 2023 15:55
Copy link
Member

@sfoster1 sfoster1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I regret to announce I have unicode-based nitpicks

api/src/opentrons/protocol_reader/file_hasher.py Outdated Show resolved Hide resolved
robot-server/robot_server/protocols/router.py Outdated Show resolved Hide resolved
@@ -148,9 +157,57 @@ async def create_protocol(
analysis_id: Unique identifier to attach to the analysis resource.
created_at: Timestamp to attach to the new resource.
"""
buffered_files = await file_reader_writer.read(files=files)
content_hash = file_hasher.hash(buffered_files)
content_hash2protocol_id: Dict[str, str] = dict(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nit]: something about the naming of this is breaking my brain

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah its kinda weird, i used this as a reference for naming: https://bagrow.com/blog/2020/10/02/naming-things-python-dictionary-ed./

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

key2value is interesting. I don't think I've ever seen that before in the wild. I usually do values_by_key. So this would be protocol_ids_by_content_hash.

@sfoster1 sfoster1 self-requested a review February 27, 2023 18:50
Copy link
Contributor

@SyntaxColoring SyntaxColoring left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! This is looking great so far. 🥔 #️⃣

api/src/opentrons/protocol_reader/file_hasher.py Outdated Show resolved Hide resolved
api/src/opentrons/protocol_reader/file_hasher.py Outdated Show resolved Hide resolved
Comment on lines +50 to +59
def get_file_reader_writer() -> FileReaderWriter:
"""Get a FileReaderWriter to read file streams into memory and write file streams to disk."""
return FileReaderWriter()


def get_file_hasher() -> FileHasher:
"""Get a FileHasher to hash a file and see if it already exists on the server."""
return FileHasher()


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because of FastAPI 𝓶𝓪𝓰𝓲𝓬, we don't actually need to define these functions.

Instead of doing this:

file_hasher = Depends(get_file_hasher)

...the endpoint function can do this:

file_hasher = Depends(FileHasher)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

love me some magic

@@ -148,9 +157,57 @@ async def create_protocol(
analysis_id: Unique identifier to attach to the analysis resource.
created_at: Timestamp to attach to the new resource.
"""
buffered_files = await file_reader_writer.read(files=files)
content_hash = file_hasher.hash(buffered_files)
content_hash2protocol_id: Dict[str, str] = dict(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

key2value is interesting. I don't think I've ever seen that before in the wild. I usually do values_by_key. So this would be protocol_ids_by_content_hash.

robot-server/robot_server/protocols/router.py Outdated Show resolved Hide resolved
robot-server/robot_server/protocols/router.py Show resolved Hide resolved
robot-server/robot_server/protocols/router.py Outdated Show resolved Hide resolved
@shlokamin shlokamin force-pushed the robot_server-add-protocol-labware-hashing branch from 9227107 to 37f1aa6 Compare March 6, 2023 17:07
@shlokamin shlokamin marked this pull request as ready for review March 6, 2023 17:08
@shlokamin shlokamin requested review from a team as code owners March 6, 2023 17:08
@shlokamin shlokamin requested a review from y3rsh March 6, 2023 22:08
Copy link
Member

@sfoster1 sfoster1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like it, it looks like it'll work great!

I'm a little worried about the sheer extent of the create_protocol route handler but given how many different values that block needs I guess it's tough to reasonably refactor.

One nit is some of the utility classes like ProtocolHasher could probably have their arguments be Sequence rather than List, or even Iterable, just to define the outer boundary of what the function accepts.

api/src/opentrons/protocol_reader/protocol_reader.py Outdated Show resolved Hide resolved
api/src/opentrons/protocol_reader/file_hasher.py Outdated Show resolved Hide resolved
Copy link
Contributor

@SyntaxColoring SyntaxColoring left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is coming along, thanks!

The big things for me at this point are the hashing algorithm, and the Delete the prototcol test additions.

robot-server/robot_server/protocols/protocol_store.py Outdated Show resolved Hide resolved
api/src/opentrons/protocol_reader/file_hasher.py Outdated Show resolved Hide resolved
robot-server/robot_server/protocols/router.py Outdated Show resolved Hide resolved
@shlokamin shlokamin force-pushed the robot_server-add-protocol-labware-hashing branch from e8e89ff to 708efe9 Compare March 13, 2023 18:27
return await anyio.to_thread.run_sync(_hash_sync, files)


def _hash_sync(files: Sequence[BufferedFile]) -> str:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have a docstring for this function as well?

Copy link
Member Author

@shlokamin shlokamin Mar 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I followed the pattern @SyntaxColoring pointed out in this comment that references file_format_validator — it doesnt look like we normally have doc strings for async private methods that are not meant to be exposed as a public api. instead we document the public method that delegates to an async method.

Ex the validate method in file_format_validator just delegates to private async methods that actually do the work, but we put the docstring in the public method

Copy link
Contributor

@jbleon95 jbleon95 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes LGTM, besides my minor comments.

Copy link
Contributor

@SyntaxColoring SyntaxColoring left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great. Thanks for pushing this through!

Here's a few fixups that I hope are uncontroversial and should be able to be applied in-place.

Looks good to merge with or without these changes (assuming CI passes).

@@ -148,6 +150,23 @@ def server_temp_directory() -> Iterator[str]:
yield new_dir


@pytest.fixture()
def clean_server_state() -> Iterator[None]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def clean_server_state() -> Iterator[None]:
def clean_server_state(run_server: object) -> Iterator[None]:

Depending on the run_server fixture ensures that Pytest will wait to tear down run_server until after this fixture runs its teardown. Without this, I think there's theoretically a hazard of this fixture running too late and running into "connection refused" errors, depending the order in which Pytest chooses to run fixtures, which I think is undefined when there are no explicit dependency relationships.

Comment on lines +19 to +26
def _hash_sync(files: Sequence[BufferedFile]) -> str:
sorted_files = sorted(files, key=lambda x: unicodedata.normalize("NFC", x.name))
name_content_hasher = md5()
for file in sorted_files:
name_hash = md5(file.name.encode("utf-8")).digest()
contents_hash = md5(file.contents).digest()
name_content_hasher.update(name_hash + contents_hash)
return name_content_hasher.hexdigest()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good. Here are a couple of suggested changes:

  • Normalize the file name both when we're sorting, and when we're hashing. If we expect normalization to be necessary, then it only makes sense to do it for both, not just one or the other, right?
  • Leave a comment explaining why we're doing this normalization.
Suggested change
def _hash_sync(files: Sequence[BufferedFile]) -> str:
sorted_files = sorted(files, key=lambda x: unicodedata.normalize("NFC", x.name))
name_content_hasher = md5()
for file in sorted_files:
name_hash = md5(file.name.encode("utf-8")).digest()
contents_hash = md5(file.contents).digest()
name_content_hasher.update(name_hash + contents_hash)
return name_content_hasher.hexdigest()
def _hash_sync(files: Sequence[BufferedFile]) -> str:
def normalized_name(file: BufferedFile) -> str:
# Normalize the file's name in case there would be differences in its
# form between the robot's filesystem and the HTTP upload.
# This is precautionary; we're not actually aware of any differences.
return unicodedata.normalize("NFC", file.name)
sorted_files = sorted(files, key=normalized_name)
name_content_hasher = md5()
for file in sorted_files:
name_hash = md5(normalized_name(file).encode("utf-8")).digest()
contents_hash = md5(file.contents).digest()
name_content_hasher.update(name_hash + contents_hash)
return name_content_hasher.hexdigest()

Comment on lines +63 to +64
files: List buffered files. Do not attempt to reuse any objects
in this list once they've been passed to the ProtocolReader.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
files: List buffered files. Do not attempt to reuse any objects
in this list once they've been passed to the ProtocolReader.
files: List of buffered files

Since you've changed the interface to accept BufferedFiles instead of AbstractInputFiles, we don't need this quirky warning about file reuse anymore. This had to do with file .seek()ing, but we don't need to worry about that with BufferedFiles.

Comment on lines +49 to +64
def delete_all_protocols(response: Response, host: str, port: str) -> None:
"""Intake the response of a GET /protocols and delete all protocols if any exist."""
headers = {"Opentrons-Version": "*"}
base_url = f"{host}:{port}"
protocols = response.json()["data"]
protocol_ids = [protocol["id"] for protocol in protocols]
for protocol_id in protocol_ids:
delete_response = requests.delete(
f"{base_url}/protocols/{protocol_id}", headers=headers
)
print(
f"Deleted protocol {protocol_id},"
f" response status code = {delete_response.status_code}"
)


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this left over from development? I don't see it used anywhere?

Suggested change
def delete_all_protocols(response: Response, host: str, port: str) -> None:
"""Intake the response of a GET /protocols and delete all protocols if any exist."""
headers = {"Opentrons-Version": "*"}
base_url = f"{host}:{port}"
protocols = response.json()["data"]
protocol_ids = [protocol["id"] for protocol in protocols]
for protocol_id in protocol_ids:
delete_response = requests.delete(
f"{base_url}/protocols/{protocol_id}", headers=headers
)
print(
f"Deleted protocol {protocol_id},"
f" response status code = {delete_response.status_code}"
)

@shlokamin
Copy link
Member Author

great! im gonna merge this and address the last comments in a follow up

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants