-
Notifications
You must be signed in to change notification settings - Fork 176
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(robot-server): hash protocol + labware contents to avoid duplicate entries #12193
Conversation
Codecov Report
@@ Coverage Diff @@
## edge #12193 +/- ##
==========================================
+ Coverage 74.02% 74.24% +0.22%
==========================================
Files 2193 2170 -23
Lines 60360 59596 -764
Branches 6223 6236 +13
==========================================
- Hits 44679 44245 -434
+ Misses 14252 13920 -332
- Partials 1429 1431 +2
Flags with carried forward coverage won't be shown. Click here to find out more.
|
9a27d67
to
93f5769
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I regret to announce I have unicode-based nitpicks
@@ -148,9 +157,57 @@ async def create_protocol( | |||
analysis_id: Unique identifier to attach to the analysis resource. | |||
created_at: Timestamp to attach to the new resource. | |||
""" | |||
buffered_files = await file_reader_writer.read(files=files) | |||
content_hash = file_hasher.hash(buffered_files) | |||
content_hash2protocol_id: Dict[str, str] = dict( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nit]: something about the naming of this is breaking my brain
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah its kinda weird, i used this as a reference for naming: https://bagrow.com/blog/2020/10/02/naming-things-python-dictionary-ed./
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
key2value
is interesting. I don't think I've ever seen that before in the wild. I usually do values_by_key
. So this would be protocol_ids_by_content_hash
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! This is looking great so far. 🥔 #️⃣
def get_file_reader_writer() -> FileReaderWriter: | ||
"""Get a FileReaderWriter to read file streams into memory and write file streams to disk.""" | ||
return FileReaderWriter() | ||
|
||
|
||
def get_file_hasher() -> FileHasher: | ||
"""Get a FileHasher to hash a file and see if it already exists on the server.""" | ||
return FileHasher() | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because of FastAPI 𝓶𝓪𝓰𝓲𝓬, we don't actually need to define these functions.
Instead of doing this:
file_hasher = Depends(get_file_hasher)
...the endpoint function can do this:
file_hasher = Depends(FileHasher)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
love me some magic
@@ -148,9 +157,57 @@ async def create_protocol( | |||
analysis_id: Unique identifier to attach to the analysis resource. | |||
created_at: Timestamp to attach to the new resource. | |||
""" | |||
buffered_files = await file_reader_writer.read(files=files) | |||
content_hash = file_hasher.hash(buffered_files) | |||
content_hash2protocol_id: Dict[str, str] = dict( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
key2value
is interesting. I don't think I've ever seen that before in the wild. I usually do values_by_key
. So this would be protocol_ids_by_content_hash
.
9227107
to
37f1aa6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like it, it looks like it'll work great!
I'm a little worried about the sheer extent of the create_protocol
route handler but given how many different values that block needs I guess it's tough to reasonably refactor.
One nit is some of the utility classes like ProtocolHasher
could probably have their arguments be Sequence rather than List
, or even Iterable, just to define the outer boundary of what the function accepts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is coming along, thanks!
The big things for me at this point are the hashing algorithm, and the Delete the prototcol
test additions.
robot-server/tests/integration/http_api/protocols/test_v6_json_upload.tavern.yaml
Outdated
Show resolved
Hide resolved
e8e89ff
to
708efe9
Compare
robot-server/tests/integration/http_api/protocols/test_v6_json_upload.tavern.yaml
Outdated
Show resolved
Hide resolved
return await anyio.to_thread.run_sync(_hash_sync, files) | ||
|
||
|
||
def _hash_sync(files: Sequence[BufferedFile]) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we have a docstring for this function as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I followed the pattern @SyntaxColoring pointed out in this comment that references file_format_validator
— it doesnt look like we normally have doc strings for async private methods that are not meant to be exposed as a public api. instead we document the public method that delegates to an async method.
Ex the validate
method in file_format_validator
just delegates to private async methods that actually do the work, but we put the docstring in the public method
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes LGTM, besides my minor comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great. Thanks for pushing this through!
Here's a few fixups that I hope are uncontroversial and should be able to be applied in-place.
Looks good to merge with or without these changes (assuming CI passes).
@@ -148,6 +150,23 @@ def server_temp_directory() -> Iterator[str]: | |||
yield new_dir | |||
|
|||
|
|||
@pytest.fixture() | |||
def clean_server_state() -> Iterator[None]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def clean_server_state() -> Iterator[None]: | |
def clean_server_state(run_server: object) -> Iterator[None]: |
Depending on the run_server
fixture ensures that Pytest will wait to tear down run_server
until after this fixture runs its teardown. Without this, I think there's theoretically a hazard of this fixture running too late and running into "connection refused" errors, depending the order in which Pytest chooses to run fixtures, which I think is undefined when there are no explicit dependency relationships.
def _hash_sync(files: Sequence[BufferedFile]) -> str: | ||
sorted_files = sorted(files, key=lambda x: unicodedata.normalize("NFC", x.name)) | ||
name_content_hasher = md5() | ||
for file in sorted_files: | ||
name_hash = md5(file.name.encode("utf-8")).digest() | ||
contents_hash = md5(file.contents).digest() | ||
name_content_hasher.update(name_hash + contents_hash) | ||
return name_content_hasher.hexdigest() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good. Here are a couple of suggested changes:
- Normalize the file name both when we're sorting, and when we're hashing. If we expect normalization to be necessary, then it only makes sense to do it for both, not just one or the other, right?
- Leave a comment explaining why we're doing this normalization.
def _hash_sync(files: Sequence[BufferedFile]) -> str: | |
sorted_files = sorted(files, key=lambda x: unicodedata.normalize("NFC", x.name)) | |
name_content_hasher = md5() | |
for file in sorted_files: | |
name_hash = md5(file.name.encode("utf-8")).digest() | |
contents_hash = md5(file.contents).digest() | |
name_content_hasher.update(name_hash + contents_hash) | |
return name_content_hasher.hexdigest() | |
def _hash_sync(files: Sequence[BufferedFile]) -> str: | |
def normalized_name(file: BufferedFile) -> str: | |
# Normalize the file's name in case there would be differences in its | |
# form between the robot's filesystem and the HTTP upload. | |
# This is precautionary; we're not actually aware of any differences. | |
return unicodedata.normalize("NFC", file.name) | |
sorted_files = sorted(files, key=normalized_name) | |
name_content_hasher = md5() | |
for file in sorted_files: | |
name_hash = md5(normalized_name(file).encode("utf-8")).digest() | |
contents_hash = md5(file.contents).digest() | |
name_content_hasher.update(name_hash + contents_hash) | |
return name_content_hasher.hexdigest() |
files: List buffered files. Do not attempt to reuse any objects | ||
in this list once they've been passed to the ProtocolReader. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
files: List buffered files. Do not attempt to reuse any objects | |
in this list once they've been passed to the ProtocolReader. | |
files: List of buffered files |
Since you've changed the interface to accept BufferedFile
s instead of AbstractInputFile
s, we don't need this quirky warning about file reuse anymore. This had to do with file .seek()
ing, but we don't need to worry about that with BufferedFile
s.
def delete_all_protocols(response: Response, host: str, port: str) -> None: | ||
"""Intake the response of a GET /protocols and delete all protocols if any exist.""" | ||
headers = {"Opentrons-Version": "*"} | ||
base_url = f"{host}:{port}" | ||
protocols = response.json()["data"] | ||
protocol_ids = [protocol["id"] for protocol in protocols] | ||
for protocol_id in protocol_ids: | ||
delete_response = requests.delete( | ||
f"{base_url}/protocols/{protocol_id}", headers=headers | ||
) | ||
print( | ||
f"Deleted protocol {protocol_id}," | ||
f" response status code = {delete_response.status_code}" | ||
) | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this left over from development? I don't see it used anywhere?
def delete_all_protocols(response: Response, host: str, port: str) -> None: | |
"""Intake the response of a GET /protocols and delete all protocols if any exist.""" | |
headers = {"Opentrons-Version": "*"} | |
base_url = f"{host}:{port}" | |
protocols = response.json()["data"] | |
protocol_ids = [protocol["id"] for protocol in protocols] | |
for protocol_id in protocol_ids: | |
delete_response = requests.delete( | |
f"{base_url}/protocols/{protocol_id}", headers=headers | |
) | |
print( | |
f"Deleted protocol {protocol_id}," | |
f" response status code = {delete_response.status_code}" | |
) |
great! im gonna merge this and address the last comments in a follow up |
Overview
This PR adds a content hash property to protocol resources. The hash is generated when
POST
ing a protocol in the robot server, and on robot boot when generating protocol sources into memory.closes RCORE-614
Test Plan
POST
to the/protocols
endpoint multiple times with the same protocolPOST
returned a 201 status, and all subsequent returned 200sGET
/protocols
I only got one entry backChangelog
Review requests
Follow test plan above
Risk assessment
High — this touches a core API, and one that the app heavily depends on