feat(robot-server): hash protocol + labware contents to avoid duplicate entries #12193

shlokamin · 2023-02-24T15:48:03Z

Overview

This PR adds a content hash property to protocol resources. The hash is generated when POSTing a protocol in the robot server, and on robot boot when generating protocol sources into memory.

closes RCORE-614

Test Plan

Used postman to POST to the /protocols endpoint multiple times with the same protocol
Validated that the first POST returned a 201 status, and all subsequent returned 200s
Validated that when hitting GET /protocols I only got one entry back
Restarted a robot and made sure i still got 200 statuses and new protocol entries were not being created

Changelog

Hash protocol and labware contents to avoid duplicate entries

Review requests

Follow test plan above

Risk assessment

High — this touches a core API, and one that the app heavily depends on

codecov · 2023-02-24T15:52:22Z

Codecov Report

Merging #12193 (77e3cc9) into edge (d47b68e) will increase coverage by 0.22%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##             edge   #12193      +/-   ##
==========================================
+ Coverage   74.02%   74.24%   +0.22%     
==========================================
  Files        2193     2170      -23     
  Lines       60360    59596     -764     
  Branches     6223     6236      +13     
==========================================
- Hits        44679    44245     -434     
+ Misses      14252    13920     -332     
- Partials     1429     1431       +2

Flag	Coverage Δ
notify-server	`89.13% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
api/src/opentrons/protocol_reader/__init__.py	`100.00% <ø> (ø)`
...i/src/opentrons/protocol_reader/protocol_reader.py	`100.00% <ø> (ø)`
...i/src/opentrons/protocol_reader/protocol_source.py	`100.00% <ø> (ø)`
...obot-server/robot_server/protocols/dependencies.py	`100.00% <ø> (ø)`
...ot-server/robot_server/protocols/protocol_store.py	`94.73% <ø> (ø)`
robot-server/robot_server/protocols/router.py	`100.00% <ø> (ø)`

... and 70 files with indirect coverage changes

…tries RCORE-614

api/src/opentrons/protocol_reader/protocol_reader.py

robot-server/robot_server/protocols/router.py

sfoster1

I regret to announce I have unicode-based nitpicks

api/src/opentrons/protocol_reader/file_hasher.py

robot-server/robot_server/protocols/router.py

sfoster1 · 2023-02-27T18:17:17Z

robot-server/robot_server/protocols/router.py

@@ -148,9 +157,57 @@ async def create_protocol(
 analysis_id: Unique identifier to attach to the analysis resource.
 created_at: Timestamp to attach to the new resource.
 """
+ buffered_files = await file_reader_writer.read(files=files)
+ content_hash = file_hasher.hash(buffered_files)
+ content_hash2protocol_id: Dict[str, str] = dict(


[nit]: something about the naming of this is breaking my brain

yeah its kinda weird, i used this as a reference for naming: https://bagrow.com/blog/2020/10/02/naming-things-python-dictionary-ed./

key2value is interesting. I don't think I've ever seen that before in the wild. I usually do values_by_key. So this would be protocol_ids_by_content_hash.

SyntaxColoring

Thank you! This is looking great so far. 🥔 #️⃣

api/src/opentrons/protocol_reader/file_hasher.py

SyntaxColoring · 2023-02-28T19:38:43Z

robot-server/robot_server/protocols/dependencies.py

+def get_file_reader_writer() -> FileReaderWriter:
+ """Get a FileReaderWriter to read file streams into memory and write file streams to disk."""
+ return FileReaderWriter()
+
+
+def get_file_hasher() -> FileHasher:
+ """Get a FileHasher to hash a file and see if it already exists on the server."""
+ return FileHasher()
+
+


Because of FastAPI 𝓶𝓪𝓰𝓲𝓬, we don't actually need to define these functions.

Instead of doing this:

file_hasher = Depends(get_file_hasher)

...the endpoint function can do this:

file_hasher = Depends(FileHasher)

love me some magic

SyntaxColoring · 2023-02-28T19:41:20Z

robot-server/robot_server/protocols/router.py

@@ -148,9 +157,57 @@ async def create_protocol(
 analysis_id: Unique identifier to attach to the analysis resource.
 created_at: Timestamp to attach to the new resource.
 """
+ buffered_files = await file_reader_writer.read(files=files)
+ content_hash = file_hasher.hash(buffered_files)
+ content_hash2protocol_id: Dict[str, str] = dict(


key2value is interesting. I don't think I've ever seen that before in the wild. I usually do values_by_key. So this would be protocol_ids_by_content_hash.

robot-server/robot_server/protocols/router.py

sfoster1

I like it, it looks like it'll work great!

I'm a little worried about the sheer extent of the create_protocol route handler but given how many different values that block needs I guess it's tough to reasonably refactor.

One nit is some of the utility classes like ProtocolHasher could probably have their arguments be Sequence rather than List, or even Iterable, just to define the outer boundary of what the function accepts.

api/src/opentrons/protocol_reader/protocol_reader.py

api/src/opentrons/protocol_reader/file_hasher.py

SyntaxColoring

This is coming along, thanks!

The big things for me at this point are the hashing algorithm, and the Delete the prototcol test additions.

robot-server/robot_server/protocols/protocol_store.py

robot-server/tests/integration/http_api/protocols/test_v6_json_upload.tavern.yaml

api/src/opentrons/protocol_reader/file_hasher.py

robot-server/robot_server/protocols/router.py

robot-server/tests/integration/http_api/protocols/test_v6_json_upload.tavern.yaml

jbleon95 · 2023-03-15T18:12:22Z

api/src/opentrons/protocol_reader/file_hasher.py

+ return await anyio.to_thread.run_sync(_hash_sync, files)
+
+
+def _hash_sync(files: Sequence[BufferedFile]) -> str:


Can we have a docstring for this function as well?

I followed the pattern @SyntaxColoring pointed out in this comment that references file_format_validator — it doesnt look like we normally have doc strings for async private methods that are not meant to be exposed as a public api. instead we document the public method that delegates to an async method.

Ex the validate method in file_format_validator just delegates to private async methods that actually do the work, but we put the docstring in the public method

jbleon95

The changes LGTM, besides my minor comments.

SyntaxColoring

Looks great. Thanks for pushing this through!

Here's a few fixups that I hope are uncontroversial and should be able to be applied in-place.

Looks good to merge with or without these changes (assuming CI passes).

SyntaxColoring · 2023-03-16T15:03:20Z

robot-server/tests/conftest.py

@@ -148,6 +150,23 @@ def server_temp_directory() -> Iterator[str]:
 yield new_dir


+@pytest.fixture()
+def clean_server_state() -> Iterator[None]:


Suggested change

def clean_server_state() -> Iterator[None]:

def clean_server_state(run_server: object) -> Iterator[None]:

Depending on the run_server fixture ensures that Pytest will wait to tear down run_server until after this fixture runs its teardown. Without this, I think there's theoretically a hazard of this fixture running too late and running into "connection refused" errors, depending the order in which Pytest chooses to run fixtures, which I think is undefined when there are no explicit dependency relationships.

SyntaxColoring · 2023-03-16T16:09:27Z

api/src/opentrons/protocol_reader/file_hasher.py

+def _hash_sync(files: Sequence[BufferedFile]) -> str:
+ sorted_files = sorted(files, key=lambda x: unicodedata.normalize("NFC", x.name))
+ name_content_hasher = md5()
+ for file in sorted_files:
+ name_hash = md5(file.name.encode("utf-8")).digest()
+ contents_hash = md5(file.contents).digest()
+ name_content_hasher.update(name_hash + contents_hash)
+ return name_content_hasher.hexdigest()


This looks good. Here are a couple of suggested changes:

Normalize the file name both when we're sorting, and when we're hashing. If we expect normalization to be necessary, then it only makes sense to do it for both, not just one or the other, right?

Leave a comment explaining why we're doing this normalization.

Suggested change

def _hash_sync(files: Sequence[BufferedFile]) -> str:

sorted_files = sorted(files, key=lambda x: unicodedata.normalize("NFC", x.name))

name_content_hasher = md5()

for file in sorted_files:

name_hash = md5(file.name.encode("utf-8")).digest()

contents_hash = md5(file.contents).digest()

name_content_hasher.update(name_hash + contents_hash)

return name_content_hasher.hexdigest()

def _hash_sync(files: Sequence[BufferedFile]) -> str:

def normalized_name(file: BufferedFile) -> str:

# Normalize the file's name in case there would be differences in its

# form between the robot's filesystem and the HTTP upload.

# This is precautionary; we're not actually aware of any differences.

return unicodedata.normalize("NFC", file.name)

sorted_files = sorted(files, key=normalized_name)

name_content_hasher = md5()

for file in sorted_files:

name_hash = md5(normalized_name(file).encode("utf-8")).digest()

contents_hash = md5(file.contents).digest()

name_content_hasher.update(name_hash + contents_hash)

return name_content_hasher.hexdigest()

SyntaxColoring · 2023-03-16T16:11:47Z

api/src/opentrons/protocol_reader/protocol_reader.py

+ files: List buffered files. Do not attempt to reuse any objects
+ in this list once they've been passed to the ProtocolReader.


Suggested change

files: List buffered files. Do not attempt to reuse any objects

in this list once they've been passed to the ProtocolReader.

files: List of buffered files

Since you've changed the interface to accept BufferedFiles instead of AbstractInputFiles, we don't need this quirky warning about file reuse anymore. This had to do with file .seek()ing, but we don't need to worry about that with BufferedFiles.

SyntaxColoring · 2023-03-16T16:24:17Z

robot-server/tests/integration/fixtures.py

+def delete_all_protocols(response: Response, host: str, port: str) -> None:
+ """Intake the response of a GET /protocols and delete all protocols if any exist."""
+ headers = {"Opentrons-Version": "*"}
+ base_url = f"{host}:{port}"
+ protocols = response.json()["data"]
+ protocol_ids = [protocol["id"] for protocol in protocols]
+ for protocol_id in protocol_ids:
+ delete_response = requests.delete(
+ f"{base_url}/protocols/{protocol_id}", headers=headers
+ )
+ print(
+ f"Deleted protocol {protocol_id},"
+ f" response status code = {delete_response.status_code}"
+ )
+
+


Is this left over from development? I don't see it used anywhere?

Suggested change

def delete_all_protocols(response: Response, host: str, port: str) -> None:

"""Intake the response of a GET /protocols and delete all protocols if any exist."""

headers = {"Opentrons-Version": "*"}

base_url = f"{host}:{port}"

protocols = response.json()["data"]

protocol_ids = [protocol["id"] for protocol in protocols]

for protocol_id in protocol_ids:

delete_response = requests.delete(

f"{base_url}/protocols/{protocol_id}", headers=headers

)

print(

f"Deleted protocol {protocol_id},"

f" response status code = {delete_response.status_code}"

)

shlokamin · 2023-03-17T17:13:44Z

great! im gonna merge this and address the last comments in a follow up

feat(api, robot-server): hash protocol contents to avoid duplicate en…

93f5769

…tries RCORE-614

shlokamin force-pushed the robot_server-add-protocol-labware-hashing branch from 9a27d67 to 93f5769 Compare February 24, 2023 15:55

shlokamin commented Feb 24, 2023

View reviewed changes

api/src/opentrons/protocol_reader/protocol_reader.py Show resolved Hide resolved

rename dict

2b076de

shlokamin commented Feb 24, 2023

View reviewed changes

robot-server/robot_server/protocols/router.py Outdated Show resolved Hide resolved

robot-server/robot_server/protocols/router.py Show resolved Hide resolved

shlokamin commented Feb 24, 2023

View reviewed changes

robot-server/robot_server/protocols/router.py Outdated Show resolved Hide resolved

shlokamin added 2 commits February 24, 2023 16:56

fix tests in protocols router

5ae9823

fix extract labware defs test

866b81a

sfoster1 requested changes Feb 27, 2023

View reviewed changes

sfoster1 self-requested a review February 27, 2023 18:50

shlokamin added 5 commits February 27, 2023 14:59

update more tests

340d2a2

fix more tests

4626b9d

lint api

4b93cef

lint robot server

a39a490

add integration test

dbd056b

SyntaxColoring requested changes Feb 28, 2023

View reviewed changes

shlokamin added 5 commits March 2, 2023 16:40

address PR comments

e413210

lint

a58c4d1

lint

6e000d3

Merge branch 'edge' into robot_server-add-protocol-labware-hashing

dcd064f

clean up integration test

37f1aa6

shlokamin force-pushed the robot_server-add-protocol-labware-hashing branch from 9227107 to 37f1aa6 Compare March 6, 2023 17:07

shlokamin requested a review from SyntaxColoring March 6, 2023 17:08

shlokamin marked this pull request as ready for review March 6, 2023 17:08

shlokamin requested review from a team as code owners March 6, 2023 17:08

fix tests after merge

777232c

shlokamin requested a review from y3rsh March 6, 2023 22:08

sfoster1 approved these changes Mar 6, 2023

View reviewed changes

api/src/opentrons/protocol_reader/protocol_reader.py Outdated Show resolved Hide resolved

api/src/opentrons/protocol_reader/file_hasher.py Outdated Show resolved Hide resolved

SyntaxColoring requested changes Mar 7, 2023

View reviewed changes

shlokamin added 7 commits March 9, 2023 10:51

concat hashes of file segments before hashing the combined output

39e4436

wip refactor integration tests

b4faf7f

use cleanup pytest fixture

8277d16

remove unused import

d02635f

do not explicitly catch internal server error

0226a6b

format

528aa2f

use sequence instead of list

708efe9

shlokamin force-pushed the robot_server-add-protocol-labware-hashing branch from e8e89ff to 708efe9 Compare March 13, 2023 18:27

update hashing util to hash cumilitave set of files

55e1a51

shlokamin requested a review from SyntaxColoring March 13, 2023 18:43

jbleon95 reviewed Mar 15, 2023

View reviewed changes

robot-server/tests/integration/http_api/protocols/test_v6_json_upload.tavern.yaml Outdated Show resolved Hide resolved

jbleon95 reviewed Mar 15, 2023

View reviewed changes

jbleon95 approved these changes Mar 15, 2023

View reviewed changes

add newline at end of tavern test

77e3cc9

SyntaxColoring approved these changes Mar 16, 2023

View reviewed changes

shlokamin merged commit 4bbc247 into edge Mar 17, 2023

shlokamin deleted the robot_server-add-protocol-labware-hashing branch March 17, 2023 17:14

SyntaxColoring mentioned this pull request May 26, 2023

test(robot-server): Deduplicate and tidy up integration test fixtures #12795

Merged

SyntaxColoring mentioned this pull request Nov 28, 2023

bug: dangerous behaviour importing external data in your protocol #13938

Closed

SyntaxColoring mentioned this pull request Feb 21, 2024

refactor(robot-server): Ensure SQL selects are ordered #14535

Merged

SyntaxColoring mentioned this pull request Mar 4, 2024

bug: Robot analysis shows outdated results when the protocol has an external dependency #14598

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(robot-server): hash protocol + labware contents to avoid duplicate entries #12193

feat(robot-server): hash protocol + labware contents to avoid duplicate entries #12193

shlokamin commented Feb 24, 2023 •

edited

Loading

codecov bot commented Feb 24, 2023 •

edited

Loading

sfoster1 left a comment

sfoster1 Feb 27, 2023

shlokamin Feb 27, 2023

SyntaxColoring Feb 28, 2023

SyntaxColoring left a comment

SyntaxColoring Feb 28, 2023

shlokamin Mar 1, 2023

SyntaxColoring Feb 28, 2023

sfoster1 left a comment

SyntaxColoring left a comment

jbleon95 Mar 15, 2023

shlokamin Mar 16, 2023 •

edited

Loading

jbleon95 left a comment

SyntaxColoring left a comment •

edited

Loading

SyntaxColoring Mar 16, 2023

SyntaxColoring Mar 16, 2023

SyntaxColoring Mar 16, 2023

SyntaxColoring Mar 16, 2023

shlokamin commented Mar 17, 2023

		return await anyio.to_thread.run_sync(_hash_sync, files)


		def _hash_sync(files: Sequence[BufferedFile]) -> str:

	def clean_server_state() -> Iterator[None]:
	def clean_server_state(run_server: object) -> Iterator[None]:

		files: List buffered files. Do not attempt to reuse any objects
		in this list once they've been passed to the ProtocolReader.

	files: List buffered files. Do not attempt to reuse any objects
	in this list once they've been passed to the ProtocolReader.
	files: List of buffered files

feat(robot-server): hash protocol + labware contents to avoid duplicate entries #12193

feat(robot-server): hash protocol + labware contents to avoid duplicate entries #12193

Conversation

shlokamin commented Feb 24, 2023 • edited Loading

Overview

Test Plan

Changelog

Review requests

Risk assessment

codecov bot commented Feb 24, 2023 • edited Loading

Codecov Report

sfoster1 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SyntaxColoring left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sfoster1 left a comment

Choose a reason for hiding this comment

SyntaxColoring left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shlokamin Mar 16, 2023 • edited Loading

Choose a reason for hiding this comment

jbleon95 left a comment

Choose a reason for hiding this comment

SyntaxColoring left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shlokamin commented Mar 17, 2023

shlokamin commented Feb 24, 2023 •

edited

Loading

codecov bot commented Feb 24, 2023 •

edited

Loading

shlokamin Mar 16, 2023 •

edited

Loading

SyntaxColoring left a comment •

edited

Loading