-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug in DB model id generator / model hasher #142
Comments
Whatever the fix is, we will need to patch it in and then reprocess all of the above meetings. Which pushes the need for a "delete all event data for event id Basically, this single issue becomes two:
|
I already have some old code lying around for deleting an entire event and all of it's attachments and files. Still need to finish it however. |
That's where I am at also at the moment. |
I have a change that gets all the regenerated hashes to match regardless of if they start out as a reference model or an in-memory model: In [1]: from cdp_backend.database import models as db_models
...: import fireo
...: from google.auth.credentials import AnonymousCredentials
...: from google.cloud.firestore import Client
...:
...: from cdp_backend.database.functions import generate_and_attach_doc_hash_as_id
...:
...: # Connect to the database
...: fireo.connection(client=Client(
...: project="cdp-seattle-staging-dbengvtn",
...: credentials=AnonymousCredentials()
...: ))
...:
...: def print_session_hashes(event_id):
...: event = db_models.Event.collection.get(f"{db_models.Event.collection_name}/{event_id}")
...: sessions = list(db_models.Session.collection.filter("event_ref", "==", event.key).fetch())
...: for session in sessions:
...: print("original", session.id)
...: updated = generate_and_attach_doc_hash_as_id(session)
...: print("regenerated", updated.id)
...:
...: print_session_hashes("c847010ce7a4")
original 71d1777f22c7
regenerated e8b6fd84193f
original e8b6fd84193f
regenerated e8b6fd84193f
In [2]: def print_session_hashes(event_id):
...: event = db_models.Event.collection.get(f"{db_models.Event.collection_name}/{event_id}")
...: sessions = list(db_models.Session.collection.filter("event_ref", "==", event.key).fetch())
...: for session in sessions:
...: session.event_ref = session.event_ref.get()
...: print("original", session.id)
...: updated = generate_and_attach_doc_hash_as_id(session)
...: print("regenerated", updated.id)
...:
...: print_session_hashes("c847010ce7a4")
original 71d1777f22c7
regenerated e8b6fd84193f
original e8b6fd84193f
regenerated e8b6fd84193f |
In other words, "did the web page decide this event had 2 sessions because it got 2 different |
Could it be related to this part of the hasher function? Where if the field is a But I have no idea why it'd only be affecting the seattle staging deployment and not the other ones. In firestore, is there anything different data wise between the duplicate sessions besides their id? |
An easy but not optimal solution is to turn autoload back on to avoid these hashing problems, although that'd be at the cost of a lot of latency and extra GCP calls. But I don't think that's probably the route we wanna go given the original reasons to turning autoload off. |
@JacksonMaxfield I copied your script to test it out for myself and made the following changes to the hash function: def generate_and_attach_doc_hash_as_id(db_model: Model) -> Model:
# Create hasher and hash primary values
hasher = sha256()
for pk in db_model._PRIMARY_KEYS:
field = getattr(db_model, pk)
# Load doc if it's a reference doc
if isinstance(field, ReferenceDocLoader):
field = field.get()
# Handle reference fields by using their doc path
if isinstance(field, Model):
# Ensure that the underlying model has an id
# In place update to db_model for this field
setattr(db_model, pk, generate_and_attach_doc_hash_as_id(field))
# Now attach the generated hash document path
hasher.update(pickle.dumps(field.id, protocol=4))
# Otherwise just simply add the primary key value
else:
# Hash by field value
hasher.update(pickle.dumps(field, protocol=4))
# Set the id to the first twelve characters of hexdigest
db_model.id = hasher.hexdigest()[:12]
return db_model This generates the following: print_session_hashes("c847010ce7a4")
71d1777f22c7
71d1777f22c7
e8b6fd84193f
71d1777f22c7
print_session_hashes("9409f4d1ea09")
25d3a51447d0
25d3a51447d0
30701b4a0fbf
25d3a51447d0
Does this fix the issue? I was a little confused by the ordering of the print statements described above. The main change difference here is that previously the following line wasn't being executed when the field was a setattr(db_model, pk, generate_and_attach_doc_hash_as_id(field)) I just force loading the reference doc when we need it, and then we either hash by the |
I think my comment above might make the hashing function consistent (for regeneration at least), but doesn't identify the root cause... I think a potential problem might be here: if isinstance(field, Model):
# Ensure that the underlying model has an id
# In place update to db_model for this field
setattr(db_model, pk, generate_and_attach_doc_hash_as_id(field))
# Now attach the generated hash document path
hasher.update(pickle.dumps(field.id, protocol=4)) On line 50 we set the underlying reference model's id. But on line 53 when we update the hasher, we use Maybe this is why the original doesn't match the regenerated id's? |
@isaacna I have similar changes here: main...bugfix/model-id-hasher We arrive at the same conclusion just different routes ahha However, this still confuses me a bit because while these changes solve the hashing problem, they don't explain why only some of our events are hit by this problem. Like the fundamental "why did this happen in the first place" is still a thing. |
This I'm not sure about. Maybe for the events that weren't affected just weren't reprocessed if they weren't picked up in the event gather pipeline. For the changes, what do you think about something like this : for pk in db_model._PRIMARY_KEYS:
field = getattr(db_model, pk)
#Reprocessing use case
if isinstance(field, ReferenceDocLoader):
field = field.get()
hasher.update(pickle.dumps(field.id, protocol=4))
# First time generation use case
elif isinstance(field, Model):
field = generate_and_attach_doc_hash_as_id(field)
setattr(db_model, pk, field)
hasher.update(pickle.dumps(field.id, protocol=4))
else:
hasher.update(pickle.dumps(field, protocol=4))
# Set the id to the first twelve characters of hexdigest
db_model.id = hasher.hexdigest()[:12]
return db_model In the above we use One thing I thought was weird is when I run the script with these changes, I get:
Whereas with your changes in the bugfix branch I get:
So for this example with my changes, But when we load the reference and call the generate hash function recursively, the generated event id is Does this make sense or am I missing/overthinking something? |
Yea... but that just doesn't make sense to me. Basically every event in the Seattle pipeline gets reprocessed I think? I would need to dig into the GitHub Action logs.
Yea I saw that behavior too which is why I went my route of forcing the recursive model loading. It results in consistent hashes which is always the end goal. Especially because as I said, the order of prints is:
that is hard to follow but basically, whatever our change is should try to match the hash to the third one as that is the hash produced when the data has "never-been-seen-before" |
The part that still confuses me is that when we force the model to rehash recursively for a session, the event id it comes up with is '8a84dc698063' which is different than the hash it generated for the event the first time |
Agree. It's concerning as well. I am wondering if we are doing something wrong with our hashing in general. |
I have spent an hour looking at writing down various notes of "how this processing" happened and I truly cannot figure out how this resulted in a "bad" hash. I guess I will try to test this in a dev infrastructure tomorrow and add a bunch of print / log statements. This is wild. Whatever I find I will likely make a recording of because the explanation will likely be some level of conspiracy theory. |
Yeah I wasn't able to reproduce the duplicate session issue locally with a minimal event either. The more I look into this the more strange the issue gets. But I think the core of the issue confusion related to why the session hashes aren't matching in the script we've been running is that the recursively generated hash is not the id that's being attached to I added log statements like so and ran your : # Handle reference fields by using their doc path
if isinstance(field, Model):
# Ensure that the underlying model has an id
# In place update to db_model for this field
print(type(field))
print("PRE GENERATE FIELD ID " + str(field.id))
new_field = generate_and_attach_doc_hash_as_id(field)
setattr(db_model, pk, new_field)
print(type(field))
print("GENERATED FIELD ID " + str(new_field.id)) Running the event gather pipeline (only relevant logs attached):
Your print session hash script (2 valid, separate sessions exist for this one event): print_session_hashes("2d79b33f34fc")
# cb32d7e83781
# 0b766156eabf
# f1356f6d3179
# 541ac1cd31e3 Looking in my database however, the session id's were |
I quoted the above but basically everything you said makes sense to me but the thing that confuses me is how we could possibly be hashing incorrectly. Session's PKs are I have this weird feeling that maybe we were hitting some other type besides either a Model or a ReferenceDocLoader and it was hitting the final else? But your print statements don't match up with that. |
Yeah I didn't include all the print statements, just the part that shows the hash not matching. I'll need to dig deeper on the hash function generation to see what's actually causing the wrong hash to be produced |
Ok so I think the root cause is inconsistent Logs during event generation: https://pastebin.com/tdKQN3UH Branch:
On lines 45-46, we see In the first paste, hash generation for
The two different
As to why the |
This is a great find. Well done. This is somewhat expected. The difference is the time zone handling and it's known. / Timezone handling will always be the bane of programming. I would say we simply add another if statement to our hashing function that checks for if the value is instance of datetime or something and converts it to an ISO string just without the milliseconds and the timezone. Thoughts? |
Yep I agree with this. We could also convert it to epoch millis so we can be as granular as milliseconds but not have to consider timezone |
I'm working on the fix described above, will put out a PR soon |
See CouncilDataProject/cdp-frontend#143
and CouncilDataProject/cdp-scrapers#50
The hasher is sometimes producing multiple sessions for a single event. I believe this happens when the pipeline is running against the same event that has previously been processed.
I.e.
Event
model is a reference (ReferenceDocLoader
) and not an in-memory modelI can show this behavior by running the following:
Notice that in the above case, the second and fourth hashes match for each event.
This is because I am forcing it to rehash itself -- but, those second and fourth hashes match neither of the original two hashes, which is bad. But since I am using the Python API to pull this data, it would seem to me that this may be happening because they are
ReferenceDocLoaders
... so something to explore next.If I change this function to explicitely fetch the event object as an entire model (
db_models.Event
) in-memory rather than aReferenceDocLoader
object, it produces:Notice under this case that everything matches after the first hash. To my understanding this is because this is how the pipeline ran (simply due to insert order, the first time the pipeline ran is the third hash in the list, the second time the pipeline ran is the first hash in the list -- i.e. the order of the output hashes printed is most recent pipeline run produced hash, rehash after explicit model pull, original pipeline run produced hash, rehash after explicit model pull). The first time this event was processed, it had the model in memory, then the second time this event was processed (6 or so hours later), it had the Event as a reference and not in memory. If we force the model into memory, then it fixes itself.
I have looked at a ton of events and to be honest I am not sure how this doesn't effect every single event of ours. It seems to have only occurred for events in the Seattle Staging instance and for a couple of events.
I am still not sure if this is what caused the two variants of the issue as well. The scraper issue I originally labeled as "two of the same session for a single event". The frontend issue, @tohuynh found out that they were the same sessions but one was missing a transcript.
I truly do not know how deep this issue goes. I haven't found any King County event that fails.
I have found the following seattle staging events that fail:
cc @dphoria @tohuynh @isaacna
The text was updated successfully, but these errors were encountered: