Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug in DB model id generator / model hasher #142

Closed
evamaxfield opened this issue Dec 15, 2021 · 22 comments · Fixed by #144
Closed

Bug in DB model id generator / model hasher #142

evamaxfield opened this issue Dec 15, 2021 · 22 comments · Fixed by #144
Labels
bug Something isn't working

Comments

@evamaxfield
Copy link
Member

evamaxfield commented Dec 15, 2021

See CouncilDataProject/cdp-frontend#143
and CouncilDataProject/cdp-scrapers#50

The hasher is sometimes producing multiple sessions for a single event. I believe this happens when the pipeline is running against the same event that has previously been processed.

I.e.

  1. scrapper runs once (and completes processing like normal)
  2. scrapper runs again and produces a different hash from the first run because the Event model is a reference (ReferenceDocLoader) and not an in-memory model

I can show this behavior by running the following:

from cdp_backend.database import models as db_models
import fireo
from google.auth.credentials import AnonymousCredentials
from google.cloud.firestore import Client

from cdp_backend.database.functions import generate_and_attach_doc_hash_as_id

# Connect to the database
fireo.connection(client=Client(
    project="cdp-seattle-staging-dbengvtn",
    credentials=AnonymousCredentials()
))

def print_session_hashes(event_id):
    event = db_models.Event.collection.get(f"{db_models.Event.collection_name}/{event_id}")
    sessions = list(db_models.Session.collection.filter("event_ref", "==", event.key).fetch())
    for session in sessions:
        print(session.id)
        updated = generate_and_attach_doc_hash_as_id(session)
        print(updated.id)

print_session_hashes("c847010ce7a4")
# prints
# 71d1777f22c7
# 5626e734b33d
# e8b6fd84193f
# 5626e734b33d

print_session_hashes("9409f4d1ea09")
# prints
# 25d3a51447d0
# 9679fad34929
# 30701b4a0fbf
# 9679fad34929

Notice that in the above case, the second and fourth hashes match for each event.
This is because I am forcing it to rehash itself -- but, those second and fourth hashes match neither of the original two hashes, which is bad. But since I am using the Python API to pull this data, it would seem to me that this may be happening because they are ReferenceDocLoaders... so something to explore next.

If I change this function to explicitely fetch the event object as an entire model (db_models.Event) in-memory rather than a ReferenceDocLoader object, it produces:

def print_session_hashes(event_id):
    event = db_models.Event.collection.get(f"{db_models.Event.collection_name}/{event_id}")
    sessions = list(db_models.Session.collection.filter("event_ref", "==", event.key).fetch())
    for session in sessions:
        session.event_ref = session.event_ref.get()
        print(session.id)
        updated = generate_and_attach_doc_hash_as_id(session)
        print(updated.id)

print_session_hashes("c847010ce7a4")
# prints
# 71d1777f22c7
# e8b6fd84193f
# e8b6fd84193f
# e8b6fd84193f

print_session_hashes("9409f4d1ea09")
# prints
# 25d3a51447d0
# 30701b4a0fbf
# 30701b4a0fbf
# 30701b4a0fbf

Notice under this case that everything matches after the first hash. To my understanding this is because this is how the pipeline ran (simply due to insert order, the first time the pipeline ran is the third hash in the list, the second time the pipeline ran is the first hash in the list -- i.e. the order of the output hashes printed is most recent pipeline run produced hash, rehash after explicit model pull, original pipeline run produced hash, rehash after explicit model pull). The first time this event was processed, it had the model in memory, then the second time this event was processed (6 or so hours later), it had the Event as a reference and not in memory. If we force the model into memory, then it fixes itself.

I have looked at a ton of events and to be honest I am not sure how this doesn't effect every single event of ours. It seems to have only occurred for events in the Seattle Staging instance and for a couple of events.

I am still not sure if this is what caused the two variants of the issue as well. The scraper issue I originally labeled as "two of the same session for a single event". The frontend issue, @tohuynh found out that they were the same sessions but one was missing a transcript.

I truly do not know how deep this issue goes. I haven't found any King County event that fails.
I have found the following seattle staging events that fail:

cc @dphoria @tohuynh @isaacna

@evamaxfield evamaxfield added the bug Something isn't working label Dec 15, 2021
@evamaxfield
Copy link
Member Author

Whatever the fix is, we will need to patch it in and then reprocess all of the above meetings. Which pushes the need for a "delete all event data for event id x" function to high.

Basically, this single issue becomes two:

  1. fix the hashing function
  2. add a "delete_event" function

@evamaxfield
Copy link
Member Author

I already have some old code lying around for deleting an entire event and all of it's attachments and files. Still need to finish it however.

@dphoria
Copy link
Contributor

dphoria commented Dec 15, 2021

I have looked at a ton of events and to be honest I am not sure how this doesn't effect every single event of ours.

That's where I am at also at the moment.

@evamaxfield
Copy link
Member Author

I have a change that gets all the regenerated hashes to match regardless of if they start out as a reference model or an in-memory model:

In [1]: from cdp_backend.database import models as db_models
   ...: import fireo
   ...: from google.auth.credentials import AnonymousCredentials
   ...: from google.cloud.firestore import Client
   ...: 
   ...: from cdp_backend.database.functions import generate_and_attach_doc_hash_as_id
   ...: 
   ...: # Connect to the database
   ...: fireo.connection(client=Client(
   ...:     project="cdp-seattle-staging-dbengvtn",
   ...:     credentials=AnonymousCredentials()
   ...: ))
   ...: 
   ...: def print_session_hashes(event_id):
   ...:     event = db_models.Event.collection.get(f"{db_models.Event.collection_name}/{event_id}")
   ...:     sessions = list(db_models.Session.collection.filter("event_ref", "==", event.key).fetch())
   ...:     for session in sessions:
   ...:         print("original", session.id)
   ...:         updated = generate_and_attach_doc_hash_as_id(session)
   ...:         print("regenerated", updated.id)
   ...: 
   ...: print_session_hashes("c847010ce7a4")
original 71d1777f22c7
regenerated e8b6fd84193f
original e8b6fd84193f
regenerated e8b6fd84193f

In [2]: def print_session_hashes(event_id):
   ...:     event = db_models.Event.collection.get(f"{db_models.Event.collection_name}/{event_id}")
   ...:     sessions = list(db_models.Session.collection.filter("event_ref", "==", event.key).fetch())
   ...:     for session in sessions:
   ...:         session.event_ref = session.event_ref.get()
   ...:         print("original", session.id)
   ...:         updated = generate_and_attach_doc_hash_as_id(session)
   ...:         print("regenerated", updated.id)
   ...: 
   ...: print_session_hashes("c847010ce7a4")
original 71d1777f22c7
regenerated e8b6fd84193f
original e8b6fd84193f
regenerated e8b6fd84193f

@dphoria
Copy link
Contributor

dphoria commented Dec 16, 2021

I am still not sure if this is what caused the two variants of the issue as well. The scraper issue I originally labeled as "two of the same session for a single event". The frontend issue, @tohuynh found out that they were the same sessions but one was missing a transcript.

In other words, "did the web page decide this event had 2 sessions because it got 2 different session.id for that event"?

@isaacna
Copy link
Collaborator

isaacna commented Dec 16, 2021

Could it be related to this part of the hasher function? Where if the field is a ReferenceDocLoader (not loaded in memory), we hash by id, but if it's already loaded then we hash by the field (instead of the id)?

But I have no idea why it'd only be affecting the seattle staging deployment and not the other ones. In firestore, is there anything different data wise between the duplicate sessions besides their id?

@isaacna
Copy link
Collaborator

isaacna commented Dec 16, 2021

An easy but not optimal solution is to turn autoload back on to avoid these hashing problems, although that'd be at the cost of a lot of latency and extra GCP calls. But I don't think that's probably the route we wanna go given the original reasons to turning autoload off.

@isaacna
Copy link
Collaborator

isaacna commented Dec 16, 2021

@JacksonMaxfield I copied your script to test it out for myself and made the following changes to the hash function:

def generate_and_attach_doc_hash_as_id(db_model: Model) -> Model:
    # Create hasher and hash primary values
    hasher = sha256()
    for pk in db_model._PRIMARY_KEYS:
        field = getattr(db_model, pk)

        # Load doc if it's a reference doc
        if isinstance(field, ReferenceDocLoader):
            field = field.get()

        # Handle reference fields by using their doc path
        if isinstance(field, Model):
            # Ensure that the underlying model has an id
            # In place update to db_model for this field
            setattr(db_model, pk, generate_and_attach_doc_hash_as_id(field))

            # Now attach the generated hash document path
            hasher.update(pickle.dumps(field.id, protocol=4))

        # Otherwise just simply add the primary key value
        else:
            # Hash by field value
            hasher.update(pickle.dumps(field, protocol=4))

    # Set the id to the first twelve characters of hexdigest
    db_model.id = hasher.hexdigest()[:12]

    return db_model

This generates the following:

print_session_hashes("c847010ce7a4")
71d1777f22c7
71d1777f22c7
e8b6fd84193f
71d1777f22c7

print_session_hashes("9409f4d1ea09")
25d3a51447d0
25d3a51447d0
30701b4a0fbf
25d3a51447d0

the order of the output hashes printed is most recent pipeline run produced hash, rehash after explicit model pull, original pipeline run produced hash, rehash after explicit model pull)

Does this fix the issue? I was a little confused by the ordering of the print statements described above.

The main change difference here is that previously the following line wasn't being executed when the field was a ReferenceDocLoader:

setattr(db_model, pk, generate_and_attach_doc_hash_as_id(field))

I just force loading the reference doc when we need it, and then we either hash by the id if the field is a Model (while recursively calling the function), or the field value if it's not a Model. Even though we're loading the model, I still think this works around the original autload issue because we're only loading for primary keys rather than for the whole database object

@isaacna
Copy link
Collaborator

isaacna commented Dec 16, 2021

I think my comment above might make the hashing function consistent (for regeneration at least), but doesn't identify the root cause...

I think a potential problem might be here:
https://github.com/CouncilDataProject/cdp-backend/blob/main/cdp_backend/database/functions.py#L50-L53

        if isinstance(field, Model):
            # Ensure that the underlying model has an id
            # In place update to db_model for this field
            setattr(db_model, pk, generate_and_attach_doc_hash_as_id(field))

            # Now attach the generated hash document path
            hasher.update(pickle.dumps(field.id, protocol=4))

On line 50 we set the underlying reference model's id. But on line 53 when we update the hasher, we use field.id when field was assigned before we recursively generated the id on line 50. I think this would cause a difference when generating the hash for the first time vs when the underlying models already exist.

Maybe this is why the original doesn't match the regenerated id's?

@evamaxfield
Copy link
Member Author

@isaacna I have similar changes here: main...bugfix/model-id-hasher

We arrive at the same conclusion just different routes ahha

However, this still confuses me a bit because while these changes solve the hashing problem, they don't explain why only some of our events are hit by this problem.

Like the fundamental "why did this happen in the first place" is still a thing.

@isaacna
Copy link
Collaborator

isaacna commented Dec 16, 2021

However, this still confuses me a bit because while these changes solve the hashing problem, they don't explain why only some of our events are hit by this problem.

Like the fundamental "why did this happen in the first place" is still a thing.

This I'm not sure about. Maybe for the events that weren't affected just weren't reprocessed if they weren't picked up in the event gather pipeline.

For the changes, what do you think about something like this :

     for pk in db_model._PRIMARY_KEYS:
        field = getattr(db_model, pk)
        
        #Reprocessing use case 
        if isinstance(field, ReferenceDocLoader):
            field = field.get()
            hasher.update(pickle.dumps(field.id, protocol=4))
        # First time generation use case 
        elif isinstance(field, Model):
            field = generate_and_attach_doc_hash_as_id(field)
            setattr(db_model, pk, field)
            hasher.update(pickle.dumps(field.id, protocol=4))

        else:
            hasher.update(pickle.dumps(field, protocol=4))

    # Set the id to the first twelve characters of hexdigest
    db_model.id = hasher.hexdigest()[:12]

    return db_model

In the above we use elif to avoid the unnecessarily doing the recursive work for the reprocessing use case. This does assume however that the existing data is all corrected, since we're not repeating the hash generation. So if we want to favor consistency over speed/cost I think regenerating the hash each time works, but once the database is clean, this should be more efficient.

One thing I thought was weird is when I run the script with these changes, I get:

print_session_hashes("c847010ce7a4")
71d1777f22c7
5626e734b33d
e8b6fd84193f
5626e734b33d

Whereas with your changes in the bugfix branch I get:

print_session_hashes("c847010ce7a4")
71d1777f22c7
71d1777f22c7
e8b6fd84193f
71d1777f22c7

So for this example with my changes, 5626e734b33d is generated when hashing with the fetched event id from ReferenceDocLoader. I double checked that the event id fetched in the ReferenceDocLoader case was c847010ce7a4.

But when we load the reference and call the generate hash function recursively, the generated event id is 8a84dc698063. I'm not sure why these results are inconsistent unless there was something that changed in the downstream linked Event -> Body that would change the hash from the original generated value.

Does this make sense or am I missing/overthinking something?

@evamaxfield
Copy link
Member Author

This I'm not sure about. Maybe for the events that weren't affected just weren't reprocessed if they weren't picked up in the event gather pipeline.

Yea... but that just doesn't make sense to me. Basically every event in the Seattle pipeline gets reprocessed I think? I would need to dig into the GitHub Action logs.


So for this example with my changes, 5626e734b33d is generated when hashing with the fetched event id from ReferenceDocLoader. I double checked that the event id fetched in the ReferenceDocLoader case was c847010ce7a4.

But when we load the reference and call the generate hash function recursively, the generated event id is 8a84dc698063. I'm not sure why these results are inconsistent unless there was something that changed in the downstream linked Event -> Body that would change the hash from the original generated value.

Does this make sense or am I missing/overthinking something?

Yea I saw that behavior too which is why I went my route of forcing the recursive model loading. It results in consistent hashes which is always the end goal. Especially because as I said, the order of prints is:

  1. the reprocessed session hash from the pipeline
  2. the rehash of the reprocessed session
  3. the original session hash from the first time the session went through the pipeline
  4. the rehash of the original session

that is hard to follow but basically, whatever our change is should try to match the hash to the third one as that is the hash produced when the data has "never-been-seen-before"

@isaacna
Copy link
Collaborator

isaacna commented Dec 16, 2021

The part that still confuses me is that when we force the model to rehash recursively for a session, the event id it comes up with is '8a84dc698063' which is different than the hash it generated for the event the first time

@evamaxfield
Copy link
Member Author

The part that still confuses me is that when we force the model to rehash recursively for a session, the event id it comes up with is '8a84dc698063' which is different than the hash it generated for the event the first time

Agree. It's concerning as well. I am wondering if we are doing something wrong with our hashing in general.

@evamaxfield
Copy link
Member Author

I have spent an hour looking at writing down various notes of "how this processing" happened and I truly cannot figure out how this resulted in a "bad" hash.

I guess I will try to test this in a dev infrastructure tomorrow and add a bunch of print / log statements. This is wild. Whatever I find I will likely make a recording of because the explanation will likely be some level of conspiracy theory.

@isaacna
Copy link
Collaborator

isaacna commented Dec 17, 2021

Yeah I wasn't able to reproduce the duplicate session issue locally with a minimal event either. The more I look into this the more strange the issue gets.

But I think the core of the issue confusion related to why the session hashes aren't matching in the script we've been running is that the recursively generated hash is not the id that's being attached to session. This seems to happen only for theSession db model.

I added log statements like so and ran your :

        # Handle reference fields by using their doc path
        if isinstance(field, Model):
            # Ensure that the underlying model has an id
            # In place update to db_model for this field

            print(type(field))
            print("PRE GENERATE FIELD ID " + str(field.id))
            new_field = generate_and_attach_doc_hash_as_id(field)
            setattr(db_model, pk, new_field)
            print(type(field))
            print("GENERATED FIELD ID " + str(new_field.id))

Running the event gather pipeline (only relevant logs attached):

# Body hash matches the existing model's id 
<class 'cdp_backend.database.models.Body'>
PRE GENERATE FIELD ID 4346a8351006
<class 'cdp_backend.database.models.Body'>
<class 'cdp_backend.database.models.Body'>
GENERATED FIELD ID 4346a8351006

# session hash doesn't
<class 'cdp_backend.database.models.Session'>
PRE GENERATE FIELD ID f1356f6d3179
<class 'cdp_backend.database.models.Session'>
<class 'cdp_backend.database.models.Session'>
GENERATED FIELD ID 541ac1cd31e3

# Session hash doesn't
<class 'cdp_backend.database.models.Session'>
PRE GENERATE FIELD ID cb32d7e83781
<class 'cdp_backend.database.models.Session'>
<class 'cdp_backend.database.models.Session'>
GENERATED FIELD ID 0b766156eabf

Your print session hash script (2 valid, separate sessions exist for this one event):

print_session_hashes("2d79b33f34fc")
# cb32d7e83781
# 0b766156eabf

# f1356f6d3179
# 541ac1cd31e3

Looking in my database however, the session id's were cb32d7e83781 and f1356f6d3179. So I think there's something generally wrong with how we generate hashes, at least when it comes to Session. Other db models matched the "pre generated field" and the "generated field". I wonder if the order in which we upload in store_event_processing_results could have some odd effects?

@evamaxfield
Copy link
Member Author

But I think the core of the issue confusion related to why the session hashes aren't matching in the script we've been running is that the recursively generated hash is not the id that's being attached to session. This seems to happen only for theSession db model.

I quoted the above but basically everything you said makes sense to me but the thing that confuses me is how we could possibly be hashing incorrectly. Session's PKs are event_ref and video_uri, video_uri is easy, event_ref hashes itself and uses the generated id as the hash. To generate an id / hash for event_ref we look to it's PKs which are body_ref and event_datetime, event_datetime is easy, body_ref does the same thing, and recursively it goes to the PK for name on the body. We know that the event hash and the body hash worked perfectly fine, so how did we get a different hash for the session....

I have this weird feeling that maybe we were hitting some other type besides either a Model or a ReferenceDocLoader and it was hitting the final else? But your print statements don't match up with that.

@isaacna
Copy link
Collaborator

isaacna commented Dec 18, 2021

I have this weird feeling that maybe we were hitting some other type besides either a Model or a ReferenceDocLoader and it was hitting the final else? But your print statements don't match up with that.

Yeah I didn't include all the print statements, just the part that shows the hash not matching.
Full logs can be found here: https://pastebin.com/w9bET2YV

I'll need to dig deeper on the hash function generation to see what's actually causing the wrong hash to be produced

@isaacna
Copy link
Collaborator

isaacna commented Dec 18, 2021

Ok so I think the root cause is inconsistent event_datetime being used deep in the recursion. The following logs display it:

Logs during event generation: https://pastebin.com/tdKQN3UH
Logs during reprocessing: https://pastebin.com/etp8Y1PU
Also for reference my GCP proj is stg-cdp-seattle-isaac if you want to reference my data.

Branch:

On lines 45-46, we see Event getting hashed with event_datetime 2021-08-16 02:42:10.318957-05:00. This results in the hash that's stored as the Event.id in firestore 2d79b33f34fc.

In the first paste, hash generation for Session starts at line 59.

  1. It recursively calls the generate function for Session -> Event -> Body.
  2. The Body is loaded via reference doc and is correct and consistent (4346a8351006) in L67-68.
  3. Event datetime hashed using event_datetime 2021-08-16 07:42:10.318957+00:00! This is different than the datetime used on line 45.
  4. In line 71, the output of this is 29f33b8722df, while the hash used in GCP is 2d79b33f34fc.
  5. This leads to Session being hashed with a different Event.id 29f33b8722df1 during creation. During regeneration, Session gets hashed with 2d79b33f34fc

The two different event_datetime's used are different datetimes I have for two different sessions in my test data:

{
   "body":{
      "name":"Full Council",
      "is_active":true,
      "start_datetime":null,
      "description":null,
      "end_datetime":null,
      "external_source_id":null
   },
   "sessions":[
      {
         "session_datetime":1629099730.318957,
         "video_uri":"assets/sample-5s.mp4",
         "session_index":0,
         "caption_uri":null,
         "external_source_id":null
      },
      {
         "session_datetime":1629099930.318957,
         "video_uri":"assets/wombo_combo.mp4",
         "session_index":1,
         "caption_uri":null,
         "external_source_id":null
      }
   ],
   "event_minutes_items":null,
   "agenda_uri":null,
   "minutes_uri":null,
   "static_thumbnail_uri":null,
   "hover_thumbnail_uri":null,
   "external_source_id":null
}

As to why the event_datetime used on Event is different, I'm not sure about because it should always be defaulting to the datetime from the first session. But I feel like the trickiest part of investigating the hash generation should be answered now?

@evamaxfield
Copy link
Member Author

This is a great find. Well done.

This is somewhat expected. The difference is the time zone handling and it's known. / Timezone handling will always be the bane of programming.

I would say we simply add another if statement to our hashing function that checks for if the value is instance of datetime or something and converts it to an ISO string just without the milliseconds and the timezone. Thoughts?

@isaacna
Copy link
Collaborator

isaacna commented Dec 19, 2021

I would say we simply add another if statement to our hashing function that checks for if the value is instance of datetime or something and converts it to an ISO string just without the milliseconds and the timezone. Thoughts?

Yep I agree with this. We could also convert it to epoch millis so we can be as granular as milliseconds but not have to consider timezone

@isaacna
Copy link
Collaborator

isaacna commented Dec 19, 2021

I'm working on the fix described above, will put out a PR soon

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants