-
Notifications
You must be signed in to change notification settings - Fork 300
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
drop and create datasets_view #2184
Conversation
Signed-off-by: Pawel Leszczynski <[email protected]>
Codecov Report
@@ Coverage Diff @@
## main #2184 +/- ##
=========================================
Coverage 76.54% 76.54%
Complexity 1108 1108
=========================================
Files 214 214
Lines 5181 5181
Branches 408 408
=========================================
Hits 3966 3966
Misses 760 760
Partials 455 455 📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
It should be a development principle that we don't make backward-incompatible changes. Dropping/renaming columns is fine in a single instance of an app where upgrades mean taking down the application and restarting it, but in a distributed environment, upgrading should not require downtime. As is, a rolling upgrade will cause all other running hosts to begin failing requests as they will be looking for a column that no longer exists in the central database. Dropping/renaming columns should always be a two-step process
These two steps should be done in different releases so that users have an opportunity to upgrade in a controller fashion without downtime. |
@collado-mike Not clear to me how the problem you described applies for this PR. The issue is related to a logical view modification and adding new column It turned out that |
Ah, sorry. I thought the previous view change was dropping a column and that this was attempting to band-aid over that by dropping the view and recreating it. Yeah, I'm not sure why postgres doesn't detect that the previous columns are still there - order should be irrelevant to the view definition. I think adding the new column to the end of the view definition would fix the issue.
I think our assumptions of what happens in distributed systems differ 🤔. Distributed systems should be resilient to failures. So, if the view is missing and the code retries the query and succeeds, that is resilient to single failures. But that's not what Marquez does right now. Currently, if the query fails, we'll fail the whole request. If it happens to fail during an OpenLineage event processing, we'll lose the whole event. The OL clients don't have any retry built-in, so the data is simply lost. Of course, these issues can be addressed with retries on the client or persistent queues on the backend, but the simpler fix is to make sure view definition changes are backward compatible whenever possible. |
Thanks for providing an explanation on how Marquez is operating right now. It's really helpful. According to the docs:
So there should be no moment of a missing view visible to other readers. I added The root cause issue was not according to column ordering but because we were trying to replace the view with a view containing different amount of columns which is not allowed in PostgreSQL. |
Signed-off-by: Pawel Leszczynski [email protected]
Problem
Change within
datasets_view
definition breaks upgrades.Closes: #2183
Solution
Drop view each time migration is run
Checklist
CHANGELOG.md
with details about your change under the "Unreleased" section (if relevant, depending on the change, this may not be necessary).sql
database schema migration according to Flyway's naming convention (if relevant)