API Changes to FivetranSensor for supporting backfills #58

dwreeves · 2023-09-04T22:31:23Z

Resolves #49

Changes

`FivetranOperator`

High level API:

Deprecate execution_timeout kwarg. This isn't even being used. It is pretty straightforward to remove this.
(Breaking) Make default behavior when deferrable=False to wait for Fivetran sync to complete.
- Right now there is a fundamental divergence between how the deferrable and non-deferrable modes work.
- I would argue that:
  - (1) Typically in Airflow, the deferrable param is a parameter that defines the runtime of the operator (i.e. in normal worker nodes vs a triggerer instance), not the fundamental behavior of the operator. For that reason, the deferrable=? kwarg should not impact the behavior as dramatically as it currently does.
  - (2) Adding code to make it wait synchronously is not tricky.
  - (3) The existing behavior when deferrable=False is significantly worse than waiting, as it leads to the FivetranSensor being necessary to await the FivetranOperator, whereas the FivetranSensor would be more interesting if it were a choice rather than a strict necessity. I updated the README.md to clarify what the FivetranSensor can/should be used for going forward.
Add wait_for_completion kwarg. (I copied this kwarg name from EcsRunTaskOperator for whatever that is worth.) This combined with deferrable splits what deferrable does into two separate options (waiting for Fivetran vs running on Triggerer instance) rather than having them combined into one option.

`FivetranSensor`

I start by describing the changes to FivetranOperator because understanding these changes paves the way to understanding why FivetranSensor also changes.

It appears right now that FivetranSensor is designed just to await a FivetranOperator(deferrable=False), out of necessity due to lack of functionality when non-deferrable.

However, a much more relevant use case for the FivetranSensor is when Fivetran is running on its own schedule, and we want to await a sync scheduled outside of Airflow. The main motivation in designing these API changes is to support that use case. By making FivetranOperator always wait by default even when deferrable=False, then FivetranSensor's primary purpose can focus on more interesting use cases.

High level API:

Replace xcom with completed_after_time. xcom is a misleading name. The value does not need to necessarily come from xcom. This kwarg was designed originally to be used in the case of FivetranOperator >> FivetranSensor, with the intent that the FivetranSensor was always awaiting a manually scheduled job via the completed at time passed through xcom.
- I originally wanted to name this target_completed_time or something similar to emulate the DateTimeSensor API. However, this is not a very descriptive name, and it feels especially confusing when using the old xcom_pull() pattern (target_completed_time="{{ task_instance.xcom_pull('fivetran-operator', key='return_value') }}" looks strange and could be interpreted as "targetting" the last sync value. But in that case it would be unclear why this pattern even works!) I found this name to be a lot clearer.
Add always_wait_when_syncing kwarg. So the operator never passes while Fivetran is syncing when this is true.
- Setting this as a default True won't matter for most users.
- This is at its most useful for backfills. Fivetran syncs are not atomic and it's possible for table A to be updated while table B is not, or alternatively some rows in table A are written while others are still being written. So running downstream operations while Fivetran is syncing can be a little dangerous, and this operates as an imperfect* but reasonable check to see if the coast is clear.
Added propagate_failures_forward kwarg.
- This makes it so target_completed_time > failed_at > succeeded_at raises an error. By default this is False, matching the existing API.
- A justification why you may want this to be true is because a fail > success indicates an issue that may eventually impact future syncs.

(* Imperfect because for a long running DAG, or for concurrent task runs, Fivetran can just resync again.)

`FivetranHook`

High level API

Add is_synced_after_target_time() and is_synced_after_target_time_async() to replace and get_sync_status_async() (which are deprecated).
- It was too risky to replace the entire API with something slightly new, so instead I just created a new method name.
- This name is also a little more descriptive. "get sync status" implies that the value being returned is some sort of enum representing, but in the sync hook implementation it is just a boolean. In reality, it is a mix-- get_sync_status_async returns a string but get_sync_status returns boolean. Here the name is_synced_after_target_time() implies a boolean return value, which is what you get.
**Refactor connector_details parsing logic so that it is in _determine_if_synced_from_connector_details(). This means that both the sync and async operator use the same logic, and it gets to be DRY.
Check success before fail in get last sync. I explain why this is a beneficial change here: [Feature Request] Support DateTimeSensor-like behavior for FivetranSensor #49 (comment)

Misc.

I did my best to keep the API in tact, which you can see by the fact that there are very few changes in tests/.

The main test change I made was so that "succeeded_at > failed_at > completed_after_time" returns a success rather than a failure.

dwreeves · 2023-09-07T20:16:50Z

We have run these code changes in our company's Airflow deployment like this:

FivetranSensor(
    task_id=...,
    connector_id=...,
    completed_after_time="{{ data_interval_end + macros.timedelta(minutes=5) }}"
)

And we confirmed it works as intended in these two cases:

When the completed_after_time is long in the past and Fivetran has already run much more recently (so completed_at > completed_after_time), it doesn't spend any time waiting around.
When the completed_after_time is near the present and a completed_at hasn't yet occurred greater than the completed_after_time, it does end up waiting around.

…r-fivetran-async into api-changes

dwreeves · 2023-09-30T20:51:00Z

Will anyone be able to review this? @phanikumv @pankajastro @sunank200.

I know these are big changes, but I think it's worth looking at them. Overall I think these changes really elevate the package to the next level, allowing folks to really customize their experience with Fivetran connection into Airflow. I've been using Airflow for multiple years for data engineering and am an OSS contributor to Airflow. I consider these changes to make for a more idiomatic experience, and also addresses many of the sort of real-world nuances that you want to have control over when designing data pipelines that can guarantee data integrity.

Other reasons why I think this should get reviewed:

We have been using this for the past few weeks at my company, and it works great!
- I did learn the hard way that always_wait_when_syncing should be defaulted to False. 😬 It's a very useful control flow feature in many circumstnaces, but it's not always safe. There are pros and cons to turning it on or off, but I think "off" is a better default.
It is nearly fully backwards compatible (with deprecation warnings). The only exception is there is a slight change in behavior when deferrable=False for the FivetranOperator (although, I think this change in behavior is extremely sensible).
I think all the API changes are totally defensible and make for a better experience.
It's better to get these sort of changes in earlier than later, while the package has fewer dependents.

Let me know if there is anything you need from me to assist in reviewing the changes.

Thank you for all your work on this!

phanikumv · 2023-10-01T04:30:00Z

Will anyone be able to review this? @phanikumv @pankajastro @sunank200.

I know these are big changes, but I think it's worth looking at them. Overall I think these changes really elevate the package to the next level, allowing folks to really customize their experience with Fivetran connection into Airflow. I've been using Airflow for multiple years for data engineering and am an OSS contributor to Airflow. I consider these changes to make for a more idiomatic experience, and also addresses many of the sort of real-world nuances that you want to have control over when designing data pipelines that can guarantee data integrity.

Other reasons why I think this should get reviewed:

We have been using this for the past few weeks at my company, and it works great!

I did learn the hard way that always_wait_when_syncing should be defaulted to False. 😬 It's a very useful control flow feature in many circumstnaces, but it's not always safe. There are pros and cons to turning it on or off, but I think "off" is a better default.

It is nearly fully backwards compatible (with deprecation warnings). The only exception is there is a slight change in behavior when deferrable=False for the FivetranOperator (although, I think this change in behavior is extremely sensible).

I think all the API changes are totally defensible and make for a better experience.

It's better to get these sort of changes in earlier than later, while the package has fewer dependents.

Let me know if there is anything you need from me to assist in reviewing the changes.

Thank you for all your work on this!

Sure @dwreeves we will review it.Thank you so much for your contributions

fivetran_provider_async/hooks.py

Lee-W

LGTM. Left a minor nitpick

fivetran_provider_async/operators.py

pankajastro

LGTM

fivetran_provider_async/hooks.py

Co-authored-by: Wei Lee <[email protected]>

dwreeves · 2023-10-10T19:22:09Z

Thanks! I will change tonight.

dwreeves · 2023-10-11T01:42:48Z

All requested changes made.

kaxil · 2023-10-16T18:58:54Z

Thanks @dwreeves , someone will review it tomorrow

Lee-W

left a few minor suggestions but overall it looks good to me. @dwreeves thanks for your contribution!

fivetran_provider_async/hooks.py

fivetran_provider_async/operators.py

pankajastro · 2023-10-17T07:40:06Z

fivetran_provider_async/sensors.py

+ if not self.completed_after_time:
+ self._completed_after_time_rendered = self.hook.get_last_sync(self.connector_id)
+ else:
+ self._completed_after_time_rendered = timezone.parse(self.completed_after_time)


I did a quick run on the old example but it failed here
Example: https://github.com/astronomer/airflow-provider-fivetran-async/blob/main/fivetran_provider_async/example_dags/example_fivetran_async.py#L33
Log

Traceback (most recent call last): File "/usr/local/lib/python3.11/site-packages/pendulum/parsing/__init__.py", line 131, in _parse dt = parser.parse( ^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/dateutil/parser/_parser.py", line 1368, in parse return DEFAULTPARSER.parse(timestr, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/dateutil/parser/_parser.py", line 643, in parse raise ParserError("Unknown string format: %s", timestr) dateutil.parser._parser.ParserError: Unknown string format: None During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/usr/local/airflow/airflow_provider_fivetran_async/fivetran_provider_async/sensors.py", line 149, in execute elif not self.poke(context): ^^^^^^^^^^^^^^^^^^ File "/usr/local/airflow/airflow_provider_fivetran_async/fivetran_provider_async/sensors.py", line 200, in poke self._completed_after_time_rendered = timezone.parse(self.completed_after_time) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/airflow/utils/timezone.py", line 207, in parse return pendulum.parse(string, tz=timezone or TIMEZONE, strict=strict) # type: ignore ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/pendulum/parser.py", line 29, in parse return _parse(text, **options) ^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/pendulum/parser.py", line 45, in _parse parsed = base_parse(text, **options) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/pendulum/parsing/__init__.py", line 74, in parse return _normalize(_parse(text, **_options), **_options) ^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/pendulum/parsing/__init__.py", line 135, in _parse raise ParserError("Invalid date string: {}".format(text)) pendulum.parsing.exceptions.ParserError: Invalid date string: None [2023-10-17, 07:27:15 UTC] {taskinstance.py:1398} INFO - Marking task as FAILED. dag_id=example_fivetran_async, task_id=fivetran_async_sensor, execution_date=20231017T072549, start_date=20231017T072715, end_date=20231017T072715 [2023-10-17, 07:27:15 UTC] {standard_task_runner.py:104} ERROR - Failed to execute job 414 for task fivetran_async_sensor (Invalid date string: None; 324) [2023-10-17, 07:27:15 UTC] {local_task_job_runner.py:228} INFO - Task exited with return code 1 [2023-10-17, 07:27:15 UTC] {taskinstance.py:2776} INFO - 0 downstream tasks scheduled from follow-on schedule check

Is this expected?

Sorry for delays, I'd had a very busy last few weeks.

I started a new job recently and don't have Airflow + Fivetran set up just yet (will be setting it up soon).

That said, I think I can see what happened here.

The error message shows the following:

Invalid date string: None

So it looks like what happened is the FivetranOperator returned None. Looking at the code, we see (simplified):

def execute(self, context: Context) -> None | str: ... if not self.wait_for_completion: return last_sync if not self.deferrable: self._wait_synchronously(pendulum.parse(last_sync)) # type: ignore[arg-type] return None else: ... return None

So it only returns last_sync (str) when when wait_for_completion is False. This makes sense, since in this case you'd be implementing a sort of async-await pattern with operator+sensor and the timestamp is used as the "future" so to speak.

The issue therefore is that the example DAG needs to show wait_for_completion=False.

Reading these examples, I actually found them a little confusing.

I now wrote them a little bit differently to clarify the new behaviors:

example_fivetran.py shows a sync example. It is unnecessary to use a sensor now by default, so I removed the sensor.

example_fivetran_async.py also does not need to use the sensor, since the FivetranOperator awaits the Fivetran sync (both in sync and async moe).

example_fivetran_xcom.py needed wait_for_completion=False to be set for that example to make any sense, which I didn't have.

That said, there are still some possible issues with the Bigquery and Bqml examples? Since they both uses the unnecessary FivetranSensor. (It is unnecessary because the FivetranOperator does the job of waiting). Maybe we remove the FivetranSensors from there too?

We can also potentially provide an example of when Fivetran is being externally scheduled and the {{ data_interval_end }} is being used as the completed time?

dwreeves · 2023-10-23T23:16:23Z

Sorry all, I have been on vacation and essentially offline the past week. Will get to this soon!

Co-authored-by: Wei Lee <[email protected]>

dwreeves requested review from phanikumv, pankajastro and sunank200 as code owners September 4, 2023 22:31

dwreeves force-pushed the api-changes branch 3 times, most recently from 1467486 to 3e4312e Compare September 6, 2023 12:08

api changes

22759a9

dwreeves force-pushed the api-changes branch from 3e4312e to 22759a9 Compare September 6, 2023 13:10

dwreeves changed the title ~~[Draft] API Changes to FivetranSensor for supporting backfills~~ API Changes to FivetranSensor for supporting backfills Sep 12, 2023

dwreeves added 3 commits September 30, 2023 16:37

make always_wait_when_syncing default to false

8435828

Merge branch 'main' of github.com-dwreeves:astronomer/airflow-provide…

e22732f

…r-fivetran-async into api-changes

help mypy

18ff382

Lee-W reviewed Oct 3, 2023

View reviewed changes

fivetran_provider_async/hooks.py Outdated Show resolved Hide resolved

Lee-W approved these changes Oct 3, 2023

View reviewed changes

fivetran_provider_async/operators.py Outdated Show resolved Hide resolved

pankajastro approved these changes Oct 9, 2023

View reviewed changes

sunank200 approved these changes Oct 9, 2023

View reviewed changes

fivetran_provider_async/hooks.py Outdated Show resolved Hide resolved

Update fivetran_provider_async/operators.py

3660a43

Co-authored-by: Wei Lee <[email protected]>

remove else:

bbda70c

dwreeves requested review from pankajastro, sunank200 and Lee-W October 15, 2023 00:34

Lee-W approved these changes Oct 17, 2023

View reviewed changes

fivetran_provider_async/hooks.py Outdated Show resolved Hide resolved

fivetran_provider_async/operators.py Outdated Show resolved Hide resolved

pankajastro reviewed Oct 17, 2023

View reviewed changes

dwreeves mentioned this pull request Nov 2, 2023

Release 1.2.0 #66

Closed

clarify examples

5588e1c

dwreeves force-pushed the api-changes branch from 1b00c91 to 5588e1c Compare November 2, 2023 02:18

dwreeves and others added 3 commits November 1, 2023 22:26

Merge branch 'main' into api-changes

c626478

Update fivetran_provider_async/hooks.py

5f68b3f

Co-authored-by: Wei Lee <[email protected]>

Update fivetran_provider_async/operators.py

e9e59c0

Co-authored-by: Wei Lee <[email protected]>

dwreeves requested a review from pankajastro November 2, 2023 12:50

phanikumv approved these changes Nov 6, 2023

View reviewed changes

phanikumv merged commit 9fa930d into astronomer:main Nov 6, 2023
6 checks passed

dwreeves deleted the api-changes branch November 30, 2023 01:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API Changes to FivetranSensor for supporting backfills #58

API Changes to FivetranSensor for supporting backfills #58

dwreeves commented Sep 4, 2023 •

edited

dwreeves commented Sep 7, 2023

dwreeves commented Sep 30, 2023 •

edited

phanikumv commented Oct 1, 2023

Lee-W left a comment

pankajastro left a comment

dwreeves commented Oct 10, 2023 •

edited

dwreeves commented Oct 11, 2023

kaxil commented Oct 16, 2023

Lee-W left a comment

pankajastro Oct 17, 2023

dwreeves Nov 2, 2023

dwreeves commented Oct 23, 2023 •

edited

API Changes to FivetranSensor for supporting backfills #58

API Changes to FivetranSensor for supporting backfills #58

Conversation

dwreeves commented Sep 4, 2023 • edited

Changes

FivetranOperator

High level API:

FivetranSensor

High level API:

FivetranHook

High level API

Misc.

dwreeves commented Sep 7, 2023

dwreeves commented Sep 30, 2023 • edited

phanikumv commented Oct 1, 2023

Lee-W left a comment

Choose a reason for hiding this comment

pankajastro left a comment

Choose a reason for hiding this comment

dwreeves commented Oct 10, 2023 • edited

dwreeves commented Oct 11, 2023

kaxil commented Oct 16, 2023

Lee-W left a comment

Choose a reason for hiding this comment

pankajastro Oct 17, 2023

Choose a reason for hiding this comment

dwreeves Nov 2, 2023

Choose a reason for hiding this comment

dwreeves commented Oct 23, 2023 • edited

dwreeves commented Sep 4, 2023 •

edited

`FivetranOperator`

`FivetranSensor`

`FivetranHook`

dwreeves commented Sep 30, 2023 •

edited

dwreeves commented Oct 10, 2023 •

edited

dwreeves commented Oct 23, 2023 •

edited