-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BEAM-9650] Cleanup documentation on side inputs patterns #11415
Conversation
You can read side input pcollection periodically into distinct windows. | ||
Later, when you apply side input to your main input, windows will be matched automatically 1:1. | ||
This way, you can guarantee side input consistency on the duration of the single window. | ||
You can read side input data periodically into distinct PCollection windows. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In Beam, we have IO.Read transforms that do read source data. Later data read can be assigned to specific window.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for opening this PR :)
meaning that each window on main input side will be matched to a single | ||
version of side input data. | ||
|
||
To do this, you can utilize a combination of PeriodicSequence/PeriodicImpulse |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's probably best to replace "To do this" with "To do ABC" - it'll help the reader figure out what "this" refers to. Otherwise, if the reader is just skimming the page and starts at this paragraph, they have to read the previous paragraph to figure out what the pronoun refers to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Just a couple more small comments to clear up a few parts of the doc.
|
||
To do this, you can utilize PeriodicSequence PTransform that will generate infinite sequence | ||
of elements with some real-time period: | ||
Described approach can be implemented using combination of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about something like this: "To read side input data periodically... use a combination of PeriodicSequence..."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit confused here since adding text you suggest is pretty much re-writing same thing that is a header and previous abstract. In next iteration I tried to remove this abstract overall since list of steps provides same information.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gotcha - the intent is to summarize the previous abstract, so that it's easier for readers to skim the doc.
To do this, you can utilize PeriodicSequence PTransform that will generate infinite sequence | ||
of elements with some real-time period: | ||
Described approach can be implemented using combination of | ||
PeriodicSequence or PeriodicImpulse PTransforms and SDF Read or ReadAll |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SDF Read and ReadAll are transforms too, right? The following will help clarify what the combination consists of: "...PeriodicSequence or PeriodicImpulse PTransforms and SDF Read or ReadAll PTransforms"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might be missing something, but that's exactly the phrasing. "PTransforms" word is on next line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I probably just missed the word. Looks good in staging.
a. MAX_TIMESTAMP can be replaced with some closer boundary if you want to stop generating elements at some point. | ||
1. Use the PeriodicImpulse or PeriodicSequence PTransform to generate an | ||
infinite sequence of elements at required processing time intervals and assign | ||
them to separate windows. | ||
|
||
1. Read data using Read operation triggered by arrival of PCollection element. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see you marked the previous comment here as resolved but I don't see any changes to this line. Was that intended?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was not intended. FIxed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The text LGTM after resolving the remaining comments, but I don't see the code sample on the staging server.
meaning that each window on the main input will be matched to a single | ||
version of side input data. | ||
|
||
Implementation of described approach can be narrowed down to: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should rephrase as "To read side input data periodically:"
The goal is to summarize the previous paragraph, so that it's clear what these steps refer to without needing to read through the previous paragraph. The previous paragraph offers additional context, but if a reader skims the page, they should be able to quickly figure out what these steps are about.
intervals | ||
* Assign them to separate windows. | ||
1. Fetch data using SDF Read or ReadAll PTransform triggered by arrival of | ||
PCollection element. | ||
1. Apply side input. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing "the" -> "Apply the side input"
I think it's missing because it's not picking up the language selector. |
Staging job failing together with website publish jobs :( So staging version has old code unfortunately. |
|
||
a. MAX_TIMESTAMP can be replaced with some closer boundary if you want to stop generating elements at some point. | ||
You can read side input data periodically into distinct PCollection windows. | ||
Later, when you apply the side input to your main input, each main input |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should remove "Later" because this is a part of the overall workflow described in the previous sentence, rather than a secondary step.
So this should be: "When you apply the side input to your main input..."
LGTM (modulo one last comment). Thanks again for opening this PR and doing an editorial review |
Co-authored-by: Mikhail Gryzykhin <[email protected]>
[Website staging preview]
Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:
R: @username
).[BEAM-XXX] Fixes bug in ApproximateQuantiles
, where you replaceBEAM-XXX
with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.CHANGES.md
with noteworthy changes.See the Contributor Guide for more tips on how to make review process smoother.
Post-Commit Tests Status (on master branch)
Pre-Commit Tests Status (on master branch)
See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.