-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Moving IN-WE to a new API which has 24 hour datetime data unlike 12 hour #6095
base: master
Are you sure you want to change the base?
Moving IN-WE to a new API which has 24 hour datetime data unlike 12 hour #6095
Conversation
…ike 12 hour The older API endpoints provided data in 12 hour format without AM and PM, because of which there was quite a lot of logic in code to parse that data. The new APIs provide data in 24 hour format which is more correct and simple to process. Also made changes to methods to use Pandas groupby on datetime hour which simplified the code even further and fixed some bugs which would arise in the earlier version for missing datapoints from the source
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi and thanks for the PR!
This is very hard to review as a lot of things have changed and/or been moved around. So if possible I would suggest splitting up this PR in several PRs that can be evaluated on their own.
For example this seems to remove the use of arrow, that should probably be a PR on it's own, and so would moving to the 24 hour API etc.
Otherwise it will take a long time to review this PR and introduces additional risk that we missed something that alters it's behaviour or output in a way we didn't expect.
I can understand that it is difficult to review all the changes, it becomes even more difficult to review if someone is trying to understand the changes by going through the diff. I feel it is much easier to understand the changes if it is clear how the code was working earlier and how it can be simplified if the data is available in 24-hour format. I will add details in the old and the newer version of the parser in the PR explaining what exactly was the logic before my change and how it changed in the newer version. I am confident that it will be easy to review after that explanation.
I have not used
I agree, without unit tests, it will be really difficult to review how the output will change unexpectedly in the newer version |
|
||
EXCHANGES_MAPPING = { | ||
"WR-SR": "IN-SO->IN-WE", | ||
"WR-ER": "IN-EA->IN-WE", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reversed this mapping because we simply need to get records in data which match to WR-SR or WR-ER or WR-NR
- In the older version we were joining the data with this mapping to add a new column which represents sortedZoneKey
- In the newer version instead of performing a join, I am simply filtering by region name like WR-SR, WR-ER, WR-NR
It is the same filtering logic, just easier to read
Which is why I am suggesting splitting up this PR, it't becomes much easier to review if there is "one" change per PR and will allow us to either implement the changes or request changes a lot quicker.
Note I don't want to use arrow at all really, but change things with dates and times have a tendency to introduce bugs. Which is why I would prefer it as a separate PR.
Even unit tests might not cover all scenarios, they are good for regression testing but big changes like in here can easily include things that are not covered in the tests right now. |
|
||
def get_date_range(dt: datetime): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method was used in the older code to generate datetime for every hour for the entire day for the given datetime
For example, if dt is 2023-11-08 12:23:00 output is
2023-11-08 00:00:00,
2023-11-08 00:01:00,
2023-11-08 00:02:00,
2023-11-08 00:03:00,
2023-11-08 00:04:00,
.
.
.
2023-11-08 23:00:00
This method was used in the older code to generate hourly datetime and then for every hour, filter and aggregate the data for that hour. This is exactly what a groupby on datetime field can do automatically without querying the data for every hour in a loop. Using group by is much simpler, efficient and straight forward
|
||
resp: Response = r.post(url=KIND_MAPPING[kind]["url"], json=payload) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the new code I am simply passing the URL as a pram instead of a type and then reading it from he KIND_MAPPING
) | ||
|
||
datetime_col = KIND_MAPPING[kind]["datetime_column"] | ||
for item in data: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logic rounds the datetime field to nearest minute, which is not needed in the newer version where we aggregate by datetime hour
|
||
def format_exchanges_data( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method format_exchanges_data
was called for every hour generated from get_date_range(dt: datetime):
and would give net_flow for that hour. In the newer version this is not needed by simply grouping on datetime field
assert len(data) > 0 | ||
assert kind != "" | ||
|
||
dt_12_hour = arrow.get(target_datetime.strftime("%Y-%m-%d %I:%M")).datetime |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
filter_raw_data
method was called for every hour and it would filter data for that target_datetime hour.
dt_12_hour = arrow.get(target_datetime.strftime("%Y-%m-%d %I:%M")).datetime | ||
datetime_col = KIND_MAPPING[kind]["datetime_column"] | ||
filtered_data = pd.DataFrame( | ||
[item for item in data if item[datetime_col].hour == dt_12_hour.hour] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is definitely a bug here. In the older version given the older data had datetime field in 12 hour format with missing AM PM value, this will filter aggregate the data for morning and evening in the same bucket. For example, if the data has records for 2023-11-08 11:00 AM and 2023-11-08 11:00 PM it will all be considered to be in the same hour
The older API provided data in a format which can never be correctly interpreted, improving the correctness is the primary motivation I am pushing for this change
Because of this bug, let's say the parser is run at 2023-11-08 18:23, the source will only provide data till 18:23; however the older version of the parser would generate data in the future as well till 23:00 hour because it will mistakenly copy the data from 11 hour to 23 hour.
|
||
if target_datetime.hour >= 12: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logic was to compensate for 12 hr dataformat, we do not need it in the newer version
df = pd.DataFrame(data, columns=["Region_Name", datetime_column, value_column]) | ||
df = df[df["Region_Name"] == EXCHANGES_MAPPING[zone_key]] | ||
df[datetime_column] = ( | ||
pd.to_datetime(df[datetime_column], format="%Y-%m-%d %H:%M:%S") | ||
.dt.tz_localize(IN_TZ) | ||
.dt.floor("h") | ||
) | ||
df = df.groupby(["Region_Name", datetime_column]).mean().round(3) | ||
df[value_column] = -df[value_column] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the core logic, which trims the datetime to hour and generates data for every hour and then takes the mean within every hour. The logic is the same as the older version, the difference here is that I am not iterating and querying the dataframes for every hour in a separate query unlike what we did in the older version
I feel this is PR for one change to move to new API with 24 hour dateformat, just moving to the new API makes a lot of old code obsolete; I am finding it difficult to understand how to slit it into multiple PRs I will put more thought into it |
This is a smaller PR with bare minimum changes to parse 24-hour data using the same logic as in the older parser #6115 |
Description
The older API endpoints provided data in 12 hour format without AM and PM, because of which there was quite a lot of logic in code to parse that data.
The new APIs from the same data source provide data in 24 hour format which is more correct and simple to process.
Also made changes to methods to use Pandas groupby on datetime hour which simplified the code even further and fixed some bugs which would arise in the earlier version for missing datapoints from the source
Double check
poetry run test_parser "zone_key"
poetry run test_parser IN-WE consumption
poetry run test_parser "IN-NO->IN-WE" exchange
pnpx prettier --write .
andpoetry run format
to format my changes.