Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Moving IN-WE to a new API which has 24 hour datetime data unlike 12 hour #6095

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

saurabhchatterjee23
Copy link
Contributor

@saurabhchatterjee23 saurabhchatterjee23 commented Nov 3, 2023

Description

The older API endpoints provided data in 12 hour format without AM and PM, because of which there was quite a lot of logic in code to parse that data.

The new APIs from the same data source provide data in 24 hour format which is more correct and simple to process.

Also made changes to methods to use Pandas groupby on datetime hour which simplified the code even further and fixed some bugs which would arise in the earlier version for missing datapoints from the source

Double check

  • I have tested my parser changes locally with poetry run test_parser "zone_key"
  • poetry run test_parser IN-WE consumption
  • poetry run test_parser "IN-NO->IN-WE" exchange
  • I have run pnpx prettier --write . and poetry run format to format my changes.

…ike 12 hour

The older API endpoints provided data in 12 hour format without AM and PM, because of which there was quite a lot of logic in code to parse that data.

The new APIs provide data in 24 hour format which is more correct and simple to process.

Also made changes to methods to use Pandas groupby on datetime hour which simplified the code even further and fixed some bugs which would arise in the earlier version for missing datapoints from the source
@github-actions github-actions bot added the parser label Nov 3, 2023
@saurabhchatterjee23 saurabhchatterjee23 changed the title Moving IN-WE to a new data source which has 24 hour datetime data unlike 12 hour Moving IN-WE to a new API which has 24 hour datetime data unlike 12 hour Nov 3, 2023
@VIKTORVAV99 VIKTORVAV99 self-requested a review November 3, 2023 07:28
@madsnedergaard madsnedergaard requested a review from a user November 7, 2023 19:56
Copy link
Member

@VIKTORVAV99 VIKTORVAV99 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi and thanks for the PR!

This is very hard to review as a lot of things have changed and/or been moved around. So if possible I would suggest splitting up this PR in several PRs that can be evaluated on their own.

For example this seems to remove the use of arrow, that should probably be a PR on it's own, and so would moving to the 24 hour API etc.

Otherwise it will take a long time to review this PR and introduces additional risk that we missed something that alters it's behaviour or output in a way we didn't expect.

@saurabhchatterjee23
Copy link
Contributor Author

This is very hard to review as a lot of things have changed and/or been moved around. So if possible I would suggest splitting up this PR in several PRs that can be evaluated on their own.

I can understand that it is difficult to review all the changes, it becomes even more difficult to review if someone is trying to understand the changes by going through the diff. I feel it is much easier to understand the changes if it is clear how the code was working earlier and how it can be simplified if the data is available in 24-hour format. I will add details in the old and the newer version of the parser in the PR explaining what exactly was the logic before my change and how it changed in the newer version. I am confident that it will be easy to review after that explanation.

For example this seems to remove the use of arrow, which should probably be a PR on its own, and so would moving to the 24 hour API etc.

I have not used arrow extensively before, I will read about how it is different to use arrow to convert pandas datetime instead of to_pydatetime()

Otherwise it will take a long time to review this PR and introduces additional risk that we missed something that alters it's behaviour or output in a way we didn't expect.

I agree, without unit tests, it will be really difficult to review how the output will change unexpectedly in the newer version


EXCHANGES_MAPPING = {
"WR-SR": "IN-SO->IN-WE",
"WR-ER": "IN-EA->IN-WE",
Copy link
Contributor Author

@saurabhchatterjee23 saurabhchatterjee23 Nov 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reversed this mapping because we simply need to get records in data which match to WR-SR or WR-ER or WR-NR

  • In the older version we were joining the data with this mapping to add a new column which represents sortedZoneKey
  • In the newer version instead of performing a join, I am simply filtering by region name like WR-SR, WR-ER, WR-NR

It is the same filtering logic, just easier to read

@VIKTORVAV99
Copy link
Member

This is very hard to review as a lot of things have changed and/or been moved around. So if possible I would suggest splitting up this PR in several PRs that can be evaluated on their own.

I can understand that it is difficult to review all the changes, it becomes even more difficult to review if someone is trying to understand the changes by going through the diff. I feel it is much easier to understand the changes if it is clear how the code was working earlier and how it can be simplified if the data is available in 24-hour format. I will add details in the old and the newer version of the parser in the PR explaining what exactly was the logic before my change and how it changed in the newer version. I am confident that it will be easy to review after that explanation.

Which is why I am suggesting splitting up this PR, it't becomes much easier to review if there is "one" change per PR and will allow us to either implement the changes or request changes a lot quicker.

For example this seems to remove the use of arrow, which should probably be a PR on its own, and so would moving to the 24 hour API etc.

I have not used arrow extensively before, I will read about how it is different to use arrow to convert pandas datetime instead of to_pydatetime()

Note I don't want to use arrow at all really, but change things with dates and times have a tendency to introduce bugs. Which is why I would prefer it as a separate PR.

Otherwise it will take a long time to review this PR and introduces additional risk that we missed something that alters it's behaviour or output in a way we didn't expect.

I agree, without unit tests, it will be really difficult to review how the output will change unexpectedly in the newer version

Even unit tests might not cover all scenarios, they are good for regression testing but big changes like in here can easily include things that are not covered in the tests right now.


def get_date_range(dt: datetime):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method was used in the older code to generate datetime for every hour for the entire day for the given datetime
For example, if dt is 2023-11-08 12:23:00 output is

2023-11-08 00:00:00, 
2023-11-08 00:01:00, 
2023-11-08 00:02:00, 
2023-11-08 00:03:00, 
2023-11-08 00:04:00,
.
.
.
2023-11-08 23:00:00

This method was used in the older code to generate hourly datetime and then for every hour, filter and aggregate the data for that hour. This is exactly what a groupby on datetime field can do automatically without querying the data for every hour in a loop. Using group by is much simpler, efficient and straight forward


resp: Response = r.post(url=KIND_MAPPING[kind]["url"], json=payload)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the new code I am simply passing the URL as a pram instead of a type and then reading it from he KIND_MAPPING

)

datetime_col = KIND_MAPPING[kind]["datetime_column"]
for item in data:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic rounds the datetime field to nearest minute, which is not needed in the newer version where we aggregate by datetime hour


def format_exchanges_data(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method format_exchanges_data was called for every hour generated from get_date_range(dt: datetime): and would give net_flow for that hour. In the newer version this is not needed by simply grouping on datetime field

assert len(data) > 0
assert kind != ""

dt_12_hour = arrow.get(target_datetime.strftime("%Y-%m-%d %I:%M")).datetime
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

filter_raw_data method was called for every hour and it would filter data for that target_datetime hour.

dt_12_hour = arrow.get(target_datetime.strftime("%Y-%m-%d %I:%M")).datetime
datetime_col = KIND_MAPPING[kind]["datetime_column"]
filtered_data = pd.DataFrame(
[item for item in data if item[datetime_col].hour == dt_12_hour.hour]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is definitely a bug here. In the older version given the older data had datetime field in 12 hour format with missing AM PM value, this will filter aggregate the data for morning and evening in the same bucket. For example, if the data has records for 2023-11-08 11:00 AM and 2023-11-08 11:00 PM it will all be considered to be in the same hour

The older API provided data in a format which can never be correctly interpreted, improving the correctness is the primary motivation I am pushing for this change

Because of this bug, let's say the parser is run at 2023-11-08 18:23, the source will only provide data till 18:23; however the older version of the parser would generate data in the future as well till 23:00 hour because it will mistakenly copy the data from 11 hour to 23 hour.


if target_datetime.hour >= 12:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic was to compensate for 12 hr dataformat, we do not need it in the newer version

Comment on lines +96 to +104
df = pd.DataFrame(data, columns=["Region_Name", datetime_column, value_column])
df = df[df["Region_Name"] == EXCHANGES_MAPPING[zone_key]]
df[datetime_column] = (
pd.to_datetime(df[datetime_column], format="%Y-%m-%d %H:%M:%S")
.dt.tz_localize(IN_TZ)
.dt.floor("h")
)
df = df.groupby(["Region_Name", datetime_column]).mean().round(3)
df[value_column] = -df[value_column]
Copy link
Contributor Author

@saurabhchatterjee23 saurabhchatterjee23 Nov 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the core logic, which trims the datetime to hour and generates data for every hour and then takes the mean within every hour. The logic is the same as the older version, the difference here is that I am not iterating and querying the dataframes for every hour in a separate query unlike what we did in the older version

@saurabhchatterjee23
Copy link
Contributor Author

Which is why I am suggesting splitting up this PR, it't becomes much easier to review if there is "one" change per PR and will allow us to either implement the changes or request changes a lot quicker.

I feel this is PR for one change to move to new API with 24 hour dateformat, just moving to the new API makes a lot of old code obsolete; I am finding it difficult to understand how to slit it into multiple PRs

I will put more thought into it

@saurabhchatterjee23
Copy link
Contributor Author

This is a smaller PR with bare minimum changes to parse 24-hour data using the same logic as in the older parser #6115
Let me know if you find it difficult to understand

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants