Moving IN-WE to a new API which has 24 hour datetime data unlike 12 hour #6095

saurabhchatterjee23 · 2023-11-03T06:48:53Z

Description

The older API endpoints provided data in 12 hour format without AM and PM, because of which there was quite a lot of logic in code to parse that data.

The new APIs from the same data source provide data in 24 hour format which is more correct and simple to process.

Also made changes to methods to use Pandas groupby on datetime hour which simplified the code even further and fixed some bugs which would arise in the earlier version for missing datapoints from the source

Double check

I have tested my parser changes locally with poetry run test_parser "zone_key"
poetry run test_parser IN-WE consumption
poetry run test_parser "IN-NO->IN-WE" exchange
I have run pnpx prettier --write . and poetry run format to format my changes.

…ike 12 hour The older API endpoints provided data in 12 hour format without AM and PM, because of which there was quite a lot of logic in code to parse that data. The new APIs provide data in 24 hour format which is more correct and simple to process. Also made changes to methods to use Pandas groupby on datetime hour which simplified the code even further and fixed some bugs which would arise in the earlier version for missing datapoints from the source

VIKTORVAV99

Hi and thanks for the PR!

This is very hard to review as a lot of things have changed and/or been moved around. So if possible I would suggest splitting up this PR in several PRs that can be evaluated on their own.

For example this seems to remove the use of arrow, that should probably be a PR on it's own, and so would moving to the 24 hour API etc.

Otherwise it will take a long time to review this PR and introduces additional risk that we missed something that alters it's behaviour or output in a way we didn't expect.

saurabhchatterjee23 · 2023-11-08T17:31:51Z

This is very hard to review as a lot of things have changed and/or been moved around. So if possible I would suggest splitting up this PR in several PRs that can be evaluated on their own.

I can understand that it is difficult to review all the changes, it becomes even more difficult to review if someone is trying to understand the changes by going through the diff. I feel it is much easier to understand the changes if it is clear how the code was working earlier and how it can be simplified if the data is available in 24-hour format. I will add details in the old and the newer version of the parser in the PR explaining what exactly was the logic before my change and how it changed in the newer version. I am confident that it will be easy to review after that explanation.

For example this seems to remove the use of arrow, which should probably be a PR on its own, and so would moving to the 24 hour API etc.

I have not used arrow extensively before, I will read about how it is different to use arrow to convert pandas datetime instead of to_pydatetime()

Otherwise it will take a long time to review this PR and introduces additional risk that we missed something that alters it's behaviour or output in a way we didn't expect.

I agree, without unit tests, it will be really difficult to review how the output will change unexpectedly in the newer version

saurabhchatterjee23 · 2023-11-08T17:38:22Z

parsers/IN_WE.py


-EXCHANGES_MAPPING = {
- "WR-SR": "IN-SO->IN-WE",
- "WR-ER": "IN-EA->IN-WE",


I reversed this mapping because we simply need to get records in data which match to WR-SR or WR-ER or WR-NR

In the older version we were joining the data with this mapping to add a new column which represents sortedZoneKey

In the newer version instead of performing a join, I am simply filtering by region name like WR-SR, WR-ER, WR-NR

It is the same filtering logic, just easier to read

VIKTORVAV99 · 2023-11-08T17:43:16Z

This is very hard to review as a lot of things have changed and/or been moved around. So if possible I would suggest splitting up this PR in several PRs that can be evaluated on their own.

I can understand that it is difficult to review all the changes, it becomes even more difficult to review if someone is trying to understand the changes by going through the diff. I feel it is much easier to understand the changes if it is clear how the code was working earlier and how it can be simplified if the data is available in 24-hour format. I will add details in the old and the newer version of the parser in the PR explaining what exactly was the logic before my change and how it changed in the newer version. I am confident that it will be easy to review after that explanation.

Which is why I am suggesting splitting up this PR, it't becomes much easier to review if there is "one" change per PR and will allow us to either implement the changes or request changes a lot quicker.

For example this seems to remove the use of arrow, which should probably be a PR on its own, and so would moving to the 24 hour API etc.

I have not used arrow extensively before, I will read about how it is different to use arrow to convert pandas datetime instead of to_pydatetime()

Note I don't want to use arrow at all really, but change things with dates and times have a tendency to introduce bugs. Which is why I would prefer it as a separate PR.

Otherwise it will take a long time to review this PR and introduces additional risk that we missed something that alters it's behaviour or output in a way we didn't expect.

I agree, without unit tests, it will be really difficult to review how the output will change unexpectedly in the newer version

Even unit tests might not cover all scenarios, they are good for regression testing but big changes like in here can easily include things that are not covered in the tests right now.

saurabhchatterjee23 · 2023-11-08T17:46:09Z

parsers/IN_WE.py


-def get_date_range(dt: datetime):


This method was used in the older code to generate datetime for every hour for the entire day for the given datetime
For example, if dt is 2023-11-08 12:23:00 output is

2023-11-08 00:00:00, 2023-11-08 00:01:00, 2023-11-08 00:02:00, 2023-11-08 00:03:00, 2023-11-08 00:04:00, . . . 2023-11-08 23:00:00

This method was used in the older code to generate hourly datetime and then for every hour, filter and aggregate the data for that hour. This is exactly what a groupby on datetime field can do automatically without querying the data for every hour in a loop. Using group by is much simpler, efficient and straight forward

saurabhchatterjee23 · 2023-11-08T17:47:43Z

parsers/IN_WE.py


- resp: Response = r.post(url=KIND_MAPPING[kind]["url"], json=payload)


In the new code I am simply passing the URL as a pram instead of a type and then reading it from he KIND_MAPPING

saurabhchatterjee23 · 2023-11-08T17:50:16Z

parsers/IN_WE.py

 )
-
- datetime_col = KIND_MAPPING[kind]["datetime_column"]
- for item in data:


This logic rounds the datetime field to nearest minute, which is not needed in the newer version where we aggregate by datetime hour

saurabhchatterjee23 · 2023-11-08T17:59:28Z

parsers/IN_WE.py


-def format_exchanges_data(


This method format_exchanges_data was called for every hour generated from get_date_range(dt: datetime): and would give net_flow for that hour. In the newer version this is not needed by simply grouping on datetime field

saurabhchatterjee23 · 2023-11-08T18:01:50Z

parsers/IN_WE.py

- assert len(data) > 0
- assert kind != ""
-
- dt_12_hour = arrow.get(target_datetime.strftime("%Y-%m-%d %I:%M")).datetime


filter_raw_data method was called for every hour and it would filter data for that target_datetime hour.

saurabhchatterjee23 · 2023-11-08T18:09:47Z

parsers/IN_WE.py

- dt_12_hour = arrow.get(target_datetime.strftime("%Y-%m-%d %I:%M")).datetime
- datetime_col = KIND_MAPPING[kind]["datetime_column"]
- filtered_data = pd.DataFrame(
- [item for item in data if item[datetime_col].hour == dt_12_hour.hour]


There is definitely a bug here. In the older version given the older data had datetime field in 12 hour format with missing AM PM value, this will filter aggregate the data for morning and evening in the same bucket. For example, if the data has records for 2023-11-08 11:00 AM and 2023-11-08 11:00 PM it will all be considered to be in the same hour

The older API provided data in a format which can never be correctly interpreted, improving the correctness is the primary motivation I am pushing for this change

Because of this bug, let's say the parser is run at 2023-11-08 18:23, the source will only provide data till 18:23; however the older version of the parser would generate data in the future as well till 23:00 hour because it will mistakenly copy the data from 11 hour to 23 hour.

saurabhchatterjee23 · 2023-11-08T18:11:25Z

parsers/IN_WE.py


- if target_datetime.hour >= 12:


This logic was to compensate for 12 hr dataformat, we do not need it in the newer version

saurabhchatterjee23 · 2023-11-08T18:15:30Z

parsers/IN_WE.py

+ df = pd.DataFrame(data, columns=["Region_Name", datetime_column, value_column])
+ df = df[df["Region_Name"] == EXCHANGES_MAPPING[zone_key]]
+ df[datetime_column] = (
+ pd.to_datetime(df[datetime_column], format="%Y-%m-%d %H:%M:%S")
+ .dt.tz_localize(IN_TZ)
+ .dt.floor("h")
+ )
+ df = df.groupby(["Region_Name", datetime_column]).mean().round(3)
+ df[value_column] = -df[value_column]


This is the core logic, which trims the datetime to hour and generates data for every hour and then takes the mean within every hour. The logic is the same as the older version, the difference here is that I am not iterating and querying the dataframes for every hour in a separate query unlike what we did in the older version

saurabhchatterjee23 · 2023-11-08T18:25:47Z

Which is why I am suggesting splitting up this PR, it't becomes much easier to review if there is "one" change per PR and will allow us to either implement the changes or request changes a lot quicker.

I feel this is PR for one change to move to new API with 24 hour dateformat, just moving to the new API makes a lot of old code obsolete; I am finding it difficult to understand how to slit it into multiple PRs

I will put more thought into it

saurabhchatterjee23 · 2023-11-09T04:53:17Z

This is a smaller PR with bare minimum changes to parse 24-hour data using the same logic as in the older parser #6115
Let me know if you find it difficult to understand

github-actions bot added the parser label Nov 3, 2023

saurabhchatterjee23 changed the title ~~Moving IN-WE to a new data source which has 24 hour datetime data unlike 12 hour~~ Moving IN-WE to a new API which has 24 hour datetime data unlike 12 hour Nov 3, 2023

Merge branch 'master' into sc/improve_IN-WE_Parser

ac01845

VIKTORVAV99 self-requested a review November 3, 2023 07:28

madsnedergaard requested a review from a user November 7, 2023 19:56

VIKTORVAV99 requested changes Nov 8, 2023

View reviewed changes

saurabhchatterjee23 commented Nov 8, 2023

View reviewed changes

saurabhchatterjee23 mentioned this pull request Nov 9, 2023

Change the IN_WE parser to consume data from a new API from the same data source #6115

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Moving IN-WE to a new API which has 24 hour datetime data unlike 12 hour #6095

Moving IN-WE to a new API which has 24 hour datetime data unlike 12 hour #6095

saurabhchatterjee23 commented Nov 3, 2023 •

edited

Loading

VIKTORVAV99 left a comment

saurabhchatterjee23 commented Nov 8, 2023

saurabhchatterjee23 Nov 8, 2023 •

edited

Loading

VIKTORVAV99 commented Nov 8, 2023

saurabhchatterjee23 Nov 8, 2023

saurabhchatterjee23 Nov 8, 2023

saurabhchatterjee23 Nov 8, 2023

saurabhchatterjee23 Nov 8, 2023

saurabhchatterjee23 Nov 8, 2023

saurabhchatterjee23 Nov 8, 2023

saurabhchatterjee23 Nov 8, 2023

saurabhchatterjee23 Nov 8, 2023 •

edited

Loading

saurabhchatterjee23 commented Nov 8, 2023

saurabhchatterjee23 commented Nov 9, 2023


		resp: Response = r.post(url=KIND_MAPPING[kind]["url"], json=payload)


		def format_exchanges_data(

Moving IN-WE to a new API which has 24 hour datetime data unlike 12 hour #6095

Are you sure you want to change the base?

Moving IN-WE to a new API which has 24 hour datetime data unlike 12 hour #6095

Conversation

saurabhchatterjee23 commented Nov 3, 2023 • edited Loading

Description

Double check

VIKTORVAV99 left a comment

Choose a reason for hiding this comment

saurabhchatterjee23 commented Nov 8, 2023

saurabhchatterjee23 Nov 8, 2023 • edited Loading

Choose a reason for hiding this comment

VIKTORVAV99 commented Nov 8, 2023

saurabhchatterjee23 Nov 8, 2023

Choose a reason for hiding this comment

saurabhchatterjee23 Nov 8, 2023

Choose a reason for hiding this comment

saurabhchatterjee23 Nov 8, 2023

Choose a reason for hiding this comment

saurabhchatterjee23 Nov 8, 2023

Choose a reason for hiding this comment

saurabhchatterjee23 Nov 8, 2023

Choose a reason for hiding this comment

saurabhchatterjee23 Nov 8, 2023

Choose a reason for hiding this comment

saurabhchatterjee23 Nov 8, 2023

Choose a reason for hiding this comment

saurabhchatterjee23 Nov 8, 2023 • edited Loading

Choose a reason for hiding this comment

saurabhchatterjee23 commented Nov 8, 2023

saurabhchatterjee23 commented Nov 9, 2023

saurabhchatterjee23 commented Nov 3, 2023 •

edited

Loading

saurabhchatterjee23 Nov 8, 2023 •

edited

Loading

saurabhchatterjee23 Nov 8, 2023 •

edited

Loading