Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NOTICE: Updates and (non-structural) Changes are Coming #2

Closed
Lucas-Czarnecki opened this issue Apr 17, 2020 · 1 comment
Closed

NOTICE: Updates and (non-structural) Changes are Coming #2

Lucas-Czarnecki opened this issue Apr 17, 2020 · 1 comment
Labels
documentation Improvements or additions to documentation

Comments

@Lucas-Czarnecki
Copy link
Owner

Lucas-Czarnecki commented Apr 17, 2020

Edit: These changes are now in effect.

Starting next week (~April 20th) I will be introducing some changes to the daily report CSVs and cleaned data (i.e., CSSE_DailyReports). Most of these changes are intended to address frequently mentioned issues pertaining to CSSEGISandData's COVID-19 data. The changes WILL NOT affect variable names and SHOULD NOT break anyone's code. The guiding philosophy here is to provide an update that addresses obvious issues while ensuring a minimal amount of change to data structure. Incoming changes are documented below as a heads up.

Daily Reports (CSVs):

  • Active cases will be recalculated (i.e., Active = Confirmed - Deaths - Recoveries ) to correct for errors and to replace missing values in older daily reports. A sanity check will also ensure that active cases are no fewer than zero; cases where JHU reports negative active cases will be reported as missing values.
  • A consistent naming scheme will be enforced for values in Country_Region and Province_State such that each location will have a unique name. For example, "Korea, South", and "Republic of Korea" will become "South Korea" across all CSVs.
  • Data cleaning will also address various inconsistencies found in Province_State such as values referring to provinces and states alongside cities and counties (e.g., "Los Angeles, CA"). For US data these values will be split into Admin2 (e.g., "Los Angeles) and Province_State (e.g., California).
  • An updated Combined_Key will be provided that addresses various inconsistencies (e.g., "France" and ",,France").
  • Data from JHU's Lookup Table will be matched to daily reports. This task has the following intended effects:
    • Values for Latitude and Longitude will be matched to regions, replacing missing values for older daily reports and ensuring that coordinates are consistent for each region (addressing known issues with countries having conflicting coordinates).
    • FIPS codes in JHU's Lookup Table will be fixed (to address known issues pertaining to leading zeros) and then mapped to daily reports.

Cleaned Data:

  • The CSSE_DailyReports will concatenate the updated csv files.
  • Geographic codes from JHU's Lookup table will be matched to this file, providing various codes; including, UID, ISO alpha 2, ISO alpha 3, ISO 3-digit, and FIPS where applicable.
  • Population statistics will be added to each region based on JHU's Lookup table.
  • In addition to .Rdata files, CSSE_DailyReports data will also include a regularly updated CSV file for non-R users.
@Lucas-Czarnecki Lucas-Czarnecki added the documentation Improvements or additions to documentation label Apr 17, 2020
@Lucas-Czarnecki Lucas-Czarnecki pinned this issue Apr 17, 2020
Lucas-Czarnecki added a commit that referenced this issue Apr 22, 2020
The update provides multiple data cleanining operation to daily reports and time-series data. See #2
@Lucas-Czarnecki
Copy link
Owner Author

I have addressed inconsistencies in JHU's older daily reports that contained both states and counties in Province_State (e.g., "Province_State: Los Angeles, CA" ). The cleaned data splits values into Admin2 and Province_State (e.g., "Admin2: Los Angeles" and "Province_State: California"). These changes effectively mean that older daily reports are now consistent with JHU most recent uploads :)

However, JHU used to report on various municipalities before committing to reporting according to FIPS. Therefore, some of the older daily reports will still refer to municipalities (e.g., Boston, Seattle, Chicago) instead of their counties and will therefore not have a FIPS code or other values such as Latitude and Longitude. While some of these cases seem to have an easy fix, I will not make such changes until I am certain that they will not cause any unintended consequences. Keeping the data in its present form may also help find and address serious gaps in JHU's reporting (e.g., see "Suffolk" versus "Suffolk County").

Note that other countries present similar problems. With Canada, for example, JHU used to report data on municipalities/provinces (e.g., Calgary, AB and Edmonton, AB) before committing to provinces (e.g., Alberta). As with US data, I am keeping the data in a format that records JHU's original intentions. Note that if you want to aggregate data on a provincial level you must combine daily cases from cities like Calgary and Edmonton.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

No branches or pull requests

1 participant