Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a script to automatically merge multiple .csv files and deal with duplicates #65

Open
ggael opened this issue Sep 20, 2022 · 1 comment

Comments

@ggael
Copy link
Collaborator

ggael commented Sep 20, 2022

We need a dedicated tool to merge merge multiple .csv files while detecting and merging duplicates.

I've started to implement it through a new static method of DeviceCarbonFootprint:

@staticmethod
    def merge(device1: 'DeviceCarbonFootprint', device2: 'DeviceCarbonFootprint',
              conflict: Literal['keep2nd','interactive'] = 'keep2nd', verbose: bool = False) -> 'DeviceCarbonFootprint':

and a merge_csv.py file1 file2 standalone script written on top of the above merge function.

By default, priority is given to device2/file2.

Conflicts are detected only for attributes that provided for both devices and when they are clearly different. If they are close enough, then merge only print a warning in verbose mode.

Then, there are two modes to resolve the conflicts:

  1. Simply keep device2 (and print the differences in verbose mode)
  2. Ask the user which version should be kept.

TODO:

  1. Add a non-regression mode only testing that device2 is consistent with device1 and that device1 does not contain more information.
  2. Cleanup and unify some entries prior to fusion to avoid false negative (i.e., CN versus China, issue Unify location names #64)
  3. Find a way to deal with PCF files reporting the same model name whereas they are not the same (in ecodiag I also extract the model name from the main html files)
@ggael
Copy link
Collaborator Author

ggael commented Sep 21, 2022

Some updates, merge_csv.py now also print a summary report like this:

PYTHONPATH=. python tools/merge_csv.py boavizta-data-us.csv dell.csv  -o /dev/null

------------------------------------------------------------
| Summary report                                           |
------------------------------------------------------------
Number of singletons: 1235, 26
Number of self duplicates: 174, 2
Number of clean fusions: 455
Number of mixed fusions: 42
Number of attributes gathered from the oldest data: 122
------------------------------------------------------------

which is handy to quickly see if there is any issues. For instance, here this report means that 1235 items of boavizta-data-us.csv are not present in dell.csv, 26 items are presents in dell.csv but not in the current db, the current db contains 174 items having one (or more) duplicates (*), among the items that are in both files, 455 are fully covered by dell.csv, but for 42 items we found attributes in boavizta-data-us.csv that are not present in dell.csv.

(*) So far duplicates are detected solely based on the model name. This implies some false positives.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant