Add a script to automatically merge multiple .csv files and deal with duplicates #65

ggael · 2022-09-20T12:51:58Z

We need a dedicated tool to merge merge multiple .csv files while detecting and merging duplicates.

I've started to implement it through a new static method of DeviceCarbonFootprint:

@staticmethod
    def merge(device1: 'DeviceCarbonFootprint', device2: 'DeviceCarbonFootprint',
              conflict: Literal['keep2nd','interactive'] = 'keep2nd', verbose: bool = False) -> 'DeviceCarbonFootprint':

and a merge_csv.py file1 file2 standalone script written on top of the above merge function.

By default, priority is given to device2/file2.

Conflicts are detected only for attributes that provided for both devices and when they are clearly different. If they are close enough, then merge only print a warning in verbose mode.

Then, there are two modes to resolve the conflicts:

Simply keep device2 (and print the differences in verbose mode)
Ask the user which version should be kept.

TODO:

Add a non-regression mode only testing that device2 is consistent with device1 and that device1 does not contain more information.
Cleanup and unify some entries prior to fusion to avoid false negative (i.e., CN versus China, issue Unify location names #64)
Find a way to deal with PCF files reporting the same model name whereas they are not the same (in ecodiag I also extract the model name from the main html files)

The text was updated successfully, but these errors were encountered:

…ipt.

ggael · 2022-09-21T09:32:01Z

Some updates, merge_csv.py now also print a summary report like this:

PYTHONPATH=. python tools/merge_csv.py boavizta-data-us.csv dell.csv  -o /dev/null

------------------------------------------------------------
| Summary report                                           |
------------------------------------------------------------
Number of singletons: 1235, 26
Number of self duplicates: 174, 2
Number of clean fusions: 455
Number of mixed fusions: 42
Number of attributes gathered from the oldest data: 122
------------------------------------------------------------

which is handy to quickly see if there is any issues. For instance, here this report means that 1235 items of boavizta-data-us.csv are not present in dell.csv, 26 items are presents in dell.csv but not in the current db, the current db contains 174 items having one (or more) duplicates (*), among the items that are in both files, 455 are fully covered by dell.csv, but for 42 items we found attributes in boavizta-data-us.csv that are not present in dell.csv.

(*) So far duplicates are detected solely based on the model name. This implies some false positives.

ggael pushed a commit that referenced this issue Sep 20, 2022

Issue #65: add initial versions of a merge function and merge_csv scr…

d7b4d66

…ipt.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a script to automatically merge multiple .csv files and deal with duplicates #65

Add a script to automatically merge multiple .csv files and deal with duplicates #65

ggael commented Sep 20, 2022

ggael commented Sep 21, 2022

Add a script to automatically merge multiple .csv files and deal with duplicates #65

Add a script to automatically merge multiple .csv files and deal with duplicates #65

Comments

ggael commented Sep 20, 2022

ggael commented Sep 21, 2022