Implementation Ideas for #23 Data Imputations #62

Samyssmile · 2023-10-19T19:13:35Z

Samyssmile
Oct 19, 2023
Maintainer

We introduce a new Enum Imputation

Filterout
Average
Mode

Example Dataset

ID	Age	Weight	Subspecies	Health Status
1	3	220	Bengal	Healthy
2	4		Sumatran	Sick
3		210	Malayan	Healthy
4	5	230		Healthy
5			Siberian	Sick
6	2	240	Bengal	Healthy
7	3			Sick
8		200	Siberian
9	4	245	Malayan	Healthy
10	6	223	Sumatran	Sick

Strategy

Average Strategy

Can be applied only for Numerical Values.

Numerical Imputation

Imputation in Column Age.
(3+4+5+2+3+4+6)/7= 3,86 or (4)

Imputation in Column Weight
(220+210+230+240+200+245+223)/7=224

Filterout Imputation

Can be applied for Numerical and Categorical Values. All rows with at least 1 missed value need to be droped.

ID	Age	Weight	Subspecies	Health Status
1	3	220	Bengal	Healthy
3		210	Malayan	Healthy
6	2	240	Bengal	Healthy
9	4	245	Malayan	Healthy
10	6	223	Sumatran	Sick

Mode Strategy

Can be applied on Numerical and Categorical values.

ID	Age	Weight	Subspecies	Health Status
1	3	220	Bengal	Healthy
2	4		Sumatran	Sick
3	3	210	Malayan	Healthy
4	5	230		Healthy
5	3		Siberian	Sick
6	2	240	Bengal	Healthy
7	3			Sick
8	3	200	Siberian	Healthy
9	4	245	Malayan	Healthy
10	6	223	Sumatran	Sick

We choose 3 as Age because 3 is the most common value.
We choose Healthy because it is the most common value.
How to solve Weight here?
How to solve Subspecies here?

API Questions.

How about to pass optionaly up to two enums, one for Numerical and one for Categorical values

API Proposal

We extend the abstract class DataPostProcessor by method Imputation

public abstract class DataPostProcessor<T> {
 ...
    public abstract DataPostProcessor<T> imputation(String column, Imputation strategy);
    public abstract DataPostProcessor<T> imputation(int columnIndex, Imputation strategy);

}

We implement the new imputation Method in DataProcessor

public abstract class DataProcessor<T> extends DataPostProcessor<T> implements IDataUtil<T> {
  ...
    @Override
    public DataProcessor<T>imputation(String column, Imputation strategy) {

    }

    @Override
    public DataProcessor<T>imputation(int columnIndex, Imputation strategy) {

    }
    @Override
    public DataProcessor<T> performListwiseDeletion(int columnIndex) {
          // Delete row if missing value in given column
    }
    @Override
    public DataProcessor<T> performListwiseDeletion(String columnName) {
          // Delete row if missing value in given column
    }

    @Override
    public DataProcessor<T> performListwiseDeletion() {
          // Apply to all
    }

Usage

        var dataset = seabornDataProcessor.loadDataSetFromCSV(csvFile, ',', SHUFFLE, NORMALIZE);
        seabornDataProcessor
              .performListwiseDeletion("2")
              .imputation("Subspecies", Imputation.MODE)
              .imputation("1", Imputation.AVERAGE)
              .normalize();

acsolle66 · 2023-10-19T19:47:18Z

acsolle66
Oct 19, 2023
Collaborator

Tthe first thing that pops out from my head to the uniformly distributed features is this: we create a Set of features and randomly input one for each missing feature

0 replies

acsolle66 · 2023-10-19T19:59:10Z

acsolle66
Oct 19, 2023
Collaborator

I don't know how deep you want to dive in to data analysis in this project but I think we should set the inputation strategy for every column separatedly. We can have one column with datas like 1,2,6,9,4... where we could use average and lets have in the other numerical column values like 1,2,1,3,3000,4,3,2. No the 3000 is not a typo for the second column we can not use average. We should use filterout(this could be another implementation where you filter out not only missing values but values based on some criteria). So for having seprate inputation strategies for numerical and categorical data is not enough if we want to do it more precisely. If we dont then I dont think it really mathers if we filterout categorical and then average numerical values in one handler.

1 reply

acsolle66 Oct 19, 2023
Collaborator

I was thinking a little bit and if we want to extend our data analysis and processing tools, in my opinion the first thing we need to do is to introduce a separate DataSet class with methods for statistics, methods to set up the inputation, cleaning or do the labeling of categorical data too directly in the DataSet class, or methods just simply to print out records or differences between the original and modified dataset, etc...

acsolle66 · 2023-10-23T08:34:31Z

acsolle66
Oct 23, 2023
Collaborator

Hi
I have found something to reconsider :

var dataset = seabornDataProcessor.loadDataSetFromCSV(csvFile, ',', SHUFFLE);
seabornDataProcessor
   .performListwiseDeletion("2")
   .imputation("Subspecies", Imputation.MODE)
   .imputation("1", Imputation.AVERAGE)
   .normalize();

The loadDataSetFromCSV returns a list of the actual objects like List because of the mapToDataRecord used in the loadDataSetFromCSV. For altering the the objects in the list we would need to know the actual getters of the features (but as far I understand it well those are set up outside of our API). So I propose something likes this:

List<String[]> rawDataset = seabornDataProcessor.loadDataSetFromCSV(csvFile, ',');
List<T> dataset = rawDataset
    .performListwiseDeletion("2")
    .imputation("Subspecies", Imputation.MODE)
    .imputation("1", Imputation.AVERAGE)
    .mapToDataRecord()
    .shuffle();
    .normalize();

we load the raw dataset -> we perform the deletions and imputations on the raw dataset -> map the raw dataset to data records -> shuffle and normalize
What do you think about this ?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implementation Ideas for #23 Data Imputations #62

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 1 reply

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Implementation Ideas for #23 Data Imputations #62

Samyssmile Oct 19, 2023 Maintainer

Example Dataset

Strategy

Average Strategy

Numerical Imputation

Filterout Imputation

Mode Strategy

API Questions.

API Proposal

Usage

Replies: 3 comments · 1 reply

acsolle66 Oct 19, 2023 Collaborator

acsolle66 Oct 19, 2023 Collaborator

acsolle66 Oct 19, 2023 Collaborator

acsolle66 Oct 23, 2023 Collaborator

Samyssmile
Oct 19, 2023
Maintainer

Replies: 3 comments 1 reply

acsolle66
Oct 19, 2023
Collaborator

acsolle66
Oct 19, 2023
Collaborator

acsolle66 Oct 19, 2023
Collaborator

acsolle66
Oct 23, 2023
Collaborator