Implementation Ideas for #23 Data Imputations #62
Replies: 3 comments 1 reply
-
Tthe first thing that pops out from my head to the uniformly distributed features is this: we create a Set of features and randomly input one for each missing feature |
Beta Was this translation helpful? Give feedback.
-
I don't know how deep you want to dive in to data analysis in this project but I think we should set the inputation strategy for every column separatedly. We can have one column with datas like 1,2,6,9,4... where we could use average and lets have in the other numerical column values like 1,2,1,3,3000,4,3,2. No the 3000 is not a typo for the second column we can not use average. We should use filterout(this could be another implementation where you filter out not only missing values but values based on some criteria). So for having seprate inputation strategies for numerical and categorical data is not enough if we want to do it more precisely. If we dont then I dont think it really mathers if we filterout categorical and then average numerical values in one handler. |
Beta Was this translation helpful? Give feedback.
-
Hi var dataset = seabornDataProcessor.loadDataSetFromCSV(csvFile, ',', SHUFFLE);
seabornDataProcessor
.performListwiseDeletion("2")
.imputation("Subspecies", Imputation.MODE)
.imputation("1", Imputation.AVERAGE)
.normalize(); The loadDataSetFromCSV returns a list of the actual objects like List because of the mapToDataRecord used in the loadDataSetFromCSV. For altering the the objects in the list we would need to know the actual getters of the features (but as far I understand it well those are set up outside of our API). So I propose something likes this: List<String[]> rawDataset = seabornDataProcessor.loadDataSetFromCSV(csvFile, ',');
List<T> dataset = rawDataset
.performListwiseDeletion("2")
.imputation("Subspecies", Imputation.MODE)
.imputation("1", Imputation.AVERAGE)
.mapToDataRecord()
.shuffle();
.normalize(); we load the raw dataset -> we perform the deletions and imputations on the raw dataset -> map the raw dataset to data records -> shuffle and normalize |
Beta Was this translation helpful? Give feedback.
-
We introduce a new Enum Imputation
Example Dataset
Strategy
Average Strategy
Can be applied only for Numerical Values.
Numerical Imputation
Imputation in Column Age.
(3+4+5+2+3+4+6)/7= 3,86 or (4)
Imputation in Column Weight
(220+210+230+240+200+245+223)/7=224
Filterout Imputation
Can be applied for Numerical and Categorical Values. All rows with at least 1 missed value need to be droped.
Mode Strategy
Can be applied on Numerical and Categorical values.
We choose 3 as Age because 3 is the most common value.
We choose Healthy because it is the most common value.
How to solve Weight here?
How to solve Subspecies here?
API Questions.
How about to pass optionaly up to two enums, one for Numerical and one for Categorical values
API Proposal
We extend the abstract class DataPostProcessor by method Imputation
We implement the new imputation Method in DataProcessor
Usage
Beta Was this translation helpful? Give feedback.
All reactions