-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Soup #7
Conversation
multi_imbalance/resampling/SOUP.py
Outdated
""" | ||
Similarity Oversampling and Undersampling Preprocessing (SOUP) is an algorithm that equalizes number of samples | ||
in each class. It also takes care of the similarity between classes, which means that it removes samples from | ||
majority class, that are close to samples from the other class and duplicate samples from th minority classes, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you have a typo in this line
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@@ -0,0 +1,163 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
examples in ipynb format are good idea! 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks! ;)
multi_imbalance/utils/plot.py
Outdated
|
||
def plot_multi_dimensional_data(X, y, ax=None): | ||
""" | ||
This function reduce quantity of dimensions to 2 principal components and prepare pretty scatter plot for your data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reduces, prepares
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
multi_imbalance/utils/plot.py
Outdated
y = pd.DataFrame({'y': y}) | ||
|
||
X_df = pd.DataFrame(data=X, columns=['x1', 'x2']) | ||
df = pd.concat([X_df, y], axis=1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not extract the data preparation to a separate method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
100% agree that manipulating on data in function for plotting was bad. I decided to change this function to be only for manipulating data and moved rest to notebook
multi_imbalance/resampling/SOUP.py
Outdated
for sample_id in indices_in_class: | ||
neighbours_indices = self.neigh_clf.kneighbors([list(X[sample_id])], return_distance=False) | ||
neighbours_classes = y[neighbours_indices[0]] | ||
neighbours_quantities = Counter(neighbours_classes) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The body of the loop could possibly be extracted as a method
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. I extracted but also decided to don't create unit tests for function only with sklearn knn and built-in Counter - nothing to test ;)
@@ -0,0 +1,140 @@ | |||
from collections import Counter, defaultdict |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pls test with invalid data
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
multi_imbalance/resampling/SOUP.py
Outdated
undersampled_X, undersampled_y = list(), list() | ||
for idx, _ in safe_levels_list: | ||
undersampled_X.append(X[idx]) | ||
undersampled_y.append(y[idx]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comprehension here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
setup.py
Outdated
@@ -21,5 +21,8 @@ | |||
install_requires=[ | |||
"numpy>=1.17.0", | |||
"scikit-learn>=0.21.3", | |||
"pandas", | |||
"seaborn", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add versions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
examples/resampling/SOUP.ipynb
Outdated
"import matplotlib.pyplot as plt\n", | ||
"\n", | ||
"%matplotlib inline\n", | ||
"rc = {'text.color':'white','axes.labelcolor':'white', 'xtick.color':'white','ytick.color':'white'}\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
…ng in soup example and added tests for invalid data
#6