Skip to content

Commit

Permalink
Improve plots and README
Browse files Browse the repository at this point in the history
  • Loading branch information
phiyodr committed Jan 3, 2023
1 parent 81a8589 commit 0d11c5b
Show file tree
Hide file tree
Showing 7 changed files with 83 additions and 74 deletions.
149 changes: 78 additions & 71 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Multilabel Oversampling
# Multilabel Oversampling :sunflower:

**Many algorithms for imbalanced data support binary and multiclass classification only.**
**This approach is made for multi-label classification (aka multi-target classification).**
Expand All @@ -10,7 +10,7 @@
* Multilabel dataset (as `pandas.DataFrame`) with imbalanced data
* Calculate counts per class and then calculate the standard deviation (std) of the count values
* Do for `number_of_adds` times the following:
* Randomly draw a sample from your data and calculate new std
* Randomly draw a sample from your data and calculate new std
* If new std reduces, add sample to your dataset
* If not, draw another sample (to this up to `number_of_tries` times)
* A new df is returned.
Expand All @@ -19,15 +19,24 @@
## :arrow_right: Usage

```python
from multilabel_oversampling import multilabel_oversampling as mo
import multilabel_oversampling as mo

df = mo.create_fake_data(size=1, seed=3)
mo.seed_everything(20)
df = mo.create_fake_data(size=1) # difficult fake dataset with very high dependency of y1 and y2
ml_oversampler = mo.MultilabelOversampler(number_of_adds=100, number_of_tries=100)
df_new, plot_at = ml_oversampler.fit(df)
#> Iteration: 20%|██████ | 20/100 [00:00<00:00, 111.68it/s]
#> No improvement after 100 tries in iter 20.
df_new = ml_oversampler.fit(df)
#>Start the upsampling process.
#>Iteration: 11%|████████████████ | 11/100 [00:00<00:01, 48.43it/s]
#>Iter 11: No improvement after 100 tries.
#>Sampling done.
#>
#>Dataset size original: 20; Upsampled dataset size: 31
#>Original target distribution: {'y1': 16, 'y2': 12, 'y3': 4, 'y4': 4}
#>Upsampled target distribution: {'y1': 19, 'y2': 12, 'y3': 15, 'y4': 15}

ml_oversampler.plot_all_tries()
```
![Plot from df_new = ml_oversampler.fit(df)](assets/plot.png)
![Plot from ml_oversampler.plot_all_tries()](assets/plot_all_tries.png)

```python
ml_oversampler.plot_results()
Expand All @@ -41,72 +50,63 @@ ml_oversampler.plot_results()

# Original DataFrame
print(df)
#> y1 y2 y3 y4 x
#> 0 1 1 0 0 img_0.jpg
#> 1 1 1 1 0 img_1.jpg
#> 2 1 1 1 0 img_2.jpg
#> 3 1 1 0 0 img_3.jpg
#> 4 1 1 0 0 img_4.jpg
#> 5 1 1 0 1 img_5.jpg
#> 6 1 1 0 0 img_6.jpg
#> 7 1 1 0 0 img_7.jpg
#> 8 1 1 0 0 img_8.jpg
#> 9 1 1 0 0 img_9.jpg
#> 10 1 1 0 0 img_10.jpg
#> 11 1 1 0 0 img_11.jpg
#> 12 1 0 0 1 img_12.jpg
#> 13 1 0 0 1 img_13.jpg
#> 14 1 0 0 0 img_14.jpg
#> 15 1 0 0 0 img_15.jpg
#> 16 0 0 1 0 img_16.jpg
#> 17 0 0 0 1 img_17.jpg
#> 18 0 0 1 0 img_18.jpg
#> 19 0 0 0 0 img_19.jpg
#> y1 y2 y3 y4 x
#>0 1 1 0 0 img_0.jpg
#>1 1 1 0 0 img_1.jpg
#>2 1 1 0 1 img_2.jpg
#>3 1 1 0 0 img_3.jpg
#>4 1 1 1 0 img_4.jpg
#>5 1 1 0 0 img_5.jpg
#>6 1 1 0 0 img_6.jpg
#>7 1 1 0 0 img_7.jpg
#>8 1 1 0 1 img_8.jpg
#>9 1 1 0 0 img_9.jpg
#>10 1 1 0 0 img_10.jpg
#>11 1 1 0 0 img_11.jpg
#>12 1 0 1 0 img_12.jpg
#>13 1 0 1 1 img_13.jpg
#>14 1 0 0 0 img_14.jpg
#>15 1 0 0 0 img_15.jpg
#>16 0 0 0 0 img_16.jpg
#>17 0 0 0 0 img_17.jpg
#>18 0 0 0 0 img_18.jpg
#>19 0 0 1 1 img_19.jpg


# New DataFrame after upsampling
print(df_new)
#> y1 y2 y3 y4 x
#> 0 1 1 0 0 img_0.jpg
#> 1 1 1 1 0 img_1.jpg
#> 2 1 1 1 0 img_2.jpg
#> 3 1 1 0 0 img_3.jpg
#> 4 1 1 0 0 img_4.jpg
#> 5 1 1 0 1 img_5.jpg
#> 6 1 1 0 0 img_6.jpg
#> 7 1 1 0 0 img_7.jpg
#> 8 1 1 0 0 img_8.jpg
#> 9 1 1 0 0 img_9.jpg
#> 10 1 1 0 0 img_10.jpg
#> 11 1 1 0 0 img_11.jpg
#> 12 1 0 0 1 img_12.jpg
#> 13 1 0 0 1 img_13.jpg
#> 14 1 0 0 0 img_14.jpg
#> 15 1 0 0 0 img_15.jpg
#> 16 0 0 1 0 img_16.jpg
#> 17 0 0 0 1 img_17.jpg
#> 18 0 0 1 0 img_18.jpg
#> 19 0 0 0 0 img_19.jpg
#> 17 0 0 0 1 img_17.jpg
#> 16 0 0 1 0 img_16.jpg
#> 16 0 0 1 0 img_16.jpg
#> 16 0 0 1 0 img_16.jpg
#> 16 0 0 1 0 img_16.jpg
#> 16 0 0 1 0 img_16.jpg
#> 16 0 0 1 0 img_16.jpg
#> 18 0 0 1 0 img_18.jpg
#> 13 1 0 0 1 img_13.jpg
#> 18 0 0 1 0 img_18.jpg
#> 17 0 0 0 1 img_17.jpg
#> 17 0 0 0 1 img_17.jpg
#> 17 0 0 0 1 img_17.jpg
#> 16 0 0 1 0 img_16.jpg
#> 17 0 0 0 1 img_17.jpg
#> 17 0 0 0 1 img_17.jpg
#> 17 0 0 0 1 img_17.jpg
#> 16 0 0 1 0 img_16.jpg
#> 17 0 0 0 1 img_17.jpg
#> 17 0 0 0 1 img_17.jpg

#> y1 y2 y3 y4 x
#>0 1 1 0 0 img_0.jpg
#>1 1 1 0 0 img_1.jpg
#>2 1 1 0 1 img_2.jpg
#>3 1 1 0 0 img_3.jpg
#>4 1 1 1 0 img_4.jpg
#>5 1 1 0 0 img_5.jpg
#>6 1 1 0 0 img_6.jpg
#>7 1 1 0 0 img_7.jpg
#>8 1 1 0 1 img_8.jpg
#>9 1 1 0 0 img_9.jpg
#>10 1 1 0 0 img_10.jpg
#>11 1 1 0 0 img_11.jpg
#>12 1 0 1 0 img_12.jpg
#>13 1 0 1 1 img_13.jpg
#>14 1 0 0 0 img_14.jpg
#>15 1 0 0 0 img_15.jpg
#>16 0 0 0 0 img_16.jpg
#>17 0 0 0 0 img_17.jpg
#>18 0 0 0 0 img_18.jpg
#>19 0 0 1 1 img_19.jpg
#>19 0 0 1 1 img_19.jpg
#>19 0 0 1 1 img_19.jpg
#>13 1 0 1 1 img_13.jpg
#>13 1 0 1 1 img_13.jpg
#>13 1 0 1 1 img_13.jpg
#>19 0 0 1 1 img_19.jpg
#>19 0 0 1 1 img_19.jpg
#>19 0 0 1 1 img_19.jpg
#>19 0 0 1 1 img_19.jpg
#>19 0 0 1 1 img_19.jpg
#>19 0 0 1 1 img_19.jpg
```


Expand All @@ -118,6 +118,13 @@ print(df_new)
pip install git+https://github.com/phiyodr/multilabel-oversampling
```

* Install from PyPI

```bash
pip install multilabel-oversampling
```



## :construction_worker: Future work

Expand Down
Binary file removed assets/plot.png
Binary file not shown.
Binary file added assets/plot_all_tries.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified assets/plot_results.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 1 addition & 1 deletion multilabel_oversampling/__init__.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
__version__ = "0.1.2"
__version__ = "0.1.3"

from .multilabel_oversampling import *
5 changes: 3 additions & 2 deletions multilabel_oversampling/multilabel_oversampling.py
Original file line number Diff line number Diff line change
Expand Up @@ -131,6 +131,7 @@ def plot_all_tries(self):
plt.scatter(i + idx*0.01, s)
plt.xlabel('Iters')#, fontsize=18)
plt.ylabel('Std')#, fontsize=16)
plt.title("All standard deviations per iteration")
return plt

def plot_results(self):
Expand Down Expand Up @@ -172,8 +173,8 @@ def plot_index_counts(self, df_new=None):
df_new = self.df_new
x = list(collections.Counter(list(df_new.index)).values())
plt.hist(x, bins=max(x)+1, rwidth=.9)
plt.title("Frequency of indexes in df")
plt.xlabel('Frequency in dataset')
plt.title("Frequency of\nindexes in df")
plt.xlabel('Occurances of indexes in dataset')
plt.ylabel('Counts')
return plt

Expand Down
1 change: 1 addition & 0 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@
long_description_content_type="text/markdown",
url="https://github.com/phiyodr/multilabel-oversampling",
classifiers=[
"Topic :: Scientific/Engineering :: Artificial Intelligence",
"Programming Language :: Python :: 3",
],
packages=setuptools.find_packages(),
Expand Down

0 comments on commit 0d11c5b

Please sign in to comment.