dsbasic

Open source library for basic data science tasks.

implements in part:

sklearn API compatible transformers that act on pandas DataFrames.
a Visualizer object with some nice visualization functions and api.
helpers in writing clean and understandable data pipelines.

Imputation

the dsbasic.frame.preprocessing.impute module implements the fImputer transformer used to impute dataframe at selected columns.

fImputer(strategy='mean', copy=True, na_sentinel=-1, columns=None)

columns - list of columns names to impute
strategy - a string from 'mean', 'median', 'most_frequent', 'na_sentinel' where each one specifies which method of imputation is to be used.
copy - whether the returned frame should be a copy or not.
na_sentinel - if strategy = 'na_sentinel' fills all columns Na's with the na_sentinel variable.

fImputer accepts both pandas.DataFrame and pandas.Series objects.

Example

from dsbasic.frame.preprocessing.impute import fImputer
from sklearn.pipeline import make_pipeline

numeric = ['n1', 'n2', 'n3']
categorical = ['c1', 'c2', 'c3']

imputer = make_pipeline(
	fImputer(strategy='median', columns = numeric, copy=True),
	fImputer(strategy='most_frequent', columns = categorical, copy=True)
)

X = pandas.read_csv(...)
Y = pandas.read_csv(...)
X_imputed = imputer.fit_transform(X)
Y_imputed = imputer.transform(Y)

Categorical Variable Encoding

the dsbasic.frame.preprocessing.categorical module implements useful transformers to deal with categorical features. specifically the fOrdinalEncoder, fOneHotEncoder, fLabelEncoder

fLabelEncoder(dtype=np.uint8, nan_handle='soft' )

assigns a natural number to each unique label of the pandas series.

dtype - dtype of ordinal oncoded columns.
nan_handle - nan_handle is one of ['soft', 'hard', 'ignore']

soft - nans will be encoded in transform only if nans are present during fit. hard - nans are assigned a label in transform even if not present during fit. ignore - ignores nan's all-together.

note : if nan_handle is set to 'ignore' dtype argument is ignored and is set to float32

fLabelEncoder accepts only a pandas.Series object. to encode several columns see fOrdinalEncoder

Example :

from dsbasic.frame.preprocessing.categorical import fLabelEncoder 
from sklearn.pipeline import make_pipeline

labels = pandas.Series(['a', 'b', 'a', 'c', numpy.nan, 'a'])

y1 = fLabelEncoder(nan_handle='ignore').fit_transform(labels)
y2 = fLabelEncoder(nan_handle='soft').fit_transform(labels)
y3 = fLabelEncoder(nan_handle='hard').fit_transform(labels)

print('y1\n{}\n\n{}\n\n{}'.format(y1, y2, y3))

output :

0 0.0 
1 1.0 
2 0.0 
3 2.0 
4 NaN 
5 0.0 
dtype: float32 

0 0 
1 1 
2 0 
3 2 
4 3 
5 0 
dtype: uint8 

0 0 
1 1 
2 0 
3 2 
4 3 
5 0 
dtype: uint8

fOrdinalEncoder(dtype=np.uint8, nan_handle='soft', columns=None, copy=True)

Label encodes each column in "columns" using fLabelEncoder

dtype - dtype of ordinal oncoded columns.
nan_handle - nan_handle is one of ['soft', 'hard', 'ignore']

soft - nans will be encoded in transform only if nans are present during fit. hard - nans are assigned a label in transform even if not present during fit. ignore - ignores nan's all-together.

columns - list of strings describing the columns to be encoded.
copy - whether the returned frame should be a copy or not.

fOneHotEncoder(sep='_', dummy_na=False, columns=None)

One hot encodes selected columns of a dataframe and discards the original columns (pandas get_dummies style).

sep - new one hot encoded column names are set to be column_name + sep + label_name
dummy_na - whether to one hot encode Na's.
columns - list of strings describing the columns to be encoded.

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
dsbasic		dsbasic
LICENSE		LICENSE
README.md		README.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

dsbasic

Imputation

Categorical Variable Encoding

About

Releases

Packages

Languages

License

liordanon/dsbasic

Folders and files

Latest commit

History

Repository files navigation

dsbasic

Imputation

Categorical Variable Encoding

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages