Skip to content

Commit

Permalink
DOC restructure data transformation user guide
Browse files Browse the repository at this point in the history
  • Loading branch information
jnothman committed Sep 11, 2014
1 parent 5fc0e8f commit 9af14b3
Show file tree
Hide file tree
Showing 9 changed files with 214 additions and 138 deletions.
23 changes: 22 additions & 1 deletion doc/data_transforms.rst
Original file line number Diff line number Diff line change
Expand Up @@ -5,10 +5,31 @@
Dataset transformations
-----------------------

scikit-learn provides a library of transformers, which may clean (see
:ref:`preprocessing`), reduce (see :ref:`data_reduction`), expand (see
:ref:`kernel_approximation`) or generate (see :ref:`feature_extraction`)
feature representations.

Like other estimators, these are represented by classes with ``fit`` method,
which learns model parameters (e.g. mean and standard deviation for
normalization) from a training set, and a ``transform`` method which applies
this transformation model to unseen data. ``fit_transform`` may be more
convenient and efficient for modelling and transforming the training data
simultaneously.

Combining such transformers, either in parallel or series is covered in
:ref:`combining_estimators`. :ref:`metrics` covers transforming feature
spaces into affinity matrices, while :ref:`preprocessing_targets` considers
transformations of the target space (e.g. categorical labels) for use in
scikit-learn.

.. toctree::

modules/pipeline
modules/feature_extraction
modules/preprocessing
modules/kernel_approximation
modules/unsupervised_reduction
modules/random_projection
modules/kernel_approximation
modules/metrics
modules/preprocessing_targets
1 change: 0 additions & 1 deletion doc/model_selection.rst
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,6 @@ Model selection and evaluation

modules/cross_validation
modules/grid_search
modules/pipeline
modules/model_evaluation
modules/model_persistence
modules/learning_curve
37 changes: 33 additions & 4 deletions doc/modules/cross_validation.rst
Original file line number Diff line number Diff line change
Expand Up @@ -129,8 +129,9 @@ In the case of the Iris dataset, the samples are balanced across target
classes hence the accuracy and the F1-score are almost equal.

When the ``cv`` argument is an integer, :func:`cross_val_score` uses the
:class:`KFold` or :class:`StratifiedKFold` strategies by default (depending on
the absence or presence of the target array).
:class:`KFold` or :class:`StratifiedKFold` strategies by default, the latter
being used if the estimator derives from :class:`ClassifierMixin
<sklearn.base.ClassifierMixin>`.

It is also possible to use other cross validation strategies by passing a cross
validation iterator instead, for instance::
Expand All @@ -143,7 +144,36 @@ validation iterator instead, for instance::
... # doctest: +ELLIPSIS
array([ 0.97..., 0.97..., 1. ])

The available cross validation iterators are introduced in the following.
The available cross validation iterators are introduced in the following
section.

.. topic:: Data transformation with held out data

Just as it is important to test a predictor on data held-out from
training, preprocessing (such as standardization, feature selection, etc.)
and similar :ref:`data transformations <data-transforms>` similarly should
be learnt from a training set and applied to held-out data for prediction::

>>> from sklearn import preprocessing
>>> X_train, X_test, y_train, y_test = cross_validation.train_test_split(
... iris.data, iris.target, test_size=0.4, random_state=0)
>>> scaler = preprocessing.StandardScaler().fit(X_train)
>>> X_train_transformed = scaler.transform(X_train)
>>> clf = svm.SVC(C=1).fit(X_train_transformed, y_train)
>>> X_test_transformed = scaler.transform(X_test)
>>> clf.score(X_test_transformed, y_test) # doctest: +ELLIPSIS
0.9333...

A :class:`Pipeline <sklearn.pipeline.Pipeline>` makes it easier to compose
estimators, providing this behavior under cross-validation::

>>> from sklearn.pipeline import make_pipeline
>>> clf = make_pipeline(preprocessing.StandardScaler(), svm.SVC(C=1))
>>> cross_validation.cross_val_score(clf, iris.data, iris.target, cv=cv)
... # doctest: +ELLIPSIS
array([ 0.97..., 0.93..., 0.95...])

See :ref:`combining_estimators`.


.. topic:: Examples
Expand All @@ -153,7 +183,6 @@ The available cross validation iterators are introduced in the following.
* :ref:`example_model_selection_grid_search_digits.py`,
* :ref:`example_model_selection_grid_search_text_feature_extraction.py`,


Cross validation iterators
==========================

Expand Down
17 changes: 13 additions & 4 deletions doc/modules/grid_search.rst
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,11 @@ all parameter combinations, while :class:`RandomizedSearchCV` can sample a
given number of candidates from a parameter space with a specified
distribution.

.. seealso
:ref:`pipeline` describes building composite estimators whose
parameter space can be searched with these tools.
Exhaustive Grid Search
======================

Expand Down Expand Up @@ -179,12 +184,16 @@ Here is the list of such models:
:toctree: generated/
:template: class.rst

linear_model.RidgeCV
linear_model.RidgeClassifierCV
linear_model.ElasticNetCV
linear_model.LarsCV
linear_model.LassoLarsCV
linear_model.LassoCV
linear_model.ElasticNetCV
linear_model.LassoLarsCV
linear_model.LogisticRegressionCV
linear_model.MultiTaskElasticNetCV
linear_model.MultiTaskLassoCV
linear_model.OrthogonalMatchingPursuitCV
linear_model.RidgeCV
linear_model.RidgeClassifierCV


Information Criterion
Expand Down
3 changes: 3 additions & 0 deletions doc/modules/kernel_approximation.rst
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,9 @@ In particular, the combination of kernel map approximations with
Since there has not been much empirical work using approximate embeddings, it
is advisable to compare results against exact kernel methods when possible.

.. seealso
:ref:`polynomial_regression` for an exact polynomial transformation.
.. currentmodule:: sklearn.kernel_approximation

Expand Down
24 changes: 16 additions & 8 deletions doc/modules/pipeline.rst
Original file line number Diff line number Diff line change
@@ -1,8 +1,13 @@
.. _combining_estimators:

===============================================
Pipeline and FeatureUnion: combining estimators
===============================================

.. _pipeline:

==============================
Pipeline: chaining estimators
==============================
=============================

.. currentmodule:: sklearn.pipeline

Expand All @@ -23,7 +28,7 @@ The last estimator may be any type (transformer, classifier, etc.).


Usage
=====
-----

The :class:`Pipeline` is build using a list of ``(key, value)`` pairs, where
the ``key`` a string containing the name you want to give this step and ``value``
Expand Down Expand Up @@ -91,9 +96,13 @@ This is particularly important for doing grid searches::
* :ref:`example_plot_kernel_approximation.py`
* :ref:`example_svm_plot_svm_anova.py`

.. topic:: See also:

* :ref:`grid_search`


Notes
=====
-----

Calling ``fit`` on the pipeline is the same as calling ``fit`` on
each estimator in turn, ``transform`` the input and pass it on to the next step.
Expand All @@ -105,9 +114,8 @@ pipeline.

.. _feature_union:

==========================================
FeatureUnion: Combining feature extractors
==========================================
FeatureUnion: composite feature spaces
======================================

.. currentmodule:: sklearn.pipeline

Expand All @@ -131,7 +139,7 @@ responsibility.)


Usage
=====
-----

A :class:`FeatureUnion` is built using a list of ``(key, value)`` pairs,
where the ``key`` is the name you want to give to a given transformation
Expand Down
122 changes: 2 additions & 120 deletions doc/modules/preprocessing.rst
Original file line number Diff line number Diff line change
Expand Up @@ -358,66 +358,7 @@ numbers the continent and the last four the web browser.
See :ref:`dict_feature_extraction` for categorical features that are represented
as a dict, not as integers.


Label preprocessing
===================

Label binarization
------------------

:class:`LabelBinarizer` is a utility class to help create a label indicator
matrix from a list of multi-class labels::

>>> lb = preprocessing.LabelBinarizer()
>>> lb.fit([1, 2, 6, 4, 2])
LabelBinarizer(neg_label=0, pos_label=1, sparse_output=False)
>>> lb.classes_
array([1, 2, 4, 6])
>>> lb.transform([1, 6])
array([[1, 0, 0, 0],
[0, 0, 0, 1]])

For multiple labels per instance, use :class:`MultiLabelBinarizer`::

>>> lb = preprocessing.MultiLabelBinarizer()
>>> lb.fit_transform([(1, 2), (3,)])
array([[1, 1, 0],
[0, 0, 1]])
>>> lb.classes_
array([1, 2, 3])

Label encoding
--------------

:class:`LabelEncoder` is a utility class to help normalize labels such that
they contain only values between 0 and n_classes-1. This is sometimes useful
for writing efficient Cython routines. :class:`LabelEncoder` can be used as
follows::

>>> from sklearn import preprocessing
>>> le = preprocessing.LabelEncoder()
>>> le.fit([1, 2, 2, 6])
LabelEncoder()
>>> le.classes_
array([1, 2, 6])
>>> le.transform([1, 1, 2, 6])
array([0, 0, 1, 2])
>>> le.inverse_transform([0, 0, 1, 2])
array([1, 1, 2, 6])

It can also be used to transform non-numerical labels (as long as they are
hashable and comparable) to numerical labels::

>>> le = preprocessing.LabelEncoder()
>>> le.fit(["paris", "paris", "tokyo", "amsterdam"])
LabelEncoder()
>>> list(le.classes_)
['amsterdam', 'paris', 'tokyo']
>>> le.transform(["tokyo", "tokyo", "paris"])
array([2, 2, 1])
>>> list(le.inverse_transform([2, 2, 1]))
['tokyo', 'tokyo', 'paris']

.. _imputation:

Imputation of missing values
============================
Expand Down Expand Up @@ -469,63 +410,4 @@ in the matrix. This format is thus suitable when there are many more missing
values than observed values.

:class:`Imputer` can be used in a Pipeline as a way to build a composite
estimator that supports imputation. See :ref:`example_missing_values.py`.

.. _data_reduction:

Unsupervised data reduction
============================

If your number of features is high, it may be useful to reduce it with an
unsupervised step prior to supervised steps. Many of the
:ref:`unsupervised-learning` methods implement a ``transform`` method that
can be used to reduce the dimensionality. Below we discuss two specific
example of this pattern that are heavily used.

.. topic:: **Pipelining**

The unsupervised data reduction and the supervised estimator can be
chained in one step. See :ref:`pipeline`.

.. currentmodule:: sklearn

PCA: principal component analysis
----------------------------------

:class:`decomposition.PCA` looks for a combination of features that
capture well the variance of the original features.

.. topic:: **Examples**

* :ref:`example_applications_face_recognition.py`

Random projections
-------------------

The module: :mod:`random_projection` provides several tools for data
reduction by random projections. See the relevant section of the
documentation: :ref:`random_projection`.

.. topic:: **Examples**

* :ref:`example_plot_johnson_lindenstrauss_bound.py`

Feature agglometration
------------------------

:class:`cluster.FeatureAgglomeration` applies
:ref:`hierarchical_clustering` to group together features that behave
similarly.

.. topic:: **Examples**

* :ref:`example_cluster_plot_feature_agglomeration_vs_univariate_selection.py`
* :ref:`example_cluster_plot_digits_agglomeration.py`

.. topic:: **Feature scaling**

Note that if features have very different scaling or statistical
properties, :class:`cluster.FeatureAgglomeration` maye not be able to
capture the links between related features. Using a
:class:`preprocessing.StandardScaler` can be useful in these settings.

estimator that supports imputation. See :ref:`example_missing_values.py`
Loading

0 comments on commit 9af14b3

Please sign in to comment.