DOC restructure data transformation user guide

neurodata · Sep 11, 2014 · 9af14b3 · 9af14b3
1 parent 5fc0e8f
commit 9af14b3
Show file tree

Hide file tree

Showing 9 changed files with 214 additions and 138 deletions.
diff --git a/doc/data_transforms.rst b/doc/data_transforms.rst
@@ -5,10 +5,31 @@
 Dataset transformations
 -----------------------
 
+scikit-learn provides a library of transformers, which may clean (see
+:ref:`preprocessing`), reduce (see :ref:`data_reduction`), expand (see
+:ref:`kernel_approximation`) or generate (see :ref:`feature_extraction`)
+feature representations.
+
+Like other estimators, these are represented by classes with ``fit`` method,
+which learns model parameters (e.g. mean and standard deviation for
+normalization) from a training set, and a ``transform`` method which applies
+this transformation model to unseen data. ``fit_transform`` may be more
+convenient and efficient for modelling and transforming the training data
+simultaneously.
+
+Combining such transformers, either in parallel or series is covered in
+:ref:`combining_estimators`. :ref:`metrics` covers transforming feature
+spaces into affinity matrices, while :ref:`preprocessing_targets` considers
+transformations of the target space (e.g. categorical labels) for use in
+scikit-learn.
+
 .. toctree::
 
+ modules/pipeline
  modules/feature_extraction
  modules/preprocessing
- modules/kernel_approximation
+ modules/unsupervised_reduction
  modules/random_projection
+ modules/kernel_approximation
  modules/metrics
+ modules/preprocessing_targets
diff --git a/doc/model_selection.rst b/doc/model_selection.rst
@@ -9,7 +9,6 @@ Model selection and evaluation
 
  modules/cross_validation
  modules/grid_search
- modules/pipeline
  modules/model_evaluation
  modules/model_persistence
  modules/learning_curve
diff --git a/doc/modules/cross_validation.rst b/doc/modules/cross_validation.rst
@@ -129,8 +129,9 @@ In the case of the Iris dataset, the samples are balanced across target
 classes hence the accuracy and the F1-score are almost equal.
 
 When the ``cv`` argument is an integer, :func:`cross_val_score` uses the
-:class:`KFold` or :class:`StratifiedKFold` strategies by default (depending on
-the absence or presence of the target array).
+:class:`KFold` or :class:`StratifiedKFold` strategies by default, the latter
+being used if the estimator derives from :class:`ClassifierMixin
+<sklearn.base.ClassifierMixin>`.
 
 It is also possible to use other cross validation strategies by passing a cross
 validation iterator instead, for instance::
@@ -143,7 +144,36 @@ validation iterator instead, for instance::
  ... # doctest: +ELLIPSIS
  array([ 0.97..., 0.97..., 1. ])
 
-The available cross validation iterators are introduced in the following.
+The available cross validation iterators are introduced in the following
+section.
+
+.. topic:: Data transformation with held out data
+
+ Just as it is important to test a predictor on data held-out from
+ training, preprocessing (such as standardization, feature selection, etc.)
+ and similar :ref:`data transformations <data-transforms>` similarly should
+ be learnt from a training set and applied to held-out data for prediction::
+
+ >>> from sklearn import preprocessing
+ >>> X_train, X_test, y_train, y_test = cross_validation.train_test_split(
+ ... iris.data, iris.target, test_size=0.4, random_state=0)
+ >>> scaler = preprocessing.StandardScaler().fit(X_train)
+ >>> X_train_transformed = scaler.transform(X_train)
+ >>> clf = svm.SVC(C=1).fit(X_train_transformed, y_train)
+ >>> X_test_transformed = scaler.transform(X_test)
+ >>> clf.score(X_test_transformed, y_test) # doctest: +ELLIPSIS
+ 0.9333...
+
+ A :class:`Pipeline <sklearn.pipeline.Pipeline>` makes it easier to compose
+ estimators, providing this behavior under cross-validation::
+
+ >>> from sklearn.pipeline import make_pipeline
+ >>> clf = make_pipeline(preprocessing.StandardScaler(), svm.SVC(C=1))
+ >>> cross_validation.cross_val_score(clf, iris.data, iris.target, cv=cv)
+ ... # doctest: +ELLIPSIS
+ array([ 0.97..., 0.93..., 0.95...])
+
+ See :ref:`combining_estimators`.
 
 
 .. topic:: Examples
@@ -153,7 +183,6 @@ The available cross validation iterators are introduced in the following.
  * :ref:`example_model_selection_grid_search_digits.py`,
  * :ref:`example_model_selection_grid_search_text_feature_extraction.py`,
 
-
 Cross validation iterators
 ==========================
 

diff --git a/doc/modules/grid_search.rst b/doc/modules/grid_search.rst
@@ -35,6 +35,11 @@ all parameter combinations, while :class:`RandomizedSearchCV` can sample a
 given number of candidates from a parameter space with a specified
 distribution.
 
+.. seealso
+
+ :ref:`pipeline` describes building composite estimators whose
+ parameter space can be searched with these tools.
+
 Exhaustive Grid Search
 ======================
 
@@ -179,12 +184,16 @@ Here is the list of such models:
  :toctree: generated/
  :template: class.rst
 
- linear_model.RidgeCV
- linear_model.RidgeClassifierCV
+ linear_model.ElasticNetCV
  linear_model.LarsCV
- linear_model.LassoLarsCV
  linear_model.LassoCV
- linear_model.ElasticNetCV
+ linear_model.LassoLarsCV
+ linear_model.LogisticRegressionCV
+ linear_model.MultiTaskElasticNetCV
+ linear_model.MultiTaskLassoCV
+ linear_model.OrthogonalMatchingPursuitCV
+ linear_model.RidgeCV
+ linear_model.RidgeClassifierCV
 
 
 Information Criterion

diff --git a/doc/modules/kernel_approximation.rst b/doc/modules/kernel_approximation.rst
@@ -25,6 +25,9 @@ In particular, the combination of kernel map approximations with
 Since there has not been much empirical work using approximate embeddings, it
 is advisable to compare results against exact kernel methods when possible.
 
+.. seealso
+
+ :ref:`polynomial_regression` for an exact polynomial transformation.
 
 .. currentmodule:: sklearn.kernel_approximation
 

diff --git a/doc/modules/pipeline.rst b/doc/modules/pipeline.rst
@@ -1,8 +1,13 @@
+.. _combining_estimators:
+
+===============================================
+Pipeline and FeatureUnion: combining estimators
+===============================================
+
 .. _pipeline:
 
-==============================
 Pipeline: chaining estimators
-==============================
+=============================
 
 .. currentmodule:: sklearn.pipeline
 
@@ -23,7 +28,7 @@ The last estimator may be any type (transformer, classifier, etc.).
 
 
 Usage
-=====
+-----
 
 The :class:`Pipeline` is build using a list of ``(key, value)`` pairs, where
 the ``key`` a string containing the name you want to give this step and ``value``
@@ -91,9 +96,13 @@ This is particularly important for doing grid searches::
  * :ref:`example_plot_kernel_approximation.py`
  * :ref:`example_svm_plot_svm_anova.py`
 
+.. topic:: See also:
+
+ * :ref:`grid_search`
+
 
 Notes
-=====
+-----
 
 Calling ``fit`` on the pipeline is the same as calling ``fit`` on
 each estimator in turn, ``transform`` the input and pass it on to the next step.
@@ -105,9 +114,8 @@ pipeline.
 
 .. _feature_union:
 
-==========================================
-FeatureUnion: Combining feature extractors
-==========================================
+FeatureUnion: composite feature spaces
+======================================
 
 .. currentmodule:: sklearn.pipeline
 
@@ -131,7 +139,7 @@ responsibility.)
 
 
 Usage
-=====
+-----
 
 A :class:`FeatureUnion` is built using a list of ``(key, value)`` pairs,
 where the ``key`` is the name you want to give to a given transformation

diff --git a/doc/modules/preprocessing.rst b/doc/modules/preprocessing.rst
@@ -358,66 +358,7 @@ numbers the continent and the last four the web browser.
 See :ref:`dict_feature_extraction` for categorical features that are represented
 as a dict, not as integers.
 
-
-Label preprocessing
-===================
-
-Label binarization
-------------------
-
-:class:`LabelBinarizer` is a utility class to help create a label indicator
-matrix from a list of multi-class labels::
-
- >>> lb = preprocessing.LabelBinarizer()
- >>> lb.fit([1, 2, 6, 4, 2])
- LabelBinarizer(neg_label=0, pos_label=1, sparse_output=False)
- >>> lb.classes_
- array([1, 2, 4, 6])
- >>> lb.transform([1, 6])
- array([[1, 0, 0, 0],
- [0, 0, 0, 1]])
-
-For multiple labels per instance, use :class:`MultiLabelBinarizer`::
-
- >>> lb = preprocessing.MultiLabelBinarizer()
- >>> lb.fit_transform([(1, 2), (3,)])
- array([[1, 1, 0],
- [0, 0, 1]])
- >>> lb.classes_
- array([1, 2, 3])
-
-Label encoding
---------------
-
-:class:`LabelEncoder` is a utility class to help normalize labels such that
-they contain only values between 0 and n_classes-1. This is sometimes useful
-for writing efficient Cython routines. :class:`LabelEncoder` can be used as
-follows::
-
- >>> from sklearn import preprocessing
- >>> le = preprocessing.LabelEncoder()
- >>> le.fit([1, 2, 2, 6])
- LabelEncoder()
- >>> le.classes_
- array([1, 2, 6])
- >>> le.transform([1, 1, 2, 6])
- array([0, 0, 1, 2])
- >>> le.inverse_transform([0, 0, 1, 2])
- array([1, 1, 2, 6])
-
-It can also be used to transform non-numerical labels (as long as they are
-hashable and comparable) to numerical labels::
-
- >>> le = preprocessing.LabelEncoder()
- >>> le.fit(["paris", "paris", "tokyo", "amsterdam"])
- LabelEncoder()
- >>> list(le.classes_)
- ['amsterdam', 'paris', 'tokyo']
- >>> le.transform(["tokyo", "tokyo", "paris"])
- array([2, 2, 1])
- >>> list(le.inverse_transform([2, 2, 1]))
- ['tokyo', 'tokyo', 'paris']
-
+.. _imputation:
 
 Imputation of missing values
 ============================
@@ -469,63 +410,4 @@ in the matrix. This format is thus suitable when there are many more missing
 values than observed values.
 
 :class:`Imputer` can be used in a Pipeline as a way to build a composite
-estimator that supports imputation. See :ref:`example_missing_values.py`.
-
-.. _data_reduction:
-
-Unsupervised data reduction
-============================
-
-If your number of features is high, it may be useful to reduce it with an
-unsupervised step prior to supervised steps. Many of the
-:ref:`unsupervised-learning` methods implement a ``transform`` method that
-can be used to reduce the dimensionality. Below we discuss two specific
-example of this pattern that are heavily used.
-
-.. topic:: **Pipelining**
-
- The unsupervised data reduction and the supervised estimator can be
- chained in one step. See :ref:`pipeline`.
-
-.. currentmodule:: sklearn
-
-PCA: principal component analysis
-----------------------------------
-
-:class:`decomposition.PCA` looks for a combination of features that
-capture well the variance of the original features.
-
-.. topic:: **Examples**
-
- * :ref:`example_applications_face_recognition.py`
-
-Random projections
--------------------
-
-The module: :mod:`random_projection` provides several tools for data
-reduction by random projections. See the relevant section of the
-documentation: :ref:`random_projection`.
-
-.. topic:: **Examples**
-
- * :ref:`example_plot_johnson_lindenstrauss_bound.py`
-
-Feature agglometration
-------------------------
-
-:class:`cluster.FeatureAgglomeration` applies
-:ref:`hierarchical_clustering` to group together features that behave
-similarly.
-
-.. topic:: **Examples**
-
- * :ref:`example_cluster_plot_feature_agglomeration_vs_univariate_selection.py`
- * :ref:`example_cluster_plot_digits_agglomeration.py`
-
-.. topic:: **Feature scaling**
-
- Note that if features have very different scaling or statistical
- properties, :class:`cluster.FeatureAgglomeration` maye not be able to
- capture the links between related features. Using a 
- :class:`preprocessing.StandardScaler` can be useful in these settings.
-
+estimator that supports imputation. See :ref:`example_missing_values.py`