Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automatically add type annotations to all API functions #192

Merged
merged 1 commit into from
Jul 13, 2018

Conversation

flying-sheep
Copy link
Member

@flying-sheep flying-sheep commented Jul 9, 2018

Fixes #58.

readthedocs

For RTD I switched from numpydoc to napoleon, with the same customizations as in anndata:

  • Numpydoc-style HTML rendering of param docs with custom CSS
  • A custom class template that supersedes numpy’s autodoc hack and makes attributes appear above methods in class docs

docstrings

The docstring part is implemented via obj.getdoc(), a method invoked by IPython if available, which means that it leaves __doc__ alone.

As an example, it converts scanpy.api.Neighbors.compute_neighbors’ docs like this, leaving alone explicit type/default info and adding info from the signature.

A huge advantage here is that by removing explicit info, we gain always-up-to-date defaults and types. Whenever we change something, we can’t forget to change everything else anymore.

Compute distances and connectivities of neighbors.

Parameters
----------
n_neighbors
     Use this number of nearest neighbors.
knn
     Restrict result to `n_neighbors` nearest neighbors.
{n_pcs}
{use_rep}

Returns
-------
Writes sparse graph attributes `.distances` and `.connectivities`.
Also writes `.knn_indices` and `.knn_distances` if
`write_knn_indices==True`.

↓↓↓

Compute distances and connectivities of neighbors.
Parameters
----------
n_neighbors : int, optional (default: 30)
     Use this number of nearest neighbors.
knn : bool, optional (default: True)
     Restrict result to `n_neighbors` nearest neighbors.
n_pcs : `int` or `None`, optional (default: `None`)
    Use this many PCs. If `n_pcs==0` use `.X` if `use_rep is None`.
use_rep : {`None`, 'X'} or any key for `.obsm`, optional (default: `None`)
    Use the indicated representation. If `None`, the representation is chosen
    automatically: for `.n_vars` < 50, `.X` is used, otherwise 'X_pca' is used.
    If 'X_pca' is not present, it's computed with default parameters.
Returns
-------
Writes sparse graph attributes `.distances` and `.connectivities`.
Also writes `.knn_indices` and `.knn_distances` if
`write_knn_indices==True`.

@falexwolf
Copy link
Member

Hey Phil!

Nice, I played around with it! 😄

I think it's really cool and I only have some minor stylistic remarks.

  • Why do we want the **? The layout and the indenting highlights the variable names enough already. Also, Jupyter notebooks don't even interpret them.
  • Why don't we stick with the underlined sections? :Parameters: is a lot less pretty than the underlined counterpart.
  • Why do we indent? Jupyter's typical help box is very narrow and the output really gets more squashed. Also, there seem to be a lot of unnecessary newlines. Pasting tl.tsne here looks somewhat acceptable (though not nice). But invoking it in a Jupyter notebook doesn't look nice...
Signature: sc.tl.tsne(adata, n_pcs=None, use_rep=None, perplexity=30, early_exaggeration=12, learning_rate=1000, random_state=0, use_fast_tsne=True, n_jobs=None, copy=False)
Docstring:
t-SNE [Maaten08]_ [Amir13]_ [Pedregosa11]_.

t-distributed stochastic neighborhood embedding (tSNE) [Maaten08]_ has been
proposed for visualizating single-cell data by [Amir13]_. Here, by default,
we use the implementation of *scikit-learn* [Pedregosa11]_. You can achieve
a huge speedup and better convergence if you install `Multicore-tSNE
<https://github.com/DmitryUlyanov/Multicore-TSNE>`__ by [Ulyanov16]_, which
will be automatically detected by Scanpy.

:Parameters:

    **adata** : :class:`~anndata.AnnData`

        Annotated data matrix.

    **n_pcs** : `int` or `None`, optional (default: `None`)

        Use this many PCs. If `n_pcs==0` use `.X` if `use_rep is None`.

    **use_rep** : \{`None`, 'X'\} or any key for `.obsm`, optional (default: `None`)

        Use the indicated representation. If `None`, the representation is chosen
        automatically: for `.n_vars` < 50, `.X` is used, otherwise 'X_pca' is used.
        If 'X_pca' is not present, it's computed with default parameters.

    **perplexity** : `float`, optional (default: 30)

        The perplexity is related to the number of nearest neighbors that
        is used in other manifold learning algorithms. Larger datasets
        usually require a larger perplexity. Consider selecting a value
        between 5 and 50. The choice is not extremely critical since t-SNE
        is quite insensitive to this parameter.

    **early_exaggeration** : `float`, optional (default: 12.0)

        Controls how tight natural clusters in the original space are in the
        embedded space and how much space will be between them. For larger
        values, the space between natural clusters will be larger in the
        embedded space. Again, the choice of this parameter is not very
        critical. If the cost function increases during initial optimization,
        the early exaggeration factor or the learning rate might be too high.

    **learning_rate** : `float`, optional (default: 1000)

        Note that the R-package "Rtsne" uses a default of 200.
        The learning rate can be a critical parameter. It should be
        between 100 and 1000. If the cost function increases during initial
        optimization, the early exaggeration factor or the learning rate
        might be too high. If the cost function gets stuck in a bad local
        minimum increasing the learning rate helps sometimes.

    **random_state** : `int` or `None`, optional (default: 0)

        Change this to use different intial states for the optimization. If `None`,
        the initial state is not reproducible.

    **use_fast_tsne** : `bool`, optional (default: `True`)

        Use the MulticoreTSNE package by D. Ulyanov if it is installed.

    **n_jobs** : `int` or `None` (default: `sc.settings.n_jobs`)

        Number of jobs.

    **copy** : `bool` (default: `False`)

        Return a copy instead of writing to adata.

:Returns:

    Depending on `copy`, returns or updates `adata` with the following fields.

        

    **X_tsne** : `np.ndarray` (`adata.obs`, dtype `float`)

Now let's look at pp.neighbors where you're reading the type annotations from the signature.

  • Obviously, the signature itself now is a mess for humans to read. But ok, that's fine if the docstring is easy to read.
  • There is an error <class 'inspect._empty'>
  • The rest looks good to me, except for the superficial stylistic remarks above.
Signature: sc.pp.neighbors(adata:anndata.base.AnnData, n_neighbors:int=15, n_pcs:Union[int, NoneType]=None, use_rep:Union[str, NoneType]=None, knn:bool=True, random_state:Union[int, mtrand.RandomState, NoneType]=0, method:str='umap', metric:Union[str, Callable[[numpy.ndarray, numpy.ndarray], float]]='euclidean', metric_kwds:Mapping[str, Any]={}, copy:bool=False) -> Union[anndata.base.AnnData, NoneType]
Docstring:
Compute a neighborhood graph of observations [McInnes18]_.

The neighbor search efficiency of this heavily relies on UMAP [McInnes18]_,
which also provides a method for estimating connectivities of data points -
the connectivity of the manifold (`method=='umap'`). If `method=='diffmap'`,
connectivities are computed according to [Coifman05]_, in the adaption of
[Haghverdi16]_.

:Parameters:

    **adata** : AnnData, optional (default: <class 'inspect._empty'>)

        Annotated data matrix.

    **n_neighbors** : int, optional (default: 15)

        The size of local neighborhood (in terms of number of neighboring data
        points) used for manifold approximation. Larger values result in more
        global views of the manifold, while smaller values result in more local
        data being preserved. In general values should be in the range 2 to 100.
        If `knn` is `True`, number of nearest neighbors to be searched. If `knn`
        is `False`, a Gaussian kernel width is set to the distance of the
        `n_neighbors` neighbor.

    **n_pcs** : `int` or `None`, optional (default: `None`)

        Use this many PCs. If `n_pcs==0` use `.X` if `use_rep is None`.

    **use_rep** : \{`None`, 'X'\} or any key for `.obsm`, optional (default: `None`)

        Use the indicated representation. If `None`, the representation is chosen
        automatically: for `.n_vars` < 50, `.X` is used, otherwise 'X_pca' is used.
        If 'X_pca' is not present, it's computed with default parameters.

    **knn** : bool, optional (default: True)

        If `True`, use a hard threshold to restrict the number of neighbors to
        `n_neighbors`, that is, consider a knn graph. Otherwise, use a Gaussian
        Kernel to assign low weights to neighbors more distant than the
        `n_neighbors` nearest neighbor.

    **random_state** : typing.Union[int, mtrand.RandomState, NoneType]

        A numpy random seed.

    **method** : {'umap', 'gauss', `None`}  (default: `'umap'`)

        Use 'umap' [McInnes18]_ or 'gauss' (Gauss kernel following [Coifman05]_
        with adaptive width [Haghverdi16]_) for computing connectivities.

    **metric** : typing.Union[str, typing.Callable[[numpy.ndarray, numpy.ndarray], float]], optional (default: 'euclidean')

        A known metric’s name or a callable that returns a distance.

    **metric_kwds** : Mapping

        Options for the metric.

    **copy** : bool

        Return a copy instead of writing to adata.

:Returns:

    Depending on `copy`, updates or returns `adata` with the following:

        

    **connectivities** : sparse matrix (`.uns['neighbors']`, dtype `float32`)

        Weighted adjacency matrix of the neighborhood graph of data
        points. Weights should be interpreted as connectivities.

    **distances** : sparse matrix (`.uns['neighbors']`, dtype `float32`)

        Instead of decaying weights, this stores distances for each pair of
        neighbors.
File:      ~/_hholtz/01_projects/1512_scanpy/scanpy/scanpy/neighbors/__init__.py
Type:      function

PS:

  • Already the docs show that Neighbors.compute_neighbors has invalid numpydoc... this was the case in several instances and I'm slowly fixing all of them... It's just a matter of adding \ at the line breaks.
  • I completely agree that the redundency between signature and docstring information lead to a a very small number of errors in the docstrings. However, in several instances, I'm setting the default value in the signature to None. But in the docstring, I'm giving the value to which this None evaluates in the default case (depending on what is passed)... There is quite a number of such cases. Clearly, one could replace all of them with 'auto' parameters, which is probably the better way of doing this. As the whole thing is backwards compat, this is not an immediate problem

@flying-sheep
Copy link
Member Author

about your first questions: i’m using the numpydoc extension to parse the strings into a structure that I can modify. the normal way to render this structure back to a string is doing str(numpydoc_object).

This however doesn’t give us the original format, but instead the reStructuredText outpt of numpydoc.

What do you mean with auto parameters? using None is the pythonic way to do things like this. i.e. if a simple string or int or so isn’t sufficient to explain the default case, use None and do something complex, which you then describe in the docs.

@falexwolf
Copy link
Member

OK, got the formatting issues. Let's discuss at the office.

Let's stick with None, this just requires to rewrite a very small number of strings... will not be a problem.

@flying-sheep
Copy link
Member Author

flying-sheep commented Jul 12, 2018

So I switched everything to Napoleon after all. Far simpler and prettier.

I no longer parse the docstrings, and opted for a different approach: If a line contains only the name of a parameter, it will be amended with type info. Since no fancy parsing is involved, that could theoretically go wrong.

On the other hand, the docstrings now stay in the same format and the probability for a line to consist of just the parameter name (and e.g. no . behind it) is relatively low and can be fixed when discovered.

@flying-sheep flying-sheep force-pushed the type-annotation branch 2 times, most recently from 7f3366f to a6c9c69 Compare July 12, 2018 13:15
@flying-sheep flying-sheep force-pushed the type-annotation branch 2 times, most recently from 5d2bfab to 8ddb075 Compare July 12, 2018 13:26
@flying-sheep
Copy link
Member Author

@falexwolf what do you say?

@falexwolf
Copy link
Member

Looks very good to me! Also the plain text docstrings look good.

A few small things remain, for instance that the AnnData type annotation does not resolve to :class:~anndata.AnnData and is hence no longer linked... It's not a dramatic thing, though.

@falexwolf falexwolf merged commit f9097fb into master Jul 13, 2018
@flying-sheep flying-sheep deleted the type-annotation branch July 13, 2018 14:49
@flying-sheep
Copy link
Member Author

Yeah, that’s sphinx-doc/sphinx#4826

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants