Skip to content

Commit

Permalink
Documentation for data types.
Browse files Browse the repository at this point in the history
I finally got around to writing a documentation page for the different types of
data processed by the library. Next on my TODO list, the page on
regularization.
  • Loading branch information
lucasmaystre committed Nov 13, 2017
1 parent 2df30c2 commit e0a8bf4
Show file tree
Hide file tree
Showing 3 changed files with 109 additions and 14 deletions.
4 changes: 2 additions & 2 deletions docs/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -64,9 +64,9 @@
# built documents.
#
# The short X.Y version.
version = '0.1'
version = '0.2'
# The full version, including alpha/beta/rc tags.
release = '0.1.0'
release = '0.2.0'

# The language for content autogenerated by Sphinx. Refer to documentation
# for a list of supported languages.
Expand Down
115 changes: 103 additions & 12 deletions docs/data.rst
Original file line number Diff line number Diff line change
@@ -1,40 +1,131 @@
Types of Data
=============

Three types of data
In order to simplify the code and speed up the implementation of algorithms,
``choix`` assumes that items are identified by consecutive integers ranging
from ``0`` to ``n_items - 1``.

Data processed by the inference algorithms in the library consist of outcomes
of comparisons between subsets of items. Specifically, four types of
observations are supported.


.. _data-pairwise:

Pairwise comparisons
--------------------

blablabla
In the simplest (and perhaps the most widely-used) case, the data consist of
outcomes of comparisons between *two* items. Mathematically, we represent the
event "item :math:`i` wins over item :math:`j`" as

.. math::
.. _data-rankings:
i \succ j.
Rankings
--------
In Python, we simply represent this event using a list with two integers:

.. code-block:: python
[i, j]
By convention, the first element of the list represents the item which *wins*,
and the second element the item which *loses*.

The statistical model that ``choix`` postulates for pairwise-comparison
data is usually known as the *Bradley–Terry model*. Given parameters
:math:`\theta_1, \ldots, \theta_n`, and two items :math:`i` and :math:`j`, the
probability of the outcome :math:`i \succ j` is

.. math::
Items are expected to be represented by consecutive integers from ``0`` to
``n-1``. A (partial) ranking (see :ref:`data-rankings`) is defined by a tuple
containing the items in decreasing order of preference. For example, the tuple
``(2, 0, 4)`` corresponds to a ranking where ``2`` is first, ``0`` is second,
and ``4`` is third.
p(i \succ j) = \frac{e^{\theta_i}}{e^{\theta_i} + e^{\theta_j}}.
.. _data-top1:

Top-1 lists
-----------

blablabla
Another case arises when the data consist of choices of *one* item over
*several* other items. We call this type of data *top-1 list*. Compared to
pairwise comparisons, this type of data is no longer restricted to comparing
only two items: comparisons can involve sets of alternatives of any size
between 2 and ``n_items``. We denote the outcome "item :math:`i` is chosen over
items :math:`j, \ldots, k`" as

.. math::
i \succ \{j, \ldots, k\}.
In Python, we represent this event using a list with two elements:

.. code-block:: python
[i, {j, ..., k}]
The first element of the list is an integer that represents the *chosen* item,
whereas the second element is a set containing the "losing" items. Note that
this set does *not* include the winning item.

The statistical model that ``choix`` uses for these data is a straightforward
extension of the Bradley–Terry model that was presented by R. Duncan Luce
(1959). Given parameters :math:`\theta_1, \ldots, \theta_n`, item :math:`i` and
losing alternatives :math:`j, k, \ell, \ldots`, the probability of the
corresponding outcome is

.. math::
p(i \succ \{j, \ldots, k\}) = \frac{e^{\theta_i}}{
e^{\theta_i} + e^{\theta_j} + \cdots + e^{\theta_k}}.
.. _data-rankings:

Rankings
--------

Instead of observing a single choice, we might have observations that consist
of a (partial) *ranking* over a set of alternatives. This leads to a third type
of data. We denote the event "item :math:`i` wins over item :math:`j` ... wins
over item :math:`k`" as

.. math::
i \succ j \succ \ldots \succ k.
In Python, we represent this as a list:

.. code-block:: python
[i, j, ..., k]
The list contains the subset of items in decreasing order of preference. For
example, the list ``[2, 0, 4]`` corresponds to a ranking where ``2`` is first,
``0`` is second, and ``4`` is third.

In this case, the statistical model that ``choix`` uses is usually referred to
as the *Plackett-Luce* model. Given parameters :math:`\theta_1, \ldots,
\theta_n` and items :math:`i, j, \ldots, k`, the probability of a given ranking
is

.. math::
p(i \succ j \succ \ldots \succ k) =
\frac{e^{\theta_i}}{e^{\theta_i} + e^{\theta_j} + \cdots + e^{\theta_k}}
\cdot \frac{e^{\theta_j}}{e^{\theta_j} + \cdots + e^{\theta_k}}
\cdots.
The attentive reader will notice that this probability corresponds to that of
an independent sequence of top-1 lists over the remaining alternatives.


.. _data-network:

Choices in a network
--------------------

blablabla
The fourth type of data is slightly more involved. It enables the processing of
choices on networks based on marginal observations at the nodes of the network.
We defer to [MG17]_ for a thorough presentation of the observed data and of the
statistical model.
4 changes: 4 additions & 0 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,8 @@ data, using a variety of algorithms:
* Rank Centrality
* Approximate Bayesian inference with expectation propagation

An easy way to get started is by exploring the notebooks_!


Contents
--------
Expand All @@ -46,3 +48,5 @@ Indices and tables
* :ref:`genindex`
* :ref:`modindex`
* :ref:`search`

.. _notebooks: https://github.com/lucasmaystre/choix/tree/master/notebooks

0 comments on commit e0a8bf4

Please sign in to comment.