-
Notifications
You must be signed in to change notification settings - Fork 27
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
I finally got around to writing a documentation page for the different types of data processed by the library. Next on my TODO list, the page on regularization.
- Loading branch information
1 parent
2df30c2
commit e0a8bf4
Showing
3 changed files
with
109 additions
and
14 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,40 +1,131 @@ | ||
Types of Data | ||
============= | ||
|
||
Three types of data | ||
In order to simplify the code and speed up the implementation of algorithms, | ||
``choix`` assumes that items are identified by consecutive integers ranging | ||
from ``0`` to ``n_items - 1``. | ||
|
||
Data processed by the inference algorithms in the library consist of outcomes | ||
of comparisons between subsets of items. Specifically, four types of | ||
observations are supported. | ||
|
||
|
||
.. _data-pairwise: | ||
|
||
Pairwise comparisons | ||
-------------------- | ||
|
||
blablabla | ||
In the simplest (and perhaps the most widely-used) case, the data consist of | ||
outcomes of comparisons between *two* items. Mathematically, we represent the | ||
event "item :math:`i` wins over item :math:`j`" as | ||
|
||
.. math:: | ||
.. _data-rankings: | ||
i \succ j. | ||
Rankings | ||
-------- | ||
In Python, we simply represent this event using a list with two integers: | ||
|
||
.. code-block:: python | ||
[i, j] | ||
By convention, the first element of the list represents the item which *wins*, | ||
and the second element the item which *loses*. | ||
|
||
The statistical model that ``choix`` postulates for pairwise-comparison | ||
data is usually known as the *Bradley–Terry model*. Given parameters | ||
:math:`\theta_1, \ldots, \theta_n`, and two items :math:`i` and :math:`j`, the | ||
probability of the outcome :math:`i \succ j` is | ||
|
||
.. math:: | ||
Items are expected to be represented by consecutive integers from ``0`` to | ||
``n-1``. A (partial) ranking (see :ref:`data-rankings`) is defined by a tuple | ||
containing the items in decreasing order of preference. For example, the tuple | ||
``(2, 0, 4)`` corresponds to a ranking where ``2`` is first, ``0`` is second, | ||
and ``4`` is third. | ||
p(i \succ j) = \frac{e^{\theta_i}}{e^{\theta_i} + e^{\theta_j}}. | ||
.. _data-top1: | ||
|
||
Top-1 lists | ||
----------- | ||
|
||
blablabla | ||
Another case arises when the data consist of choices of *one* item over | ||
*several* other items. We call this type of data *top-1 list*. Compared to | ||
pairwise comparisons, this type of data is no longer restricted to comparing | ||
only two items: comparisons can involve sets of alternatives of any size | ||
between 2 and ``n_items``. We denote the outcome "item :math:`i` is chosen over | ||
items :math:`j, \ldots, k`" as | ||
|
||
.. math:: | ||
i \succ \{j, \ldots, k\}. | ||
In Python, we represent this event using a list with two elements: | ||
|
||
.. code-block:: python | ||
[i, {j, ..., k}] | ||
The first element of the list is an integer that represents the *chosen* item, | ||
whereas the second element is a set containing the "losing" items. Note that | ||
this set does *not* include the winning item. | ||
|
||
The statistical model that ``choix`` uses for these data is a straightforward | ||
extension of the Bradley–Terry model that was presented by R. Duncan Luce | ||
(1959). Given parameters :math:`\theta_1, \ldots, \theta_n`, item :math:`i` and | ||
losing alternatives :math:`j, k, \ell, \ldots`, the probability of the | ||
corresponding outcome is | ||
|
||
.. math:: | ||
p(i \succ \{j, \ldots, k\}) = \frac{e^{\theta_i}}{ | ||
e^{\theta_i} + e^{\theta_j} + \cdots + e^{\theta_k}}. | ||
.. _data-rankings: | ||
|
||
Rankings | ||
-------- | ||
|
||
Instead of observing a single choice, we might have observations that consist | ||
of a (partial) *ranking* over a set of alternatives. This leads to a third type | ||
of data. We denote the event "item :math:`i` wins over item :math:`j` ... wins | ||
over item :math:`k`" as | ||
|
||
.. math:: | ||
i \succ j \succ \ldots \succ k. | ||
In Python, we represent this as a list: | ||
|
||
.. code-block:: python | ||
[i, j, ..., k] | ||
The list contains the subset of items in decreasing order of preference. For | ||
example, the list ``[2, 0, 4]`` corresponds to a ranking where ``2`` is first, | ||
``0`` is second, and ``4`` is third. | ||
|
||
In this case, the statistical model that ``choix`` uses is usually referred to | ||
as the *Plackett-Luce* model. Given parameters :math:`\theta_1, \ldots, | ||
\theta_n` and items :math:`i, j, \ldots, k`, the probability of a given ranking | ||
is | ||
|
||
.. math:: | ||
p(i \succ j \succ \ldots \succ k) = | ||
\frac{e^{\theta_i}}{e^{\theta_i} + e^{\theta_j} + \cdots + e^{\theta_k}} | ||
\cdot \frac{e^{\theta_j}}{e^{\theta_j} + \cdots + e^{\theta_k}} | ||
\cdots. | ||
The attentive reader will notice that this probability corresponds to that of | ||
an independent sequence of top-1 lists over the remaining alternatives. | ||
|
||
|
||
.. _data-network: | ||
|
||
Choices in a network | ||
-------------------- | ||
|
||
blablabla | ||
The fourth type of data is slightly more involved. It enables the processing of | ||
choices on networks based on marginal observations at the nodes of the network. | ||
We defer to [MG17]_ for a thorough presentation of the observed data and of the | ||
statistical model. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters