Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adding dirichlet distribution #318

Open
wants to merge 7 commits into
base: develop
Choose a base branch
from

Conversation

mrityunjay-tripathi
Copy link

@mrityunjay-tripathi mrityunjay-tripathi commented Feb 27, 2020

I am adding Dirichlet distribution in include/boost/math/distributions/. The Dirichlet distribution is a multivariate distribution function, so vector type inputs are needed, but the tests in test/compile_test/test_compile_result.hpp are for int, double, float, long double, unsigned. How can I add tests for Dirichlet distribution?
TODO:

  • Add tests.
  • Add google benchmark.
  • Decision on VectorType and it's implementation.
  • Cross check cdf function. (I could not find it, so tried to derive it.)
  • Implementation of skewness function
  • Implementation of kurtosis function.

Also, none of the distribution functions generate random samples of the corresponding distribution. That means no sampling function implemented. In PyTorch, we can do something like,

>>> import torch
>>> import torch.distributions as dist
>>> d = dist.dirichlet.Dirichlet(torch.tensor([0.3, 0.3, 0.3]))
>>> d.sample()
tensor([0.7736, 0.1908, 0.0356])
>>> d.rsample(torch.tensor([5]))
tensor([[2.3072e-02, 9.4317e-01, 3.3756e-02],
        [1.8339e-01, 1.6803e-03, 8.1493e-01],
        [7.3136e-01, 1.9131e-01, 7.7323e-02],
        [2.1086e-01, 9.5684e-03, 7.7958e-01],
        [9.6672e-06, 4.6919e-02, 9.5307e-01]])
>>> 

Can we add such feature or its implemented as such?

@mrityunjay-tripathi
Copy link
Author

Also, none of the distribution functions generate random samples of the corresponding distribution. That means no sampling function implemented. In PyTorch, we can do something like,

Ok, got it. It is implemented in boostorg/random repository.

@pabristow
Copy link
Contributor

Looks a really promising start.

It would be really nice if the Real defaults could be vector and all other parameters double.
This would make it appear to work quite like C, Python, R etc, out of the box.

For testing, \boost\libs\math\test\test_beta_dist.cpp is a slightly useful template. However it is probably over-complicated, at least to get started on testing. I found it easier to start with testing just double values and then refactor later with other types and higher precision. This will at least get some sanity checks.

(Although the usefulness of high values for this distribution is unclear, our policy is to aim for tests at least the precision of 128-bit (~36 decimal digits)).

There is also the important question of test values. Python could be used to generate some test values for a start. https://cran.r-project.org/web/packages/DirichletReg/DirichletReg.pdf might also be slightly useful but is ominously marked as 'orphaned' - but is quite new. Our favorite Wolfram doesn't seem to be able to help.

The beta distribution tests should be applicable when they should be identical, so this might be the first thing to try?

We also like to have some example(s). Being able to show how to combine with Boost.Random to replicate the Wikipedia/Python example(s) would be good.

There are lots of tests of the that check for 'dodgy' input like negative, NaN or inf. These check throwing (if the policy is default). (and don't forget any examples should use thow'n'catch blocks so that the error messages are displayed). You can probably leave me to tidy these up later.

include/boost/math/distributions/dirichlet.hpp Outdated Show resolved Hide resolved
include/boost/math/distributions/dirichlet.hpp Outdated Show resolved Hide resolved
include/boost/math/distributions/dirichlet.hpp Outdated Show resolved Hide resolved
include/boost/math/distributions/dirichlet.hpp Outdated Show resolved Hide resolved
include/boost/math/distributions/dirichlet.hpp Outdated Show resolved Hide resolved
include/boost/math/distributions/dirichlet.hpp Outdated Show resolved Hide resolved
include/boost/math/distributions/dirichlet.hpp Outdated Show resolved Hide resolved
include/boost/math/distributions/dirichlet.hpp Outdated Show resolved Hide resolved
include/boost/math/distributions/dirichlet.hpp Outdated Show resolved Hide resolved
include/boost/math/distributions/dirichlet.hpp Outdated Show resolved Hide resolved
@NAThompson
Copy link
Collaborator

Looks like a great start. I'll go through your other items on your checklist later.

@mrityunjay-tripathi
Copy link
Author

Hi @pabristow @NAThompson Can we use something like validate_args to see if the user wants to check for argument validity. Currently, we are validating the arguments when an instance of the object is created which is inefficient.

@pabristow
Copy link
Contributor

I have a vague recollection that we had a complaint similar to this many years ago. In the end we decided to keep the apparently redundant check because there were circumstances when it would allow a 'bad' value to sneak in. Checking seems quite cheap, as is construction generally?

@mrityunjay-tripathi
Copy link
Author

Right. If some 'invalid' value comes in for reasons unknown to the user, it would be a disaster.

@NAThompson
Copy link
Collaborator

@mrityunjay-tripathi Could you write a google benchmark and show how fast it is with and without the argument validation? This is a really hard question to answer without hard data. An example is here.

@NAThompson
Copy link
Collaborator

@jzmaddock , @pabristow : For scalar distribution, using free functions to compute kurtosis, range, so on, is just fine. But for vector types, following this pattern is getting a bit awkward. Should each of these functions go from free functions acting on the class to member functions?

@NAThompson
Copy link
Collaborator

@mrityunjay-tripathi : We have a file "boost/libs/math/test/math_unit_test.hpp", which you can use to write unit tests. An example is "test/whittaker_shannon_test.cpp".

@mrityunjay-tripathi
Copy link
Author

@mrityunjay-tripathi : We have a file "boost/libs/math/test/math_unit_test.hpp", which you can use to write unit tests. An example is "test/whittaker_shannon_test.cpp".

This is clear now. I will soon add the tests.

@mrityunjay-tripathi
Copy link
Author

@mrityunjay-tripathi Could you write a google benchmark and show how fast it is with and without the argument validation? This is a really hard question to answer without hard data. An example is here.

I am not familiar with 'google benchmark', but ready to delve into it. As soon as I add the tests, I will try to write the benchmark as well.

@pabristow
Copy link
Contributor

@jzmaddock , @pabristow : For scalar distribution, using free functions to compute kurtosis, range, so on, is just fine. But for vector types, following this pattern is getting a bit awkward. Should each of these functions go from free functions acting on the class to member functions?

@jzmaddock devised this scheme - with good reasons - so he is best qualified to comment on this change to multi-thingys.

@pabristow
Copy link
Contributor

pabristow commented Feb 29, 2020 via email

@jzmaddock
Copy link
Collaborator

For scalar distribution, using free functions to compute kurtosis, range, so on, is just fine. But for vector types, following this pattern is getting a bit awkward. Should each of these functions go from free functions acting on the class to member functions?

For better or worse I'd prefer to keep to the current pattern if possible. Implementation and use wise, I think it makes very little practical difference.

I confess to a certain amount of hesitation in adding something that's not entirely mainstream as our first multivariate distribution. I had always assume that the multivariate normal would come first, but no matter I guess.

With regard to the CDF, I note that multivariate normal has 2 different possible definitions, and I assume that is the case here also. I think we should be careful about treading on shaky ground here. Likewise, I assume there is no real notion of a quantile, not least because there is unlikely to be a unique vector of x-values corresponding to any given probability.

@NAThompson
Copy link
Collaborator

I am not familiar with 'google benchmark', but ready to delve into it. As soon as I add the tests, I will try to write the benchmark as well.

You will definitely enjoy adding google benchmark to your toolkit. It's amazing.

@NAThompson
Copy link
Collaborator

NAThompson commented Feb 29, 2020

I confess to a certain amount of hesitation in adding something that's not entirely mainstream as our first multivariate distribution. I had always assume that the multivariate normal would come first, but no matter I guess.

I think the problem is that the multivariate normal factors into products so that there is little demand for such a thing. Writing down a useful definition of the cdf that translates into an ergonomic API is going to be a problem though . .

@mrityunjay-tripathi
Copy link
Author

Likewise, I assume there is no real notion of a quantile, not least because there is unlikely to be a unique vector of x-values corresponding to any given probability.

@jzmaddock I thought of using some random values so that they satisfy the condition. But it seems there is no practical use for this. I was putting it for the sense of completeness of the implementation.

@mrityunjay-tripathi
Copy link
Author

Writing down a useful definition of the cdf that translates into an ergonomic API is going to be a problem though . .

@NAThompson: I've used this deduction to calculate the CDF.

cdf
cdf1

Talking about efficiency, this distribution obviously is computationally expensive.

@NAThompson
Copy link
Collaborator

NAThompson commented Mar 1, 2020

./include/boost/math/distributions/dirichlet.hpp:53:20: error: need 'typename' before 'RandomAccessContainer::value_type' because 'RandomAccessContainer' is a dependent scope
   53 |   using RealType = RandomAccessContainer::value_type;
      |                    ^~~~~~~~~~~~~~~~~~~~~
      |                    typename

Also:

./include/boost/math/distributions/dirichlet.hpp:231:12: error: 'm_alpha' was not declared in this scope; did you mean 'alpha'?
  231 |   decltype(m_alpha.size()) &order() const { return m_alpha.size(); }
      |            ^~~~~~~
      |            alpha

@mrityunjay-tripathi
Copy link
Author

mrityunjay-tripathi commented Mar 2, 2020

I was trying this sample program from scipy but the values I am getting is not in sync with that of scipy.
Not even the 'mean' is correct! What might be the reason?

#include <bits/stdc++.h>
#include <boost/math/distributions/dirichlet.hpp>
using namespace std;

void print(std::vector<long double> v)
{
  for (size_t i = 0; i < v.size(); ++i)
  {
    cout<<v[i]<<" ";
  }
  cout<<endl;
}
int main()
{
  std::vector<long double> alpha = {0.4, 5.0, 15.0};
  std::vector<long double> x = {0.2, 0.2, 0.6};
  
  using VectorType = typename std::vector<long double>;
  
  boost::math::dirichlet_distribution<VectorType> diri(std::move(alpha));
  long double p = pdf(diri, x);
  long double c = cdf(diri, x);
  std::vector<long double> m = mean(diri);
  std::vector<long double> v = variance(diri);
  std::vector<long double> std = standard_deviation(diri);
  long double en = entropy(diri);
  cout<<"pdf: "<<p<<endl;
  cout<<"cdf: "<<c<<endl;
  cout<<"entropy: "<<en<<endl;
  
  cout<<"mean: ";
  print(m);
  cout<<"variance: ";
  print(v);
  cout<<"statndard deviation: ";
  print(std);
}

The output:

pdf: 0.0203131
cdf: 6.90545e-05
entropy: -nan
mean: 0.02 0.25 0.75 
variance: 0.000933333 0.00892857 0.00892857 
statndard deviation: 0.0305505 0.0944911 0.0944911 

@pabristow
Copy link
Contributor

I also wonder if we can dispense with passing a zero of the right type (required in the Dark Days of C++98) test_spots(std::vector<float>(0.0F)); and instead pass nothing, and be explicit
by calling test_spots<std::vector<float> >(); ... (This works OK)

(Since you have already upped the C++ standard version to 14, if not 17!)

Incidentally, is using c++14/17 really helpful - I'd much prefer to limit it to c++11 if possible?

@jzmaddock Boost.Math policy decision? Your important views?

@jzmaddock
Copy link
Collaborator

Incidentally, is using c++14/17 really helpful - I'd much prefer to limit it to c++11 if possible?

It depends.... I think we need to mark C++03 as deprecated in the next release - that way we can start removing it from old code next year sometime (!).

For new code, I'm reasonably relaxed about the compiler requirements, although it would be nice to not require stuff gratuitously. I guess recent GCC's are C++14 by default, as is MSVC so that seems like a reasonable target unless there's something in C++17 that essential?

@jzmaddock
Copy link
Collaborator

Should we use BOOST_MATH_TEST_VALUE ?

It's probably overkill unless you have extended precision values which you're certain are correct.

@pabristow
Copy link
Contributor

pabristow commented Mar 6, 2020 via email

@pabristow
Copy link
Contributor

Sadly Dirichlet distribution in Wolfram says "Contribute this entry." :-(

@pabristow
Copy link
Contributor

Correction of my bad - max_digits10 should be 21 for 80-bit long double, not 17 as I said before (that's only MSVC who can't be bothered to implement 80-bit long double). But I have yet to discover who to make SciPy output all those digits. Still digits10 = 18 would be better than only 6 decimal digit output!

@mrityunjay-tripathi
Copy link
Author

mrityunjay-tripathi commented Mar 6, 2020

but there are two tests that don't throw when expected.
alpha[0] = 1.26;
alpha[1] = 2.99;
x[0] = 0.5;
x[1] = 0.75; // sum(x) > 1.0
BOOST_MATH_CHECK_THROW(boost::math::pdf(dirichlet_distribution(std::move(alpha)), x), std::domain_error);
and
alpha[0] = 1.56;
alpha[1] = 4.00;
x[0] = 0.31;
x[1] = 0.71; // sum(x) > 1.
BOOST_MATH_CHECK_THROW(boost::math::cdf(dirichlet_distribution(std::move(alpha)), x), std::domain_error);

@pabristow: I got it why this is the case. Actually the check for sum of x just returns false if the condition is not satisfied but doesn't throw the error.
return accumulate(x.begin(), x.end(), 0.0) <= 1.0; (line 95).

Boost has to cope with systems without ‘throwability’ (eg embedded systems, real-time, IOT…).

Never thought about that!
and Thank You for your constant support :)

@mrityunjay-tripathi
Copy link
Author

Why do these tests pass even if there are several errors? Am I missing something?

@pabristow
Copy link
Contributor

I don't think that the jamfile.v2 includes dirichlet tests or examples yet .

I will do this and check it out locally using the b2 build system with a variety of gcc clang and msvc versions and c++std variants too. Job for next week.

(I think that it is premature to be using the CI suite yet. I'm only testing locally. When pushing, include the string [CI SKIP] in the commit message will avoid triggering the long-suffering and very tired and overworked CI systems.)

@pabristow
Copy link
Contributor

I'm also trying to use beta distribution (which works with Boost.Multiprecision) for some really high precision tests.

https://github.com/bertcarnell/dirichlet/blob/7954c574e2bb1359883e8d4b810d266acc6c5a35/tests/testthat/test-dirichlet.r

says
it("A two parameter Dirichlet is equivalent to a beta", {
expect_equal(ddirichlet(c(0.3, 0.7), c(2, 3)), dbeta(0.3, 2, 3))

and this seems to work for PDF, but so far I get a different value for CDF.

I've been playing with

//   dirichlet outputs             cdf: 0.061739999999999982  ??????????????
// https://www.wolframalpha.com/input/?i=betaregularized%280.3%2C+0.3%2C+0.7%29
// I0.3(0.3,0.7) = 0.6122328397117941119545947029213687501316859461148673
// alpha 0.3, beta 0.7
// (beta(0.3, alpha, beta) gamma(alpha + beta))/(gamma(alpha) gamma(beta))
// betadistribution(0.3,0.7) 
// mean  0.3, mode = 0.7, sd = 0.324037, variance = 0.105, skew = 0.822951 

// N[PDF[betadistribution(0.3,0.7), 0.3], 55] =  0.6657228773303705085295919960559028800560346898849140226
// N[PDF[betadistribution(2, 3), 0.3], 55] 1.764000000000000000000000000000000000000000000000000000
// N[CDF[betadistribution(2, 3), 0.3], 55] 0.3483000000000000000000000000000000000000000000000000000

// N[mode[betadistribution(2, 3), 0.3], 55] shows 
// mean 0.4, mode = 0.3333, sd = 1/5 = 0.2
// variance = 1/25 = 0.04, skewness = 2/7 = 0.285714
// PDF is as expected, but not CDF whose plot shows about 0.35

I'm not sure if this is feasible and I suspect my ignorance and inexpertise. Views from mathy experts?

I also would hope that one can do the equivalent of this for didchlet

// Check beta distribution round-tripping:
double cq = quantile(b, cdf(b, 0.3)); // 
show_value(cq, "quantile(b, cdf(b, 0.3) = ", "\n"); // quantile(b, cdf(b, 0.3) = 0.29999999999999999
BOOST_ASSERT_MSG(0.3000000000000000 == cq, "round_trip failed!");


double qc = cdf(b, quantile(b, 0.3)); // 
show_value(qc, "cdf(b, quantile(b, 0.3)) = ", "\n"); // cdf(b, quantile(b, 0.3)) = 0.30000000000000004
BOOST_ASSERT_MSG(0.3000000000000000 == qc, "round_trip failed!");

How to do this for a multi distribution (in hand-waving mode ;-) )?

@mrityunjay-tripathi
Copy link
Author

mrityunjay-tripathi commented Mar 7, 2020

and this seems to work for PDF, but so far I get a different value for CDF.

The deduction for CDF is definitely wrong and now it really seems complicated to deduce CDF of Dirichlet distribution though I will try to work on it. Scipy also doesn't have CDF implementation, now I see why.

@mrityunjay-tripathi
Copy link
Author

I got this answer on researchgate about CDF of Dirichlet distribution. Since the application for which I am using Dirichlet distribution is concerned about generating random distribution mostly. Can we omit CDF for now and wait if something (approximation or efficient deductions) comes up later.

@NAThompson
Copy link
Collaborator

@mrityunjay-tripathi : Why don't you try Cross-Validated and see if you get any better answers?

https://stats.stackexchange.com/

@pabristow
Copy link
Contributor

For the record here is the helpful if negative reply.

"3rd Mar, 2020
Joachim Domsta
The State University of Applied Sciences in Elbląg,
Dear Mrityunjay Tripathi
do you mean the multivariate continuous Dirichlet distribution (extended beta)?
If yes, then do not be wondering, since then for the cpdf the formula is a very complicated combination of gamma multidimensional integrals. Moreover this is a pd concentrated on the simplex formed by points with positive coordinates summing up to 1In the case of 3D, this is concentrated on the triangle
x+y+z=1, where x,y,z >0.
What would you do with such function F(a,b,c) which is zero for all triples a,b,c, which sum up to value less than 1?
"

Sounds tricky, as my rusty maths quickly concluded.

typename RandomAccessContainer::value_type entropy, // entropy
typename RandomAccessContainer::value_type pdf, // pdf
typename RandomAccessContainer::value_type cdf, // cdf
typename RandomAccessContainer::value_type tol) // Test tolerance.
{
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed mode, skewness, kurtosis, cdf because I don't have predefined values for them. I thought of adding tests first for those that we have predefined values.

Comment on lines +191 to +200
alpha[0] = static_cast<RealType>(5.310948003052013);
alpha[1] = static_cast<RealType>(8.003963132298916);
x[0] = static_cast<RealType>(0.35042614416132284);
x[1] = static_cast<RealType>(0.64957385583867716);
mean[0] = static_cast<RealType>(0.398872207937724);
mean[1] = static_cast<RealType>(0.601127792062276);
var[0] = static_cast<RealType>(0.016749888798155716);
var[1] = static_cast<RealType>(0.016749888798155716);
pdf = static_cast<RealType>(2.870121181949622);
entropy = static_cast<RealType>(-0.6347509574442718);
Copy link
Author

@mrityunjay-tripathi mrityunjay-tripathi Mar 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't get higher precision than this (numpy.float64), because gamma function in SciPy doesn't support numpy.float128/numpy.longdouble.

Copy link
Collaborator

@NAThompson NAThompson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can’t you test kurtosis/variance with a scaling relation?

Var(kX) = k^2Var(X)?

@pabristow
Copy link
Contributor

I could generate some cpp_bin_float_50 values for this, but that is somewhat circular in that it assumes that your code is correct ;-)
These values would also least be a check on regressions from ill-advised edits in future.

Cunning-er tests like Nick's are a Good Idea. (Past experience is that it is all to easy to code these equations wrong :-( so cross checks are Really Useful)
We should also be able to compare the beta_distribution results ('Know Good' from comparison with Wolfram) with Dirichlet when k= 2?

I've asked again about the Dirichlet CDF
https://stats.stackexchange.com/questions/57262/implementation-of-dirichlet-cdf/453526#453526

@pabristow
Copy link
Contributor

We also like to have some worked examples of using functions and distributions, (often lifted from somewhere else so we can compare results).

Have you any examples yet? Perhaps using with Boost.Random?

I am starting on documentation.

@mrityunjay-tripathi
Copy link
Author

Sorry, I couldn't get done much today because of all-day celebration of 'Holi'.

I could generate some cpp_bin_float_50 values for this, but that is somewhat circular in that it assumes that your code is correct ;-)

@pabristow: The pdf, mean, variance, mode, entropy are producing correct results now. I have to work on cdf and recheck skewness and kurtosis. :)

I've asked again about the Dirichlet CDF

I got the equation for CDF but it's toooo big. I am trying to implement it as soon as possible.

We also like to have some worked examples of using functions and distributions, (often lifted from somewhere else so we can compare results).
Have you any examples yet? Perhaps using with Boost.Random?

Yes, I have but I am yet to get familiar with Boost.Random

I am starting on documentation.

That's nice.

@mrityunjay-tripathi
Copy link
Author

mrityunjay-tripathi commented Mar 10, 2020

We should also be able to compare the beta_distribution results ('Know Good' from comparison with Wolfram) with Dirichlet when k= 2?

It seems unlikely that we will get the same results because the equation for cdf of dirichlet is much complex and has to be dealt using approximation (as far as I could get it). May be we could hard-code it for k <= 2 and use the other equation for k > 2?

@pabristow
Copy link
Contributor

https://www.researchgate.net/publication/220461721_On_numerical_calculation_of_probabilities_according_to_Dirichlet_distribution

looks intimidating to be an exercise for the student ;-)

@mrityunjay-tripathi
Copy link
Author

Sorry for the delay in the implementation of the CDF function, I've almost completed it but need to change the API accordingly. I got busy with the project proposal for GSoC. The approximation for Lauricella function mentioned in this paper is elaborated here.
Sorry that I might take another few days :(

@pabristow
Copy link
Contributor

Ah - that sounds extremely interesting and promising. Approximations like these are really useful.
I have been trying to get my head around the Gouda and Szantai paper, but I have yet to get access to the Butler and Wood paper. Comparing with beta distribution will give some good sanity checks on your code.

@mrityunjay-tripathi
Copy link
Author

I've removed std::invalid_argument almost everywhere, I will see if it is still used somewhere.

Also, my Institute is closed and I am on my way to home due to this whole Corona thing. I too can't access them right now, though I downloaded some of the papers (if it's that what you want) and can send you if you say.

@pabristow
Copy link
Contributor

I've searched better and found the article, but not even partially digested it.

I'm still struggling to re-understand how the error_handling works - don't worry about it for now. Getting the cdf to work would be excellent.

I've download lauricella.zip from https://faculty.smu.edu/rbutler
Ronald Butler C.F. Frensley Professor of Mathematical Sciences Department of Statistical Science
Southern Methodist University Dallas

contains a pdf for Laplace Approximation of Lauricella Functions FA, FD, and Φ2

lauricella_NM2.pdf contains a section on 3.1.1 explaining how the there is an error in the Exton paper (equation 10)

We may be able to get help from Andrew Wood at Nottingham, UK. This is above my 'pay grade' ;-)

https://www.smu.edu/statistics/faculty/butler.html is is 404, but the download above has some .m programs that might be useful?

HTH

Good luck avoiding the virus - at least you are not in the Death Age Zone!

@evanmiller
Copy link
Contributor

For test values, Selected Tables in Mathematical Statistics, Volume IV (1977) has 200 pages of numerical tables dedicated to the Dirichlet distribution's CDF, inverse, mean, and variance. The book also includes a lengthy exposition on how to compute them.

@pabristow
Copy link
Contributor

Thanks for this. Covid-19 and other things have pushed the Dirichlet onto the back burner, but this looks a very useful reference. Library loans are not functionally at present, but I hope to return to this in due course.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants