forked from chrispiech/probabilityForComputerScientists
-
Notifications
You must be signed in to change notification settings - Fork 0
/
searchIndex.json
1 lines (1 loc) · 417 KB
/
searchIndex.json
1
[{"id": "intro", "title": "Introduction", "url": "intro/intro", "text": "\n \nIntroduction\n\nProbability is the math of the future. Your ability to program can both illuminate the complexities of probability. But more, the intersection of coding and probability has created a beautiful field of its own. \n\n"}, {"id": "notation", "title": "Notation Reference", "url": "intro/notation", "text": "\n \nNotation Reference\n\nCore Probability\n\n\n\nNotation\nMeaning\n\n\n\n$E \\text{ or } F$\nCapital letters can denote events\n\n\n$A \\text{ or } B$\nSometimes they denote sets\n\n\n$|E| $\nSize of an event or set\n\n\n$E^C$\nComplement of an event or set\n\n\n$EF$\nAnd of events (aka intersection)\n\n\n$ E \\and F$\nAnd of events (aka intersection)\n\n\n$ E \\cap F$\nAnd of events (aka intersection)\n\n\n$ E \\or F$\nOr of events (aka union)\n\n\n $ E \\cup F$\nOr of events (aka union)\n\n\n$\\p(E)$\nThe probability of an event $E$\n\n\n$\\p(E|F)$\nThe conditional probability of an event $E$ given $F$\n\n\n$\\p(E,F)$\nThe probability of event $E$ and $F$\n\n\n$\\p(E|F,G)$\nThe conditional probability of an event $E$ given both $F$ and $G$\n\n\n$n!$\n$n$ factorial\n\n\n${n \\choose k}$\nBinomial coefficient\n\n\n${n \\choose {r_1,r_2,r_3} }$\nMultinomial coefficient\n\n\nRandom Variables\n\n\n\nNotation\nMeaning\n\n\n\n$x \\text{ or } y \\text{ or } i$\nLower case letters denote regular variables\n\n\n$X \\text{ or } Y$\nCapital letters are used to denote random variables\n\n\n$K$\nCapital $K$ is reserved for constants\n\n\n$\\E[X]$\nExpectation of $X$\n\n\n$\\Var(X)$\nVariance of $X$\n\n\n$\\p(X=x)$\nProbability mass function (PMF) of $X$, evaluated at $x$\n\n\n$\\p(x)$\nProbability mass function (PMF) of $X$, evaluated at $x$\n\n\n$f(X=x)$\nProbability density function (PDF) of $X$, evaluated at $x$\n\n\n$f(x)$\nProbability density function (PDF) of $X$, evaluated at $x$\n\n\n$f(X=x,Y=y)$\nJoint probability density\n\n\n$f(X=x|Y=y)$\nConditional probability density\n\n\n$F_X(x)$ or $F(x)$\nCumulative distribution function (CDF) of $X$\n\n\nIID\nIndependent and Identically Distributed\n\n\nParametric Distributions\n\n\n\nNotation\nMeaning\n\n\n\n$X \\sim \\Ber(p)$\n$X$ is a Bernoulli random variable\n\n\n$X \\sim \\Bin(n,p)$\n$X$ is a Binomial random variable\n\n\n$X \\sim \\Poi(p)$\n$X$ is a Poisson random variable\n\n\n$X \\sim \\Geo(p)$\n$X$ is a Geometric random variable\n\n\n$X \\sim \\NegBin(r, p)$\n$X$ is a Negative Binomial random variable\n\n\n$X \\sim \\Uni(a,b)$\n$X$ is a Uniform random variable\n\n\n$X \\sim \\Exp(\\lambda)$\n$X$ is a Exponential random variable\n\n\n$X \\sim \\Beta(a,b)$\n$X$ is a Beta random variable\n\n\n\n"}, {"id": "all_distributions", "title": "Random Variable Reference", "url": "intro/all_distributions", "text": "\n \nRandom Variable Reference\n\nDiscrete Random Variables\n\n<%\n include('templates/rvCards/bernoulli.html')\n\n\n<%\n include('templates/rvCards/binomial.html')\n\n\n<%\n include('templates/rvCards/poisson.html')\n\n\n<%\n include('templates/rvCards/geometric.html')\n\n\n<%\n include('templates/rvCards/negBinomial.html')\n\nContinuous Random Variables\n\n<%\n include('templates/rvCards/uniform.html')\n\n\n<%\n include('templates/rvCards/exponential.html')\n\n\n<%\n include('templates/rvCards/normal.html')\n\n\n<%\n include('templates/rvCards/beta.html')\n\n"}, {"id": "calculators", "title": "Calculators", "url": "intro/calculators", "text": "\n \nCalculators\n\n\n<%\ninclude('chapters/intro/calculators/factorial.html')\n\n<%\ninclude('chapters/intro/calculators/choose.html')\n\n<%\ninclude('chapters/intro/calculators/phi.html')\n\n<%\ninclude('chapters/intro/calculators/invPhi.html')\n\n<%\ninclude('chapters/intro/calculators/normCdf.html')\n\n<%\ninclude('chapters/intro/calculators/betaCdf.html')\n"}, {"id": "calculators", "title": "Calculators", "url": "intro/calculators", "text": "\nBeta CDF Calculator\n\n\nx\n\n\n\na\n\n\n\nb\n\n\n\n\nbeta.cdf(x, a, b)\n\n\u00a0\n\n\n"}, {"id": "calculators", "title": "Calculators", "url": "intro/calculators", "text": "\nPhi Calculator, $\\Phi(x)$\n\n\n\nx\n\n\n\n\nphi(x)\n\n\u00a0\n\n\n"}, {"id": "calculators", "title": "Calculators", "url": "intro/calculators", "text": "\nNorm CDF Calculator\n\n\nx\n\n\n\nmu\n\n\n\nstd\n\n\n\n\nnorm.cdf(x, mu, std)\n\n\u00a0\n\n\n"}, {"id": "calculators", "title": "Calculators", "url": "intro/calculators", "text": "\nCombination Calculator ${n \\choose k}$\n\n\n\nn\n\n\n\nk\n\n\n\n\ncombination(n,k)\n\n\u00a0\n\n\n"}, {"id": "calculators", "title": "Calculators", "url": "intro/calculators", "text": "\nInverse Phi Calculator, $\\Phi^{-1}(y)$\n\n\n\ny\n\n\n\n\ninverse_phi(y)\n\n\u00a0\n\n\n"}, {"id": "calculators", "title": "Calculators", "url": "intro/calculators", "text": "\nFactorial Calculator ${n!}$\n\n\n\nn\n\n\n\n\nfactorial(n)\n\n\u00a0\n\n\n"}, {"id": "counting", "title": "Counting", "url": "part1/counting", "text": "\n \nCounting\n\nAlthough you may have thought you had a pretty good grasp on the notion of counting at the age of\nthree, it turns out that you had to wait until now to learn how to really count. Aren\u2019t you glad you\ntook this class now?! But seriously, counting is like\nthe foundation of a house (where the house is all the great things we will do later in this book, such\nas machine learning). Houses are awesome. Foundations, on the other hand, are pretty much just\nconcrete in a hole. But don\u2019t make a house without a foundation. It won\u2019t turn out well.\nCounting with Steps\n\n\nDefinition: Step Rule of Counting (aka Product Rule of Counting)\nIf an experiment has two parts, where the first part can result in one of $m$ outcomes and the second part\ncan result in one of $n$ outcomes regardless of the outcome of the first part, then the total number of\noutcomes for the experiment is $m \\cdot n$.\n\n\nRewritten using set notation, the Step Rule of Counting states that if an experiment with two parts has an outcome\nfrom set $A$ in the first part, where $|A| = m$, and an outcome from set $B$ in the second part (where the number of outcomes in $B$ is the same regardless of the\noutcome of the first part), where $|B| = n$, then the total number of outcomes of the experiment is $|A||B| = m \\cdot n $.\n\n\nSimple Example: \n\t\tConsider a hash table with 100 buckets. Two arbitrary strings are independently hashed and added to the\ntable. How many possible ways are there for the strings to be stored in the table?\nEach string can be hashed to one of 100 buckets. Since the results of hashing the first string do not impact the\nhash of the second, there are 100 * 100 = 10,000 ways that the two strings may be stored in the hash table.\n\n\n\nPeter Norvig, the author of the cannonical text book \"Artificial Intelligence\" made the following compelling point on why computer scientists need to know how to count. To start, lets set a baseline for a really big number: The number of atoms in the observable universe, often estimated to be around 10 to the 80th power ($10^{80}$). There\ncertainly are a lot of atoms in the universe. As a leading expert said,\n\n\t\u201cSpace is big. Really big. You just won\u2019t believe how vastly, hugely, mind-bogglingly big it is.\nI mean, you may think it\u2019s a long way down the road to the chemist, but that\u2019s just peanuts to\nspace.\u201d - Douglas Adams\nThis number is often used to demonstrate tasks that computers will never be able to solve. Problems can\nquickly grow to an absurd size, and we can understand why using the Step Rule of Counting.\nThere is an art project to display every possible picture.\nSurely that would take a long time, because there must be many possible pictures. But how many? We will\nassume the color model known as True Color, in which each pixel can be one of $2^{24}$ \u2248 17 million distinct\ncolors.\nHow many distinct pictures can you generate from (a) a smart phone camera shown with 12 million pixels, (b) a\ngrid with 300 pixels, and (c) a grid with just 12 pixels?\n\n\n\n\n\n\n\tAnswer: We can use the step rule of counting. An image can be created one pixel at a time, step by step. Each time we choose a pixel you can select its color out of 17 million choices. An array of $n$ pixels produces (17 million)$^n$ different pictures. (17 million)$^{12}$ \u2248 $10^{86}$, so the tiny\n12-pixel grid produces a million times more pictures than the number of atoms in the universe! How about\nthe 300 pixel array? It can produce $10^{2167}$ pictures. You may think the number of atoms in the universe is big,\nbut that\u2019s just peanuts to the number of pictures in a 300-pixel array. And 12M pixels? $10^{86696638}$ pictures.\n\t\n\n\nExample: Unique states of Go\nFor example a Go board has 19 \u00d7 19 points where a user can place a stone. Each of the points can be empty or occupied\nby black or white stone. By the Step Rule of Counting, we can compute the number of unique board\nconfigurations.\nIn go there are 19x19 points. Each point can have a black stone, white stone, or no stone at all.\n Here we are going to construct the board one point at a time, step by step. Each time we add a point we have a unique choice where we can decide to make the point one of three options:\n {Black, White, No Stone}. Using this construction we can apply the Step Rule of Counting. If there was only one point, there would be three unique board configurations. If there were four points you would have $3 \\cdot 3 \\cdot 3 \\cdot 3 = 81$ unique combinations. In Go there are $3^{(19\u00d719)} \u2248 10^{172}$ possible board positions. The way we constructed our board didn't take into account which ones were illegal by the rules of Go. It turns out that \"only\" about\n$10^{170}$ of those positions are legal. That is about the square of the number of atoms in the universe. In other words: if there was another universe of atoms for every single atom, only then would there be as many atoms\nin the universe as there are unique configurations of a Go board. \n\n\tAs a computer scientist this sort of result can be very important. While computers are powerful, an algorithm which needed to store each configuration of the board would not be a reasonable approach. No computer can store more information than atoms in the universe squared!\n\t\n\n\nThe above argument might leave you feeling like some problems are incredibly hard as a result of the product\nrule of counting. Let\u2019s take a moment to talk about how the product rule of counting can help! Most logrithmic\ntime algorithms leverage this principle. \n\n\tImagine you are building a machine learning system that needs to learn from data and you want to synthetically generate 10 million unique data points for it. How many steps would you need to encode to get to 10 million? Assuming that at each step you have a binary choice, the number of unique data points you produce will be $2^n$ by the Step Rule of counting. If we chose $n$ such that $\\log_{2} 10,000,000 < n$. You would only need to encode $n=24$ binary decisions.\n\t\n\n\nExample: Rolling two dice. Two 6-sided dice, with faces numbered 1 through 6, are rolled. How many possible\noutcomes of the roll are there?\n\nSolution: \n\t\tNote that we are not concerned with the total value of the two die (\"die\" is the singular form of \"dice\"), but rather the set of\nall explicit outcomes of the rolls. Since the first die can come up with 6 possible values and the\nsecond die similarly can have 6 possible values (regardless of what appeared on the first die), the\ntotal number of potential outcomes is 36 (= 6 \u00d7 6). These possible outcomes are explicitly listed\nbelow as a series of pairs, denoting the values rolled on the pair of dice:\n\n\n\n\n\n\n\nCounting with or\n\n\tIf you want to consider the total number of unique outcomes, when outcomes can come from source $A$ or source $B$, then the equation you use depends on whether or not there are outcomes which are both in $A$ and $B$. If not, you can use the simpler \"Mutually Exclusive Counting\" rule. Otherwise you need to use the slightly more involved Inclusion Exclusion rule.\n\n\n\nDefinition: Mutually Exclusive Counting\nIf the outcome of an experiment can either be drawn from set $A$ or set $B$, where none of\nthe outcomes in set $A$ are the same as any of the outcomes in set $B$ (called mutual exclusion),\nthen there are $|A \\or B| = |A|+|B|$ possible outcomes of the experiment.\n\n\n\n\nExample: Sum of Routes. A route finding algorithm needs to find routes from Nairobi to Dar Es Salaam. It finds routes that either pass through Mt Kilimanjaro or Mombasa. There are 20 routes that pass through Mt Kilimanjaro, 15 routes that pass through Mombasa and 0 routes which pass through both Mt Kilimanjaro and Mombasa. How many routes are there total?\n\nSolution: \n\t\tRoutes can come from either Mt Kilimanjaro or Mombasa. The two sets of routes are mutually exclusive as there are zero routes which are in both groups. As such the total number of routes is addition: 20 + 15 = 35.\n\t\n\n\n\tIf you can show that two groups are mutually exclusive counting becomes simple addition. Of course not all sets are mutually exclusive. In the example above, imagine there had been a single route which went through both Mt Kilimanjaro and Mombasa. We would have double counted that route because it would be included in both the sets. If sets are not mutually exclusive, counting the or is still addition, we simply need to take into account any double counting.\n\t\n\n\nDefinition: Inclusion Exclusion Counting\nIf the outcome of an experiment can either be drawn from set $A$ or set $B$, and sets $A$ and $B$ may potentially\noverlap (i.e., it is not the case that $A$ and $B$ are mutually exclusive), then the number of outcomes of the experiment is\n$|A \\or B| = |A|+|B| \u2212|A \\and B|$.\n\n\nNote that the Inclusion-Exclusion Principle generalizes the Sum Rule of Counting for arbitrary sets $A$ and\n$B$. In the case where $A \\and B = \u2205$, the Inclusion-Exclusion Principle gives the same result as the Sum Rule of\nCounting since $|A \\and B| = 0$.\n\n\nExample: An 8-bit string (one byte) is sent over a network. The valid set of strings recognized by\nthe receiver must either start with \"01\" or end with \"10\". How many such strings are there?\n\nSolution: \n\t\tThe potential bit strings that match the receiver\u2019s criteria can either be the 64 strings that\nstart with \"01\" (since that last 6 bits are left unspecified, allowing for $2^6 = 64$ possibilities) or the 64\nstrings that end with \"10\" (since the first 6 bits are unspecified). Of course, these two sets overlap,\nsince strings that start with \"01\" and end with \"10\" are in both sets. There are $2^4$ = 16 such strings\n(since the middle 4 bits can be arbitrary). Casting this description into corresponding set notation,\nwe have: $|A|$ = 64, $|B|$ = 64, and $|A \\and B|$ = 16, so by the Inclusion-Exclusion Principle, there are\n64 + 64 \u2212 16 = 112 strings that match the specified receiver\u2019s criteria.\n\t\n\nOvercounting and Correcting\nOne strategy for counting is sometimes to overcount a solution and then correct for any duplicates. This is especially common when it is easier to generate all outcomes under some relaxed assumptions, or someone introduces contraints. If you can argue that you have over-counted each element the same multiple number of times, you can simply correct by using division. If you can count exactly how many elements were over-counted you can correct using subtraction.\nAs a simple example to demonstrate the point, lets revisit the problem of generating all images, but this time lets just have 4 pixels (2x2) and each pixel can only be blue or white. How many unique images are there? Generating any image is a four step process where you choose each pixel one at a time. Since each pixel has two choices there are $2^4 = 16$ unique images (they are not exactly Picasso \u2014 but hey its 4 pixels):\n\nNow lets say we add in new \"constraint\" that we only want to accept pictures which have an odd number of pixels turned blue. There are two ways of getting to the answer. You could start out with the original 16 and work out that you need to subtract off 8 images that have either 0, 2 or 4 blue pixels (which is easier to work out after the next chapter). Or you could have counted up using Mutually Exclusive Counting: there are 4 ways of making an image with 1 pixel and 4 ways of making an image with 3. Both approaches lead to the same answer, 8.\n\nNext lets add a much harder constraint: mirror indistinction. If you can flip any image horizontally to create another, they are no longer considered unique. For example these two both show up in our set of 8 odd-blue pixel images, but they are now considered to be the same (they are indistinct after a horizontal flip): \n\t\nHow many images have an odd number of pixels taking into account mirror indistinction? The answer is that for each unique image with odd numbers of blue pixels, under this new constraint, you have counted it twice: itself and its horizonal flip. To convince yourself that each image has been counted exactly twice you can look at all of the examples in the set of 8 images with an odd number of blue pixels. Each image is next to one which is indistinct after a horizontal flip. Since each image was counted exactly twice in the set of 8, we can divide by two to get the updated count. If we list them out we can confirm that there are 8/2=4 images left after this last constraint:\n\nApplying any math (counting included) to novel contexts can be as much an art as it is a science. In the next chapter we will build a useful toolset from the basic first principles of counting by steps, and counting by \"or\".\n\n\n\n\n\n"}, {"id": "combinatorics", "title": "Combinatorics", "url": "part1/combinatorics", "text": "\n \nCombinatorics\n\nCounting problems can be approached from the basic building blocks described in the first section: Counting. However some counting problems are so ubiquitous in the world of probability that it is worth knowing a few\nhigher level counting abstractions. When solving problems, if you can find the analogy from these canonical\nexamples you can build off of the corresponding combinatorics formulas:\n\nPermutations of Distinct Objects\nPermutations with Indistinct Objects\nCombinations with Distinct Objects\nBucketing with Distinct Objects\nBucketing with Indistinct Objects\nBucketing into Fixed Sized Containers\n\nWhile these are by no means the only common counting paradigms, it is a helpful set.\n\nPermutations of Distinct Objects\n\n\nDefinition: Permutation Rule\nA permutation is an ordered arrangement of n distinct object. Those $n$ objects can\nbe permuted in $n \\cdot (n \u2013 1) \\cdot (n \u2013 2) \\cdots 2 \\cdot 1 = n!$ ways.\n\n\nThis changes slightly if you are permuting a subset of distinct objects, or if some of your objects\nare indistinct. We will handle those cases shortly! Note that unique is a synonym for distinct.\n\n\nExample: How many unique orderings of characters are possible for the string \"BAYES\"?\nSolution: Since the order of characters is important, we are considering all permutations of the 5 distinct\ncharacters B, A, Y, E, and S: $5! = 120$. Here is the full list:\nBAYES, BAYSE, BAEYS, BAESY, BASYE, BASEY, BYAES, BYASE, BYEAS, BYESA, BYSAE, BYSEA, BEAYS, BEASY, BEYAS, BEYSA, BESAY, BESYA, BSAYE, BSAEY, BSYAE, BSYEA, BSEAY, BSEYA, ABYES, ABYSE, ABEYS, ABESY, ABSYE, ABSEY, AYBES, AYBSE, AYEBS, AYESB, AYSBE, AYSEB, AEBYS, AEBSY, AEYBS, AEYSB, AESBY, AESYB, ASBYE, ASBEY, ASYBE, ASYEB, ASEBY, ASEYB, YBAES, YBASE, YBEAS, YBESA, YBSAE, YBSEA, YABES, YABSE, YAEBS, YAESB, YASBE, YASEB, YEBAS, YEBSA, YEABS, YEASB, YESBA, YESAB, YSBAE, YSBEA, YSABE, YSAEB, YSEBA, YSEAB, EBAYS, EBASY, EBYAS, EBYSA, EBSAY, EBSYA, EABYS, EABSY, EAYBS, EAYSB, EASBY, EASYB, EYBAS, EYBSA, EYABS, EYASB, EYSBA, EYSAB, ESBAY, ESBYA, ESABY, ESAYB, ESYBA, ESYAB, SBAYE, SBAEY, SBYAE, SBYEA, SBEAY, SBEYA, SABYE, SABEY, SAYBE, SAYEB, SAEBY, SAEYB, SYBAE, SYBEA, SYABE, SYAEB, SYEBA, SYEAB, SEBAY, SEBYA, SEABY, SEAYB, SEYBA, SEYAB\n\n\n\n\nExample: a smart-phone has a 4-digit passcode. Suppose there are 4 smudges over 4 digits on\nthe screen. How many distinct passcodes are possible?\nSolution: Since the order of digits in the code is important, we should use permutations. And since\nthere are exactly four smudges we know that each number in the passcode is distinct. Thus, we can plug in the\npermutation formula: 4! = 24.\n\t\n\nPermutations of Indistinct Objects\n\n\nDefinition: Permutations of In-Distinct Objects\nGenerally when there are $n$ objects and:\n\n$n_1$ are the same (indistinguishable) and\n$n_2$ are the same and\n...\n$n_r$ are the same, then the number of distinct permutations is:\n\n\n$$\n\\text{Number of unique orderings} = \\frac{n!}{n_1!n_2!\\cdots n_r!}\n$$\n\n\n\n\nExample: How many distinct bit strings can be formed from three 0\u2019s and two 1\u2019s?\nSolution: 5 total digits would give 5! permutations. But that is assuming the 0\u2019s and 1\u2019s are\ndistinguishable (to make that explicit, let\u2019s give each one a subscript). Here are the $3! \\cdot 2! = 12$ different ways that we could have arrived at the identical string \"01100\" if we thought of each 0 and 1 as unique.\n\n\n\n\n\nSince identical digits are indistinguishable, all the listed permutations are the same. For any given\npermutation, there are 3! ways of rearranging the 0\u2019s and 2! ways of rearranging the 1\u2019s (resulting in\nindistinguishable strings). We have over-counted. Using the formula for permutations of indistinct\nobjects, we can correct for the over-counting:\n\n\n$$\n\\text{Total} = \\frac{5!}{3! \\cdot 2!} = \\frac{120}{6 \\cdot 2}\n= 10\n$$\n\n\n\n\n\nExample: How many distinct orderings of characters are possible for the string \"MISSISSIPPI\"?\n\n\n\nIn the case of the string \"MISSISSIPPI\", we should separate the characters into four distinct groups of indistinct characters: one \"M\", four \"I\"s, four \"S\"s, and two \"P\"s. The number of distinct orderings are: $$\\frac{11!}{1!4!4!2!} = 34,650$$\n\t\n\n\n\nExample: Consider the 4-digit passcode smart-phone from before. How many distinct passcodes are possible if there are 3 smudges over 3 digits on the screen?\n\nSolution: One of 3 digits is repeated, but we don't know which one. We can solve this by making three cases, one for each digit that could be repeated (each with the same number of permutations). Let $A, B, C$ represent the 3 digits, with $C$ repeated twice. We can initially pretend the two $C$'s are distinct $[A,B,C_1,C_2]$. Then each case will have 4! permutations: \nHowever, then we need to eliminate the double-counting of the permutations of the identical digits (one $A$, one $B$, and two $C$'s): $$\\frac{4!}{2!\\cdot 1!\\cdot 1!}$$\nAdding up the three cases for the different repeated digits gives\n$$3 \\cdot \\frac{4!}{2!\\cdot 1!\\cdot 1!} = 3 \\cdot 12 = 36$$\n\n\nPart B: What if there are 2 smudges over 2 digits on the screen?\n\n\nSolution: There are two possibilities: 2 digits used twice each, or 1 digit used 3 times, and other digit used once.\n$$\n\\frac{4!}{2!\\cdot 2!} + 2 \\cdot \\frac{4!}{3!\\cdot 1!} = 6 + (2 \\cdot 4) = 6 + 8 = 14\n$$\n\n\n\n\tYou can use the power of computers to enumerate all permutations. Here is sample python code which uses the built in itertools library:\n\t\n>>> import itertools\n\n# get all 4! = 24 permutations of 1,2,3,4 as a list:\n>>> list(itertools.permutations([1,2,3,4]))\n[(1, 2, 3, 4), (1, 2, 4, 3), (1, 3, 2, 4), (1, 3, 4, 2), (1, 4, 2, 3), (1, 4, 3, 2), (2, 1, 3, 4), (2, 1, 4, 3), (2, 3, 1, 4), (2, 3, 4, 1), (2, 4, 1, 3), (2, 4, 3, 1), (3, 1, 2, 4), (3, 1, 4, 2), (3, 2, 1, 4), (3, 2, 4, 1), (3, 4, 1, 2), (3, 4, 2, 1), (4, 1, 2, 3), (4, 1, 3, 2), (4, 2, 1, 3), (4, 2, 3, 1), (4, 3, 1, 2), (4, 3, 2, 1)]\n\n# get all 3!/2! = 3 unique permutations of 1,1,2 as a set:\n>>> set(itertools.permutations([1,1,2]))\n{(1, 2, 1), (2, 1, 1), (1, 1, 2)}\nCombinations of Distinct Objects\n\n\nDefinition: Combinations\nA combination is an unordered selection of r objects from a set of n objects. If all objects\nare distinct, and objects are not \"replaced\" once selected, then the number of ways of making the selection is:\n\n\t$$\n\t\\text{Number of unique selections} = \\frac{n!}{r!(n-r)!} = {n \\choose r}\n\t$$\n\n\n\n\n\tHere are all the $10 = {5 \\choose 3}$ ways of choosing three items from a list of 5 unique numbers:\n\t# Get all ways of chosing three numbers from [1,2,3,4,5]\n>>> list(itertools.combinations([1,2,3,4,5], 3))\n[(1, 2, 3), (1, 2, 4), (1, 2, 5), (1, 3, 4), (1, 3, 5), (1, 4, 5), (2, 3, 4), (2, 3, 5), (2, 4, 5), (3, 4, 5)]\n\nNotice how order doesn't matter. Since (1, 2, 3) is in the set of combinations, we don't also include (3, 2, 1) as this is considered to be the same selection. Note that this formula does not work if some of the objects are indistinct form one another.\nHow did we get the formula $\\frac{n!}{r!(n-r!)}$? Consider this general way to select $r$ unordered objects from a set of $n$\nobjects, e.g., \u201c7 choose 3\u201d:\n\nFirst consider permutations of all $n$ objects. There are $n!$ ways to do that.\nThen select the first $r$ in the permutation. There is one way to do that.\nNote that the order of $r$ selected objects is irrelevant. There are $r!$ ways to permute them. The\nselection remains unchanged.\n Note that the order of $(n \u2212 r)$ unselected objects is irrelevant. There are $(n \u2212 r)!$ ways to\npermute them. The selection remains unchanged.\n\t\n\n\t$$\n\t\\text{Total} = \\frac{n!}{r! \\cdot (n-r)!} = {n \\choose r}\n\t$$\n\n\n\nExample: In the Hunger Games, how many ways are there of choosing 2 villagers from district 12, which has a population of 8,000?\n\nSolution: This is a straightforward combinations problem. ${8000 \\choose 2} = $31,996,000.\n\n\n\n\nPart A: How many ways are there to select 3 books from a set of 6?\nSolution: If each of the books are distinct, then this is another straightforward combination problem. There are $\\binom{6}{3} = \\frac{6!}{3!3!} = 20$ ways.\n\n\nPart B: How many ways are there to select 3 books if there are two books that should not both be chosen together? For example, if you are choosing 3 out of 6 probability books, don't choose both the 8th and 9th edition of the Ross textbook.\nSolution: This problem is easier to solve if we split it up into cases. Consider the following three different cases:\n\n\nCase 1: Select the 8th Ed. and 2 other non-9th Ed.: There are $\\binom{4}{2}$ ways of doing so.\n\nCase 2: Select the 9th Ed. and 2 other non-8th Ed.: There are $\\binom{4}{2}$ ways of doing so.\n\nCase 3: Select 3 from the books that are neither the 8th nor the 8th edition:\tThere are $\\binom{4}{3}$ ways of doing so.\n\nUsing our old friend the Sum Rule of Counting, we can add the cases:\n$$\n\\text{Total} = 2 \\cdot \\binom{4}{2} + \\binom{4}{3} = 16\n$$\n\nAlternatively, we could have calculated all the ways of selecting 3 books from 6, and then subtract the \"forbidden'' ones (i.e., the selections that break the constraint). \n\n\nForbidden Case: Select 8th edition and 9th edition and 1 other book. There are $\\binom{4}{1}$ ways of doing so (which equals 4).\n\nTotal = All possibilities - forbidden = 20 - 4 = 16\n\n\nTwo different ways to get the same right answer!\n\n\nBucketing with Distinct Objects\n\nIn this section we are going to be counting the many different ways that we can think of stuffing elements into containers. (It turns out that Jacob Bernoulli was into voting and ancient Rome. And in ancient Rome they used urns for ballot boxes. For this reason many books introduce this through counting ways to put balls in urns.) This \"bucketing\" or \"group assignment\" process is a useful metaphor for many counting problems.\nThe most common case that we will want to consider is when all of the items you are putting into buckets are distinct. In that case you can think of bucketing as a series of steps, and employ the step rule of counting. The first step? You put the first distinct item into a bucket (there are number-of-buckets ways to do this). Second step? You put the second distinct item into a bucket (again, there are number-of-buckets ways to do this). \n\nBucketing Distinct Items:\nSuppose you want to place $n$ distinguishable items into $r$ containers. The number of ways of doing so is:\n\t\n\t$$r^n$$\n\n\tYou have $n$ steps (place each item) and for each item you have $r$ choices\n\n\n\n\nProblem: Say you want to put 10 distinguishable balls into 5 urns (No! Wait! Don't say that! Not urns!). Okay, fine. No urns. Say we are going to put 10 different strings into 5 buckets of a hash table. How many possible ways are there of doing this?\n\t\t\nSolution: You can think of this as 10 independent experiments each with 5 outcomes. Using our rule for bucketing with distinct items, this comes out to $5^{10}$.\n\n\n\nBucketing with Indistinct Objects\n\n\n\tWhile the previous example allowed us to put $n$ distinguishable objects into $r$ distinct groups, the more interesting problem is to work with $n$ indistinguishable objects. \n\n\nDivider Method:\nSuppose you want to place $n$ indistinguishable items into $r$ containers. The divider method works by imagining that you are going to solve this problem by sorting two types of objects, your $n$ original elements and $(r - 1)$ dividers. Thus, you are permuting $n + r - 1$ objects, $n$ of which are the same (your elements) and $r - 1$ of which are the same (the dividers). Thus the total number of outcomes is:\n\n$${}\\frac{(n+r-1)!}{n!(r-1)!} = \\binom{n+r-1}{n} = \\binom{n+r-1}{r-1}$$\n\nThe divider method can be derived via the \"Stars and Bars\" method. This is a creative construction where we consider permutations of indistinguishable items, represented by stars *, and dividers between our containers, represented by bars |. Any distinct permutation of these stars and bars represents a unique assignments of our items to containers.\n Imagine we want to separate 5 indistinguishable objects into 3 containers. We can think of the problem as finding the number of ways to order 5 stars and 2 bars *****||. Any permutation of these symbols represents a unique assignment. Here are a few examples:\n\n\n**|*|** represents 2 items in the first bucket, 1 item in the second and 2 items in the third.\n\n\n****||* represents 4 items in the first bucket, 0 item in the second and 1 items in the third.\n\n\n||***** represents 0 items in the first bucket, 0 item in the second and 5 items in the third.\n\n\n\nWhy are there only 2 dividers when there are 3 buckets? This is an example of a fence-post-problem\". With 2 dividers you have created three containers. We already have a method for counting permutations with some indistinct items. For the example above where we have seven elements in our permutation ($n = 5$ stars and $r-1 = 2$ bars):\n\n\n$$\n\\text{Number of unique orderings} = \\frac{n!}{n_1! n_2!} = \\frac{(n+r-1)!}{n! (r-1)!} =\\frac{7!}{5!2!} = 21\n$$\n\n\n\n\nPart A: Say you are a startup incubator and you have \\$10 million to invest in 4 companies (in \\$1 million increments). How many ways can you allocate this money?\nSolution: This is just like putting 10 balls into 4 urns. Using the Divider Method we get:\n\n$$\n\\text{Total ways}= \\binom{10+4-1}{10} = \\binom{13}{10} = 286\n$$\n\nThis problem is analogous to solving the integer equation $x_1 + x_2 + x_3 + x_4 = 10$, where $x_i$ represents the investment in company $i$ such that $x_i \\geq 0$ for all $i = 1, 2, 3, 4$.\n\nPart B: What if you know you want to invest at least \\$3 million in Company 1?\nSolution: There is one way to give \\$3 million to Company 1. The number of ways of investing the remaining money is the same as putting 7 balls into 4 urns.\n\n$$\n\\text{Total Ways} = \\binom{7+4-1}{7} = \\binom{10}{7} = 120\n$$\n\nThis problem is analogous to solving the integer equation $x_1 + x_2 + x_3 + x_4 = 10$, where $x_1 \\geq 3$ and $x_2, x_3, x_4 \\geq 0$. To translate this problem into the integer solution equation that we can solve via the divider method, we need to adjust the bounds on $x_1$ such that the problem becomes $x_1 + x_2 + x_3 + x_4 = 7$, where $x_i$ is defined as in Part A.\n\n\nPart C: What if you don't have to invest all \\$10 M? (The economy is tight, say, and you might want to save your money.)\nSolution: Imagine that you have an extra company: yourself. Now you are investing \\$10 million in 5 companies. Thus, the answer is the same as putting 10 balls into 5 urns.\n\n$$\n\\text{Total}= \\binom{10+5-1}{10} = \\binom{14}{10} = 1001\n$$\n\nThis problem is analogous to solving the integer equation $x_1 + x_2 + x_3 + x_4 + x_5 = 10$, such that $x_i \\geq 0$ for all $i = 1, 2, 3, 4, 5$.\n\n\n\nBucketing into Fixed Sized Containers\n\n\nBucketing into Fixed Sized Containers: If $n$ objects are distinct, then the number of ways of putting them into $r$ groups of objects, such that group $i$ has size $n_i$, and $\\sum_{i=1}^{r} n_i = n$, is:\n$$\\frac{n!}{n_1!n_2!\\cdots n_r!} = \\binom{n}{n_1, n_2, \\dots, n_r}$$\n\nwhere $\\binom{n}{n_1, n_2, \\dots, n_r}$ is special notation called the multinomial coefficient.\n\n\n\nYou may have noticed that this is the exact same formula as \"Permutations With Indistinct Objects\". There is a deep parallel. One way to imagine assigning objects into their groups would be to imagine the groups themselves as objects. You have one object per \"slot\" in a group. So if there were two slots in group 1, three slots in group 2, and one slot in group 3 you could have six objects (1, 1, 2, 2, 2, 3). Each unique permutation can be used to make a unique assignment. \n\n\n\nProblem: Company Camazon has 13 distinct new servers that they would like to assign to 3 datacenters, where Datacenter A, B, and C have 6, 4, and 3 empty server racks, respectively. How many different divisions of the servers are possible?\n\n\n\nSolution: This is a straightforward application of our multinomial coefficient representation. Setting $n_1 = 6, n_2 = 4, n_3 = 3$, $\\binom{13}{6,4,3} = 60,060$.\n\nAnother way to do this problem would be from first principles of combinations as a multipart experiment. We first select the $6$ servers to be assigned to Datacenter A, in $\\binom{13}{6}$ ways. Now out of the $7$ servers remaining, we select the $4$ servers to be assigned to Datacenter B, in $\\binom{7}{4}$ ways. Finally, we select the $3$ servers out of the remaining $3$ servers, in $\\binom{3}{3}$ ways. By the Product Rule of Counting, the total number of ways to assign all servers would be $\\binom{13}{6} \\binom{7}{4} \\binom{3}{3} = \\frac{13!}{6!4!3!} = 60,060$.\n\n\n\n"}, {"id": "probability", "title": "Definition of Probability", "url": "part1/probability", "text": " \nDefinition of Probability\n\nWhat does it mean when someone makes a claim like \"the probability that you find a pearl in an oyster is 1 in 5,000?\" or \"the probability that it will rain tomorrow is 52%?\nEvents and Experiments\nWhen we speak about probabilities, there is always an implied context, which we formally call the \"experiment\". For example: flipping two coins is something that probability folks would call an experiment. In order to precisely speak about probability, we must first define two sets: the set of all possible outcomes of an experiment, and the subset that we consider to be our event (what is a set?).\n\n\nDefinition: Sample Space, $S$ A Sample Space is set of all possible outcomes of an experiment. For example:\n\n Coin flip: $S$ = {Heads, Tails}\n Flipping two coins: $S$ = {(H, H), (H, T), (T, H), (T, T)}\n Roll of 6-sided die: $S$ = {1, 2, 3, 4, 5, 6}\n The number of emails you receive in a day: $S = \\{x|x \u2208 \u2124, x \u2265 0\\}$ (non-neg. ints)\n YouTube hours in a day: $S = \\{x|x \u2208 \u211d,0 \u2264 x \u2264 24\\}$\n\n\n\n\n\nDefinition: Event, $E$ An Event is some subset of $S$ that we ascribe meaning to. In set notation ($E \u2286 S$).For example:\n\n Coin flip is heads: $E$ = {Heads}\n Greater than 1 head on 2 coin flips = {(H, H), (H, T), (T, H)}\n Roll of die is 3 or less: E = {1, 2, 3}\n You receive less than 20 emails in a day: $E = \\{x|x \u2208 Z,0 \u2264 x < 20\\}$ (non-neg. ints)\n Wasted day (\u2265 5 YouTube hours): $E = \\{x|x \u2208 R, 5 \u2264 x \u2264 24\\}$\n\nEvents can be represented as capital letters such as $E$ or $F$.\n\n\n[todo] In the world of probability, events are binary: they either happen or they don't.\nDefinition of Probability\nIt wasn't until the 20th century that humans figured out a way to precisely define what the word probability means:\n$$ \\p(\\text{Event}) \n = \\lim_{n \\rightarrow \\infty}\n \\frac\n {\\text{count}(\\text{Event})}\n {n} \n $$\nIn English this reads: lets say you perform $n$ trials of an \"experiment\" which could result in a particular \"Event\" occuring. The probability of the event occuring, $\\p(\\text{Event})$,\nis the ratio of trials that result in the event, written as $\\text{count}(\\text{Event})$, to the number of trials performed, $n$. In the limit, as your number of trials\napproaches infinity, the ratio will converve to the true probability. People also apply other semantics to the concept of a probability. One\ncommon meaning ascribed is that $\\p(E)$ is a measure of the chance of event E occurring. \n\n\nMeasure of uncertainty: It is tempting to think of probability as representing some natural randomness in the world. That might be the case. But perhaps the world isn't random. I propose a deeper way of thinking about probability. There is so much that we as humans don't know, and probability is our robust language for expressing my belief that an event will happen given my limited knowledge.\nThis interpretation acknowledges that your own uncertainty of an event. Perhaps if you knew the position of every water molecule, you could perfectly predict tomorrow's weather. But we don't have such knowledge and as such we use probability to talk about the chance of rain tomorrow given the information that we have access to.\nOrigins of probabilities: The different interpretations of probability are reflected in the many origins of probabilities that you will encounter in the wild (and not so wild) world. Some probabilities are calculated analytically using mathematical\nproofs. Some probabilities are calculated from data, experiments or simulations. Some probabilities are just\nmade up to represent a belief. Most probabilities are generated from a combination of the above. For example, someone will make up a prior belief, that\nbelief will be mathematically updated using data and evidence. Here is an example of calculating a probability from data: \n\n\n\nProbabilities and simulations: Another way to compute probabilities is via simulation. For some complex problems where the probabilities are too hard to compute analytically you can run\nsimulations using your computer.\nIf your simulations generate believable trials from the sample space, then the probability of an event E is\napproximately equal to the fraction of simulations that produced an outcome from E. Again, by the definition\nof probability, as your number of simulations approaches infinity, the estimate becomes more accurate.\n\n\nProbabilities and percentages: You might hear people refer to a probability as a percent. That the probability of rain tomorrow is 32%. The proper way to state this would be to say that 0.32 is the probability of rain. Percentages are simply probabilities multiplied by 100. \"percent\" is latin for \"out of one hundred\". \n\n\nProblem: Use the definition of probability to approximate the answer to the question: \"What is the probability a new-born elephant child is male?\" Contrary to what you might think the gender outcomes of a newborn elephant are not equally likely between male and female. You have data from a report in Animal Reproductive Science which states that 3,070 elephants were born in Myanmar of which 2,180 were male [1]. Humans also don't have a 50/50 sex ratio at birth [2].\nAnswer:\n\tThe Experiment is: A single elephant birth in Myanmar. The sample space is the set of possible sexes assigned at birth, {Male, Female, Intersex}. $E$ is the event that a new-born elephan child is male, which in set notation is the subset {Male} of the sample space. The outcomes are not equally likely.\n By the definition of probability, the ratio of trials that result in the event to the number of trials will tend to our desired probability:\n$$ \\begin{aligned} \\p(\\text{Born Male}) &= \\p(E) \\\\\n\t\t&= \\lim_{n \\rightarrow \\infty}\\frac{\\text{count}(E)}{n} \\\\\n\t &\\approx \\frac{2,180}{3,070} \\\\\n\t &\\approx 0.710\\end{aligned}$$\nSince 3,000 is quite a bit less than infinity, this is an approximation. It turns out, however, to be a rather good one. A few important notes: there is no garuntee that our estimate applies to elephants outside Myanmar. Later in the class we will develop language for \"how confident we can be in a number like 0.71 after 3,000 trials?\" Using tools from later in class we can say that we have 98% confidence that the true probability is within 0.02 of 0.710.\n\nAxioms of Probability\nHere are some basic truths about probabilities that we accept as axioms:\n\n\n\n\nAxiom 1: $0 \u2264 \\p(E) \u2264 1$\nAll probabilities are numbers between 0 and 1.\n\n\nAxiom 2: $\\p(S) = 1$\nAll outcomes must be from the Sample Space.\n\n\nAxiom 3: If $E$ and $F$ are mutually exclusive, then $\\p(E \\text{ or } F) = \\p(E) + \\p(F)$\n The probability of \"or\" for mutually exclusive events\n\n\n\n\nThese three axioms are formally called the Kolmogorov axioms and they are considered to be the foundation of probability theory. They are also useful identities!\nYou can convince yourself of the first axiom by thinking about the math definition of probability. As you\nperform trials of an experiment it is not possible to get more events than trials (thus probabilities are less than\n1) and its not possible to get less than 0 occurrences of the event (thus probabilities are greater than 0).\nThe second axiom makes sense too. If your event is the sample space, then each trial must produce\nthe event. This is sort of like saying; the probability of you eating cake (event) if you eat cake (sample\nspace that is the same as the event) is 1.\n\nThe third axiom is more complex and in this textbook we dedicate an entire chapter to understanding it: Probability of or. It applies to events that have a special property called \"mutual exclusion\": the events do not share any outcomes.\n\n\nThese axioms have great historical significance. In the early 1900s it was not clear if probability was somehow different than other fields of math -- perhaps the set of techniques and systems of proofs from other fields of mathematics couldn't apply. Kolmogorov's great success was to show to the world that the tools of mathematics did infact apply to probability. From the foundation provided by this set of axioms mathematicians built the edifice of probability theory.\nProvable Identities\nWe often refer to these as corollaries that are directly provable from the three\naxioms given above.\n\n\n\n\nIdentity 1: $\\p(E\\c) = 1 - \\p(E)$\nThe probability of event E not happening\n\n\nIdentity 2: If $E \u2286 F$, then $\\p(E) \u2264 \\p(F)$\nEvents which are subsets\n\n\n\n\nThis first identity is especially useful. For any event, you can calculate the probability of the event not occuring which we write in probability notation as $E\\c$, if you know the probability of it occuring -- and vice versa. We can also use this identity to show you what it looks like to prove a theorem in probability.\n\nProof: $\\p(E\\c) = 1 - \\p(E)$\n\n$$\n\\begin{align}\n\\p(S) &= \\p(E \\or E\\c) && \\text{$E$ or $E\\c$ covers every outcome in the sample space} \\\\\n\\p(S) &= \\p(E) + \\p(E\\c) && \\text{Events $E$ and $E\\c$ are mututally exclusive} \\\\\n1 &= \\p(E) + \\p(E\\c) && \\text{Axiom 2 of probability} \\\\\n\\p(E\\c) &= 1 - \\p(E) && \\text{By re-arranging}\n\\end{align}\n$$\n\n"}, {"id": "probability", "title": "Definition of Probability", "url": "part1/probability", "text": "\nExample: Probability in the limit\nHere we use the definition of probability to calculate the probability of event $E$, rolling a \"5\" or a \"6\" on a fair six-sided dice. Hit the \"Run trials\" button to start running trials of the experiment \"roll dice\". Notice how $\\p(E)$, converges to $2/6$ or 0.33 repeating.\nEvent $E$: Rolling a 5 or 6 on a six-sided dice.\n\n\n Run trials\nDice outcome: \n\n\n\n\n\n\n\n$n= $ 0\n$\\text{count}(E) = $ 0\n\n\n\t\t\t$ \\p(E) \n \\approx \n \\frac\n {\\text{count}(E)}\n {n} \n =\n $\n\t\t\t\n\n\n\n\n\n\n\n\n\n\n\n"}, {"id": "equally_likely", "title": "Equally Likely Outcomes", "url": "part1/equally_likely", "text": "\n \nEqually Likely Outcomes\n\nSome sample spaces have equally likely outcomes. We like those sample spaces, because there is a way\nto calculate probability questions about those sample spaces simply by counting. Here are a few examples\nwhere there are equally likely outcomes:\n\n\nCoin flip: S = {Head, Tails}\n\tFlipping two coins: S = {(H, H), (H, T), (T, H), (T, T)}\n\tRoll of 6-sided die: S = {1, 2, 3, 4, 5, 6}\n\t\n\nBecause every outcome is equally likely, and the probability of the sample space must be 1, we can prove\nthat each outcome must have probability:\n$$\n\\p(\\text{an outcome}) = \\frac{1}{|S|}\n$$\nWhere |S| is the size of the sample space, or, put in other words, the total number of outcomes of the experiment. Of course this is only true in the special case where every outcome has the same likelihood. \n\n\nDefinition: Probability of Equally Likely Outcomes \n\n\t\tIf $S$ is a sample space with equally likely outcomes, for an\nevent $E$ that is a subset of the outcomes in $S$:\n$$\n\\begin{align}\n\\p(E) &= \\frac{\\text{number of outcomes in $E$}}{\\text{number of outcomes in $S$}} \n= \\frac{|E|}{|S|}\n\\end{align}\n$$\n\n\t\n\nThere is some art form to setting up a problem to calculate a probability based on the equally likely outcome\nrule. (1) The first step is to explicitly define your sample space and to argue that all outcomes in your sample\nspace are equally likely. (2) Next, you need to count the number of elements in the sample space and (3)\nfinally you need to count the size of the event space. The event space must be all elements of the sample\nspace that you defined in part (1). The first step leaves you with a lot of choice! For example you can decide\nto make indistinguishable objects distinct, as long as your calculation of the size of the event space makes the\nexact same assumptions.\n\n\nExample: What is the probability that the sum of two die is equal to 7?\nBuggy Solution: You could define your sample space to be all the possible sum values of two die (2 through 12).\nHowever this sample space fails the \u201cequally likely\u201d test. You are not equally likely to have a sum of 2 as you\nare to have a sum of 7.\nSolution: Consider the sample space from the previous chapter where we thought of the die as distinct and\nenumerated all of the outcomes in the sample space. The first number is the roll on die 1 and the second\nnumber is the roll on die 2. Note that (1, 2) is distinct from (2, 1). Since each outcome is equally likely, and the sample space has exactly 36 outcomes, the likelihood of any one outcome is $\\frac{1}{36}$. Here is a visualization of all outcomes:\n\n\n\n(1,1)(1,2)(1,3)(1,4)(1,5)(1,6)\n(2,1)(2,2)(2,3)(2,4)(2,5)(2,6)\n(3,1)(3,2)(3,3)(3,4)(3,5)(3,6)\n(4,1)(4,2)(4,3)(4,4)(4,5)(4,6)\n(5,1)(5,2)(5,3)(5,4)(5,5)(5,6)\n(6,1)(6,2)(6,3)(6,4)(6,5)(6,6)\n\n\nThe event (sum of dice is 7) is the subset of the sample space where the sum of the two dice is 7. Each outcome in the event is highlighted in blue. There are 6 such outcomes: (1, 6), (2, 5), (3, 4), (4, 3), (5, 2), (6, 1). Notice that (1, 6) is a different outcome than (6, 1). To make the outcomes equally likely we had to make the die distinct.\n\t$$\n\t\\begin{align}\n\t\\p(\\text{Sum of two dice is 7}) \n\t&= \\frac{|E|}{|S|}\n\t&& \\text{Since outcomes are equally likely} \\\\\n\t&= \\frac{6}{36} = \\frac{1}{6}\n\t&& \\text{There are 6 outcomes in the event}\n\t\\end{align}\n\t$$\n\n\n\n\nInterestingly, this idea also applies to continuous sample spaces. Consider the sample space of all the outcomes of the computer function \"random\" which produces a real valued number between 0 and 1, where all\nreal valued numbers are equally likely. Now consider the event $E$ that the number generated is in the range\n[0.3 to 0.7]. Since the sample space is equally likely, $\\p(E)$ is the ratio of the size of $E$ to the size of $S$. In this\ncase $\\p(E) = \\frac{0.4}{1} = 0.4$.\n"}, {"id": "prob_or", "title": "Probability of <b>or</b>", "url": "part1/prob_or", "text": " \nProbability of or\n\nThe equation for calculating the probability of either event E or event F happening, written $\\p(E \\or F)$ or equivalently as $\\p(E \u222a F)$, is deeply analogous\nto counting the size of two sets. As in counting, the equation that you can use depends on whether or not the events are \"mutually exclusive\". If events are mutually exclusive, it is very straightforward to calculate the probability of either event happening. Otherwise, you need the more complex \"inclusion exclusion\" formula.\nMutually exclusive events\nTwo events: $E$, $F$ are considered to be mutually exclusive (in set notation $E \u2229 F = \u2205$) if there are no outcomes that are\nin both events (recall that an event is a set of outcomes which is a subset of the sample\nspace). In English, mutually exclusive means that two events can't both happen.\nMutual exclusion can be visualized. Consider the following visual sample space where each outcome is a\nhexagon. The set of all the fifty hexagons is the full sample space:\n\n\n\nExample of two events: $E$, $F$, which are mutually exclusive.\n\n\nBoth events $E$ and $F$ are subsets of the same sample space. Visually, we can note that\nthe two sets do not overlap. They are mutually exclusive: there is no outcome that is in both sets.\nProb of or for mutually exclusive events\n\n\nDefinition: Probability of or for mututally exclusive events\nIf two events: $E$, $F$ are mutually exclusive then the probability of $E$ or $F$ occuring is:\n\t\t\t$$\n\t\t\t\\p(E \\or F) = \\p(E) + \\p(F)\n\t\t\t$$\n\t\t\nThis property applies regardless of how you calculate the probability of $E$ or $F$.\nMoreover, the idea extends to more than two events. Lets say you have $n$ events $E_1, E_2, \\dots E_n$ where each\nevent is mutually exclusive of one another (in other words, no outcome is in more than one event). Then:\n$$\n\t\t\t\\p(E_1 \\or E_2 \\or \\dots \\or E_n) = \\p(E_1) + \\p(E_2) + \\dots + \\p(E_n) = \\sum_{i=1}^n \\p(E_i)\n\t\t\t$$\n\n\nYou may have noticed that this is one of the axioms of probability. Though it might seem intuitive, it is one of three rules that we accept without proof.\n\nCaution: Mutual exclusion only makes it easier to calculate the probability of $E \\or F$ not other ways of combining events, such as $E \\and F$. \n\nAt this point we know how to compute the probability of the \"or\" of events if and only if they have the mutual exclusion property. What if they don't?\nProb of or for non-mutually exclusive events\nUnfortunately, not all events are mutually exclusive. If you want to calculate $\\p(E \\or F)$ where the events $E$\nand F are not mutually exclusive you can not simply add the probabilities. As a simple sanity check, consider the event $E$: getting heads on a coin flip, where $\\p(E) = 0.5$. Now imagine the sample space $S$, getting either a heads or a tails on a coin flip. These events are not mutually exclusive (the outcome heads is in both). If you incorrectly assumed they were mutually exclusive and tried to calculate $\\p(E \\or S)$ you would get this buggy derivation:\n\n\nBuggy derivation: Incorrectly assuming mutual exclusion\nCalculate the probability of $E$, getting an even number on a dice role (2, 4 or 6), or $F$, getting three or less (1, 2, 3) on the same dice role.\n\t$$\n\t\\begin{align}\n\t\\p(E \\or F) &= \\p(E) + \\p(F) && \\text{Incorrectly assumes mutual exclusion} \\\\\n\t&= 0.5 + 0.5 && \\text{substitute the probabilities of $E$ and $S$} \\\\\n\t&= 1.0 && \\text{uh oh!}\n\t\\end{align}\n\t$$\n\tThe probability can't be one since the outcome 5 is neither three or less nor even. The problem is that we double counted the probability of getting a 2, and the fix is to subtract out the probability of that doubly counted case.\n\nWhat went wrong? If two events are not mutually exclusive, simply adding their probabilities double counts the probability of any outcome which is in both events. There is a formula for calculating or of two non-mutually exclusive events: it is called the \"inclusion exclusion\" principle.\n\n\n\nDefinition: Inclusion Exclusion principle\nFor any two events: E, F:\n$$\n\\p(E \\or F) = \\p(E) + \\p(F) \u2212 \\p(E \\and F)\n$$\nThis formula does have a version for more than two events, but it gets rather complex. For three events, $E$, $F$, and $G$ the formula is:\n\t$$\n\t\\begin{align}\n\t\\p(E \\or F \\or G) =& \\text{ }\\p(E) + \\p(F) + \\p(G) \\\\\n& \u2212\\p(E \\and F) \u2212 \\p(E \\and G)\u2212P(F \\and G) \\\\\n& +\\p(E \\and F \\and G)\n\t\\end{align}\n\t$$\n\nFor $n$ events, $E_1, E_2, \\dots E_n$: build a running sum. Add all the probabilities of the events on their own. Then subtract all pairs of events. Then add all subsets of 3 events. Then subtract all subset of 4 events. Continue this process, up until $n$, adding the subsets if the size of subsets is odd, else subtracting them. The alternating addition and subtraction is where the name inclusion exclusion comes from. This is a complex process and you should first check if there is an easier way to calculate your probability.\n\n\nNote that the inclusion exclusion principle also applies for mutually exclusive events. If two events are mutually exclusive $\\p(E \\and F) = 0$ since its not possible for both $E$ and $F$ to occur. As such the formula $\\p(E) + \\p(F) - \\p(E \\and F)$ reduces to $\\p(E) + \\p(F)$.\n\n\tThe formulas for calculating the or of events that are not mutually exclusive often requires calculating the probability of the and of events. Learn more in the next section:\n\t\n\n"}, {"id": "cond_prob", "title": "Conditional Probability", "url": "part1/cond_prob", "text": "\n \n\nConditional Probability\n\nIn English, a conditional probability states \"what is the chance of an event $E$ happening given that I have\nalready observed some other event $F$\". It is a critical idea in machine learning and probability because it\nallows us to update our probabilities in the face of new evidence.\nWhen you condition on an event happening you are entering the universe where that event has taken place.\nFormally, once you condition on $F$ the only outcomes that are now possible are the ones which are consistent with $F$. In other words your sample space will now be reduced to $F$. As an aside, in the universe where $F$ has\ntaken place, all rules of probability still hold!\n\n\nDefinition: Conditional Probability.\nThe probability of E given that (aka conditioned on) event F already happened:\n$$\n\\p(E |F) = \\frac{\\p(E \\and F)}{\\p(F)}\n$$\n\n\n\n\n\t Let's use a visualization to get an intuition for why the conditional probability formula is true. Again consider events $E$ and $F$\t\nwhich have outcomes that are subsets of a sample space with 50 equally likely outcomes, each one drawn as\na hexagon:\n\n\n\n\n\n\n\n\tConditioning on $F$ means that we have entered the world where $F$ has happened (and $F$, which has 14\nequally likely outcomes, has become our new sample space). Given that event $F$ has occurred, the conditional\nprobability that event $E$ occurs is the subset of the outcomes of E that are consistent with $F$. In this case we\ncan visually see that those are the three outcomes in $E \\and F$. Thus we have the:\n$$\n\\p(E |F) = \\frac{\\p(E \\and F)}{\\p(F)} = \\frac{3/50}{14/50} = \\frac{3}{14} \\approx 0.21\n$$\n\nEven though the visual example (with equally likely outcome spaces) is useful for gaining intuition, conditional probability applies regardless of whether the sample space has equally likely outcomes!\n\nConditional Probability Example\nLet's use a real world example to better understand conditional probability: movie recommendation. Imagine a streaming service like Netflix wants to figure out the probability that a user will watch a movie $E$ (for example, Life is Beautiful), based on knowing that they watched a different movie $F$ (say Am\u00e9lie). To start lets answer the simpler question, what is the probability that a user watches movie Life is Beautiful, $E$? We can solve this problem using the definition of probability and a dataset of movie watching [1]:\n\t$$\n\t\\begin{align}\n\t\\p(E) &= \\lim_{n \\rightarrow \\infty} \\frac{\\text{count}(E)}{n} \\approx \\frac{\\text{# people who watched movie $E$}}{\\text{# people on Netflix}} \\\\\n\t&= \\frac{1,234,231}{50,923,123} \n\t\\approx 0.02\n\t\\end{align}\n\t$$\nIn fact we can do this for many movies $E$:\n\n\n\n$\\p(E) = 0.02$\n$\\p(E) = 0.01$\n$\\p(E) = 0.05$\n$\\p(E) = 0.09$\n$\\p(E) = 0.03$\n\n\nNow for a more interesting question. What is the What is the probability that a user will watch the movie\nLife is Beautiful ($E$), given they watched Amelie ($F$)? We can use the definition of conditional probability.\n$$\n\\begin{align}\n\\p(E|F) &= \\frac{\\p(E \\and F)}{\\p(F)} && \\text{Def of Cond Prob}\\\\\n&\\approx \\frac{\n\t(\\text{# who watched $E \\and F$}) / (\\text{# of people on Netflix})\n}{\n\t(\\text{# who watched movie $F$}) / (\\text{# people on Netflix})\n} && \\text{Def of Prob}\n \\\\\n&\\approx \\frac{\\text{# of people who watched both $E \\and F$}}{\\text{# of people who watched movie $F$}}\n&& \\text{Simplifying}\n\\end{align} \n$$\nIf we let $F$ be the event that someone watches the movie Am\u00e9lie, we can now calculate $\\p(E|F)$, the conditional probability that someone watches movie $E$:\n\n\n\n$\\p(E|F) = 0.09$\n$\\p(E|F) = 0.03$\n$\\p(E|F) = 0.05$\n$\\p(E|F) = 0.02$\n$\\p(E|F)$ = 1.00\n\n\nWhy do some probabilities go up, some probabilities go down, and some probabilities are unchanged after we observe that the person has watched Amelie ($F$)? If you know someone watched Amelie, they are more likely to watch life is beautiful, and less likely to watch star wars. We have new information on the person!\n\n\nThe Conditional Paradigm\nWhen you condition on an event you enter the universe where that event has taken place. In that new universe\nall the laws of probability still hold. Thus, as long as you condition consistently on the same event, every one of\nthe tools we have learned still apply. Let\u2019s look at a few of our old friends when we condition consistently on\nan event (in this case $G$):\n\n\n\n\nName of Rule\nOriginal Rule\nRule Conditioned on $G$\n\n\n\n\nAxiom of probability 1\n$0 \u2264 \\p(E) \u2264 1$\n$0 \u2264 \\p(E|G) \u2264 1$\n\n\nAxiom of probability 2\n$\\p(S) = 1$\n$\\p(S | G) = 1$\n\n\nAxiom of probability 3\n$\\p(E \\or F) = \\p(E) + \\p(F)$ for mutually exclusive events\n$\\p(E \\or F | G) = \\p(E | G) + \\p(F | G)$ for mutually exclusive events\n\n\nIdentity 1\n$\\p(E\\c) = 1 - \\p(E)$\n$\\p(E\\c | G) = 1 - \\p(E |G)$\n\n\n\n\nConditioning on Multiple Events\nThe conditional paradigm also applies to the definition of conditional probability! Again if we consistently condition on some event $G$ occuring, the rule still holds:\n\t\t$$\n\\p(E |F, G) = \\frac{\\p(E \\and F | G)}{\\p(F | G)}\n\t\t$$\nThe term $\\p(E | F, G)$ is new notation for conditioning on multiple events. You should read that term as \"The probability of E occuring, given that both F and G have occured\". This equation states that the definition for conditional probability of $E | F$ still applies in the universe where $G$ has occured. Do you think that $\\p(E |F, G)$ should be equal to $\\p(E |F)$? The answer is: sometimes yes and sometimes no. \n\n\t\n\n\n"}, {"id": "independence", "title": "Independence", "url": "part1/independence", "text": "\n \nIndependence\n\nSo far we have talked about mutual exclusion as an important \"property\" that two or more events can have. In this chapter we will introduce you to a second property: independence. Independence is perhaps one of the most important properties to consider! Like for mutual exclusion, if you can establish that this property applies (either by logic, or by declaring it as an assumption) it will make analytic probability calculations much easier!\n\n\n\n\nDefinition: Independence\nTwo events are said to be independent if knowing the outcome of one event does not change your belief about whether or not the other event will occur. For example, you might say that two separate dice rolls are independent of one another: the outcome of the first dice gives you no information about the outcome of the second -- and vice versa. \n\n\t$$\n\t\\p(E | F) = \\p(E)\n\t$$\n\n\tThis definition is symmetric. If $E$ is independent of $F$, then $F$ is independent of $E$:\n\t$$\n\t\\p(F | E) = \\p(F)\n$$\n\n\nHow to establish independence\n\n\tHow can you show that two or more events are independent? The default option is to show it mathematically. If you can show that $\\p(E | F) = \\p(E)$ then you have proven that the two events are independent. When working with probabilities that come from data,\nvery few things will exactly match the mathematical definition of independence. That can happen for two reasons:\nfirst, events that are calculated from data or simulation are not perfectly precise and it can be impossible to know if a discreptancy between $\\p(E)$ and $\\p(E |F)$ is due to innacuracy in estimating probabilities, or dependence of events. Second, in our complex\nworld many things actually influence each other, even if just a tiny amount. Despite that we often make the\nwrong, but useful, independence assumption. Since independence makes it so much easier for humans and machines to calculate composite probabilities, you may declare the events to be independent. It could mean your resulting calculation is slightly incorrect -- but this \"modelling assumption\" might make it feasible to come up with a result.\n\nIndependence is a property which is often \"assumed\" if you think it is reasonable that one event is unlikely to influence your belief that the other will occur (or if the influence is negligable). Let's worth through a few examples to better understand:\n\n\nConditional Independence\nWe saw earlier that the laws of probability still held if you consistently conditioned on an event. While the rules stay the same, the independence property might change. Events that were dependent can become independent when conditioning on an event. Events that were independent can become dependent.\n\n"}, {"id": "prob_and", "title": "Probability of <b>and</b>", "url": "part1/prob_and", "text": "\n \n\nProbability of and\n\nThe probability of the and of two events, say $E$ and $F$, written $\\p(E \\and F)$, is the probability of both events happening. You might see equivalent notations $\\p(EF)$, $\\p(E \u2229 F)$ and $\\p(E,F)$ to mean the probability of and. How you calculate the probability of event $E$ and event $F$ happening\ndepends on whether or not the events are \"independent\". In the same way that mutual exclusion makes it easy to calculate the probability of the or of events, independence is a property that makes it easy to calculate the and of events.\nIndependent Events\nIf events are independent then calculating the probability of and becomes simple multiplication:\n\n\n\nDefinition: Probability of and for independent events.\nIf two events: $E$, $F$ are independent then the probability of $E$ and $F$ occuring is:\n\t\t\t$$\n\t\t\t\\p(E \\and F) = \\p(E) \\cdot \\p(F)\n\t\t\t$$\n\t\t\nThis property applies regardless of how the probabilities of $E$ and $F$ were calculated and\nwhether or not the events are mutually exclusive. \n The independence principle extends to more than two\nevents. For $n$ events $E_1, E_2, \\dots E_n$ that are mutually independent of one another -- the independence equation also holds for all subsets of the events.\n$$\n\\p(E_1 \\and E_2 \\and \\dots \\and E_n) = \\prod_{i=1}^n \\p(E_i)\n$$\n\n\n\n\n\n\tWe can prove this equation by combining the definition of conditional probability and the definition of independence.\n\t\nProof: If $E$ is independent of $F$ then $\\p(E \\and F) = \\p(E) \\cdot \\p(F)$\n\t$$\n\t\\begin{align}\n\t\\p(E|F) &= \\frac{\\p(E \\and F)}{\\p(F)} && \\text{Definition of }\n\t\\href{ {{pathToLang}}part1/cond_prob/}{\\text{conditional probability}} \n\t\\\\\n\t\\p(E) &= \\frac{\\p(E \\and F)}{\\p(F)} && \\text{Definition of }\n\t\\href{ {{pathToLang}}part1/independence/}{\\text{independence}} \\\\\n\t\\p(E \\and F) &= \\p(E) \\cdot \\p(F) && \\text{Rearranging terms}\n\t\\end{align}\n\t$$\n\n\nSee the chapter on independence to learn about when you can assume that two events are independent\nDependent Events\nEvents which are not independent are called dependent events. How can you calculate the and of dependent events? If your events are mutually exclusive you might be able to use a technique called DeMorgan's law, which we cover in a latter chapter. For the probability of and in dependent events there is a direct formula called the chain rule which can be directly derived from the definition of conditional probability:\n\n\t\n\nDefinition: The chain rule.\nThe formula in the definition of conditional probability can be re-arranged to derive a general way of calculating the probability of the and of any two events:\n$$\n\\p(E \\and F) = \\p(E | F) \\cdot \\p(F)\n$$\n\nOf course there is nothing special about $E$ that says it should go first. Equivalently:\n\t$$\n\t\\p(E \\and F) = \\p(F \\and E) = \\p(F | E) \\cdot \\p(E)\n\t$$\n\nWe call this formula the \"chain rule.\" Intuitively it states that the probability of observing events $E$ and $F$ is the\nprobability of observing $F$, multiplied by the probability of observing $E$, given that you have observed $F$.\nIt generalizes to more than two events:\n$$\n\\begin{align}\n\\p(E_1 \\and E_2 \\and \\dots \\and E_n) = &\\p(E_1) \\cdot \\p(E_2|E_1) \\cdot \\p(E_3 |E_1 \\and E_2) \\dots \\\\ &\\p(E_n|E_1 \\dots E_{n\u22121})\n\\end{align}\n$$\n\t\n\n\n"}, {"id": "law_total", "title": "Law of Total Probability", "url": "part1/law_total", "text": "\n \nLaw of Total Probability\n\nAn astute person once observed that when looking at a picture, like the one we say for conditional probability:\n\n\n\n\n\n\nthat event $E$ can be\nthought of as having two parts, the part that is in $F$, $(E \\and F)$, and the part that isn\u2019t, $(E \\and F\\c)$.\nThis is true\nbecause $F$ and $F\\c$ are (a) mutually exclusive sets of outcomes which (b) together cover the entire sample space.\nAfter further investigation this proved to be mathematically true, and there was much rejoicing:\n\n$$\\p(E) = \\p(E \\and F) + \\p(E \\and F\\c)$$\nThis observation proved to be particularly useful when it was combined with the chain rule and gave rise to a\ntool so useful, it was given the big name, law of total probability.\n\n\nThe Law of Total Probability\nIf we combine our above observation with the chain rule, we get a very useful formula:\n$$\n\\p(E) = \\p(E | F) \\p(F) + \\p(E | F\\c) \\p(F\\c)\n$$\n\nThere is a more general version of the rule. If you can divide your sample space into any number of\nmutually exclusive events: $B_1, B_2, \\dots B_n$ such that every outcome in sample space fall into one of those\nevents, then:\n$$\n\\begin{align}\n\\p(E) \n&= \\sum_{i=1}^n \\p(E \\and B_i) && \\text{Extension of our observation}\\\\\n&= \\sum_{i=1}^n \\p(E | B_i) \\p(B_i) && \\text{Using chain rule on each term}\n\\end{align}\n$$\n\n\n\n\tWe can build intuition for the general version of the law of total probability in a similar way. If we can divide a sample space into a set of several mutually exclusive sets (where the $\\or$ of all the sets covers the entire sample space) then any event can be solved for by thinking of the likelihood of the event and each of the mutually exclusive sets.\n\n\n\n\n\n\n\n\tIn the image above, you could compute $\\p(E)$ to be equal to $\\p\\Big[(E \\and B_1) \\text{ }\\or \\text{ }(E \\and B_2) \\dots\\big]$. Of course this is worth mentioning because there are many real world cases where the sample space can be discretized into several mutual exclusive events. As an example, if you were thinking about the probability of the location of an object on earth, you could discretize the area over which you are tracking into a grid. \n\t\n\n"}, {"id": "bayes_theorem", "title": "Bayes' Theorem", "url": "part1/bayes_theorem", "text": "\n \n Bayes' Theorem\n\nBayes' Theorem is one of the most ubiquitous results in probability for computer scientists. In a nutshell, Bayes' theorem provides a way to convert a conditional probability from one direction, say $\\p(E|F)$, to the other direction, $\\p(F|E)$.\nBayes' theorem is a mathematical identity which we can\nderive ourselves. Start with the definition of conditional probability and then expanding the $\\and$ term using the chain rule:\n\n\n$$\n\\begin{align}\n\\p(F|E) \n&= \\frac{\\p(F \\and E)}{\\p(E)} && \\text{Def of }\n\\href{ {{pathToLang}}part1/cond_prob/}{\\text{conditional probability}}\n\n \\\\\n&= \\frac{\\p(E | F) \\cdot \\p(F)}{\\p(E)} && \\text{Substitute the }\n\\href{ {{pathToLang}}part1/cond_prob/#chain_rule}{\\text{chain rule}} \\text{ for $\\p(F \\and E)$}\n\\end{align}\n$$\n\n\nThis theorem makes no assumptions about $E$ or $F$ so it will apply for any two events. Bayes' theorem is exceptionally useful because it turns out to be the ubiquitous way to answer the question: \"how can I update a belief about something, which is not directly observable, given evidence.\" This is for good reason. For many \"noisy\" measurements it is straightforward to estimate the probability of the noisy observation given the true state of the world. However, what you would really like to know is the conditional probability the other way around: what is the probability of the true state of the world given evidence. There are countless real world situations that fit this situation:\n\n\n\nExample 1: Medical tests\nWhat you want to know: Probability of a disease given a test result\nWhat is easy to know: Probability of a test result given the true state of disease\nCausality: We believe that diseases influences test results\n\n\nExample 2: Student ability\nWhat you want to know: Student knowledge of a subject given their answers\nWhat is easy to know: Likelihood of answers given a student's knowledge of a subject\nCausality: We believe that ability influences answers \n\n\nExample 3: Cell phone location\nWhat you want to know: Where is a cell phone, given noisy measure of distance to tower\nWhat is easy to know: Error in noisy measure, given the true distance to tower\nCausality: We believe that cell phone location influences distance measure\n\n\n\n\tThere is a pattern here: in each example we care about knowing some unobservable -- or hard to observe -- state of the world. This state of the world \"causes\" some easy-to-observe evidence. For example: having the flu (something we would like to know) causes a fever (something we can easily observe), not the other way around. We often call the unobservable state the \"belief\" and the observable state the \"evidence\". For that reason lets rename the events! Lets call the unobservable thing we want to know $B$ for belief. Lets call the thing we have evidence of $E$ for evidence. This makes is clear that Bayes' theorem allows us to calculate an updated belief given evidence: $\\p(B | E)$\n\n\n\n\nDefinition: Bayes' Theorem \n The most common form of Bayes' Theorem is \n\t\tBayes' Theorem Classic:\n\t\t$$\n\t\t\\p(B|E) = \\frac{\\p(E | B) \\cdot \\p(B)}{\\p(E)} \n\t\t$$\n\t\nThere are names for the different terms in the Bayes' Rule formula. The term $\\p(B|E)$ is often called the\n\"posterior\": it is your updated belief of $B$ after you take into account evidence $E$. The term $\\p(B)$ is often called the \"prior\": it was your belief before seeing any evidence. The term $\\p(E|B)$ is called the update and $\\p(E)$ is\noften called the normalization constant.\nThere are several techniques for handling the case where the denominator is not know. One technique is to use the law of total probability to expand out the term, resulting in another formula, called Bayes' Theorem with Law of Total Probability:\n$$\n\\p(B|E) = \\frac{\\p(E | B) \\cdot \\p(B)}{\\p(E|B)\\cdot \\p(B) + \\p(E|B\\c) \\cdot \\p(B\\c)} \n$$\n\nRecall the law of total probability which is responsible for our new denominator:\n$$\n\\begin{align}\n\\p(E) = \\p(E|B)\\cdot \\p(B) + \\p(E|B\\c) \\cdot \\p(B\\c)\n\\end{align}\n$$\n\n\nA common scenario for applying the Bayes' Rule formula is when you want to know the probability of\nsomething \u201cunobservable\u201d given an \u201cobserved\u201d event. For example, you want to know the probability that a\nstudent understands a concept, given that you observed them solving a particular problem. It turns out it is\nmuch easier to first estimate the probability that a student can solve a problem given that they understand the\nconcept and then to apply Bayes' Theorem. Intuitively, you can think about this as updating a belief given\nevidence.\nBayes' Theorem Applied\nSometimes the (correct) results from Bayes' Theorem can be counter intuitive. Here we work through a classic result: Bayes' applied to medical tests. We show a dynamic solution and present a visualization for understanding what is happening.\n\n\nBayes with the General Law of Total Probability\nA classic challenge when applying Bayes' theorm is to calculate the probability of the normalization constant $\\p(E)$ in the denominator of Bayes' Theorem. One common strategy for calculating this probability is to use the law of total probability. Our expanded version of Bayes' Theorem uses the simple version of the total law of probability: $\\p(E) = \\p(E|F)\\p(F) + \\p(E|F^c)\n\t\\p(F^c)$. Sometimes you will want the more expanded version of the law of total probability: $\\p(E) = \\sum_i\\p(E|B_i)\\p(B_i)$. Recall that this only works if the events $B_i$ are mutually exclusive and cover the sample space.\n\t\nFor example say we are trying to track a phone which could be in any one of $n$ discrete\nlocations and we have prior beliefs $\\p(B_1) \\dots \\p(B_n)$ as to whether the phone is in location $B_i$. Now we gain\nsome evidence (such as a particular signal strength from a particular cell tower) that we call $E$ and we need\nto update all of our probabilities to be $\\p(B_i\n|E)$. We should use Bayes' Theorem!\nThe probability of the observation, assuming that the the phone is in location $B_i$, $\\p(E|B_i)$, is something that\ncan be given to you by an expert. In this case the probability of getting a particular signal strength given a\nlocation $B_i$ will be determined by the distance between the cell tower and location $B_i$\n.\nSince we are assuming that the phone must be in exactly one of the locations, we can find the probability of\nany of the event $B_i$ given $E$ by first applying Bayes' Theorem and then applying the general version of the law of\ntotal probability:\n$$\n\\begin{align}\n\\p(B_i | E) &= \\frac{\\p(E|B_i) \\cdot \\p(B_i)}{\\p(E)}\n&& \\text{Bayes Theorem. What to do about $\\p(E)$?} \\\\\n &= \\frac{\\p(E|B_i) \\cdot \\p(B_i)}{\\sum_{i=1}^n \\p(E|B_i) \\cdot \\p(B_i)}\n&& \\text{Use General Law of Total Probability for $\\p(E)$} \\\\\n\\end{align}\n$$\n\n\nUnknown Normalization Constant, $\\p(E)$\nThere are times when we would like to use Bayes' Theorem to update a belief, but we don't know the probability of $E$, $\\p(E)$. All hope is not lost. This term is called the \"normalization constant\" because it is the same regardless of whether or not the event $B$ happens. The most traditional solution is to use the law of total probability: $\\p(E) = \\p(E |B) \\p(B) + \\p(E|B\\c)\\p(B\\c)$. Here are some other useful \"tricks\" for dealing with $\\p(E)$. \n\n\tWe can make the normalization cancel out by calculating the ratio of $\\frac{\\p(B|E)}{\\p(B\\c|E)}$. This fraction tells you how many times more likely it is that $B$ will happen given $E$ than not $B$:\n$$\n\\begin{align}\n\\frac{\\p(B|E)}{\\p(B\\c|E)} \n&= \\frac{\n\t\\frac{\\p(E|B)\\p(B)}{\\p(E)}\n}{\n\t\\frac{\\p(E|B\\c)\\p(B\\c)}{\\p(E)}\n}\n&& \\text{Apply Bayes' Theorm to both terms} \\\\\n&= \\frac{\n\t\\p(E|B)\\p(B)\n}{\n\t\\p(E|B\\c)\\p(B\\c)\n}\n&& \\text{The term $\\p(E)$ cancels}\n\\end{align}\n$$\n\t\n\nWe can always use the fact that either $B$ will happen or it won't when consistently conditioned on $E$: $\\p(B |E) + \\p(B\\c|E) =1$ to compute $\\p(E)$. Note that this is the simply the first identity of probability, consistently conditioning:\n$$\n\\begin{align}\n1 &= \\p(B|E) + \\p(B\\c|E) \n&& \\text{Either $B$ occurs or it doesn't} \\\\\n\n1 &= \\frac{\\p(E|B)\\p(B)}{\\p(E)}\n + \n\\frac{\\p(E|B\\c)\\p(B\\c)}{\\p(E)}\n&& \\text{Apply Bayes' Theorem to both terms}\n\\\\\n\n1 &= \\frac{1}{\\p(E)} \\cdot \\big[\\p(E|B)\\p(B) + \\p(E|B\\c)\\p(B\\c)\\big]\n&& \\text{Factor out $1/ \\p(E)$} \\\\\n\n\\p(E) &= \\p(E|B)\\p(B) + \\p(E|B\\c)\\p(B\\c)\n&& \\text{Rearrange terms}\n\\end{align}\n$$\nIf you look closely at the last line, you will notice that we have simply found a new way to derive the total law of probability for $E$. The law of total probability is truly a great way of dealing with $\\p(E)$.\n\t\n\n\n"}, {"id": "bayes_theorem", "title": "Bayes' Theorem", "url": "part1/bayes_theorem", "text": "\n\nExample: Probability of a disease given a noisy test\nIn this problem we are going to calculate the probability that a patient has an illness given test-result for the illness. A positive test result means the test thinks the patient has the illness. You know the following information, which is typical for medical tests:\n\nNatural % of population with illness:\n\nProbability of a positive result given the patient has illness\n\nProbability of a positive result given the patient does not have illness\n\n\nThe numbers in this example are from the Mamogram test for breast cancer. The seriousness of cancer underscores the potential for bayesian probability to be applied to important contexts. The natural occurence of breast cancer is 8%. The mamogram test returns a positive result 95% of the time for patients who have breast cancer. The test resturns a positive result 7% of the time for people who do not have breast cancer. In this demo you can enter different input numbers and it will reclaculate.\n\nAnswer\n\n\n\n\t\t\t\t \tThe probability that the patient has the illness given a positive test result is: \n\n\n\n\nTerms:\n\t\t\t\t\t \t\t\t\tLet $I$ be the event that the patient has the illness\n\t\t\t\t\t \t\t\t\tLet $E$ be the event that the test result is positive\n\t\t\t\t\t \t\t\t\t$\\p(I|E)$ = probability of the illness given a positive test. This is the number we want to calculate.\n\t\t\t\t\t \t\t\t\t$\\p(E|I)$ = probability of a positive result given illness = \n\t\t\t\t\t \t\t\t\t$\\p(E|I\\c)$ = probability of a positive result given no illness = \n\t\t\t\t\t \t\t\t\t$\\p(I)$ = natural probability of the illness = \n\nBayes Theorem:\n\t\t\t\t\t \t\n\t\t\t\t\t \t\tIn this problem we know $\\p(E|I)$ and $\\p(E|I\\c)$ but we want to know $\\p(I|E)$. We can apply Bayes Theorem to turn our knowledge of one conditional into knowledge of the reverse.\n\t\t\t\t\t \t\n\n\t\t\t\t\t \t\t$$\\begin{align}\\p(I|E) &= \\frac{\\p(E|I)P(I)}{\\p(E|I)\\p(I) + \\p(E|I\\c)\\p(I\\c)} && \\text{Bayes' Theorem with Total Prob.}\\end{align}$$\n\t\t\t\t\t \t\n\t\t\t\t\t \t\tNow all we need to do is plug values into this formula. The only value we don't explicitly have is $\\p(I\\c)$. But we can simply calculate it since $\\p(I\\c) = 1 - \\p(I)$. Thus:\n\t\t\t\t\t\t\n\n\n\n\n\nNatural Frequency Intuition\n\n\t\t\t\t\t\t\tOne way to build intuition for Bayes Theorem is to think about \"natural frequences\". Let's take another approach at answer the probability question in the above example on belief of illness given a test. In this take, we are going to imagine we have a population of 1000 people. Let's think about how many of those have the illness and test positive and how many don't have the illness and test positive. This visualization is based off the numbers in the fields above. Feel free to change them!\n\t\t\t\t\t\t\n\n\t\t\t\t\t\t\tThere are many possibilities for how many people have the illness, but one very plaussible number is 1000, the number of people in our population, multiplied by the probability of the disease.\n\n\t\t\t\t\t\t\t$1000 \\times \\p(\\text{Illness})$ people have the illness\n\t\t\t\t\t\t\t$1000 \\times (1- \\p(\\text{Illness}))$ people do not have the illness.\n\t\t\t\t\t\t\n\n\t\t\t\t\t\t\tWe are going to color people who have the illness in blue and those without the illness in pink (those colors do not imply gender!). \n\t\t\t\t\t\t\n\n\t\t\t\t\t\t\tA certain number of people with the illness will test positive (which we will draw in Dark Blue) and a certain number of people without the illness will test positive (which we will draw in Dark Pink):\n\t\t\t\t\t\t\n\n\t\t\t\t\t\t\t$1000 \\times \\p(\\text{Illness}) \\times \\p(\\text{Positive}|\\text{Illness})$ people have the illness and test positive\n\t\t\t\t\t\t\t$1000 \\times \\p(\\text{Illness}\\c) \\times \\p(\\text{Positive}|\\text{Illness}\\c)$ people do not have the illness and test positive.\n\t\t\t\t\t\t\n\n\t\t\t\t\t\t\tHere is the whole population of 1000 people:\n\t\t\t\t\t\t\n\n\n\n\n\t\t\t\t\t\t\t\n\t\t\t\t\t\t\tThe number of people who test positive and have the illness is ?. \n\t\t\t\t\t\t\tThe number of people who test positive and don't have the illness is ?. \n\t\t\t\t\t\t\tThe total number of people who test positive is ?. \n\n\t\t\t\t\t\t\n\n\t\t\t\t\t\t\tOut of the subset of people who test positive, the fraction that have the illness is ?/? = ? which is a close approximation of the answer. If instead of using 1000 imaginary people, we had used more, the approximation would have been even closer to the actual answer (which we calculated using Bayes Theorem).\n\t\t\t\t\t\t\n\n\n"}, {"id": "log_probabilities", "title": "Log Probabilities", "url": "part1/log_probabilities", "text": "\n \nLog Probabilities\n\nA log probability $\\log \\p(E)$ is simply the log function applied to a probability. For example if $\\p(E) = 0.00001$ then $\\log \\p(E) = \\log(0.00001) \\approx -11.51$. Note that in this book, the default base is the natural base $e$. There are many reasons why log probabilities are an essential tool for digital probability: (a) computers can be rather limited when representing very small numbers and (b) logs have the wonderful ability to turn multiplication into addition, and computers are much faster at addition. \nYou may have noticed that the log in the above example produced a negative number. Recall that $\\log b = c$, with the implied natural base $e$ is the same as the statement $e ^ c = b$. It says that $c$ is the exponent of $e$ that produces $b$. If $b$ is a number between 0 and 1, what power should you raise $e$ to in order to produce $b$? If you raise $e^0$ it produces 1. To produce a number less than 1, you must raise $e$ to a power less than 0. That is a long way of saying: if you take the log of a probability, the result will be a negative number.\n\n\t$$\n\t\\begin{align}\n\t0 &\\leq \\p(E) \\leq 1 && \\text{Axiom 1 of probability} \\\\\n\t-\\infty &\\leq \\log \\p(E) \\leq 0 && \\text{Rule for log probabilities}\n\t\\end{align}\n$$\nProducts become Addition\nThe product of probabilities $\\p(E)$ and $\\p(F)$ becomes addition in logarithmic space:\n$$\n\\log (\\p(E) \\cdot \\p(F) ) = \\log \\p(E) + \\log \\p(F)\n$$\nThis is especially convenient because computers are much more efficient when adding than when multiplying. It can also make derivations easier to write. This is especially true when you need to multiply many probabilities together:\n\n$$\n\\log \\prod_i \\p(E_i) = \\sum_i \\log \\p(E_i)\n$$\n\n\nRepresenting Very Small Probabilities\nComputers have the power to process many events and consider the probability of very unlikely situations. While computers are capable of doing all the computation, the floating point representation means that computers can not represent decimals to perfect precision. In fact, python is unable to represent any probability smaller than 2.225e-308. On the other hand the log of that same number is -307.652 is very easy for a computer to store.\nWhy would you care? Often in the digital world, computers are asked to reason about the probability of data, or a whole dataset. For example, perhaps your data is words and you want to reason about the probability that a given author would write these specific words. While this probability is very small (we are talking about an exact document) it might be larger than the probability that a different author would write a specific document with specific words. For these sort of small probabilities, if you use computers, you would need to use log probabilities.\n\n\n\n\n"}, {"id": "many_flips", "title": "Many Coin Flips", "url": "part1/many_flips", "text": "\n \nMany Coin Flips\n\nIn this section we are going to consider the number of heads on $n$ coin flips. This thought experiment is going to be a basis for much probability theory! It goes far beyond coin flips.\nSay a coin comes up heads with probability $p$. Most coins are fair and as such come up heads with probability $p=0.5$. There are many events for which coin flips are a great analogy that have different values of $p$ so lets leave $p$ as a variable. You can try simulating coins here. Note that H is short for Heads and T is short for Tails. We think of each coin as distinct:\n\n\n\n\tLet's explore a few probability questions in this domain.\n\nWarmups\n\nWhat is the probability that all $n$ flips are heads?\n\n\n\t\tLets say $n=10$ this question is asking what is the probability of getting:\n\t\tH, H, H, H, H, H, H, H, H, H\n\t\t\n\t\tEach coin flip is independent so we can use the rule for probability of and with independent events. As such, the probability of $n$ heads is $p$ multiplied by itself $n$ times: $p^n$. If $n=10$ and $p=0.6$ then the probability of $n$ heads is around 0.006.\n\t\n\n\nWhat is the probability that all $n$ flips are tails?\n\n\n\t\tLets say $n=10$ this question is asking what is the probability of getting:\n\t\tT, T, T, T, T, T, T, T, T, T\n\t\t\n\t\tEach coin flip is independent. The probability of tails on any coin flip is $1-p$. Again, since the coin flips are independent, the probability of tails $n$ times on $n$ flips is $(1-p)$ multiplied by itself $n$ times: $(1-p)^n$. If $n=10$ and $p=0.6$ then the probability of $n$ tails is around 0.0001.\n\t\n\n\nFirst $k$ heads then $n-k$ tails\n\n\n\t\tLets say $n=10$ and $k=4$, this question is asking what is the probability of getting:\n\t\tH, H, H, H, T, T, T, T, T, T\n\t\t\n\t\tThe coins are still independent! The first $k$ heads occur with probability $p^k$ the run of $n-k$ tails occurs with probability $(1-p)^{n-k}$. The probability of $k$ heads then $n-k$ tails is the product of those two terms: $p^k \\cdot (1-p)^{n-k}$\n\t\n\nExactly $k$ heads\nNext lets try to figure out the probability of exactly $k$ heads in the $n$ flips. Importantly we don't care where in the $n$ flips that we get the heads, as long as there are $k$ of them. Note that this question is different than the question of first $k$ heads and then $n-k$ tails which requires that the $k$ heads come first! That particular result does generate exactly $k$ coin flips, but there are others.\nThere are many others! In fact any permutation of $k$ heads and $n-k$ tails will satisfy this event. Lets ask the computer to list them all for exactly $k=4$ heads within $n=10$ coin flips. The output region is scrollable:\n\n(H, H, H, H, T, T, T, T, T, T)\n(H, H, H, T, H, T, T, T, T, T)\n(H, H, H, T, T, H, T, T, T, T)\n(H, H, H, T, T, T, H, T, T, T)\n(H, H, H, T, T, T, T, H, T, T)\n(H, H, H, T, T, T, T, T, H, T)\n(H, H, H, T, T, T, T, T, T, H)\n(H, H, T, H, H, T, T, T, T, T)\n(H, H, T, H, T, H, T, T, T, T)\n(H, H, T, H, T, T, H, T, T, T)\n(H, H, T, H, T, T, T, H, T, T)\n(H, H, T, H, T, T, T, T, H, T)\n(H, H, T, H, T, T, T, T, T, H)\n(H, H, T, T, H, H, T, T, T, T)\n(H, H, T, T, H, T, H, T, T, T)\n(H, H, T, T, H, T, T, H, T, T)\n(H, H, T, T, H, T, T, T, H, T)\n(H, H, T, T, H, T, T, T, T, H)\n(H, H, T, T, T, H, H, T, T, T)\n(H, H, T, T, T, H, T, H, T, T)\n(H, H, T, T, T, H, T, T, H, T)\n(H, H, T, T, T, H, T, T, T, H)\n(H, H, T, T, T, T, H, H, T, T)\n(H, H, T, T, T, T, H, T, H, T)\n(H, H, T, T, T, T, H, T, T, H)\n(H, H, T, T, T, T, T, H, H, T)\n(H, H, T, T, T, T, T, H, T, H)\n(H, H, T, T, T, T, T, T, H, H)\n(H, T, H, H, H, T, T, T, T, T)\n(H, T, H, H, T, H, T, T, T, T)\n(H, T, H, H, T, T, H, T, T, T)\n(H, T, H, H, T, T, T, H, T, T)\n(H, T, H, H, T, T, T, T, H, T)\n(H, T, H, H, T, T, T, T, T, H)\n(H, T, H, T, H, H, T, T, T, T)\n(H, T, H, T, H, T, H, T, T, T)\n(H, T, H, T, H, T, T, H, T, T)\n(H, T, H, T, H, T, T, T, H, T)\n(H, T, H, T, H, T, T, T, T, H)\n(H, T, H, T, T, H, H, T, T, T)\n(H, T, H, T, T, H, T, H, T, T)\n(H, T, H, T, T, H, T, T, H, T)\n(H, T, H, T, T, H, T, T, T, H)\n(H, T, H, T, T, T, H, H, T, T)\n(H, T, H, T, T, T, H, T, H, T)\n(H, T, H, T, T, T, H, T, T, H)\n(H, T, H, T, T, T, T, H, H, T)\n(H, T, H, T, T, T, T, H, T, H)\n(H, T, H, T, T, T, T, T, H, H)\n(H, T, T, H, H, H, T, T, T, T)\n(H, T, T, H, H, T, H, T, T, T)\n(H, T, T, H, H, T, T, H, T, T)\n(H, T, T, H, H, T, T, T, H, T)\n(H, T, T, H, H, T, T, T, T, H)\n(H, T, T, H, T, H, H, T, T, T)\n(H, T, T, H, T, H, T, H, T, T)\n(H, T, T, H, T, H, T, T, H, T)\n(H, T, T, H, T, H, T, T, T, H)\n(H, T, T, H, T, T, H, H, T, T)\n(H, T, T, H, T, T, H, T, H, T)\n(H, T, T, H, T, T, H, T, T, H)\n(H, T, T, H, T, T, T, H, H, T)\n(H, T, T, H, T, T, T, H, T, H)\n(H, T, T, H, T, T, T, T, H, H)\n(H, T, T, T, H, H, H, T, T, T)\n(H, T, T, T, H, H, T, H, T, T)\n(H, T, T, T, H, H, T, T, H, T)\n(H, T, T, T, H, H, T, T, T, H)\n(H, T, T, T, H, T, H, H, T, T)\n(H, T, T, T, H, T, H, T, H, T)\n(H, T, T, T, H, T, H, T, T, H)\n(H, T, T, T, H, T, T, H, H, T)\n(H, T, T, T, H, T, T, H, T, H)\n(H, T, T, T, H, T, T, T, H, H)\n(H, T, T, T, T, H, H, H, T, T)\n(H, T, T, T, T, H, H, T, H, T)\n(H, T, T, T, T, H, H, T, T, H)\n(H, T, T, T, T, H, T, H, H, T)\n(H, T, T, T, T, H, T, H, T, H)\n(H, T, T, T, T, H, T, T, H, H)\n(H, T, T, T, T, T, H, H, H, T)\n(H, T, T, T, T, T, H, H, T, H)\n(H, T, T, T, T, T, H, T, H, H)\n(H, T, T, T, T, T, T, H, H, H)\n(T, H, H, H, H, T, T, T, T, T)\n(T, H, H, H, T, H, T, T, T, T)\n(T, H, H, H, T, T, H, T, T, T)\n(T, H, H, H, T, T, T, H, T, T)\n(T, H, H, H, T, T, T, T, H, T)\n(T, H, H, H, T, T, T, T, T, H)\n(T, H, H, T, H, H, T, T, T, T)\n(T, H, H, T, H, T, H, T, T, T)\n(T, H, H, T, H, T, T, H, T, T)\n(T, H, H, T, H, T, T, T, H, T)\n(T, H, H, T, H, T, T, T, T, H)\n(T, H, H, T, T, H, H, T, T, T)\n(T, H, H, T, T, H, T, H, T, T)\n(T, H, H, T, T, H, T, T, H, T)\n(T, H, H, T, T, H, T, T, T, H)\n(T, H, H, T, T, T, H, H, T, T)\n(T, H, H, T, T, T, H, T, H, T)\n(T, H, H, T, T, T, H, T, T, H)\n(T, H, H, T, T, T, T, H, H, T)\n(T, H, H, T, T, T, T, H, T, H)\n(T, H, H, T, T, T, T, T, H, H)\n(T, H, T, H, H, H, T, T, T, T)\n(T, H, T, H, H, T, H, T, T, T)\n(T, H, T, H, H, T, T, H, T, T)\n(T, H, T, H, H, T, T, T, H, T)\n(T, H, T, H, H, T, T, T, T, H)\n(T, H, T, H, T, H, H, T, T, T)\n(T, H, T, H, T, H, T, H, T, T)\n(T, H, T, H, T, H, T, T, H, T)\n(T, H, T, H, T, H, T, T, T, H)\n(T, H, T, H, T, T, H, H, T, T)\n(T, H, T, H, T, T, H, T, H, T)\n(T, H, T, H, T, T, H, T, T, H)\n(T, H, T, H, T, T, T, H, H, T)\n(T, H, T, H, T, T, T, H, T, H)\n(T, H, T, H, T, T, T, T, H, H)\n(T, H, T, T, H, H, H, T, T, T)\n(T, H, T, T, H, H, T, H, T, T)\n(T, H, T, T, H, H, T, T, H, T)\n(T, H, T, T, H, H, T, T, T, H)\n(T, H, T, T, H, T, H, H, T, T)\n(T, H, T, T, H, T, H, T, H, T)\n(T, H, T, T, H, T, H, T, T, H)\n(T, H, T, T, H, T, T, H, H, T)\n(T, H, T, T, H, T, T, H, T, H)\n(T, H, T, T, H, T, T, T, H, H)\n(T, H, T, T, T, H, H, H, T, T)\n(T, H, T, T, T, H, H, T, H, T)\n(T, H, T, T, T, H, H, T, T, H)\n(T, H, T, T, T, H, T, H, H, T)\n(T, H, T, T, T, H, T, H, T, H)\n(T, H, T, T, T, H, T, T, H, H)\n(T, H, T, T, T, T, H, H, H, T)\n(T, H, T, T, T, T, H, H, T, H)\n(T, H, T, T, T, T, H, T, H, H)\n(T, H, T, T, T, T, T, H, H, H)\n(T, T, H, H, H, H, T, T, T, T)\n(T, T, H, H, H, T, H, T, T, T)\n(T, T, H, H, H, T, T, H, T, T)\n(T, T, H, H, H, T, T, T, H, T)\n(T, T, H, H, H, T, T, T, T, H)\n(T, T, H, H, T, H, H, T, T, T)\n(T, T, H, H, T, H, T, H, T, T)\n(T, T, H, H, T, H, T, T, H, T)\n(T, T, H, H, T, H, T, T, T, H)\n(T, T, H, H, T, T, H, H, T, T)\n(T, T, H, H, T, T, H, T, H, T)\n(T, T, H, H, T, T, H, T, T, H)\n(T, T, H, H, T, T, T, H, H, T)\n(T, T, H, H, T, T, T, H, T, H)\n(T, T, H, H, T, T, T, T, H, H)\n(T, T, H, T, H, H, H, T, T, T)\n(T, T, H, T, H, H, T, H, T, T)\n(T, T, H, T, H, H, T, T, H, T)\n(T, T, H, T, H, H, T, T, T, H)\n(T, T, H, T, H, T, H, H, T, T)\n(T, T, H, T, H, T, H, T, H, T)\n(T, T, H, T, H, T, H, T, T, H)\n(T, T, H, T, H, T, T, H, H, T)\n(T, T, H, T, H, T, T, H, T, H)\n(T, T, H, T, H, T, T, T, H, H)\n(T, T, H, T, T, H, H, H, T, T)\n(T, T, H, T, T, H, H, T, H, T)\n(T, T, H, T, T, H, H, T, T, H)\n(T, T, H, T, T, H, T, H, H, T)\n(T, T, H, T, T, H, T, H, T, H)\n(T, T, H, T, T, H, T, T, H, H)\n(T, T, H, T, T, T, H, H, H, T)\n(T, T, H, T, T, T, H, H, T, H)\n(T, T, H, T, T, T, H, T, H, H)\n(T, T, H, T, T, T, T, H, H, H)\n(T, T, T, H, H, H, H, T, T, T)\n(T, T, T, H, H, H, T, H, T, T)\n(T, T, T, H, H, H, T, T, H, T)\n(T, T, T, H, H, H, T, T, T, H)\n(T, T, T, H, H, T, H, H, T, T)\n(T, T, T, H, H, T, H, T, H, T)\n(T, T, T, H, H, T, H, T, T, H)\n(T, T, T, H, H, T, T, H, H, T)\n(T, T, T, H, H, T, T, H, T, H)\n(T, T, T, H, H, T, T, T, H, H)\n(T, T, T, H, T, H, H, H, T, T)\n(T, T, T, H, T, H, H, T, H, T)\n(T, T, T, H, T, H, H, T, T, H)\n(T, T, T, H, T, H, T, H, H, T)\n(T, T, T, H, T, H, T, H, T, H)\n(T, T, T, H, T, H, T, T, H, H)\n(T, T, T, H, T, T, H, H, H, T)\n(T, T, T, H, T, T, H, H, T, H)\n(T, T, T, H, T, T, H, T, H, H)\n(T, T, T, H, T, T, T, H, H, H)\n(T, T, T, T, H, H, H, H, T, T)\n(T, T, T, T, H, H, H, T, H, T)\n(T, T, T, T, H, H, H, T, T, H)\n(T, T, T, T, H, H, T, H, H, T)\n(T, T, T, T, H, H, T, H, T, H)\n(T, T, T, T, H, H, T, T, H, H)\n(T, T, T, T, H, T, H, H, H, T)\n(T, T, T, T, H, T, H, H, T, H)\n(T, T, T, T, H, T, H, T, H, H)\n(T, T, T, T, H, T, T, H, H, H)\n(T, T, T, T, T, H, H, H, H, T)\n(T, T, T, T, T, H, H, H, T, H)\n(T, T, T, T, T, H, H, T, H, H)\n(T, T, T, T, T, H, T, H, H, H)\n(T, T, T, T, T, T, H, H, H, H)\nExactly how many outcomes are there with $k=4$ heads in $n=10$ flips? 210. The answer can be calculated using permutations of indistinct objects: $$N = \\frac{n!}{k! (n-k)!} = {n \\choose k}$$\n\nThe probability of exactly $k=4$ heads is the probability of the or of each of these outcomes. Because we consider each coin to be unique, each of these outcomes is \"mutually exclusive\" and as such if $E_i$ is the outcome from the $i$th row, $$\\p(\\text{exactly $k$ heads}) = \\sum_{i=1}^N \\p(E_i)$$\nThe next question is, what is the probability of each of these outcomes?\nHere is a arbitrarily chosen outcome which satisfies the event of exactly $k=4$ heads in $n=10$ coin flips. In fact it is the one on row 128 in the list above:\n\n\tT, H, T, T, H, T, T, H, H, T\nWhat is the probability of getting the exact sequence of heads and tails in the example above? Each coin flip is still independent, so we multiply $p$ for each heads and $1-p$ for each tails. Let $E_{128}$ be the event of this exact outcome:\n\t$$\\p(E_{128}) = (1-p) \\cdot p \\cdot (1-p) \\cdot (1-p) \\cdot p \\cdot (1-p) \\cdot (1-p) \\cdot p \\cdot p \\cdot (1-p)$$\nIf you rearrange these multiplication terms you get:\n\t$$\n\t\\begin{align}\n\t\\p(E_{128}) &= p \\cdot p \\cdot p \\cdot p \\cdot (1-p) \\cdot (1-p) \\cdot (1-p) \\cdot (1-p) \\cdot (1-p) \\cdot (1-p)\\\\\n\t&= p^4 \\cdot (1-p)^{6}\n\t\\end{align}\n\t$$\nThere is nothing too special about row 128. If you chose any row, you would get $k$ independent heads and $n-k$ independent tails. For any row $i$, $\\p(E_i) = p^k \\cdot (1-p)^{n-k}$. Now we are ready to calculate the probability of exactly $k$ heads:\n\n$$\n\\begin{align}\n\\p(\\text{exactly $k$ heads}) \n\t&= \\sum_{i=1}^N \\p(E_i) && \\text{Mutual Exclusion}\\\\\n\t&= \\sum_{i=1}^N p^k \\cdot (1-p)^{n-k} && \\text{Sub in }\\p(E_i) \\\\\n\t&= N \\cdot p^k \\cdot (1-p)^{n-k} && \\text{Sum $N$ times} \\\\\n\t&= {n \\choose k} \\cdot p^k \\cdot (1-p)^{n-k} && \\text{Perm of indistinct objects} \n\\end{align}$$\n\nMore than $k$ heads\n\n\tYou can use the formula for exactly $k$ heads to compute other probabilities. For example the probability of more than $k$ heads is:\n\t$$\n\\begin{align}\n\\p(\\text{more than $k$ heads}) \n\t&= \\sum_{i=k+1}^n \\p(\\text{exactly $i$ heads}) && \\text{Mutual Exclusion}\\\\\n\t&= \\sum_{i=k+1}^n {n \\choose i} \\cdot p^i \\cdot (1-p)^{n-i} && \\text{Substitution}\\\\\n\\end{align}$$\n\t\n\n"}, {"id": "many_flips", "title": "Many Coin Flips", "url": "part1/many_flips", "text": "\nCoin Flip Simulator\n\n\nNumber of flips $n$: \n\n\nProbability of heads $p$: \n\n\nNew simulation\n\n\nSimulator results:\n\nTotal number of heads: \n\n\n"}, {"id": "enigma", "title": "Enigma Machine", "url": "examples/enigma", "text": "\n \nEnigma Machine\n\nOne of the very first computers was built to break the Nazi \u201cenigma\u201d codes in WW2. It was a hard problem\nbecause the \u201cenigma\u201d machine, used to make secret codes, had so many unique configurations. Every day the Nazi's would chose a new configuration and if they Allies could figure out the daily configuration, they could read all enemy messages. One solution was to try all configurations until one produced legible German. This begs the question: How many configurations are there?\n\n\n\nThe WW2 machine built to search different enigma configurations.\n\n\nThe enigma machine has three rotors. Each rotor can be set to one of 26 different positions. How many\nunique configurations are there of the three rotors?\n\n\n\t\tUsing the steps rule of counting: $26 \\cdot 26 \\cdot 26 = 26^3 = 17,576$.\n\t\n\nWhats more! The machine has a plug board which could swap the electrical signal for letters. On the plug\nboard, wires can connect any pair of letters to produce a new configuration. A wire can\u2019t connect a letter to itself. Wires are indistinct. A wire from \u2018K\u2019 to \u2019L\u2019\nis not considered distinct from a wire from \u2018L\u2019 to \u2019K\u2019. We are going to work up to considering any number of wires.\n\n\n\nThe engima plugboard. For electical reasons each letter has two jacks and each plug has two prongs. Semantically this is equivalent to one plug location per letter.\n\n\nOne wire: How many ways are there to place exactly one wire that connects two letters? \n\n\n\t\tChosing 2 letters from 26 is a combination. Using the combination formula: ${26 \\choose 2} = 325$.\n\t\n\n Two wires: How many ways are there to place exactly two wires? Recall that wires are not\nconsidered distinct. Each letter can have at most one wire connected to it, thus you couldn\u2019t have a\nwire connect \u2018K\u2019 to \u2018L\u2019 and another one connect \u2018L\u2019 to \u2018X\u2019\n\n\n\t\t\nThere are ${26 \\choose 2}$\nways to place the first wire and\n${24 \\choose 2}$\nways to place the second wire. However,\nsince the wires are indistinct, we have double counted every possibility. Because every possibility is counted twice we should divide by 2:\n$$\n\\text{Total} = \\frac{ {26 \\choose 2} \\cdot {24 \\choose 2} }{2} = 44,850\n$$ \n\t\n\n Three wires: How many ways are there to place exactly three wires? \n\n\n\t\t\nThere are ${26 \\choose 2}$\nways to place the first wire and\n${24 \\choose 2}$\nways to place the second wire. There are now ${22 \\choose 2}$ ways to place the third. However,\nsince the wires are indistinct, and our step counting implicitly treats them as distinct, we have overcounted each possibility. How many times is each pairing of three letters overcounted? Its the number of permutations of three distinct objects: 3!\n$$\n\\text{Total} = \\frac{ {26 \\choose 2} \\cdot {24 \\choose 2} \\cdot {22 \\choose 2}}{3!} = 3,453,450\n$$ \n\nThere is another way to arrive at the same answer. First we are going to chose the letters to be paired, then we are going to pair them off. There are ${26 \\choose 6}$\nways to select the letters that are\nbeing wired up. We then need to pair off those letters. One way to think about pairing the\nletters off is to first permute them (6! ways) and then pair up the first two letters, the next two,\nthe next two, and so on. For example, if our letters were {A,B,C,D,E,F} and our permutation\nwas BADCEF, then this would correspond to wiring B to A and D to C and E to F. We are \novercounting by a lot. First, we are overcounting by a factor of 3! since the ordering of the pairs\ndoesn\u2019t matter. Second, we are overcounting by a factor of $2 ^ 3$\nsince the ordering of the letters\nwithin each pair doesn\u2019t matter.\n$$\n\\text{Total} = {26 \\choose 6} \\frac{6!}{3! \\cdot 2^3} = 3,453,450\n$$\n\t\n\n\nArbitrary wires: How many ways are there to place $k$ wires, thus connecting $2 \\cdot k$ letters? During WW2 the Germans always used a fixed number of wires. But one fear was that if they discovered the Enigma machine was cracked, they could simply use an arbitrary number of wires.\n\n\t\n\n\t\tThe set of ways to use exactly $i$ wires is mutually exclusive from the set of ways to use exactly $j$ wires if $i \\neq j$ (since no way can use both exactly $i$ and $j$ wires). As such\n$\n\\text{Total} = \\sum_{k=0}^{13} \\text{Total}_k \n$\nWhere Total$_k$ is the number of ways to use exactly $k$ wires. Continuing our logic for ways to used exact number of wires:\n$$\n\\text{Total}_k = \\frac{\\prod_{i=1}^{k} {28 - 2i \\choose 2} }{k!} \n$$\nBringing it all together:\n$$\n\\begin{align}\n\\text{Total} &= \\sum_{k=0}^{13} \\text{Total}_k \\\\\n&= \\sum_{k=0}^{13} \\frac{\\prod_{i=1}^{k} {28 - 2i \\choose 2} }{k!} \\\\\n&= 532,985,208,200,576 \n\\end{align}\n$$\n\t\n\nThe actual Enigma used in ww2 had exactly 10 wires connecting 20 letters allowing for 150,738,274,937,250 unique configuration. The enigma machine also chose the three rotors from a set of five adding another factor of ${5 \\choose 3} = 60$.\n\nWhen you combine the number of ways of setting the rotars, with the number of ways you could set the plug board you get the total number of configurations of an enigma machine. Thinking of this as two steps we can multiply the two numbers we earlier calculated: 17,576 \u00b7 150,738,274,937,250 \u00b7 60 $\\approx 159 \\cdot 10^{18}$ unique settings. So, Alan Turing and his team at Blechly Park to build a machine which could help test many configurations -- a predecesor to the first computers. \n\n"}, {"id": "serendipity", "title": "Serendipity", "url": "examples/serendipity", "text": "\n \nSerendipity\n\n\n\n\n\n\n\n\n\n\nThe word serendipity comes from the Persian fairy tale of the Three Princes of Serendip\n\n\n\n\nProblem\n\n\t\t\t\t\tWhat is the probability of a seredipitous encounter with a friend? Imagine you are live in an area with a large general population (eg Stanford with 17,000 students). A small subset of the population are friends. What are the chances that you run into at least one friend if you see a handful of people from the population? Assume that seeing each person from the population is equally likely.\n\t\t\t\t\n\n\n\n\n\nTotal Population\n\n\n\nFriends\n\n\n\nPeople that you see\n\n\nCalculate\n\nAnswer\n\n\t\t\t\t \tThe probability that you see at least one friend is: \n\n\nTerms\n\n\n\t\t\t\t\t \t\tFirst lets define some useful terms:\n\t\t\t\t\t \t\t$p$ = total population = \n\t\t\t\t\t \t\t$s$ = people seen = \n\t\t\t\t\t \t\t$f$ = num friends = \n\n\nApproach\n\n\t\t\t\t\t \t\tSince each way of seing $s$ people is equally likely, we can use the \"Equally Likely Events\" probability calculation: \n\t\t\t\t\t \t\n\n\t\t\t\t\t \t\t$P(E) = \\frac{|E|}{|S|}$\n\t\t\t\t\t \t\n\t\t\t\t\t \t\tWhere $S$ is the sample set (all the ways of seing $s$ people) and $E$ is the event set (all the ways of seing $s$ people where at least one is a friend).\n\t\t\t\t\t \t\n\n\t\t\t\t\t\t\tOne way to approach this problem is to directly count all ways you see 1 or more friends. That is hard. You would have to count the ways you could see exactly one friend, then exactly two friends and so on. It is much easier to calculate the ways that you see zero friends. If we can calculate the probability of seeing zero friends our answer is just one minus that number.\n\t\t\t\t\t\t\n\nProb that you don't see friends\n\n\t\t\t\t\t\t\tLet the sample space ($S$) be the set of ways that you could see $s$ people. The size of the sample space is: the total population choose the number of people seen. \n\t\t\t\t\t\t\n\n\t\t\t\t\t\t\tThe event space ($E$) is the set of ways that you could see no friends. The size of the event space is: the number of non friends (aka population - friends) choose the number of people seen.\n\t\t\t\t\t\t\nThus the probability of not seeing a friend is:\n\t\t\t\t\t\t\n\n\t\t\t\t\t\t\t$\\text{prob(not seen)} = \\frac{ \\left( {\\begin{array}{*{20}c} p - f \\\\ s \\\\ \\end{array}} \\right) } { \\left( {\\begin{array}{*{20}c} p \\\\ s \\\\ \\end{array}} \\right) } $\n\t\t\t\t\t\t\n\nProb that you see friends\n\n\t\t\t\t\t\t\tNow the probability that you see at least one friend is 1 minus the probability that you see no friends.\n\t\t\t\t\t\t\n\n\t\t\t\t\t\t\t$\\text{prob(seen)} = 1 - \\frac{ \\left( {\\begin{array}{*{20}c} p - f \\\\ s \\\\ \\end{array}} \\right) } { \\left( {\\begin{array}{*{20}c} p \\\\ s \\\\ \\end{array}} \\right) } $\n\t\t\t\t\t\t\nThat is equal to \n\n\nIsn't that suprising?\n\n\n\n\n\n"}, {"id": "bacteria_evolution", "title": "Bacteria Evolution", "url": "examples/bacteria_evolution", "text": "\n \nBacteria Evolution\n\nA wonderful property of modern life is that we have anti-biotics to kill bacterial infections. However, we only have a fixed number of anti-biotic medicines, and bacteria are evolving to become resistent to our anti-biotics. In this example we are going to use probability to understand evolution of anti-biotic resistence in bacteria.\nImagine you have a population of 1 million infectious bacteria in your gut, 10% of which have a mutation that makes them\nslightly more resistant to anti-biotics. You take a course of anti-biotics. The probability that bacteria with the\nmutation survives is 20%. The probability that bacteria without the mutation survives is 1%.\nWhat is the probability that a randomly chosen bacterium\nsurvives the anti-biotics?\n\n\n\tLet $E$ be the event that our bacterium survives. Let $M$ be the event that a bacteria has the mutataion. By the By Law of Total Probability (LOTP):\n\t$$\n\t\\begin{align}\n\t\\p(E) \n\t\t&= \\p(E \\and M) + \\p(E \\and M\\c)\n\t\t\t&& \\text{LOTP} \\\\ \n\t\t&= \\p(E | M)\\p(M) + \\p(E | M\\c)\\p(M\\c)\n\t\t\t&& \\text{Chain Rule} \\\\ \t\n\t\t&= 0.20 \\cdot 0.10 + 0.01 \\cdot 0.90 \n\t\t\t&& \\text{Substituting} \\\\ \n\t\t&= 0.029\n\t\\end{align}\n\t$$\n\n\nWhat is the probability that a surviving bacterium has the mutation?\n\n\n\tUsing the same events in the last section, this question is asking for $\\p(M | E)$. We aren't givin the conditional probability in that direction, instead we know $P(E|M)$. Such situations call for Bayes' Theorem:\n\t$$\n\t\\begin{align}\n\t\\p(M | E) \n\t\t&= \\frac{\\p(E|M)\\p(M)}{\\p(E)} \n\t\t\t&& \\text{Bayes} \\\\\n\t\t&= \\frac{0.20 \\cdot 0.10}{\\p(E)} \n\t\t\t&& \\text{Given} \\\\\n\t\t&= \\frac{0.20 \\cdot 0.10}{0.029} \n\t\t\t&& \\text{Calculated} \\\\\n\t\t&\\approx 0.69\n\t\\end{align}\n\t$$\n\n\nAfter the course of anti-biotics, 69% of bacteria have the mutation, up from 10% before. If this population is allowed to reproduce you will have a much more resistent set of bacteria!\n"}, {"id": "rvs", "title": "Random Variables", "url": "part2/rvs", "text": "\n \nRandom Variables\n\n\nA Random Variables (RV) is a variable that probabilistically takes on a value and they are one of the most important constructs in all of probability theory. You can think of an RV as\nbeing like a variable in a programming language, and in fact random variables are just as important to probability theory as variables are to programming. Random Variables take on values, have types and have domains over\nwhich they are applicable.\nRandom variables work with all of the foundational theory we have build up to this point. We can define events that occur if the random variable takes one values that satisfy\na numerical test (eg does the variable equal 5, is the variable less than 8).\n\nLets look at a first example of a random variable. Say we flip three fair coins. We can define a random variable Y to be the total number\nof \u201cheads\u201d on the three coins. We can ask about the probability of Y taking on different values using the\nfollowing notation:\n\n\n\n\n\nLet $Y$ be the number of heads on three coin flips\n\n\n$\\p(Y = 0)$ = 1/8 (T, T, T)\n\n\n$\\p(Y = 1)$ = 3/8 (H, T, T), (T, H, T), (T, T, H)\n\n\n$\\p(Y = 2)$ = 3/8 (H, H, T), (H, T, H), (T, H, H)\n\n\n$\\p(Y = 3)$ = 1/8 (H, H, H)\n\n\n$\\p(Y \u2265 4)$ = 0\n\n\n\n\nEven though we use the same notation for random variables and for events (both use capitol letters) they\nare distinct concepts. An event is a scenario, a random variable is an object. The scenario where a random\nvariable takes on a particular value (or range of values) is an event. When possible, I will try and use letters\nE,F,G for events and X,Y,Z for random variables.\nUsing random variables is a convenient notation technique that assists in decomposing problems. There are\nmany different types of random variables (indicator, binary, choice, Bernoulli, etc). The two main families of\nrandom variable types are discrete and continuous. Discrete random variables can only take on integer values. Continuous random variables can take on decimal values. We are going to develop our intuitions using discrete random variable and then introduce continuous.\nProperties of random variables\nThere are many properties of a random variable of any random variable some of which we will dive into extensively. Here is a brief summary. Each random variable has:\n\n\n\n\n\nProperty\nNotation Example\nDescription\n\n\n\nMeaning\n\nA semantic description of the random variable\n\n\nSymbol\n$X$\nA letter used to denote the random variable\n\n\nSupport or Range\n$\\{0, 1, \\dots, 3\\}$\nthe values the random variable can take on\n\n\nDistribution Function (PMF or PDF)\n$\\P(X=x)$\nA function which maps values the RV can take on to likelihood.\n\n\nExpectation\n$\\E[X]$\nA weighted average\n\n\nVariance\n$\\var(X)$\nA measure of spread\n\n\nStandard Deviation\n$\\std(X)$\nThe square root of variance\n\n\nMode\n\nThe most likely value of the random variable\n\n\n\nYou should set a goal of deeply understanding what each of these properties mean. There are many more properties than the ones in the table above: properties like entropy, median, skew, kertosis.\n\n\nRandom variables vs Events \n\nRandom variables and events are two different concepts. An event is an outcome, or a set of outcomes, to an experiment. A random variable is a more like an experiment -- it will take on an outcome eventually. Probabilities are over events, so if you want to talk about probability in the context of a random variable, you must construct an event. You can make events by using any of the Relational Operators: \n<, \u2264, >, \u2265, =, or \u2260 (not equal to). This is analogous to coding where you can use relational operators to create boolean expressions from numbers. \n\n\tLets continue our example of the random variable $Y$ which represents the number of heads on three coin flips. Here are some events using the variable $Y$:\n\t\n\t\n\n\nEventMeaningProbability Statement\n\n\n\n\n$Y= 1$$Y$ takes on the value 1 (there was one heads)$\\p(Y=1)$\n\n\n$Y< 2$$Y$ takes on 0 or 1 (note this $Y$ can't be negative)$\\p(Y<2)$\n\n\n$X > Y$$X$ takes on a value greater than the value $Y$ takes on.$\\p(X>Y)$\n\n\n$Y= y$$Y$ takes on a value represented by non-random variable $y$$\\p(Y = y)$\n\n\n\n\nYou will see many examples like this last one, $\\p(Y=y)$, in this text book as well as in scientific and math research papers. It allows us to talk about the likelihood of $Y$ taking on a value, in general. For example, later in this book we will derive that for three coin flips where $Y$ is the number of heads, the probability of getting exactly $y$ heads is:\n$$\n\\begin{align}\n\\P(Y = y) = \\frac{0.75}{y!(3-y)!} && \\text{If } 0 \\leq y \\leq 3\n\\end{align}\n$$\n\tThis statement above is a function which takes in a parameter $y$ as input and returns the numeric probability $\\P(Y=y)$ as output. This particular expression allows us to talk about the probability that the number of heads is 0, 1, 2 or 3 all in one expression. You can plug in any one of those values for $y$ to get the corresponding probability. It is customary to use lower-case symbols for non-random values. The use of an equals sign in the \"event\" can be confusing. For example what does this expression say $\\P(Y = 1) = 0.375$? It says that the probability that \"$Y$ takes on the value 4\" is 0.375. For discrete random variables this function is called the \"probability mass function\" and it is the topic of our next chapter.\n\n"}, {"id": "pmf", "title": "Probability Mass Functions", "url": "part2/pmf", "text": "\n\nProbability Mass Functions\n\n\nFor a random variable, the most important thing to know is: how likely is each outcome? For a discrete random variable, this information is called the \"probability mass function\". The probability mass function (PMF) provides the \"mass\" (i.e. amount) of \"probability\" for each possible assignment of the random variable. \nFormally, the probability mass function is a mapping between the values that the random variable could take on and the probability of the random variable taking on said value. In mathematics, we call these associations functions. There are many different ways of representing functions: you can write an equation, you can make a graph, you can even store many samples in a list. Let's start by looking at PMFs as graphs where the $x$-axis is the values that the random variable could take on and the $y$-axis is the probability of the random variable taking on said value.\nIn the following example, on the left we show a PMF, as a graph, for the random variable: $X$ = the value of a six-sided die roll. On the right we show a contrasting example of a PMF for the random variable $X$ = value of the sum of two dice rolls:\n\n\n\n\n\n\n\n\n\nLeft: the PMF of a single six-sided die roll. Right: the PMF of the sum of two dice rolls.\n\n\n\tThe sum of two dice example in the equally likely probability section. Again, the information that is provided in these graphs is the likelihood of a random variable taking on different values. In the graph on the right, the value \"$6$\" on the $x$-axis is associated with the probability $\\frac{5}{36}$ on the $y$-axis. This $x$-axis refers to the event \"the sum of two dice is 6\" or $Y=6$. The $y$-axis tells us that the probability of that event is $\\frac{5}{36}$. In full: $\\p(Y=6) = \\frac{5}{36}$. The value \"$2$\" is associated with \"$\\frac{1}{36}$\" which tells us that, $\\p(Y=2) = \\frac{1}{36}$, the probability that two dice sum to 2 is $\\frac{1}{36}$. There is no value associated with \"$1$\" because the sum of two dice can not be 1. If you find this notation confusing, revisit the random variables section.\n\t\n\n\t\tHere is the exact same information in equation form:\n\t\t\n\n\t\t\t\t$$\n\t\t\t\t\\begin{align}\n\t\t\t\t\\p(X=x) = \\frac{1}{6} && \\text{ if } 1 \\leq x \\leq 6\n\t\t\t\t\\end{align}\n\t\t\t\t$$\n\t\t\t\n\n\t\t\t\t$$\n\t\t\t\t\\p(Y=y) = \\begin{cases}\n\t\t\t\t\t\\frac{(y-1)}{36} && \\text{ if } 1 \\leq y \\leq 7\\\\\n\t\t\t\t\t\\frac{(13-y)}{36} && \\text{ if } 8 \\leq y \\leq 12\n\t\t\t\t\\end{cases}\n\t\t\t\t$$\n\t\t\t\n\n\n\n\t\tAs a final example, here is the PMF for $Y$, the sum of two dice, in Python code:\n\t\tdef pmf_sum_two_dice(y):\n # Returns the probability that the sum of two dice is y\n if y < 2 or y > 12:\n return 0\n if y <= 7:\n return (y-1) / 36\n else:\n return (13-y) / 36\nNotation\nYou may feel that $\\p(Y=y)$ is redundant notation. In probability research papers and higher-level work, mathematicians often use the shorthand $\\p(y)$ to mean $\\p(Y=y)$. This shorthand assumes that the lowercase value (e.g. $y$) has a capital letter counterpart (e.g. $Y$) that represents a random variable even though it's not written explicitly. In this book, we will often use the full form of the event $\\P(Y=y)$, but we will occasionally use the shorthand $\\p(y)$.\nProbabilities Must Sum to 1\nFor a variable (call it $X$) to be a proper random variable it must be the case that if you summed up the values of $\\p(X = k)$ for all possible values $k$ that $X$ can take on, the result must be 1:\n$$\n\\sum_{k} \\p(X = k) = 1\n$$\nFor further understanding, let's derive why this is the case. A random variable taking on a value is an event (for example $X = 2$). Each of those events is mutually exclusive because a random variable will take on exactly one value. Those mutually exclusive cases define an entire sample space. Why? Because $X$ must take on some value. \n\n\n\nData to Histograms to Probability Mass Functions\nOne surprising way to store a likelihood function (recall that a PMF is the name of the likelihood function for discrete random variables) is simply a list of data. We simulated summing two die 10,000 times to make this example dataset:\n\n\nNote that this data, on its own, represents an approximation for the probability mass function. If you wanted to approximate $\\p(Y=5)$ you could simply count the number of times that \"5\" occurs in your data. This is an approximation based on the definition of probability. Here is the full histogram of the data, a count of times each value occurs:\n\n\n\n\n\nA normalized histogram (where each value is divided by the length of your data list) is an approximation of the PMF. For a dataset of discrete numbers, a histogram shows the count of each value (in this case $y$). By the definition of probability, if you divide this count by the number of experiments run, you arrive at an approximation of the probability of the event $\\p(Y=y)$. In our example, we have 10,000 elements in our dataset. The count of times that 3 occurs is 552. Note that:\n\t$$\\begin{align}\n\t\\frac{\\text{count}(Y=5)}{n} &= \\frac{552}{10000} = 0.0552 \\\\\n\t\\p(Y=5) &= \\frac{4}{36} = 0.0555\n\t\\end{align}\n\t$$\n\nIn this case, since we ran 10,000 trials, the histogram is a very good approximation of the PMF. We use the sum of dice as an example because it is easy to understand. Datasets in the real world often represent more exciting events. \n\n"}, {"id": "pmf", "title": "Probability Mass Functions", "url": "part2/pmf", "text": "[8, 4, 9, 7, 7, 7, 7, 5, 6, 8, 11, 5, 7, 7, 7, 6, 7, 8, 8, 9, 9, 4, 6, 7, 10, 12, 6, 7, 8, 9, 3, 7, 4, 9, 2, 8, 5, 8, 9, 6, 8, 7, 10, 7, 6, 7, 7, 5, 4, 6, 9, 5, 7, 4, 2, 11, 10, 11, 8, 4, 11, 9, 7, 10, 12, 4, 8, 5, 11, 5, 3, 9, 7, 5, 5, 5, 3, 8, 6, 11, 11, 2, 7, 7, 6, 5, 4, 6, 3, 8, 5, 8, 7, 6, 9, 4, 3, 7, 6, 6, 6, 5, 6, 10, 5, 9, 9, 8, 8, 7, 4, 8, 4, 9, 8, 5, 10, 10, 9, 7, 9, 7, 7, 10, 4, 7, 8, 4, 7, 8, 9, 11, 7, 9, 10, 10, 2, 7, 9, 4, 8, 8, 12, 9, 5, 11, 10, 7, 6, 4, 8, 9, 9, 6, 5, 6, 5, 6, 11, 7, 3, 10, 7, 3, 7, 7, 10, 3, 6, 8, 6, 8, 5, 10, 2, 7, 4, 8, 11, 9, 3, 4, 2, 8, 8, 6, 6, 12, 11, 10, 10, 10, 8, 4, 9, 4, 4, 6, 6, 7, 8, 2, 5, 7, 6, 9, 5, 5, 8, 4, 7, 7, 7, 6, 5, 6, 8, 6, 5, 7, 8, 4, 9, 8, 8, 9, 7, 2, 8, 3, 5, 5, 10, 7, 9, 12, 6, 4, 5, 7, 6, 4, 7, 6, 10, 3, 8, 5, 7, 7, 3, 6, 7, 7, 6, 6, 9, 12, 9, 10, 7, 10, 8, 10, 3, 9, 9, 4, 7, 8, 6, 8, 12, 5, 6, 2, 4, 4, 5, 5, 8, 7, 9, 10, 6, 7, 10, 7, 6, 8, 9, 8, 10, 3, 7, 8, 8, 8, 4, 7, 7, 8, 3, 8, 5, 9, 2, 8, 6, 11, 7, 8, 7, 6, 8, 5, 5, 3, 6, 7, 9, 7, 11, 5, 8, 2, 11, 9, 9, 7, 12, 8, 6, 9, 7, 7, 5, 7, 6, 9, 2, 5, 4, 11, 10, 4, 7, 11, 9, 8, 3, 9, 5, 5, 2, 6, 7, 8, 10, 5, 9, 4, 4, 4, 7, 6, 3, 5, 6, 4, 3, 12, 7, 7, 6, 7, 7, 4, 5, 7, 9, 3, 5, 7, 6, 11, 5, 6, 6, 8, 6, 4, 5, 7, 4, 6, 10, 3, 6, 5, 7, 6, 8, 10, 7, 7, 9, 3, 10, 9, 6, 8, 5, 9, 6, 6, 5, 3, 4, 10, 8, 10, 6, 8, 4, 7, 7, 7, 12, 5, 5, 7, 9, 5, 5, 7, 8, 7, 10, 5, 10, 7, 7, 7, 4, 7, 4, 9, 11, 8, 7, 6, 9, 7, 2, 4, 8, 4, 7, 3, 7, 8, 7, 11, 7, 5, 11, 10, 6, 7, 11, 4, 7, 5, 9, 6, 11, 6, 10, 10, 6, 10, 8, 10, 5, 5, 10, 6, 3, 7, 5, 6, 7, 12, 10, 4, 2, 6, 9, 9, 8, 7, 5, 5, 10, 7, 5, 10, 10, 6, 8, 8, 4, 6, 7, 8, 10, 4, 6, 11, 5, 3, 7, 5, 7, 12, 4, 8, 6, 11, 6, 8, 7, 11, 12, 11, 9, 8, 9, 6, 6, 2, 3, 4, 9, 3, 5, 8, 9, 7, 6, 8, 6, 4, 8, 7, 4, 6, 5, 3, 8, 4, 9, 12, 6, 9, 7, 9, 3, 8, 2, 8, 10, 5, 8, 5, 2, 3, 6, 6, 6, 9, 5, 4, 6, 5, 6, 5, 8, 10, 8, 7, 10, 9, 9, 5, 2, 11, 3, 10, 4, 9, 7, 11, 9, 2, 10, 4, 11, 5, 12, 11, 9, 8, 3, 8, 9, 5, 10, 8, 4, 6, 8, 8, 7, 6, 4, 6, 4, 9, 8, 8, 4, 7, 10, 8, 5, 6, 10, 7, 7, 3, 11, 8, 10, 9, 9, 4, 7, 10, 5, 7, 3, 4, 12, 5, 10, 12, 8, 6, 8, 4, 8, 6, 12, 5, 5, 7, 3, 2, 8, 4, 4, 10, 4, 8, 5, 5, 9, 12, 8, 5, 8, 11, 12, 5, 6, 8, 3, 7, 11, 11, 8, 4, 4, 3, 8, 11, 8, 5, 6, 11, 10, 7, 6, 6, 12, 5, 7, 5, 6, 8, 6, 8, 8, 7, 7, 5, 8, 8, 5, 10, 8, 7, 8, 7, 7, 5, 4, 7, 9, 11, 9, 8, 10, 7, 5, 6, 5, 11, 11, 9, 4, 9, 6, 6, 5, 6, 9, 7, 9, 6, 3, 6, 6, 4, 10, 12, 9, 7, 6, 7, 7, 3, 12, 7, 6, 11, 9, 7, 6, 5, 9, 8, 10, 6, 8, 6, 6, 8, 11, 7, 5, 9, 11, 3, 8, 8, 8, 9, 10, 7, 8, 7, 5, 5, 9, 8, 10, 11, 3, 6, 5, 10, 6, 7, 8, 4, 5, 4, 6, 8, 12, 10, 7, 9, 8, 7, 9, 4, 8, 7, 2, 8, 4, 3, 7, 11, 5, 4, 7, 8, 7, 8, 7, 7, 6, 4, 6, 12, 12, 7, 6, 11, 12, 10, 5, 6, 6, 5, 5, 9, 9, 8, 6, 9, 3, 11, 8, 6, 4, 8, 9, 6, 7, 9, 8, 9, 6, 6, 4, 2, 9, 8, 7, 9, 9, 3, 4, 8, 3, 8, 3, 10, 7, 7, 6, 8, 4, 5, 7, 10, 9, 8, 8, 3, 7, 6, 4, 2, 4, 3, 9, 12, 4, 9, 12, 8, 2, 9, 4, 11, 2, 6, 5, 3, 11, 4, 5, 5, 11, 4, 4, 12, 7, 4, 10, 10, 3, 8, 6, 4, 5, 6, 7, 7, 9, 3, 6, 10, 8, 3, 4, 6, 2, 6, 7, 8, 5, 8, 11, 9, 4, 8, 7, 5, 6, 11, 4, 8, 6, 6, 6, 10, 9, 4, 8, 8, 6, 9, 5, 6, 7, 7, 6, 8, 5, 10, 8, 7, 9, 6, 5, 8, 7, 8, 4, 5, 9, 10, 6, 5, 8, 8, 7, 7, 8, 5, 10, 7, 8, 5, 3, 8, 3, 9, 8, 7, 12, 7, 9, 11, 4, 4, 4, 6, 6, 8, 9, 9, 7, 7, 5, 11, 8, 8, 10, 11, 5, 7, 8, 8, 8, 7, 4, 6, 11, 11, 7, 12, 7, 11, 6, 9, 11, 6, 2, 2, 9, 7, 7, 7, 11, 9, 6, 7, 6, 7, 7, 7, 9, 8, 10, 10, 4, 7, 6, 9, 5, 7, 9, 5, 9, 5, 6, 8, 9, 10, 7, 4, 8, 5, 7, 5, 5, 9, 9, 4, 6, 4, 5, 3, 7, 8, 10, 3, 5, 11, 9, 12, 5, 8, 4, 7, 4, 7, 5, 5, 8, 4, 4, 9, 8, 7, 4, 5, 10, 9, 7, 7, 5, 8, 6, 12, 9, 7, 6, 10, 4, 7, 5, 5, 8, 5, 6, 7, 8, 9, 9, 7, 10, 8, 6, 8, 7, 5, 7, 6, 6, 5, 4, 8, 5, 3, 9, 3, 2, 9, 3, 6, 7, 7, 6, 6, 9, 6, 4, 6, 8, 5, 6, 6, 4, 8, 9, 9, 10, 7, 6, 4, 7, 4, 8, 11, 6, 7, 9, 6, 10, 6, 7, 5, 9, 6, 9, 5, 7, 5, 10, 5, 6, 7, 5, 10, 6, 8, 6, 5, 8, 4, 7, 9, 9, 7, 7, 5, 11, 7, 3, 4, 8, 6, 10, 6, 6, 6, 4, 11, 8, 5, 10, 11, 9, 8, 11, 10, 7, 9, 3, 9, 8, 8, 2, 3, 6, 11, 8, 7, 9, 12, 9, 7, 8, 5, 8, 10, 8, 9, 5, 4, 3, 8, 11, 5, 7, 5, 3, 9, 9, 7, 7, 8, 11, 7, 9, 6, 6, 10, 3, 9, 6, 7, 3, 6, 10, 5, 7, 6, 7, 6, 10, 8, 7, 4, 3, 9, 10, 5, 9, 7, 10, 2, 9, 8, 3, 7, 6, 7, 6, 5, 6, 8, 8, 11, 6, 8, 6, 6, 7, 6, 5, 6, 9, 4, 6, 6, 7, 5, 7, 4, 11, 4, 12, 6, 7, 12, 7, 5, 9, 7, 6, 5, 8, 7, 7, 6, 9, 2, 6, 4, 8, 5, 10, 7, 5, 10, 5, 6, 7, 5, 8, 10, 9, 2, 7, 4, 9, 8, 9, 7, 8, 7, 8, 5, 7, 7, 6, 9, 10, 5, 5, 6, 8, 8, 3, 4, 3, 6, 4, 9, 6, 6, 6, 8, 10, 3, 11, 5, 4, 5, 4, 8, 6, 4, 10, 8, 7, 7, 8, 8, 8, 12, 7, 6, 5, 6, 9, 6, 10, 5, 3, 6, 2, 8, 10, 10, 7, 9, 7, 7, 7, 8, 3, 6, 7, 6, 10, 8, 11, 8, 6, 5, 9, 6, 11, 9, 3, 10, 7, 7, 7, 6, 7, 9, 9, 9, 9, 10, 8, 11, 4, 6, 8, 7, 7, 10, 11, 7, 7, 7, 11, 7, 5, 10, 2, 7, 4, 3, 7, 8, 8, 4, 12, 2, 7, 5, 8, 9, 4, 9, 7, 3, 6, 8, 9, 6, 5, 4, 10, 7, 10, 9, 8, 2, 9, 7, 12, 7, 5, 6, 12, 7, 7, 5, 9, 9, 8, 8, 9, 5, 3, 8, 11, 9, 8, 8, 9, 8, 7, 6, 8, 7, 12, 7, 7, 4, 9, 10, 2, 6, 3, 5, 8, 4, 8, 7, 10, 11, 3, 6, 2, 9, 7, 12, 7, 9, 7, 9, 7, 8, 7, 3, 9, 6, 5, 10, 7, 6, 2, 9, 4, 7, 4, 7, 5, 8, 2, 5, 7, 9, 8, 10, 2, 7, 8, 9, 9, 6, 4, 10, 8, 3, 4, 5, 11, 7, 6, 8, 6, 12, 8, 8, 12, 9, 8, 7, 7, 7, 8, 3, 6, 5, 7, 7, 6, 10, 2, 8, 4, 5, 7, 8, 7, 8, 10, 12, 9, 8, 7, 9, 8, 10, 9, 9, 3, 10, 10, 8, 9, 8, 7, 11, 7, 8, 5, 7, 4, 5, 10, 7, 7, 8, 8, 5, 5, 7, 11, 11, 5, 6, 8, 8, 7, 7, 9, 5, 7, 7, 7, 9, 2, 4, 7, 9, 4, 8, 10, 7, 12, 12, 5, 2, 5, 8, 11, 7, 9, 7, 8, 7, 7, 6, 7, 10, 9, 6, 7, 2, 7, 4, 9, 8, 10, 7, 6, 4, 7, 4, 5, 7, 8, 8, 3, 8, 6, 7, 4, 11, 10, 8, 4, 12, 11, 8, 8, 3, 8, 11, 5, 4, 9, 6, 5, 8, 7, 10, 8, 12, 9, 8, 4, 7, 8, 8, 5, 9, 3, 5, 11, 6, 8, 6, 7, 11, 4, 7, 7, 7, 6, 7, 12, 9, 8, 11, 7, 7, 8, 9, 9, 6, 6, 2, 9, 6, 10, 12, 3, 4, 3, 9, 5, 7, 7, 6, 3, 6, 7, 7, 10, 11, 9, 10, 9, 10, 11, 4, 5, 12, 3, 6, 11, 3, 4, 8, 6, 9, 7, 9, 12, 9, 7, 6, 7, 10, 7, 4, 6, 5, 5, 11, 9, 8, 6, 6, 12, 4, 4, 8, 10, 12, 10, 5, 6, 4, 12, 7, 7, 7, 7, 7, 7, 8, 3, 6, 9, 3, 7, 6, 9, 7, 5, 7, 2, 12, 6, 7, 11, 9, 9, 4, 7, 11, 11, 6, 12, 3, 4, 11, 8, 8, 10, 3, 4, 4, 9, 8, 9, 7, 3, 5, 6, 9, 8, 4, 8, 3, 11, 4, 7, 7, 4, 9, 10, 8, 9, 11, 9, 5, 5, 12, 5, 9, 4, 5, 5, 10, 5, 9, 10, 8, 10, 11, 4, 4, 5, 9, 6, 7, 9, 7, 7, 8, 3, 6, 12, 7, 5, 6, 11, 5, 6, 7, 5, 10, 6, 9, 8, 7, 6, 6, 6, 2, 11, 10, 6, 10, 7, 5, 9, 7, 6, 6, 11, 9, 3, 2, 6, 6, 5, 11, 7, 7, 8, 2, 2, 9, 5, 6, 8, 2, 10, 9, 3, 7, 6, 4, 7, 8, 10, 6, 9, 2, 4, 7, 6, 9, 6, 10, 10, 8, 6, 7, 9, 7, 6, 8, 10, 10, 12, 6, 5, 4, 7, 7, 7, 11, 9, 3, 5, 11, 4, 10, 8, 7, 5, 8, 7, 3, 7, 5, 8, 7, 11, 11, 9, 5, 5, 6, 10, 9, 7, 7, 7, 10, 11, 4, 2, 7, 10, 10, 12, 5, 8, 6, 10, 9, 10, 6, 10, 10, 11, 8, 9, 6, 9, 8, 4, 8, 7, 7, 5, 6, 6, 5, 6, 2, 7, 3, 8, 7, 11, 10, 6, 7, 7, 6, 6, 3, 7, 10, 9, 6, 5, 8, 7, 2, 12, 5, 10, 8, 6, 5, 2, 4, 4, 4, 2, 6, 7, 6, 7, 7, 7, 7, 2, 7, 11, 3, 5, 5, 5, 10, 11, 7, 3, 9, 9, 3, 11, 7, 7, 7, 7, 8, 11, 10, 7, 7, 7, 7, 7, 9, 8, 4, 2, 5, 12, 8, 8, 3, 10, 10, 5, 5, 3, 9, 9, 8, 6, 6, 9, 5, 7, 11, 8, 7, 4, 9, 4, 9, 6, 5, 8, 5, 7, 9, 4, 10, 7, 5, 12, 5, 8, 7, 6, 7, 10, 5, 9, 7, 4, 6, 5, 6, 7, 7, 7, 7, 9, 10, 10, 6, 4, 5, 5, 6, 12, 12, 12, 10, 6, 8, 11, 8, 6, 8, 8, 10, 12, 8, 4, 8, 6, 9, 3, 5, 6, 8, 10, 8, 10, 8, 4, 8, 6, 9, 9, 6, 10, 10, 6, 9, 10, 7, 5, 6, 9, 2, 10, 9, 8, 9, 9, 12, 10, 7, 6, 8, 5, 5, 3, 7, 10, 6, 7, 8, 7, 7, 11, 8, 4, 8, 9, 11, 4, 7, 4, 4, 3, 5, 5, 10, 3, 10, 7, 8, 4, 5, 7, 4, 2, 11, 4, 9, 8, 3, 10, 8, 8, 5, 7, 10, 8, 6, 6, 8, 4, 6, 6, 7, 5, 6, 8, 5, 6, 6, 10, 10, 5, 7, 5, 6, 12, 2, 7, 6, 4, 8, 9, 8, 7, 6, 3, 11, 9, 2, 10, 5, 8, 8, 8, 2, 8, 5, 6, 4, 10, 3, 9, 6, 9, 9, 9, 8, 7, 4, 7, 8, 8, 12, 10, 4, 9, 7, 3, 5, 6, 3, 6, 5, 5, 4, 4, 7, 8, 10, 8, 10, 11, 6, 6, 11, 5, 7, 11, 4, 6, 2, 9, 10, 10, 11, 10, 7, 8, 5, 6, 11, 7, 10, 3, 6, 3, 2, 6, 8, 7, 3, 8, 7, 11, 11, 7, 6, 7, 11, 7, 5, 7, 6, 3, 11, 5, 6, 7, 9, 7, 8, 7, 8, 10, 7, 9, 3, 10, 7, 10, 12, 4, 6, 7, 3, 8, 7, 7, 10, 5, 6, 3, 6, 7, 7, 12, 6, 4, 7, 9, 4, 5, 6, 8, 7, 7, 10, 9, 4, 6, 7, 8, 4, 9, 9, 5, 10, 8, 2, 5, 3, 7, 5, 9, 4, 5, 9, 5, 10, 4, 5, 6, 7, 6, 8, 8, 10, 9, 10, 7, 7, 7, 5, 5, 3, 6, 8, 6, 5, 8, 6, 4, 7, 3, 9, 9, 8, 7, 8, 8, 7, 2, 3, 6, 12, 3, 7, 6, 8, 9, 7, 7, 6, 12, 5, 10, 10, 9, 7, 6, 4, 12, 5, 4, 10, 5, 9, 7, 5, 7, 6, 5, 7, 7, 6, 5, 10, 11, 6, 4, 10, 6, 7, 7, 5, 2, 7, 7, 6, 8, 8, 11, 6, 5, 6, 8, 9, 7, 8, 6, 7, 4, 6, 4, 7, 4, 7, 11, 5, 6, 9, 7, 3, 11, 9, 7, 7, 2, 6, 9, 8, 8, 6, 2, 8, 5, 5, 9, 10, 7, 8, 7, 7, 8, 11, 3, 7, 8, 7, 3, 3, 6, 9, 5, 7, 10, 4, 5, 9, 6, 6, 6, 11, 9, 9, 7, 10, 4, 9, 6, 11, 6, 3, 5, 4, 3, 11, 10, 7, 7, 4, 3, 7, 10, 11, 7, 3, 8, 5, 7, 9, 7, 4, 9, 4, 5, 5, 4, 8, 5, 8, 8, 7, 6, 7, 9, 7, 9, 7, 9, 2, 9, 11, 5, 3, 9, 9, 6, 5, 10, 8, 5, 5, 7, 4, 6, 9, 7, 5, 8, 8, 7, 5, 5, 7, 9, 3, 5, 5, 7, 10, 8, 6, 6, 10, 9, 5, 4, 12, 2, 10, 5, 7, 7, 6, 10, 5, 8, 4, 3, 8, 7, 7, 11, 7, 10, 10, 10, 6, 6, 3, 10, 8, 9, 7, 6, 12, 7, 6, 7, 5, 9, 6, 5, 9, 9, 5, 9, 9, 6, 5, 11, 8, 9, 6, 4, 5, 3, 3, 11, 5, 9, 7, 3, 10, 9, 12, 9, 2, 7, 10, 4, 10, 3, 3, 11, 2, 4, 8, 6, 6, 8, 12, 8, 12, 9, 6, 4, 8, 3, 3, 6, 6, 11, 9, 10, 8, 7, 11, 11, 9, 7, 7, 8, 9, 9, 6, 9, 7, 7, 11, 9, 4, 11, 9, 8, 8, 6, 6, 7, 10, 9, 2, 9, 6, 7, 11, 7, 2, 9, 5, 11, 5, 8, 8, 6, 8, 9, 8, 6, 6, 10, 4, 9, 3, 8, 6, 8, 12, 8, 7, 8, 7, 2, 8, 6, 6, 6, 5, 9, 5, 5, 10, 6, 8, 9, 4, 5, 7, 8, 9, 7, 5, 4, 5, 7, 5, 8, 11, 4, 7, 5, 5, 8, 5, 12, 9, 4, 3, 2, 7, 11, 8, 4, 9, 6, 5, 7, 7, 9, 7, 10, 4, 7, 12, 6, 3, 11, 5, 8, 9, 9, 6, 7, 3, 9, 9, 9, 5, 5, 7, 3, 8, 7, 2, 7, 7, 3, 7, 6, 5, 6, 8, 3, 6, 6, 4, 11, 9, 4, 6, 8, 5, 7, 7, 7, 4, 10, 11, 6, 7, 6, 7, 6, 8, 4, 5, 5, 7, 4, 6, 9, 6, 4, 7, 9, 10, 11, 3, 5, 5, 9, 2, 7, 11, 6, 7, 7, 9, 8, 9, 8, 4, 4, 4, 10, 6, 9, 8, 5, 5, 7, 8, 5, 6, 10, 5, 5, 4, 9, 8, 3, 9, 5, 12, 10, 4, 9, 12, 7, 10, 11, 3, 11, 10, 7, 5, 7, 7, 11, 6, 8, 6, 5, 12, 6, 8, 10, 4, 5, 4, 5, 5, 7, 7, 2, 5, 8, 5, 4, 8, 5, 6, 11, 10, 9, 6, 6, 4, 10, 9, 9, 7, 5, 10, 6, 8, 9, 3, 3, 7, 4, 8, 10, 9, 7, 6, 7, 5, 11, 6, 2, 7, 5, 9, 12, 5, 5, 2, 8, 5, 12, 4, 8, 4, 6, 7, 8, 6, 8, 9, 8, 6, 7, 6, 11, 4, 10, 2, 4, 4, 8, 3, 8, 2, 8, 5, 4, 4, 7, 6, 8, 10, 6, 4, 6, 5, 9, 5, 6, 10, 7, 6, 8, 7, 9, 7, 7, 4, 11, 9, 3, 8, 11, 4, 7, 6, 5, 8, 4, 5, 7, 11, 6, 7, 5, 8, 7, 7, 9, 5, 9, 4, 2, 9, 12, 6, 12, 7, 11, 6, 6, 7, 9, 4, 3, 3, 7, 6, 8, 6, 6, 9, 3, 6, 12, 5, 4, 3, 10, 8, 8, 7, 6, 12, 6, 10, 11, 3, 7, 9, 10, 11, 7, 11, 3, 8, 7, 9, 5, 9, 6, 7, 9, 3, 6, 3, 6, 7, 5, 8, 5, 8, 5, 9, 10, 3, 7, 8, 4, 8, 3, 7, 5, 7, 10, 5, 7, 11, 6, 2, 7, 7, 8, 7, 7, 7, 8, 4, 7, 4, 3, 7, 7, 9, 6, 8, 11, 3, 8, 6, 7, 4, 7, 10, 7, 7, 5, 8, 4, 7, 10, 6, 8, 12, 9, 6, 8, 6, 9, 4, 11, 7, 5, 5, 6, 6, 7, 7, 6, 12, 12, 5, 10, 6, 3, 6, 9, 7, 11, 6, 8, 5, 7, 9, 7, 3, 5, 6, 8, 9, 9, 8, 7, 6, 10, 7, 11, 8, 7, 7, 10, 6, 8, 9, 4, 7, 5, 7, 9, 8, 7, 11, 12, 3, 10, 12, 9, 7, 6, 8, 8, 5, 6, 9, 9, 6, 3, 8, 6, 4, 5, 9, 7, 8, 7, 7, 5, 9, 8, 8, 4, 8, 5, 7, 5, 11, 7, 10, 5, 4, 5, 6, 3, 3, 8, 10, 3, 8, 9, 7, 7, 7, 5, 11, 9, 7, 7, 7, 6, 4, 10, 5, 9, 2, 7, 9, 6, 6, 11, 10, 9, 4, 6, 5, 10, 9, 8, 6, 8, 9, 8, 10, 5, 5, 8, 9, 9, 7, 5, 7, 10, 11, 7, 6, 9, 6, 7, 10, 8, 7, 4, 11, 7, 8, 8, 7, 8, 6, 11, 12, 9, 9, 5, 7, 7, 7, 7, 2, 6, 7, 5, 9, 3, 4, 6, 6, 2, 5, 6, 4, 8, 11, 2, 8, 5, 7, 7, 6, 7, 6, 4, 7, 6, 3, 9, 9, 11, 7, 11, 7, 10, 8, 4, 5, 2, 7, 9, 8, 3, 7, 3, 5, 6, 8, 12, 11, 5, 2, 2, 5, 7, 12, 7, 6, 5, 5, 5, 8, 11, 12, 3, 6, 6, 6, 2, 7, 6, 3, 7, 4, 5, 8, 7, 4, 11, 7, 4, 11, 5, 9, 6, 10, 6, 6, 6, 8, 6, 12, 10, 5, 12, 4, 8, 5, 5, 8, 11, 9, 12, 4, 5, 4, 3, 7, 7, 6, 10, 4, 4, 7, 7, 6, 7, 8, 10, 8, 6, 9, 9, 4, 7, 8, 6, 8, 7, 6, 6, 7, 10, 11, 7, 5, 5, 3, 4, 11, 5, 4, 9, 9, 4, 6, 5, 5, 12, 9, 8, 10, 9, 12, 4, 4, 5, 10, 3, 6, 7, 5, 6, 9, 8, 9, 6, 6, 4, 7, 6, 6, 6, 6, 8, 7, 10, 8, 10, 12, 3, 11, 12, 7, 8, 7, 6, 2, 12, 7, 6, 10, 6, 8, 8, 8, 5, 10, 8, 9, 7, 7, 3, 8, 6, 10, 9, 4, 7, 7, 7, 9, 3, 7, 3, 4, 8, 7, 6, 12, 12, 12, 7, 2, 8, 4, 5, 3, 4, 11, 10, 6, 4, 11, 5, 6, 11, 2, 7, 9, 7, 7, 12, 6, 4, 7, 9, 7, 8, 7, 6, 7, 10, 5, 7, 9, 4, 7, 12, 5, 7, 8, 9, 9, 9, 10, 4, 10, 10, 6, 9, 9, 4, 6, 9, 9, 8, 7, 8, 7, 10, 8, 6, 5, 6, 6, 4, 10, 8, 10, 6, 11, 9, 7, 8, 11, 8, 5, 4, 10, 4, 6, 7, 9, 3, 6, 7, 7, 6, 10, 8, 7, 7, 5, 2, 4, 7, 7, 7, 8, 8, 10, 7, 7, 4, 7, 9, 10, 9, 6, 7, 8, 3, 4, 7, 8, 11, 9, 9, 9, 7, 11, 11, 10, 3, 9, 5, 10, 10, 7, 2, 7, 8, 8, 8, 8, 12, 10, 8, 9, 4, 7, 6, 6, 8, 2, 6, 10, 9, 7, 7, 7, 7, 12, 6, 2, 8, 7, 5, 5, 10, 11, 10, 6, 11, 9, 4, 8, 5, 12, 11, 4, 8, 6, 5, 8, 7, 6, 9, 11, 4, 8, 6, 8, 11, 8, 8, 6, 6, 3, 5, 3, 3, 7, 12, 7, 5, 7, 3, 6, 5, 8, 8, 5, 2, 7, 2, 11, 6, 2, 5, 9, 7, 9, 8, 10, 5, 5, 12, 6, 9, 4, 8, 9, 8, 6, 7, 4, 2, 5, 4, 4, 8, 6, 8, 10, 6, 9, 7, 9, 8, 8, 7, 2, 10, 8, 8, 9, 4, 3, 5, 11, 4, 7, 8, 4, 4, 11, 2, 3, 8, 7, 9, 9, 4, 8, 6, 6, 10, 8, 4, 7, 7, 9, 7, 4, 7, 8, 5, 4, 3, 9, 9, 3, 3, 10, 4, 6, 6, 9, 4, 9, 3, 10, 10, 7, 8, 10, 9, 10, 8, 5, 7, 10, 5, 11, 5, 8, 5, 3, 7, 6, 9, 8, 7, 8, 5, 7, 8, 7, 8, 4, 11, 6, 7, 6, 5, 3, 2, 6, 7, 2, 7, 6, 9, 8, 9, 6, 11, 10, 10, 8, 3, 6, 5, 6, 5, 4, 10, 5, 7, 7, 7, 3, 6, 8, 9, 4, 11, 6, 10, 11, 6, 6, 6, 4, 8, 8, 8, 6, 8, 7, 5, 7, 6, 8, 12, 5, 9, 9, 10, 11, 6, 9, 4, 8, 9, 8, 2, 2, 12, 8, 9, 9, 6, 5, 9, 8, 4, 11, 9, 8, 6, 11, 8, 5, 7, 7, 8, 4, 5, 7, 5, 9, 6, 6, 3, 7, 10, 7, 6, 7, 12, 4, 7, 6, 3, 5, 9, 9, 5, 6, 3, 5, 7, 6, 5, 8, 6, 10, 6, 7, 7, 2, 7, 7, 6, 8, 10, 9, 4, 9, 7, 9, 8, 8, 12, 4, 4, 4, 8, 7, 7, 5, 4, 8, 7, 6, 7, 5, 9, 10, 8, 8, 8, 8, 6, 8, 11, 7, 9, 5, 10, 3, 6, 5, 2, 6, 7, 7, 5, 10, 5, 8, 7, 8, 7, 11, 7, 7, 7, 6, 5, 7, 11, 9, 9, 4, 7, 2, 4, 9, 5, 4, 6, 11, 4, 6, 11, 7, 7, 6, 8, 6, 9, 9, 3, 11, 8, 11, 6, 2, 8, 6, 7, 5, 7, 10, 3, 4, 7, 4, 7, 6, 8, 4, 9, 5, 10, 11, 8, 7, 7, 7, 3, 10, 5, 2, 6, 9, 8, 6, 5, 5, 9, 7, 2, 2, 8, 8, 4, 8, 11, 8, 8, 7, 4, 9, 10, 3, 9, 12, 7, 9, 9, 7, 8, 7, 10, 5, 3, 7, 6, 8, 5, 10, 7, 11, 5, 10, 6, 10, 8, 7, 2, 8, 10, 4, 8, 9, 10, 4, 7, 8, 6, 3, 6, 7, 3, 4, 5, 9, 5, 8, 4, 10, 6, 4, 2, 6, 7, 3, 6, 5, 10, 6, 12, 3, 5, 6, 10, 9, 11, 3, 7, 5, 4, 8, 2, 10, 5, 6, 5, 10, 7, 9, 11, 7, 8, 10, 4, 5, 8, 7, 3, 5, 7, 7, 8, 8, 7, 10, 9, 11, 6, 8, 7, 5, 8, 5, 6, 8, 8, 4, 4, 3, 4, 8, 6, 2, 7, 3, 9, 4, 5, 7, 11, 10, 3, 3, 9, 6, 10, 7, 9, 4, 3, 7, 6, 6, 4, 3, 6, 8, 5, 11, 9, 7, 6, 10, 5, 8, 3, 9, 12, 11, 8, 7, 11, 2, 10, 5, 8, 7, 2, 10, 7, 7, 8, 9, 8, 3, 3, 4, 8, 9, 6, 5, 9, 9, 12, 9, 9, 8, 10, 7, 5, 8, 11, 11, 9, 5, 11, 4, 7, 9, 5, 7, 5, 6, 11, 4, 10, 9, 7, 11, 10, 5, 2, 6, 11, 12, 5, 7, 8, 6, 9, 5, 6, 4, 9, 9, 9, 6, 9, 7, 3, 7, 8, 5, 4, 3, 6, 12, 7, 10, 4, 6, 4, 4, 4, 8, 7, 9, 10, 3, 3, 7, 6, 11, 8, 8, 8, 5, 6, 5, 5, 4, 8, 7, 7, 7, 7, 6, 7, 6, 3, 7, 7, 9, 8, 12, 5, 7, 8, 3, 8, 5, 12, 10, 2, 10, 10, 5, 7, 8, 7, 9, 9, 7, 8, 8, 4, 2, 3, 10, 5, 9, 7, 7, 8, 9, 9, 8, 5, 6, 6, 4, 6, 7, 5, 7, 4, 6, 7, 8, 8, 11, 4, 10, 6, 6, 10, 7, 4, 10, 5, 5, 3, 2, 7, 5, 2, 4, 7, 8, 7, 6, 8, 10, 8, 8, 8, 8, 10, 5, 6, 4, 6, 4, 4, 3, 6, 9, 6, 8, 10, 5, 5, 12, 4, 8, 3, 2, 3, 12, 5, 4, 7, 9, 10, 4, 11, 11, 7, 8, 3, 10, 9, 11, 6, 9, 10, 7, 7, 5, 7, 9, 9, 9, 10, 5, 8, 9, 11, 8, 5, 7, 11, 9, 4, 6, 10, 11, 8, 6, 2, 9, 5, 8, 7, 6, 3, 7, 4, 10, 2, 8, 8, 5, 8, 11, 8, 3, 6, 6, 7, 9, 6, 8, 4, 4, 9, 12, 5, 6, 9, 6, 6, 6, 7, 8, 3, 6, 3, 12, 6, 5, 5, 8, 8, 6, 9, 6, 8, 9, 9, 7, 4, 8, 10, 8, 7, 4, 4, 9, 11, 7, 5, 6, 7, 8, 7, 7, 6, 7, 8, 9, 7, 7, 6, 6, 3, 11, 6, 5, 9, 6, 7, 4, 5, 6, 8, 11, 7, 9, 2, 7, 12, 7, 5, 5, 4, 5, 9, 9, 4, 5, 8, 5, 3, 4, 8, 7, 5, 9, 6, 2, 2, 9, 2, 6, 6, 6, 11, 11, 9, 11, 9, 7, 9, 4, 4, 7, 7, 5, 8, 4, 8, 10, 7, 9, 8, 4, 5, 7, 8, 3, 7, 7, 7, 4, 9, 3, 4, 7, 5, 6, 6, 12, 7, 7, 7, 6, 7, 9, 10, 8, 2, 6, 7, 11, 7, 6, 3, 8, 6, 8, 9, 7, 4, 8, 9, 11, 9, 7, 12, 7, 4, 7, 9, 11, 7, 6, 6, 8, 7, 9, 8, 5, 11, 7, 4, 7, 7, 9, 4, 4, 4, 8, 4, 6, 5, 5, 7, 10, 5, 2, 8, 10, 9, 4, 10, 6, 6, 10, 10, 5, 5, 4, 10, 2, 9, 4, 8, 5, 3, 3, 4, 8, 6, 8, 9, 8, 7, 4, 7, 12, 6, 6, 4, 8, 8, 3, 11, 10, 12, 11, 8, 7, 10, 7, 8, 7, 5, 6, 7, 11, 7, 8, 7, 7, 2, 11, 7, 8, 8, 11, 5, 8, 4, 7, 8, 8, 5, 3, 9, 4, 7, 4, 7, 9, 6, 5, 6, 3, 5, 5, 7, 6, 8, 10, 9, 6, 8, 8, 11, 9, 11, 11, 3, 3, 9, 8, 9, 9, 6, 5, 5, 12, 7, 6, 6, 9, 10, 7, 11, 7, 7, 10, 9, 5, 7, 7, 11, 9, 8, 3, 8, 6, 8, 8, 7, 12, 5, 10, 7, 11, 6, 7, 8, 8, 6, 8, 8, 2, 7, 5, 9, 5, 9, 9, 8, 12, 8, 8, 6, 2, 4, 5, 12, 9, 7, 9, 4, 6, 4, 3, 7, 5, 8, 9, 6, 5, 10, 10, 10, 7, 11, 4, 6, 7, 9, 10, 6, 11, 6, 5, 12, 7, 3, 11, 6, 4, 7, 8, 2, 7, 8, 6, 6, 8, 3, 2, 8, 6, 9, 5, 11, 8, 6, 9, 7, 10, 10, 10, 6, 5, 9, 4, 5, 8, 8, 6, 6, 6, 10, 4, 7, 7, 5, 7, 9, 12, 6, 7, 5, 5, 10, 7, 5, 4, 7, 6, 6, 5, 5, 8, 9, 7, 7, 7, 9, 9, 8, 9, 11, 11, 10, 5, 3, 8, 10, 9, 7, 11, 6, 12, 6, 3, 8, 6, 3, 11, 11, 9, 6, 5, 7, 9, 7, 9, 6, 8, 9, 3, 7, 9, 10, 8, 9, 9, 7, 6, 9, 7, 5, 5, 5, 3, 8, 10, 6, 10, 8, 10, 8, 4, 11, 4, 12, 6, 7, 3, 9, 5, 11, 5, 7, 4, 7, 8, 12, 9, 8, 10, 4, 4, 5, 6, 4, 5, 6, 7, 3, 3, 11, 8, 9, 2, 8, 4, 8, 7, 8, 9, 10, 5, 10, 7, 9, 8, 8, 6, 7, 5, 6, 11, 2, 5, 3, 8, 4, 7, 7, 4, 7, 2, 7, 10, 10, 7, 9, 3, 5, 8, 6, 4, 8, 7, 7, 6, 8, 6, 11, 7, 3, 6, 6, 6, 9, 11, 6, 5, 7, 3, 12, 7, 10, 4, 6, 7, 4, 11, 3, 3, 6, 6, 12, 11, 12, 10, 11, 7, 9, 7, 5, 12, 6, 3, 6, 4, 5, 10, 6, 11, 11, 7, 6, 8, 11, 5, 12, 4, 7, 9, 9, 9, 10, 7, 9, 7, 4, 4, 6, 8, 6, 3, 4, 9, 7, 11, 8, 6, 11, 5, 7, 11, 7, 7, 6, 4, 9, 12, 9, 8, 8, 8, 9, 6, 8, 5, 11, 6, 8, 6, 5, 8, 5, 8, 6, 11, 5, 8, 3, 7, 8, 8, 10, 9, 8, 9, 8, 4, 7, 9, 5, 8, 8, 9, 7, 3, 9, 3, 4, 6, 9, 9, 5, 6, 4, 8, 9, 7, 5, 10, 5, 8, 5, 5, 5, 8, 9, 3, 9, 10, 10, 6, 4, 6, 2, 6, 2, 8, 7, 4, 6, 6, 7, 9, 4, 6, 8, 5, 7, 7, 7, 9, 2, 6, 7, 3, 10, 10, 7, 3, 5, 3, 6, 6, 7, 12, 9, 9, 11, 9, 4, 4, 10, 8, 8, 9, 8, 4, 4, 6, 2, 7, 5, 7, 7, 10, 4, 11, 5, 7, 8, 8, 2, 8, 6, 9, 8, 7, 8, 8, 10, 4, 7, 10, 10, 10, 4, 6, 12, 11, 4, 9, 12, 2, 3, 5, 3, 3, 11, 7, 8, 8, 5, 10, 8, 9, 4, 7, 7, 2, 5, 10, 7, 10, 9, 9, 4, 7, 8, 9, 8, 7, 7, 6, 12, 2, 7, 11, 10, 8, 7, 9, 11, 7, 9, 6, 8, 9, 10, 7, 3, 8, 10, 6, 6, 4, 2, 7, 11, 5, 6, 5, 4, 3, 2, 8, 6, 7, 6, 8, 6, 11, 8, 6, 10, 6, 5, 11, 4, 9, 5, 11, 10, 4, 7, 10, 7, 3, 9, 7, 8, 5, 4, 12, 9, 7, 5, 6, 6, 10, 6, 7, 5, 4, 12, 9, 7, 7, 9, 8, 3, 6, 6, 8, 10, 10, 5, 4, 7, 6, 2, 5, 12, 8, 4, 4, 7, 6, 5, 3, 8, 5, 5, 7, 12, 9, 7, 10, 9, 9, 8, 6, 8, 6, 6, 6, 8, 5, 12, 7, 5, 7, 8, 4, 5, 2, 6, 4, 5, 10, 7, 5, 6, 5, 4, 3, 2, 12, 8, 6, 8, 9, 9, 12, 6, 8, 9, 8, 5, 3, 6, 6, 10, 9, 11, 6, 7, 3, 7, 3, 8, 9, 10, 6, 4, 7, 5, 9, 11, 7, 9, 3, 8, 6, 8, 9, 10, 5, 3, 9, 5, 4, 11, 7, 6, 11, 2, 5, 7, 4, 6, 7, 6, 6, 7, 3, 8, 7, 3, 7, 5, 8, 10, 9, 8, 6, 5, 5, 8, 6, 6, 5, 5, 5, 5, 7, 2, 7, 6, 8, 9, 3, 4, 2, 6, 9, 8, 8, 11, 11, 7, 9, 7, 8, 5, 10, 5, 5, 7, 5, 9, 7, 6, 4, 6, 4, 6, 7, 9, 6, 6, 7, 11, 9, 4, 5, 8, 7, 5, 11, 8, 5, 6, 7, 4, 7, 9, 12, 5, 5, 4, 5, 6, 5, 5, 4, 10, 6, 4, 6, 7, 5, 7, 10, 10, 6, 8, 10, 6, 9, 5, 8, 6, 10, 5, 3, 2, 5, 8, 4, 8, 6, 6, 7, 7, 8, 8, 4, 5, 5, 5, 8, 6, 11, 3, 8, 9, 3, 8, 9, 7, 11, 12, 10, 5, 4, 10, 8, 7, 4, 5, 10, 10, 5, 5, 8, 5, 8, 12, 3, 3, 8, 7, 8, 10, 4, 7, 9, 3, 5, 5, 11, 11, 10, 5, 8, 10, 5, 3, 4, 7, 12, 8, 6, 4, 10, 10, 3, 2, 11, 5, 6, 9, 3, 2, 5, 10, 7, 9, 11, 6, 5, 10, 5, 8, 10, 12, 9, 10, 6, 6, 10, 6, 7, 5, 9, 6, 4, 5, 11, 6, 6, 6, 8, 6, 9, 8, 7, 8, 9, 8, 9, 7, 6, 8, 5, 4, 5, 6, 11, 5, 7, 4, 8, 3, 4, 8, 10, 6, 11, 5, 8, 10, 3, 7, 6, 10, 6, 8, 5, 4, 5, 6, 6, 6, 7, 9, 6, 7, 8, 12, 6, 5, 2, 9, 8, 4, 8, 7, 5, 4, 6, 9, 5, 10, 6, 11, 2, 3, 10, 11, 11, 6, 7, 11, 10, 5, 8, 7, 7, 4, 6, 4, 9, 7, 7, 8, 6, 5, 6, 4, 3, 9, 11, 6, 6, 3, 5, 6, 7, 8, 8, 10, 10, 10, 8, 7, 12, 7, 11, 5, 3, 6, 5, 9, 10, 5, 8, 6, 8, 7, 6, 10, 7, 12, 5, 8, 7, 11, 10, 8, 8, 4, 3, 6, 6, 2, 8, 9, 9, 5, 9, 6, 8, 11, 5, 5, 12, 12, 10, 6, 10, 8, 8, 12, 12, 6, 7, 5, 6, 8, 5, 3, 8, 9, 7, 10, 9, 9, 9, 5, 2, 9, 7, 11, 6, 7, 7, 6, 9, 6, 7, 9, 6, 7, 10, 7, 11, 7, 6, 6, 6, 4, 5, 8, 7, 7, 9, 7, 5, 3, 11, 7, 6, 10, 4, 4, 2, 8, 9, 6, 10, 6, 8, 5, 12, 11, 7, 11, 10, 6, 2, 3, 9, 8, 6, 6, 11, 10, 5, 6, 5, 5, 9, 2, 9, 2, 7, 10, 8, 6, 2, 7, 4, 7, 9, 2, 3, 6, 5, 9, 7, 7, 8, 12, 10, 2, 10, 8, 7, 2, 7, 7, 8, 5, 4, 5, 7, 7, 10, 5, 7, 12, 8, 6, 7, 10, 5, 7, 8, 6, 5, 8, 7, 9, 6, 12, 7, 8, 5, 5, 9, 5, 4, 8, 11, 7, 6, 10, 3, 4, 5, 7, 11, 8, 8, 7, 7, 5, 6, 8, 7, 8, 6, 2, 6, 6, 6, 7, 6, 7, 6, 7, 6, 4, 5, 7, 8, 4, 7, 5, 6, 7, 10, 5, 4, 5, 2, 6, 2, 6, 7, 10, 7, 7, 11, 3, 6, 7, 8, 6, 7, 10, 8, 6, 5, 2, 6, 9, 4, 6, 8, 12, 7, 4, 4, 10, 9, 9, 7, 9, 7, 10, 10, 6, 3, 10, 7, 8, 7, 3, 5, 7, 9, 8, 6, 4, 7, 6, 4, 7, 10, 2, 10, 5, 6, 9, 10, 6, 5, 12, 4, 3, 11, 10, 6, 7, 6, 8, 5, 6, 9, 5, 5, 10, 3, 3, 7, 8, 9, 4, 8, 4, 9, 10, 7, 11, 9, 8, 2, 8, 5, 9, 8, 6, 7, 4, 7, 8, 7, 8, 9, 7, 12, 6, 7, 4, 10, 6, 5, 5, 8, 7, 11, 8, 10, 7, 6, 7, 11, 11, 10, 12, 8, 5, 5, 3, 7, 6, 9, 6, 12, 12, 10, 7, 9, 4, 8, 6, 8, 9, 9, 10, 4, 5, 9, 3, 6, 9, 7, 8, 8, 6, 10, 10, 5, 11, 9, 4, 4, 9, 10, 9, 7, 4, 7, 8, 5, 7, 4, 4, 6, 8, 5, 8, 8, 5, 10, 7, 11, 8, 2, 5, 7, 8, 10, 6, 9, 6, 7, 4, 9, 3, 5, 9, 6, 3, 11, 12, 10, 3, 8, 5, 10, 8, 5, 4, 8, 9, 8, 11, 6, 9, 9, 6, 11, 5, 4, 9, 10, 8, 3, 8, 6, 8, 11, 10, 5, 5, 7, 9, 3, 11, 6, 9, 8, 9, 7, 2, 8, 4, 7, 5, 10, 9, 7, 5, 9, 5, 9, 5, 8, 5, 6, 5, 8, 8, 7, 7, 3, 6, 3, 8, 7, 11, 6, 3, 11, 11, 9, 8, 6, 4, 9, 6, 10, 5, 4, 4, 3, 6, 6, 5, 8, 8, 2, 10, 6, 8, 7, 9, 4, 8, 8, 8, 5, 5, 2, 8, 9, 10, 8, 7, 5, 5, 5, 12, 4, 7, 8, 11, 9, 8, 11, 11, 10, 9, 5, 7, 10, 4, 6, 5, 10, 8, 12, 7, 6, 3, 7, 5, 8, 9, 6, 9, 11, 9, 11, 12, 6, 5, 8, 9, 8, 9, 4, 5, 4, 2, 6, 3, 10, 3, 7, 8, 10, 7, 11, 10, 8, 7, 6, 9, 5, 6, 5, 11, 7, 9, 2, 7, 6, 3, 6, 3, 8, 5, 3, 4, 2, 11, 4, 9, 9, 7, 5, 6, 7, 7, 8, 7, 9, 4, 9, 6, 4, 8, 10, 9, 8, 3, 4, 8, 5, 9, 5, 7, 11, 4, 6, 2, 9, 4, 9, 10, 8, 7, 8, 9, 5, 6, 9, 6, 10, 8, 8, 4, 6, 4, 4, 5, 9, 7, 4, 9, 7, 4, 7, 9, 6, 2, 5, 10, 5, 6, 3, 11, 9, 10, 8, 5, 6, 7, 2, 6, 9, 6, 7, 8, 6, 6, 6, 6, 10, 9, 8, 9, 7, 9, 7, 6, 9, 7, 3, 9, 8, 10, 4, 9, 4, 8, 3, 8, 7, 10, 8, 5, 10, 4, 10, 5, 6, 6, 7, 7, 8, 9, 7, 6, 3, 8, 3, 10, 10, 6, 11, 6, 11, 9, 9, 8, 7, 7, 10, 4, 9, 5, 12, 12, 4, 10, 11, 7, 8, 7, 3, 9, 9, 5, 6, 6, 3, 7, 4, 10, 7, 6, 4, 7, 7, 3, 4, 5, 4, 10, 6, 5, 7, 10, 12, 7, 12, 8, 8, 6, 10, 3, 5, 12, 7, 5, 5, 10, 7, 5, 11, 6, 5, 4, 4, 5, 11, 6, 2, 12, 6, 7, 6, 8, 7, 6, 11, 8, 8, 9, 6, 5, 7, 2, 9, 9, 7, 9, 7, 3, 8, 7, 7, 5, 4, 6, 7, 8, 8, 8, 4, 3, 6, 7, 11, 7, 5, 4, 6, 11, 8, 5, 4, 2, 6, 10, 12, 3, 6, 6, 4, 4, 9, 5, 9, 9, 5, 4, 5, 8, 7, 4, 10, 5, 6, 7, 10, 9, 11, 11, 9, 10, 10, 11, 8, 6, 2, 11, 9, 3, 5, 6, 6, 7, 11, 7, 3, 9, 12, 8, 5, 6, 6, 5, 10, 9, 7, 4, 7, 6, 7, 9, 5, 9, 9, 7, 7, 2, 11, 10, 10, 4, 12, 4, 5, 12, 10, 10, 6, 11, 5, 7, 6, 4, 8, 7, 8, 7, 10, 2, 8, 6, 4, 5, 12, 4, 6, 5, 6, 11, 2, 10, 7, 11, 5, 5, 9, 8, 8, 8, 9, 2, 11, 6, 10, 8, 7, 2, 8, 9, 8, 9, 4, 12, 7, 8, 5, 4, 9, 6, 3, 5, 8, 9, 11, 9, 11, 4, 5, 4, 7, 9, 6, 10, 8, 11, 3, 5, 4, 6, 4, 8, 10, 8, 11, 3, 11, 6, 4, 10, 6, 5, 4, 4, 8, 6, 8, 6, 8, 6, 5, 2, 8, 4, 8, 3, 5, 9, 7, 10, 4, 5, 10, 4, 11, 4, 7, 9, 7, 11, 4, 8, 3, 7, 9, 2, 10, 3, 7, 10, 3, 11, 4, 7, 6, 8, 2, 6, 3, 8, 2, 10, 5, 6, 6, 8, 5, 7, 7, 7, 5, 5, 11, 10, 6, 2, 5, 8, 4, 7, 6, 7, 8, 7, 9, 12, 9, 7, 2, 8, 8, 9, 8, 8, 7, 8, 4, 9, 5, 7, 7, 11, 6, 10, 6, 4, 10, 6, 5, 7, 6, 7, 6, 10, 9, 8, 5, 6, 8, 8, 8, 6, 5, 7, 5, 7, 9, 4, 10, 6, 8, 7, 9, 5, 8, 6, 5, 8, 6, 5, 9, 6, 3, 4, 10, 8, 10, 9, 6, 7, 5, 11, 11, 8, 3, 6, 6, 5, 12, 6, 8, 3, 3, 4, 7, 7, 5, 7, 8, 6, 10, 9, 3, 6, 8, 4, 6, 11, 6, 6, 11, 7, 9, 8, 8, 8, 8, 4, 6, 10, 5, 7, 9, 6, 10, 4, 11, 7, 4, 10, 6, 3, 10, 4, 7, 5, 7, 4, 9, 9, 7, 8, 6, 8, 10, 5, 8, 2, 6, 3, 3, 8, 12, 6, 9, 10, 11, 10, 6, 5, 7, 10, 8, 6, 10, 7, 8, 7, 6, 8, 12, 5, 7, 5, 8, 3, 7, 7, 12, 10, 12, 10, 11, 6, 3, 5, 8, 3, 4, 6, 11, 8, 4, 9, 6, 9, 7, 3, 8, 7, 5, 6, 12, 4, 12, 7, 6, 6, 5, 6, 9, 10, 7, 8, 3, 5, 3, 6, 8, 7, 7, 5, 2, 7, 11, 7, 6, 9, 8, 6, 7, 8, 4, 8, 8, 5, 9, 9, 10, 6, 7, 11, 9, 9, 5, 12, 2, 5, 7, 7, 3, 9, 7, 8, 6, 8, 9, 4, 5, 6, 5, 11, 11, 6, 6, 10, 9, 4, 4, 7, 9, 11, 6, 4, 8, 9, 4, 8, 12, 4, 6, 9, 2, 4, 7, 3, 4, 5, 8, 7, 5, 7, 6, 3, 6, 10, 9, 12, 10, 9, 9, 11, 6, 10, 4, 6, 6, 2, 9, 7, 8, 5, 8, 5, 6, 10, 5, 4, 12, 2, 7, 4, 7, 6, 5, 6, 4, 8, 8, 9, 7, 9, 10, 5, 2, 6, 8, 5, 8, 8, 9, 10, 5, 5, 10, 4, 3, 5, 10, 5, 5, 4, 7, 10, 11, 2, 6, 3, 7, 4, 8, 2, 8, 6, 8, 10, 9, 8, 7, 7, 7, 5, 5, 8, 10, 8, 3, 7, 6, 6, 5, 3, 7, 5, 9, 6, 11, 6, 3, 5, 9, 3, 7, 5, 10, 9, 9, 9, 6, 3, 7, 5, 9, 7, 7, 5, 7, 7, 5, 7, 7, 12, 6, 2, 5, 10, 10, 11, 12, 7, 4, 9, 8, 6, 5, 7, 7, 8, 7, 6, 4, 4, 9, 9, 9, 10, 8, 9, 7, 5, 7, 7, 4, 6, 9, 8, 7, 7, 6, 3, 7, 7, 6, 3, 10, 9, 6, 6, 3, 11, 9, 8, 9, 12, 12, 3, 6, 10, 4, 7, 6, 4, 6, 10, 7, 10, 7, 12, 6, 9, 5, 8, 7, 5, 4, 12, 7, 7, 8, 10, 8, 7, 3, 11, 9, 9, 7, 7, 10, 6, 3, 2, 8, 8, 6, 7, 7, 10, 8, 3, 7, 6, 5, 5, 4, 10, 9, 6, 8, 3, 3, 6, 11, 10, 8, 7, 3, 9, 6, 8, 9, 7, 8, 9, 12, 10, 8, 8, 7, 6, 3, 4, 9, 6, 2, 8, 6, 10, 10, 7, 4, 7, 2, 8, 7, 4, 6, 9, 12, 12, 4, 8, 12, 9, 10, 8, 4, 6, 8, 10, 5, 11, 5, 6, 5, 8, 11, 3, 3, 6, 12, 5, 10, 5, 5, 5, 8, 8, 6, 5, 8, 10, 11, 8, 9, 7, 10, 9, 8, 10, 9, 12, 8, 9, 5, 11, 6, 6, 10, 5, 7, 11, 8, 7, 7, 9, 5, 9, 4, 5, 9, 11, 9, 4, 5, 11, 7, 6, 7, 6, 6, 8, 7, 9, 6, 7, 5, 9, 2, 4, 8, 9, 12, 5, 3, 6, 7, 8, 7, 3, 12, 7, 6, 10, 3, 5, 6, 9, 8, 2, 6, 10, 9, 12, 5, 5, 9, 4, 10, 4, 8, 7, 12, 8, 5, 9, 6, 7, 3, 6, 7, 7, 4, 5, 8, 8, 7, 5, 4, 8, 6, 5, 9, 7, 11, 9, 7, 7, 8, 11, 7, 7, 10, 7, 6, 6, 11, 6, 7, 11, 10, 9, 8, 7, 7, 10, 8, 9, 9, 12, 7, 6, 7, 4, 9, 7, 8, 8, 7, 5, 8, 11, 5, 8, 6, 8, 11, 4, 8, 9, 6, 10, 7, 6, 7, 11, 3, 9, 4, 5, 6, 9, 7, 7, 8, 6, 4, 10, 6, 5, 10, 10, 11, 8, 9, 6, 10, 11, 10, 10, 10, 6, 7, 3, 6, 9, 7, 8, 5, 7, 10, 3, 8, 9, 8, 5, 7, 9, 5, 6, 7, 7, 10, 10, 7, 6, 8, 10, 3, 7, 11, 8, 5, 9, 8, 9, 8, 4, 8, 7, 5, 9, 10, 8, 10, 7, 9, 4, 4, 12, 9, 11, 2, 6, 6, 5, 7, 5, 6, 5, 6, 7, 11, 3, 9, 9, 3, 6, 8, 6, 8, 10, 6, 8, 7, 2, 9, 3, 5, 7, 7, 5, 9, 5, 8, 5, 7, 7, 7, 8, 5, 8, 8, 6, 10, 9, 4, 6, 4, 12, 6, 7, 6, 7, 7, 9, 9, 7, 9, 4, 8, 3, 10, 10, 5, 10, 5, 7, 9, 11, 8, 7, 6, 12, 11, 8, 6, 5, 9, 3, 9, 8, 9, 7, 8, 7, 9, 8, 6, 3, 7, 8, 4, 3, 7, 6, 11, 7, 7, 9, 8, 9, 10, 3, 6, 9, 5, 8, 8, 8, 9, 8, 5, 5, 5, 7, 11, 5, 9, 9, 6, 11, 7, 11, 9, 10, 7, 6, 7, 8, 10, 4, 3, 8, 5, 7, 7, 7, 7, 9, 10, 6, 9, 4, 11, 10, 8, 8, 5, 4, 4, 6, 7, 2, 3, 4, 7, 8, 8, 7, 11, 7, 8, 7, 7, 3, 7, 7, 4, 8, 10, 8, 4, 10, 8, 11, 5, 9, 9, 7, 7, 8, 5, 4, 7, 2, 3, 7, 5, 6, 7, 8, 10, 4, 7, 8, 8, 9, 7, 7, 7, 5, 3, 9, 5, 9, 9, 8, 4, 10, 11, 6, 7, 8, 10, 5, 7, 8, 6, 4, 2, 9, 8, 7, 6, 3, 7, 12, 9, 6, 7, 12, 11, 6, 10, 3, 7, 8, 6, 6, 5, 11, 6, 11, 7, 11, 6, 10, 5, 9, 6, 8, 9, 3, 4, 8, 2, 5, 4, 11, 5, 11, 8, 5, 8, 7, 11, 3, 6, 8, 11, 8, 10, 5, 6, 9, 4, 8, 5, 7, 7, 6, 5, 9, 3, 7, 7, 6, 7, 9, 8, 4, 7, 7, 9, 7, 9, 5, 7, 6, 7, 7, 12, 4, 8, 9, 6, 7, 5, 6, 5, 3, 8, 9, 5, 10, 4, 6, 5, 8, 11, 9, 12, 4, 10, 5, 8, 7, 6, 11, 5, 8, 5, 3, 10, 5, 9, 12, 12, 10, 8, 6, 7, 11, 3, 5, 5, 8, 2, 8, 7, 5, 6, 5, 5, 9, 8, 7, 7, 7, 8, 8, 7, 10, 5, 9, 5, 4, 7, 4, 3, 5, 8, 11, 6, 7, 4, 7, 5, 5, 7, 8, 10, 10, 4, 5, 5, 6, 11, 6, 7, 6, 6, 12, 7, 8, 6, 5, 5, 2, 9, 6, 7, 8, 11, 8, 9, 12, 6, 5, 11, 11, 10, 10, 8, 7, 8, 5, 7, 9, 2, 3, 10, 8, 8, 10, 8, 4, 11, 9, 6, 7, 10, 9, 8, 9, 7, 9, 9, 6, 9, 5, 7, 11, 3, 8, 4, 12, 11, 9, 5, 7, 10, 7, 7, 9, 3, 6, 11, 9, 7, 3, 7, 6, 8, 7, 9, 4, 9, 7, 10, 8, 8, 9, 4, 5, 3, 10, 9, 5, 10, 9, 3, 9, 9, 5, 8, 10, 6, 5, 11, 7, 6, 9, 9, 3, 6, 5, 8, 7, 5, 12, 8, 6, 7, 4, 4, 7, 3, 9, 3, 10, 7, 7, 3, 5, 7, 5, 5, 8, 6, 4, 8, 2, 6, 7, 4, 11, 7, 7, 5, 6, 5, 4, 6, 7, 3, 7, 5, 4, 7, 6, 9, 4, 6, 7, 2, 6, 6, 9, 5, 9, 8, 7, 3, 11, 10, 5, 8, 7, 6, 7, 5, 11, 8, 10, 7, 8, 7, 8, 5, 8, 7, 7, 10, 4, 9, 11, 6, 7, 11, 6, 11, 10, 10, 8, 11, 8, 8, 10, 5, 5, 6, 8, 11, 7, 11, 3, 10, 7, 9, 8, 5, 5, 8, 8, 6, 10, 3, 8, 7, 4, 6, 8, 8, 10, 7, 7, 6, 4, 3, 7, 9, 6, 3, 12, 11, 12, 7, 10, 5, 7, 7, 7, 6, 6, 2, 8, 11, 7, 8, 7, 7, 7, 11, 9, 10, 6, 10, 6, 8, 9, 8, 9, 8, 3, 4, 9, 8, 7, 10, 3, 2, 12, 3, 5, 8, 7, 4, 7, 7, 10, 7, 7, 8, 5, 8, 7, 6, 7, 5, 11, 6, 2, 6, 8, 6, 5, 7, 9, 5, 9, 11, 6, 5, 5, 12, 9, 8, 5, 8, 4, 5, 11, 6, 9, 7, 9, 6, 2, 5, 10, 5, 6, 7, 5, 11, 7, 6, 5, 7, 6, 6, 7, 7, 3, 4, 8, 8, 8, 7, 9, 6, 5, 5, 3, 7, 2, 7, 6, 7, 7, 5, 7, 7, 4, 7, 9, 5, 7, 6, 7, 8, 10, 4, 10, 5, 4, 5, 7, 6, 7, 10, 4, 3, 6, 8, 8, 8, 2, 8, 10, 6, 7, 6, 8, 5, 4, 10, 3, 3, 10, 3, 8, 8, 5, 3, 7, 7, 7, 5, 7, 5, 2, 5, 10, 5, 3, 6, 9, 7, 8, 4, 5, 10, 5, 6, 10, 6, 8, 6, 7, 5, 12, 11, 3, 8, 10, 4, 6, 3, 5, 6, 6, 7, 8, 7, 6, 10, 6, 6, 9, 5, 7, 4, 8, 3, 3, 6, 9, 12, 8, 5, 10, 8, 10, 6, 9, 3, 7, 11, 3, 5, 12, 3, 8, 9, 5, 5, 7, 10, 7, 9, 6, 5, 6, 5, 4, 7, 8, 4, 10, 6, 4, 5, 8, 9, 7, 7, 7, 4, 11, 7, 10, 4, 6, 10, 6, 11, 4, 7, 3, 7, 4, 12, 10, 8, 3, 9, 9, 7, 11, 7, 12, 7, 3, 4, 3, 10, 6, 12, 5, 8, 6, 6, 2, 10, 7, 7, 7, 4, 2, 7, 10, 8, 3, 3, 7, 7, 6, 7, 10, 11, 5, 7, 8, 12, 7, 12, 3, 6, 7, 11, 8, 9, 5, 4, 7, 8, 10, 7, 7, 11, 7, 8, 8, 4, 3, 5, 6, 8, 2, 5, 4, 5, 9, 7, 3, 8, 8, 10, 9, 9, 3, 7, 4, 8, 6, 3, 5, 10, 12, 4, 5, 12, 11, 11, 5, 12, 7, 5, 8, 6, 7, 8, 2, 9, 3, 8, 7, 4, 8, 5, 10, 12, 5, 7, 4, 9, 7, 4, 8, 5, 2, 5, 12, 5, 7, 6, 5, 4, 2, 6, 8, 6, 7, 8, 6, 10, 8, 7, 9, 6, 7, 11, 9, 11, 6, 7, 4, 10, 11, 11, 9, 8, 7, 6, 6, 5, 8, 2, 3, 4, 12, 7, 11, 7, 6, 10, 7, 9, 8, 9, 7, 7, 6, 7, 11, 5, 2, 6, 5, 8, 11, 9, 10, 10, 8, 7, 11, 7, 5, 11, 8, 6, 5, 7, 5, 7, 5, 7, 10, 2, 10, 8, 7, 12, 11, 8, 11, 10, 8, 6, 11, 10, 3, 5, 7, 8, 9, 5, 10, 7, 8, 10, 7, 7, 12, 5, 11, 10, 6, 5, 8, 5, 11, 7, 8, 4, 7, 8, 6, 6, 11, 7, 5, 9, 6, 6, 5, 6, 8, 8, 10, 7, 6, 7, 7, 6, 10, 2, 12, 3, 6, 3, 7, 3, 2, 3, 3, 8, 10, 10, 7, 10, 5, 3, 9, 11, 8, 7, 5, 7, 6, 3, 8, 10, 3, 10, 6, 7, 3, 5, 3, 9, 5, 2, 5, 5, 9, 11, 9, 7, 3, 10, 10, 5, 10, 7, 9, 10, 5, 12, 9, 9, 7, 8, 10, 10, 7, 4, 6, 9, 8, 11, 11, 7, 7, 5, 8, 7, 10, 10, 8, 9, 7, 8, 3, 11, 9, 5, 11, 6, 9, 2, 2, 8, 8, 10, 9, 7, 12, 8, 6, 12, 5, 6, 2, 9, 9, 9, 7, 5, 7, 7, 7, 9, 8, 9, 11, 4, 4, 7, 8, 2, 4, 11, 11, 9, 9, 4, 4, 6, 2, 7, 10, 6, 4, 6, 10, 8, 8, 7, 7, 7, 4, 9, 6, 9, 7, 9, 6, 9, 7, 6, 8, 7, 5, 4, 6, 3, 9, 11, 5, 5, 6, 6, 8, 8, 4, 12, 4, 7, 7, 11, 8, 7, 3, 5, 8, 8, 2, 7, 8, 3, 6, 8, 6, 3, 8, 5, 3, 8, 5, 7, 7, 8, 6, 9, 7, 7, 9, 7, 10, 6, 11, 9, 7, 3, 4, 6, 6, 7, 4, 9, 9, 11, 2, 9, 3, 5, 9, 11, 9, 7, 7, 11, 9, 7, 6, 5, 10, 2, 6, 8, 7, 12, 5, 7, 6, 5, 5, 8, 8, 7, 8, 2, 8, 3, 10, 9, 7, 4, 4, 9, 6, 8, 7, 5, 9, 9, 7, 3, 7, 4, 8, 7, 9, 7, 4, 9, 7, 3, 10, 12, 9, 9, 5, 4, 6, 4, 7, 11, 6, 6, 8, 4, 11, 12, 7, 6, 6, 8, 12, 6, 7, 11, 5, 10, 8, 3, 5, 7, 5, 6, 3, 6, 10, 6, 6, 6, 11, 8, 7, 8, 9, 5, 5, 7, 6, 4, 10, 6, 7, 5, 9, 10, 4, 6, 5, 7, 8, 3, 8, 6, 3, 4, 4, 7, 11, 5, 8, 6, 7, 8, 11, 3, 8, 9, 4, 9, 9, 9, 9, 7, 6, 5, 6, 2, 5, 7, 10, 5, 6, 7, 9, 7, 8, 9, 8, 4, 11, 7, 7, 3, 4, 8, 7, 7, 7, 3, 9, 8, 5, 6, 6, 6, 7, 7, 10, 5, 8, 5, 10, 9, 8, 9, 5, 8, 9, 8, 7, 10, 5, 7, 4, 3, 8, 6, 8, 4, 5, 10, 5, 5, 8, 7, 12, 6, 4, 5, 4, 4, 5, 4, 6, 9, 8, 5, 4, 11, 8, 7, 5, 7, 7, 5, 5, 10, 6, 8, 4, 4, 4, 6, 7, 7, 4, 3, 7, 6, 10, 5, 6, 8, 5, 6, 10, 7, 9, 6, 9, 6, 8, 8, 7, 8, 7, 6, 4, 4, 9, 4, 8, 6, 10, 5, 8, 4, 11, 6, 7, 8, 7, 4, 7, 6, 9, 8, 3, 6, 9, 5, 6, 6, 11, 8, 8, 7, 7, 2, 12, 7, 4, 2, 2, 6, 10, 9, 4, 3, 6, 11, 6, 10, 7, 4, 7, 7, 8, 11, 3, 5, 4, 9, 4, 5, 8, 7, 5, 10, 6, 6, 9, 8, 8, 12, 8, 8, 9, 9, 6, 4, 7, 6, 10, 8, 6, 8, 6, 9, 6, 8, 5, 4, 2, 5, 11, 7, 7, 7, 3, 7, 5, 4, 4, 9, 5, 7, 10, 4, 8, 5, 4, 7, 10, 10, 5, 5, 11, 6, 9, 4, 6, 8, 2, 4, 4, 9, 5, 10, 2, 12, 11, 5, 10, 7, 7, 2, 11, 6, 7, 6, 11, 5, 5, 7, 7, 7, 4, 10, 5, 11, 7, 9, 4, 4, 5, 10, 7, 9, 7, 9, 4, 8, 8, 7, 7, 8, 6, 7, 9, 8, 4, 3, 7, 2, 10, 9, 7, 11, 6, 8, 11, 8, 4, 11, 9, 3, 10, 8, 7, 11, 5, 7, 4, 2, 9, 6, 3, 12, 4, 11, 9, 8, 6, 9, 8, 11, 10, 7, 7, 3, 4, 7, 6, 7, 5, 6, 6, 8, 4, 6, 9, 5, 3, 6, 3, 2, 2, 3, 10, 5, 9, 4, 4, 9, 5, 4, 7, 9, 7, 12, 3, 8, 4, 8, 8, 9, 5, 9, 8, 11, 6, 6, 8, 10, 6, 8, 8, 4, 8, 6, 7, 7, 8, 6, 8, 2, 11, 3, 7, 3, 6, 11, 7, 5, 9, 10, 10, 3, 6, 8, 4, 10, 9, 8, 8, 10, 10, 4, 10, 8, 3, 9, 10, 7, 9, 7, 2, 5, 9, 8, 9, 10, 9, 9, 11, 6, 6, 7, 10, 3, 7, 6, 8, 10, 5, 6, 3, 11, 6, 4, 7, 5, 4, 5, 8, 9, 8, 7, 8, 3, 6, 3, 7, 10, 10, 5, 9, 8, 9, 5, 10, 10, 12, 7, 4, 4, 7, 9, 9, 11, 6, 7, 7, 5, 7, 8, 5, 9, 11, 9, 9, 8, 9, 9, 3, 8, 8, 7, 10, 9, 3, 8, 7, 5, 5, 6, 6, 8, 8, 7, 9, 8, 2, 4, 3, 12, 9, 4, 5, 5, 5, 8, 8, 5, 2, 7, 6, 5, 9, 5, 9, 9, 3, 7, 9, 8, 12, 9, 4, 8, 11, 6, 4, 11, 7, 9, 4, 6, 6, 11, 9, 5, 9, 4, 7, 10, 9, 6, 6, 5, 9, 6, 6, 3, 4, 7, 8, 8, 4, 9, 8, 8, 7, 10, 8, 9, 6, 11, 6, 6, 8, 9, 10, 6, 4, 4, 5, 6, 7, 9, 12, 6, 7, 8, 7, 7, 10, 9, 4, 7, 10, 9, 5, 2, 8, 6, 7, 5, 7, 8, 7, 8, 8, 6, 7, 9, 7, 7, 3, 7, 6, 7, 8, 12, 10, 7, 9, 6, 5, 8, 7, 7, 10, 6, 4, 10, 3, 8, 9, 6, 6, 8, 3, 4, 9, 5, 3, 11, 10, 12, 5, 5, 10, 6, 6, 10, 10, 4, 6, 11, 9, 10]\n\n"}, {"id": "expectation", "title": "Expectation", "url": "part2/expectation", "text": "\n \nExpectation\n\nA random variable is fully prepresented by its probability mass function (PMF), which represents each of the values the random variable can take on, and the corresponding probabilities. A PMF can be a lot of information. Sometimes it is useful to summarize the random variable! The most common, and arguably the most useful, summary of a random variable is its \"Expectation\". \n\n\n\nDefinition: Expectation\nThe expectation of a random variable $X$, writte $\\E[X]$ is the average of all the values the random variable can take on, each weighted by the probability that the random variable will take on that value.\n\n\t$$\n\t\\E[X] = \\sum_x x \\cdot \\p(X=x)\n\t$$\n\n\nExpectation goes by many other names: Mean, Weighted Average, Center of Mass, 1st Moment. All of which are calculated using the same formula.\nRecall that $\\p(X=x)$, also written as $\\p(x)$, is the probability mass function of the random variable $X$. Here is code that calculates the expectation of the sum of two dice, based off the probability mass function:\ndef expectation_sum_two_dice():\n exp_sum_two_dice = 0\n # sum of dice can take on the values 2 through 12\n for x in range(2, 13):\n pr_x = pmf_sum_two_dice(x) # pmf gives Pr(x)\n exp_sum_two_dice += x * pr_x\n return exp_sum_two_dice\nIf we worked it out manually we would get that if $X$ is the sum of two dice, $\\E[X] = 7$:\n\t$$\n\\E[X] = \\sum_x x \\cdot \\p(X=x) = 2 \\cdot \\frac{1}{36} + 2 \\cdot \\frac{2}{36} + \\dots + 12 \\frac{1}{36} = 7\n\t$$\n\n7 is the \"average\" number you expect to get if you took the sum of two dice near infinite times. In this case it also happens to be the same as the mode, the most likely value of the sum of two dice, but this is not always the case!\n\nProperties of expectation\n\n\nProperty: Linearity of Expectation\n\t\t$$E[aX + b] = a\\E[X]+b$$\n\t\t\n\t\n\n\n\nProperty: Expectation of the Sum of Random Variables\n\t\t$$E[X+Y] = E[X] +E[Y]$$\n\t\t\n\t\n\n\n\nProperty: Law of Unconcious Statistician\n\t\t$$E[g(X)] = \\sum_x g(x)\\p(X=x)$$\n\t\tOne can also calculate the expected value of a function g(X) of a random variable X when one knows the probability distribution of X but one does not explicitly know the distribution of g(X). This theorem has the humorous name of \"the Law of the Unconscious Statistician\" (LOTUS), because it is so useful that you should be able to employ it unconciously.\n\n\n\n\nProperty: Expectation of a Constant\n\t\t$$E[a] = a$$\n\t\tSometimes in proofs, you will end up with the expectation of a constant (rather than a random variable). For example what does the $\\E[5]$ mean? Since 5 is not a random variable, it does not change, and will always be 5, $\\E[5] = 5$.\n\t\n\n\n\n"}, {"id": "variance", "title": "Variance", "url": "part2/variance", "text": "\n \nVariance\n\n\n\nDefinition: Variance of a Random Variable\nThe variance is a measure of the \"spread\" of a random variable around the mean. Variance for a random variable, X, with expected value $\\E[X] = \u00b5$ is:\n\t\t\t$$\n\\var(X) = \\E[(X\u2013\u00b5)^2]\n$$\nSemantically, this is the average distance of a sample from the distribution to the mean. When computing the variance often we use a different (equivalent) form of the variance equation:\n$$\n\\begin{align}\n\\var(X) &= \\E[X^2] - \\E[X]^2\n\\end{align}\n$$\n\t\n\n\nIn the last section we showed that Expectation was a useful summary of a random variable (it calculates the \"weighted average\" of the random variable). One of the next most important properties of random variables to understand is variance: the measure of spread.\n\nTo start, lets consider probability mass functions for three sets of graders. When each of them grades an assigment, meant to receive a 70/100, they each have a probability distribution of grades that they could give. \n\n\n\n\nDistributions of three types of peer graders. Data is from a massive online course.\n\n\n\n\tThe distribution for graders in group $C$ have a different expectation. The average grade that they give when grading an assignment worth 70 is a 55/100. That is clearly not great! But what is the difference between graders $A$ and $B$? Both of them have the same expected value (which is equal to the correct grade). The graders in group $A$ have a higher \"spread\". When grading an assignment worth 70, they have a reasonable chance of giving it a 100, or of giving it a 40. Graders in group $B$ have much less spread. Most of the probability mass is close to 70. You want graders like those in group $B$: in expectation they give the correct grade, and they have low spread. As an aside: scores in group $B$ came from a probabilistic algorithm over peer grades.\n\t\n\n\tTheorists wanted a number to describe spread. They invented variance to be the average of the distance between values that the random variable could take on and the mean of the random variable. There are many reasonable choices for the distance function, probability theorists chose squared deviation from the mean:\n\n\t$$\n\\var(X) = \\E[(X\u2013\u00b5)^2]\n$$\n\n\n\nProof: $\\var(X) = \\E[X^2] - \\E[X]^2$\nIt is much easier to compute variance using $\\E[X^2] - \\E[X]^2$. You certainly don't need to know why its an equivalent expression, but in case you were wondering, here is the proof.\n\t\t\n\t\t\t$$\n\\begin{align}\n\\var(X) \n&= \\E[(X\u2013\u00b5)^2] && \\text{Note: } \\mu = \\E[X]\\\\\n&= \\sum_x(x-\\mu)^2 \\p(x) && \\text{Definition of Expectation}\\\\\n&= \\sum_x (x^2 -2\\mu x + \\mu^2) \\P(x) \n&& \\text{Expanding the square}\\\\\n&= \\sum_x x^2\\P(x)- 2\\mu \\sum_x x \\P(x) + \\mu^2 \\sum_x \\P(x)\n&& \\text{Propagating the sum}\\\\\n&= \\sum_x x^2\\P(x)- 2\\mu \\E[X] + \\mu^2 \\sum_x \\P(x) && \\text{Substitute def of expectation}\\\\\n&= \\E[X^2]- 2\\mu \\E[X] + \\mu^2 \\sum_x \\P(x)\n&& \\text{LOTUS } g(x) = x^2 \\\\\n&= \\E[X^2]- 2\\mu \\E[X] + \\mu^2\n&& \\text{Since }\\sum_x \\P(x) = 1\\\\\n&= \\E[X^2]- 2\\E[X]^2 + \\E[X]^2\n&& \\text{Since }\\mu = \\E[X]\\\\\n&= \\E[X^2]- \\E[X]^2\n&& \\text{Cancelation}\\\\\n\\end{align}\n$$\n\t\n\nStandard Deviation\nVariance is especially useful for comparing the \"spread\" of two distributions and it has the useful property that it is easy to calculate. In general a larger variance means that\nthere is more deviation around the mean \u2014 more spread. However, if you look at the leading example, the units of variance\nare the square of points. This makes it hard to interpret the numerical value. What does it mean that the\nspread is 52 points$^2$\n? A more interpretable measure of spread is the square root of Variance, which we call\nthe Standard Deviation $\\std(X) = \\sqrt{\\var(X)}$. The standard deviation of our grader is 7.2 points. In this example folks find it easier to think of spread in points rather than points $^2$. As an aside, the standard deviation is the average distance of a sample (from the distribution) to the mean, using the euclidean distance function\n\n"}, {"id": "bernoulli", "title": "Bernoulli Distribution", "url": "part2/bernoulli", "text": "\nBernoulli Distribution\n\nParametric Random Variables\nThere are many classic and commonly-seen random variable abstractions that show up in the world of probability. At\n this point in the class, you will learn about several of the most significant parametric discrete distributions.\n When solving problems, if you can recognize that a random variable fits one of these formats, then you can\n use its pre-derived probability mass function (PMF), expectation, variance, and other properties. Random variables\n of this sort are called parametric random variables. If you can argue that a random variable falls under one\n of the studied parametric types, you simply need to provide parameters. A good analogy is a class in\n programming. Creating a parametric random variable is very similar to calling a constructor with input parameters.\n\nBernoulli Random Variables\n\n A Bernoulli random variable (also called a boolean or indicator random variable) is the simplest kind\n of parametric random variable. It can take on two values, 1 and 0. It takes on a 1 if an experiment with probability\n $p$ resulted in success and a 0 otherwise. Some example uses include a coin flip, a random binary digit, whether a\n disk drive crashed, and whether someone likes a Netflix movie. Here $p$ is the parameter, but different instances of\n Bernoulli random variables might have different values of $p$.\n\n\n Here is a full description of the key properties of a Bernoulli random variable. If $X$ is declared to be a\n Bernoulli random variable with parameter $p$, denoted $X \u223c \\Ber(p)$:\n\n <%\n include('templates/rvCards/bernoulli.html')\n\nBecause Bernoulli distributed random variables are parametric, as soon as you declare a random variable to be of type\n Bernoulli you automatically can know all of these pre-derived properties! Some of these properties are straightforward to\n prove for a Bernoulli. For example, you could have solved for expectation:\n\n\n\nProof: Expectation of a Bernoulli. If $X$ is a Bernoulli with parameter $p$, $X \\sim \\Ber(p)$:\n\n\n $$\n \\begin{align}\n \\E[X] &= \\sum_x x \\cdot \\p(X=x) && \\text{Definition of expectation} \\\\\n &= 1 \\cdot p + 0 \\cdot (1-p) &&\n X \\text{ can take on values 0 and 1} \\\\\n &= p && \\text{Remove the 0 term}\n \\end{align}\n $$\n\n\n\n\nProof: Variance of a Bernoulli. If $X$ is a Bernoulli with parameter $p$, $X \\sim \\Ber(p)$:\n To compute variance, first compute $E[X^2]$:\n $$\n \\begin{align}\n E[X^2]\n &= \\sum_x x^2 \\cdot \\p(X=x) &&\\text{LOTUS}\\\\\n &= 0^2 \\cdot (1-p) + 1^2 \\cdot p\\\\\n &= p\n \\end{align}\n $$\n $$\n \\begin{align}\n \\var(X)\n &= E[X^2] - E[X]^2&& \\text{Def of variance} \\\\\n &= p - p^2 && \\text{Substitute }E[X^2]=p, E[X] = p \\\\\n &= p (1-p) && \\text{Factor out }p\n \\end{align}$$\n\n\nIndicator\nBernoulli random variables and indicator variables are two aspects of the same concept. A random variable $I$ is an\n indicator variable for an event $A$ if $I = 1$ when $A$ occurs and $I = 0$ if $A$ does not occur. $\\P(I=1)=\\P(A)$\n and $\\E[I]=\\P(A)$. Indicator random variables are Bernoulli random variables, with $p=\\P(A)$.\n\n"}, {"id": "binomial", "title": "Binomial Distribution", "url": "part2/binomial", "text": "\nBinomial Distribution\n\nIn this section, we will discuss the binomial distribution. To start, imagine the following example. Consider $n$\n independent trials of an experiment where each trial is a \"success\" probability $p$. Let $X$ be the\n number of successes in $n$ trials. This situation is truly common in the natural world, and as such, there has been\n a lot of research into such phenomena. Random variables like $X$ are called binomial random variables. If you\n can identify that a process fits this description, you can inherit many already proved properties such as the PMF\n formula, expectation, and variance!\n\n\n\n\n\n\nHere are a few examples of binomial random variables:\n\n# of heads in $n$ coin flips\n # of 1\u2019s in randomly generated length $n$ bit string\n # of disk drives crashed in 1000 computer cluster, assuming disks crash independently\n\n\n\n<%\ninclude('templates/rvCards/binomial.html')\n\n\nOne way to think of the binomial is as the sum of $n$ Bernoulli\n variables. Say that $Y_i \\sim \\Ber(p)$ is an indicator Bernoulli random variable which is 1 if experiment $i$ is a\n success. Then if $X$ is the total number of successes in $n$ experiments, $X \\sim \\Bin(n, p)$:\n $$\n X = \\sum_{i=1}^n Y_i\n $$\nRecall that the outcome of $Y_i$ will be 1 or 0, so one way to think of $X$ is as the sum of those 1s and 0s.\nBinomial PMF\nThe most important property to know about a binomial is its PMF function:\n\n\n\n\n\nRecall, we derived this formula in Part 1. There is a complete example on the probability of $k$ heads in $n$ coin\n flips, where each flip is heads with probability $0.5$: Many Coin Flips. To briefly review, if you think of each\n experiment as\n being distinct, then there are ${n \\choose k}$ ways of permuting $k$ successes from $n$ experiments. For any of the\n mutually exclusive permutations, the probability of that permutation is $p^k \\cdot (1-p)^{n-k}$.\n\nThe name binomial comes from the term ${n \\choose k}$ which is formally called the binomial coefficient.\nExpectation of Binomial\nThere is an easy way to calculate the expectation of a binomial and a hard way.\n\n The easy way is to leverage the fact that a binomial is the sum of Bernoulli random variables. $X = \\sum_{i=1}^{n}\n Y_i$ where $Y_i \\sim \\Ber(p)$. Since the expectation of the sum of\n random variables is the sum of expectations, we can add the expectation, $\\E[Y_i] = p$, of each of the Bernoulli's:\n $$\n \\begin{align}\n \\E[X] &= \\E\\Big[\\sum_{i=1}^{n} Y_i\\Big] && \\text{Since }X = \\sum_{i=1}^{n} Y_i \\\\\n &= \\sum_{i=1}^{n}\\E[ Y_i] && \\text{Expectation of sum} \\\\\n &= \\sum_{i=1}^{n}p && \\text{Expectation of Bernoulli} \\\\\n &= n \\cdot p && \\text{Sum $n$ times}\n \\end{align}\n $$\n\n The hard way is to use the definition of expectation:\n $$\n \\begin{align}\n \\E[X] &= \\sum_{i=0}^n i \\cdot \\p(X = i) && \\text{Def of expectation} \\\\\n &= \\sum_{i=0}^n i \\cdot {n \\choose i} p^i(1-p)^{n-i} && \\text{Sub in PMF} \\\\\n & \\cdots && \\text{Many steps later} \\\\\n &= n \\cdot p\n \\end{align}\n $$\n\nBinomial Distribution in Python\nAs you might expect, you can use binomial distributions in code. The standarized library for binomials is scipy.stats.binom. \nOne of the most helpful methods that this package provides is a way to calculate the PMF. For example, say $X \\sim \\text{Bin}(n=5,p=0.6)$ and you want to find $\\P(X=2)$ you could use the following code:\nfrom scipy import stats\n\n# define variables for x, n, and p\nn = 5\np = 0.6\nx = 2\n\n# use scipy to compute the pmf\np_x = stats.binom.pmf(x, n, p)\n\n# use the probability for future work\nprint(f'P(X = {x}) = {p_x}')\n\nConsole:P(X = 2) = 0.2304\nAnother particularly helpful function is the ability to generate a random sample from a binomial. For example, say $X \\sim \\Bin(n=10, p = 0.3)$ represents the number of requests to a website. We can draw 100 samples from this distrubtion using the following code:\nfrom scipy import stats\n\n# define variables for x, n, and p\nn = 10\np = 0.3\nx = 2\n\n# use scipy to compute the pmf\nsamples = stats.binom.rvs(n, p, size=100)\n\n# use the probability for future work\nprint(samples)\n\nConsole:[4 5 3 1 4 5 3 1 4 6 5 6 1 2 1 1 2 3 2 5 2 2 2 4 4 2 2 3 6 3 1 1 4 2 6 2 4\n 2 3 3 4 2 4 2 4 5 0 1 4 3 4 3 3 1 3 1 1 2 2 2 2 3 5 3 3 3 2 1 3 2 1 2 3 3\n 4 5 1 3 7 1 4 1 3 3 4 4 1 2 4 4 0 2 4 3 2 3 3 1 1 4]\nYou might be wondering what a random sample is! A random sample is a randomly chosen assignment to a our random variable. Above we have 100 such assignments. The probability that value $x$ is chosen is given by the PMF: $\\p(X=x)$. You will notice that even though 8 is a possible assignment to the binomial above, in 100 samples we never saw the value 8. Why? Because $P(X=8) \\approx 0.0014$. You would need to draw 1,000 samples before you would expect to see an 8.\n\n There are also functions for getting the mean, the variance, and more. You can read the scipy.stats.binom documentation, especially the list of methods.\n\n"}, {"id": "poisson", "title": "Poisson Distribution", "url": "part2/poisson", "text": "\n \nPoisson Distribution\n\nA Poisson random variable gives the probability of a given number of events in a fixed interval of time (or space). It make the Poisson assumption that events occur with a known constant mean rate and independently of the time since the last event.\n\t\n\n<%\n include('templates/rvCards/poisson.html')\n\nPoisson Intuition\nIn this section we show the intuition behind the Poisson derivation. It is both a great way to deeply understand the Poisson, as well as good practice with Binomial distributions.\nLet's work on the problem of predicting the chance of a given number of events occuring in a fixed time interval \u2014 the next minute. For example, imagine you are working on a ride sharing application and you care about the probability of how many requests you get from a particular area. From historical data, you know that the average requests per minute is $\\lambda = 5$. What is the probability of getting 1, 2, 3, etc requests in a minute?\n: We could approximate a solution to this problem by using a binomial distribution! Lets say we split our minute into 60 seconds, and make each second an indicator Bernoulli variable \u2014 you either get a request or you don't. If you get a request in a second, the indicator is 1. Otherwise it is 0. Here is a visualization of our 60 binary-indicators. In this example imagine we have requests at 2.75 and 7.12 seconds. the corresponding indicator variables are blue filled in boxes:\n1 minute\n\n\n\n\tThe total number of requests received over the minute can be approximated as the sum of the sixty indicator variables, which conveniently matches the description of a binomial \u2014 a sum of Bernoullis. Specifically define $X$ to be the number of requests in a minute. $X$ is a binomial with $n=60$ trials. What is the probability, $p$, of a success on a single trial? To make the expectation of $X$ equal the observed historical average $\\lambda =5$ we should chose $p$ so that $\\lambda = \\E[X]$. \n\t$$\n\t\\begin{align}\n\t\\lambda &= \\E[X] && \\text{Expectation matches historical average} \\\\\n\t\\lambda &= n \\cdot p && \\text{Expectation of a Binomial is } n \\cdot p \\\\\n\tp &= \\frac{\\lambda}{n} && \\text{Solving for $p$}\n\t\\end{align}\n\t$$\n\tIn this case since $\\lambda=5$ and $n=60$, we should chose $p=5/60$ and state that $X \\sim \\Bin(n=60, p=5/60)$. Now that we have a form for $X$ we can answer probability questions about the number of requests by using the Binomial PMF:\n\n\t$$\\p(X = x) = {n \\choose x} p^x (1-p)^{n-x}$$\n\tSo for example:\n\t$$\\p(X=1) = {60 \\choose 1} (5/60)^1 (55/60)^{60-1} \\approx 0.0295$$\n\t$$\\p(X=2) = {60 \\choose 2} (5/60)^2 (55/60)^{60-2} \\approx 0.0790$$\n\t$$\\p(X=3) = {60 \\choose 3} (5/60)^3 (55/60)^{60-3} \\approx 0.1389$$\n\t\n\n\n\tGreat! But don't forget that this was an approximation. We didn't account for the fact that there can be more than one event in a single second. One way to assuage this issue is to devide our minute into more fine-grained intervals (the choice to split it into 60 seconds was rather arbitrary). Instead lets divide our minute into 600 deciseconds, again with requests at 2.75 and 7.12 seconds:\n\t1 minute\n\n\n\nNow $n=600$, $p=5/600$ and $X \\sim \\Bin(n=600, p=6/600)$. We can repeat our example calculations using this better approximation:\n$$\\p(X=1) = {600 \\choose 1} (5/600)^1 (595/60)^{600-1} \\approx 0.0333$$\n\t$$\\p(X=2) = {600 \\choose 2} (5/600)^2 (595/600)^{600-2} \\approx 0.0837$$\n\t$$\\p(X=3) = {600 \\choose 3} (5/600)^3 (595/600)^{600-3} \\approx 0.1402$$\n\n\nChose any value of $n$, the number of buckets to divide our minute into: \n\n\n\n\n\n\nThe larger $n$ is, the more accurate the approximation. So what happens when $n$ is infinity? It becomes a Poisson!\nPoisson, a Binomial in the limit\nOr if we really cared about making sure that we don't get two events in the same bucket, we can divide our minute into infinitely small buckets:\n\t1 minute\n\n\n\n\nProof: Derivation of the Poisson\n\n\tWhat does the PMF of $X$ look like now that we have infinite divisions of our minute? We can write the equation and think about it as $n$ goes to infinity. Recall that $p$ still equals $\\lambda/n$:\n\t\n\n\t$$\n\t\\P(X=x) = \\lim_{n \\rightarrow \\infty} {n \\choose x} (\\lambda / n)^x(1-\\lambda/n)^{n-x}\n\t$$\n\n\tWhile it may look intimidating, this expression simplifies nicely. This proof uses a few special limit rules that we haven't introduced in this book:\n\n\n$$\n\\begin{align}\n\t\\P(X=x) \n\t&= \\lim_{n \\rightarrow \\infty} {n \\choose x} (\\lambda / n)^x(1-\\lambda/n)^{n-x}\n\t\t&& \\text{Start: binomial in the limit}\\\\\n\t&= \\lim_{n \\rightarrow \\infty}\n\t\t{n \\choose x} \\cdot\n\t\t\\frac{\\lambda^x}{n^x} \\cdot\n\t\t\\frac{(1-\\lambda/n)^{n}}{(1-\\lambda/n)^{x}} \n\t\t&& \\text{Expanding the power terms} \\\\\n\t&= \\lim_{n \\rightarrow \\infty}\n\t\t\\frac{n!}{(n-x)!x!} \\cdot\n\t\t\\frac{\\lambda^x}{n^x} \\cdot\n\t\t\\frac{(1-\\lambda/n)^{n}}{(1-\\lambda/n)^{x}} \n\t\t&& \\text{Expanding the binomial term} \\\\\n\t&= \\lim_{n \\rightarrow \\infty}\n\t\t\\frac{n!}{(n-x)!x!} \\cdot\n\t\t\\frac{\\lambda^x}{n^x} \\cdot\n\t\t\\frac{e^{-\\lambda}}{(1-\\lambda/n)^{x}} \n\t\t&& \\href{https://www.sosmath.com/calculus/sequence/specialim/specialim.html}{\\text{Rule }} \\lim_{n \\rightarrow \\infty}(1-\\lambda/n)^{n} = e^{-\\lambda}\\\\\n\t&= \\lim_{n \\rightarrow \\infty}\n\t\t\\frac{n!}{(n-x)!x!} \\cdot\n\t\t\\frac{\\lambda^x}{n^x} \\cdot\n\t\t\\frac{e^{-\\lambda}}{1} \n\t\t&& \\href{https://www.youtube.com/watch?v=x1WBTBtfvjM}{\\text{Rule }} \\lim_{n \\rightarrow \\infty}\\lambda/n= 0\\\\\n\t&= \\lim_{n \\rightarrow \\infty}\n\t\t\\frac{n!}{(n-x)!} \\cdot\n\t\t\\frac{1}{x!} \\cdot\n\t\t\\frac{\\lambda^x}{n^x} \\cdot\n\t\t\\frac{e^{-\\lambda}}{1} \n\t\t&& \\text{Splitting first term}\\\\\n\t&= \\lim_{n \\rightarrow \\infty}\n\t\t\\frac{n^x}{1} \\cdot\n\t\t\\frac{1}{x!} \\cdot\n\t\t\\frac{\\lambda^x}{n^x} \\cdot\n\t\t\\frac{e^{-\\lambda}}{1} \n\t\t&& \\lim_{n \\rightarrow \\infty }\\frac{n!}{(n-x)!} = n^x\\\\\n\t&= \\lim_{n \\rightarrow \\infty}\n\t\t\\frac{\\lambda^x}{x!} \\cdot\n\t\t\\frac{e^{-\\lambda}}{1} \n\t\t&& \\text{Cancel }n^x\\\\\n\t&= \n\t\t\\frac{\\lambda^x \\cdot e^{-\\lambda}}{x!} \n\t\t&& \\text{Simplify}\\\\\n\\end{align}\n\t$$\n\n\nThat is a beautiful expression! Now we can calculate the real probability of number of requests in a minute, if the historical average is $\\lambda=5$:\n\n$$\\p(X=1) = \\frac{5^1 \\cdot e^{-5}}{1!} = 0.03369$$\n\t$$\\p(X=2) = \\frac{5^2 \\cdot e^{-5}}{2!}= 0.08422$$\n\t$$\\p(X=3) = \\frac{5^3 \\cdot e^{-5}}{3!} = 0.14037$$\n\nThis is both more accurate and much easier to compute!\nChanging time frames\nSay you are given a rate over one unit of time, but you want to know the rate in another unit of time. For example, you may be given the rate of hits to a website per minute, but you want to know the probability over a 20 minute period. You would just need to multiply this rate by 20 in order to go from the \"per 1 minute of time\" rate to obtain the \"per 20 minutes of time\" rate.\n\n\n\n\n\n"}, {"id": "continuous", "title": "Continuous Distribution", "url": "part2/continuous", "text": "\n \nContinuous Distribution\n\nSo far, all random variables we have seen have been discrete. In all the cases we have seen in CS109 this meant that our RVs could only take on integer values. Now it's time for continuous random variables which can take on values in the real number domain ($\\R$). Continuous random variables can be used to represent measurements with arbitrary precision (eg height, weight, time).\nFrom Discrete to Continuous\nTo make our transition from thinking about discrete random variable, to thinking about continuous random variables, lets start with a thought experiment: Imagine you are running to catch the bus. You know that you will arrive at 2:15pm but you don't know exactly when the bus will arrive, and want to think of the arrival time in minutes past 2pm as a random variable $T$ so that you can calculate the probability that you will have to wait more than five minutes $P(15 < T < 20)$.\nWe immediately face a problem. For discrete distributions we would describe the probability that a random variable takes on exact values. This doesn't make sense for continuous values, like the time the bus arrives. As an example, what is the probability that the bus arrives at exactly 2:17pm and 12.12333911102389234 seconds? Similarly, if I were to ask you: what is the probability of a child being born with weight exactly equal to 3.523112342234 kilos, you might recognize that question as ridiculous. No child will have precisely that weight. Real values can have infinite precision and as such it is a bit mind boggling to think about the probability that a random variable takes on a specific value.\nInstead, let's start by discretizing time, our continuous variable, by breakint it into 5 minute chunks. We can now think about something like, the probability that the bus arrives between 2:00p and 2:05 as an event with some probability (see figure below on the left). Five minute chunks seem a bit coarse. You could imagine that instead, we could have discretized time into 2.5minute chunks (figure in the center). In this case the probability that the bus shows up between 15 mins and 20 mins after 2pm is the sum of two chunks, shown in orange. Why stop there? In the limit we could keep breaking time down into smaller and smaller pieces. Eventually we will be left with a derivative of probability at each moment of time, where the probability that $P(15 < T < 20)$ is the integral of that derivative between 15 and 20 (figure on the right).\n\nProbability Density Functions\nIn the world of discrete random variables, the most important property of a random variable was its probability mass function (PMF) that would tell you the probability of the random variable taking on any value. When we move to the world of continuous random variables, we are going to need to rethink this basic concept. In the continuous world, every random variable instead has a Probability Density Function (PDF) which defines the relative likelihood that a random variable takes on a particular value. We traditionally use the symbol $f$ for the probability density function and write it in one of two ways:\n$$\nf(X=x) \\quad \\or \\quad f(x)\n$$\nWhere the notation on the right hand side is shorthand where the lowercase $x$ implies that we are talking about the relative likelihood of a continuous random variable which is the upper case $X$.\nLike in the bus example, the PDF is the derivative of probability at all points of the random variable. This means that the PDF has the important property that you can integrate over it to find the probability that the random variable takes on values within a range $(a, b)$.\n\n\nDefinition: Continuous Random Variable\n$X$ is a Continuous Random Variable if there is a Probability Density Function (PDF) $f(x)$ that takes in real valued numbers $x$ such that:\n$$\n\\begin{align*}\n &\\P(a \\leq X \\leq b) = \\int_a^b f(x) \\d x \n\\end{align*}\n$$\nThe following properties must also hold. These preserve the axiom that $P(a \\leq X \\leq b)$ is a probability:\n$$\n\\begin{align*}\n &0 \\leq \\P(a \\leq X \\leq b) \\leq 1 \\\\\n &\\P(-\\infty < X < \\infty) = 1\n\\end{align*}\n$$\n\t\n\n\n\t\tA common misconception is to think of $f(x)$ as a probability. It is instead what we call a probability density. It represents probability/unit of $X$. Generally this is only meaningful when we either take an integral over the PDF or we compare probability densities. As we mentioned when motivating probability densities, the probability that a continuous random variable takes on a specific value (to infinite precision) is 0.\n\t\n\t$$\n\t \\P(X = a) = \\int_a^a f(x) \\d x = 0\n\t$$\n\n\tThat is pretty different than in the discrete world where we often talked about the probability of a random variable taking on a particular value.\nCumulative Distribution Function\nHaving a probability density is great, but it means we are going to have to solve an integral every single time we want to calculate a probability. To avoid this unfortunate fate, we are going to use a standard called a cumulative distribution function (CDF). The CDF is a function which takes in a number and returns the probability that a random variable takes on a value less than that number. It has the pleasant property that, if we have a CDF for a random variable, we don't need to integrate to answer probability questions!\n\n\nFor a continuous random variable $X$ the Cumulative Distribution Function, written $F(x)$ is:\n$$\n\\begin{align*}\n &F(x) = P(X \\leq x) = \\int_{-\\infty}^{x} f(y)\\d y\n\\end{align*}\n$$\n\n\n\nWhy is the CDF the probability that a random variable takes on a value \\textbf{less than} the input value as opposed to greater than? It is a matter of convention. But it is a useful convention. Most probability questions can be solved simply by knowing the CDF (and taking advantage of the fact that the integral over the range $-\\infty$ to $\\infty$ is 1. Here are a few examples of how you can answer probability questions by just using a CDF:\n\\begin{align*}\n &\\text{Probability Query} && \\text{Solution} && \\text{Explanation} \\\\\n &\\P(X < a) && F(a) && \\text{That is the definition of the CDF}\\\\\n &\\P(X \\leq a) && F(a) && \\text{Trick question. }\\P(X = a) = 0\\\\\n &\\P(X > a) && 1 - F(a) && \\P(X < a) + \\P(X > a) = 1 \\\\\n &\\P(a < X < b) && F(b) - F(a) && F(a) + \\P(a < X < b) = F(b)\\\\\n\\end{align*}\n\n\nThe continuous distribution also exists for discrete random variables, but there is less utility to a CDF in the discrete world as none of our discrete random variables had ``closed form\" (eg without any summation) functions for the CDF:\n\\begin{align*}\n F_X(a) = \\sum_{i = 1}^a P(X = i) \n\\end{align*}\n\nSolving for Constants\n\n\nLet $X$ be a continuous random variable with PDF:\n\\begin{align*}\n f(x) = \\begin{cases} C(4x - 2x^2) &\\text{when } 0 < x < 2 \\\\ \n0 & \\text{otherwise} \\end{cases} \n\\end{align*}\nIn this function, $C$ is a constant. What value is $C$? Since we know that the PDF must sum to 1:\n\\begin{align*}\n &\\int_0^2 C(4x - 2x^2) \\d x = 1 \\\\\n &C\\left(2x^2 - \\frac{2x^3}{3}\\right)\\bigg|_0^2 = 1 \\\\\n &C\\left(\\left(8 - \\frac{16}{3}\\right) - 0 \\right) = 1 \\\\\n &C = 3/8\n\\end{align*}\n\n\nNow that we know $C$, what is $\\P(X > 1)$?\n\\begin{align*}\n\\P(X > 1) \n &=\\int_1^{\\infty}f(x) \\d x \\\\\n &= \\int_1^2 \\frac{3}{8}(4x - 2x^2) \\d x \\\\\n &= \\frac{3}{8}\\left(2x^2 - \\frac{2x^3}{3}\\right)\\bigg|_1^2 \\\\\n &= \\frac{3}{8}\\left[\\left(8 - \\frac{16}{3}\\right) - \\left(2 - \\frac{2}{3}\\right)\\right] = \\frac{1}{2}\n\\end{align*}\n\n\nExpectation and Variance of Continuous Variables\nFor continuous RV $X$:\n\\begin{align*}\n &E[X] = \\int_{-\\infty}^{\\infty} x f(x) dx \\\\\n &E[g(X)] = \\int_{-\\infty}^{\\infty} g(x) f(x) dx \\\\\n &E[X^n] = \\int_{-\\infty}^{\\infty} x^n f(x) dx \n\\end{align*}\nFor both continuous and discrete RVs:\n\\begin{align*}\n &E[aX + b] = aE[X] + b\\\\ \n &\\text{Var}(X) = E[(X - \\mu)^2] = E[X^2] - (E[X])^2 \\\\\n &\\text{Var}(aX + b) = a^2 \\text{Var}(X)\n\\end{align*}\n\n\n"}, {"id": "uniform", "title": "Uniform Distribution", "url": "part2/uniform", "text": "\n \nUniform Distribution\n\nThe most basic of all the continuous random variables is the uniform random variable, which is equally likely to take on any value in its range ($\\alpha, \\beta$). \n\n$X$ is a uniform random variable ($X \\sim \\Uni(\\alpha, \\beta)$) if it has PDF:\n\\begin{align*}\n f(x) = \\begin{cases} \\frac{1}{\\beta - \\alpha} &\\text{when } \\alpha \\leq x \\leq \\beta \\\\ \n0 & \\text{otherwise} \\end{cases} \n\\end{align*}\n\nNotice how the density $1/(\\beta - \\alpha$) is exactly the same regardless of the value for $x$. That makes the density uniform. So why is the PDF $1/(\\beta - \\alpha)$ and not 1? That is the constant that makes it such that the integral over all possible inputs evaluates to 1.\n\n<%\n include('templates/rvCards/uniform.html')\n\n\nExample: You are running to the bus stop. You don\u2019t know exactly when the bus arrives. You believe all times between 2 and 2:30 are equally likely. You show up at 2:15pm. What is P(wait < 5 minutes)?\n\n\nLet $T$ be the time, in minutes after 2p that the bus arrives. Because we think that all times are equally likely in this range, $T \\sim \\Uni(\\alpha = 0, \\beta = 30)$. The probability that you wait 5 minutes is equal to the probability that the bus shows up between 2:15 and 2:20. In other words $\\p(15 < T < 20)$:\n\\begin{align*}\n \\p(\\text{Wait under 5 mins}) &= \\p(15 < T < 20) \\\\\n &= \\int_{15}^{20} f_T(x) \\partial x \\\\\n &= \\int_{15}^{20} \\frac{1}{\\beta - \\alpha} \\partial x \\\\ \\\\\n &= \\frac{1}{30} \\partial x \\\\ \\\\\n &= \\frac{x}{30}\\bigg\\rvert_{15}^{20} \\\\\n &= \\frac{20}{30} - \\frac{15}{30} = \\frac{5}{30}\n\\end{align*}\n\n\n\n\n\nWe can come up with a closed form for the probability that a uniform random variable $X$ is in the range $a$ to $b$, assuming that $\\alpha \\leq a \\leq b \\leq \\beta$:\n\\begin{align*}\n \\P(a \\leq X \\leq b) &= \\int_a^b f(x) \\d x \\\\\n &= \\int_a^b \\frac{1}{\\beta - \\alpha} \\d x \\\\\n &= \\frac{ b - a }{ \\beta - \\alpha } \n\\end{align*}\n\n\n"}, {"id": "exponential", "title": "Exponential Distribution", "url": "part2/exponential", "text": "\n \nExponential Distribution\n\nAn exponential distribution measures the amount of time until a next event occurs. It assumes that the events occur via a poisson process. Note that this is different from the Poisson Random Variable which measures number of events in a fixed amount of time.\n\n<%\n include('templates/rvCards/exponential.html')\n\nAn exponential distribution is a great example of a continuous distribution where the cumulative distribution funciton (CDF) is much easier to work with as it allows you to answer probability questions without using integrals.\n\n\nExample: Based on historical data from the USGS, earthquakes of magnitude 8.0+ happen in a certain location at a rate of 0.002 per year. Earthquakes are known to occur via a poisson process. What is the probability of a major earthquake in the next 4 years?\n\n\nLet $Y$ be the years until the next major earthquake. Because $Y$ measures time until the next event it fits the description of an exponential random variable: $Y \\sim \\Exp(\\lambda = 0.002)$. The question is asking, what is $\\p(Y < 4)$?\n\\begin{align*}\n \\p(Y < 4) \n &= F_Y(4) && \\text{The CDF measures $\\p(Y < y)$} \\\\\n &= 1 - e^{- \\lambda \\cdot y} && \\text{The CDF of an Exp} \\\\\n &= 1 - e^{- 0.002 \\cdot 4} && \\text{The CDF of an Exp} \\\\\n &\\approx 0.008\n\\end{align*}\nNote that it is possible to answer this question using the PDF, but it will require solving an integral.\n\n\n\nExponential is Memoryless\nOne way to gain intuition for what is meant by the \"poisson process\" is through the proof that the exponential distribution is \"memoryless\". That means that the occurence (or lack of occurence) of events in the past does not change our belief as to how long until the next occurence. This can be stated formally. If $X \\sim \\Exp(\\lambda)$ then for an interval of time until the start $s$, and a proceeding, query, interval of time $t$:\n\n$$\\p(X > s + t | X > s) = \\p(X > t) $$\n\nWhich is something we can prove:\n\\begin{align*}\n\\p(X > s + t | X > s)\n&= \\frac{\\p(X > s + t \\and X > s)}{\\p(X > s)}\n&& \\text{Def of conditional prob.} \\\\\n\n&= \\frac{\\p(X > s + t )}{\\p(X > s)}\n&& \\text{Because $X>s+t$ implies $X>s$} \\\\\n\n&= \\frac{1 - F_X(s + t )}{1 - F_X(s)}\n&& \\text{Def of CDF} \\\\\n\n&= \\frac{e^{-\\lambda (s + t)}}{e^{-\\lambda s}}\n&& \\text{By CDF of Exp} \\\\\n\n&= e^{-\\lambda t}\n&& \\text{Simplify} \\\\\n\n&= 1 - F_X(t)\n&& \\text{By CDF of Exp} \\\\\n\n&= \\p(X > t)\n&& \\text{Def of CDF} \\\\\n\n\\end{align*}\n"}, {"id": "normal", "title": "Normal Distribution", "url": "part2/normal", "text": "\n \nNormal Distribution\n\nThe single most important random variable type is the Normal (aka Gaussian) random variable, parametrized by a mean ($\\mu$) and variance ($\\sigma^2$), or sometimes equivalently written as mean and variance ($\\sigma^2$). If $X$ is a normal variable we write $X \\sim N(\\mu, \\sigma^2)$. The normal is important for many reasons: it is generated from the summation of independent random variables and as a result it occurs often in nature. Many things in the world are not distributed normally but data scientists and computer scientists model them as Normal distributions anyways. Why? Because it is the most entropic (conservative) modelling decision that we can make for a random variable while still matching a particular expectation (average value) and variance (spread). \n\nThe Probability Density Function (PDF) for a Normal $X \\sim N(\\mu, \\sigma^2)$ is:\n\\begin{align*}\n f_X(x) = \\frac{1}{\\sigma \\sqrt{2\\pi} } e ^{\\frac{-(x-\\mu)^2}{2\\sigma^2}}\n\\end{align*}\nNotice the $x$ in the exponent of the PDF function. When $x$ is equal to the mean ($\\mu$) then e is raised to the power of $0$ and the PDF is maximized.\n\nBy definition a Normal has $\\E[X] = \\mu$ and $\\var(X) = \\sigma^2$. \n\nThere is no closed form for the integral of the Normal PDF, and as such there is no closed form CDF. However we can use a transformation of any normal to a normal with a precomputed CDF. The result of this mathematical gymnastics is that the CDF for a Normal $X \\sim N(\\mu, \\sigma^2)$ is:\n\\begin{align*}\n F_X(x) = \\Phi\\left(\\frac{x - \\mu}{\\sigma}\\right)\n\\end{align*}\n\nWhere $\\Phi$ is a precomputed function that represents that CDF of the Standard Normal.\n\n\n<%\n include('templates/rvCards/normal.html')\n\n\nLinear Transform\nIf $X$ is a Normal such that $X \\sim N(\\mu, \\sigma^2)$ and $Y$ is a linear transform of $X$ such that $Y = aX + b$ then $Y$ is also a Normal where:\n\\begin{align*}\n Y \\sim N(a\\mu + b, a^2\\sigma^2)\n\\end{align*}\n\nProjection to Standard Normal\nFor any Normal $X$ we can find a linear transform from $X$ to the standard normal $Z \\sim N(0, 1)$. Note that $Z$ is the typical notation choice for the standard normal. For any normal, if you subtract the mean ($\\mu$) of the normal and divide by the standard deviation ($\\sigma$) the result is always the standard normal. We can prove this mathematically. Let $W = \\frac{X - \\mu}{\\sigma}$:\n\\begin{align*}\nW &= \\frac{X -\\mu}{\\sigma} && \\text{ Transform X: Subtract by $\\mu$ and diving by $\\sigma$} \\\\\n & = \\frac{1}{\\sigma}X - \\frac{\\mu}{\\sigma} && \\text{ Use algebra to rewrite the equation}\\\\\n & = aX + b && \\text{ Linear transform where $a = \\frac{1}{\\mu}$, $b = - \\frac{\\mu}{\\sigma}$ }\\\\\n &\\sim N(a\\mu + b, a^2\\sigma^2) && \\text{ The linear transform of a Normal is another Normal}\\\\\n &\\sim N(\\frac{\\mu}{\\sigma} - \\frac{\\mu}{\\sigma}, \\frac{\\sigma^2}{\\sigma^2}) && \\text{ Substituting values in for $a$ and $b$}\\\\\n &\\sim N(0, 1) && \\text{ The standard normal}\n\\end{align*}\n\nUsing this transform we can express $F_X(x)$, the CDF of $X$, in terms of the known CDF of $Z$, $F_Z(x)$. Since the CDF of $Z$ is so common it gets its own Greek symbol: $\\Phi(x)$\n\\begin{align*}\nF_X(x) &= P(X \\leq x) \\\\\n&= P\\left(\\frac{X - \\mu}{\\sigma} \\leq \\frac{x - \\mu}{\\sigma}\\right) \\\\\n&= P\\left(Z \\leq \\frac{x - \\mu}{\\sigma}\\right)\\\\\n&= \\Phi\\left(\\frac{x - \\mu}{\\sigma}\\right)\n\\end{align*}\nThe values of $\\Phi(x)$ can be looked up in a table. Every modern programming language also has the ability to calculate the CDF of a normal random variable!\n\n\nExample: Let $X\\sim \\mathcal{N}(3, 16)$, what is $P(X > 0)$?\n\\begin{align*}\nP(X > 0) &= P\\left(\\frac{X-3}{4} > \\frac{0-3}{4}\\right) = P\\left(Z > -\\frac{3}{4}\\right) = 1 - P\\left(Z \\leq -\\frac{3}{4}\\right)\\\\\n&= 1 - \\Phi(-\\frac{3}{4}) = 1 - (1- \\Phi(\\frac{3}{4})) = \\Phi(\\frac{3}{4}) = 0.7734\n\\end{align*}\nWhat is $P(2 < X < 5)$?\n\\begin{align*}\nP(2 < X < 5) &= P\\left(\\frac{2 - 3}{4} < \\frac{X-3}{4} < \\frac{5-3}{4}\\right) = P\\left(-\\frac{1}{4} < Z < \\frac{2}{4}\\right)\\\\\n&= \\Phi(\\frac{2}{4})-\\Phi(-\\frac{1}{4}) = \\Phi(\\frac{1}{2})-(1 - \\Phi(\\frac{1}{4})) = 0.2902\n\\end{align*}\n\n\nExample: You send voltage of 2 or -2 on a wire to denote 1 or 0. Let $X$ = voltage sent and let $R$ = voltage received. $R = X + Y$, where $Y \\sim \\mathcal{N}(0, 1)$ is noise. When decoding, if $R \\geq 0.5$ we interpret the voltage as 1, else 0. What is $P(\\text{error after decoding}|\\text{original bit} = 1)$?\n\n\\begin{align*}\nP(X + Y < 0.5) &= P(2 + Y < 0.5) \\\\\n\t&= P(Y < -1.5) \\\\\n\t&= \\Phi(-1.5) \\\\\n\t&\\approx 0.0668\n\\end{align*}\n\n\n\nExample: The 67% rule of a normal within one standard deviation. What is the probability that a normal variable $X \\sim \\N(\\mu, \\sigma)$ has a value within one standard deviation of its mean?\n\t\t$$\n\t\t\\begin{align}\n\t\t\\p(\\text{Within one }\\sigma \\text{ of } \\mu) \n\t\t\t&= \\p(\\mu - \\sigma < X < \\mu + \\sigma)\n\t\t\t\t \\\\\n\t\t\t&= \\p(X < \\mu + \\sigma) - \\p(X < \\mu - \\sigma)\n\t\t\t\t&& \\text{Prob of a range} \\\\\n\t\t\t&= \\Phi\\Big(\\frac{(\\mu + \\sigma)-\\mu}{\\sigma}\\Big) -\n\t\t\t\t\\Phi\\Big(\\frac{(\\mu - \\sigma)-\\mu}{\\sigma}\\Big)\n\t\t\t\t&& \\text{CDF of Normal} \\\\\n\t\t\t&= \\Phi\\Big(\\frac{\\sigma}{\\sigma}\\Big) -\n\t\t\t\t\\Phi\\Big(\\frac{- \\sigma}{\\sigma}\\Big)\n\t\t\t\t&& \\text{Cancel $\\mu$s} \\\\\n\t\t\t&= \\Phi(1) -\n\t\t\t\t\\Phi(-1)\n\t\t\t\t&& \\text{Cancel $\\sigma$s} \\\\\n\t\t\t&\\approx 0.8413 - 0.1587 \\approx 0.683\n\t\t\t\t&& \\text{Plug into $\\Phi$}\n\n\t\t\\end{align}\n\t\t\n\t\t$$\nWe made no assumption about the value of $\\mu$ or the value of $\\sigma$ so this will apply to every single normal random variable. Since it uses the Normal CDF this doesn't apply to other types of random variables.\n\t\n\n"}, {"id": "binomial_approx", "title": "Binomial Approximation", "url": "part2/binomial_approx", "text": "\n \nBinomial Approximation\n\nThere are times when it is exceptionally hard to numerically calculate probabilities for a binomial distribution, especially when $n$ is large. For example, say $X \\sim \\text{Bin}(n = 10000, p = 0.5)$ and you want to calculate $\\p(X > 5500)$. The correct formula is:\n\\begin{align}\n\t\\p(X > 55) &= \\sum_{i = 5500}^{10000} \\p(X=x) \\\\\n\t&= \\sum_{i = 5500}^{10000}{10000 \\choose i}p^i(1-p)^{10000-i}\n\\end{align}\n\n\n\tThat is a difficult value to calculate. Luckily there is an easier way. For deep reasons which we will cover in our section on \"uncertainty theory\" it turns out that a binomial distribution can be very well approximated by both Normal distributions and Poisson distributions if $n$ is large enough. \n\nUse the Poisson approximation when $n$ is large (>20) and $p$ is small (<0.05). A slight dependence between results of each experiment is ok\nUse the Normal approximation when $n$ is large (>20), $p$ mid-ranged. Specifically it considered an accurate approximation when the variance is greater then 10, in other words: $np(1-p)>10$. There are situations where either a Poisson or a Normal can be used to approximate a Binomial. In that situation go with the Normal!\n\nPoisson Approximation\nWhen defining the Poisson we proved that a Binomial in the limit as $n \\rightarrow \\infty$ and $p = \\lambda/n$ is a Poisson. That same logic can be used to show that a Poisson is a great approximation for a Binomial when the Binomial has extreme values of $n$ and $p$. A Poisson random variable approximates Binomial where $n$ is large, $p$ is small, and $\\lambda = np$ is \u201cmoderate\u201d. Interestingly, to calculate the things we care about (PMF, expectation, variance) we no longer need to know $n$ and $p$. We only need to provide $\\lambda$ which we call the rate. When approximating a Poisson with a Binomial always chose $\\lambda = n \\cdot p$.\n\nThere are different interpretations of \"moderate\". The accepted ranges are $n > 20$ and $p < 0.05$ or $n > 100$ and $p < 0.1$.\n\nLet's say you want to send a bit string of length $n = 10^4$ where each bit is independently corrupted with $p = 10^{-6}$. What is the probability that the message will arrive uncorrupted? You can solve this using a Poisson with $\\lambda = np = 10^4 10^{-6} = 0.01$. Let $X \\sim Poi(0.01)$ be the number of corrupted bits. Using the PMF for Poisson:\n\\begin{align*}\n P(X = 0) &= \\frac{\\lambda^i}{i!}e^{-\\lambda}\\\\\n &= \\frac{0.01^0}{0!}e^{-0.01}\\\\\n &\\sim 0.9900498\n\\end{align*}\nWe could have also modelled X as a binomial such that $X \\sim Bin(10^4, 10^{-6})$. That would have been impossible to calculate on a computer but would have resulted in the same number (up to the millionth decimal). \n\nNormal Approximation\nFor a Binomial where $n$ is large and $p$ is mid-ranged, a Normal can be used to approximate the Binomial. Let's take a side by side view of a normal and a binomial:\n\n\nLets say our binomial is a random variable $X \\sim \\text{Bin}(100, 0.5)$ and we want to calculate $P(X \\geq 55)$. We could cheat by using the closest fit normal (in this case $Y \\sim N(50, 25)$). How did we chose that particular Normal? Simply select one with a mean and variance that matches the Binomial expectation and variance. The binomial expectation is $np = 100 \\cdot 0.5 = 50$. The Binomial variance is $np(1-p) = 100 \\cdot 0.5 \\cdot 0.5 = 25$.\n\n\nYou can use a Normal distribution to approximate a Binomial $X\\sim \\Bin(n,p)$. To do so define a normal $Y \\sim (E[X], Var(X))$. Using the Binomial formulas for expectation and variance, $Y \\sim (np, np(1-p))$. This approximation holds for large $n$ and moderate $p$. That gets you very close. However since a Normal is continuous and Binomial is discrete we have to use a continuity correction to discretize the Normal.\n\n\n\n\n\\begin{align*}\nP(X = k) \\sim P\\left(k - \\frac{1}{2} < Y < k + \\frac{1}{2}\\right) = \n\\Phi\\left(\n\t\\frac{k - np + 0.5}{\\sqrt{np(1-p)}}\n\\right)\n-\n\\Phi\\left(\n\t\\frac{k - np - 0.5}{\\sqrt{np(1-p)}}\n\\right)\n\\end{align*}\n\nYou should get comfortable deciding what continuity correction to use. Here are a few examples of discrete probability questions and the continuity correction:\n\\begin{align*}\n &\\text{Discrete (Binomial) probability question} && \\text{Equivalent continuous probability question}\\\\\n &P(X=6) && P(5.5 < X < 6.5) \\\\\n &P(X\\geq6) && P(X > 5.5) \\\\\n &P(X > 6) && P(X > 6.5) \\\\\n &P(X < 6) && P(X < 5.5) \\\\\n &P(X \\leq 6) && P(X < 6.5) \\\\\n\\end{align*}\n\n\nExample: 100 visitors to your website are given a new design. Let $X$ = \\# of people who were given the new design and spend more time on your website. Your CEO will endorse the new design if $X \\geq 65$. What is $P(\\text{CEO endorses change}|\\text{it has no effect})$?\n\n$E[X] = np = 50$. $\\Var(X) = np(1-p) = 25$. $\\sigma = \\sqrt{\\Var(X)} = 5$. We can thus use a Normal approximation: $Y \\sim \\mathcal{N}(\\mu = 50, \\sigma^2 = 25)$.\n\\begin{align*}\n P(X \\geq 65) &\\approx P(Y > 64.5) = P\\left(\\frac{Y - 50}{5} > \\frac{64.5 - 50}{5}\\right) = 1 - \\Phi(2.9) = 0.0019\n\\end{align*}\n\n\n\nExample: Stanford accepts 2480 students and each student has a 68\\% chance of attending. Let $X$ = \\# students who will attend. $X\\sim \\Bin(2480, 0.68)$. What is $P(X > 1745)$?\n\n$E[X] = np = 1686.4$. $\\var(X) = np(1-p) = 539.7$. $\\sigma = \\sqrt{\\Var(X)} = 23.23$. We can thus use a Normal approximation: $Y \\sim \\mathcal{N}(\\mu = 1686.4, \\sigma^2 = 539.7)$.\n\\begin{align*}\n P(X > 1745) &\\approx P(Y > 1745.5) \\\\\n &\\approx P\\left(\\frac{Y - 1686.4}{23.23} > \\frac{1745.5 - 1686.4}{23.23}\\right) \\\\\n &\\approx 1 - \\Phi(2.54) = 0.0055\n\\end{align*}\n\n\n"}, {"id": "100_binomial_problems", "title": "100 Binomial Problems", "url": "examples/100_binomial_problems", "text": "\n \n\n100 Binomial Problems\n\n\nJust for fun (and to give you a lot of practice) I wrote a generative probabilistic program which could sample binomial distribution problems. Here are 100 binomial questions:\n\nQuestions\n\nQuestion 1: \nLaura is running a server cluster with 50 computers. The probability of a crash on a given server is 0.5. What is the standard deviation of crashes? \n\n\nAnswer 1:\nLet $X$ be the number of crashes.\n$X \\sim \\Bin(n = 50, p = 0.5)$\n\\begin{align*}\n\\Std(X) &= \\sqrt{np(1-p)} \\\\ &= \\sqrt{50 \\cdot 0.5 \\cdot (1 - 0.5)}\\\\ &= 3.54\n \\end{align*}\n\n\n\nQuestion 2: \nYou are showing an online-ad to 30 people. The probability of an ad ignore on each ad shown is 2/3. What is the expected number of ad clicks? \n\n\nAnswer 2:\nLet $X$ be the number of ad clicks.\n$X \\sim \\Bin(n = 30, p = 1/3)$\n\\begin{align*}\n\\E[X] &= np \\\\ &= 30 \\cdot 1/3 \\\\ &= 10\n \\end{align*}\n\n\n\nQuestion 3: \nA machine learning algorithm makes binary predictions. The machine learning algorithm makes 50 guesses where the probability of a incorrect prediction on a given guess is 19/25. What is the probability that the number of correct predictions is greater than 0? \n\n\nAnswer 3:\nLet $X$ be the number of correct predictions.\n$X \\sim \\Bin(n = 50, p = 6/25)$\n\\begin{align*}\n\\P(X > 0) &= 1 - \\P(0 <= X <= 0)\\\\\n &= 1 - {n \\choose 0} p^0 (1 - p)^{n - 0}\n \\end{align*}\n\n\n\nQuestion 4: \nWind blows independently across 50 locations. The probability of no wind at a given location is 0.5. What is the expected number of locations that have wind? \n\n\nAnswer 4:\nLet $X$ be the number of locations that have wind.\n$X \\sim \\Bin(n = 50, p = 0.5)$\n\\begin{align*}\n\\E[X] &= np \\\\ &= 50 \\cdot 0.5 \\\\ &= 25.0\n \\end{align*}\n\n\n\nQuestion 5: \nWind blows independently across 30 locations. What is the standard deviation of locations that have wind? the probability of wind at each location is 0.6. \n\n\nAnswer 5:\nLet $X$ be the number of locations that have wind.\n$X \\sim \\Bin(n = 30, p = 0.6)$\n\\begin{align*}\n\\Std(X) &= \\sqrt{np(1-p)} \\\\ &= \\sqrt{30 \\cdot 0.6 \\cdot (1 - 0.6)}\\\\ &= 2.68\n \\end{align*}\n\n\n\nQuestion 6: \nYou are trying to mine bitcoins. There are 50 independent attempts where the probability of a mining a bitcoin on a given attempt is 0.6. What is the expectation of bitcoins mined? \n\n\nAnswer 6:\nLet $X$ be the number of bitcoins mined.\n$X \\sim \\Bin(n = 50, p = 0.6)$\n\\begin{align*}\n\\E[X] &= np \\\\ &= 50 \\cdot 0.6 \\\\ &= 30.0\n \\end{align*}\n\n\n\nQuestion 7: \nYou are testing a new medicine on 40 patients. What is P(X is exactly 38)? The number of cured patients can be represented by a random variable X. X ~ Bin(40, 3/10). \n\n\nAnswer 7:\nLet $X$ be the number of cured patients.\n$X \\sim \\Bin(n = 40, p = 3/10)$\n\\begin{align*}\n\\P(X = 38) &= \n {n \\choose 38 } p^{ 38 } (1 - p)^{n - 38 } \\\\\n &= { 40 \\choose 38 } 3/10^{ 38 } (1 - 3/10)^{40 - 38 } \\\\\n &< 0.00001\n \n \\end{align*}\n\n\n\nQuestion 8: \nYou are manufacturing chips and are testing for defects. There are 50 independent tests and 0.5 is the probability of a defect on each test. What is the standard deviation of defects? \n\n\nAnswer 8:\nLet $X$ be the number of defects.\n$X \\sim \\Bin(n = 50, p = 0.5)$\n\\begin{align*}\n\\Std(X) &= \\sqrt{np(1-p)} \\\\ &= \\sqrt{50 \\cdot 0.5 \\cdot (1 - 0.5)}\\\\ &= 3.54\n \\end{align*}\n\n\n\nQuestion 9: \nLaura is flipping a coin 12 times. The probability of a tail on a given coin-flip is 5/12. What is the probability that the number of tails is greater than or equal to 2? \n\n\nAnswer 9:\nLet $X$ be the number of tails.\n$X \\sim \\Bin(n = 12, p = 5/12)$\n\\begin{align*}\n\\P(X >= 2) &= 1 - \\P(0 <= X <= 1)\\\\\n &= 1 - \\sum_{i = 0}^{1} {n \\choose i} p^i (1 - p)^{n - i}\n \\end{align*}\n\n\n\nQuestion 10: \nYou are asking a survey question where responses are \"like\" or \"dislike\". There are 30 responses. You can assume each response is independent where the probability of a dislike on a given response is 1/6. What is the probability that the number of likes is greater than 28? \n\n\nAnswer 10:\nLet $X$ be the number of likes.\n$X \\sim \\Bin(n = 30, p = 5/6)$\n\\begin{align*}\n\\P(X > 28) &= P(29 <= X <= 30)\\\\ \n &= \\sum_{i = 29}^{30} {n \\choose i} p^i (1 - p)^{n - i}\n \\end{align*}\n\n\n\nQuestion 11: \nA ball hits a series of 50 pins where it can bounce either right or left. The probability of a left on a given pin hit is 0.4. What is the standard deviation of rights? \n\n\nAnswer 11:\nLet $X$ be the number of rights.\n$X \\sim \\Bin(n = 50, p = 3/5)$\n\\begin{align*}\n\\Std(X) &= \\sqrt{np(1-p)} \\\\ &= \\sqrt{50 \\cdot 3/5 \\cdot (1 - 3/5)}\\\\ &= 3.46\n \\end{align*}\n\n\n\nQuestion 12: \nYou are sending a stream of 30 bits to space. The probability of a no corruption on a given bit is 1/3. What is the probability that the number of corruptions is 10? \n\n\nAnswer 12:\nLet $X$ be the number of corruptions.\n$X \\sim \\Bin(n = 30, p = 2/3)$\n\\begin{align*}\n\\P(X = 10) &= \n {n \\choose 10 } p^{ 10 } (1 - p)^{n - 10 } \\\\\n &= { 30 \\choose 10 } 2/3^{ 10 } (1 - 2/3)^{30 - 10 } \\\\\n &= 0.00015\n \n \\end{align*}\n\n\n\nQuestion 13: \nWind blows independently across locations. The probability of wind at a given location is 0.9. The number of independent locations is 20. What is the probability that the number of locations that have wind is not less than 19? \n\n\nAnswer 13:\nLet $X$ be the number of locations that have wind.\n$X \\sim \\Bin(n = 20, p = 0.9)$\n\\begin{align*}\n\\P(X >= 19) &= P(19 <= X <= 20)\\\\ \n &= \\sum_{i = 19}^{20} {n \\choose i} p^i (1 - p)^{n - i}\n \\end{align*}\n\n\n\nQuestion 14: \nYou are sending a stream of bits to space. There are 30 independent bits where 5/6 is the probability of a no corruption on each bit. What is the probability that the number of corruptions is 21? \n\n\nAnswer 14:\nLet $X$ be the number of corruptions.\n$X \\sim \\Bin(n = 30, p = 1/6)$\n\\begin{align*}\n\\P(X = 21) &= \n {n \\choose 21 } p^{ 21 } (1 - p)^{n - 21 } \\\\\n &= { 30 \\choose 21 } 1/6^{ 21 } (1 - 1/6)^{30 - 21 } \\\\\n &< 0.00001\n \n \\end{align*}\n\n\n\nQuestion 15: \nCody generates random bit strings. There are 20 independent bits. Each bit has a 1/4 probability of resulting in a 1. What is the probability that the number of 1s is 11? \n\n\nAnswer 15:\nLet $X$ be the number of 1s.\n$X \\sim \\Bin(n = 20, p = 1/4)$\n\\begin{align*}\n\\P(X = 11) &= \n {n \\choose 11 } p^{ 11 } (1 - p)^{n - 11 } \\\\\n &= { 20 \\choose 11 } 1/4^{ 11 } (1 - 1/4)^{20 - 11 } \\\\\n &= 0.00301\n \n \\end{align*}\n\n\n\nQuestion 16: \nIn a restaurant some customers ask for a water with their meal. A random sample of 40 customers is selected where the probability of a water requested by a given customer is 9/20. What is the probability that the number of waters requested is 16? \n\n\nAnswer 16:\nLet $X$ be the number of waters requested.\n$X \\sim \\Bin(n = 40, p = 9/20)$\n\\begin{align*}\n\\P(X = 16) &= \n {n \\choose 16 } p^{ 16 } (1 - p)^{n - 16 } \\\\\n &= { 40 \\choose 16 } 9/20^{ 16 } (1 - 9/20)^{40 - 16 } \\\\\n &= 0.10433\n \n \\end{align*}\n\n\n\nQuestion 17: \nA student is guessing randomly on an exam with 12 questions. What is the expected number of correct answers? the probability of a correct answer on a given question is 5/12. \n\n\nAnswer 17:\nLet $X$ be the number of correct answers.\n$X \\sim \\Bin(n = 12, p = 5/12)$\n\\begin{align*}\n\\E[X] &= np \\\\ &= 12 \\cdot 5/12 \\\\ &= 5\n \\end{align*}\n\n\n\nQuestion 18: \nLaura is trying to mine bitcoins. The number of bitcoins mined can be represented by a random variable X. X ~ Bin(n = 100, p = 1/2). What is P(X is equal to 53)? \n\n\nAnswer 18:\nLet $X$ be the number of bitcoins mined.\n$X \\sim \\Bin(n = 100, p = 1/2)$\n\\begin{align*}\n\\P(X = 53) &= \n {n \\choose 53 } p^{ 53 } (1 - p)^{n - 53 } \\\\\n &= { 100 \\choose 53 } 1/2^{ 53 } (1 - 1/2)^{100 - 53 } \\\\\n &= 0.06659\n \n \\end{align*}\n\n\n\nQuestion 19: \nYou are showing an online-ad to customers. The add is shown to 100 people. The probability of an ad ignore on a given ad shown is 1/2. What is the standard deviation of ad clicks? \n\n\nAnswer 19:\nLet $X$ be the number of ad clicks.\n$X \\sim \\Bin(n = 100, p = 0.5)$\n\\begin{align*}\n\\Std(X) &= \\sqrt{np(1-p)} \\\\ &= \\sqrt{100 \\cdot 0.5 \\cdot (1 - 0.5)}\\\\ &= 5.00\n \\end{align*}\n\n\n\nQuestion 20: \nYou are running a server cluster with 40 computers. 5/8 is the probability of a computer continuing to work on each server. What is the expected number of crashes? \n\n\nAnswer 20:\nLet $X$ be the number of crashes.\n$X \\sim \\Bin(n = 40, p = 3/8)$\n\\begin{align*}\n\\E[X] &= np \\\\ &= 40 \\cdot 3/8 \\\\ &= 15\n \\end{align*}\n\n\n\nQuestion 21: \nYou are hashing 100 strings into a hashtable. The probability of a hash to the first bucket on a given string hash is 3/20. What is the probability that the number of hashes to the first bucket is greater than or equal to 97? \n\n\nAnswer 21:\nLet $X$ be the number of hashes to the first bucket.\n$X \\sim \\Bin(n = 100, p = 3/20)$\n\\begin{align*}\n\\P(X >= 97) &= P(97 <= X <= 100)\\\\ \n &= \\sum_{i = 97}^{100} {n \\choose i} p^i (1 - p)^{n - i}\n \\end{align*}\n\n\n\nQuestion 22: \nYou are running in an election with 50 voters. 6/25 is the probability of a vote for you on each vote. What is the probability that the number of votes for you is less than 2? \n\n\nAnswer 22:\nLet $X$ be the number of votes for you.\n$X \\sim \\Bin(n = 50, p = 6/25)$\n\\begin{align*}\n\\P(X < 2) &= P(0 <= X <= 1)\\\\ \n &= \\sum_{i = 0}^{1} {n \\choose i} p^i (1 - p)^{n - i}\n \\end{align*}\n\n\n\nQuestion 23: \nIrina is sending a stream of 40 bits to space. The probability of a corruption on each bit is 3/4. What is the probability that the number of corruptions is 22? \n\n\nAnswer 23:\nLet $X$ be the number of corruptions.\n$X \\sim \\Bin(n = 40, p = 3/4)$\n\\begin{align*}\n\\P(X = 22) &= \n {n \\choose 22 } p^{ 22 } (1 - p)^{n - 22 } \\\\\n &= { 40 \\choose 22 } 3/4^{ 22 } (1 - 3/4)^{40 - 22 } \\\\\n &= 0.00294\n \n \\end{align*}\n\n\n\nQuestion 24: \nYou are hashing 100 strings into a hashtable. The probability of a hash to the first bucket on a given string hash is 9/50. What is the probability that the number of hashes to the first bucket is greater than 97? \n\n\nAnswer 24:\nLet $X$ be the number of hashes to the first bucket.\n$X \\sim \\Bin(n = 100, p = 9/50)$\n\\begin{align*}\n\\P(X > 97) &= P(98 <= X <= 100)\\\\ \n &= \\sum_{i = 98}^{100} {n \\choose i} p^i (1 - p)^{n - i}\n \\end{align*}\n\n\n\nQuestion 25: \nYou generate random bit strings. There are 100 independent bits. The probability of a 1 at a given bit is 3/25. What is the probability that the number of 1s is less than 97? \n\n\nAnswer 25:\nLet $X$ be the number of 1s.\n$X \\sim \\Bin(n = 100, p = 3/25)$\n\\begin{align*}\n\\P(X < 97) &= 1 - \\P(97 <= X <= 100)\\\\\n &= 1 - \\sum_{i = 97}^{100} {n \\choose i} p^i (1 - p)^{n - i}\n \\end{align*}\n\n\n\nQuestion 26: \nYou are manufacturing toys and are testing for defects. What is the probability that the number of defects is greater than 1? the probability of a non-defect on a given test is 16/25 and you test 50 objects. \n\n\nAnswer 26:\nLet $X$ be the number of defects.\n$X \\sim \\Bin(n = 50, p = 9/25)$\n\\begin{align*}\n\\P(X > 1) &= 1 - \\P(0 <= X <= 1)\\\\\n &= 1 - \\sum_{i = 0}^{1} {n \\choose i} p^i (1 - p)^{n - i}\n \\end{align*}\n\n\n\nQuestion 27: \nLaura is sending a stream of 40 bits to space. The number of corruptions can be represented by a random variable X. X is a Binomial with n = 40 and p = 3/4. What is P(X = 25)? \n\n\nAnswer 27:\nLet $X$ be the number of corruptions.\n$X \\sim \\Bin(n = 40, p = 3/4)$\n\\begin{align*}\n\\P(X = 25) &= \n {n \\choose 25 } p^{ 25 } (1 - p)^{n - 25 } \\\\\n &= { 40 \\choose 25 } 3/4^{ 25 } (1 - 3/4)^{40 - 25 } \\\\\n &= 0.02819\n \n \\end{align*}\n\n\n\nQuestion 28: \n100 trials are run. What is the probability that the number of successes is 78? 1/2 is the probability of a success on each trial. \n\n\nAnswer 28:\nLet $X$ be the number of successes.\n$X \\sim \\Bin(n = 100, p = 1/2)$\n\\begin{align*}\n\\P(X = 78) &= \n {n \\choose 78 } p^{ 78 } (1 - p)^{n - 78 } \\\\\n &= { 100 \\choose 78 } 1/2^{ 78 } (1 - 1/2)^{100 - 78 } \\\\\n &< 0.00001\n \n \\end{align*}\n\n\n\nQuestion 29: \nYou are flipping a coin. You flip the coin 20 times. The probability of a tail on a given coin-flip is 1/10. What is the standard deviation of heads? \n\n\nAnswer 29:\nLet $X$ be the number of heads.\n$X \\sim \\Bin(n = 20, p = 0.9)$\n\\begin{align*}\n\\Std(X) &= \\sqrt{np(1-p)} \\\\ &= \\sqrt{20 \\cdot 0.9 \\cdot (1 - 0.9)}\\\\ &= 1.34\n \\end{align*}\n\n\n\nQuestion 30: \nIrina is showing an online-ad to 12 people. 5/12 is the probability of an ad click on each ad shown. What is the probability that the number of ad clicks is less than or equal to 11? \n\n\nAnswer 30:\nLet $X$ be the number of ad clicks.\n$X \\sim \\Bin(n = 12, p = 5/12)$\n\\begin{align*}\n\\P(X <= 11) &= 1 - \\P(12 <= X <= 12)\\\\\n &= 1 - {n \\choose 12} p^12 (1 - p)^{n - 12}\n \\end{align*}\n\n\n\nQuestion 31: \nYou are flipping a coin 50 times. 19/25 is the probability of a head on each coin-flip. What is the standard deviation of tails? \n\n\nAnswer 31:\nLet $X$ be the number of tails.\n$X \\sim \\Bin(n = 50, p = 6/25)$\n\\begin{align*}\n\\Std(X) &= \\sqrt{np(1-p)} \\\\ &= \\sqrt{50 \\cdot 6/25 \\cdot (1 - 6/25)}\\\\ &= 3.02\n \\end{align*}\n\n\n\nQuestion 32: \nYou are running in an election with 100 voters. The probability of a vote for you on each vote is 1/4. What is the probability that the number of votes for you is less than or equal to 97? \n\n\nAnswer 32:\nLet $X$ be the number of votes for you.\n$X \\sim \\Bin(n = 100, p = 1/4)$\n\\begin{align*}\n\\P(X <= 97) &= 1 - \\P(98 <= X <= 100)\\\\\n &= 1 - \\sum_{i = 98}^{100} {n \\choose i} p^i (1 - p)^{n - i}\n \\end{align*}\n\n\n\nQuestion 33: \nYou are running a server cluster with 40 computers. What is the probability that the number of crashes is less than or equal to 39? 3/4 is the probability of a computer continuing to work on each server. \n\n\nAnswer 33:\nLet $X$ be the number of crashes.\n$X \\sim \\Bin(n = 40, p = 1/4)$\n\\begin{align*}\n\\P(X <= 39) &= 1 - \\P(40 <= X <= 40)\\\\\n &= 1 - {n \\choose 40} p^40 (1 - p)^{n - 40}\n \\end{align*}\n\n\n\nQuestion 34: \nWaddie is sending a stream of bits to space. Waddie sends 100 bits. The probability of a corruption on each bit is 1/2. What is the standard deviation of corruptions? \n\n\nAnswer 34:\nLet $X$ be the number of corruptions.\n$X \\sim \\Bin(n = 100, p = 1/2)$\n\\begin{align*}\n\\Std(X) &= \\sqrt{np(1-p)} \\\\ &= \\sqrt{100 \\cdot 1/2 \\cdot (1 - 1/2)}\\\\ &= 5.00\n \\end{align*}\n\n\n\nQuestion 35: \nA student is guessing randomly on an exam with 100 questions. Each question has a 0.5 probability of resulting in a incorrect answer. What is the probability that the number of correct answers is greater than 97? \n\n\nAnswer 35:\nLet $X$ be the number of correct answers.\n$X \\sim \\Bin(n = 100, p = 1/2)$\n\\begin{align*}\n\\P(X > 97) &= P(98 <= X <= 100)\\\\ \n &= \\sum_{i = 98}^{100} {n \\choose i} p^i (1 - p)^{n - i}\n \\end{align*}\n\n\n\nQuestion 36: \nYou are testing a new medicine on patients. 0.5 is the probability of a cured patient on each trial. There are 10 independent trials. What is the expected number of cured patients? \n\n\nAnswer 36:\nLet $X$ be the number of cured patients.\n$X \\sim \\Bin(n = 10, p = 0.5)$\n\\begin{align*}\n\\E[X] &= np \\\\ &= 10 \\cdot 0.5 \\\\ &= 5.0\n \\end{align*}\n\n\n\nQuestion 37: \nA ball hits a series of pins where it can either go right or left. The number of independent pin hits is 100. The probability of a right on each pin hit is 0.5. What is the standard deviation of rights? \n\n\nAnswer 37:\nLet $X$ be the number of rights.\n$X \\sim \\Bin(n = 100, p = 0.5)$\n\\begin{align*}\n\\Std(X) &= \\sqrt{np(1-p)} \\\\ &= \\sqrt{100 \\cdot 0.5 \\cdot (1 - 0.5)}\\\\ &= 5.00\n \\end{align*}\n\n\n\nQuestion 38: \nYou are flipping a coin 40 times. The probability of a head on a given coin-flip is 1/2. What is the probability that the number of heads is 38? \n\n\nAnswer 38:\nLet $X$ be the number of heads.\n$X \\sim \\Bin(n = 40, p = 1/2)$\n\\begin{align*}\n\\P(X = 38) &= \n {n \\choose 38 } p^{ 38 } (1 - p)^{n - 38 } \\\\\n &= { 40 \\choose 38 } 1/2^{ 38 } (1 - 1/2)^{40 - 38 } \\\\\n &< 0.00001\n \n \\end{align*}\n\n\n\nQuestion 39: \n100 trials are run and the probability of a success on a given trial is 1/2. What is the standard deviation of successes? \n\n\nAnswer 39:\nLet $X$ be the number of successes.\n$X \\sim \\Bin(n = 100, p = 1/2)$\n\\begin{align*}\n\\Std(X) &= \\sqrt{np(1-p)} \\\\ &= \\sqrt{100 \\cdot 1/2 \\cdot (1 - 1/2)}\\\\ &= 5.00\n \\end{align*}\n\n\n\nQuestion 40: \nYou are trying to mine bitcoins. There are 40 independent attempts. The probability of a mining a bitcoin on each attempt is 3/10. What is the probability that the number of bitcoins mined is 19? \n\n\nAnswer 40:\nLet $X$ be the number of bitcoins mined.\n$X \\sim \\Bin(n = 40, p = 3/10)$\n\\begin{align*}\n\\P(X = 19) &= \n {n \\choose 19 } p^{ 19 } (1 - p)^{n - 19 } \\\\\n &= { 40 \\choose 19 } 3/10^{ 19 } (1 - 3/10)^{40 - 19 } \\\\\n &= 0.00852\n \n \\end{align*}\n\n\n\nQuestion 41: \n20 trials are run. 0.5 is the probability of a failure on each trial. What is the probability that the number of successes is 6? \n\n\nAnswer 41:\nLet $X$ be the number of successes.\n$X \\sim \\Bin(n = 20, p = 0.5)$\n\\begin{align*}\n\\P(X = 6) &= \n {n \\choose 6 } p^{ 6 } (1 - p)^{n - 6 } \\\\\n &= { 20 \\choose 6 } 0.5^{ 6 } (1 - 0.5)^{20 - 6 } \\\\\n &= 0.03696\n \n \\end{align*}\n\n\n\nQuestion 42: \nYou are flipping a coin. What is the probability that the number of tails is 0? there are 30 independent coin-flips where the probability of a head on a given coin-flip is 5/6. \n\n\nAnswer 42:\nLet $X$ be the number of tails.\n$X \\sim \\Bin(n = 30, p = 1/6)$\n\\begin{align*}\n\\P(X = 0) &= \n {n \\choose 0 } p^{ 0 } (1 - p)^{n - 0 } \\\\\n &= { 30 \\choose 0 } 1/6^{ 0 } (1 - 1/6)^{30 - 0 } \\\\\n &= 0.00421\n \n \\end{align*}\n\n\n\nQuestion 43: \nIn a restaurant some customers ask for a water with their meal. A random sample of 20 customers is selected and each customer has a 1/4 probability of resulting in a water not requested. What is the probability that the number of waters requested is 14? \n\n\nAnswer 43:\nLet $X$ be the number of waters requested.\n$X \\sim \\Bin(n = 20, p = 3/4)$\n\\begin{align*}\n\\P(X = 14) &= \n {n \\choose 14 } p^{ 14 } (1 - p)^{n - 14 } \\\\\n &= { 20 \\choose 14 } 3/4^{ 14 } (1 - 3/4)^{20 - 14 } \\\\\n &= 0.16861\n \n \\end{align*}\n\n\n\nQuestion 44: \nA student is guessing randomly on an exam. 3/8 is the probability of a incorrect answer on each question. The number of independent questions is 40. What is the probability that the number of correct answers is less than or equal to 37? \n\n\nAnswer 44:\nLet $X$ be the number of correct answers.\n$X \\sim \\Bin(n = 40, p = 5/8)$\n\\begin{align*}\n\\P(X <= 37) &= 1 - \\P(38 <= X <= 40)\\\\\n &= 1 - \\sum_{i = 38}^{40} {n \\choose i} p^i (1 - p)^{n - i}\n \\end{align*}\n\n\n\nQuestion 45: \nYou are running in an election with 30 voters. 3/5 is the probability of a vote for you on each vote. What is the standard deviation of votes for you? \n\n\nAnswer 45:\nLet $X$ be the number of votes for you.\n$X \\sim \\Bin(n = 30, p = 3/5)$\n\\begin{align*}\n\\Std(X) &= \\sqrt{np(1-p)} \\\\ &= \\sqrt{30 \\cdot 3/5 \\cdot (1 - 3/5)}\\\\ &= 2.68\n \\end{align*}\n\n\n\nQuestion 46: \nCharlotte is flipping a coin 100 times. The probability of a tail on each coin-flip is 0.5. What is the probability that the number of tails is greater than 2? \n\n\nAnswer 46:\nLet $X$ be the number of tails.\n$X \\sim \\Bin(n = 100, p = 0.5)$\n\\begin{align*}\n\\P(X > 2) &= 1 - \\P(0 <= X <= 2)\\\\\n &= 1 - \\sum_{i = 0}^{2} {n \\choose i} p^i (1 - p)^{n - i}\n \\end{align*}\n\n\n\nQuestion 47: \nYou are trying to mine bitcoins. You try 50 times. 3/5 is the probability of a not mining a bitcoin on each attempt. What is the probability that the number of bitcoins mined is 14? \n\n\nAnswer 47:\nLet $X$ be the number of bitcoins mined.\n$X \\sim \\Bin(n = 50, p = 2/5)$\n\\begin{align*}\n\\P(X = 14) &= \n {n \\choose 14 } p^{ 14 } (1 - p)^{n - 14 } \\\\\n &= { 50 \\choose 14 } 2/5^{ 14 } (1 - 2/5)^{50 - 14 } \\\\\n &= 0.02597\n \n \\end{align*}\n\n\n\nQuestion 48: \nYou are testing a new medicine on 100 patients. The probability of a cured patient on a given trial is 3/25. What is the probability that the number of cured patients is not less than 97? \n\n\nAnswer 48:\nLet $X$ be the number of cured patients.\n$X \\sim \\Bin(n = 100, p = 3/25)$\n\\begin{align*}\n\\P(X >= 97) &= P(97 <= X <= 100)\\\\ \n &= \\sum_{i = 97}^{100} {n \\choose i} p^i (1 - p)^{n - i}\n \\end{align*}\n\n\n\nQuestion 49: \nWind blows independently across 40 locations. What is the probability that the number of locations that have wind is 40? 11/20 is the probability of no wind at each location. \n\n\nAnswer 49:\nLet $X$ be the number of locations that have wind.\n$X \\sim \\Bin(n = 40, p = 9/20)$\n\\begin{align*}\n\\P(X = 40) &= \n {n \\choose 40 } p^{ 40 } (1 - p)^{n - 40 } \\\\\n &= { 40 \\choose 40 } 9/20^{ 40 } (1 - 9/20)^{40 - 40 } \\\\\n &< 0.00001\n \n \\end{align*}\n\n\n\nQuestion 50: \nYou are showing an online-ad to 30 people. 1/6 is the probability of an ad click on each ad shown. What is the probability that the number of ad clicks is less than or equal to 28? \n\n\nAnswer 50:\nLet $X$ be the number of ad clicks.\n$X \\sim \\Bin(n = 30, p = 1/6)$\n\\begin{align*}\n\\P(X <= 28) &= 1 - \\P(29 <= X <= 30)\\\\\n &= 1 - \\sum_{i = 29}^{30} {n \\choose i} p^i (1 - p)^{n - i}\n \\end{align*}\n\n\n\nQuestion 51: \nYou are flipping a coin. You flip the coin 40 times and 7/8 is the probability of a head on each coin-flip. What is the standard deviation of tails? \n\n\nAnswer 51:\nLet $X$ be the number of tails.\n$X \\sim \\Bin(n = 40, p = 1/8)$\n\\begin{align*}\n\\Std(X) &= \\sqrt{np(1-p)} \\\\ &= \\sqrt{40 \\cdot 1/8 \\cdot (1 - 1/8)}\\\\ &= 2.09\n \\end{align*}\n\n\n\nQuestion 52: \nCody is sending a stream of bits to space. 2/5 is the probability of a no corruption on each bit and there are 20 independent bits. What is the expectation of corruptions? \n\n\nAnswer 52:\nLet $X$ be the number of corruptions.\n$X \\sim \\Bin(n = 20, p = 3/5)$\n\\begin{align*}\n\\E[X] &= np \\\\ &= 20 \\cdot 3/5 \\\\ &= 12\n \\end{align*}\n\n\n\nQuestion 53: \nYou are running in an election. There are 12 independent votes and 5/6 is the probability of a vote for you on each vote. What is the probability that the number of votes for you is greater than or equal to 9? \n\n\nAnswer 53:\nLet $X$ be the number of votes for you.\n$X \\sim \\Bin(n = 12, p = 5/6)$\n\\begin{align*}\n\\P(X >= 9) &= P(9 <= X <= 12)\\\\ \n &= \\sum_{i = 9}^{12} {n \\choose i} p^i (1 - p)^{n - i}\n \\end{align*}\n\n\n\nQuestion 54: \nYou are flipping a coin. The number of tails can be represented by a random variable X. X is a Bin(n = 30, p = 5/6). What is the probability that X = 1? \n\n\nAnswer 54:\nLet $X$ be the number of tails.\n$X \\sim \\Bin(n = 30, p = 5/6)$\n\\begin{align*}\n\\P(X = 1) &= \n {n \\choose 1 } p^{ 1 } (1 - p)^{n - 1 } \\\\\n &= { 30 \\choose 1 } 5/6^{ 1 } (1 - 5/6)^{30 - 1 } \\\\\n &< 0.00001\n \n \\end{align*}\n\n\n\nQuestion 55: \nIn a restaurant some customers ask for a water with their meal. A random sample of 100 customers is selected where 0.3 is the probability of a water requested by each customer. What is the expected number of waters requested? \n\n\nAnswer 55:\nLet $X$ be the number of waters requested.\n$X \\sim \\Bin(n = 100, p = 0.3)$\n\\begin{align*}\n\\E[X] &= np \\\\ &= 100 \\cdot 0.3 \\\\ &= 30.0\n \\end{align*}\n\n\n\nQuestion 56: \nYou are hashing strings into a hashtable. 30 strings are hashed. The probability of a hash to the first bucket on each string hash is 1/6. What is the expected number of hashes to the first bucket? \n\n\nAnswer 56:\nLet $X$ be the number of hashes to the first bucket.\n$X \\sim \\Bin(n = 30, p = 1/6)$\n\\begin{align*}\n\\E[X] &= np \\\\ &= 30 \\cdot 1/6 \\\\ &= 5\n \\end{align*}\n\n\n\nQuestion 57: \nYou are flipping a coin 100 times. What is the probability that the number of tails is greater than or equal to 98? 19/20 is the probability of a head on each coin-flip. \n\n\nAnswer 57:\nLet $X$ be the number of tails.\n$X \\sim \\Bin(n = 100, p = 1/20)$\n\\begin{align*}\n\\P(X >= 98) &= P(98 <= X <= 100)\\\\ \n &= \\sum_{i = 98}^{100} {n \\choose i} p^i (1 - p)^{n - i}\n \\end{align*}\n\n\n\nQuestion 58: \nIrina is running a server cluster. What is the probability that the number of crashes is less than 99? the server has 100 computers which crash independently and the probability of a computer continuing to work on a given server is 22/25. \n\n\nAnswer 58:\nLet $X$ be the number of crashes.\n$X \\sim \\Bin(n = 100, p = 3/25)$\n\\begin{align*}\n\\P(X < 99) &= 1 - \\P(99 <= X <= 100)\\\\\n &= 1 - \\sum_{i = 99}^{100} {n \\choose i} p^i (1 - p)^{n - i}\n \\end{align*}\n\n\n\nQuestion 59: \nYou are manufacturing chairs and are testing for defects. You test 100 objects. 1/2 is the probability of a non-defect on each test. What is the probability that the number of defects is not greater than 97? \n\n\nAnswer 59:\nLet $X$ be the number of defects.\n$X \\sim \\Bin(n = 100, p = 1/2)$\n\\begin{align*}\n\\P(X <= 97) &= 1 - \\P(98 <= X <= 100)\\\\\n &= 1 - \\sum_{i = 98}^{100} {n \\choose i} p^i (1 - p)^{n - i}\n \\end{align*}\n\n\n\nQuestion 60: \nIn a restaurant some customers ask for a water with their meal. There are 50 customers. You can assume each customer is independent. 0.2 is the probability of a water requested by each customer. What is the expected number of waters requested? \n\n\nAnswer 60:\nLet $X$ be the number of waters requested.\n$X \\sim \\Bin(n = 50, p = 0.2)$\n\\begin{align*}\n\\E[X] &= np \\\\ &= 50 \\cdot 0.2 \\\\ &= 10.0\n \\end{align*}\n\n\n\nQuestion 61: \nYou are showing an online-ad to 40 people. 1/4 is the probability of an ad ignore on each ad shown. What is the probability that the number of ad clicks is 9? \n\n\nAnswer 61:\nLet $X$ be the number of ad clicks.\n$X \\sim \\Bin(n = 40, p = 3/4)$\n\\begin{align*}\n\\P(X = 9) &= \n {n \\choose 9 } p^{ 9 } (1 - p)^{n - 9 } \\\\\n &= { 40 \\choose 9 } 3/4^{ 9 } (1 - 3/4)^{40 - 9 } \\\\\n &< 0.00001\n \n \\end{align*}\n\n\n\nQuestion 62: \n100 trials are run. Each trial has a 22/25 probability of resulting in a failure. What is the standard deviation of successes? \n\n\nAnswer 62:\nLet $X$ be the number of successes.\n$X \\sim \\Bin(n = 100, p = 3/25)$\n\\begin{align*}\n\\Std(X) &= \\sqrt{np(1-p)} \\\\ &= \\sqrt{100 \\cdot 3/25 \\cdot (1 - 3/25)}\\\\ &= 3.25\n \\end{align*}\n\n\n\nQuestion 63: \nA machine learning algorithm makes binary predictions. There are 12 independent guesses where the probability of a incorrect prediction on a given guess is 1/6. What is the expected number of correct predictions? \n\n\nAnswer 63:\nLet $X$ be the number of correct predictions.\n$X \\sim \\Bin(n = 12, p = 5/6)$\n\\begin{align*}\n\\E[X] &= np \\\\ &= 12 \\cdot 5/6 \\\\ &= 10\n \\end{align*}\n\n\n\nQuestion 64: \nWaddie is showing an online-ad to customers. 1/2 is the probability of an ad click on each ad shown. The add is shown to 100 people. What is the average number of ad clicks? \n\n\nAnswer 64:\nLet $X$ be the number of ad clicks.\n$X \\sim \\Bin(n = 100, p = 1/2)$\n\\begin{align*}\n\\E[X] &= np \\\\ &= 100 \\cdot 1/2 \\\\ &= 50\n \\end{align*}\n\n\n\nQuestion 65: \nCharlotte is testing a new medicine on 50 patients. The probability of a cured patient on a given trial is 1/5. What is the probability that the number of cured patients is 12? \n\n\nAnswer 65:\nLet $X$ be the number of cured patients.\n$X \\sim \\Bin(n = 50, p = 1/5)$\n\\begin{align*}\n\\P(X = 12) &= \n {n \\choose 12 } p^{ 12 } (1 - p)^{n - 12 } \\\\\n &= { 50 \\choose 12 } 1/5^{ 12 } (1 - 1/5)^{50 - 12 } \\\\\n &= 0.10328\n \n \\end{align*}\n\n\n\nQuestion 66: \nYou are running in an election. The number of votes for you can be represented by a random variable X. X is a Bin(n = 50, p = 0.4). What is P(X is exactly 8)? \n\n\nAnswer 66:\nLet $X$ be the number of votes for you.\n$X \\sim \\Bin(n = 50, p = 0.4)$\n\\begin{align*}\n\\P(X = 8) &= \n {n \\choose 8 } p^{ 8 } (1 - p)^{n - 8 } \\\\\n &= { 50 \\choose 8 } 0.4^{ 8 } (1 - 0.4)^{50 - 8 } \\\\\n &= 0.00017\n \n \\end{align*}\n\n\n\nQuestion 67: \nIrina is flipping a coin 100 times. The probability of a head on a given coin-flip is 1/2. What is the probability that the number of tails is less than or equal to 99? \n\n\nAnswer 67:\nLet $X$ be the number of tails.\n$X \\sim \\Bin(n = 100, p = 0.5)$\n\\begin{align*}\n\\P(X <= 99) &= 1 - \\P(100 <= X <= 100)\\\\\n &= 1 - {n \\choose 100} p^100 (1 - p)^{n - 100}\n \\end{align*}\n\n\n\nQuestion 68: \nYou are manufacturing airplanes and are testing for defects. You test 30 objects and the probability of a defect on a given test is 5/6. What is the probability that the number of defects is 14? \n\n\nAnswer 68:\nLet $X$ be the number of defects.\n$X \\sim \\Bin(n = 30, p = 5/6)$\n\\begin{align*}\n\\P(X = 14) &= \n {n \\choose 14 } p^{ 14 } (1 - p)^{n - 14 } \\\\\n &= { 30 \\choose 14 } 5/6^{ 14 } (1 - 5/6)^{30 - 14 } \\\\\n &< 0.00001\n \n \\end{align*}\n\n\n\nQuestion 69: \nYou are flipping a coin 20 times. The number of heads can be represented by a random variable X. X is a Binomial with 20 trials. Each trial is a success, independently, with probability 1/4. What is the standard deviation of X? \n\n\nAnswer 69:\nLet $X$ be the number of heads.\n$X \\sim \\Bin(n = 20, p = 1/4)$\n\\begin{align*}\n\\Std(X) &= \\sqrt{np(1-p)} \\\\ &= \\sqrt{20 \\cdot 1/4 \\cdot (1 - 1/4)}\\\\ &= 1.94\n \\end{align*}\n\n\n\nQuestion 70: \nYou are giving a survey question where responses are \"like\" or \"dislike\" to 100 people. What is the probability that X is equal to 4? The number of likes can be represented by a random variable X. X is a Bin(100, 0.5). \n\n\nAnswer 70:\nLet $X$ be the number of likes.\n$X \\sim \\Bin(n = 100, p = 0.5)$\n\\begin{align*}\n\\P(X = 4) &= \n {n \\choose 4 } p^{ 4 } (1 - p)^{n - 4 } \\\\\n &= { 100 \\choose 4 } 0.5^{ 4 } (1 - 0.5)^{100 - 4 } \\\\\n &< 0.00001\n \n \\end{align*}\n\n\n\nQuestion 71: \nYou are flipping a coin. There are 20 independent coin-flips where the probability of a tail on a given coin-flip is 0.9. What is the standard deviation of tails? \n\n\nAnswer 71:\nLet $X$ be the number of tails.\n$X \\sim \\Bin(n = 20, p = 0.9)$\n\\begin{align*}\n\\Std(X) &= \\sqrt{np(1-p)} \\\\ &= \\sqrt{20 \\cdot 0.9 \\cdot (1 - 0.9)}\\\\ &= 1.34\n \\end{align*}\n\n\n\nQuestion 72: \nYou are flipping a coin. There are 50 independent coin-flips. The probability of a tail on a given coin-flip is 4/5. What is the expectation of heads? \n\n\nAnswer 72:\nLet $X$ be the number of heads.\n$X \\sim \\Bin(n = 50, p = 1/5)$\n\\begin{align*}\n\\E[X] &= np \\\\ &= 50 \\cdot 1/5 \\\\ &= 10\n \\end{align*}\n\n\n\nQuestion 73: \nYou are giving a survey question where responses are \"like\" or \"dislike\" to 100 people. What is the standard deviation of likes? the probability of a dislike on each response is 41/50. \n\n\nAnswer 73:\nLet $X$ be the number of likes.\n$X \\sim \\Bin(n = 100, p = 9/50)$\n\\begin{align*}\n\\Std(X) &= \\sqrt{np(1-p)} \\\\ &= \\sqrt{100 \\cdot 9/50 \\cdot (1 - 9/50)}\\\\ &= 3.84\n \\end{align*}\n\n\n\nQuestion 74: \nIn a restaurant some customers ask for a water with their meal. 0.6 is the probability of a water requested by each customer and there are 30 independent customers. What is the expected number of waters requested? \n\n\nAnswer 74:\nLet $X$ be the number of waters requested.\n$X \\sim \\Bin(n = 30, p = 0.6)$\n\\begin{align*}\n\\E[X] &= np \\\\ &= 30 \\cdot 0.6 \\\\ &= 18.0\n \\end{align*}\n\n\n\nQuestion 75: \nThere are 40 independent trials and 0.5 is the probability of a failure on each trial. What is the expectation of successes? \n\n\nAnswer 75:\nLet $X$ be the number of successes.\n$X \\sim \\Bin(n = 40, p = 1/2)$\n\\begin{align*}\n\\E[X] &= np \\\\ &= 40 \\cdot 1/2 \\\\ &= 20\n \\end{align*}\n\n\n\nQuestion 76: \nImran is showing an online-ad to 30 people. 5/6 is the probability of an ad click on each ad shown. What is the standard deviation of ad clicks? \n\n\nAnswer 76:\nLet $X$ be the number of ad clicks.\n$X \\sim \\Bin(n = 30, p = 5/6)$\n\\begin{align*}\n\\Std(X) &= \\sqrt{np(1-p)} \\\\ &= \\sqrt{30 \\cdot 5/6 \\cdot (1 - 5/6)}\\\\ &= 2.04\n \\end{align*}\n\n\n\nQuestion 77: \nYou are running a server cluster. What is the probability that the number of crashes is 1? the server has 30 computers which crash independently and each server has a 1/3 probability of resulting in a crash. \n\n\nAnswer 77:\nLet $X$ be the number of crashes.\n$X \\sim \\Bin(n = 30, p = 1/3)$\n\\begin{align*}\n\\P(X = 1) &= \n {n \\choose 1 } p^{ 1 } (1 - p)^{n - 1 } \\\\\n &= { 30 \\choose 1 } 1/3^{ 1 } (1 - 1/3)^{30 - 1 } \\\\\n &= 0.00008\n \n \\end{align*}\n\n\n\nQuestion 78: \nCody is running a server cluster with 40 computers. What is P(X <= 39)? The number of crashes can be represented by a random variable X. X is a Bin(n = 40, p = 3/4). \n\n\nAnswer 78:\nLet $X$ be the number of crashes.\n$X \\sim \\Bin(n = 40, p = 3/4)$\n\\begin{align*}\n\\P(X <= 39) &= 1 - \\P(40 <= X <= 40)\\\\\n &= 1 - {n \\choose 40} p^40 (1 - p)^{n - 40}\n \\end{align*}\n\n\n\nQuestion 79: \nYou are hashing strings into a hashtable. 5/6 is the probability of a hash to the first bucket on each string hash. There are 30 independent string hashes. What is the probability that the number of hashes to the first bucket is greater than or equal to 29? \n\n\nAnswer 79:\nLet $X$ be the number of hashes to the first bucket.\n$X \\sim \\Bin(n = 30, p = 5/6)$\n\\begin{align*}\n\\P(X >= 29) &= P(29 <= X <= 30)\\\\ \n &= \\sum_{i = 29}^{30} {n \\choose i} p^i (1 - p)^{n - i}\n \\end{align*}\n\n\n\nQuestion 80: \nIrina is flipping a coin. Irina flips the coin 30 times and the probability of a head on each coin-flip is 0.4. What is the probability that the number of tails is 19? \n\n\nAnswer 80:\nLet $X$ be the number of tails.\n$X \\sim \\Bin(n = 30, p = 0.6)$\n\\begin{align*}\n\\P(X = 19) &= \n {n \\choose 19 } p^{ 19 } (1 - p)^{n - 19 } \\\\\n &= { 30 \\choose 19 } 0.6^{ 19 } (1 - 0.6)^{30 - 19 } \\\\\n &= 0.13962\n \n \\end{align*}\n\n\n\nQuestion 81: \nYou are asking a survey question where responses are \"like\" or \"dislike\". The probability of a like on a given response is 1/2. You give the survey to 100 people. What is the probability that the number of likes is not less than 2? \n\n\nAnswer 81:\nLet $X$ be the number of likes.\n$X \\sim \\Bin(n = 100, p = 1/2)$\n\\begin{align*}\n\\P(X >= 2) &= 1 - \\P(0 <= X <= 1)\\\\\n &= 1 - \\sum_{i = 0}^{1} {n \\choose i} p^i (1 - p)^{n - i}\n \\end{align*}\n\n\n\nQuestion 82: \nWind blows independently across locations. The number of independent locations is 100. The probability of wind at a given location is 3/20. What is the probability that the number of locations that have wind is 93? \n\n\nAnswer 82:\nLet $X$ be the number of locations that have wind.\n$X \\sim \\Bin(n = 100, p = 3/20)$\n\\begin{align*}\n\\P(X = 93) &= \n {n \\choose 93 } p^{ 93 } (1 - p)^{n - 93 } \\\\\n &= { 100 \\choose 93 } 3/20^{ 93 } (1 - 3/20)^{100 - 93 } \\\\\n &< 0.00001\n \n \\end{align*}\n\n\n\nQuestion 83: \nYou are flipping a coin. 0.9 is the probability of a tail on each coin-flip. You flip the coin 50 times. What is the expected number of heads? \n\n\nAnswer 83:\nLet $X$ be the number of heads.\n$X \\sim \\Bin(n = 50, p = 0.1)$\n\\begin{align*}\n\\E[X] &= np \\\\ &= 50 \\cdot 0.1 \\\\ &= 5.0\n \\end{align*}\n\n\n\nQuestion 84: \nA machine learning algorithm makes binary predictions. What is the probability that the number of correct predictions is less than or equal to 0? the probability of a incorrect prediction on a given guess is 1/4. The number of independent guesses is 40. \n\n\nAnswer 84:\nLet $X$ be the number of correct predictions.\n$X \\sim \\Bin(n = 40, p = 3/4)$\n\\begin{align*}\n\\P(X <= 0) &= P(0 <= X <= 0)\\\\ \n &= {n \\choose 0} p^0 (1 - p)^{n - 0}\n \\end{align*}\n\n\n\nQuestion 85: \nWind blows independently across 20 locations. 1/2 is the probability of wind at each location. What is the standard deviation of locations that have wind? \n\n\nAnswer 85:\nLet $X$ be the number of locations that have wind.\n$X \\sim \\Bin(n = 20, p = 1/2)$\n\\begin{align*}\n\\Std(X) &= \\sqrt{np(1-p)} \\\\ &= \\sqrt{20 \\cdot 1/2 \\cdot (1 - 1/2)}\\\\ &= 2.24\n \\end{align*}\n\n\n\nQuestion 86: \n7/10 is the probability of a failure on each trial and the number of independent trials is 100. What is the probability that the number of successes is 7? \n\n\nAnswer 86:\nLet $X$ be the number of successes.\n$X \\sim \\Bin(n = 100, p = 0.3)$\n\\begin{align*}\n\\P(X = 7) &= \n {n \\choose 7 } p^{ 7 } (1 - p)^{n - 7 } \\\\\n &= { 100 \\choose 7 } 0.3^{ 7 } (1 - 0.3)^{100 - 7 } \\\\\n &< 0.00001\n \n \\end{align*}\n\n\n\nQuestion 87: \nYou generate random bit strings. What is the expectation of 1s? there are 100 independent bits and 0.1 is the probability of a 1 at each bit. \n\n\nAnswer 87:\nLet $X$ be the number of 1s.\n$X \\sim \\Bin(n = 100, p = 0.1)$\n\\begin{align*}\n\\E[X] &= np \\\\ &= 100 \\cdot 0.1 \\\\ &= 10.0\n \\end{align*}\n\n\n\nQuestion 88: \nYou are testing a new medicine on patients. 3/5 is the probability of a cured patient on each trial. There are 30 independent trials. What is the probability that the number of cured patients is greater than or equal to 1? \n\n\nAnswer 88:\nLet $X$ be the number of cured patients.\n$X \\sim \\Bin(n = 30, p = 3/5)$\n\\begin{align*}\n\\P(X >= 1) &= 1 - \\P(0 <= X <= 0)\\\\\n &= 1 - {n \\choose 0} p^0 (1 - p)^{n - 0}\n \\end{align*}\n\n\n\nQuestion 89: \nA student is guessing randomly on an exam. 0.9 is the probability of a correct answer on each question and the test has 20 questions. What is the standard deviation of correct answers? \n\n\nAnswer 89:\nLet $X$ be the number of correct answers.\n$X \\sim \\Bin(n = 20, p = 0.9)$\n\\begin{align*}\n\\Std(X) &= \\sqrt{np(1-p)} \\\\ &= \\sqrt{20 \\cdot 0.9 \\cdot (1 - 0.9)}\\\\ &= 1.34\n \\end{align*}\n\n\n\nQuestion 90: \nA student is guessing randomly on an exam with 40 questions. What is the probability that the number of correct answers is 32? 0.5 is the probability of a correct answer on each question. \n\n\nAnswer 90:\nLet $X$ be the number of correct answers.\n$X \\sim \\Bin(n = 40, p = 0.5)$\n\\begin{align*}\n\\P(X = 32) &= \n {n \\choose 32 } p^{ 32 } (1 - p)^{n - 32 } \\\\\n &= { 40 \\choose 32 } 0.5^{ 32 } (1 - 0.5)^{40 - 32 } \\\\\n &= 0.00007\n \n \\end{align*}\n\n\n\nQuestion 91: \nIn a restaurant some customers ask for a water with their meal. A random sample of 40 customers is selected where the probability of a water not requested by a given customer is 1/4. What is the standard deviation of waters requested? \n\n\nAnswer 91:\nLet $X$ be the number of waters requested.\n$X \\sim \\Bin(n = 40, p = 3/4)$\n\\begin{align*}\n\\Std(X) &= \\sqrt{np(1-p)} \\\\ &= \\sqrt{40 \\cdot 3/4 \\cdot (1 - 3/4)}\\\\ &= 2.74\n \\end{align*}\n\n\n\nQuestion 92: \nA machine learning algorithm makes binary predictions. The number of correct predictions can be represented by a random variable X. X is a Bin(n = 30, p = 2/5). What is P(X < 27)? \n\n\nAnswer 92:\nLet $X$ be the number of correct predictions.\n$X \\sim \\Bin(n = 30, p = 2/5)$\n\\begin{align*}\n\\P(X < 27) &= 1 - \\P(27 <= X <= 30)\\\\\n &= 1 - \\sum_{i = 27}^{30} {n \\choose i} p^i (1 - p)^{n - i}\n \\end{align*}\n\n\n\nQuestion 93: \nIrina is flipping a coin. The probability of a tail on each coin-flip is 3/4. The number of independent coin-flips is 40. What is the probability that the number of tails is greater than 0? \n\n\nAnswer 93:\nLet $X$ be the number of tails.\n$X \\sim \\Bin(n = 40, p = 3/4)$\n\\begin{align*}\n\\P(X > 0) &= 1 - \\P(0 <= X <= 0)\\\\\n &= 1 - {n \\choose 0} p^0 (1 - p)^{n - 0}\n \\end{align*}\n\n\n\nQuestion 94: \nWaddie is sending a stream of 50 bits to space. The probability of a no corruption on a given bit is 1/2. What is the expectation of corruptions? \n\n\nAnswer 94:\nLet $X$ be the number of corruptions.\n$X \\sim \\Bin(n = 50, p = 0.5)$\n\\begin{align*}\n\\E[X] &= np \\\\ &= 50 \\cdot 0.5 \\\\ &= 25.0\n \\end{align*}\n\n\n\nQuestion 95: \nYou are hashing strings into a hashtable. There are 30 independent string hashes where the probability of a hash to the first bucket on each string hash is 5/6. What is the probability that the number of hashes to the first bucket is 24? \n\n\nAnswer 95:\nLet $X$ be the number of hashes to the first bucket.\n$X \\sim \\Bin(n = 30, p = 5/6)$\n\\begin{align*}\n\\P(X = 24) &= \n {n \\choose 24 } p^{ 24 } (1 - p)^{n - 24 } \\\\\n &= { 30 \\choose 24 } 5/6^{ 24 } (1 - 5/6)^{30 - 24 } \\\\\n &= 0.16009\n \n \\end{align*}\n\n\n\nQuestion 96: \nCharlotte is hashing strings into a hashtable. 100 strings are hashed and the probability of a hash to the first bucket on a given string hash is 1/5. What is the probability that the number of hashes to the first bucket is greater than or equal to 1? \n\n\nAnswer 96:\nLet $X$ be the number of hashes to the first bucket.\n$X \\sim \\Bin(n = 100, p = 1/5)$\n\\begin{align*}\n\\P(X >= 1) &= 1 - \\P(0 <= X <= 0)\\\\\n &= 1 - {n \\choose 0} p^0 (1 - p)^{n - 0}\n \\end{align*}\n\n\n\nQuestion 97: \nYou are flipping a coin. Each coin-flip has a 3/10 probability of resulting in a head and there are 100 coin-flips. You can assume each coin-flip is independent. What is the probability that the number of heads is 0? \n\n\nAnswer 97:\nLet $X$ be the number of heads.\n$X \\sim \\Bin(n = 100, p = 3/10)$\n\\begin{align*}\n\\P(X = 0) &= \n {n \\choose 0 } p^{ 0 } (1 - p)^{n - 0 } \\\\\n &= { 100 \\choose 0 } 3/10^{ 0 } (1 - 3/10)^{100 - 0 } \\\\\n &< 0.00001\n \n \\end{align*}\n\n\n\nQuestion 98: \nChris is sending a stream of 50 bits to space. 16/25 is the probability of a no corruption on each bit. What is the probability that the number of corruptions is greater than or equal to 47? \n\n\nAnswer 98:\nLet $X$ be the number of corruptions.\n$X \\sim \\Bin(n = 50, p = 9/25)$\n\\begin{align*}\n\\P(X >= 47) &= P(47 <= X <= 50)\\\\ \n &= \\sum_{i = 47}^{50} {n \\choose i} p^i (1 - p)^{n - i}\n \\end{align*}\n\n\n\nQuestion 99: \nYou are flipping a coin 30 times. What is the probability that the number of tails is less than 29? the probability of a tail on a given coin-flip is 2/3. \n\n\nAnswer 99:\nLet $X$ be the number of tails.\n$X \\sim \\Bin(n = 30, p = 2/3)$\n\\begin{align*}\n\\P(X < 29) &= 1 - \\P(29 <= X <= 30)\\\\\n &= 1 - \\sum_{i = 29}^{30} {n \\choose i} p^i (1 - p)^{n - i}\n \\end{align*}\n\n\n\nQuestion 100: \nYou are manufacturing chips and are testing for defects. There are 40 independent tests. The probability of a non-defect on a given test is 5/8. What is the probability that the number of defects is 10? \n\n\nAnswer 100:\nLet $X$ be the number of defects.\n$X \\sim \\Bin(n = 40, p = 3/8)$\n\\begin{align*}\n\\P(X = 10) &= \n {n \\choose 10 } p^{ 10 } (1 - p)^{n - 10 } \\\\\n &= { 40 \\choose 10 } 3/8^{ 10 } (1 - 3/8)^{40 - 10 } \\\\\n &= 0.03507\n \n \\end{align*}\n\n\n\n"}, {"id": "jury", "title": "Jury Selection", "url": "examples/jury", "text": "\n \nJury Selection\n\n\n\t\t\t\t\tIn the Supreme Court case: Berghuis v. Smith, the Supreme Court (of the US) discussed the question: \"If a group is underrepresented in a jury pool, how do you tell?\"\n\n\n\t\n\tJustice Breyer [Stanford Alum] opened the questioning by invoking the binomial theorem.\u00a0 He hypothesized a scenario involving \u201can urn with a thousand balls, and sixty are red, and nine hundred forty are green, and then you select them at random\u2026 twelve at a time.\u201d\u00a0 According to Justice Breyer and the binomial theorem, if the red balls were black jurors then \u201cyou would expect\u2026 something like a third to a half of juries would have at least one black person\u201d on them.\u00a0\n\n\t\t\t\t\n\n\t\t\t\t\tNote: What is missing in this conversation is the power of diverse backgrounds when making difficult decisions.\n\t\t\t\t\n\n\nSimulate\nSimulation:\n\n\nExplination:\n\n\t\t\t\t\tTechnically, since jurors are selected without replacement, you should represent the number of under-representative jurors as being a Hyper Geometric Random Variable (a random variable we don't look at explicitely in CS109) st \n\t\t\t\t\n\n\t\t\t\t\t\\begin{align*}X \\sim \\text{HypGeo}(n=12, N = 1000, m = 60)\\end{align*}\n\t\t\t\t\n\n\t\t\t\t\t\\begin{align*} \n\t\t\t\t\t\tP(X \\geq 1) &= 1 - P(X = 0) \\\\\n\t\t\t\t\t\t\t\t\t&= 1 - \\frac{ {60 \\choose 0}{940 \\choose 12} }{1000 \\choose 12} \\\\\n\t\t\t\t\t\t\t\t\t&\\approx 0.5261\n\t\t\t\t\t\\end{align*} \n\n\t\t\t\t\nHowever Justic Breyer made his case by citing a Binomial distribution. This isn't a perfect use of binomial, because the binomial assumes that each experiment has equal likelihood ($p$) of success. Because the jurors are selected without replacement, the probability of getting a minority juror changes slightly after each selection (and depending on what the selection was). However, as we will see, because the probabilities don't change too much the binomial distribution is not too far off.\n\t\t\t\t\n\n\t\t\t\t\t\\begin{align*}\n\t\t\t\t\tX \\sim \\text{Binomial}(n=12, p = 60/1000)\n\t\t\t\t\t\\end{align*}\n\t\t\t\t\n\n\t\t\t\t\t\\begin{align*} \n\t\t\t\t\t\tP(X \\geq 1) &= 1 - P(X = 0) \\\\\n\t\t\t\t\t\t\t\t\t&= 1 - {60 \\choose 0}(1- 0.06)^{12} \\\\\n\t\t\t\t\t\t\t\t\t&\\approx 0.5241\n\t\t\t\t\t\\end{align*} \n\n\t\t\t\t\n\nAcknowledgements: Problem posed and solved by Mehran Sahami\n\t\t\n\t\t\n\t\t\t\t\n\t\t\n"}, {"id": "grading_eye_inflammation", "title": "Grading Eye Inflamation", "url": "examples/grading_eye_inflammation", "text": "\n \nGrading Eye Inflamation\n\nWhen a patient has eye inflammation, eye doctors \"grade\" the inflammation. When \"grading\" inflammation they randomly look at a single 1 millimeter by 1 millimeter square in the patient's eye and count how many \"cells\" they see. \n\nThere is uncertainty in these counts. If the true average number of cells for a given patient's eye is 6, the doctor could get a different count (say 4, or 5, or 7) just by chance. As of 2021, modern eye medicine does not have a sense of uncertainty for their inflammation grades! In this problem we are going to change that. At the same time we are going to learn about \\Poisson distributions over space.\n\n\n\n\nWhy is the number of cells observed in a 1x1 square governed by a \\Poisson process?\n\n We can approximate a distribution for the count by discretizing the square into a fixed number of equal sized buckets. Each bucket either has a cell or not. Therefore, the count of cells in the 1x1 square is a sum of Bernoulli random variables with equal $p$, and as such can be modeled as a binomial random variable. This is an approximation because it doesn't allow for two cells in one bucket. Just like with time, if we make the size of each bucket infinitely small, this limitation goes away and we converge on the true distribution of counts. The binomial in the limit, i.e. a binomial as $n \\rightarrow \\infty$, is truly represented by a \\Poisson random variable. In this context, $\\lambda$ represents the average number of cells per 1$\\times$1 sample. See Figure 2.\n \n \n \n\n\n For a given patient the true average rate of cells is 5 cells per 1x1 sample. What is the probability that in a single 1x1 sample the doctor counts 4 cells?\n\n\n Let $X$ denote the number of cells in the 1x1 sample. We note that $X \\sim \\Poi(5)$. We want to find $P(X=4)$.\n \\[P(X=4) = \\frac{5^4 e^{-5}}{4!} \\approx 0.175\\]\n\n \n\n\nMultiple Observations\nHeads up! This section uses concepts from Part 3. Specifically {{!independent_vars}}\n For a given patient the true average rate of cells is 5 cells per 1mm by 1mm sample. In an attempt to be more precise, the doctor counts cells in two different, larger 2mm by 2mm samples. Assume that the occurrences of cells in one 2mm by 2mm samples are independent of the occurrences in any other 2mm by 2mm samples. What is the probability that she counts 20 cells in the first samples and 20 cells in the second?\n\n\nLet $Y_1$ and $Y_2$ denote the number of cells in each of the 2x2 samples. Since there are 5 cells in a 1x1 sample, there are 20 samples in a 2x2 sample since the area quadrupled, so we have that $Y_1 \\sim \\Poi(20)$ and $Y_2 \\sim \\Poi(20)$. We want to find $P(Y_1 = 20 \\land Y_2 = 20)$. Since the number of cells in the two samples are independent, this is equivalent to finding $\\P(Y_1 = 20) \\P(Y_2=20)$.\n\nEstimating Lambda\nHeads up! This section uses concepts from Part 5. Specifically {{!map}}\n\nInflammation prior: Based on millions of historical patients, doctors have learned that the prior probability density function of true rate of cells is:\n\\begin{align*}\n f(\\lambda) = K \\cdot \\lambda \\cdot e ^{-\\frac{\\lambda}{2}}\n\\end{align*}\nWhere $K$ is a normalization constant and $\\lambda$ must be greater than 0.\n\n A doctor takes a single sample and counts 4 cells. Give an equation for the updated probability density of $\\lambda$. Use the \"Inflammation prior\" as the prior probability density over values of $\\lambda$. Your probability density may have a constant term.\n\n\n \n Let $\\theta$ be the random variable for true rate. Let $X$ be the random variable for the count\n \\begin{align*}\n f(\\theta = \\lambda | X = 4) \n &= \\frac{P(X=4|\\theta = \\lambda) f(\\theta = \\lambda)}{P(X = 4)} \\\\\n &= \\frac{\\frac{\\lambda^{4} e^{-\\lambda}}{4!} \\cdot K \\cdot \\lambda \\cdot e^{\\lambda / 2}}{P(X = 4)} \\\\\n &= \\frac{K \\cdot \\lambda^5 e^{-\\frac{3}{2}\\lambda}}{4! P(X=4)}\n \\end{align*}\n \nA doctor takes a single sample and counts 4 cells. What is the Maximum A Posteriori estimate of $\\lambda$?\n\n \n Maximize the \"posterior\" of the parameter calculated in the previous section:\n \\begin{align*}\n \\argmax_\\limits\\lambda \\frac{K \\cdot \\lambda^5 e^{-\\frac{3}{2}\\lambda}}{4! P(X=4)}\n &= \\argmax_\\limits\\lambda \\lambda^5 e^{-\\frac{3}{2}\\lambda} \\\\\n \\end{align*}\n \n Take logarithm (preserves argmax, and easier derivative):\n \\begin{align*}\n &= \\argmax_\\limits\\lambda \\log \\left(\\lambda^5 e^{-\\frac{3}{2}\\lambda} \\right) \\\\\n &= \\argmax_\\limits\\lambda \\left(5 \\log \\lambda -\\frac{3}{2}\\lambda\\right)\n \\end{align*}\n\n Calculate the derivative with respect to the parameter, and set equal to 0\n \\begin{align*}\n 0 &= \\frac{\\partial}{\\partial \\lambda} \\left(5 \\log \\lambda -\\frac{3}{2}\\lambda\\right) \\\\\n 0 &= \\frac{5}{\\lambda} -\\frac{3}{2} \\\\\n \\lambda &= \\frac{10}{3}\n \\end{align*}\n \nExplain, in words, the difference between the two estimates of lambda in the two previous parts. \n\n\tThe estimate in the first part is a ``distribution\" (also called a soft estimate) whereas the estimate in the second part is a single value (also called a point estimate). The former contains information about confidence.\n\n\nWhat is the MLE estimate of $\\lambda$?\n\n\n The MLE estimate doesn't use the prior belief. The MLE estimate for a poisson is simply the average of the observations. In this case the average of our single observation is 4. MLE is not a great tool for estimating our parameter from just one datapoint.\n\nA patient comes on two separate days. The first day the doctor counts 5 cells, the second day the doctor counts 4 cells. Based only on this observation, and treating the true rates on the two days as independent, what is the probability that the patient's inflammation has gotten better (in other words, that their $\\lambda$ has decreased)? \n\n\tLet $\\theta_1$ be the random variable for lambda on the first day and $\\theta_2$ be the random variable for lambda on the second day.\n\n\\begin{align*}\n f(\\theta_1 = \\lambda | X = 5) &= K_1 \\cdot \\lambda^6 e^{-\\frac{3}{2}\\lambda} \\\\\n f(\\theta_2 = \\lambda | X = 4) &= K_2 \\cdot \\lambda^5 e^{-\\frac{3}{2}\\lambda} \n\\end{align*}\n\nThe question is asking what is $P(\\theta_1 > \\theta_2)$? There are a few ways to calculate this exactly:\n\\begin{align*}\n& \\int_{\\lambda_1=0}^\\infty \\int_{\\lambda_2=0}^{\\lambda_1} f(\\theta_1 = \\lambda_1, \\theta_2 = \\lambda_2) \\\\\n&= \\int_{\\lambda_1=0}^\\infty \\int_{\\lambda_2=0}^{\\lambda_1} f(\\theta_1 = \\lambda_1) \\cdot f(\\theta_2 = \\lambda_2) \\\\\n&= \\int_{\\lambda_1=0}^\\infty f(\\theta_1 = \\lambda_1) \\int_{\\lambda_2=0}^{\\lambda_1} f(\\theta_2 = \\lambda_2) \\\\\n&= \\int_{\\lambda_1=0}^\\infty K_1 \\cdot \\lambda^6 e^{-\\frac{3}{2}\\lambda} \\int_{\\lambda_2=0}^{\\lambda_1} K_2 \\cdot \\lambda^5 e^{-\\frac{3}{2}\\lambda}\n\\end{align*}\n\n"}, {"id": "norm_cdf_calculator", "title": "Gaussian CDF Calculator", "url": "examples/norm_cdf_calculator", "text": "\n\n\t\t\t\t\t\tGaussian CDF Calculator\n\t\t\t\t\t\n\n\n\t\t\t\t\tTo calculate the Cumulative Density Function (CDF) for a normal (aka Gaussian) random variable at a value $x$, also writen as $F(x)$, you can transform your distribution to the \"standard normal\" and look up the corresponding value in the standard normal CDF. However, most programming libraries will provide a normal cdf funciton. This tool replicates said functionality.\n\t\t\t\t\n Calculator\n\n\nx: \u00a0\u00a0\n\n\n\n\n\n\n\nmu: \n\n\n\n\n\n\n\nstd: \n\n\n\n\n\n\nnorm.cdf(x, mu, std)\n\n\n\n\nExplanation\nThis function calculates the cumulative density function of a Normal random variable. It is very important in CS109 to understand the difference between a probability density function (PDF), and a cumulative density function (CDF). The CDF of a random variable at point little $x$ is equal to the probability that the random variable takes on a value less than or equal to $x$. If the random variable is called big $X$, the CDF can be written as $P(X < x)$ or as $F_X(x)$.\n\t\t\t\t\t The CDF function of a Normal is calculated by translating the random variable to the Standard Normal, and then looking up a value from the precalculated \"Phi\" function ($\\Phi$), which is the cumulative density function of the standard normal. The Standard Normal, often written $Z$, is a Normal with mean 0 and variance 1. Thus, $Z \\sim N(\\mu = 0, \\sigma^2 = 1)$.\n\t\t\t\t\t\n\n\nTry different calculations to see different translations to the standard normal!\n\t\t\t\t\t\n\t\t\t\t\t\t\t\n\n\n\n\n"}, {"id": "grades_not_normal", "title": "Grades are Not Normal", "url": "examples/grades_not_normal", "text": "\n \nGrades are Not Normal\n\nSometimes you just feel like squashing normals:\n\nLogit Normal\nThe logit normal is the continuous distribution that results from applying a special \"squashing\" function to a Normally distributed random variable. The squashing function maps all values the normal could take on onto the range 0 to 1. If $X \\sim \\text{LogitNormal}(\\mu, \\sigma^2)$ it has:\n\\begin{align*}\n \\text{PDF:}&& &f_X(x) = \\begin{cases}\n \\frac{1}{\\sigma(\\sqrt{2\\pi})x(1 - x)}e^{-\\frac{(\\text{logit}(x) - \\mu)^2}{2\\sigma^2}} & \\text{if } 0 < x < 1\\\\\n 0 & \\text{otherwise}\n \\end{cases} \\\\\n \\text{CDF:} && &F_X(x) = \\Phi\\Big(\\frac{\\text{logit}(x) - \\mu}{\\sigma}\\Big)\\\\\n \\text{Where:} && &\\text{logit}(x) = \\text{log}\\Big(\\frac{x}{1-x}\\Big)\n\\end{align*}\n\nA new theory shows that the Logit Normal better fits exam score distributions than the traditionally used Normal. Let's test it out! We have some set of exam scores for a test with min possible score 0 and max possible score 1, and we are trying to decide between two hypotheses:\n\n$H_1$: our grade scores are distributed according to $X\\sim \\text{Normal}(\\mu = 0.7, \\sigma^2 = 0.2^2)$. \n$H_2$: our grade scores are distributed according to $X\\sim \\text{LogitNormal}(\\mu = 1.0, \\sigma^2 = 0.9^2)$.\n\n\n\n\n\nUnder the normal assumption, $H_1$, what is $P(0.9 < X < 1.0)$? Provide a numerical answer to two decimal places.\n\n\t$$P(0.9 < X < 1.0) = \\Phi\\left(\\frac{1.0 - 0.7}{0.2}\\right) - \\Phi\\left(\\frac{0.9 - 0.7}{0.2}\\right) = \\Phi(1.5) - \\Phi(1.0) = 0.9332 - 0.8413 = 0.09$$\n\nUnder the logit-normal assumption, $H_2$, what is $P(0.9 < X < 1.0)$?\n\n\n\t\t$$F_X(1.0) - F_X(0.9) = \\Phi\\Big(\\frac{\\text{logit}(1.0) - 1.0}{0.9}\\Big) - \\Phi\\Big(\\frac{\\text{logit}(0.9) - 1.0}{0.9}\\Big)$$\n\nWhich we can solve for numerically:\n$$\\Phi\\Big(\\frac{\\text{logit}(1.0) - 1.0}{0.9}\\Big) - \\Phi\\Big(\\frac{\\text{logit}(0.9) - 1.0}{0.9}\\Big) = 1 - \\Phi(1.33) \n\\approx 0.91$$\n\t\nUnder the normal assumption, $H_1$, what is the maximum value that $X$ can take on?\n\n\t\n\t$$\\infty$$\n\nBefore observing any test scores, you assume that (a) one of your two hypotheses is correct and (b) that initially, each hypothesis is equally likely to be correct, $P(H_1)=P(H_2)=\\frac{1}{2}$. You then observe a single test score, $X = 0.9$. What is your updated probability that the Logit-Normal hypothesis is correct? \n\t\n\t\\begin{align*}\nP(H_2|X = 0.9) &= \\frac{f(X = 0.9|H_2)P(H_2)}{f(X = 0.9|H_2)P(H_2) + f(X = 0.9|H_1)P(H_1)}\\\\\n&= \\frac{f(X = 0.9|H_2)}{f(X = 0.9|H_2) + f(X = 0.9|H_1)}\\\\\n&= \\frac{\\frac{1}{\\sigma(\\sqrt{2\\pi})0.9*(1 - 0.9)}e^{-\\frac{(\\text{logit}(0.9) - 1.0)^2}{2*0.9^2}}}{\\frac{1}{\\sigma(\\sqrt{2\\pi})0.9*(1 - 0.9)}e^{-\\frac{(\\text{logit}(0.9) - 1.0)^2}{2*0.9^2}} + \\frac{1}{0.2\\sqrt{2\\pi}}e^{-\\frac{(0.9 - 0.7)^2}{2*0.2^2}}}\n\\end{align*}\n\n\n"}, {"id": "curse_of_dimensionality", "title": "Curse of Dimensionality", "url": "examples/curse_of_dimensionality", "text": "\n \nCurse of Dimensionality\n\nIn machine learning, like many fields of computer science, often involves high dimensional points, and high dimension spaces have some surprising probabilistic properties. \n\nA random value $X_i$ is a Uni(0, 1).\nA random point of dimension $d$ is a list of $d$ random values: $[X_1 \\dots X_d]$.\n\n\n\n\n\n A random value $X_i$ is close to an edge if $X_i$ is less than 0.01 or $X_i$ is greater than 0.99. What is the probability that a random value is close to an edge?\nLet $E$ be the event that a random value is close to an edge.\n $P(E) = P(X_i < 0.01) + P(X_i > 0.99) = 0.02$\n\nA random point $[X_1, X_2, X_3]$ of dimension $3$ is close to an edge if any of it's values are close to an edge. What is the probability that a $3$ dimensional point is close to an edge?\nThe event is equivalent to the complement of none of the dimensions of the \n point is close to an edge, which is: \n $1 - (1 - P(E))^3 = 1 - 0.98^3 \\approx 0.058$\n\nA random point $[X_1, \\dots X_{100}]$ of dimension $100$ is close to an edge if any of it's values are close to an edge. What is the probability that a 100 dimensional point is close to an edge?\nSimilarly, it is: \n $1 - (1 - P(E))^{100} = 1 - 0.98^{100} \\approx 0.867$\n\n\nThere are many other phenomena of high dimensional points: such as, the euclidean distance between points starts to converge.\n"}, {"id": "prob_baby_delivery", "title": "Probability of Baby Delivery", "url": "examples/prob_baby_delivery", "text": "\n\n \nProbability and Babies\n\n\n\tThis demo used to be live. We now know that the delivery happened on Jan 23rd. Lets go back in time to Jan 1st and see what the probability looked like at that point.\n\nWhat is the probability that Laura gives birth today (given that she hasn't given birth up until today)?\n\n\nToday's Date\n\n\n\n\n\nDue Date\n\n\n\n\n\nProbability of delivery today: \nProbability of delivery in next 7 days: \nCurrent days past due date: days\nUnconditioned probability mass before today: \n\n\nHow likely is delivery, in humans, relative to the due date? There have been millions of births which gives us a relatively good picture [1]. The length of human pregnancy varies by quite a lot! Have you heard that it is 9 months? That is a rough, point estimate. The mean duration of pregnancy is 278.6 days, and pregnancy length has a standard deviation (SD) of 12.5 days. This distribution is not normal, but roughly matches a \"skewed normal\". This is a general probability mass function for the first pregnancy collected from hundreds of thousands of women (this PMF is very similar across demographics, but changes based on whether the woman has given birth before):\n\n\n\n\nOf course, we have more information. Specifically, we know that Laura hasn't given birth up until today (we will update this example when that changes). We also know that babies which are over 14 days late are \"induced\" on day 14. How likely is delivery given that we haven't delivered up until today? Note that the y-axis is scalled differently:\n\n\n\n\n\n\tImplementation notes: this calculation was performed by storing the PDF as a list of (day, probability) points. These values are sometimes called weighted samples, or \"particles\" and are the key component to a \"particle filtering\" approach. After we observe no-delivery, we set the probability of every point which has a day before today to be 0, and then re-normalize the remaining points (aka we \"filter\" the \"particles\"). This is convenient because the \"posterior\" belief doesn't follow a simple equation -- using particles means we never have to write that equation down in our code.\n\t\n\n\nThree friends have the exact same due date (Really! this isn't a hypothetical) What is the probability that all three couples deliver on the exact same day?\n\n\n\t\t\n\nProbability of three couples on the same day: \n\nHow did we get that number? Let $p_i$ be the probability that one baby is delivered on day $i$ -- this number can be read off the probability mass function. Let $D_i$ be the event that all three babies are delivered on day $i$. Note that the event $D_i$ is mutually exclusive with the event that all three babies are born on another day (So for example, $D_1$ is mutually exclusive with $D_2$, $D_3$ etc). Let $N=3$ be the event that all babies are born on the same day:\n\n\t$$\n\\begin{align}\n\\p(N=3) &= \\sum_i \\p(D_i) && \\text{Since days are mutually exclusive} \\\\\n&= \\sum_i p_i^3 && \\text{Since the three couples are independent} \n\\end{align}\n\t$$\n\n\n\n[1] Predicting delivery date by ultrasound and last menstrual period in early gestation\n\nAcknowledgements: This problem was first posed to me by Chris Gregg.\n\n\n\n"}, {"id": "joint", "title": "Joint Probability", "url": "part3/joint", "text": "\n \nJoint Probability\n\n\n\n\tMany interesting problems involve not one random variable, but rather several interacting with one another. In order to create interesting probabilistic models and to reason in real world situations, we are going to need to learn how to consider several random variables jointly.\n\nIn this section we are going to use disease prediction as a working example to introduce you to the concepts involved in probabilistic models. The general question is: a person has a set of observed symptops. Given the symptoms what is the probability over each possible disease? \nWe have already considered events that co-occur and covered concepts such as independence and conditional probability. What is new about this section is (1) we are going to cover how to handle random variables which co-occur and (2) we are going to talk about how computers can reason under large probabilistic models.\n\nJoint Probability Functions\nFor single random variables, the most important information was the PMF or, if the variable was continuous, the PDF. When dealing with two or more variables, the equivalent function is called the Joint function. For discrete random variables, it is a function which takes in a value for each variable and returns the probability (or probability density for continuous variables) that each variable takes on its value. For example if you had two discrete variables the Joint function is:\n $$\n \\begin{align}\n \\p(X=x,Y=y) && \\text{Joint function for $X$ and $Y$}\n \\end{align}\n $$\n You should read the comma as an \"and\" and as such this is saying the probability that $X=x$ and $Y=y$. Again like for single variables, as shorthand, we often write just the values and it implies that we are talking about the probability of the random variables taking on those values. This notation is convenient because it is shorter, and it makes it explicit that the function is operating over two parameters. It requires to to recall that the event is a random variable taking on the given value.\n $$\n \\begin{align}\n \\p(x,y) && \\text{Shorthand for }\\p(X=x,Y=y)\n \\end{align}\n $$\n If any of the variables are continuous we use different notation to make it clear that we need a probability density function, something we can integrate over to get a probability. We will cover this in detail:\n $$\n \\begin{align}\n f(X=x,Y=y) && \\text{Joint density function if $X$ or $Y$ are continuous}\n \\end{align}\n $$\n The same idea extends to as many variables as you have in your model. For example if you had three discrete random variables $X$, $Y$, and $Z$, the joint probability function would state the likelihood of an assignment to all three: $\\p(X=x,Y=y,Z=z)$.\n\n \n\n\tJoint Probability Tables\n\n\n\nDefinition: Joint Probability Table\n\t\tA joint probability table is a way of specifying the \"joint\" distribution between multiple random variables. It does so by keeping a multi-dimensional lookup table (one dimension per variable) so that the probability mass of any assignment, eg $\\p(X=x,Y=y, \\dots$), can be directly looked up.\n\t\n\nLet us start with an example. In 2020 the Covid-19 pandemic disrupted lives around the world. Many people were unable to get tested and had to determine whether or not they were sick based on home diagnosis. Lets build a very simple probabilistic model to enable us to make a tool which can predict the probability of having the illness given observed symptoms. To make it clear that this is a pedagogical example, lets consider a made up illness called Determinitis. The two main symptoms are fever and loss of smell.\n\n\nVariable\nSymbol\nType\n\n\nHas Determinitis\n$D$\nBernoulli (1 indicates has Determinitis)\n\n\nFever\n$F$\nCategorical (none, low, high)\n\n\nCan Smell\n$S$\nBernoulli (1 indicates can smell)\n\n\nA joint probability table is a brute force way to store the probability mass of a particular assignment of values to our variables. Here is a probabilistic model for our three random variables (aside: the values in this joint are realistic and based on reasearch, but are primarily for teaching. Consult a doctor before making medical decisions).\n\n\n\n\n$D=0$\n\t<%\n\tinclude('templates/jointGrid.html', data = [\n\t\t\t[0.024, 0.783],\n\t\t\t[0.003,0.092],\n\t\t\t[0.001,0.046]\n\t\t],\n\t\tkey ='covid',\n\t\trows = ['$F = \\\\text{none}$', '$F=\\\\text{low}$', '$F=\\\\text{high}$'],\n\t\tcols = ['$S=0$', '$S=1$'],\n\t)\n\t\n\n$D=1$\n\t<%\n\tinclude('templates/jointGrid.html', data = [\n\t\t\t[0.006, 0.014],\n\t\t\t[0.005,0.011],\n\t\t\t[0.004,0.011]\n\t\t],\n\t\tkey ='noCovid',\n\t\trows = ['$F = \\\\text{none}$', '$F=\\\\text{low}$', '$F=\\\\text{high}$'],\n\t\tcols = ['$S=0$', '$S=1$'],\n\t)\n\n\n\n\tA few key observations:\n\t\nEach cell in this table represents the probability of one assignment of variables. For example the probability that someone cant smell, $S=0$, has a low fever, $F=\\text{low}$, and has the illness, $D=1$, can be directly read off the table: $P(D=1,S=0,F=\\text{low}) = 0.005$. \n\t\tThese are joint probabilities not conditional probabilities. The value 0.005 is the value of illness, no smell and low fever. It is not the probability of no smell and low fever given illness. A table which stored conditional probabilities would be called a conditional probability table, this is a joint probability table.\n\t\tIf you sum over all cells, the total will be 1. Each cell is a mutually exclusive combination of events and the cells are meant to span the entire space of possible outcomes.\n\t\t\tThis table is large! We can count the number of cells using the step rule of counting. If $n_i$ is the number of different values that random variable $i$ can take on, the number of cells in the joint table is $\\prod_i n_i$.\n\t\n\nProperties of Joint Distributions\nThere are many properties of a random variable of any random variable some of which we will dive into extensively. Here is a brief summary. Each random variable has:\n\n\n\n\n\nProperty\nNotation Example\nDescription\n\n\n\nDistribution Function (PMF or PDF)\n$\\P(X=x,Y=y,\\dots)$ or $f(X=x,Y=y,\\dots)$\nA function which maps values the RV can take on to likelihood.\n\n\nCumulative Distribution Function (CDF)\n$F(X < x,Y < y, \\dots)$\nProbability that each variable is less than its corresponding parameter\n\n\nCovariance\n$\\sigma_{X,Y}$\nA measure of how much two random variables vary together.\n\n\nCorrelation\n$\\rho_{X,Y}$\nNormalized co-variance\n\n\n\n\n"}, {"id": "joint", "title": "Joint Probability", "url": "part3/joint", "text": "\nLet us start with a humble example: imagine you care about the sort of music that a person likes. Spotify, a popular music application breaks all songs into a set of features. There are 9 different features but for now lets focus on three: Acousticness, Danceability and Popularity. All features are scored in a range 0 to 1, so for example if a song has a danceability rating of 1, it is a straight up bop.\nIf you want to understand a person's music taste you could look at the probability distribution over any of those features on their own. For example here is one person's estimated distribution of Danceability after they listened to 10,000 songs (recall each song has a Danceability score):\n\n<%\ninclude('templates/barGraph.html', params={\n\t'key':'singleSong',\n\t'data':[\n\t\t[0.1, 0.15],\n\t\t[0.2, 0.05],\n\t\t[0.3, 0.05],\n\t\t[0.4, 0.07],\n\t\t[0.5, 0.10],\n\t\t[0.6, 0.50],\n\t\t[0.7, 0.70],\n\t\t[0.8, 0.80],\n\t\t[0.9, 0.40],\n\t\t[1.0, 0.20]\n\t],\n\t'normalize':'True',\n\t'height':400,\n\t'xLabel':'Danceability Score'\n})\nIn the PMF chapter we covered how you can create a distribution like this from data. This graph of a single variable relating to a person's music taste does not tell the whole picture. \n\n<%\ninclude('templates/jointGrid.html', data = [\n\t\t[0.1, 0.15,0.99, 0.1,0.1],\n\t\t[0.5,0.5,0.5, 0.1,0.1],\n\t\t[0.1,0.9,0.8, 0.1,0.1],\n\t\t[0.5,0.5,0.5, 0.1,0.1],\n\t\t[0.1,0.9,0.8, 0.1,0.1]\n\t],\n\tkey ='songJoint',\n\trows = ['1', '2', '3', '4', '5'],\n\tcols = ['1', '2', '3', '4', '5'],\n\tnormalize = True\n)\n"}, {"id": "multinomial", "title": "Multinomial", "url": "part3/multinomial", "text": "\n \nMultinomial\n\nThe multinomial is an example of a parametric distribution for multiple random variables.\nSay you perform $n$ independent trials of an experiment where each trial results in one of $m$ outcomes, with respective probabilities: $p_1, p_2, \\dots , p_m$ (constrained so that $\\sum_i p_i = 1$). Define $X_i$ to be the number of trials with outcome $i$. A multinomial distribution is a closed form function that answers the question: What is the probability that there are $c_i$ trials with outcome $i$. Mathematically:\n\\begin{align*}\nP(X_1=c_1,X_2 = c_2, \\dots , X_m = c_m) = { {n} \\choose {c_1,c_2,\\dots , c_m} }\\cdot p_1^{c_1} \\cdot p_2^{c_2}\\dots p_m^{c_m}\n\\end{align*}\nOften people will use the product notation to write the exact same equation:\n\\begin{align*}\nP(X_1=c_1,X_2 = c_2, \\dots , X_m = c_m) = { {n} \\choose {c_1,c_2,\\dots , c_m} }\\cdot \\prod_i p_i^{c_i} \n\\end{align*}\n\nExample: A 6-sided die is rolled 7 times. What is the probability that you roll: 1 one, 1 two, 0 threes, 2 fours, 0 fives, 3 sixes (disregarding order).\n\\begin{align*}\nP(X_1=1,X_2 = 1&, X_3 = 0,X_4 = 2,X_5 = 0,X_6 = 3) \\\\&= \\frac{7!}{2!3!}\\left(\\frac{1}{6}\\right)^1\\left(\\frac{1}{6}\\right)^1\\left(\\frac{1}{6}\\right)^0\\left(\\frac{1}{6}\\right)^2\\left(\\frac{1}{6}\\right)^0\\left(\\frac{1}{6}\\right)^3\\\\\n&=420\\left(\\frac{1}{6}\\right)^7\n\\end{align*}\n\nThe multinomial is especially popular because of its use as a model of language. For a full example see the Federalist Paper Authorship example.\n\n"}, {"id": "continuous_joint", "title": "Continuous Joint", "url": "part3/continuous_joint", "text": "\n \nContinuous Joint\n\nRandom variables $X$ and $Y$ are Jointly Continuous if there exists a joint Probability Density Function (PDF) $f$ such that:\n\\begin{align*}\n P(a_1 < X \\leq a_2,b_1 < Y \\leq b_2) = \\int_{a_1}^{a_2} \\int_{b_1}^{b_2} f(X=x,Y=y)\\d y \\text{ } \\d x\n\\end{align*}\nUsing the PDF we can compute marginal probability densities:\n\\begin{align*}\n f(X=a) &= \\int_{-\\infty}^{\\infty}f(X=a,Y=y)\\d y \\\\\n f(Y=b) &= \\int_{-\\infty}^{\\infty}f(X=x,Y=b)\\d x \n\\end{align*}\n\nLet $F(x,y)$ be the Cumulative Density Function (CDF):\n\\begin{align*}\n P(a_1 < X \\leq a_2,b_1 < Y \\leq b_2) = F(a_2,b_2) - F(a_1,b_2) + F(a_1,b_1) - F(a_2,b_1)\n\\end{align*}\n\nFrom Discrete Joint to Continuous Joint\nThinking about multiple continuous random variables jointly can be unintuitive at first blush. But we can turn to our helpful trick that we can use to understand continuous random variables: start with a discrete approximation. Consider the example of creating the CS109 seal. It was generated by throwing half a million darts at an image of the stanford logo (keeping all the pixels that get hit by at least one dart). The darts could hit any continuous location on the logo, and, the locations are not equally likely. Instead, the location a dart hits is goverened by a joint continuous distribution. In this case there are only two simultaneous random variables, the x location of the dart and the y location of the dart. Each random variable is continuous (it takes on real numbers). Thinking about the joint probability density function is easier by first considering a discretization. I am going to break the dart landing area into 25 discrete buckets:\n\n\n\n\n\n\n\n\n\nOn the left is a visualization of the probability mass of this joint distribution, and on the right is a visualization of how we could answer the question: what is the probability that a dart hits within a certain distance of the center. For each bucket there is a single number, the probability that a dart will fall into that particular bucket (these probabilities are mutually exclusive and sum to 1).\nOf course this discretization only approximates the joint probability distribution. In order to get a better approximation we could create more fine-grained discretizations. In the limit we can make our buckets infinitely small, and the value associated with each bucket becomes a second derivative of probability.\n\n\n\n\n\n\n\n\n\nTo represent the 2D probability density in a graph, we use the darkness of a value to represent the density (darker means more density). Another way to visualize this distribution is from an angle. This makes it easier to realize that this is a function with two inputs and one output. Below is an different visualization of the exact same density function:\n\nJust like in the single random variable case, we are now represending our belief in the continuous random variables as densities rather than probabilities. Recall that a density represents a relative belief. If the density of $f(X = 1.1, Y = 0.9)$ is twice as high as the density that $f(X = 1.1, Y = 1.1)$ the function is expressing that it is twice as likely to find the particular combination of $X = 1.1$ and $Y=0.9$.\n\nMultivariate Gaussian\nThe density that is depicted in this example happens to be a particular of joint continuous distribution called Multivariate Gaussian. In fact it is a special case where all of the constituent variables are independent.\n\nDef: Independent Multivariate Gaussian\n. An Independent Multivariate Gaussian can model a collection of continuous joint random variables $\\vec{X} = (X_1 \\dots X_n)$ as being a composition of independent normals with means $\\vec{\\mu} = (\\mu_1 \\dots \\mu_n)$ and standard deviations $\\vec{\\sigma} = (\\sigma_1 \\dots \\sigma_n)$. Notice how we now have variables in vectors (similar to a list in python).\n\nThe notation for the multivariate uses vector notation:\n\\begin{align*}\n\\vec{X} \\sim \\vec{\\N}(\\vec{\\mu}, \\vec{\\sigma})\n\\end{align*}\n\n The joint PDF is:\n\\begin{align*}\nf(\\vec{x}) \n&= \\prod_{i=1}^n f(x_i) \\\\\n&= \\prod_{i=1}^n \\frac{1}{\\sigma_i \\sqrt{2\\pi} } e ^{\\frac{-(x-\\mu_i)^2}{2\\sigma_i^2}}\n\\end{align*}\n\nAnd the joint CDF is\n\\begin{align*}\nF(\\vec{x}) \n&= \\prod_{i=1}^n F(x_i) \\\\\n&= \\prod_{i=1}^n \\Phi(\\frac{x_i-\\mu_i}{\\sigma_i})\n\\end{align*}\n\n\nExample: Gaussian BlurIn the same way that many single random variables are assumed to be gaussian, many joint random variables may be assumed to be Multivariate Gaussian. Consider this example of Gaussian Blur:\n\n\n In image processing, a Gaussian blur is the result of blurring an image by a Gaussian function. It is a widely used effect in graphics software, typically to reduce image noise. A Gaussian blur works by convolving an image with a 2D independent multivariate gaussian (with means of 0 and equal valued standard deviations). \n\n\n\n\n\nIn order to use a Gaussian blur, you need to be able to compute the probability mass of that 2D gaussian in the space of pixels. Each pixel is given a weight equal to the probability that X and Y are both within the pixel bounds. The center pixel covers the area where \n $-0.5 \u2264 x \u2264 0.5$ and $-0.5 \u2264 y \u2264 0.5$. Let's do one step in computing the Gaussian function discretized over image space. What is the weight of the center pixel for gaussian blur with a multivariate gaussian which has means of 0 and standard deviation of 3?\n\nLet $\\vec{B}$ be the multivariate gaussian, $\\vec{B} \\sim \\N(\\vec{\\mu} = [0,0], \\vec{\\sigma} = [3,3])$. Let's compute the CDF of this multivariate gaussian $F(x_1,x_2)$:\n\\begin{align*}\n F(x_1,x_2) \n&= \\prod_{i=1}^n \\Phi(\\frac{x_i-\\mu_i}{\\sigma_i}) \\\\\n&= \\Phi(\\frac{x_1-\\mu_1}{\\sigma_1}) \\cdot \\Phi(\\frac{x_2-\\mu_2}{\\sigma_2}) \\\\\n&= \\Phi(\\frac{x_1}{3}) \\cdot \\Phi(\\frac{x_2}{3})\n\\end{align*}\n\n\nNow we are ready to calculate the weight of the center pixel:\n \\begin{align*}\n \\P&(-0.5 < X_1 \\leq 0.5,-0.5 < X_2 \\leq 0.5) \\\\\n &= F(0.5,0.5) - F(-0.5,0.5) + F(-0.5,-0.5) - F(0.5,-0.5) \\\\\n &=\\Phi(\\frac{0.5}{3}) \\cdot \\Phi(\\frac{0.5}{3}) \n - \\Phi(\\frac{-0.5}{3}) \\cdot \\Phi(\\frac{0.5}{3}) \n + \\Phi(\\frac{-0.5}{3}) \\cdot \\Phi(\\frac{-0.5}{3})\n - \\Phi(\\frac{0.5}{3}) \\cdot \\Phi(\\frac{-0.5}{3})\\\\\n &\\approx 0.026\n \\end{align*}\n\n How can this 2D gaussian blur the image? Wikipedia explains: \"Since the Fourier transform of a Gaussian is another Gaussian, applying a Gaussian blur has the effect of reducing the image's high-frequency components; a Gaussian blur is a low pass filter\" [2].\n\n"}, {"id": "inference", "title": "Inference", "url": "part3/inference", "text": "\n \nInference\n\nSo far we have set the foundation for how we can represent probabilistic models with multiple random variables. These models are especially useful because they let us perform a task called \"inference\" where we update our belief about one random variable in the model, conditioned on new information about another. Inference in general is hard! In fact, it has been proven that in the worst case, the inference task, can be NP-Hard where $n$ is the number of random variables [1]. \n\nFirst we are going to practice it with two random variables (in this section). Then, later in this unit we are going to talk about inference in the general case, with many random variables. \n\n Earlier we looked at conditional probabilities for events. The first task in inference is to understand how to combine conditional probabilities and random variables. The equations for both the discrete and continuous case are intuitive extensions of our understanding of conditional probability:\n\nThe Discrete Conditional\nThe discrete case, where every random variable in your model is discrete, is a straightforward combination of what you know about conditional probability (which you learned in the context of events). Recall that every relational operator applied to a random variable defines an event. As such the rules for conditional probability directly apply: \nThe conditional probability mass function (PMF) for the discrete case:\n\n\n Let $X$ and $Y$ be discrete random variables.\n Def: Conditional definition with discrete random variables.\n\n\n\n\\begin{align*}\n \\P(X=x|Y=y)=\\frac{P(X=x,Y=y)}{P(Y=y)}\n\\end{align*}\n\nDef: Bayes' theorem with discrete random variables.\n\n\\begin{align*}\n \\P(X=x|Y=y)=\\frac{P(Y=y|X=x)P(X=x)}{P(Y=y)}\n\\end{align*}\n\nIn the presence of multiple random variables, it becomes increasingly useful to use shorthand! The above definition is identical to this notation where a lowercase symbol such as $x$ is short hand for the event $X=x$:\n\\begin{align*}\n \\P(x|y)=\\frac{P(x,y)}{P(y)}\n\\end{align*}\nThe conditional definition works for any event and as such we can also write conditionals using cumulative density functions (CDFs) for the discrete case:\n\\begin{align*}\n \\P(X \\leq a | Y=y) \n &= \\frac{\\P(X \\leq a, Y=y)}{\\p(Y=y)} \\\\\n &= \\frac{\\sum_{x\\leq a} \\P(X=x,Y=y)}{\\P(Y=y)} \n\\end{align*}\n\nHere is a neat result: this last term can be rewritten, by a clever manipulation. We can make the sum extend over the whole fraction:\n\n\\begin{align*}\n \\P(X \\leq a | Y=y) \n &= \\frac{\\sum_{x\\leq a} \\P(X=x,Y=y)}{\\P(Y=y)} \\\\\n &= \\sum_{x\\leq a} \\frac{\\P(X=x,Y=y)}{\\P(Y=y)} \\\\\n &= \\sum_{x\\leq a} \\P(X=x|Y=y) \n\\end{align*}\n\nIn fact it becomes straight forward to translate the rules of probability (such as bayes theorem, law of total probability, etc) to the language of discrete random variables: we simply need to recall that every relational operator applied to a random variable defines an event.\n\n\nMixing Discrete and Continuous\nWhat happens when we want to reason about continuous random variables using our rules of probability (such as Bayes theoreom, law of total probability, chain rule, etc)? There is a simple practical answer: the rules still apply, but we have to replace probability terminology with probability density functions. As a concrete example let's look at Bayes' Theorem with one continuous random variable.\n\nDef: Bayes' Theorem with mixed discrete and continuous.\n\nLet $X$ be continuous random variable and let $N$ be a discrete random variable. The conditional probabilities of $X$ given $N$ and $N$ given $X$ respectively are:\n\\begin{align*}\n f(X=x|N=n) = \\frac{\\P(N=n|X=x)f(X=x)}{\\p(N=n)} &&\n\\end{align*}\n\\begin{align*}\n \\p(N=n|X=x) = \\frac{f(X=x|N=n)\\p(N=n)}{f(X=x)} \n\\end{align*}\n\nThese equations might seem complicated since they mix probability densities and probabilities. Why should we believe that they are correct? First, observe that anytime the random variable on the left hand side of the conditional is continuous, we use a density, whenever it is discrete, we use a probability. This result can be derived by making the observation:\n $$\n \\P(X = x) = f(X=x) \\cdot \\epsilon_x\n $$\nIn the limit as $\\epsilon_x \\rightarrow 0$. In order to obtain a probability from a density function is to integrate under the function. If you wanted to approximate the probability that $X = x$ you could consider the area created by a rectangle which has height $f(X=x)$ and some very small width. As that width gets smaller, your answer becomes more accurate:\n\n \nA value of $\\epsilon_x$ is problematic if it is left in a formula. However, if we can get them to cancel, we can arrive at a working equation. This is the key insight used to derive the rules of probability in the context of one or more continuous random variables. Again, let $X$ be continuous random variable and let $N$ be a discrete random variable:\n\n\n\\begin{align*}\n \\p(N=n|X=x) &= \\frac{P(X=x|N=n)\\p(N=n)}{P(X=x)} &&\\text{Bayes' Theorem}\\\\\n &= \\frac{f(X=x|N=n) \\cdot \\epsilon_x \\cdot \\p(N=n)}{f(X=x) \\cdot \\epsilon_x} &&\\P(X = x) = f(X=x) \\cdot \\epsilon_x \\\\\n &= \\frac{f(X=x|N=n) \\cdot \\p(N=n)}{f(X=x)} &&\\text{Cancel } \\epsilon_x \\\\\n\\end{align*}\n\nThis strategy applies beyond Bayes' Theorem. For example here is a version of the Law of Total Probability when $X$ is continuous and $N$ is discrete:\n\n\\begin{align*}\n f(X=x) &= \\sum_{n \\in N} f(X=x | N = n) \\p(N = n)\n\\end{align*}\nProbability Rules with Continuous Random Variables\nThe strategy used in the above section can be used to derive the rules of probability in the presence of continuous random variables. The strategy also works when there are multiple continuous random variables. For example here is Bayes' Theorem with two continuous random variables.\n\nDef: Bayes' Theorem with continuous random variables.\n\n\nLet $X$ and $Y$ be continuous random variables.\n\\begin{align*}\n f(X=x|Y=y) = \\frac{f(X=x,Y=y)}{f(Y=y)}\n\\end{align*}\n\nExample: Inference with a Continuous Variable\nConsider the following question:\n\nQuestion: At birth, girl elephant weights are distributed as a Gaussian with mean 160kg, and standard deviation 7kg. At birth, boy elephant weights are distributed as a Gaussian with mean 165kg, and standard deviation of 3kg. All you know about a newborn elephant is that it is 163kg. What is the probability that it is a girl? \n\n\nAnswer: Let $G$ be an indicator that the elephant is a girl. $G$ is Bern(p = 0.5)\nLet $X$ be the distribution of weight of the elephant. \n$X | G = 1$ is $N(\u03bc = 160, \u03c3^2 = 7^2)$\n$X | G = 0$ is $N(\u03bc = 165, \u03c3^2 = 3^2)$\n\n\\begin{align*}\n\\p(G = 1 | X = 163) &= \\frac{f(X = 163 | G = 1) \\P(G = 1)}{f(X = 163)} && \\text{Bayes} \n\\end{align*}\n\nIf we can solve this equation we will have our answer. What is $f(X = 163 | G = 1)$? It is the probability density function of a gaussian $X$ which has $\\mu=160, \\sigma^2 = 7^2$ at the point $x$ is 163:\n\n\\begin{align*}\nf(X = 163 | G = 1) &= \\frac{1}{\\sigma \\sqrt{2 \\pi}} e^{-\\frac{1}{2}\\Big(\\frac{x-\\mu}{\\sigma}\\Big)^2} && \\text{PDF Gauss} \\\\\n&= \\frac{1}{7 \\sqrt{2 \\pi}} e^{-\\frac{1}{2}\\Big(\\frac{163-160}{7}\\Big)^2} && \\text{PDF } X \\text{ at } 163\n\\end{align*}\n\n\nNext we note that $\\P(G = 0) = \\P(G = 1) = \\frac{1}{2}$. Putting this all together, and using the law of total probability to compute the denominator we get:\n\n\\begin{align*}\n\\p&(G = 1 | X = 163) \\\\\n&= \\frac{f(X = 163 | G = 1) \\P(G = 1)}{f(X = 163)} \\\\\n\n&= \\frac{f(X = 163 | G = 1) \\P(G = 1)}{f(X = 163 | G = 1) \\P(G = 1) + f(X = 163 | G = 0) \\P(G = 0)}\\\\\n\n&= \\frac{\\frac{1}{7 \\sqrt{2 \\pi}} e^{-\\frac{1}{2}\\Big(\\frac{163-160}{7}\\Big)^2} \\cdot \\frac{1}{2}}{\\frac{1}{7 \\sqrt{2 \\pi}} e^{-\\frac{1}{2}\\Big(\\frac{163-160}{7}\\Big)^2} \\cdot \\frac{1}{2} + \\frac{1}{3 \\sqrt{2 \\pi}} e^{-\\frac{1}{2}\\Big(\\frac{163-165}{3}\\Big)^2} \\cdot \\frac{1}{2}} \\\\\n\n&= \\frac\n{\\frac{1}{7} e^{-\\frac{1}{2}\\Big(\\frac{9}{49}\\Big)} }\n{\\frac{1}{7} e^{-\\frac{1}{2}\\Big(\\frac{9}{49}\\Big)} + \\frac{1}{3} e^{-\\frac{1}{2}\\Big(\\frac{4}{9}\\Big)^2} }\\\\\n&\\approx 0.328\n\\end{align*}\n\n\n"}, {"id": "bayesian_networks", "title": "Bayesian Networks", "url": "part3/bayesian_networks", "text": "\n \nBayesian Networks\n\nAt this point in the reader we have developed tools for analytically solving for probabilities. We can calculate the likelihood of random variables taking on values, even if they are interacting with other random variables (which we have called multi-variate models, or we say the random variables are jointly distributed). We have also started to study samples and sampling. \n\nConsider the WebMd Symptom Checker. WebMD have built a probabilistic model with random variables which roughly fall under three categories: symptoms, risk factors and diseases. For any combination of observed symptoms and risk factors, they can calculate the probability of any disease. For example, they can calculate the probability that I have influenza given that I am a 21-year-old female who has a fever and who is tired: $P(I = 1 | A = 21, G = 1, T = 1, F = 1)$. Or they could calculate the probability that I have a cold given that I am a 30-year-old with a runny nose: $P(C = 1 | A = 30, R = 1)$. At first blush this might not seem difficult. But as we dig deeper we will realize just how hard it is. There are two challenges: (1) Modelling: sufficiently specifying the probabilistic model and (2) Inference: calculating any desired probability.\n\nBayesian Networks\nBefore we jump into how to solve probability (aka inference) questions, let's take a moment to go over how an expert doctor could specify the relationship between so many random variables. Ideally we could have our expert sit down and specify the entire \"joint distribution\" (see the first lecture on multi-variable models). She could do so either by writing a single equation that relates all the variables (which is as impossible as it sounds), or she could come up with a joint distribution table where she specifies the probability of any possible combination of assignments to variables. It turns out that is not feasible either. Why? Imagine there are $N = 100$ binary random variables in our WebMD model. Our expert doctor would have to specify a probability for each of the $2^N > 10^{30}$ combinations of assignments to those variables, which is approaching the number of atoms in the universe. Thankfully, there is a better way. We can simplify our task if we know the \"generative\" process that creates a joint assignment. Based on the generative process we can make a data structure known as a Bayesian Network. Here are two networks of random variables for diseases:\n\n\nFor diseases the flow of influence is directed. The states of \"demographic\" random variables influence whether someone has particular \"conditions\", which influence whether someone shows particular \"symptoms\". On the right is a simple model with only four random variables. Though this is a less interesting model it is easier to understand when first learning Bayesian Networks. Being in university (binary) influences whether or not someone has influenza (binary). Having influenza influences whether or not someone has a fever (binary) and the state of university and influenza influences whether or not someone feels tired (also binary).\n\nIn a Bayesian Network an arrow from random variable $X$ to random variable $Y$ articulates our assumption that $X$ directly influences the likelihood of $Y$. We say that $X$ is a parent of $Y$. To fully define the Bayesian network we must provide a way to compute the probability of each random variable ($X_i$) conditioned on knowing the value of all their parents: $P(X_i = k | \\text{Parents of }X_i \\text{ take on specified values})$. Here is a concrete example of what needs to be defined for the simple disease model. Recall that each of the random variables is binary:\n\\begin{align*}\n& P(\\text{Uni} = 1) = 0.8 \\\\\n& P(\\text{Influenza} = 1 | \\text{Uni} = 1) = 0.2 \n&& P(\\text{Fever} = 1 | \\text{Influenza} = 1) = 0.9 \\\\\n& P(\\text{Influenza} = 1 | \\text{Uni} = 0) = 0.1\n&& P(\\text{Fever} = 1 | \\text{Influenza} = 0) = 0.05 \\\\\n& P(\\text{Tired} = 1 | \\text{Uni} = 0, \\text{Influenza} = 0) = 0.1 \n&& P(\\text{Tired} = 1 | \\text{Uni} = 0, \\text{Influenza} = 1) = 0.9 \\\\\n& P(\\text{Tired} = 1 | \\text{Uni} = 1, \\text{Influenza} = 0) = 0.8\n&& P(\\text{Tired} = 1 | \\text{Uni} = 1, \\text{Influenza} = 1) = 1.0 \n\\end{align*}\n\nLet's put this in programming terms. All that we need to do in order to code up a Bayesian network is to define a function:\ngetProbXi(i, k, parents) which returns the probability that $X_i$ (the random var with index i) takes on the value k given a value for each of the parents of $X_i$ encoded by the list parents: $P(X_i = x_i | \\text{Values of parents of }X_i)$\n\n\nDeeper understanding: The reason that a Bayes Net is so useful is that the \"joint\" probability can be expressed in exponentially less space as the product of the probabilities of each random variable conditioned on its parents! Without loss of generality, let $X_i$ refer to the $i$th random variable (such that if $X_i$ is a parent of $X_j$ then $i < j$):\n\\begin{align*}\n P&(\\text{Joint}) = P(X_1 = x_1, \\dots, X_n = x_n) = \\prod_i P(X_i = x_i | \\text{Values of parents of }X_i )\n\\end{align*} \n\nWhat assumptions are implicit in a Bayes' Net? Using the chain rule we can decompose the exact joint probability for $n$ random variables. To make the following math easier to digest I am going to use $x_i$ as shorthand for the event that $X_i = x_i$:\n\\begin{align*}\nP(x_1, \\dots, x_n) \n&= \\prod_i P(x_i | x_{i-1}, \\dots, x_1)\\\\\n\\end{align*}\n\nBy looking at the difference in the two equations, we can see that a Bayes' Net is assumping that \n$$P(x_i | x_{i-1}, \\dots, x_1) = P(x_i | \\text{Values of parents of }X_i)$$\nThis is a conditional independence statement. It is saying that once you know the value of the parents of a variable in your network, $X_i$, any further information about non-descendents will not change your belief in $X_i$. Formally we say that $X_i$ is conditionally independent of its non-descendents, given its parents. What is a non-descendent again? In a graph, a descendent of $X_i$ is anything which is in the subtree that starts at $X_i$. Everything else is a non-descendent. Non-descendents include the \"ancestor\" nodes of $X_i$ as well as nodes which are totally unconnected to $X_i$. When designing Bayes' Nets you don't have to think about this assumption directly. It turns out to be a naturally good assumption if the arrows between your nodes follow a causal path.\n\nDesigning a Bayes Net\nThere are several steps to designing a Bayes Net. \n\nChose you random variables, and make them nodes. \n\tAdd edges, often based off your assumptions about which nodes directly cause which others.\n\t\tDefine $P(X_i = x_i | \\text{Values of parents of }X_i )$ for all nodes.\n\nAs you might have guessed, we can do step (2) and (3) by hand, or, we can have computers try and perform those tasks based on data. The first task is called \"structure learning\" and the second is an instance of \"machine learning.\" There are fully autonomous solutions to structure learning -- but they only work well if you have a massive amount of data. Alternatively people will often compute a statistic called correlation between all pairs of random variables to help in the art form of designing a Bayes' net. \n\nIn the next part of the reader we are going to talk about how we could learn $P(X_i = x_i | \\text{Values of parents of }X_i )$ from data. For now let's start with the (reasonable) assumption that an expert can write down these functions in equation or as python: getProbXi. \n\nNext Steps\nGreat! We have a feasible way to define a large network of random variables. First challenge complete. We haven't talked about continuous of multinomial random variables in Bayes Nets. None of the theory changes: the expert will just have to define getProbXi to handle more values of k than 0 or 1.\nA Bayesian network is not very interesting to us unless we can use it to solve different conditional probability questions. How can we perform \"inference\" on a network as complex as a Bayesian network? \n"}, {"id": "independent_vars", "title": "Independence in Variables", "url": "part3/independent_vars", "text": "\n \nIndependence in Variables\n\nDiscrete\nTwo discrete random variables $X$ and $Y$ are called independent if:\n\\begin{align*}\n\\P(X=x,Y=y) = \\P(X=x)\\P(Y=y) \\text{ for all } x,y\n\\end{align*}\nIntuitively: knowing the value of $X$ tells us nothing about the distribution of $Y$. If two variables are not independent, they are called dependent. This is a similar conceptually to independent events, but we are dealing with multiple variables. Make sure to keep your events and variables distinct.\n\nContinuous\nTwo continuous random variables $X$ and $Y$ are called independent if:\n\\begin{align*}\n\\P(X\\leq a, Y \\leq b) = \\P(X \\leq a)\\P(Y \\leq b) \\text{ for all } a,b\n\\end{align*}\nThis can be stated equivalently using either the CDF or the PDF:\n\\begin{align*}\nF_{X,Y}(a,b) &= F_{X}(a)F_{Y}(b) \\text{ for all } a,b \\\\\nf(X=x,Y=y) &= f(X=x)f(Y=y) \\text{ for all } x,y \n\\end{align*}\nMore generally, if you can factor the joint density function then your random variable are independent (or the joint probability function for discrete random variables):\n\\begin{align*}\n\n&f(X=x,Y=y) = h(x)g(y) \\\\\n&\\P(X=x,Y=y) = h(x)g(y)\n\\end{align*}\n\n\nExample: Showing Independence\n\nLet $N$ be the # of requests to a web server/day and that $N \\sim \\Poi(\\lambda)$. Each request comes from a human with probability = $p$ or from a \"bot\" with probability = $(1 \u2013 p)$. Define $X$ to be the # of requests from humans/day and $Y$ to be the # of requests from bots/day. Show that the number of requests from humans, $X$, is independent of the number of requests from bots, $Y$.\n\nSince requests come in independently, the probability of $X$ conditioned on knowing the number of requests is a Binomial. Specifically:\n\\begin{align*}\n(X|N) &\\sim \\Bin(N,p)\\\\\n(Y|N) &\\sim \\Bin(N, 1-p)\n\\end{align*}\nTo get started we need to first write an expression for the join probability of $X$ and $Y$. To do so, we use the chain rule:\n\\begin{align*}\n \\P(X=x,Y=y) = \\P(X = x, Y=y|N = x+y)\\P(N = x+y)\n\\end{align*}\nWe can calculate each term in this expression. The first term is the PMF of the binomial $X|N$ having $x$ ``successes.\"\" The second term is the probability that the Poisson $N$ takes on the value $x+y$ :\n \\begin{align*}\n &\\P(X = x, Y=y|N = x+y) = { {x + y} \\choose x}p^x(1-p)^y \\\\\n &\\P(N = x + y) = e^{-\\lambda}\\frac{\\lambda^{x+y}}{(x+y)!} \n\\end{align*}\nNow we can put those together we have an expression for the joint:\n\\begin{align*}\n &\\P(X = x, Y=y) = { {x + y} \\choose x}p^x(1-p)^y e^{-\\lambda}\\frac{\\lambda^{x+y}}{(x+y)!}\n\\end{align*}\nAt this point we have derived the joint distribution over $X$ and $Y$. In order to show that these two are independent, we need to be able to factor the joint:\n\n\\begin{align*}\n \\P&(X = x, Y=y) \\\\\n &= { {x + y} \\choose x}p^x(1-p)^y e^{-\\lambda}\\frac{\\lambda^{x+y}}{(x+y)!} \\\\\n &= \\frac{(x+y)!}{x! \\cdot y!} p^x(1-p)^y e^{-\\lambda}\\frac{\\lambda^{x+y}}{(x+y)!} \\\\\n &= \\frac{1}{x! \\cdot y!} p^x(1-p)^y e^{-\\lambda}\\lambda^{x+y} && \\text{Cancel (x+y)!} \\\\\n &= \\frac{p^x \\cdot \\lambda^x}{x!} \\cdot \\frac{(1-p)^y \\cdot \\lambda ^{y}}{y!} \\cdot e^{-\\lambda} && \\text{Rearrange} \\\\\n\\end{align*}\n\nBecause the joint can be factered into a term that only has $x$ and a term that only has $y$, the random variables are independent.\n\nSymmetry of Independence\nIndependence is symmetric. That means that if random variables $X$ and $Y$ are independent, $X$ is independent of $Y$ and $Y$ is independent of $X$. This claim may seem meaningless but it can be very useful. Imagine a sequence of events $X_1,X_2, \\dots$. Let $A_i$ be the event that $X_i$ is a \"record value\" (eg it is larger than all previous values). Is $A_{n+1}$ independent of $A_n$? It is easier to answer that $A_n$ is independent of $A_{n+1}$. By symmetry of independence both claims must be true.\n\nExpectation of Products\n\nLemma: Product of Expectation for Independent Random Variables:\n\tIf two random variables $X$ and $Y$ are independent, the expectation of their product is the product of the individual expectations.\n\\begin{align*}\n &E[X \\cdot Y] = E[X] \\cdot E[Y] && \\text{ if and only if $X$ and $Y$ are independent}\\\\\n &E[g(X)h(Y)] = E[g(X)]E[h(Y)] && \\text{ where $g$ and $h$ are functions} \n\\end{align*}\nNote that this assumes that $X$ and $Y$ are independent. Contrast this to the sum version of this rule (expectation of sum of random variables, is the sum of individual expectations) which does not require the random variables to be independent.\n\t\n"}, {"id": "correlation", "title": "Correlation", "url": "part3/correlation", "text": "\n \nCorrelation\n\nCovariance\nCovariance is a quantitative measure of the extent to which the deviation of one variable from its mean matches the deviation of the other from its mean. It is a mathematical relationship that is defined as:\n\\begin{align*}\n \\text{Cov}(X,Y) = E[(X-E[X])(Y-E[Y])]\n\\end{align*}\nThat is a little hard to wrap your mind around (but worth pushing on a bit). The outer expectation will be a weighted sum of the inner function evaluated at a particular $(x,y)$ weighted by the probability of $(x,y)$. If $x$ and $y$ are both above their respective means, or if $x$ and $y$ are both below their respective means, that term will be positive. If one is above its mean and the other is below, the term is negative. If the weighted sum of terms is positive, the two random variables will have a positive correlation. We can rewrite the above equation to get an equivalent equation:\n\\begin{align*}\n \\text{Cov}(X,Y) = E[XY] - E[Y]E[X]\n\\end{align*}\n\n\n\n\t\nLemma: Correlation of Independent Random Variables:\nIf two random variables $X$ and $Y$ are independent, than their covariance must be 0.\n\n\\begin{align*}\n \\text{Cov}(X,Y) &= E[XY] - E[Y]E[X] && \\text{ Def of Cov} \\\\\n &= E[X]E[Y] - E[Y]E[X] && \\text{ Lemma Product of Expectation} \\\\\n &= 0\n\\end{align*}\nNote that the reverse claim is not true. Covariance of 0 does not prove independence.\n\t\nUsing this equation (and the product lemma) is it easy to see that if two random variables are independent their covariance is 0. The reverse is \\emph{not} true in general.\n\n\n\nProperties of Covariance\nSay that $X$ and $Y$ are arbitrary random variables:\n\\begin{align*}\n &\\text{Cov}(X,Y) = \\text{Cov}(Y,X) \\\\\n &\\text{Cov}(X,X) = E[X^2] - E[X]E[X] = \\text{Var}(X) \\\\\n &\\text{Cov}(aX +b,Y) = a\\text{Cov}(X,Y)\n\\end{align*}\nLet $X = X_1 + X_2 + \\dots + X_n$ and let $Y = Y_1 + Y_2 + \\dots + Y_m$. The covariance of $X$ and $Y$ is:\n\\begin{align*}\n &\\text{Cov}(X,Y) = \\sum_{i=1}^n \\sum_{j=1}^m\\text{Cov}(X_i,Y_j) \\\\\n &\\text{Cov}(X,X) = \\text{Var}(X) = \\sum_{i=1}^n \\sum_{j=1}^n\\text{Cov}(X_i,X_j) \n\\end{align*}\nThat last property gives us a third way to calculate variance. We can use it to, again, show how to get the variance of a Binomial.\n\n\n\n\n\nCorrelation\nWe left off last class talking about covariance. Covariance was interesting because it was a quantitative measurement of the relationship between two variables. Today we are going to extend that concept to correlation. Correlation between two random variables, $\\rho(X, Y)$ is the covariance of the two variables normalized by the variance of each variable. This normalization cancels the units out:\n\\begin{align*}\n \\rho(X,Y) = \\frac{\\text{Cov}(X,Y)}{\\sqrt{\\text{Var}(X)Var(Y)}}\n\\end{align*}\nCorrelation measure linearity between $X$ and $Y$. \n\\begin{align*}\n &\\rho(X,Y) = 1 && Y = aX + b \\text{ where } a = \\sigma_y / \\sigma_x \\\\\n &\\rho(X,Y) = -1 && Y = aX + b \\text{ where } a = -\\sigma_y / \\sigma_x \\\\\n &\\rho(X,Y) = 0 && \\text{ absence of linear relationship}\\\\\n\\end{align*}\nIf $\\rho(X, Y) = 0$ we say that $X$ and $Y$ are ``uncorrelated.\"\n\nWhen people use the term correlation, they are actually referring to a specific type of correlation called ``Pearson\" correlation. It measures the degree to which there is a linear relationship between the two variables. An alternative measure is ``Spearman\" correlation which has a formula almost identical to your regular correlation score, with the exception that the underlying random variables are first transformed into their rank. ``Spearman\" correlation is outside the scope of CS109. \n"}, {"id": "computational_inference", "title": "General Inference", "url": "part3/computational_inference", "text": "\n \nComputational Inference\n\nA Bayesian Network gives us a reasonable way to specify the joint probability of a network of many random variables. Before we celebrate, realize that we still don't know how to use such a network to answer probability questions. There are many techniques for doing so. I am going to introduce you to one of the great ideas in probability for computer science: we can use sampling to solve inference questions on Bayesian networks. Sampling is frequently used in practice because it is relatively easy to understand and easy to implement.\n\nRejection Sampling\nAs a warmup consider what it would take to sample an assignment to each of the random variables in our Bayes net. Such a sample is often called a \"joint sample\" or a \"particle\" (as in a particle of sand). To sample a particle, simply sample a value for each random variable one at a time based on the value of the random variable's parents. This means that if $X_i$ is a parent of $X_j$, you will have to sample a value for $X_i$ before you sample a value for $X_j$.\n\nLet's work through an example of sampling a \"particle\" for the Simple Disease Model in the Bayes Net section:\n1. Sample from $P(\\text{Uni} = 1)$: Bern$(0.8)$. Sampled value for Uni is 1.\n2. Sample from $P(\\text{Influenza} = 1 | \\text{Uni} = 1)$: Bern$(0.2)$. Sampled value for Influenza is 0.\n3. Sample from $P(\\text{Fever} = 1 | \\text{Influenza} = 0)$: Bern$(0.05)$. Sampled value for Fever is 0.\n4. Sample from $P(\\text{Tired} = 1 | \\text{Uni} = 1, \\text{Influenza} = 0)$: Bern$(0.8)$. Sampled value for Tired is 0.\nThus the sampled particle is: [Uni = 1, Influenza = 0, Fever = 0, Tired = 0]. If we were to run the process again we would get a new particle (with likelihood determined by the joint probability).\n\nNow our strategy is simple: we are going to generate $N$ samples where $N$ is in the hundreds of thousands (if not millions). Then we can compute probability queries by counting. Let $N(\\textbf{X} = \\textbf{k})$ be notation for the number of particles where random variables \\textbf{X} take on values \\textbf{k}. Recall that the bold notation \\textbf{X} means that \\textbf{X} is a vector with one or more elements. By the \"frequentist\" definition of probability: \n\\begin{align*}\n P(\\textbf{X} = \\textbf{k}) = \\frac{N(\\textbf{X} = \\textbf{k})}{N}\n\\end{align*}\nCounting for the win! But what about conditional probabilities? Well using the definition of conditional probabilities, we can see it's still some pretty straightforward counting:\n\\begin{align*}\n P(\\textbf{X} = \\textbf{a} | \\textbf{Y} = \\textbf{b})\n = \\frac{P(\\textbf{X} = \\textbf{a},\\textbf{Y} = \\textbf{b}) }{P(\\textbf{Y} = \\textbf{b})}\n = \\frac\n {\\frac{N(\\textbf{X} = \\textbf{a},\\textbf{Y} = \\textbf{b})}{N}}\n {\\frac{N(\\textbf{Y} = \\textbf{b})}{N}}\n = \\frac\n {N(\\textbf{X} = \\textbf{a},\\textbf{Y} = \\textbf{b})}\n {N(\\textbf{Y} = \\textbf{b})}\n\\end{align*}\nLet's take a moment to recognize that this is straight-up fantastic. General inference based on analytic probability (math without samples) is hard even given a Bayesian network (if you don't believe me, try to calculate the probability of flu conditioning on one demographic and one symptom in the Full Disease Model). However if we generate enough samples we can calculate any conditional probability question by reducing our samples to the ones that are consistent with the condition ($\\vec{Y} = \\vec{b}$) and then counting how many of those are also consistent with the query ($\\vec{X} = \\vec{a}$). Here is the algorithm in pseudocode:\n\n\nN = 10000\n# \"query\" is the assignment to variables we want probabilities for\n# condition\" is the assignments to variables we will condition on\ndef getAnyProbability(query, condition):\n particles = generateManyJointSamples(N)\n condParticles = rejectNonConsistentSamples(particles, condition)\n K = countConsistentSamples(condParticles, query)\n return K / len(condParticles)\n\nThis algorithm is sometimes called \"Rejection Sampling\" because it works by generating many particles from the joint distribution and rejecting the ones that are not consistent with the set of assignments we are conditioning on. Of course this algorithm is an approximation, though with enough samples it often works out to be a very good approximation. However, in cases where the event we're conditioning on is rare enough that it doesn't occur after millions of samples are generated, our algorithm will not work. The last line of our code will result in a divide by 0 error. See the next section for solutions!\n\nGeneral Inference when Conditioning on Rare Events\nJoint Sampling is a powerful technique that takes advantage of computational power. But it doesn't always work. In fact it doesn't work any time that the probability of the event we are conditioning is rare enough that we are unlikely to ever produce samples that exactly match the event. The simplest example is with continuous random variables. Consider the Simple Disease Model. Let's change Fever from being a binary variable to being a continuous variable. To do so the only thing we need to do is re-specify the likelihood of fever given assignments to its parents (influenza). Let's say that the likelihoods come from the normal PDF:\n\\begin{align*}\n &\\text{if Influenza = 0, then Fever} \\sim N(\\mu = 98.3, \\sigma = 0.7) && \\therefore f(\\text{Fever} = x) = \\frac{1}{\\sqrt{2 \\pi \\cdot 0.7}} e ^{-\\frac{(x - 98.3)^2}{2 \\cdot 0.7}} \\\\\n &\\text{if Influenza = 1, then Fever} \\sim N(\\mu = 100.0, \\sigma = 1.8) && \\therefore f(\\text{Fever} = x) = \\frac{1}{\\sqrt{2 \\pi \\cdot 1.8}} e ^{-\\frac{(x - 100.0)^2}{2 \\cdot 1.8}}\n\\end{align*}\nDrawing samples (aka particles) is still straightforward. We apply the same process until we get to the step where we sample a value for the Fever random variable (in the example from the previous section that was step 3). If we had sampled a 0 for influenza we draw a value for fever from the normal for healthy adults (which has $\\mu = 98.3$). If we had sampled a 1 for influenza we draw a value for fever from the normal for adults with the flu (which has $\\mu = 100.0$). The problem comes in the \"rejection\" stage of joint sampling.\n\nWhen we sample values for fever we get numbers with infinite precision (eg 100.819238 etc). If we condition on someone having a fever equal to 101 we would reject every single particle. Why? No particle will have exactly a fever of 101.\n\nThere are several ways to deal with this problem. One especially easy solution is to be less strict when rejecting particles. We could round all fevers to whole numbers. \n\nThere is an algorithm called \"Likelihood Weighting\" which sometimes helps, but which we don't cover in CS109. Instead, in class we talked about a new algorithm called Markov Chain Monte Carlo (MCMC) that allowed us to sample from the \"posterior\" probability: the distribution of random variables after (post) us fixing variables in the conditioned event. The version of MCMC we talked about is called Gibbs Sampling. While I don't require that students in CS109 know how to implement Gibbs Sampling, I wanted everyone to know that it exists and that it isn't beyond your capabilities. If you need to use it, you can learn it given the knowledge you have now. \n\nMCMC does require more math than Joint Sampling. For every random variable you will need to specify how to calculate the likelihood of assignments given the variable's: parents, children and parents of its children (a set of variables cozily called a \"blanket\"). Want to learn more? Take CS221 or CS228!\n\nThoughts\nWhile there are slightly-more-powerful \"general inference algorithms\" that you will get to learn in the future, it is worth recognizing that at this point we have reached an important milestone in CS109. You can take very complicated probability models (encoded as Bayesian networks) and can answer general inference queries on them. To get there we worked through the concrete example of predicting disease. While the WebMd website is great for home users, similar probability models are being used in thousands of hospitals around the world. As you are reading this general inference is being used to improve health care (and sometimes even save lives) for real human beings. That's some probability for computer scientists that is worth learning. What if we don't have an expert? Could we learn those probabilities from data? Jump to part 5 to answer that question.\n"}, {"id": "dart_logo", "title": "CS109 Logo", "url": "examples/dart_logo", "text": " \n\n\n\t\t\t\t\t\tCS109 Logo\n\t\t\t\t\t\n\n\n\t\t\t\t\tTo generate the CS109 logo, we are going to throw half a million darts at a picture of the Stanford seal. We only keep the pixels that are hit by at least one dart. Each dart has it's x-pixel and y-pixel chosen at random from gaussian distributions. Let $X$ be a random variable which represent the x-pixel, $Y$ be a random variable which represents the y-pixel and $S$ be a constant that equals the size of the logo (its width is equal to its height). $X \\sim \\mathcal{N}\\left(\\frac{S}{2}, \\frac{S}{2}\\right)$ and $Y \\sim \\mathcal{N}\\left(\\frac{S}{3},\\frac{S}{5}\\right)$\n\t\t\t\t\n\n\nDarts thrown: 0\n\n\n\n\nDart Results\n\n\n\nDart Probability Density\n\n\n\n\n\nX Distribution\n\n\n\nY Distribution\n\n\n\n\n"}, {"id": "fairness", "title": "Fairness in AI", "url": "examples/fairness", "text": "\n \nFairness in Artificial Intelligence\n\nArtificial Intelligence often gives the impression that it is objective and \"fair\". However, algorithms are made by humans and trained by data which may be biased. There are several examples of AI algorithms, deployed, have been shown to make decisions that were biased based on gender, race or other protected demographics -- even when\nthere is no intention for it.\nThese examples have also led to a necessary research into a growing field of algorithmic fairness. How can we demonstrate, or proove, that an algorithm is behaving in a way that we think is appropriate? What is fair? Clearly these are complex questions and are deserving of a complete conversation. This example is simple for the purpose of giving an introduction to the topic.\n\n\n\nML stands for Machine Learning. Solon Barocas and Moritz Hardt, \"Fairness in Machine Learning\", NeurIPS 2017\n\nWhat is Fairness?\nAn artificial intelligence algorithm is going to be used to make a binary prediction ($G$ for guess) for whether a person will repay a loan. The question has come up: is the algorithm \"fair\" with respect to a binary protected demographic ($D$ for demographic)? To answer this question we are going to analyze predictions the algorithm made on historical data. We are then going to compare the predictions to the true outcome ($T$ for truth). Consider the following joint probability table from the history of the algorithm\u2019s predictions:\n\n\n$D=0$\n\n\n\n\n\t\t\t\t\t$G=0$\n\t\t\t\t\n\n\t\t\t\t\t$G=1$\n\t\t\t\t\n\n\n\n\t\t\t\t\t$T=0$\n\t\t\t\t\n\n\t\t\t\t\t0.21\n\t\t\t\t\n\n\t\t\t\t\t0.32\n\t\t\t\t\n\n\n\n\t\t\t\t\t$T=1$\n\t\t\t\t\n\n\t\t\t\t\t0.07\n\t\t\t\t\n\n\t\t\t\t\t0.28\n\t\t\t\t\n\n\n\n\n$D=1$\n\n\n\n\n\t\t\t\t\t$G=0$\n\t\t\t\t\n\n\t\t\t\t\t$G=1$\n\t\t\t\t\n\n\n\n\t\t\t\t\t$T=0$\n\t\t\t\t\n\n\t\t\t\t\t0.01\n\t\t\t\t\n\n\t\t\t\t\t0.01\n\t\t\t\t\n\n\n\n\t\t\t\t\t$T=1$\n\t\t\t\t\n\n\t\t\t\t\t0.02\n\t\t\t\t\n\n\t\t\t\t\t0.08\n\t\t\t\t\n\n\n\n\nRecall that cell $D=i,G=j,T=k$ contains the probability $\\P(D=i,G=j,T=k)$. A joint probability table gives the probability of all combination of events. Recall that since each cell is mutually exclusive, the $\\sum_i \\sum_j \\sum_k \\P(D=i,G=j,T=k) = 1$. Note that this assumption of mutual exclusion could be problematic for demographic variables (some people are mixed ethnicity, etc) which gives you a hint that we are just scratching the surface in our conversation about fairness. Lets use this joint probability to learn about some of the common definitions of fairness. \nPractice with joint marginalizationWhat is $\\p(D=0)$? What is $\\p(D=1)$? \nProbabilities with assignments to a subset of the random variables in the joint distribution can be calculated by a process called marginalization: sum the probability from all cells where that assignment is true.\n$$\n\\begin{align}\n\\p(D=1) &= \\sum_{j \\in \\{0, 1\\}} \\sum_{k \\in \\{0, 1\\}} \\p(D= 1, G= j, T = k)\\\\\n&= 0.01 + 0.01 + 0.02 + 0.08 = 0.12\n\\end{align}\n$$\n\n$$\n\\begin{align}\n\\p(D=0) &= \\sum_{j \\in \\{0, 1\\}} \\sum_{k \\in \\{0, 1\\}} \\p(D= 0, G= j, T = k)\\\\\n&= 0.21 + 0.32 + 0.07 + 0.28 = 0.88\n\\end{align}\n$$\n\nNote that $\\p(D=0) + \\p(D=1) = 1$. That implies that the demographics are mututally exclusive.\nFairness definition #1: ParityAn algorithm satisfies \u201cparity\u201d if the probability that the algorithm makes a positive prediction ($G$ = 1) is the same regardless of begin conditioned on demographic variable.\nDoes this algorithm satisfy parity?\n$$\n\\begin{align}\nP(G=1|D=1) \n &= \\frac{P(G = 1, D = 1)}{P(D=1)}\n \t&& \\text{Cond. Prob.}\\\\\n &= \\frac{P(G = 1, D = 1, T= 0) + P(G=1, D = 1, T=1)}{P(D=1)}\n \t&& \\text{Prob or}\\\\\n &= \\frac{0.01 + 0.08}{0.12} = 0.75\n \t&& \\text{From joint}\n\\end{align}\n$$\n\n$$\n\\begin{align}\nP(G=1|D=0) \n &= \\frac{P(G = 1, D = 0)}{P(D=0)}\n \t&& \\text{Cond. Prob.}\\\\\n &= \\frac{P(G = 1, D = 0, T= 0) + P(G=1, D = 0, T=1)}{P(D=0)}\n \t&& \\text{Prob or}\\\\\n &= \\frac{0.32 + 0.28}{0.88} \\approx 0.68\n \t&& \\text{From joint}\n\\end{align}\n$$\nNo. Since $P(G=1|D=1) \\neq P(G=1|D=0)$ this algorithm does not satisfy parity. It is more likely to guess 1 when the demographic indicator is 1.\n\n\nFairness definition #2: Calibration\n\tAn algorithm satisfies \u201ccalibration\u201d if the probability that the algorithm is correct ($G=T$) is the same regardless of demographics. \n\nDoes this algorithm satisfy calibration?\nThe algorithm satisfies calibration if $P(G = T | D = 0) = P(G = T | D = 1)$ \n$$\n\\begin{align}\n P(G = T | D = 0)\n &= P(G = 1, T = 1 | D = 0) + P(G = 0, T = 0 | D = 0)\\\\\n &= \\frac{0.28 + 0.21}{0.88} \\approx 0.56 \\\\\n P(G = T | D = 1)\n &= P(G = 1, T = 1 | D = 1) + P(G = 0, T = 0 | D = 1)\\\\\n &= \\frac{0.08 + 0.01}{0.12} = 0.75\n\\end{align}\n$$\nNo: $P(G = T | D = 0) \\neq P(G = T | D = 1)$\n\n\n\nFairness definition #3: Equality of Odds\nAn algorithm satisfies \"equality of odds\" if the probability that the algorithm predicts a positive outcome ($G=1$) is the same regardless of demographics given that the outcome will occur ($T=1$).\n Does this algorithm satisfy equality of odds?\n\nThe algorithm satisfies equality of odds if $P(G = 1 | D = 0, T = 1) = P(G = 1 | D = 1, T = 1)$ \n$$\n\\begin{align*}\n P(G = 1 | D = 1, T = 1)\n &= \\frac{P(G = 1 , D = 1, T = 1)}{P(D = 1, T = 1)}\\\\\n &= \\frac{0.08}{0.08 + 0.02} = 0.8 \\\\\n P(G = 1 | D = 0, T = 1)\n &= \\frac{P(G = 1 , D = 0, T = 1)}{P(D = 0, T = 1)}\\\\\n &= \\frac{0.28}{0.28 + 0.07} = 0.8\n\\end{align*}\n$$\nYes: $P(G = 1 | D = 0, T = 1) = P(G = 1 | D = 1, T = 1)$\n\nWhich of these definitions seems right to you? One can prove that you cant satisfy all of the above simultaneously. For a deeper treatment of the subject, here is a useful summary of the latest research Pessach et al. Algorithmic Fairness.\n\n\tGender Shades\n\n\t\tIn 2018, Joy Buolamwini and Timnit Gebru had a breakthrough result called \"gender shades\" published in the first conference on Fairness, Accountability and Transparency in ML [1]. They showed that facial recognition algorithms, which had been deployed to be used by Facebook, IBM and Microsoft, were substantially better at making predicitons (in this case classifying gender) when looking at lighter skinned men than darker skinned women. Their work exposed several shortcomings in production AI: biased datasets, optimizing for average accuracy (which means that the majority demographic gets most weight) lack of awarness of intersectionality, and more. Let's take a look at some of their results.\n\t\n\n\n\nFigure by Joy Buolamwini and Timnit Gebru. Facial recognition algorithms perform very differently depending on who they are looking at. [1]\n\t\n\nTimnit and Joy looked at three classifiers trained to predict gender, and computed several statistics. Lets take a look at one statistic, accuracy, for one of the facial recognition classifiers, IBMs:\n\n\t\n\n\n\n\nWomen\nMen\nDarker\nLighter\n\n\n\nAccuracy\n\n79.7\n94.4\n77.6\n96.8\n\n\nUsing the language of fairness, accuracy measures $\\p(G=T)$. The cell in the table above under \"Women\" says the accuracy when looking at photos of women $\\p(G=T|D = \\text{Women})$. It is easy to show that these production level systems are terribly \"uncalibrated\":\n\t\t$$\\p(G=T|D = \\text{Woman}) \\neq \\p(G=T|D = \\text{Man})$$\n\t\t$$\\p(G=T|D = \\text{Lighter}) \\neq \\p(G=T|D = \\text{Darker})$$\n\n\t\tWhy should we care about callibration and not the other definitions of fairness? In the case the classifier was making a prediction of gender where a positive prediction (say predicting women) doesn't have directly associated reward as in our above example, where we were predicting if someone should receive a loan. As such the most salient idea is: is the algorithm just as accurate for different genders (callibration)?\nThe lack of callibration between men/women and lighter/darker skined photos is an issue. What Joy and Timnit showed next was that the problem becomes even worse when you look at intersectional demographics.\n\n\n\n\nDarker Men\nDarker Women\nLighter Men\nLighter Women\n\n\n\nAccuracy\n88.0\n65.3\n99.7\n92.9\n\n\n\n\t\tIf the algorithms were \"fair\" according to the callibration you would expect the accuracy to be the same regardness of demographics. Instead there is alomst a 34.2 percentage point difference! $\\p(G=T|D = \\text{Darker Woman})$ = 65.3 compared to $\\p(G=T|D = \\text{Ligher Man}) = 99.7$\n\t\n[1] Buolamwini, Gebru. Gender Shades. 2018\nWays Forward?\nWadsworth et al. Achieving Fairness through Adversarial Learning\n"}, {"id": "federalist", "title": "Federalist Paper Authorship", "url": "examples/federalist", "text": "\n \nFederalist Paper Authorship\n\n\nLet's write a program to decide whether or not James Madison or Alexander Hamilton wrote Fedaralist Paper 49. Both men have claimed to be have written it, and hence the authorship is in dispute. First we used historical essays to estimate $p_i$, the probability that Hamilton generates the word $i$ (independent of all previous and future choices or words). Similarly we estimated $q_i$, the probability that Madison generates the word $i$. For each word $i$ we observe the number of times that word occurs in Fedaralist Paper 49 (we call that count $c_i$). We assume that, given no evidence, the paper is equally likely to be written by Madison or Hamilton.\n\n\nDefine three events: $H$ is the event that Hamilton wrote the paper, $M$ is the event that Madison wrote the paper, and $D$ is the event that a paper has the collection of words observed in Fedaralist Paper 49. We would like to know whether $P(H|D)$ is larger than $P(M|D)$. This is equivalent to trying to decide if $P(H|D)/P(M|D)$ is larger than 1.\n\n\nThe event $D|H$ is a multinomial parameterized by the values $p$. The event $D|M$ is also a multinomial, this time parameterized by the values $q$.\n\n\nUsing Bayes Rule we can simplify the desired probability.\n\\begin{align*}\n\\frac{P(H|D)}{P(M|D)} \n &= \\frac{ \\frac{P(D|H)P(H)}{P(D)} }{ \\frac{P(D|M)P(M)}{P(D)}}\n = \\frac{ P(D|H)P(H) }{ P(D|M)P(M)} \n = \\frac{ P(D|H) }{ P(D|M)} \\\\\n &= \\frac{ { {n} \\choose {c_1,c_2,\\dots , c_m}} \\prod_i p_i^{c_i} }{ { {n} \\choose {c_1,c_2,\\dots , c_m}}\\prod_i q_i^{c_i}} \n = \\frac{ \\prod_i p_i^{c_i} }{ \\prod_i q_i^{c_i}} \n\\end{align*}\n\n\nThis seems great! We have our desired probability statement expressed in terms of a product of values we have already estimated. However, when we plug this into a computer, both the numerator and denominator come out to be zero. The product of many numbers close to zero is too hard for a computer to represent. To fix this problem, we use a standard trick in computational probability: we apply a log to both sides and apply some basic rules of logs.\n\\begin{align*}\n\\text{log}\\Big(\\frac{P(H|D)}{P(M|D)}\\Big) \n &= \\text{log}\\Big(\\frac{ \\prod_i p_i^{c_i} }{ \\prod_i q_i^{c_i}} \\Big) \\\\\n &= \\text{log}(\\prod_i p_i^{c_i}) - \\text{log}( \\prod_i q_i^{c_i}) \\\\\n &= \\sum_i \\text{log}(p_i^{c_i}) - \\sum_i \\text{log}(q_i^{c_i}) \\\\\n &= \\sum_i c_i\\text{log}(p_i) - \\sum_i c_i \\text{log}(q_i) \n\\end{align*}\nThis expression is ``numerically stable\" and my computer returned that the answer was a negative number. We can use exponentiation to solve for $P(H|D)/P(M|D)$. Since the exponent of a negative number is a number smaller than 1, this implies that $P(H|D)/P(M|D)$ is smaller than 1. As a result, we conclude that Madison was more likely to have written Federalist Paper 49. That is the standing assuption currently made by historians!\n\n"}, {"id": "name2age", "title": "Name to Age", "url": "examples/name2age", "text": "\n \nName to Age\n\n\nBecause of shifting patterns in name popularity, a person's age is a hint as to their age. The United States publishes a data which contains counts of how many US residents were born with a given name in a given year, based off Social Security applications. We can use inference to compute the reverse probability distribution: an updated belief in a person's age, given their name. As a reminder, if I know the year someone was born, I can calculate their age within one year.\n\n\n The US Social Security applications data provides you with a function: count(year, name) which returns the number of US citizens, born in a given year with a given name. You also have access to a list names which has each name ever given in the US and years which has all the years. This function is implicitly giving us the joint probability over names and birth year. The probability of a joint assignment to name and birth year can be estimated as the count of people with that name, born on that year, over the total number of people in the dataset. Let $B$ be the year someone is born, and let $N$ be their name:\n\t$$\n\tP(B = b,N = n) \\approx \n\t\\frac{\\text{count}(b, n)}\n\t{\\sum\\limits_{i \\in \\text{names}} \\sum\\limits_{j \\in \\text{years}} \\text{count}(i,j)}\n\t$$\n\nThe question we would really like to answer is: what is your belief that a resident was born in 1950, given that their name is Gary? We can get started by applying the definition of conditional probability for random variables:\n\n\t\t\\begin{align*}\n\t\t\\P( B = 1950 | N = \\text{Gary} ) = \\frac{\\P(N = \\text{Gary} , B = 1950)}{\\P(N = \\text{Gary})}\n\t\t\\end{align*}\nBut this leaves one term to compute: $\\P(N = \\text{Gary})$ which we can compute using marginalization:\n\n\\begin{align*}\n\\P( N = \\text{Gary}) &= \\sum_{y \\in \\text{years}} P(B = y,N = \\text{Gary})\\\\\n&\\approx \\frac{ \\sum\\limits_{y \\in \\text{years}} \\text{count}(y, \\text{Gary})}{\\sum\\limits_{i \\in \\text{names}} \\sum\\limits_{j \\in \\text{years}} \\text{count}(i,j)}\n\\end{align*}\n\n\n\nPutting this all together we have:\n\n\\begin{align*}\n\\P( B = 1950 | N = \\text{Gary} ) \n&= \\frac{\\P(N = \\text{Gary} , B = 1950)}{\\P(N = \\text{Gary})} \\\\\n&\\approx \n\\frac\n{\n\t\\Big(\n\t\\frac{\\text{count}(1950, \\text{Gary})}\n\t{\\sum\\limits_{i \\in \\text{names}} \\sum\\limits_{j \\in \\text{years}} \\text{count}(i,j)}\n\t\\Big)\n}\n{\n\t\\Big(\n\t\\frac{ \\sum\\limits_{y \\in \\text{years}} \\text{count}(y, \\text{Gary})}{\\sum\\limits_{i \\in \\text{names}} \\sum\\limits_{j \\in \\text{years}} \\text{count}(i,j)}\n\t\\Big)\n} \\\\\n&\\approx\n\\frac\n{\n\t\\text{count}(1950, \\text{Gary})\n}\n{\n\t\\sum\\limits_{y \\in \\text{years}} \\text{count}(y, \\text{Gary})\n}\n\\end{align*}\n\n\nMore generally, for any name, we can compute the conditional probability mass function over birth year $B$:\n\n\t\\begin{align*}\n\\P( B = b | N = n ) \n&\\approx \n\\frac\n{\n\t\\text{count}(b, n)\n}\n{\n\t\\sum\\limits_{y \\in \\text{years}} \\text{count}(y, n)\n}\n\\end{align*}\n\nFrom Birth Year to Age\nOf course, if $B$ is the birth year of a person, their age, $A$ is approximately the current year minus $B$. This could be off by one if someone has a birth day later in the year, but we will ignore this small deviation for now. So for example, if we think that a person was born in 1988, since the current year is then their age is - 1988 = \nAssumptions\nThis problem makes many assumptions which are worth highlighting. In fact, any time we make generalizations (especially about demographics) based on sparse information we should tread lightly. Here are the assumptions that I can think of:\n\n\n\nThis data only is accurate for names of people in the US. The probability of age given names could be very different in other countries.\nThe US census is not perfect. It does not capture all people who are resident in the US, and there are demographics which are underrepresented. This will also skew our results.\n\nDemo\n\nQuery Name: \n\n\n\nRecords with name: \nThis demo is based on real data from US Social Security applications between 1914 and 2014. Thank you to https://www.kaggle.com/kaggle/us-baby-names for compiling the data. Download Data.\nNames that Give Away Your Age\nSome names have certain years where they were exceptionally popular. These names provide quite a lot of information about birth year. Let's look at some of the names with the highest max probability.\nMedium Popularity (>10,000 people with the name)\n\n\nNameAge with max probProb of most likely age\n\nKatina\t49 0.245\nMarquita\t38 0.233\nAshanti\t19 0.250\nMiley 13 0.250\nAria\t7 0.247\n\nHigh Popularity (>100,000 people with the name)\n\n\nNameAge with max probProb of most likely age\n\nDebbie\t 62 0.104\nWhitney\t 35 0.098\nChelsea\t 29 0.103\nAidan\t 18 0.098\nAddison 14 0.112\n\nA search for \"Katina 1972\" brought up this interesting article about a baby named Katina in a 1972 CBS Soap Opera. Marquita's popularity was likely from a 1983 toothpase add. Ashanti Douglas and Miley Cirus were popular singers in 2002 and 2008 respectively. \n\nFuther Reading\nSome names don't seem to have enough data to make good probability estimates. Can we quantify our uncertainty in such probability estimates? For example, if a name has only 10,000 entries in the database, of which only 100 were born in the year 1950, how confident are we that the true probability for 1950 is $\\frac{100}{10000} = 0.01$? One way to express our uncertainty would be through a {{!beta}}. In this scenario we could represent our belief in the probability for 1950 as $X \\sim \\Beta(a=101, b=9901)$ reflecting that we have seen 100 people born in 1950 and 9900 people who were not. We can plot that belief, zoomed into the range [0, 0.03]:\n\n\t<%\ninclude('templates/functions/betaPdf.html', a=101, b=9901,id='nameBeta', min=0,max=0.03)\n\nWe can now ask questions such as, what is the probability that $X$ is within 0.002 of 0.01?\n\\begin{align*}\nP(&0.008 < X < 0.012) \\\\\n&= P(X < 0.012) - P(X < 0.008) \\\\\n&= F_X(0.012) - F_X(0.008) \\\\\n&= 0.966 - 0.013 \\\\\n&= 0.953\n\\end{align*}\n\nSemantically this leads to the claim that, after observing 100 births with a name in 1950, out of 10,000 births with that name over the whole dataset, there is a 95% chance that the probability of someone being born in 1950 is 0.010 $\\pm$ 0.002.\n\n\n"}, {"id": "bridge_distribution", "title": "Bridge Distribution", "url": "examples/bridge_distribution", "text": "\n \nBridge Card Game\n\n\n\nStub: This section is not complete. Parts might not be fully written\n Bridge is one of the most popular collaborative card games. It is played with four players in two teams. A few interesting probability problems come up in this game. You do not need to know the rules of bridge to follow this example.\n\n\t\tDistribution of Suit Splits\n\n\tWhen playing the game there are many times when one player will know exactly how many cards there are of a certain suit between their two opponents hands (call the opponents A and B). However, the player won't know the \"split\": how many of that particular suit are in opponent A's hand and how many cards of that suit are in opponent B's hand.\nBoth opponents have equal sized hands with $k$ cards left. Across the two hands there are a known number of cards of a particular suit (eg spades) $n$, and you want to know how many are in one hand and how many are in the other. A split is represented as a tuple. For example $(0, 5)$ would mean 0 cards of the suit in opponent A's hands and 5 in opponent B's. Feel free to chose specific values for $k$ and $n$:\n\n\n\n$k$, the number of cards in each player's hand: \n\n\n$n$, the number of cards of particular suit among the two hands: \n\nA few notes: If there are $k$ cards in each of the 2 hands there are $2 k$ cards total. At the start of a game of bridge $k=13$. It must be the case that $n \\leq 2 k$ because you can't have more cards of the suit left than number of cards! If there are $n$ of a suit, then there are $2k -n$ of other suits. This problem assumes that the cards are properly shuffled.\n\n\nProbability of different splits of the suit:\nLet $Y$ be a random variable representing the number of the suit in opponent A's hand. We can calculate the probability that $Y$ equals different values $i$ by counting equally likely outcomes.\n$$\\p(Y = i) = \\frac\n\t{ {n \\choose i} \\cdot {2\\cdot k-n \\choose k-i} }\n\t{ {2\\cdot k \\choose k}}$$\n\tCan you figure out how we came up with that formula? It uses equally likely outcomes where each element in the sample set is a chosen set of $k$ cards to be dealt to one player (out of $2k$ cards which go to both). For $k = $13 and $n = $5 here is the PMF over splits:\n\n\n\tIf we want to think about the probability of a given split, it is sufficient to chose one hand (call it \"hand one\") and think about the probability of the number of the given suit in that hand. Though there are two hands, if I tell you how many of a suit are in one hand, you can automatically figure out how many of the suit are in the other hand: recall that the number of the suit sums to $n$. \n\t\n\nProbability that either hand has at least $j$ cards of suit\nLet $X$ be a random variable representing the highest number of cards of the suit in either hand. We can calculate the probability by using probability of or.\n$$\\p(X \\geq j) = 1 - \\sum_{i=n-j+1}^{j - 1}\\p(Y= i)$$\n\n\n\n\n\nDistribution of Hand Strength\nThe way folks play bridge is that they make a calculation about their \"hand strength\" and then make decisions based off that number. The strength of your hand is a number which is equal to 4 times the number of \"aces\", 3 times the number of \"kings\", 2 times the number of \"queens\" and 1 times the number of \"jacks\" in your hand. No other cards contribute to your hand strength. Lets consider your hand strength to be a random variable and compute its distribution. It seems complex to compute by hand -- but perhaps we could run a simulation? Here we simulate a million deals of bridge hands, calculate the hand strengths, and use that to approximate the the distribution of hand strengths:\n\n\nYou might notice that at first blush this looks a lot like a poisson with rate $\\lambda = 10$. First, lets consider why the rate might be 10. Let $X_i$ be the points of a given card $i$. Since each card value is equally likely $\\p(X_i=x) = \\frac{1}{13}$. The expectation of points for each card is $\\E[X] = \\sum_x x \\cdot \\p(X_i=x) = (1+2+3+4)\\frac{1}{13}$. Let $H$ be the value of a hand. The value of a hand is the sum of the value of each card: \n\t$$\n\t\\begin{align}\\E[H] &= \\sum_{i\\in \\{1 \\dots 13\\}} \\E[X_i] \\\\&= 13 \\cdot \\E[X_i] \\\\&= 13 \\cdot (1+2+3+4)\\frac{1}{13} = 10\\end{align}$$\nSaying that $H$ is approximately $\\sim \\Poi(\\lambda=10)$ is an interesting claim. It suggests that points in a hand come at a constant rate, and that the next point in your hand is independent of when you got your last point. Of course this second part of the assumption is mildly violated. There are a fixed set of cards so getting one card changes the probabilities of others. For this reason the poisson is a close, but not perfect approximation.\n\nJoint distribution of hand strength among two hands\nIn most card games it doesn't just matter how strong your hand is, but the relative strength of your hand and another hand. In bridge, you play with a partner. We know that the two hands are not independent of each other. If I tell you that your partner has a strong hand, that means there are fewer \"high value\" cards that can be in your hand, and as such by belief in your strength has changed. If you think about each player's hand strength as a random variable, we care about the joint distribution of hand strength.\n\n\n\nFinally lets consider the conditional distribution of your partners points given your points:\n\nYour points: \n\n\n\n\n"}, {"id": "tracking_in_2D", "title": "Tracking in 2D", "url": "examples/tracking_in_2D", "text": "\n \nTracking in 2D\n\n\n\nIn this example we are going to explore the problem of tracking an object in 2D space. The object exists at some $(x, y)$ location, however we are not sure exactly where! Thus we are going to use random variables $X$ and $Y$ to represent location. \n\nWe have a prior belief about where the object is. In this example our prior both $X$ and $Y$ as normals which are independently distributed with mean 3 and variance 2. First let's write the prior belief as a joint probability density function\n\\begin{align*}\n f(X = x, Y = y) &= f(X = x) \\cdot f(Y = y) && \\text{In the prior X and Y are independent} \\\\\n &= \n \\frac{1}{\\sqrt{2 \\cdot 4 \\cdot \\pi}}\\cdot e ^{-\\frac{(x-3)^2}{2 \\cdot 4}} \\cdot\n \\frac{1}{\\sqrt{2 \\cdot 4 \\cdot \\pi}}\\cdot e ^{-\\frac{(y-3)^2}{2 \\cdot 4}} && \\text{Using the PDF equation for normals} \\\\\n &= K_1 \\cdot\n e ^{-\\frac{(x-3)^2 + (y-3)^2}{8}} && \\text{All constants are put into } K_1\n\\end{align*}\nThis combinations of normals is called a bivariate distribution. Here is a visualization of the PDF of our prior.\n\n\n\n\nThe interesting part about tracking an object is the process of updating your belief about it's location based on an observation. Let's say that we get an instrument reading from a sonar that is sitting on the origin. The instrument reports that the object is 4 units away. Our instrument is not perfect: if the true distance was $t$ units away, than the instrument will give a reading which is normally distributed with mean $t$ and variance 1. Let's visualize the observation:\n\n\n\n\nBased on this information about the noisiness of our prior, we can compute the conditional probability of seeing a particular distance reading $D$, given the true location of the object $X$, $Y$. If we knew the object was at location $(x, y)$, we could calculate the true distance to the origin $\\sqrt{x^2 + y^2}$ which would give us the mean for the instrument Gaussian: \n\\begin{align*}\n f(D = d | X = x, Y = y) \n &= \n \\frac{1}{\\sqrt{2 \\cdot 1 \\cdot \\pi}}\\cdot e ^{-\\frac{\\big(d-\\sqrt{x^2 + y^2}\\big)^2}{2 \\cdot 1}} && \\text{Normal PDF where } \\mu = \\sqrt{x^2 + y^2} \\\\\n &=\n K_2\\cdot e ^{-\\frac{\\big(d-\\sqrt{x^2 + y^2}\\big)^2}{2 \\cdot 1}} && \\text{All constants are put into } K_2\n\\end{align*}\nHow about we try this out on actual numbers. How much more likely is an instrument reading of 1 compared to 2, given that the location of the object is at (1, 1)?\n\\begin{align*}\n \\frac{f(D = 1 | X = 1, Y = 1)}{f(D = 2 | X = 1, Y = 1) }\n &= \n \\frac\n {K_2\\cdot e ^{-\\frac{\\big(1-\\sqrt{1^2 + 1^2}\\big)^2}{2 \\cdot 1}}}\n {K_2\\cdot e ^{-\\frac{\\big(2-\\sqrt{1^2 + 1^2}\\big)^2}{2 \\cdot 1}}} \n && \\text{Substituting into the conditional PDF of D}\\\\\n &=\n \\frac\n {e ^0}\n {e^{-1/2}}\n \\approx 1.65\n && \\text{Notice how the $K_2$ cancel out}\n\\end{align*}\nAt this point we have a prior belief and we have an observation. We would like to compute an updated belief, given that observation. This is a classic Bayes' formula scenario. We are using joint continuous variables, but that doesn't change the math much, it just means we will be dealing with densities instead of probabilities:\n\\begin{align*}\nf&(X =x, Y =y | D =4) \\\\\n &= \n \\frac{f(D = 4| X = x, Y =y) \\cdot f(X = x, Y = y)}{f(D =4)} \n && \\text{Bayes using densities}\\\\\n &= \n \\frac{K_1 \\cdot e^{-\\frac{[4 - \\sqrt{x^2 + y^2})^2]}{2}} \\cdot K_2 \\cdot e^{-\\frac{[(x - 3)^2 + (y - 3)^2]}{8}}}{f(D = 4)} \n && \\text{Substitute}\\\\\n &= \n \\frac{K_1 \\cdot K_2}{ { f(D =4)}} \\cdot e ^{-\\big[\\frac{[4 - \\sqrt{x^2 + y^2})^2]}{2} + \\frac{[(x - 3)^2 + (y - 3)^2]}{8} \\big]} \n && \\text{$f(D=4)$ is a constant w.r.t. $(x,y)$}\\\\\n &= \n K_3 \\cdot e ^{-\\big[\\frac{(4 - \\sqrt{x^2 + y^2})^2}{2} + \\frac{[(x - 3)^2 + (y - 3)^2]}{8} \\big]}\n && \\text{$K_3$ is a new constant}\n\\end{align*}\nWow! That looks like a pretty interesting function! You have successfully computed the updated belief. Let's see what it looks like. Here is a figure with our prior on the left and the posterior on the right:\n\n\n\nHow beautiful is that! Its like a 2D normal distribution merged with a circle. But wait, what about that constant! We do not know the value of $K_3$ and that is not a problem for two reasons: the first reason is that if we ever want to calculate a relative probability of two locations, $K_3$ will cancel out. The second reason is that if we really wanted to know what $K_3$ was, we could solve for it.\n\nThis math is used every day in millions of applications. If there are multiple observations the equations can get truly complex (even worse than this one). To represent these complex functions often use an algorithm called particle filtering.\n\n"}, {"id": "beta", "title": "Beta Distribution", "url": "part4/beta", "text": "\n\n\nBeta Distribution\n\n\n The Beta distribution is the distribution most often used as the distribution of probabilities. In this section we are going to have a very meta discussion about how we represent probabilities. Until now probabilities have just been numbers in the range 0 to 1. However, if we have uncertainty about our probability, it would make sense to represent our probabilities as random variables (and thus articulate the relative likelihood of our belief).\n \n\n<%\ninclude('templates/rvCards/beta.html')\n\n \n\nWhat is your Belief in $p$ After 9 Heads in 10 Flips?\nImagine we have a coin and we would like to know its true probability of coming up heads, $p$. We flip the coin 10 times and observe 9 heads and 1 tail. What is your belief in $p$ based off this evidence? Using the definition of probability we could guess that $p\\approx\\frac{9}{10}$. That number is a very rough estimate, especially since it is only based off 10 coin flips. Moreover the \"point-value\" $\\frac{9}{10}$ does not have the ability to articulate how uncertain it is. \n\n\n\n\nCould we instead have a random variable for the true probability? Formally, let $X$ represent the true probability of the coin coming up heads. We don't use the symbol $P$ for random variables, so $X$ will have to do. If $X = 0.7$ then the probability of heads is 0.7. $X$ must be a continuous random variable with support $[0, 1]$ since probabilities are continuous values which must be between 0 and 1.\n\nBefore flipping the coin, we could say that our belief about the coin's heads probability is uniform: $X \\sim \\Uni(0, 1)$. Let $H$ be a random variable for the number of heads and let $T$ be a random variable for the number of tails observed. What is $\\P(X =x | H = 9, T = 1)$? \n\nThat probability is hard to think about! However it is much easier to reason about the probability with the condition reveresed: $P(H = 9, T = 1 | X= x)$. This term asks the question: what is the probability of seeing 9 heads and 1 tail in 10 coin flips, given that the true probability of a heads is $x$. Convince yourself that this probability is just a binomial probability mass function with $n=10$ experiements, and $p=x$ evaluated at $k=9$ heads:\n $$\n P(H = 9, T = 1 | X= x) = {10 \\choose 9}x^{9} (1-x)^{1}\n $$\n\n\nWe are presented with a perfect context for Bayes' theorem with random variables. We know a conditional probability in one direction and we would like to know it in the other:\n \\begin{align*}\n f(&X = x|H =9, T=1) \\\\\n &= \\frac{P(H =9, T=1|X=x) \\cdot f(X=x)}{P(H =9, T=1)} && \\text{Bayes Theorem}\\\\\n &= \\frac{ {10 \\choose 9} x^{9} (1-x)^{1} \\cdot f(X=x)}{P(H =9, T=1)} && \\text{Binomial PMF}\\\\\n &= \\frac{ {10 \\choose 9} x^{9} (1-x)^{1} \\cdot 1}{P(H =9, T=1)} && \\text{Uniform PDF}\\\\\n &= \\frac{ {10 \\choose 9} }{P(H =9, T=1)} x^{9} (1-x)^{1} && \\text{Constants to front}\\\\\n &= K \\cdot x^{9} (1-x)^{1} && \\text{Rename constant}\\\\\n\\end{align*} \n\nLets take a look at that function. For now we can let $K = \\frac{1}{110}$. Regardless of $K$ we will get the same shape, just scaled:\n\n<%\ninclude('templates/functions/betaPdf.html', a=10, b=2,id='9heads')\nWhat a beautiful image. It tells us relatively likelihood over the probability that is governing our coinflips. Here are a few observations from this chart:\n \nEven after only 10 coin flips we are very confident that the true probability is > 0.5\n It is almost 10 times more likely that $X=0.9$ as it is that $X=0.6$.\n $f(X=1) = 0$, which makes sense. How could we have flipped that one tail if the probability of heads was 1?\n \n\nWait but why? \nIn the derivation above for $f(X = x|H =9, T=1)$ we made the claim that $P(H =9, T=1)$ is a constant. A lot of folks find that hard to believe. Why is that the case?\n\n It may be helpful to juxtapose $P(H =9, T=1)$ with $P(H =9, T=1 | X= x)$. The later says \"what is the probability of 9 heads, given the true probability is $x$\". The former says \"what is the probability of 9 heads, under all possible assignments of $x$\". If you wanted to calculate $P(H =9, T=1)$ you could use the law of total probability:\n \\begin{align*}\n P(&H =9, T=1) \\\\\n &= \\int_{y=0}^{1} P(H =9, T=1 | X= y) f(X=y)\n \\end{align*}\n That is a hard number to calculate, but it is in fact a constant with respect to $x$.\n\nBeta Derivation\nLet's repeat the derivation from the previous section, using non-random variables for the number of observed heads, $h$, and the number of tails, $t$, observed.\n\nIf we let $H =h $ be the event that we saw $h$ heads, and let $T=t$ be the even that we saw $t$ tails in $h+t$ coinflips. We want to calculate the probability density function $f(X=x|H=h,T=t)$. We can use the exam same series of steps, starting with Bayes Theorem:\n\\begin{align*}\n f(&X = x|H =h, T=t) \\\\\n&= \\frac{P(H =h, T=t|X=x)f(X=x)}{P(H =h, T=t)} && \\text{Bayes Theorem}\\\\\n&= \\frac{ { {h+t} \\choose h} x^n(1-x)^h}{P(H =h, T=t)} && \\text{Binomial PMF, Uniform PDF}\\\\\n&= \\frac{ { {h+t} \\choose h}}{P(H =h, T=t)}x^h(1-x)^t && \\text{Moving terms around}\\\\\n&= \\frac{1}{c} \\cdot x^h(1-x)^t && \\text{where } c = \\int_0^1 x^h(1-x)^t dx\n\\end{align*}\n\n\nThe equation that we arrived at when using a Bayesian approach to estimating our probability defines a probability density function and thus a random variable. The random variable is called a Beta distribution, and it is defined as follows:\n\nThe Probability Density Function (PDF) for $X \\sim \\Beta(a,b)$ is:\n\\begin{align*}\n f(X=x) = \n \\begin{cases} \n \\frac{1}{B(a,b)}x^{a-1}(1-x)^{b-1} &\\mbox{if } 0 < x < 1 \\\\ \n 0 & \\mbox{otherwise} \n \\end{cases} \n &&\\mbox{where } B(a,b) = \\int_0^1x^{a-1}(1-x)^{b-1}dx\n\\end{align*}\nA Beta distribution has $E[X] = \\frac{a}{a + b}$ and $Var(X) = \\frac{ab}{(a+b)^2(a+b+1)}$. All modern programming languages have a package for calculating Beta CDFs. You will not be expected to compute the CDF by hand in CS109.\n\nTo model our estimate of the probability of a coin coming up heads: set $a = h + 1$ and $b = t + 1$. Beta is used as a random variable to represent a belief distribution of probabilities in contexts beyond estimating coin flips. For example perhaps a drug has been given to 6 patients, 4 of whom have been cured. We could express our belief in the probability that the drug can cure patients as $X \\sim \\Beta(a=5,b=3)$:\n\n\nNotice how the most likely belief for the probability of curing a patient, is $4/6$, the fraction of patients cured. This distribution shows that we hold a non-zero belief that the probability could be something other than $4/6$. It is unlikely that the probability is 0.01 or 0.09, but reasonably likely that it could be 0.5.\n\n\n\n\nBeta as a Prior\nYou can set $X \\sim \\Beta(a, b)$ as a prior to reflect how biased you think the coin is apriori to flipping it. This is a subjective judgment that represent $a+b- 2$ \"imaginary\" trials with $a-1$ heads and $b-1$ tails. If you then observe $h + t$ real trials with $h$ heads you can update your belief. Your new belief would be, $X \\sim \\Beta(a+h, b+t)$. Using the prior $\\Beta(1,1) = \\Uni(0, 1)$ is the same as saying we haven't seen any \"imaginary\" trials, so apriori we know nothing about the coin. Here is the proof for the distribution of $X$ when the prior was a Beta too:\n\nIf our prior belief for $X \\sim \\Beta(a, b)$, then our posterior is $\\Beta(a+h, b+t)$:\n \\begin{align*}\n f(&X = x|H =h, T=t) \\\\\n&= \\frac{P(H =h, T=t|X=x)f(X=x)}{P(H =h, T=t)} && \\text{Bayes Theorem}\\\\\n&= \\frac{ { {h+t} \\choose h} x^h(1-x)^t \\cdot \\frac{1}{c} \\cdot x^{a-1}(1-x)^{b-1} } {P(H =h, T=t)} && \\text{Beta PMF, Uniform PDF}\\\\\n&= K \\cdot x^h(1-x)^t \\cdot x^{a-1}(1-x)^{b-1} && \\text{Combine Constants}\\\\\n&= K \\cdot x^{a+h-1}(1-x)^{b+t-1} && \\text{Combine Like Bases}\\\\\n\\end{align*}\nWhich is the PDF of $\\Beta(a+h, b+t)$\n \nIt is pretty convenient that if we have a Beta prior belief, then our posterior belief is also Beta. This makes Betas especially convenient to work with, in code and in proof, if there are many updates that you will make to your belief over time. This property where the type of distribution is the same before and after an observation is called a conjugate prior.\n\n\nQuick question: Are you allowed to just make up priors and imaginary trials? Some folks think that is fine (they are called Bayesians) and some folks think that you shouldn't make up prior beliefs (they are called frequentists). In general, for small data it can make you much better at making predictions if you are able to come up with a good prior belief.\nObservation: There is a deep connection between the beta-prior and the uniform-prior (which we used initially). It turns out that $\\Beta(1,1) = \\Uni(0,1)$. Recall that $\\Beta(1,1)$ means 0 imaginary heads and 0 imaginary tails.\n\n"}, {"id": "summation_vars", "title": "Adding Random Variables", "url": "part4/summation_vars", "text": "\n \nAdding Random Variables\n\nIn this section on uncertainty theory we are going to explore some of the great results in probability theory. As a gentle introduction we are going to start with convolution.\nConvolution is a very fancy way of saying \"adding\" two different random variables together. The name comes from the fact that adding two random varaibles requires you to \"convolve\" their distribution functions. It is interesting to study in detail because (1) many natural processes can be modelled as the sum of random variables, and (2) because mathemeticians have made great progress on proving convolution theorems. For some particular random variables computing convolution has closed form equations. Importantly convolution is the sum of the random variables themselves, not the addition of the probability density functions (PDF)s that correspond to the random variables.\n\nAdding Two Random Variables\nSum of Independent Poissons\nSum of Independent Binomials\nSum of Independent Normals\nSum of Independent Uniforms\n\nAdding Two Random Variables\nDeriving an expression for the likelihood for the sum of two random variables requires an interesting insight. If your random variables are discrete then the probability that $X + Y = n$ is the sum of mutually exclusive cases where $X$ takes on a values in the range $[0, n]$ and $Y$ takes on a value that allows the two to sum to $n$. Here are a few examples $X = 0 \\and Y = n$, $X = 1 \\and Y = n - 1$ etc. In fact all of the mutually exclusive cases can be enumerated in a sum:\n\nDef: General Rule for the Convolution of Discrete Variables \n\t$$\\p(X + Y = n) = \\sum_{i=-\\infty}^{\\infty} \\p(X = i, Y = n- i)$$\n\nIf the random variables are independent you can futher decompose the term $\\p(X = i, Y = n- i)$. Let's expand on some of the mutually exclusive cases where $X+Y=n$:\n\n\n\n$i$\n$X$\n$Y$\n\n\n\n0\n0\n$n$\n$\\P(X=0,Y=n)$\n\n\n1\n1\n$n-1$\n$\\P(X=1,Y=n-1)$\n\n\n2\n2\n$n-2$\n$\\P(X=2,Y=n-2)$\n\n\n\n...\n\n\n$n$\n$n$\n0\n$\\P(X=n,Y=0)$\n\n\nConsider the sum of two independent dice. Let $X$ and $Y$ be the outcome of each dice. Here is the probability mass function for the sum $X + Y$:\n\n\n\n\nLet's use this context to practice deriving the sum of two variables, in this case $\\p(X + Y = n)$, starting with the General Rule for the Convolution of Discrete Random Variables. We start by considering values of $n$ between 2 and 7. In this range $\\p(X = i, Y = n- i) = \\frac{1}{36}$ for all values of $i$ between 1 and $n-1$. There is exactly one outcome of the two die where $X = i$ and $Y= n-i$. For values of $i$ outside this range $n- i$ is not a valid dice outcome and $\\p(X = i, Y = n- i) = 0$:\n\\begin{align*}\n\t\\p&(X + Y = n) \\\\\n\t&= \\sum_{i=-\\infty}^{\\infty} \\p(X = i, Y = n- i) \\\\\n\t&= \\sum_{i=1}^{n-1} \\p(X = i, Y = n- i) \\\\\n\t&= \\sum_{i=1}^{n-1} \\frac{1}{36}\\\\\n\t&= \\frac{n-1}{36}\n\\end{align*}\n\nFor values of $n$ greater than 7 we could use the same approach, though different values of $i$ would make $\\p(X = i, Y = n- i)$ non-zero.\n\n\tThis derivation for a general rule has a continuous equivalent:\n\\begin{align*}\n&f(X+Y = n) = \\int_{i=-\\infty}^{\\infty} f(X = n-i, Y=i) \\d i\n\\end{align*}\n\n\nSum of Independent Poissons\nFor any two Poisson random variables: $X ~ \\sim \\Poi(\\lambda_1)$ and $Y ~ \\sim \\Poi(\\lambda_2)$ the sum of those two random variables is another Poisson: $X +Y ~ \\sim \\Poi(\\lambda_1 + \\lambda_2)$. This holds even when $\\lambda_1$ is not the same as $\\lambda_2$.\nHow could we prove a the above claim?\n\n\nExample derivation:\nLet's go about proving that the sum of two independent Poisson random variables is also Poisson. Let $X\\sim\\Poi(\\lambda_1)$ and $Y\\sim\\Poi(\\lambda_2)$ be two independent random variables, and $Z = X + Y$. What is $P(Z = n)$?\n\n\n\\begin{align*}\nP(Z = n) \n&= P(X + Y = n) \\\\\n&= \\sum_{k=-\\infty}^{\\infty} \\p(X = k, Y = n- k) & \\text{(Convolution)}\\\\\n&= \\sum_{k=-\\infty}^{\\infty} P(X = k) P(Y = n - k) & \\text{(Independence)}\\\\\n&= \\sum_{k=0}^n P(X = k) P(Y = n - k) &\\text{(Range of }X\\text{ and }Y\\text{)}\\\\\n&= \\sum_{k=0}^n e^{-{\\lambda_1}} \\frac{\\lambda_1^k}{k!} e^{-{\\lambda_2}} \\frac{\\lambda_2^{n-k}}{(n-k)!} & \\text{(Poisson PMF)} \\\\\n&= e^{-(\\lambda_1 + \\lambda_2)} \\sum_{k=0}^n \\frac{\\lambda_1^k \\lambda_2^{n-k}}{k!(n-k)!} \\\\\n&= \\frac{ e^{-(\\lambda_1 + \\lambda_2)}}{n!} \\sum_{k=0}^n \\frac{n!}{k!(n-k)!} \\lambda_1^k \\lambda_2^{n-k} \\\\\n&= \\frac{ e^{-(\\lambda_1 + \\lambda_2)}}{n!} (\\lambda_1 + \\lambda_2)^n & \\text{(Binomial theorem)}\n\\end{align*}\n\n\nNote that the Binomial Theorem (which we did not cover in this class, but is often used in contexts like expanding polynomials) says that for two numbers $a$ and $b$ and positive integer $n$, $(a+b)^n = \\sum_{k=0}^n \\binom{n}{k} a^k b^{n-k}$.\n\nSum of Independent Binomials with equal $p$\nFor any two independent Binomial random variables with the same \"success\" probability $p$: $X ~ \\sim \\Bin(n_1,p)$ and $Y ~ \\sim \\Bin(n_2,p)$ the sum of those two random variables is another binomial: $X +Y ~ \\sim \\Bin(n_1 + n_2,p)$.\n This result hopefully makes sense. The convolution is the number of sucesses across $X$ and $Y$. Since each trial has the same probability of success, and there are now $n_1 + n_2$ trials, which are all independent, the convolution is simply a new Binomial. This rule does not hold when the two Binomial random variables have different parameters $p$. \nSum of Independent Normals\nFor any two independent normal random variables $X ~ \\sim \\mathcal{N}(\\mu_1,\\sigma_1^2)$ and $Y ~ \\sim \\mathcal{N}(\\mu_2,\\sigma_2^2)$ the sum of those two random variables is another normal: $X +Y ~ \\sim \\mathcal{N}(\\mu_1 + \\mu_2,\\sigma_1^2 + \\sigma_2^2)$.\nAgain this only holds when the two normals are independent.\nSum of Independent Uniforms\n\nIf $X$ and $Y$ are independent uniform random variables where $X \\sim \\Uni(0,1)$ and $Y \\sim \\Uni(0,1)$: \n\t\\begin{align*}\n f(X+Y=n) = \\begin{cases} n &\\mbox{if } 0 < n \\leq 1 \\\\ \n2-n & \\mbox{if } 1 < n \\leq 2 \\\\\n0 & \\mbox{else} \\end{cases} \n\\end{align*}\n\n\nExample derivation: \nCalculate the PDF of $X + Y$ for independent uniform random variables $X \\sim \\Uni(0,1)$ and $Y \\sim \\Uni(0,1)$? First plug in the equation for general convolution of independent random variables:\n\\begin{align*}\n f(X+Y=n) \n &= \\int_{i=0}^{1} f(X=n-i, Y=i)di\\\\\n &= \\int_{i=0}^{1} f(X=n-i)f(Y=i)di && \\text{Independence}\\\\\n &= \\int_{i=0}^{1} f(X=n-i)di && \\text{Because } f(Y=y) = 1\n\\end{align*}\nIt turns out that is not the easiest thing to integrate. By trying a few different values of $n$ in the range $[0,2]$ we can observe that the PDF we are trying to calculate is discontinuous at the point $n=1$ and thus will be easier to think about as two cases: $n < 1$ and $n > 1$. If we calculate $f(X+Y=n)$ for both cases and correctly constrain the bounds of the integral we get simple closed forms for each case:\n\\begin{align*}\n f(X+Y=n) = \\begin{cases} n &\\mbox{if } 0 < n \\leq 1 \\\\ \n2-n & \\mbox{if } 1 < n \\leq 2 \\\\\n0 & \\mbox{else} \\end{cases} \n\\end{align*}\n\n"}, {"id": "clt", "title": "Central Limit Theorem", "url": "part4/clt", "text": "\n \nCentral Limit Theorem\n\nThere are two ways that you could state the central limit theorem. Either that the sum of IID random variables is normally distributed, or that the average of IID random variables is normally distributed.\n\n\nThe Central Limit Thorem (Sum Version)\n\n\n\nLet $X_1, X_2 \\dots X_n$ be independent and identically distributed random variables. The sum of these random variables approaches a normal as $n \\rightarrow \\infty$:\n\n\\begin{align*}\n\\sum_{i=1}^{n}X_i \\sim N(n \\cdot \\mu, n \\cdot \\sigma^2)\n\\end{align*} \n\nWhere $\\mu = \\E[X_i]$ and $\\sigma^2 = \\Var(X_i)$. Note that since each $X_i$ is identically distributed they share the same expectation and variance.\n \nAt this point you probably think that the central limit theorem is awesome. But it gets even better. With some algebraic manipulation we can show that if the sample mean of IID random variables is normal, it follows that the sum of equally weighted IID random variables must also be normal:\n\n \nThe Central Limit Thorem (Average Version)\n\n\n\nLet $X_1, X_2 \\dots X_n$ be independent and identically distributed random variables. The average of these random variables approaches a normal as $n \\rightarrow \\infty$:\n\n\\begin{align*}\n\\frac{1}{n}\\sum_{i=1}^{n}X_i \\sim N(\\mu, \\frac{\\sigma^2}{n})\n\\end{align*} \n\nWhere $\\mu = \\E[X_i]$ and $\\sigma^2 = \\Var(X_i)$. \n \nCentral Limit Theorem Intuition\nIn the previous section we explored what happens when you add two random variables. What happens when you add more than two random variables? For example, what if I wanted to add up 100 different uniform random variables:\n\n from random import random \n\ndef add_100_uniforms():\n total = 0\n for i in range(100):\n # returns a sample from uniform(0, 1)\n x_i = random() \n total += x_i\n return total\nThe value, total returned by this function will be a random variable. Hit the button below to run the function and observe the resulting value of total:\n\n \n\nWhat does total look like as a distribution? Let's calculate total many times and visualize the histogram of values it produces.\n\n \n\nThat is interesting! total which is the sum of 100 independent uniforms looks normal. Is that a special property of uniforms? No! It turns out to work for almost any type of distribution (as long as the thing you are adding has finite mean and finite variance, everything we have covered in this reader).\n\n\nSum of 40 $X_i$ where $X_i \\sim \\Beta(a = 5, b = 4)$? Normal.\n Sum of 90 $X_i$ where $X_i \\sim \\Poi(\\lambda = 4)$? Normal.\n Sum of 50 dice-rolls? Normal.\n Average of 10000 $X_i$ where $X_i \\sim \\Exp(\\lambda = 8)$? Normal.\n\nFor any distribution the sum, or average, of $n$ independent equally-weighted samples from that distribution, will be normal.\n\nContinuity Correction\nNow we can see that the Binomial Approximation using a Normal actually derives from the central limit theorem. Recall that, when computing probabilities for a normal approximation, we had to to use a continuity correction. This was because we were approximating a discrete random variable (a binomial) with a continuous one (a normal). You should use a continuity correction any time your normal is approximating a discrete random variable. The rules for a general continuity correction are the same as the rules for the binomial-approximation continuity correction.\n\nIn the motivating example above, where we added 100 uniforms, a continuity correction isn't needed because the sum of uniforms is continuous. In the dice sum example below, a continuity correction is needed because die outcomes are discrete.\n\nExamples\n\nExample:\nYou will roll a 6 sided dice 10 times. Let $X$ be the total value of all 10 dice = $X_1 + X_2 + \\dots + X_{10}$. You win the game if $X \\leq 25$ or $X \\geq 45$. Use the central limit theorem to calculate the probability that you win. Recall that $E[X_i] = 3.5$ and $\\text{Var}(X_i) = \\frac{35}{12}$.\n\nLet $Y$ be the approximating normal. By the Central Limit Theorem $Y \\sim N(10 \\cdot E[X_i], 10 \\cdot \\Var(X_i))$. Substituting in the known values for expectation and variance: $Y \\sim N(35, 29.2)$\n\n\\begin{align*}\n\\P(&X \\leq 25 \\text{ or } X \\geq 45) \\\\\n&= \\P(X \\leq 25) + \\P(X \\geq 45) \\\\\n&\\approx \\P(Y < 25.5) + \\P(Y > 44.5) &&\\text{Continuity Correction} \\\\\n&\\approx \\P(Y < 25.5) + [1 -\\P(Y < 44.5)]\\\\\n&\\approx \\Phi(\\frac{25.5 - 35}{\\sqrt{29.2} }) + \\Big[1 -\\Phi(\\frac{44.5- 35}{\\sqrt{29.2} })\\Big] &&\\text{Normal CDF}\\\\\n&\\approx \\Phi(-1.76) + [1 - \\Phi(1.76)]\\\\\n&\\approx 0.039 + (1-0.961) \\approx 0.078\n\n\\end{align*}\n\n\nExample:\nSay you have a new algorithm and you want to test its running time. You have an idea of the variance of the algorithm's run time: $\\sigma^2 = 4\\text{sec}^2$ but you want to estimate the mean: $\\mu = t$sec. You can run the algorithm repeatedly (IID trials). How many trials do you have to run so that your estimated runtime = $t \\pm 0.5$ with 95\\% certainty? Let$X_i$ be the run time of the $i$-th run (for $1 \\leq i \\leq n$).\n\\begin{align*}\n0.95 = P(-0.5 \\leq \\frac{\\sum_{i=1}^nX_i}{n} - t \\leq 0.5)\n\\end{align*}\nBy the central limit theorem, the standard normal $Z$ must be equal to:\n\\begin{align*}\nZ &= \\frac{\\left(\\sum_{i=1}^n X_i\\right) - n\\mu}{\\sigma \\sqrt{n}} \\\\\n&= \\frac{\\left(\\sum_{i=1}^n X_i\\right) - nt}{2 \\sqrt{n}}\n\\end{align*}\nNow we rewrite our probability inequality so that the central term is $Z$:\n\\begin{align*}\n0.95&= P(-0.5 \\leq \\frac{\\sum_{i=1}^nX_i}{n} - t \\leq 0.5)=P(\\frac{-0.5\\sqrt{n}}{2} \\leq \\frac{\\sum_{i=1}^nX_i}{n} - t \\leq \\frac{0.5\\sqrt{n}}{2})\\\\\n&=P(\\frac{-0.5\\sqrt{n}}{2} \\leq \\frac{\\sqrt{n}}{2}\\frac{\\sum_{i=1}^nX_i}{n} - \\frac{\\sqrt{n}}{2}t \\leq \\frac{0.5\\sqrt{n}}{2})=P(\\frac{-0.5\\sqrt{n}}{2} \\leq \\frac{\\sum_{i=1}^nX_i}{2\\sqrt{n}} - \\frac{\\sqrt{n}}{\\sqrt{n}}\\frac{\\sqrt{n}t}{2} \\leq \\frac{0.5\\sqrt{n}}{2})\\\\\n&=P(\\frac{-0.5\\sqrt{n}}{2} \\leq \\frac{\\sum_{i=1}^nX_i - nt}{2\\sqrt{n}} \\leq \\frac{0.5\\sqrt{n}}{2})\\\\\n&=P(\\frac{-0.5\\sqrt{n}}{2} \\leq Z \\leq \\frac{0.5\\sqrt{n}}{2})\n\\end{align*}\nAnd now we can find the value of $n$ that makes this equation hold.\n\\begin{align*}\n0.95&= \\phi(\\frac{\\sqrt{n}}{4}) - \\phi(-\\frac{\\sqrt{n}}{4}) = \\phi(\\frac{\\sqrt{n}}{4}) - (1- \\phi(\\frac{\\sqrt{n}}{4})) \\\\\n &= 2\\phi(\\frac{\\sqrt{n}}{4}) - 1\\\\\n0.975 &= \\phi(\\frac{\\sqrt{n}}{4})\\\\\n\\phi^{-1}(0.975) &= \\frac{\\sqrt{n}}{4} \\\\\n1.96 &= \\frac{\\sqrt{n}}{4}\\\\\nn &= 61.4\n\\end{align*}\nThus it takes 62 runs. If you are interested in how this extends to cases where the variance is unknown, look into variations of the students' t-test.\n\n\n"}, {"id": "clt", "title": "Central Limit Theorem", "url": "part4/clt", "text": "\n\n10,000 more runs\n\n"}, {"id": "clt", "title": "Central Limit Theorem", "url": "part4/clt", "text": "\nadd_100_uniforms()\ntotal: \n\n"}, {"id": "samples", "title": "Sampling", "url": "part4/samples", "text": "\n \nSampling\n\nIn this section we are going to talk about statistics calculated on samples from a population. We are then going to talk about probability claims that we can make with respect to the original population -- a central requirement for most scientific disciplines.\n\nLet's say you are the king of Bhutan and you want to know the average happiness of the people in your country. You can't ask every single person, but you could ask a random subsample. In this next section we will consider principled claims that you can make based on a subsample. Assume we randomly sample 200 Bhutanese and ask them about their happiness. Our data looks like this: ${72, 85, \\dots, 71}$. You can also think of it as a collection of $n$ = 200 I.I.D. (independent, identically distributed) random variables ${X_1, X_2, \\dots, X_n}$.\n\nUnderstanding Samples\nThe idea behind sampling is simple, but the details and the mathematical notation can be complicated. Here is a picture to show you all of the ideas involved:\n\n\nThe theory is that there is some large population (for example the 774,000 people who live in Bhutan). We collect a sample of $n$ people at random, where each person in the population is equally likely to be in our sample. From each person we record one number (for example their reported happiness). We are going to call the number from the ith person we sampled $X_i$. One way to visualize your samples ${X_1, X_2, \\dots, X_n}$ is to make a histogram of their values.\n\nWe make the assumption that all of our $X_i$s are identically distributed. That means that we are assuming there is a single underlying distribution $F$ that we drew our samples from. Recall that a distribution for discrete random variables should define a probability mass function.\n\n\n\nEstimating Mean and Variance from Samples\nWe assume that the data we look at are IID from the same underlying distribution ($F$) with a true mean ($\\mu$) and a true variance ($\\sigma^2$). Since we can't talk to everyone in Bhutan we have to rely on our sample to estimate the mean and variance. From our sample we can calculate a sample mean ($\\bar{X}$) and a sample variance ($S^2$). These are the best guesses that we can make about the true mean and true variance. \n\\begin{align*}\n &\\bar{X} = \\sum_{i=1}^n \\frac{X_i}{n} && S^2 = \\sum_{i=1}^n \\frac{(X_i - \\bar{X})^2}{n-1}\n\\end{align*}\nThe first question to ask is, are those unbiased estimates? Yes. Unbiased, means that if we were to repeat this sampling process many times, the expected value of our estimates should be equal to the true values we are trying to estimate. We will prove that that is the case for $\\bar{X}$. The proof for $S^2$ is in lecture slides.\n\\begin{align*}\n E[\\bar{X}] &= E[\\sum_{i=1}^n \\frac{X_i}{n}] = \\frac{1}{n}E\\left[\\sum_{i=1}^n X_i\\right] \\\\\n &= \\frac{1}{n}\\sum_{i=1}^n E[X_i] = \\frac{1}{n}\\sum_{i=1}^n \\mu = \\frac{1}{n}n\\mu = \\mu\n\\end{align*}\n\nThe equation for sample mean seems related to our understanding of expectation. The same could be said about sample variance except for the surprising $(n-1)$ in the denominator of the equation. Why $(n-1)$? That denominator is necessary to make sure that the $E[S^2] = \\sigma^2$.\n\nThe intuition behind the proof is that sample variance calculates the distance of each sample to the sample mean, \\emph{not} the true mean. The sample mean itself varies, and we can show that its variance is also related to the true variance. \n\nStandard Error\nOk, you convinced me that our estimates for mean and variance are not biased. But now I want to know how much my sample mean might vary relative to the true mean.\n\\begin{align*}\n \\text{Var}(\\bar{X}) &= \\text{Var}(\\sum_{i=1}^n \\frac{X_i}{n}) = \\left(\\frac{1}{n}\\right)^2 \\text{Var}\\left(\\sum_{i=1}^n X_i\\right) \\\\\n &= \\left(\\frac{1}{n}\\right)^2 \\sum_{i=1}^n\\text{Var}( X_i) =\\left(\\frac{1}{n}\\right)^2 \\sum_{i=1}^n \\sigma^2 = \\left(\\frac{1}{n}\\right)^2 n \\sigma^2 = \\frac{\\sigma^2}{n}\\\\\n&\\approx \\frac{S^2}{n} \\\\\n \\text{Std}(\\bar{X}) &\\approx \\sqrt{\\frac{S^2}{n}} \n\\end{align*}\nThat term, Std($\\bar{X}$), has a special name. It is called the standard error and its how you report uncertainty of estimates of means in scientific papers (and how you get error bars).\nGreat! Now we can compute all these wonderful statistics for the Bhutanese people. But wait! You never told me how to calculate the Std($S^2$). That is hard because the central limit theorem doesn't apply to the computation of $S^2$. Instead we will need a more general technique. See the next chapter: Bootstrapping\nLet's say we calculate the our sample of happiness has $n$ = 200 people. The sample mean is $\\bar{X} = 83$ (what is the unit here? happiness score?) and the sample variance is $S^2 = 450$. We can now calculate the standard error of our estimate of the mean to be 1.5. When we report our results we will say that our estimate of the average happiness score in Bhutan is 83 $\\pm$ 1.5. Our estimate of the variance of happiness is 450 $\\pm$ ?.\n"}, {"id": "bootstrapping", "title": "Bootstrapping", "url": "part4/bootstrapping", "text": "\n \nBootstrapping\n\nThe bootstrap is a newly invented statistical technique for both understanding distributions of statistics and for calculating $p$-values (a $p$-value is a the probability that a scientific claim is incorrect). It was invented here at Stanford in 1979 when mathematicians were just starting to understand how computers, and computer simulations, could be used to better understand probabilities.\n\nThe first key insight is that: if we had access to the underlying distribution ($F$) then answering almost any question we might have as to how accurate our statistics are becomes straightforward. For example, in the previous section we gave a formula for how you could calculate the sample variance from a sample of size $n$. We know that in expectation our sample variance is equal to the true variance. But what if we want to know the probability that the true variance is within a certain range of the number we calculated? That question might sound dry, but it is critical to evaluating scientific claims! If you knew the underlying distribution, $F$, you could simply repeat the experiment of drawing a sample of size $n$ from $F$, calculate the sample variance from our new sample and test what portion fell within a certain range. \n\nThe next insight behind bootstrapping is that the best estimate that we can get for $F$ is from our sample itself! The simplest way to estimate $F$ (and the one we will use in this class) is to assume that the $P(X=k)$ is simply the fraction of times that $k$ showed up in the sample. Note that this defines the probability mass function of our estimate $\\hat{F}$ of $F$.\n\n\n\ndef bootstrap(sample):\n N = number of elements in sample\n pmf = estimate the underlying pmf from the sample\n stats = []\n repeat 10,000 times:\n resample = draw N new samples from the pmf\n stat = calculate your stat on the resample\n stats.append(stat)\n stats can now be used to estimate the distribution of the stat\nBootstrapping is a reasonable thing to do because the sample you have is the best and only information you have about what the underlying population distribution actually looks like. Moreover most samples will, if they're randomly chosen, look quite like the population they came from. \n\nTo calculate $\\text{Var}(S^2)$ we could calculate $S_i^2$ for each resample $i$ and after 10,000 iterations, we could calculate the sample variance of all the $S_i^2$s. You might be wondering why the resample is the same size as the original sample ($n$). The answer is that the variation of the variation of stat that you are calculating could depend on the size of the sample (or the resample). To accurately estimate the distribution of the stat we must use resamples of the same size.\n\nThe bootstrap has strong theoretic grantees, and is accepted by the scientific community. It breaks down when the underlying distribution has a ``long tail\" or if the samples are not I.I.D.\n\n\n\nExample of p-value calculation\nWe are trying to figure out if people are happier in Bhutan or in Nepal. We sample $n_1 = 200$ individuals in Bhutan and $n_2 = 300$ individuals in Nepal and ask them to rate their happiness on a scale from 1 to 10. We measure the sample means for the two samples and observe that people in Nepal are slightly happier--the difference between the Nepal sample mean and the Bhutan sample mean is 0.5 points on the happiness scale. \n\nIf you want to make this claim scientific you should calculate a $p$-value. A p-value is the probability that, when the null hypothesis is true, the statistic measured would be equal to, or more extreme than, than the value you are reporting. The null hypothesis is the hypothesis that there is no relationship between two measured phenomena or no difference between two groups.\n\nIn the case of comparing Nepal to Bhutan, the null hypothesis is that there is no difference between the distribution of happiness in Bhutan and Nepal. The null hypothesis argument is: there is no difference in the distribution of happiness between Nepal and Bhutan. When you drew samples, Nepal had a mean that 0.5 points larger than Bhutan by chance.\n\nWe can use bootstrapping to calculate the p-value. First, we estimate the underlying distribution of the null hypothesis underlying distribution, by making a probability mass function from all of our samples from Nepal and all of our samples from Bhutan.\n\n\ndef pvalue_bootstrap(bhutan_sample, nepal_sample):\n N = size of the bhutan_sample\n M = size of the nepal_sample\n universal_sample = combine bhutan_samples and nepal_samples\n universal_pmf = estimate the underlying pmf of the universalSample\n count = 0\n observed_difference = mean(nepal_sample) - mean(bhutan_sample)\n repeat 10,000 times:\n bhutan_resample = draw N new samples from the universalPmf\n nepal_resample = draw M new samples from the universalPmf\n mu_bhutan = sample mean of the bhutanResample\n mu_nepal = sample mean of the nepalResample\n mean_difference = |muNepal - muBhutan|\n if mean_difference > observed_difference:\n count += 1\n pvalue = count / 10,000\nThis is particularly nice because nowhere did we have to make an assumption about a parametric distribution that our samples came from (ie we never had to claim that happiness is gaussian). You might have heard of a t-test. That is another way of calculating p-values, but it makes the assumption that both samples are gaussian and that they both have the same variance. In the modern context where we have reasonable computer power, bootstrapping is a more correct and versatile tool.\n"}, {"id": "algorithmic_analysis", "title": "Algorithmic Analysis", "url": "part4/algorithmic_analysis", "text": "\n \nAlgorithmic Analysis\n\nIn this section we are going to use probability to analyze code. Specifically we are going to be calculating expectations on code: expected run time, expected resulting values etc. The reason that we are going to focus on expectation is that it has several nice properties. One of the most useful properties that we have seen so far is that the expectation of a sum, is the sum of expectations, regardless of whether the random variables are independent of one another. In this section we will see a few more helpful properties, including the Law of Total Expectation, which is also helpful in analyzing code:\n\nLaw of Total Expectation\n\tThe law of total expectation gives you a way to calculate $\\E[X]$ in the scenareo where it is easier to compute $\\E[X|Y=y]$ where $Y$ is some other random variable:\n\t\\begin{align*}\n\t\\E[X] \n\t&= \\E\\big[E[X|Y]\\big] \\\\\n\t&= \\sum_y \\E[X|Y=y] \\p(Y=y)\n\t\\end{align*}\n\n"}, {"id": "thompson", "title": "Thompson Sampling", "url": "examples/thompson", "text": "\n \nThompson Sampling\n\n\n\n\n\nWarning: This chapter is a stub. Come back later and hopefully someone has had a chance to finish it. \nLet me present you with a seemingly simple problem that has a suprisingly complex solution. Imagine that you have two brand new drugs for a serious illness. You don't know how effective each drug is. You want to know which drug is the most effective, but at the same time, there are costs to exploration \u2014 there are high stakes.\n"}, {"id": "parameter_estimation", "title": "Parameter Estimation", "url": "part5/parameter_estimation", "text": "\n \nParameter Estimation\n\n\nWe have learned many different distributions for random variables and all of those distributions had parameters: the numbers that you provide as input when you define a random variable. So far when we were working with random variables, we either were explicitly told the values of the parameters, or, we could divine the values by understanding the process that was generating the random variables.\n\nWhat if we don't know the values of the parameters and we can't estimate them from our own expert knowledge? What if instead of knowing the random variables, we have a lot of examples of data generated with the same underlying distribution? In this chapter we are going to learn formal ways of estimating parameters from data. \n\n\nThese ideas are critical for artificial intelligence. Almost all modern machine learning algorithms work like this: (1) specify a probabilistic model that has parameters. (2) Learn the value of those parameters from data.\nParameters\n\nBefore we dive into parameter estimation, first let's revisit the concept of parameters. Given a model, the parameters are the numbers that yield the actual distribution. In the case of a Bernoulli random variable, the single parameter was the value $p$. In the case of a Uniform random variable, the parameters are the $a$ and $b$ values that define the min and max value. Here is a list of random variables and the corresponding parameters. From now on, we are going to use the notation $\\theta$ to be a vector of all the parameters:\n\n\n\n\nDistribution\nParameters\n\n\n\n\nBernoulli($p$)\n$\\theta = p$\n\n\nPoisson($\\lambda$)\n$\\theta = \\lambda$\n\n\nUniform($a, b$)\n$\\theta = [a, b]$\n\n\nNormal($\\mu, \\sigma^2$)\n$\\theta = [\\mu, \\sigma^2]$\n\n\n\n\n\nIn the real world often you don't know the \"true\" parameters, but you get to observe data. Next up, we will explore how we can use data to estimate the model parameters. \n\nIt turns out there isn't just one way to estimate the value of parameters. There are two main schools of thought: Maximum Likelihood Estimation (MLE) and Maximum A Posteriori (MAP). Both of these schools of thought assume that your data are independent and identically distributed (IID) samples: $X_1, X_2, \\dots X_n$.\n"}, {"id": "mle", "title": "Maximum Likelihood Estimation", "url": "part5/mle", "text": "\n \nMaximum Likelihood Estimation\n\nOur first algorithm for estimating parameters is called Maximum Likelihood Estimation (MLE). The central idea behind MLE is to select that parameters ($\\theta$) that make the observed data the most likely. \n\nThe data that we are going to use to estimate the parameters are going to be $n$ independent and identically distributed (IID) samples: $X_1, X_2, \\dots X_n$.\n\nLikelihood\nWe made the assumption that our data are identically distributed. This means that they must have either the same probability mass function (if the data are discrete) or the same probability density function (if the data are continuous). To simplify our conversation about parameter estimation we are going to use the notation $f(X|\\theta)$ to refer to this shared PMF or PDF. Our new notation is interesting in two ways. First, we have now included a conditional on $\\theta$ which is our way of indicating that the likelihood of different values of $X$ depends on the values of our parameters. Second, we are going to use the same symbol $f$ for both discrete and continuous distributions.\n\n\nWhat does likelihood mean and how is ``likelihood\" different than ``probability\"? In the case of discrete distributions, likelihood is a synonym for the joint probability of your data. In the case of continuous distribution, likelihood refers to the joint probability density of your data.\n\nSince we assumed that each data point is independent, the likelihood of all of our data is the product of the likelihood of each data point. Mathematically, the likelihood of our data give parameters $\\theta$ is:\n\\begin{align*}\n L(\\theta) = \\prod_{i=1}^n f(X_i|\\theta)\n\\end{align*}\n\nFor different values of parameters, the likelihood of our data will be different. If we have correct parameters our data will be much more probable than if we have incorrect parameters. For that reason we write likelihood as a function of our parameters ($\\theta$).\n\nMaximization\nIn maximum likelihood estimation (MLE) our goal is to chose values of our parameters ($\\theta$) that maximizes the likelihood function from the previous section. We are going to use the notation $\\hat{\\theta}$ to represent the best choice of values for our parameters. Formally, MLE assumes that: \n\\begin{align*}\n \\hat{\\theta} = \\underset{\\theta}{\\operatorname{argmax }} \\text{ }L(\\theta)\n\\end{align*}\nArgmax is short for Arguments of the Maxima. The argmax of a function is the value of the domain at which the function is maximized. It applies for domains of any dimension. \n\nA cool property of argmax is that since log is a monotone function, the argmax of a function is the same as the argmax of the log of the function! That's nice because logs make the math simpler. If we find the argmax of the log of likelihood it will be equal to the armax of the likelihood. Thus for MLE we first write the Log Likelihood function ($LL$)\n\\begin{align*}\n LL(\\theta) = \\log L(\\theta) = \\log \\prod_{i=1}^n f(X_i|\\theta) = \\sum_{i=1}^n \\log f(X_i|\\theta)\n\\end{align*}\n\nTo use a maximum likelihood estimator, first write the log likelihood of the data given your parameters. Then chose the value of parameters that maximize the log likelihood function. Argmax can be computed in many ways. All of the methods that we cover in this class require computing the first derivative of the function.\n\n\nBernoulli MLE Estimation\nFor our first example, we are going to use MLE to estimate the $p$ parameter of a Bernoulli distribution. We are going to make our estimate based on $n$ data points which we will refer to as IID random variables $X_1, X_2, \\dots X_n$. Every one of these random variables is assumed to be a sample from the same Bernoulli, with the same $p$, $X_i \\sim \\text{Ber}(p)$. We want to find out what that $p$ is. \n\nStep one of MLE is to write the likelihood of a Bernoulli as a function that we can maximize. Since a Bernoulli is a discrete distribution, the likelihood is the probability mass function.\n\nThe probability mass function of a Bernoulli $X$ can be written as $f(x) = p^{x}(1-p)^{1-x}$. Wow! What's up with that? It's an equation that allows us to say that the probability that $X = 1$ is $p$ and the probability that $X = 0$ is $1- p$. Convince yourself that when $X_i=0$ and $X_i=1$ the PMF returns the right probabilities. We write the PMF this way because its derivable.\n\nNow let's do some MLE estimation:\n\\begin{align*}\nL(\\theta) &= \\prod_{i=1}^n p^{x_i}(1-p)^{1-x_i} && \\text{First write the likelihood function} \\\\\nLL(\\theta) &= \\sum_{i=1}^n \\log p^{x_i}(1-p)^{1-x_i} && \\text{Then write the log likelihood function}\\\\\n &= \\sum_{i=1}^n x_i (\\log p) + (1 - x_i) log(1-p) \\\\\n &= Y \\log p + (n - Y) \\log(1-p) && \\text{where } Y = \\sum_{i=1}^n x_i\n\\end{align*}\nGreat Scott! We have the log likelihood equation. Now we simply need to chose the value of $p$ that maximizes our log-likelihood. As your calculus teacher probably taught you, one way to find the value which maximizes a function that is to find the first derivative of the function and set it equal to 0.\n\\begin{align*}\n\\frac{\\delta LL(p)}{\\delta p} &= Y \\frac{1}{p} + (n - Y) \\frac{-1}{1-p} = 0 \\\\\n \\hat{p} &= \\frac{Y}{n} = \\frac{\\sum_{i=1}^n x_i}{n}\n\\end{align*}\nAll that work and find out that the MLE estimate is simply the sample mean...\n\nNormal MLE Estimation\nPractice is key. Next up we are going to try and estimate the best parameter values for a normal distribution. All we have access to are $n$ samples from our normal which we refer to as IID random variables $X_1, X_2, \\dots X_n$. We assume that for all $i$, $X_i \\sim N(\\mu = \\theta_0, \\sigma^2 = \\theta_1)$. This example seems trickier since a normal has $\\textbf{two}$ parameters that we have to estimate. In this case $\\theta$ is a vector with two values, the first is the mean ($\\mu$) parameter. The second is the variance($\\sigma^2$) parameter.\n\\begin{align*}\nL(\\theta) &= \\prod_{i=1}^n f(X_i|\\theta) \\\\ \n &=\\prod_{i=1}^n \\frac{1}{\\sqrt{2\\pi\\theta_1}} e^{-\\frac{(x_i - \\theta_0)^2}{2\\theta_1}} && \\text{Likelihood for a continuous variable is the PDF}\\\\\n LL(\\theta) &= \\sum_{i=1}^n \\log \\frac{1}{\\sqrt{2\\pi\\theta_1}} e^{-\\frac{(x_i - \\theta_0)^2}{2\\theta_1}} && \\text{We want to calculate log likelihood} \\\\\n &= \\sum_{i=1}^n\\left[ \n - \\log(\\sqrt{2\\pi\\theta_1}) - \\frac{1}{2\\theta_1}(x_i - \\theta_0)^2 \\right]\n\\end{align*}\nAgain, the last step of MLE is to chose values of $\\theta$ that maximize the log likelihood function. In this case we can calculate the partial derivative of the $LL$ function with respect to both $\\theta_0$ and $\\theta_1$, set both equations to equal 0 and than solve for the values of $\\theta$. Doing so results in the equations for the values $\\hat{\\mu} = \\hat{\\theta_0}$ and $\\hat{\\sigma^2} = \\hat{\\theta_1}$ that maximize likelihood. The result is: $\\hat{\\mu} = \\frac{1}{n}\\sum_{i=1}^n x_i$ and $\\hat{\\sigma^2} = \\frac{1}{n}\\sum_{i=1}^n (x_i - \\hat{\\mu})^2$.\n\nLinear Transform Plus Noise\nMLE is an algorithm that can be used for any probability model with a derivable likelihood function. As an example lets estimate the parameter $\\theta$ in a model where there is a random variable $Y$ such that $Y = \\theta X + Z$, $Z \\sim N(0, \\sigma^2)$ and $X$ is an unknown distribution. \n\nIn the case where you are told the value of $X$, $\\theta X$ is a number and $\\theta X + Z$ is the sum of a gaussian and a number. This implies that $Y|X \\sim N(\\theta X, \\sigma^2)$. Our goal is to chose a value of $\\theta$ that maximizes the probability IID: $(X_1, Y_1), (X_2, Y_2), \\dots (X_n, Y_n)$. \n\nWe approach this problem by first finding a function for the log likelihood of the data given $\\theta$. Then we find the value of $\\theta$ that maximizes the log likelihood function. To start, use the PDF of a Normal to express the probability of $Y|X,\\theta$:\n\\begin{align*}\n f(Y_i | X_i , \\theta) = \\frac{1}{\\sqrt{2\\pi} \\sigma} e^{-\\frac{(Y_i - \\theta X_i)^2}{2\\sigma^2}}\n\\end{align*}\n\nNow we are ready to write the likelihood function, then take its log to get the log likelihood function:\n\\begin{align*}\nL(\\theta) &= \\prod_{i=1}^n f(Y_i, X_i | \\theta) && \\text{Let's break up this joint}\\\\\n&= \\prod_{i=1}^n f(Y_i | X_i, \\theta)f(X_i ) && f(X_i) \\text{ is independent of } \\theta\\\\\n &= \\prod_{i=1}^n\\frac{1}{\\sqrt{2\\pi} \\sigma} e^{-\\frac{(Y_i - \\theta X_i)^2}{2\\sigma^2}}f(X_i) &&\\text{Substitute in the definition of }f(Y_i | X_i)\n\\end{align*}\n\\begin{align*}\n LL(\\theta) &= \\log L(\\theta) \\\\\n &= \\log \\prod_{i=1}^n \\frac{1}{\\sqrt{2\\pi} \\sigma} e^{-\\frac{(Y_i - \\theta X_i)^2}{2\\sigma^2}} f(X_i) &&\\text{Substitute in }L(\\theta)\\\\\n &= \\sum_{i=1}^n \\log \\frac{1}{\\sqrt{2\\pi} \\sigma} e^{-\\frac{(Y_i - \\theta X_i)^2}{2\\sigma^2}} + \\sum_{i=1}^n \\log f(X_i) &&\\text{Log of a product is the sum of logs}\\\\\n &=n \\log \\frac{1}{\\sqrt{2\\pi}} - \\frac{1}{2\\sigma^2} \\sum_{i=1}^n (Y_i - \\theta X_i)^2 + \\sum_{i=1}^n \\log f(X_i )\n\\end{align*}\n\nRemove constant multipliers and terms that don't include $\\theta$. We are left with trying to find a value of $\\theta$ that maximizes:\n\\begin{align*}\n \\hat{\\theta} &= \\underset{\\theta}{\\operatorname{argmax}} - \\sum_{i=1}^m (Y_i - \\theta X_i)^2\\\\\n&= \\underset{\\theta}{\\operatorname{argmin}} \\sum_{i=1}^m (Y_i - \\theta X_i)^2\n\\end{align*}\nThis result says that the value of $\\theta$ that makes the data most likely is one that minimizes the squared error of predictions of $Y$. We will see in a few days that this is the basis for linear regression.\n"}, {"id": "map", "title": "Maximum A Posteriori", "url": "part5/map", "text": "\n \nMaximum A Posteriori\n\nMLE is great, but it is not the only way to estimate parameters! This section introduces an alternate algorithm, Maximum A Posteriori (MAP).The paradigm of MAP is that we should chose the value for our parameters that is the most likely given the data. At first blush this might seem the same as MLE, however notice that MLE chooses the value of parameters that makes the \\emph{data} most likely. Formally, for IID random variables $X_1, \\dots, X_n$:\n\\begin{align*}\n\\theta_{\\text{MAP}} =& \\underset{\\theta}{\\operatorname{argmax }} \\text{ } f(\\theta | X_1, X_2, \\dots X_n) \n\\end{align*}\nIn the equation above we trying to calculate the conditional probability of unobserved random variables given observed random variables. When that is the case, think Bayes Theorem! Expand the function $f$ using the continuous version of Bayes Theorem.\n\\begin{align*}\n\\theta_{\\text{MAP}} =& \\underset{\\theta}{\\operatorname{argmax }} \\text{ } f(\\theta | X_1, X_2, \\dots X_n) && \\text{Now apply Bayes Theorem}\\\\\n=& \\underset{\\theta}{\\operatorname{argmax }} \\text{ } \\frac{f(X_1, X_2, \\dots, X_n | \\theta) g(\\theta)}{h(X_1, X_2, \\dots X_n)} && \\text{Ahh much better}\n\\end{align*}\nNote that $f, g$ and $h$ are all probability densities. I used different symbols to make it explicit that they may have different functions. Now we are going to leverage two observations. First, the data is assumed to be IID so we can decompose the density of the data given $\\theta$. Second, the denominator is a constant with respect to $\\theta$. As such its value does not affect the argmax and we can drop that term. Mathematically:\n\\begin{align*}\n\\theta_{\\text{MAP}} =& \\underset{\\theta}{\\operatorname{argmax }} \\text{ } \\frac{\\prod_{i=1}^n f(X_i | \\theta) g(\\theta)}{h(X_1, X_2, \\dots X_n)} && \\text{Since the samples are IID}\\\\\n=& \\underset{\\theta}{\\operatorname{argmax }} \\text{ } \\prod_{i=1}^n f(X_i | \\theta) g(\\theta) && \\text{Since $h$ is constant with respect to $\\theta$}\n\\end{align*}\nAs before, it will be more convenient to find the argmax of the log of the MAP function, which gives us the final form for MAP estimation of parameters.\n\\begin{align*}\n\\theta_{\\text{MAP}} =& \\underset{\\theta}{\\operatorname{argmax }} \\text{ } \\left( \\log (g(\\theta)) + \\sum_{i=1}^n \\log(f(X_i | \\theta)) \\right)\n\\end{align*}\nUsing Bayesian terminology, the MAP estimate is the mode of the \"posterior\" distribution for $\\theta$. If you look at this equation side by side with the MLE equation you will notice that MAP is the argmax of the exact same function \\emph{plus} a term for the log of the prior.\n\nParameter Priors\nIn order to get ready for the world of MAP estimation, we are going to need to brush up on our distributions. We will need reasonable distributions for each of our different parameters. For example, if you are predicting a Poisson distribution, what is the right random variable type for the prior of $\\lambda$? \n\nA desiderata for prior distributions is that the resulting posterior distribution has the same functional form. We call these \"conjugate\" priors. In the case where you are updating your belief many times, conjugate priors makes programming in the math equations much easier.\n\nHere is a list of different parameters and the distributions most often used for their priors:\n\\begin{align*}\n&\\text{Parameter }&& \\text{Distribution}\\\\\n&\\text{Bernoulli } p && \\text{Beta}\\\\\n&\\text{Binomial } p && \\text{Beta}\\\\\n&\\text{Poisson } \\lambda && \\text{Gamma}\\\\\n&\\text{Exponential } \\lambda && \\text{Gamma}\\\\\n&\\text{Multinomial } p_i && \\text{Dirichlet}\\\\\n&\\text{Normal } \\mu && \\text{Normal}\\\\\n&\\text{Normal } \\sigma^2 && \\text{Inverse Gamma}\n\\end{align*}\nYou are only expected to know the new distributions on a high level. You do not need to know Inverse Gamma. I included it for completeness.\n\nThe distributions used to represent your \"prior\" belief about a random variable will often have their own parameters. For example, a Beta distribution is defined using two parameters $(a, b)$. Do we have to use parameter estimation to evaluate $a$ and $b$ too? No. Those parameters are called \"hyperparameters\". That is a term we reserve for parameters in our model that we fix before running parameter estimate. Before you run MAP you decide on the values of $(a, b)$. \n\nDirichlet\nThe Dirichlet distribution generalizes Beta in same way Multinomial generalizes Bernoulli. A random variable $X$ that is Dirichlet is parametrized as $X \\sim \\text{Dirichlet}(a_1, a_2, \\dots, a_m)$. The PDF of the distribution is:\n\\begin{align*}\nf(X_1 = x_1, X_2 = x_2, \\dots, X_m = x_m) = K \\prod_{i=1}^m x_i^{a_i - 1}\n\\end{align*}\nWhere $K$ is a normalizing constant.\n\nYou can intuitively understand the hyperparameters of a Dirichlet distribution: imagine you have seen $\\sum_{i=1}^m a_i - m$ imaginary trials. In those trials you had $(a_i - 1)$ outcomes of value $i$. As an example consider estimating the probability of getting different numbers on a six-sided Skewed Dice (where each side is a different shape). We will estimate the probabilities of rolling each side of this dice by repeatedly rolling the dice $n$ times. This will produce $n$ IID samples. For the MAP paradigm, we are going to need a prior on our belief of each of the parameters $p_1 \\dots p_6$. We want to express that we lightly believe that each roll is equally likely. \n\nBefore you roll, let's imagine you had rolled the dice six times and had gotten one of each possible values. Thus the \"prior\" distribution would be Dirichlet($2, 2, 2, 2, 2, 2$). After observing $n_1 + n_2 + \\dots + n_6$ new trials with $n_i$ results of outcome $i$, the \"posterior\" distribution is Dirichlet($2 + n_1, \\dots 2 + n_6$). Using a prior which represents one imagined observation of each outcome is called \"Laplace smoothing\" and it guarantees that none of your probabilities are 0 or 1.\n\nGamma\nThe Gamma($k, \\theta$) distribution is the conjugate prior for the $\\lambda$ parameter of the Poisson distribution (It is also the conjugate for Exponential, but we won't delve into that). \n\nThe hyperparameters can be interpreted as: you saw $k$ total imaginary events during $\\theta$ imaginary time periods. After observing $n$ events during the next $t$ time periods the posterior distribution is Gamma($k + n, \\theta + t$).\n\nFor example Gamma(10, 5) would represent having seen 10 imaginary events in 5 time periods. It is like imagining a rate of 2 with some degree of confidence. If we start with that Gamma as a prior and then see 11 events in the next 2 time periods our posterior is Gamma(21,7) which is equivalent to an updated rate of 3. \n"}, {"id": "machine_learning", "title": "Machine Learning", "url": "part5/machine_learning", "text": "\n \nMachine Learning\n\nMachine Learning is the subfield of computer science that gives computers the ability to perform tasks without being explicitly programmed. There are several different tasks that fall under the domain of machine learning and several different algorithms for \"learning\". In this chapter, we are going to focus on Classification and two classic Classification algorithms: Naive Bayes and Logistic Regression.\n\nClassification\nIn classification tasks, your job is to use training data with feature/label pairs ($\\mathbf{x}$, y) in order to estimate a function $\\hat{y} = g(\\mathbf{x})$. This function can then be used to make a prediction. In classification the value of $y$ takes on one of a \\textbf{discrete} number of values. As such we often chose $g(\\mathbf{x}) = \\underset{y}{\\operatorname{argmax }}\\text{ }\\hat{P}(Y=y|\\mathbf{X})$.\n\n In the classification task you are given $N$ training pairs:\n $(\\mathbf{x}^{(1)},y^{(1)}), (\\mathbf{x}^{(2)},y^{(2)}), \\dots , (\\mathbf{x}^{(N)},y^{(N)})$\n Where $\\mathbf{x}^{(i)}$ is a vector of $m$ discrete features for the $i$th training example and $y^{(i)}$ is the discrete label for the $i$th training example.\n \n In our introduction to machine learning, we are going to assume that all values in our training data-set are binary. While this is not a necessary assumption (both naive Bayes and logistic regression can work for non-binary data), it makes it much easier to learn the core concepts. Specifically we assume that all labels are binary $y^{(i)} \\in \\{0, 1\\} \\text{ }\\forall i$ and all features are binary $x^{(i)}_j \\in \\{0, 1\\} \\text{ }\\forall i, j$.\n"}, {"id": "naive_bayes", "title": "Na\u00efve Bayes", "url": "part5/naive_bayes", "text": "\n \nNa\u00efve Bayes\n\nNaive Bayes is a Machine Learning algorithm for the ``classification task\". It make the substantial assumption (called the Naive Bayes assumption) that all features are independent of one another, given the classification label. This assumption is wrong, but allows for a fast and quick algorithm that is often useful. In order to implement Naive Bayes you will need to learn how to train your model and how to use it to make predictions, once trained.\n \n\n \nTraining (aka Parameter Estimation)\nThe objective in training is to estimate the probabilities $P(Y)$ and $P(X_i | Y)$ for all $0 < i \\leq m$ features. We use the symbol $\\hat{p}$ to make it clear that the probability is an estimate.\n \n Using an MLE estimate:\n \\begin{align*}\n \\hat{p}(X_i = x_i | Y = y) = \\frac{ (\\text{# training examples where $X_i = x_i$ and $Y = y$})}{(\\text{# training examples where $Y = y$})}\n \\end{align*}\n \n Using a Laplace MAP estimate:\n \\begin{align*}\n \\hat{p}(X_i = x_i | Y = y) = \\frac{ (\\text{# training examples where $X_i = x_i$ and $Y = y$}) + 1 }{(\\text{# training examples where $Y = y$}) + 2}\n \\end{align*}\n \n The prior probability of $Y$ trained using an MLE estimate:\n \\begin{align*}\n \\hat{p}(Y = y) = \\frac{ (\\text{# training examples where $Y = y$) }}{(\\text{# training examples)}}\n \\end{align*}\n \nPrediction\nFor an example with $\\mathbf{x} = [x_1, x_2, \\dots , x_m]$, estimate the value of $y$ as:\n \\begin{align*}\n \\hat{y} &= \n \\argmax_{y = \\{0, 1\\}} \\text{ }\n \\log \\hat{p}(Y = y) +\n \\sum_{i=1}^m \\log \\hat{p}(X_i = x_i | Y = y) \n \\end{align*}\n Note that for small enough datasets you may not need to use the log version of the argmax.\n \nTheory\nIn the world of classification when we make a prediction we want to chose the value of $y$ that maximizes $P(Y=y|\\mathbf{X})$.\n\\begin{align*}\n \\hat{y} \n &= \\argmax_{y = \\{0, 1\\}} P(Y = y|\\mathbf{X} = \\mathbf{X}) \n && \\text{Our objective}\\\\\n &= \\argmax_{y = \\{0, 1\\}} \\frac{P(Y=y)P(\\mathbf{X} =\\mathbf{x}| Y = y)}{P(\\mathbf{X} =\\mathbf{x})} \n && \\text{By bayes theorem}\\\\\n &= \\argmax_{y = \\{0, 1\\}} P(Y=y)P(\\mathbf{X} =\\mathbf{x}| Y = y)) \n && \\text{Since $P(\\mathbf{X} =\\mathbf{x})$ is constant with respect to $Y$}\n \n\\end{align*}\nUsing our training data we could interpret the joint distribution of $\\mathbf{X}$ and $Y$ as one giant multinomial with a different parameter for every combination of $\\mathbf{X}=\\mathbf{x}$ and $Y=y$. If for example, the input vectors are only length one. In other words $|\\mathbf{x}| = 1$ and the number of values that $x$ and $y$ can take on are small, say binary, this is a totally reasonable approach. We could estimate the multinomial using MLE or MAP estimators and then calculate argmax over a few lookups in our table.\n\nThe bad times hit when the number of features becomes large. Recall that our multinomial needs to estimate a parameter for every unique combination of assignments to the vector $\\mathbf{x}$ and the value $y$. If there are $|\\mathbf{x}| = n$ binary features then this strategy is going to take order $\\mathcal{O}(2^n)$ space and there will likely be many parameters that are estimated without any training data that matches the corresponding assignment.\n\n\n\nNaive Bayes Assumption\n The Na\u00efve Bayes Assumption is that each feature of $\\mathbf{x}$ is independent of one another given $y$. \n \n\n \nThe Na\u00efve Bayes Assumption is wrong, but useful. This assumption allows us to make predictions using space and data which is linear with respect to the size of the features: $\\mathcal{O}(n)$ if $|\\mathbf{x}| = n$. That allows us to train and make predictions for huge feature spaces such as one which has an indicator for every word on the internet. Using this assumption the prediction algorithm can be simplified.\n\\begin{align*}\n \\hat{y} \n &= \\argmax\\limits_{y = \\{0, 1\\}} \\text{ }P(Y = y)P(\\mathbf{X} = \\mathbf{x}| Y = y) \n && \\text{As we last left off}\\\\\n &= \\argmax\\limits_{y = \\{0, 1\\}} \\text{ } P(Y= y)\\prod_i P(X_i = x_i| Y = y) \n &&\\text{Na\u00efve bayes assumption}\\\\\n &= \\argmax\\limits_{y = \\{0, 1\\}} \\text{ } \\log P(Y = y) + \\sum_i \\log P(X_i = x_i| Y = y) \n && \\text{For numerical stability}\\\\\n\\end{align*}\nIn the last step we leverage the fact that the argmax of a function is equal to the argmax of the log of a function. This algorithm is both fast and stable both when training and making predictions. \n\n\n\n"}, {"id": "log_regression", "title": "Logistic Regression", "url": "part5/log_regression", "text": "\n \nLogistic Regression\n\nLogistic Regression is a classification algorithm (I know, terrible name. Perhaps Logistic Classification would have been better) that works by trying to learn a function that approximates $\\P(y|x)$. It makes the central assumption that $\\P(y|x)$ can be approximated as a sigmoid function applied to a linear combination of input features. It is particularly important to learn because logistic regression is the basic building block of artificial neural networks.\n\nMathematically, for a single training datapoint ($\\mathbf{x}, y)$ Logistic Regression assumes:\n\\begin{align*}\nP(Y=1|\\mathbf{X}=\\mathbf{x}) &= \\sigma(z) \\text{ where } z = \\theta_0 + \\sum_{i=1}^m \\theta_i x_i\n\\end{align*}\nThis assumption is often written in the equivalent forms:\n\\begin{align*}\n P(Y=1|\\mathbf{X}=\\mathbf{x}) &=\\sigma(\\mathbf{\\theta}^T\\mathbf{x}) &&\\text{ where we always set $x_0$ to be 1}\\\\\n P(Y=0|\\mathbf{X}=\\mathbf{x}) &=1-\\sigma(\\mathbf{\\theta}^T\\mathbf{x}) &&\\text{ by total law of probability}\n\\end{align*}\nUsing these equations for probability of $Y|X$ we can create an algorithm that selects values of theta that maximize that probability for all data. I am first going to state the log probability function and partial derivatives with respect to theta. Then later we will (a) show an algorithm that can chose optimal values of theta and (b) show how the equations were derived.\n\nAn important thing to realize is that: given the best values for the parameters ($\\theta$), logistic regression often can do a great job of estimating the probability of different class labels. However, given bad , or even random, values of $\\theta$ it does a poor job. The amount of ``intelligence\" that you logistic regression machine learning algorithm has is dependent on having good values of $\\theta$.\n\nNotation\nBefore we get started I want to make sure that we are all on the same page with respect to notation. In logistic regression, $\\theta$ is a vector of parameters of length $m$ and we are going to learn the values of those parameters based off of $n$ training examples. The number of parameters should be equal to the number of features of each datapoint. \n\nTwo pieces of notation that we use often in logistic regression that you may not be familiar with are:\n\\begin{align*}\n \\mathbf{\\theta}^T\\mathbf{x} &= \\sum_{i=1}^m \\theta_i x_i = \\theta_1 x_1 + \\theta_2 x_2 + \\dots + \\theta_m x_m && \\text{dot product, aka weighted sum}\\\\\n \\sigma(z) &= \\frac{1}{1+ e^{-z}} && \\text{sigmoid function}\n\\end{align*}\n\nLog Likelihood\nIn order to chose values for the parameters of logistic regression we use Maximum Likelihood Estimation (MLE). As such we are going to have two steps: (1) write the log-likelihood function and (2) find the values of $\\theta$ that maximize the log-likelihood function.\n\nThe labels that we are predicting are binary, and the output of our logistic regression function is supposed to be the probability that the label is one. This means that we can (and should) interpret the each label as a Bernoulli random variable: $Y \\sim \\text{Bern}(p)$ where $p = \\sigma(\\theta^T \\textbf{x})$.\n\nTo start, here is a super slick way of writing the probability of one datapoint (recall this is the equation form of the probability mass function of a Bernoulli):\n\\begin{align*}\n P(Y=y | X = \\mathbf{x}) = \\sigma({\\mathbf{\\theta}^T\\mathbf{x}})^y \\cdot \\left[1 - \\sigma({\\mathbf{\\theta}^T\\mathbf{x}})\\right]^{(1-y)}\n \\end{align*}\n \nNow that we know the probability mass function, we can write the likelihood of all the data: \n \\begin{align*}\nL(\\theta) =& \\prod_{i=1}^n P(Y=y^{(i)} | X = \\mathbf{x}^{(i)}) && \\text{The likelihood of independent training labels}\\\\\n=& \\prod_{i=1}^n \\sigma({\\mathbf{\\theta}^T\\mathbf{x}^{(i)}})^{y^{(i)}} \\cdot \\left[1 - \\sigma({\\mathbf{\\theta}^T\\mathbf{x}^{(i)}})\\right]^{(1-y^{(i)})} && \\text{Substituting the likelihood of a Bernoulli}\n \\end{align*}\n And if you take the log of this function, you get the reported Log Likelihood for Logistic Regression. The log likelihood equation is:\n\\begin{align*}\n LL(\\theta) = \\sum_{i=1}^n y^{(i)} \\log \\sigma(\\mathbf{\\theta}^T\\mathbf{x}^{(i)}) + (1-y^{(i)}) \\log [1 - \\sigma(\\mathbf{\\theta}^T\\mathbf{x}^{(i)})]\n\\end{align*}\n\nRecall that in MLE the only remaining step is to chose parameters ($\\theta$) that maximize log likelihood.\n\nGradient of Log Likelihood\nNow that we have a function for log-likelihood, we simply need to chose the values of theta that maximize it. We can find the best values of theta by using an optimization algorithm. However, in order to use an optimization algorithm, we first need to know the partial derivative of log likelihood with respect to each parameter. First I am going to give you the partial derivative (so you can see how it is used). Then I am going to show you how to derive it:\n\\begin{align*}\n \\frac{\\partial LL(\\theta)}{\\partial \\theta_j} = \\sum_{i=1}^n \\left[\n y^{(i)} - \\sigma(\\mathbf{\\theta}^T\\mathbf{x}^{(i)})\n \\right] x_j^{(i)}\n\\end{align*}\n\nGradient Descent Optimization\nOur goal is to choosing parameters ($\\theta$) that maximize likelihood, and we know the partial derivative of log likelihood with respect to each parameter. We are ready for our optimization algorithm. \n\nIn the case of logistic regression we can't solve for $\\theta$ mathematically. Instead we use a computer to chose $\\theta$. To do so we employ an algorithm called gradient descent (a classic in optimization theory). The idea behind gradient descent is that if you continuously take small steps downhill (in the direction of your negative gradient), you will eventually make it to a local minima. In our case we want to maximize our likelihood. As you can imagine, minimizing a negative of our likelihood will be equivalent to maximizing our likelihood.\n\nThe update to our parameters that results in each small step can be calculated as:\n\\begin{align*}\n \\theta_j^{\\text{ new}} &= \\theta_j^{\\text{ old}} + \\eta \\cdot \\frac{\\partial LL(\\theta^{\\text{ old}})}{\\partial \\theta_j^{\\text{ old}}} \\\\\n&= \\theta_j^{\\text{ old}} + \\eta \\cdot \\sum_{i=1}^n \\left[\n y^{(i)} - \\sigma(\\mathbf{\\theta}^T\\mathbf{x}^{(i)})\n \\right] x_j^{(i)}\n\\end{align*}\nWhere $\\eta$ is the magnitude of the step size that we take. If you keep updating $\\theta$ using the equation above you will converge on the best values of $\\theta$. You now have an intelligent model. Here is the gradient ascent algorithm for logistic regression in pseudo-code:\n\n\n\n\nPro-tip: Don't forget that in order to learn the value of $\\theta_0$ you can simply define $\\textbf{x}_0$ to always be 1.\n\nDerivations\nIn this section we provide the mathematical derivations for the gradient of log-likelihood. The derivations are worth knowing because these ideas are heavily used in Artificial Neural Networks. \n \nOur goal is to calculate the derivative of the log likelihood with respect to each theta. To start, here is the definition for the derivative of a sigmoid function with respect to its inputs:\n \\begin{align*}\n \\frac{\\partial}{\\partial z} \\sigma(z) = \\sigma(z)[1 - \\sigma(z)] && \\text{to get the derivative with respect to $\\theta$, use the chain rule}\n \\end{align*}\nTake a moment and appreciate the beauty of the derivative of the sigmoid function. The reason that sigmoid has such a simple derivative stems from the natural exponent in the sigmoid denominator.\n \nSince the likelihood function is a sum over all of the data, and in calculus the derivative of a sum is the sum of derivatives, we can focus on computing the derivative of one example. The gradient of theta is simply the sum of this term for each training datapoint.\n \nFirst I am going to show you how to compute the derivative the hard way. Then we are going to look at an easier method. The derivative of gradient for one datapoint $(\\mathbf{x}, y)$:\n \\begin{align*}\n \\frac{\\partial LL(\\theta)}{\\partial \\theta_j} &= \\frac{\\partial }{\\partial \\theta_j} y \\log \\sigma(\\mathbf{\\theta}^T\\mathbf{x}) + \\frac{\\partial }{\\partial \\theta_j} (1-y) \\log [1 - \\sigma(\\mathbf{\\theta}^T\\mathbf{x}] && \\text{derivative of sum of terms}\\\\\n&=\\left[\\frac{y}{\\sigma(\\theta^T\\mathbf{x})} - \\frac{1-y}{1-\\sigma(\\theta^T\\mathbf{x})} \\right] \\frac{\\partial}{\\partial \\theta_j} \\sigma(\\theta^T \\mathbf{x}) &&\\text{derivative of log $f(x)$}\\\\\n&=\\left[\\frac{y}{\\sigma(\\theta^T\\mathbf{x})} - \\frac{1-y}{1-\\sigma(\\theta^T\\mathbf{x})} \\right] \\sigma(\\theta^T \\mathbf{x}) [1 - \\sigma(\\theta^T \\mathbf{x})]\\mathbf{x}_j && \\text{chain rule + derivative of sigma}\\\\\n&=\\left[\n\\frac{y - \\sigma(\\theta^T\\mathbf{x})}{\\sigma(\\theta^T \\mathbf{x}) [1 - \\sigma(\\theta^T \\mathbf{x})]}\n\\right] \\sigma(\\theta^T \\mathbf{x}) [1 - \\sigma(\\theta^T \\mathbf{x})]\\mathbf{x}_j && \\text{algebraic manipulation}\\\\\n&= \\left[y - \\sigma(\\theta^T\\mathbf{x}) \\right] \\mathbf{x}_j && \\text{cancelling terms}\n \\end{align*}\n \nDerivatives Without Tears\nThat was the hard way. Logistic regression is the building block of Artificial Neural Networks. If we want to scale up, we are going to have to get used to an easier way of calculating derivatives. For that we are going to have to welcome back our old friend the chain rule. By the chain rule:\n\\begin{align*}\n\\frac{\\partial LL(\\theta)}{\\partial \\theta_j} &= \n \\frac{\\partial LL(\\theta)}{\\partial p} \n \\cdot \\frac{\\partial p}{\\partial \\theta_j}\n && \\text{Where } p = \\sigma(\\theta^T\\textbf{x})\\\\\n&= \n \\frac{\\partial LL(\\theta)}{\\partial p} \n \\cdot \\frac{\\partial p}{\\partial z} \n \\cdot \\frac{\\partial z}{\\partial \\theta_j}\n && \\text{Where } z = \\theta^T\\textbf{x}\n \\end{align*}\nChain rule is the decomposition mechanism of calculus. It allows us to calculate a complicated partial derivative ($\\frac{\\partial LL(\\theta)}{\\partial \\theta_j}$) by breaking it down into smaller pieces.\n\\begin{align*}\nLL(\\theta) &= y \\log p + (1-y) \\log (1 - p) && \\text{Where } p = \\sigma(\\theta^T\\textbf{x}) \\\\\n \\frac{\\partial LL(\\theta)}{\\partial p} &= \\frac{y}{p} - \\frac{1-y}{1-p} && \\text{By taking the derivative}\n\\end{align*}\n\n\\begin{align*}\np &= \\sigma(z) && \\text{Where }z = \\theta^T\\textbf{x}\\\\\n\\frac{\\partial p}{\\partial z} &= \\sigma(z)[1- \\sigma(z)] && \\text{By taking the derivative of the sigmoid}\n\\end{align*}\n\n\\begin{align*}\n z &= \\theta^T\\textbf{x} && \\text{As previously defined}\\\\\n \\frac{\\partial z}{\\partial \\theta_j} &= \\textbf{x}_j && \\text{ Only $\\textbf{x}_j$ interacts with $\\theta_j$}\n \\end{align*}\nEach of those derivatives was much easier to calculate. Now we simply multiply them together.\n\\begin{align*}\n\\frac{\\partial LL(\\theta)}{\\partial \\theta_j} &=\n \\frac{\\partial LL(\\theta)}{\\partial p} \n \\cdot \\frac{\\partial p}{\\partial z} \n \\cdot \\frac{\\partial z}{\\partial \\theta_j} \\\\\n &=\n\\Big[\\frac{y}{p} - \\frac{1-y}{1-p}\\Big]\n \\cdot \\sigma(z)[1- \\sigma(z)]\n \\cdot \\textbf{x}_j && \\text{By substituting in for each term} \\\\\n&=\n\\Big[\\frac{y}{p} - \\frac{1-y}{1-p}\\Big]\n \\cdot p[1- p]\n \\cdot \\textbf{x}_j && \\text{Since }p = \\sigma(z)\\\\\n &=\n[y(1-p) - p(1-y)]\n \\cdot \\textbf{x}_j && \\text{Multiplying in} \\\\\n &= [y - p]\\textbf{x}_j && \\text{Expanding} \\\\\n&= [y - \\sigma(\\theta^T\\textbf{x})]\\textbf{x}_j && \\text{Since } p = \\sigma(\\theta^T\\textbf{x})\n \\end{align*}\n"}, {"id": "mle_demo", "title": "MLE Normal Demo", "url": "examples/mle_demo", "text": "\n \nMLE Normal Demo\n\nLets manually perform maximum likelihood estimation. Your job is to chose parameter values that make the data look as likely as possible. Here are the 20 data points, which we assume come from a Normal distribution\n\nData = [6.3 , 5.5\n, 5.4, 7.1, 4.6,\n 6.7, 5.3 , 4.8, 5.6, 3.4,\n 5.4, 3.4, 4.8, 7.9, 4.6,\n 7.0, 2.9, 6.4, 6.0 , 4.3]\n\nChose your parameter estimates\n\n\nParameter $\\mu$: \n\n\nParameter $\\sigma$: \n\n\nLikelihood of the data given your params\nLikelihood: \n\t\tLog Likelihood: \n\t\tBest Seen: \n\nYour Gaussian\n\n\n"}, {"id": "mixture_models", "title": "MLE Mixture Model", "url": "examples/mixture_models", "text": "\n \nGaussian Mixtures\n\nErrata: This example was first written at 1:00p on Nov 10th. During class we noticed a mistake in the derivative (incorrectly assumed log of sum was sum of log). It was updated soon after class on Nov 10th. \n\n\nData = [6.47, 5.82, 8.7, 4.76, 7.62, 6.95, 7.44, 6.73, 3.38, 5.89, 7.81, 6.93, 7.23, 6.25, 5.31, 7.71, 7.42, 5.81, 4.03, 7.09, 7.1, 7.62, 7.74, 6.19, 7.3, 7.37, 6.99, 2.97, 3.3, 7.08, 6.23, 3.67, 3.05, 6.67, 6.5, 6.08, 3.7, 6.76, 6.56, 3.61, 7.25, 7.34, 6.27, 6.54, 5.83, 6.44, 5.34, 7.7, 4.19, 7.34]\n\n\n\nParameter $t$: \n\n\nParameter $\\mu_a$: \n\n\nParameter $\\sigma_a$: \n\n\nParameter $\\mu_b$: \n\n\nParameter $\\sigma_b$: \n\n\n\nLikelihood: \n Log Likelihood: \n Best Seen: \nWhat is a Gaussian Mixture?\nA Gaussian Mixture describes a random variable whose PDF could come from one of two Gaussians (or more, but we will just use two in this demo). There is a certain probability the sample will come from the first gaussian, otherwise it comes from the second. It has five parameters: 4 to describe the two gaussians and one to describe the relative weighting of the two gaussians.\n\nGenerative Code\nfrom scipy import stats\ndef sample():\n # choose group membership\n membership = stats.bernoulli.rvs(0.2)\n if membership == 1:\n # sample from gaussian 1\n return stats.norm.rvs(3.5,0.7)\n else:\n # sample from gaussian 2\n return stats.norm.rvs(6.8,0.7)\nProbability Density Function\n\n\\begin{align*}\nf(X=x) = t \\cdot f(A=x) + (1-t) \\cdot f(B=x)\n\\end{align*}\n$$\\text{st}$$\n$$A \\sim N(\\mu_a, \\sigma_a^2)$$\n$$B \\sim N(\\mu_b, \\sigma_b^2)$$\nPutting it all together, the PDF of a Gaussian Mixture is:\n\\begin{align*}\nf(x) = t \\cdot \n\\Big(\\frac{1}{\\sqrt{2\\pi}\\sigma_a} e^{-\\frac{1}{2}(\\frac{x-\\mu_a}{\\sigma_a})^2}\\Big)\n\n+ (1-t) \\cdot\n\\Big(\\frac{1}{\\sqrt{2\\pi}\\sigma_b} e^{-\\frac{1}{2}(\\frac{x-\\mu_b}{\\sigma_b})^2}\\Big)\n\\end{align*}\n\n\nMLE for Gaussian Mixture\nSpecial note: even though the generative story has a bernoulli (group membership) it is never observed. MLE maximizes the likelihood of the observed data.\n\nLet $\\vec{\\theta} = [t, \\mu_a,\\mu_b,\\sigma_a, \\sigma_b]$ be the parameters. Because the math will get long I will use $\\theta$ as notation in place of $\\vec{\\theta}$. Just keep in mind that it is a vector.\n\nThe MLE idea is to chose values of $\\theta$ which maximize log likelihood. All optimization methods require us to calculate the partial derivatives of the thing we want to optimize (log likelihood) with respect to the values we can change (our parameters).\n\n\n Likelihood function\n\n\\begin{align*}\nL(\\theta) &= \\prod_i^n f(x_i | \\theta) \\\\\n&= \\prod_i^n \\Big[t \\cdot \n\\Big(\\frac{1}{\\sqrt{2\\pi}\\sigma_a} e^{-\\frac{1}{2}(\\frac{x_i-\\mu_a}{\\sigma_a})^2}\\Big)\n+ (1-t) \\cdot\n\\Big(\\frac{1}{\\sqrt{2\\pi}\\sigma_b} e^{-\\frac{1}{2}(\\frac{x_i-\\mu_b}{\\sigma_b})^2}\\Big)\n\\Big]\n\\end{align*}\n\n Log Likelihood function\n\n\\begin{align*}\nLL(\\theta) \n&= \\log L(\\theta) \\\\\n&= \\log \\prod_i^n f(x_i | \\theta) \\\\\n\n&= \\sum_i^n \\log f(x_i | \\theta) \\\\\n\n\\end{align*}\nThat is sufficient for now, but if you wanted to expand out the term you would get:\n \\begin{align*}\nLL(\\theta) &= \\sum_i^n \\log \\Big[t \\cdot \n\\Big(\\frac{1}{\\sqrt{2\\pi}\\sigma_a} e^{-\\frac{1}{2}(\\frac{x_i-\\mu_a}{\\sigma_a})^2}\\Big)\n+ (1-t) \\cdot\n\\Big(\\frac{1}{\\sqrt{2\\pi}\\sigma_b} e^{-\\frac{1}{2}(\\frac{x_i-\\mu_b}{\\sigma_b})^2}\\Big)\n\\Big] \n\\end{align*}\n\n\n Derivative of LL with respect to $\\theta$\nHere is an example of calculating a partial derivative with respect to one of the parameters, $\\mu_a$. You would need a derivative like this for all parameters. \n\n Caution: When I first wrote this demo I thought it would be a simple derivative . It is not so simple because the log has a sum in it. As such the log term doesn't reduce. The log still serves to make the outer $\\prod$ into a $\\sum$. As such the $LL$ partial derivatives are solvable, but the proof uses quite a lot of chain rule. \nTakeaway: The main takeaway from this section (in case you want to skip the derivative proof) is that the resulting derivative is complex enough that we will want a way to compute argmax without having to set that derivative equal to zero and solving for $\\mu_a$. Enter gradient descent!\nA good first step when doing a huge derivative of a log likelihood function is to think of the derivative for the log of likelihood of a single datapoint. This is the inner sum in the log likelihood expression:\n $$\n \\frac{\\d}{\\d \\mu_a} \\log f(x_i|\\theta) \n $$\n\n Before we start: notice that $\\mu_a$ does not show up in this term from $f(x_i|\\theta) $:\n $$(1-t) \\cdot\n\\Big(\\frac{1}{\\sqrt{2\\pi}\\sigma_b} e^{-\\frac{1}{2}(\\frac{x_i-\\mu_b}{\\sigma_b})^2}\\Big) =K $$\nIn the proof, when we encounter this term, we are going to think of it as a constant which we call $K$. Ok, lets go for it!\n\n\\begin{align*}\n\\frac{\\d}{\\d \\mu_a} & \\log f(x_i|\\theta) \\\\\n\n&= \\frac{1}{f(x_i|\\theta)} \\frac{\\d}{\\d \\mu_a} f(x_i|\\theta) &&\\text{chain rule on }\\log \\\\\n\n&= \\frac{1}{f(x_i|\\theta)} \\frac{\\d}{\\d \\mu_a} \\Big[t \\cdot \n\\Big(\\frac{1}{\\sqrt{2\\pi}\\sigma_a} e^{-\\frac{1}{2}(\\frac{x_i-\\mu_a}{\\sigma_a})^2}\\Big)\n+ K\\Big] &&\\text{substitute in }f(x_i|\\theta)\\\\\n\n&= \\frac{1}{f(x_i|\\theta)} \\frac{\\d}{\\d \\mu_a} \\Big[t \\cdot \n\\Big(\\frac{1}{\\sqrt{2\\pi}\\sigma_a} e^{-\\frac{1}{2}(\\frac{x_i-\\mu_a}{\\sigma_a})^2}\\Big)\\Big] &&\\frac{\\d}{\\d \\mu_a} K = 0\\\\\n\n&= \\frac{t}{f(x_i|\\theta) \\sqrt{2\\pi}\\sigma_a} \\cdot \\frac{\\d}{\\d \\mu_a} e^{-\\frac{1}{2}(\\frac{x_i-\\mu_a}{\\sigma_a})^2} &&\\text{pull out const}\\\\\n\n&= \\frac{t}{f(x_i|\\theta) \\sqrt{2\\pi}\\sigma_a} \\cdot e^{-\\frac{1}{2}(\\frac{x_i-\\mu_a}{\\sigma_a})^2} \\cdot \\frac{\\d}{\\d \\mu_a} -\\frac{1}{2}(\\frac{x_i-\\mu_a}{\\sigma_a})^2&&\\text{chain on }e^x \\\\\n\n&= \\frac{t}{f(x_i|\\theta) \\sqrt{2\\pi}\\sigma_a} \\cdot e^{-\\frac{1}{2}(\\frac{x_i-\\mu_a}{\\sigma_a})^2} \\cdot \\Big[-(\\frac{x_i-\\mu_a}{\\sigma_a}) \\frac{\\d}{\\d \\mu_a} (\\frac{x_i-\\mu_a}{\\sigma_a})\\Big]&&\\text{chain on }x^2 \\\\\n\n&= \\frac{t}{f(x_i|\\theta) \\sqrt{2\\pi}\\sigma_a} \\cdot e^{-\\frac{1}{2}(\\frac{x_i-\\mu_a}{\\sigma_a})^2} \\cdot \\Big[ -(\\frac{x_i-\\mu_a}{\\sigma_a}) \\cdot \\frac{-1}{\\sigma_a}\\Big] &&\\text{final derivative} \\\\\n\n&= \\frac{t}{f(x_i|\\theta) \\sqrt{2\\pi}\\sigma_a^3} \\cdot e^{-\\frac{1}{2}(\\frac{x_i-\\mu_a}{\\sigma_a})^2} \\cdot (x_i-\\mu_a) &&\\text{simplify} \\\\\n\\end{align*}\n\nThat was for a single data-point. For the full dataset:\n\\begin{align*}\n\\frac{\\d LL(\\theta)}{\\d \\mu_a} \n&= \\sum_i^n \n\\frac{\\d}{\\d \\mu_a} \\log f(x_i|\\theta)\\\\\n&= \\sum_i^n \n\\frac{t}{f(x_i|\\theta) \\sqrt{2\\pi}\\sigma_a^3} \\cdot e^{-\\frac{1}{2}(\\frac{x_i-\\mu_a}{\\sigma_a})^2} \\cdot (x_i-\\mu_a)\n\\end{align*}\n\nThis process should be repeated for all five parameters! Now, how should we find a value of $\\mu_a$, which, in the presence of the other settings to parameters, and the data, makes this derivative zero? Setting the derivative = 0 and solving for $\\mu_a$ is not going to work.\n\nUse an Optimizer to Estimate Params\nOnce we have a $LL$ function and the derivative of $LL$ with respect to each parameter we are ready to compute argmax using an optimizer. In this case the best choice would probably be gradient ascent (or gradient descent with negative log likelihood).\n\n\\begin{align*}\n\\nabla_{\\theta} LL(\\theta)\n&=\\begin{bmatrix}\n\\frac{\\d LL(\\theta)}{\\d t}\\\\\n\\frac{\\d LL(\\theta)}{\\d \\mu_a}\\\\\n\\frac{\\d LL(\\theta)}{\\d \\mu_b}\\\\\n\\frac{\\d LL(\\theta)}{\\d \\sigma_a}\\\\\n\\frac{\\d LL(\\theta)}{\\d \\sigma_b}\\\\\n\\end{bmatrix}\n\\end{align*}\n\n\n\n\n\n"}]