{"id":184,"date":"2016-05-19T13:14:25","date_gmt":"2016-05-19T13:14:25","guid":{"rendered":"http:\/\/deberker.com\/archy\/?p=184"},"modified":"2021-11-06T16:05:52","modified_gmt":"2021-11-06T16:05:52","slug":"deep-learning-textbook-chapter-2-probability","status":"publish","type":"post","link":"https:\/\/deberker.com\/archy\/deep-learning-textbook-chapter-2-probability\/","title":{"rendered":"Deep Learning Textbook Chapter 2: Probability"},"content":{"rendered":"<p><em> As part of the Max Planck Centre&#8217;s Deep Learning reading group, I&#8217;m working my way through the soon-to-be released <a href=\"http:\/\/www.deeplearningbook.org\/\">Deep Learning<\/a> textbook. These posts are my informal notes for each chapter. <a href=\"http:\/\/deberker.com\/archy\/?p=156\">Find my notes from chapter 1 here.<\/a>\u00a0Thanks again to Zeb &amp; Toby for edits.<\/em><\/p>\n<p>This week we were back on slightly firmer ground, dealing with the chapter on probability. Most of this chapter will be fairly familiar to those who have done some statistics and\/or computational modelling. My notes are correspondingly sparse.<\/p>\n<p><strong>Flavours of uncertainty in\u00a0modelling<\/strong><\/p>\n<ol>\n<li>Inherent stochasticity, e.g. in the outcome of a coin flip (equivalent to irreducible uncertainty, which we wrote about <a href=\"http:\/\/www.nature.com\/ncomms\/2016\/160329\/ncomms10996\/full\/ncomms10996.html\">here<\/a>)<\/li>\n<li>Incomplete observability &#8211; we don&#8217;t have enough information to fully specify what&#8217;s going to happen<\/li>\n<li>Incomplete modelling &#8211;\u00a0all models are wrong, some are useful (George Box). All models sacrifice some precision in order to make useful simplifications of the subject under study; these simplifications bring with them some uncertainty.<\/li>\n<\/ol>\n<p><strong>Frequentist vs Bayesian probability<\/strong><\/p>\n<p>The textbook\u00a0has\u00a0a nice delineation of two ways of thinking about probability.<\/p>\n<p>The frequentist tradition maintains that probability is the percentage of times that something has happened divided by all the times it might have happened, and, in the strongest interpretation, that these are the only kind of probabilities it makes sense to talk about. \u00a0We can therefore derive probabilities for rain on a day in June, but we can&#8217;t derive probabilities for the likelihood of the sun exploding, or for the probability of my having cancer.<\/p>\n<p>A Bayesian treatment, however, allows us to use probabilities to describe various degrees of <em>certainty<\/em>, or degrees of <em>belief<\/em>. This is far more useful from a neuroscientific perspective, and allows us to do all sorts of neat stuff, like <a href=\"http:\/\/www.nature.com\/nature\/journal\/v415\/n6870\/abs\/415429a.html\">integrate sensory information optimally <\/a>or <a href=\"http:\/\/people.hss.caltech.edu\/~pbs\/expfinance\/Readings\/BehrensRushworthNN2007.pdf\">learn at appropriate rates.<\/a><\/p>\n<p>Bayesian statistics also allows us to integrate priors &#8211; information about how likely something is,\u00a0<em>a priori<\/em> (that is, without observing any new information). This can be pretty useful:<\/p>\n<p><a href=\"https:\/\/i0.wp.com\/deberker.com\/archy\/wp-content\/uploads\/2016\/05\/frequentists_vs_bayesians.png\" rel=\"attachment wp-att-192\"><img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-192\" src=\"https:\/\/i0.wp.com\/deberker.com\/archy\/wp-content\/uploads\/2016\/05\/frequentists_vs_bayesians.png?resize=468%2C709\" alt=\"frequentists_vs_bayesians\" width=\"468\" height=\"709\" \/><\/a><\/p>\n<p><strong>Probability Mass Functions<\/strong><\/p>\n<p>A Probability Mass Function (PMF) describes the probability of a randomly drawn\u00a0sample from x taking on a certain value <em>x <\/em>(note the italics)\u00a0: P(x=<em>x<\/em>).<\/p>\n<p>PMF&#8217;s are used when variables are discrete. They generate nice blocky histograms, with the height of the bar telling us how likely we are to observe that value of x. For instance, this is the PMF for a (fair) die:<\/p>\n<p><strong><a href=\"https:\/\/i0.wp.com\/deberker.com\/archy\/wp-content\/uploads\/2016\/05\/pmf.png\" rel=\"attachment wp-att-189\"><img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-189\" src=\"https:\/\/i0.wp.com\/deberker.com\/archy\/wp-content\/uploads\/2016\/05\/pmf.png?resize=220%2C96\" alt=\"pmf\" width=\"220\" height=\"96\" \/><\/a><\/strong><\/p>\n<p><strong>Probability Density Functions<\/strong><\/p>\n<p>Probability Density Functions (PDF&#8217;s) deal with the P(x=<em>x<\/em>) again, but for\u00a0<strong>continuous\u00a0<\/strong>variables. This has one huge implication:\u00a0<strong>the height of the function no longer tells us the probability of observing that particular value of x, as the probability of seeing x of a particular value ~=0. <\/strong>This is because the function is continuous &#8211; so it has infinitely many values &#8211; and since the probability of any one value is going to be 1 \/ n_possible_values, we have 1\/infinity == 0.<\/p>\n<p>Intuitively, this is not different from the material meaning of density &#8211; if we take an infinitely small volume, we can&#8217;t say anything about its density (since density = mass\/volume). We need to specify a volume in order to measure the density and therefore infer the mass.<\/p>\n<p>Instead, we use integration with some set of limits to tell us how likely we are to\u00a0<strong>observe\u00a0<em>x<\/em> in some range<\/strong>. So we can ask: &#8216;how likely am I to observe\u00a0<em>x<\/em> between a and b?&#8217;:<\/p>\n<p><a href=\"https:\/\/i0.wp.com\/deberker.com\/archy\/wp-content\/uploads\/2016\/05\/PDF.png\" rel=\"attachment wp-att-190\"><img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-190\" src=\"https:\/\/i0.wp.com\/deberker.com\/archy\/wp-content\/uploads\/2016\/05\/PDF.png?resize=439%2C292\" alt=\"PDF\" width=\"439\" height=\"292\" \/><\/a><\/p>\n<p><strong>Marginal probabilities<\/strong><\/p>\n<p>We use marginal probabilities to isolate\u00a0the probabilities of a subset of variables. Say we have a bunch of people, and we split them up according to gender and by height. We now have them labelled as short\/tall, and male\/female, and we count them as<\/p>\n<p style=\"padding-left: 30px;\">30 tall men<\/p>\n<p style=\"padding-left: 30px;\">15 short men<\/p>\n<p style=\"padding-left: 30px;\">20 tall women<\/p>\n<p style=\"padding-left: 30px;\">35 short women<\/p>\n<p>Now if we want to\u00a0know the probability that a random draw will produce a man, we can\u00a0<strong>marginalise<\/strong> over the height variable\u00a0to find p(male):<\/p>\n<p><a href=\"https:\/\/i0.wp.com\/deberker.com\/archy\/wp-content\/uploads\/2016\/05\/84ba7aca1e9f93f3151646d8b9717c35.png\" rel=\"attachment wp-att-191\"><img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-191\" src=\"https:\/\/i0.wp.com\/deberker.com\/archy\/wp-content\/uploads\/2016\/05\/84ba7aca1e9f93f3151646d8b9717c35.png?resize=595%2C38\" alt=\"84ba7aca1e9f93f3151646d8b9717c35\" width=\"595\" height=\"38\" srcset=\"https:\/\/i0.wp.com\/deberker.com\/archy\/wp-content\/uploads\/2016\/05\/84ba7aca1e9f93f3151646d8b9717c35.png?w=595&amp;ssl=1 595w, https:\/\/i0.wp.com\/deberker.com\/archy\/wp-content\/uploads\/2016\/05\/84ba7aca1e9f93f3151646d8b9717c35.png?resize=580%2C37&amp;ssl=1 580w\" sizes=\"auto, (max-width: 595px) 100vw, 595px\" \/><\/a><\/p>\n<p>Where X is male\/female, and Y is tall\/short. In order to isolate the probability distribution of gender, we need to &#8216;sum out&#8217; the probability that\u00a0the person is tall or short.<\/p>\n<p>Practically: p(male) = (0.3\/0.5)*(0.5) + (0.15\/0.5)*(0.5) = 0.45 (which is bang on &#8211; we have 45 men in the sample).<\/p>\n<p>Conversely, p(short) = (0.15\/0.45)*0.45 +(0.35\/0.55)*0.55 = 0.5.<\/p>\n<p>Note that in the two variable case, we can actually cancel out\u00a0the denominator\u00a0of the conditional with the total probability (i.e. in both cases, we&#8217;re dividing by p(y) and then immediately multiplying by it again).<\/p>\n<p>In the case of continuous variables, this sum &#8211;&gt; an integration.<\/p>\n<p><strong>Conditional probability<\/strong><\/p>\n<p>Use a vertical line to represent this: p(y|x) means p(y=<em>y<\/em>) given that x=<em>x<\/em>.<\/p>\n<p>So now we have <em>marginal probabilities<\/em> (<strong>p(x)<\/strong>), <em>conditional probabilities<\/em> (<strong>p(x|y)<\/strong>), and <em>joint probabilities<\/em> (<strong>p(x,y)<\/strong>).<\/p>\n<p><strong>The chain rule of conditional probability<\/strong><\/p>\n<p>If we want to decompose a joint probability distribution into components, we can repeatedly crack open our joint distribution using conditional probabilities:<\/p>\n<p style=\"padding-left: 30px;\">P(a,b) = P(a|b)P(b)<\/p>\n<p>So the probability of observing a AND b, is equal to the probability of observing a given that b has occured, multiplied by the probability of observing b in the first place.<\/p>\n<p>We can do this recursively:<\/p>\n<p style=\"padding-left: 30px;\">P(a,b,c) = P(a|b,c)P(b,c)<\/p>\n<p style=\"padding-left: 30px;\">P(a,b,c) = P(a|b,c)P(b|c)P(c)<\/p>\n<p>Disclaimer: I&#8217;m not clear yet why this is particularly useful.<\/p>\n<p><strong>Independence<\/strong><\/p>\n<p>If two variables are independent, their joint probability is merely<\/p>\n<p style=\"padding-left: 30px;\">p(x=<em>x<\/em>,y=<em>y<\/em>) = p(x=<em>x<\/em>) * p(y=<em>y<\/em>)<\/p>\n<p>This leads on to the notion of\u00a0<strong>conditional independence<\/strong>.\u00a0There may be situations where we can&#8217;t predict A from B, but knowing the value of a third variable, C, gives us some predictive power. Here&#8217;s an (abridged, emphasised) example from<a href=\"http:\/\/math.stackexchange.com\/questions\/23093\/could-someone-explain-conditional-independence\"> stack exchange<\/a>:<\/p>\n<p style=\"padding-left: 30px;\">&#8220;Say you roll a blue die and a red die. The two results are <strong>independent<\/strong> of each other. Now you tell me that the blue result isn&#8217;t a\u00a0<span id=\"MathJax-Element-19-Frame\" class=\"MathJax\" data-mathml=\"&lt;math xmlns=&quot;http:\/\/www.w3.org\/1998\/Math\/MathML&quot;&gt;&lt;mn&gt;6&lt;\/mn&gt;&lt;\/math&gt;\" tabindex=\"0\"><span id=\"MathJax-Span-55\" class=\"math\"><span id=\"MathJax-Span-56\" class=\"mrow\"><span id=\"MathJax-Span-57\" class=\"mn\">6<\/span><\/span><\/span><\/span>\u00a0and the red result isn&#8217;t a\u00a0<span id=\"MathJax-Element-20-Frame\" class=\"MathJax\" data-mathml=\"&lt;math xmlns=&quot;http:\/\/www.w3.org\/1998\/Math\/MathML&quot;&gt;&lt;mn&gt;1&lt;\/mn&gt;&lt;\/math&gt;\" tabindex=\"0\"><span id=\"MathJax-Span-58\" class=\"math\"><span id=\"MathJax-Span-59\" class=\"mrow\"><span id=\"MathJax-Span-60\" class=\"mn\">1<\/span><\/span><\/span><\/span>.<strong> You&#8217;ve given me new information, but that hasn&#8217;t affected the independence of the results<\/strong>. By taking a look at the blue die, I can&#8217;t gain any knowledge about the red die; after I look at the blue die I will still have a probability of\u00a0<span id=\"MathJax-Element-21-Frame\" class=\"MathJax\" data-mathml=\"&lt;math xmlns=&quot;http:\/\/www.w3.org\/1998\/Math\/MathML&quot;&gt;&lt;mn&gt;1&lt;\/mn&gt;&lt;mrow class=&quot;MJX-TeXAtom-ORD&quot;&gt;&lt;mo&gt;\/&lt;\/mo&gt;&lt;\/mrow&gt;&lt;mn&gt;5&lt;\/mn&gt;&lt;\/math&gt;\" tabindex=\"0\"><span id=\"MathJax-Span-61\" class=\"math\"><span id=\"MathJax-Span-62\" class=\"mrow\"><span id=\"MathJax-Span-63\" class=\"mn\">1<\/span><span id=\"MathJax-Span-64\" class=\"texatom\"><span id=\"MathJax-Span-65\" class=\"mrow\"><span id=\"MathJax-Span-66\" class=\"mo\">\/5<\/span><\/span><\/span><\/span><\/span><\/span>\u00a0for each number on the red die except\u00a0<span id=\"MathJax-Element-22-Frame\" class=\"MathJax\" data-mathml=\"&lt;math xmlns=&quot;http:\/\/www.w3.org\/1998\/Math\/MathML&quot;&gt;&lt;mn&gt;1&lt;\/mn&gt;&lt;\/math&gt;\" tabindex=\"0\"><span class=\"MJX_Assistive_MathML\">1<\/span><\/span>. So the probabilities for the results are <strong>conditionally independent<\/strong> given the information you&#8217;ve given me. But if instead <strong>you tell me that the sum of the two results is even, this allows me to learn a lot about the red die by looking at the blue die.<\/strong> For instance, if I see a\u00a0<span id=\"MathJax-Element-23-Frame\" class=\"MathJax\" data-mathml=\"&lt;math xmlns=&quot;http:\/\/www.w3.org\/1998\/Math\/MathML&quot;&gt;&lt;mn&gt;3&lt;\/mn&gt;&lt;\/math&gt;\" tabindex=\"0\"><span id=\"MathJax-Span-71\" class=\"math\"><span id=\"MathJax-Span-72\" class=\"mrow\"><span id=\"MathJax-Span-73\" class=\"mn\">3<\/span><\/span><\/span><\/span>\u00a0on the blue die, the red die can only be\u00a0<span id=\"MathJax-Element-24-Frame\" class=\"MathJax\" data-mathml=\"&lt;math xmlns=&quot;http:\/\/www.w3.org\/1998\/Math\/MathML&quot;&gt;&lt;mn&gt;1&lt;\/mn&gt;&lt;\/math&gt;\" tabindex=\"0\"><span id=\"MathJax-Span-74\" class=\"math\"><span id=\"MathJax-Span-75\" class=\"mrow\"><span id=\"MathJax-Span-76\" class=\"mn\">1<\/span><\/span><\/span><\/span>,\u00a0<span id=\"MathJax-Element-25-Frame\" class=\"MathJax\" data-mathml=\"&lt;math xmlns=&quot;http:\/\/www.w3.org\/1998\/Math\/MathML&quot;&gt;&lt;mn&gt;3&lt;\/mn&gt;&lt;\/math&gt;\" tabindex=\"0\"><span id=\"MathJax-Span-77\" class=\"math\"><span id=\"MathJax-Span-78\" class=\"mrow\"><span id=\"MathJax-Span-79\" class=\"mn\">3<\/span><\/span><\/span><\/span>\u00a0or\u00a0<span id=\"MathJax-Element-26-Frame\" class=\"MathJax\" data-mathml=\"&lt;math xmlns=&quot;http:\/\/www.w3.org\/1998\/Math\/MathML&quot;&gt;&lt;mn&gt;5&lt;\/mn&gt;&lt;\/math&gt;\" tabindex=\"0\"><span id=\"MathJax-Span-80\" class=\"math\"><span id=\"MathJax-Span-81\" class=\"mrow\"><span id=\"MathJax-Span-82\" class=\"mn\">5<\/span><\/span><\/span><\/span>. So in this case the probabilities for the results are not conditionally independent given this other information that you&#8217;ve given me. This also underscores that conditional independence is always relative to the given condition &#8212; in this case, the results of the dice rolls are conditionally independent with respect to the event &#8220;the blue result is not\u00a0<span id=\"MathJax-Element-27-Frame\" class=\"MathJax\" data-mathml=\"&lt;math xmlns=&quot;http:\/\/www.w3.org\/1998\/Math\/MathML&quot;&gt;&lt;mn&gt;6&lt;\/mn&gt;&lt;\/math&gt;\" tabindex=\"0\"><span class=\"MJX_Assistive_MathML\">6<\/span><\/span>\u00a0and the red result is not\u00a0<span id=\"MathJax-Element-28-Frame\" class=\"MathJax\" data-mathml=\"&lt;math xmlns=&quot;http:\/\/www.w3.org\/1998\/Math\/MathML&quot;&gt;&lt;mn&gt;1&lt;\/mn&gt;&lt;\/math&gt;\" tabindex=\"0\"><span id=\"MathJax-Span-86\" class=\"math\"><span id=\"MathJax-Span-87\" class=\"mrow\"><span id=\"MathJax-Span-88\" class=\"mn\">1<\/span><\/span><\/span><\/span>&#8220;, but they&#8217;re not conditionally independent with respect to the event &#8220;the sum of the results is even&#8221;<\/p>\n<p><strong>Expectation, variance<\/strong><\/p>\n<p>The expectation is basically the mean of a probability distribution &#8211; it&#8217;s the most likely value of\u00a0<em>x<\/em> to obtain if we draw a random value. We denote it with a fancy E:<\/p>\n<p><img decoding=\"async\" class=\"mathtex-equation-editor\" src=\"http:\/\/chart.apis.google.com\/chart?cht=tx&amp;chl=%24%5Cmathbb%7BE%7D%24%20\" alt=\"$\\mathbb{E}$ \" align=\"absmiddle\" \/><\/p>\n<p>For discrete variables it&#8217;s simply:<\/p>\n<p><img decoding=\"async\" class=\"mathtex-equation-editor\" src=\"http:\/\/chart.apis.google.com\/chart?cht=tx&amp;chl=%5Cmathbb%7BE%7D%20_%7Bx%20%5Csim%20%20P%5Bf(x)%5D)%7D%7D%3D%20%5Csum%20%7BP(x)f(x)%7D\" alt=\"\\mathbb{E} _{x \\sim P[f(x)])}}= \\sum {P(x)f(x)}\" align=\"absmiddle\" \/><\/p>\n<p>Meaning the expectation of a random\u00a0<em>x<\/em> from the probability distribution f(x) is the sum over the possible values of x multiplied by their respective probabilities. Pretty intuitive.<\/p>\n<p>For continuous variables, we use&#8230; you guessed it, an integral version of the same:<\/p>\n<p><img decoding=\"async\" class=\"mathtex-equation-editor\" src=\"http:\/\/chart.apis.google.com\/chart?cht=tx&amp;chl=%5Cmathbb%7BE%7D%20_%7Bx%20%5Csim%20%20P%5Bf(x)%5D)%7D%7D%3D%20%5Cint%20%7BP(x)f(x)%7Ddx\" alt=\"\\mathbb{E} _{x \\sim P[f(x)])}}= \\int {P(x)f(x)}dx\" align=\"absmiddle\" \/><\/p>\n<p>If the expectation is like the mean of a distribution, the <strong>variance<\/strong> is equivalent to&#8230; well&#8230; the variance. It means exactly the same thing in this context: how much things vary around the expectation, which is to say, the &#8216;peakiness&#8217; of the probability distribution.<\/p>\n<p><img decoding=\"async\" class=\"mathtex-equation-editor\" src=\"http:\/\/chart.apis.google.com\/chart?cht=tx&amp;chl=Var(f(x))%20%3D%20%5Cmathbb%7BE%7D%5B(f(x)%20-%20%5Cmathbb%7BE%7D%5Bf(x)%5D)%5E2%5D\" alt=\"Var(f(x)) = \\mathbb{E}[(f(x) - \\mathbb{E}[f(x)])^2]\" align=\"absmiddle\" \/><\/p>\n<p>This form\u00a0is intuitively\u00a0similar to the definition of the standard deviation: we are taking some average difference between each\u00a0data point and the mean (expectation in this case).<\/p>\n<p>In the figure below, \u00a0<img decoding=\"async\" class=\"mathtex-equation-editor\" src=\"http:\/\/chart.apis.google.com\/chart?cht=tx&amp;chl=x\" alt=\"x\" align=\"absmiddle\" \/>\u00a0defines the expectation, and\u00a0<img decoding=\"async\" class=\"mathtex-equation-editor\" src=\"http:\/\/chart.apis.google.com\/chart?cht=tx&amp;chl=%5Cgamma\" alt=\"\\gamma\" align=\"absmiddle\" \/>\u00a0the variance:<\/p>\n<p><a href=\"https:\/\/i0.wp.com\/deberker.com\/archy\/wp-content\/uploads\/2016\/05\/360px-Cauchy_pdf.png\"><img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-204\" src=\"https:\/\/i0.wp.com\/deberker.com\/archy\/wp-content\/uploads\/2016\/05\/360px-Cauchy_pdf.png?resize=360%2C288\" alt=\"360px-Cauchy_pdf\" width=\"360\" height=\"288\" \/><\/a><\/p>\n<p>(from <a href=\"https:\/\/en.wikipedia.org\/wiki\/Cauchy_distribution#\/media\/File:Cauchy_pdf.svg\">Wikipedia<\/a>)<\/p>\n<p>Note that since the probability distribution has to integrate to 1,\u00a0<em>changing the variance also changes the height of the distribution<\/em>.<\/p>\n<p>This seems like a good time to introduce the<strong> dirac delta function<\/strong>, which is a weird function that exemplifies the above consideration: it has zero variance, and is 0 everywhere apart from x=0, but it still integrates to 1:<\/p>\n<p><a href=\"https:\/\/i0.wp.com\/deberker.com\/archy\/wp-content\/uploads\/2016\/05\/dirac3.png\"><img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-205\" src=\"https:\/\/i0.wp.com\/deberker.com\/archy\/wp-content\/uploads\/2016\/05\/dirac3.png?resize=138%2C39\" alt=\"dirac3\" width=\"138\" height=\"39\" \/><\/a><\/p>\n<p><a href=\"https:\/\/i0.wp.com\/deberker.com\/archy\/wp-content\/uploads\/2016\/05\/dirac2.png\"><img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" class=\"alignnone size-large wp-image-206\" src=\"https:\/\/i0.wp.com\/deberker.com\/archy\/wp-content\/uploads\/2016\/05\/dirac2.png?resize=184%2C60\" alt=\"dirac2\" width=\"184\" height=\"60\" \/><\/a><\/p>\n<p><a href=\"https:\/\/i0.wp.com\/deberker.com\/archy\/wp-content\/uploads\/2016\/05\/dirac.png\"><img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" class=\"alignnone size-large wp-image-207\" src=\"https:\/\/i0.wp.com\/deberker.com\/archy\/wp-content\/uploads\/2016\/05\/dirac.png?resize=325%2C244\" alt=\"dirac\" width=\"325\" height=\"244\" \/><\/a><\/p>\n<p>It&#8217;s useful for denoting very short events in time &#8211; we can use\u00a0it to denote a spike (action potential) in a neuron, for instance.<\/p>\n<p><strong>Covariance<\/strong><\/p>\n<p>An important concept: how much two variance in two variables relate to one another (correlation is the covariance corrected for the magnitude of each variable).<\/p>\n<p>This is similar to calculating the\u00a0variance (see above), but we sub in another variable instead of squaring everything:<\/p>\n<p><img decoding=\"async\" class=\"mathtex-equation-editor\" src=\"http:\/\/chart.apis.google.com\/chart?cht=tx&amp;chl=Cov(f(x)%2Cg(y))%20%3D%20%5Cmathbb%7BE%7D%5B(f(x)%20-%20%5Cmathbb%7BE%7D%5Bf(x)%5D)(g(y)%20-%20%5Cmathbb%7BE%7D%5Bg(y)%5D)%5D\" alt=\"Cov(f(x),g(y)) = \\mathbb{E}[(f(x) - \\mathbb{E}[f(x)])(g(y) - \\mathbb{E}[g(y)])]\" align=\"absmiddle\" \/><\/p>\n<p>Covariance will be positive if two things tend to vary in the same direction (e.g. height and weight) and negative if two variables tend to vary oppositely (e.g. height and number of books you have to stand on to reach the top shelf). Note that covariance is a\u00a0<em>linear<\/em> construct: two variables can be related in a funky non-linear way, and still have zero covariance.<\/p>\n<p><strong>A bunch of distributions<\/strong><\/p>\n<p><em><strong>Bernoulli<\/strong><\/em><\/p>\n<p>A discrete probability distribution for describing the probability over a single binary variable, the most classic example of which would be heads or tails on a coin. It has a single parameter, <img decoding=\"async\" class=\"mathtex-equation-editor\" src=\"http:\/\/chart.apis.google.com\/chart?cht=tx&amp;chl=%5Cphi\" alt=\"\\phi\" align=\"absmiddle\" \/>,which says how likely the binary variable is to attain state 1 (for an unbiased coin, <img decoding=\"async\" class=\"mathtex-equation-editor\" src=\"http:\/\/chart.apis.google.com\/chart?cht=tx&amp;chl=%5Cphi\" alt=\"\\phi\" align=\"absmiddle\" \/> is 0.5).<\/p>\n<p>The\u00a0<em><strong>multinoulli<\/strong><\/em> distribution does the same thing over k states.<\/p>\n<p><em><strong>Gaussian (=normal)<\/strong><\/em><\/p>\n<p>The old favourite. Random variables tend to the Gaussian distribution due to the <a href=\"https:\/\/en.wikipedia.org\/wiki\/Central_limit_theorem\">central limit theorem\u00a0<\/a>. Despite being exceedingly common, you may never have actually seen the formula written down:<\/p>\n<p><img decoding=\"async\" class=\"mathtex-equation-editor\" src=\"http:\/\/chart.apis.google.com\/chart?cht=tx&amp;chl=N(x%3B%5Cmu%2C%5Csigma%5E2)%3D%5Csqrt%7B%5Cfrac%20%7B1%7D%7B2%5Cpi%5Csigma%5E2%7D%7Dexp(-%5Cfrac%7B1%7D%7B2%5Csigma%5E2%7D(x-%5Cmu)%5E2)%0A%0A\" alt=\"N(x;\\mu,\\sigma^2)=\\sqrt{\\frac {1}{2\\pi\\sigma^2}}exp(-\\frac{1}{2\\sigma^2}(x-\\mu)^2) \" align=\"absmiddle\" \/><\/p>\n<p>Phew, what a mouthful.\u00a0<img decoding=\"async\" class=\"mathtex-equation-editor\" src=\"http:\/\/chart.apis.google.com\/chart?cht=tx&amp;chl=%5Cmu\" alt=\"\\mu\" align=\"absmiddle\" \/>\u00a0is the mean,\u00a0<img decoding=\"async\" class=\"mathtex-equation-editor\" src=\"http:\/\/chart.apis.google.com\/chart?cht=tx&amp;chl=%5Csigma\" alt=\"\\sigma\" align=\"absmiddle\" \/>\u00a0is the standard deviation (<img decoding=\"async\" class=\"mathtex-equation-editor\" src=\"http:\/\/chart.apis.google.com\/chart?cht=tx&amp;chl=%5Csigma%5E2\" alt=\"\\sigma^2\" align=\"absmiddle\" \/>=variance).<\/p>\n<p>They have a nice figure in the book (3.1):<a href=\"https:\/\/i0.wp.com\/deberker.com\/archy\/wp-content\/uploads\/2016\/05\/fig3.1.png\"><img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-209\" src=\"https:\/\/i0.wp.com\/deberker.com\/archy\/wp-content\/uploads\/2016\/05\/fig3.1.png?resize=660%2C340\" alt=\"fig3.1\" width=\"660\" height=\"340\" srcset=\"https:\/\/i0.wp.com\/deberker.com\/archy\/wp-content\/uploads\/2016\/05\/fig3.1.png?w=760&amp;ssl=1 760w, https:\/\/i0.wp.com\/deberker.com\/archy\/wp-content\/uploads\/2016\/05\/fig3.1.png?resize=580%2C299&amp;ssl=1 580w, https:\/\/i0.wp.com\/deberker.com\/archy\/wp-content\/uploads\/2016\/05\/fig3.1.png?resize=624%2C322&amp;ssl=1 624w\" sizes=\"auto, (max-width: 660px) 100vw, 660px\" \/><\/a><\/p>\n<p>We can bang in lots more variables to the Gaussian distribution to produce a<em><strong> multivariate normal\u00a0<\/strong><\/em>distribution. This involves replacing our scalar mean with a vector of means (\u03bc &#8211;&gt; <strong>\u03bc<\/strong>) and our variance with a covariance matrix, <strong>\u03a3.<\/strong><\/p>\n<p>An example plotted for 2 dimensions;<\/p>\n<p><a href=\"https:\/\/i0.wp.com\/deberker.com\/archy\/wp-content\/uploads\/2016\/05\/MultivariateNormal.png\"><img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-211\" src=\"https:\/\/i0.wp.com\/deberker.com\/archy\/wp-content\/uploads\/2016\/05\/MultivariateNormal.png?resize=476%2C360\" alt=\"MultivariateNormal\" width=\"476\" height=\"360\" srcset=\"https:\/\/i0.wp.com\/deberker.com\/archy\/wp-content\/uploads\/2016\/05\/MultivariateNormal.png?w=842&amp;ssl=1 842w, https:\/\/i0.wp.com\/deberker.com\/archy\/wp-content\/uploads\/2016\/05\/MultivariateNormal.png?resize=580%2C439&amp;ssl=1 580w, https:\/\/i0.wp.com\/deberker.com\/archy\/wp-content\/uploads\/2016\/05\/MultivariateNormal.png?resize=768%2C581&amp;ssl=1 768w, https:\/\/i0.wp.com\/deberker.com\/archy\/wp-content\/uploads\/2016\/05\/MultivariateNormal.png?resize=624%2C472&amp;ssl=1 624w\" sizes=\"auto, (max-width: 476px) 100vw, 476px\" \/><\/a><\/p>\n<p>(from <a href=\"https:\/\/en.wikipedia.org\/wiki\/Multivariate_normal_distribution\">Wikipedia<\/a>)<\/p>\n<p><em><strong>Exponential &amp; laplace<\/strong><\/em><\/p>\n<p>Used to give us nice and spiky distributions. For (negative) exponential, this spike is at 0, whereas the laplace distribution gives us the ability to shift it around based upon the point\u00a0\u03bc. Both are characterised by a steepness\u00a0parameter \u03b3 (b in the laplacian graph):<\/p>\n<p>Exponential:<\/p>\n<p><a href=\"https:\/\/i0.wp.com\/deberker.com\/archy\/wp-content\/uploads\/2016\/05\/exp.png\"><img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-213\" src=\"https:\/\/i0.wp.com\/deberker.com\/archy\/wp-content\/uploads\/2016\/05\/exp.png?resize=360%2C288\" alt=\"exp\" width=\"360\" height=\"288\" \/><\/a><\/p>\n<p>NB Exponential is a special subset of <a href=\"https:\/\/en.wikipedia.org\/wiki\/Gamma_distribution\">gamma distributions<\/a> (which look a bit like a mixture between an exponential and a normal distribution- they have a peak, and a long tail).<\/p>\n<p>Laplace:<\/p>\n<p><a href=\"https:\/\/i0.wp.com\/deberker.com\/archy\/wp-content\/uploads\/2016\/05\/lapl.png\"><img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-212\" src=\"https:\/\/i0.wp.com\/deberker.com\/archy\/wp-content\/uploads\/2016\/05\/lapl.png?resize=345%2C259\" alt=\"lapl\" width=\"345\" height=\"259\" srcset=\"https:\/\/i0.wp.com\/deberker.com\/archy\/wp-content\/uploads\/2016\/05\/lapl.png?w=800&amp;ssl=1 800w, https:\/\/i0.wp.com\/deberker.com\/archy\/wp-content\/uploads\/2016\/05\/lapl.png?resize=580%2C435&amp;ssl=1 580w, https:\/\/i0.wp.com\/deberker.com\/archy\/wp-content\/uploads\/2016\/05\/lapl.png?resize=768%2C576&amp;ssl=1 768w, https:\/\/i0.wp.com\/deberker.com\/archy\/wp-content\/uploads\/2016\/05\/lapl.png?resize=624%2C468&amp;ssl=1 624w\" sizes=\"auto, (max-width: 345px) 100vw, 345px\" \/><\/a><\/p>\n<p>(from <a href=\"https:\/\/en.wikipedia.org\/wiki\/Laplace_distribution#\/media\/File:Laplace_pdf_mod.svg\">Wikipedia<\/a>)<\/p>\n<p><strong>A mixture of distributions<\/strong><\/p>\n<p>We can combine a\u00a0<strong>multinoulli<\/strong> distribution with some set of the above to generate a distribution mixture. The multinoulli specifies which distribution we are likely to be in on each trial.<\/p>\n<p><img decoding=\"async\" class=\"mathtex-equation-editor\" src=\"http:\/\/chart.apis.google.com\/chart?cht=tx&amp;chl=P(x)%20%3D%20%5CSigma%20P(c%3Di)P(x%7Cc%3Di)\" alt=\"P(x) = \\Sigma P(c=i)P(x|c=i)\" align=\"absmiddle\" \/><\/p>\n<p>Where P(c) is the probability distribution over component distributions. This tells us that the probability of obtaining a certain value (x) is a product of\u00a0the probability that we are in a certain state, and the probability of x given that state.<\/p>\n<p>A common use of this is in\u00a0<strong>Gaussian mixture models<\/strong>. We use these in neuroscience when we think that a population might be composed of multiple groups.\u00a0For instance, the distribution of heights across the population is best described by two Gaussians, one for men, one for women. In this case gender is a\u00a0<strong>latent, or hidden variable<\/strong> &#8211; an extra piece of information not inherent in the data, but which allows us to understand our data better.<\/p>\n<p><a href=\"https:\/\/i0.wp.com\/deberker.com\/archy\/wp-content\/uploads\/2016\/05\/heigh.jpg\"><img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-220\" src=\"https:\/\/i0.wp.com\/deberker.com\/archy\/wp-content\/uploads\/2016\/05\/heigh.jpg?resize=426%2C265\" alt=\"heigh\" width=\"426\" height=\"265\" \/><\/a><\/p>\n<p>(from <a href=\"http:\/\/cacm.acm.org\/magazines\/2012\/2\/145412-disentangling-gaussians\/fulltext\">Disentangling Gaussians<\/a>)<\/p>\n<p><strong>Useful functions\u00a0<\/strong><\/p>\n<p>When working with neural networks, we use functions which curtail the firing rates of our neurons to positive values (because neurons can&#8217;t fire at a negative rate).<\/p>\n<p>Two useful ones: the <strong>sigmoid<\/strong> (this is what everybody used to use) and the <strong>rectifier<\/strong> &amp;\u00a0<b>soft plus<\/b> (I believe these is more in fashion now).<\/p>\n<p><strong>Sigmoid<\/strong><\/p>\n<p><a href=\"https:\/\/i0.wp.com\/deberker.com\/archy\/wp-content\/uploads\/2016\/05\/sig.png\"><img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-218\" src=\"https:\/\/i0.wp.com\/deberker.com\/archy\/wp-content\/uploads\/2016\/05\/sig.png?resize=454%2C302\" alt=\"sig\" width=\"454\" height=\"302\" srcset=\"https:\/\/i0.wp.com\/deberker.com\/archy\/wp-content\/uploads\/2016\/05\/sig.png?w=600&amp;ssl=1 600w, https:\/\/i0.wp.com\/deberker.com\/archy\/wp-content\/uploads\/2016\/05\/sig.png?resize=580%2C387&amp;ssl=1 580w\" sizes=\"auto, (max-width: 454px) 100vw, 454px\" \/><\/a><\/p>\n<p><strong>Rectifier &amp; softplus<\/strong><\/p>\n<p><strong><a href=\"https:\/\/i0.wp.com\/deberker.com\/archy\/wp-content\/uploads\/2016\/05\/rect.png\"><img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-217\" src=\"https:\/\/i0.wp.com\/deberker.com\/archy\/wp-content\/uploads\/2016\/05\/rect.png?resize=441%2C321\" alt=\"rect\" width=\"441\" height=\"321\" \/><\/a><\/strong><\/p>\n<p>(from <a href=\"https:\/\/en.wikipedia.org\/wiki\/File:Rectifier_and_softplus_functions.svg\">Wikipedia<\/a>)<\/p>\n<p><strong>Transforming distributions<\/strong><\/p>\n<p>Sometimes we want to project one distribution through another. One common example from model-fitting: we want to fit a parameter in an unbounded space (-infinity to +infinity), but then when we use this parameter, we need to constrain to, for instance between 0 and 1 (if, for instance, we are fitting a learning rate). In this case, we&#8217;d fit the parameter then\u00a0<strong>pass it through a sigmoid.<\/strong><\/p>\n<p>We can derive the new distribution by multiplying the original distribution by the derivative of the transformation:<\/p>\n<p style=\"padding-left: 30px;\">q(x) = p(x) * g'(x)<\/p>\n<p><strong>Bayes rule<\/strong><\/p>\n<p>A favourite of ours, it&#8217;s:<\/p>\n<p><img decoding=\"async\" class=\"mathtex-equation-editor\" src=\"http:\/\/chart.apis.google.com\/chart?cht=tx&amp;chl=p(a%7Cb)%20%3D%20%5Cfrac%7Bp(b%7Ca)p(a)%7D%7Bp(b)%7D\" alt=\"p(a|b) = \\frac{p(b|a)p(a)}{p(b)}\" align=\"absmiddle\" \/><\/p>\n<p>A derivation:<\/p>\n<ol>\n<li>p(x,y) = p(x|y)p(y)<\/li>\n<\/ol>\n<p>2. p(x|y) = p(x,y)\/p(y)<\/p>\n<p>Doing the same in reverse:<\/p>\n<p>3. p(y,x) = p(y|x)p(x)<\/p>\n<p>The substitute in\u00a0p(y|x)p(x) for p(y,x) in eq. 2, and we get:<\/p>\n<p>p(x|y)=p(y|x)p(x) \/ p(y)<\/p>\n<p>Woohoo!<\/p>\n<p><strong>A little bit of\u00a0information theory<\/strong><\/p>\n<p>Information is defined as the amount of uncertainty we reduce when we\u00a0learn something. If you know something already, being told it again has an information content of zero. If something is very unlikely and you find out it has occurred, you gain a lot of information. If something was very likely and you find out it has occurred, you gain a little information.<\/p>\n<p>Formally:<\/p>\n<p><img decoding=\"async\" class=\"mathtex-equation-editor\" src=\"http:\/\/chart.apis.google.com\/chart?cht=tx&amp;chl=I(x)%20%3D%20-logP(x)\" alt=\"I(x) = -logP(x)\" align=\"absmiddle\" \/><\/p>\n<p>So information is negatively related to probability: the more probable something is, the less we learn when we find out the outcome. Depending upon the type of logarithm you use here, information comes in different units:\u00a0<b>nets<\/b> (for the natural log),\u00a0or<strong> bits <\/strong>or<strong> shannons\u00a0<\/strong>(log2).<\/p>\n<p>We quantify the amount of uncertainty in a whole probability distribution (rather than merely associated with a single event, x) using the <strong>Shannon entropy<\/strong>:<\/p>\n<p><strong><img decoding=\"async\" class=\"mathtex-equation-editor\" src=\"http:\/\/chart.apis.google.com\/chart?cht=tx&amp;chl=H(x)%20%3D%20%5Cmathbb%20E%20%7B%7D%7D_%7Bx%5Csim%20P%7D%5BI(x)%5D%20\" alt=\"H(x) = \\mathbb E {}}_{x\\sim P}[I(x)] \" align=\"absmiddle\" \/><\/strong><\/p>\n<p>Which is basically the expectation (i.e. the average) of the information.<\/p>\n<p><strong>Kullback-Leiber (KL) divergence<\/strong><\/p>\n<p>A phrase which many have heard and, I suspect, few have understood. The KL-divergence quantifies the\u00a0<strong>difference between two distributions<\/strong>. More precisely, it quantifies\u00a0<strong>how much\u00a0information we use if we try and approximate one distribution with another distribution<\/strong>. Interestingly, it&#8217;s not symmetric &#8211; the amount of info we lose when we use A to approximate B \u2260 the amount of info lost when we use B to approximate A.<\/p>\n<p>It has quite a simple formula:<\/p>\n<p><img decoding=\"async\" class=\"mathtex-equation-editor\" src=\"http:\/\/chart.apis.google.com\/chart?cht=tx&amp;chl=D_%7BKL%7D(P%5Cparallel%20Q)%20%3D%20%5Cmathbb%7BE%7D%20_%7Bx%20%5Csim%20%20P%7D%5BlogP(x)-logQ(x)%5D)%7D%7D\" alt=\"D_{KL}(P\\parallel Q) = \\mathbb{E} _{x \\sim P}[logP(x)-logQ(x)])}}\" align=\"absmiddle\" \/><\/p>\n<p>Basically the average difference in probabilities of observing each sample based upon the two distributions!<\/p>\n<p>Note that this means that the KL captures both\u00a0<strong>differences in means and\u00a0variances of distributions.<\/strong><\/p>\n<p>A related concept is the cross-entropy:<\/p>\n<p><img decoding=\"async\" class=\"mathtex-equation-editor\" src=\"http:\/\/chart.apis.google.com\/chart?cht=tx&amp;chl=Cross%20Entropy%20(x%2Cy)%20%3D%20p(x%2Cy)log%5Cfrac%7Bp(x%2Cy)%7D%7Bp(x)p(y)%7D\" alt=\"Cross Entropy (x,y) = p(x,y)log\\frac{p(x,y)}{p(x)p(y)}\" align=\"absmiddle\" \/><\/p>\n<p>If the two variables are completely independent, p(x,y) = p(x)p(y) [that&#8217;s the product rule], and we end up with\u00a0log(1), which is 0. If p(x,y) &gt; p(x)p(y). then the joint probability is higher than the product of the independent probabilities, implying that the variables are related- and yielding a positive cross entropy.<\/p>\n<p><strong>Graphical\u00a0models<\/strong><\/p>\n<p>Basically a way of drawing out probabilistic relationships. They consist of <strong>nodes<\/strong> (distributions, or functions), and\u00a0<strong>edges<\/strong> (links between them).<\/p>\n<p>They can be\u00a0<strong>directed<\/strong>, in which case we are visualising conditional relationships, or they can be\u00a0<strong>undirected<\/strong>, which allows us to mix in non-probabilistic functions.<\/p>\n<p>In undirected models, nodes\u00a0which are linked together by edges are referred to as\u00a0<strong>cliques<\/strong>.<\/p>\n<p>A directed graph:<\/p>\n<p><a href=\"https:\/\/i0.wp.com\/deberker.com\/archy\/wp-content\/uploads\/2016\/05\/Graphical_model_example.png\"><img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-215\" src=\"https:\/\/i0.wp.com\/deberker.com\/archy\/wp-content\/uploads\/2016\/05\/Graphical_model_example.png?resize=321%2C241\" alt=\"Graphical_model_example\" width=\"321\" height=\"241\" \/><\/a><\/p>\n<p>(<a href=\"https:\/\/upload.wikimedia.org\/wikipedia\/commons\/e\/e2\/Graphical_model_example.png\">Wikipedia<\/a>)<\/p>\n<p>One common usage is to describe transitions in a Markov model:<\/p>\n<p><a href=\"https:\/\/i0.wp.com\/deberker.com\/archy\/wp-content\/uploads\/2016\/05\/gmm.png\"><img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-216\" src=\"https:\/\/i0.wp.com\/deberker.com\/archy\/wp-content\/uploads\/2016\/05\/gmm.png?resize=315%2C252\" alt=\"gmm\" width=\"315\" height=\"252\" srcset=\"https:\/\/i0.wp.com\/deberker.com\/archy\/wp-content\/uploads\/2016\/05\/gmm.png?resize=940%2C752&amp;ssl=1 940w, https:\/\/i0.wp.com\/deberker.com\/archy\/wp-content\/uploads\/2016\/05\/gmm.png?resize=580%2C464&amp;ssl=1 580w, https:\/\/i0.wp.com\/deberker.com\/archy\/wp-content\/uploads\/2016\/05\/gmm.png?resize=768%2C614&amp;ssl=1 768w, https:\/\/i0.wp.com\/deberker.com\/archy\/wp-content\/uploads\/2016\/05\/gmm.png?resize=624%2C499&amp;ssl=1 624w, https:\/\/i0.wp.com\/deberker.com\/archy\/wp-content\/uploads\/2016\/05\/gmm.png?w=2000&amp;ssl=1 2000w, https:\/\/i0.wp.com\/deberker.com\/archy\/wp-content\/uploads\/2016\/05\/gmm.png?w=1320 1320w\" sizes=\"auto, (max-width: 315px) 100vw, 315px\" \/><\/a><\/p>\n<p>(<a href=\"https:\/\/upload.wikimedia.org\/wikipedia\/commons\/thumb\/8\/8a\/HiddenMarkovModel.svg\/2000px-HiddenMarkovModel.svg.png\">Wikipedia<\/a>)<\/p>\n<p>The advantage of having a graphical model is that it allows us to depict dependencies. If we learn these dependencies, then we save ourselves the effort of learning the whole covariance matrix across all of the variables. This is probably the kind of trick the brain uses to make things easier&#8230;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>As part of the Max Planck Centre&#8217;s Deep Learning reading group, I&#8217;m working my way through the soon-to-be released Deep Learning textbook. These posts are my informal notes for each chapter. Find my notes from chapter 1 here.\u00a0Thanks again to Zeb &amp; Toby for edits. This week we were back on slightly firmer ground, dealing [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":185,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"hide_page_title":"","_coblocks_attr":"","_coblocks_dimensions":"","_coblocks_responsive_height":"","_coblocks_accordion_ie_support":"","_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_memberships_contains_paid_content":false,"footnotes":"","jetpack_publicize_message":"","jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":true,"jetpack_social_options":{"image_generator_settings":{"template":"highway","default_image_id":0,"font":"","enabled":false},"version":2},"jetpack_post_was_ever_published":false},"categories":[6],"tags":[],"class_list":["post-184","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-neural-networks"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v26.9 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>Deep Learning Textbook Chapter 2: Probability - Archy de Berker<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/deberker.com\/archy\/deep-learning-textbook-chapter-2-probability\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Deep Learning Textbook Chapter 2: Probability - Archy de Berker\" \/>\n<meta property=\"og:description\" content=\"As part of the Max Planck Centre&#8217;s Deep Learning reading group, I&#8217;m working my way through the soon-to-be released Deep Learning textbook. These posts are my informal notes for each chapter. Find my notes from chapter 1 here.\u00a0Thanks again to Zeb &amp; Toby for edits. This week we were back on slightly firmer ground, dealing [&hellip;]\" \/>\n<meta property=\"og:url\" content=\"https:\/\/deberker.com\/archy\/deep-learning-textbook-chapter-2-probability\/\" \/>\n<meta property=\"og:site_name\" content=\"Archy de Berker\" \/>\n<meta property=\"article:published_time\" content=\"2016-05-19T13:14:25+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2021-11-06T16:05:52+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/i0.wp.com\/deberker.com\/archy\/wp-content\/uploads\/2016\/05\/deepL_C2Graphic.jpg?fit=1140%2C440&ssl=1\" \/>\n\t<meta property=\"og:image:width\" content=\"1140\" \/>\n\t<meta property=\"og:image:height\" content=\"440\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/jpeg\" \/>\n<meta name=\"author\" content=\"archy\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@archydeb\" \/>\n<meta name=\"twitter:site\" content=\"@archydeb\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"archy\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"19 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/deberker.com\/archy\/deep-learning-textbook-chapter-2-probability\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/deberker.com\/archy\/deep-learning-textbook-chapter-2-probability\/\"},\"author\":{\"name\":\"archy\",\"@id\":\"https:\/\/deberker.com\/archy\/#\/schema\/person\/01cf8dd0f94a4ba124b26eeeeb59e67d\"},\"headline\":\"Deep Learning Textbook Chapter 2: Probability\",\"datePublished\":\"2016-05-19T13:14:25+00:00\",\"dateModified\":\"2021-11-06T16:05:52+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/deberker.com\/archy\/deep-learning-textbook-chapter-2-probability\/\"},\"wordCount\":2659,\"publisher\":{\"@id\":\"https:\/\/deberker.com\/archy\/#\/schema\/person\/01cf8dd0f94a4ba124b26eeeeb59e67d\"},\"image\":{\"@id\":\"https:\/\/deberker.com\/archy\/deep-learning-textbook-chapter-2-probability\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/i0.wp.com\/deberker.com\/archy\/wp-content\/uploads\/2016\/05\/deepL_C2Graphic.jpg?fit=1140%2C440&ssl=1\",\"articleSection\":[\"Neural networks\"],\"inLanguage\":\"en-US\"},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/deberker.com\/archy\/deep-learning-textbook-chapter-2-probability\/\",\"url\":\"https:\/\/deberker.com\/archy\/deep-learning-textbook-chapter-2-probability\/\",\"name\":\"Deep Learning Textbook Chapter 2: Probability - Archy de Berker\",\"isPartOf\":{\"@id\":\"https:\/\/deberker.com\/archy\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/deberker.com\/archy\/deep-learning-textbook-chapter-2-probability\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/deberker.com\/archy\/deep-learning-textbook-chapter-2-probability\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/i0.wp.com\/deberker.com\/archy\/wp-content\/uploads\/2016\/05\/deepL_C2Graphic.jpg?fit=1140%2C440&ssl=1\",\"datePublished\":\"2016-05-19T13:14:25+00:00\",\"dateModified\":\"2021-11-06T16:05:52+00:00\",\"breadcrumb\":{\"@id\":\"https:\/\/deberker.com\/archy\/deep-learning-textbook-chapter-2-probability\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/deberker.com\/archy\/deep-learning-textbook-chapter-2-probability\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/deberker.com\/archy\/deep-learning-textbook-chapter-2-probability\/#primaryimage\",\"url\":\"https:\/\/i0.wp.com\/deberker.com\/archy\/wp-content\/uploads\/2016\/05\/deepL_C2Graphic.jpg?fit=1140%2C440&ssl=1\",\"contentUrl\":\"https:\/\/i0.wp.com\/deberker.com\/archy\/wp-content\/uploads\/2016\/05\/deepL_C2Graphic.jpg?fit=1140%2C440&ssl=1\",\"width\":1140,\"height\":440},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/deberker.com\/archy\/deep-learning-textbook-chapter-2-probability\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/deberker.com\/archy\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Deep Learning Textbook Chapter 2: Probability\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/deberker.com\/archy\/#website\",\"url\":\"https:\/\/deberker.com\/archy\/\",\"name\":\"Archy de Berker\",\"description\":\"Building things with data\",\"publisher\":{\"@id\":\"https:\/\/deberker.com\/archy\/#\/schema\/person\/01cf8dd0f94a4ba124b26eeeeb59e67d\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/deberker.com\/archy\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":[\"Person\",\"Organization\"],\"@id\":\"https:\/\/deberker.com\/archy\/#\/schema\/person\/01cf8dd0f94a4ba124b26eeeeb59e67d\",\"name\":\"archy\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/deberker.com\/archy\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/i0.wp.com\/deberker.com\/archy\/wp-content\/uploads\/2021\/09\/freelance-logo.png?fit=359%2C311&ssl=1\",\"contentUrl\":\"https:\/\/i0.wp.com\/deberker.com\/archy\/wp-content\/uploads\/2021\/09\/freelance-logo.png?fit=359%2C311&ssl=1\",\"width\":359,\"height\":311,\"caption\":\"archy\"},\"logo\":{\"@id\":\"https:\/\/deberker.com\/archy\/#\/schema\/person\/image\/\"},\"sameAs\":[\"https:\/\/x.com\/archydeb\"],\"url\":\"https:\/\/deberker.com\/archy\/author\/archy\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Deep Learning Textbook Chapter 2: Probability - Archy de Berker","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/deberker.com\/archy\/deep-learning-textbook-chapter-2-probability\/","og_locale":"en_US","og_type":"article","og_title":"Deep Learning Textbook Chapter 2: Probability - Archy de Berker","og_description":"As part of the Max Planck Centre&#8217;s Deep Learning reading group, I&#8217;m working my way through the soon-to-be released Deep Learning textbook. These posts are my informal notes for each chapter. Find my notes from chapter 1 here.\u00a0Thanks again to Zeb &amp; Toby for edits. This week we were back on slightly firmer ground, dealing [&hellip;]","og_url":"https:\/\/deberker.com\/archy\/deep-learning-textbook-chapter-2-probability\/","og_site_name":"Archy de Berker","article_published_time":"2016-05-19T13:14:25+00:00","article_modified_time":"2021-11-06T16:05:52+00:00","og_image":[{"width":1140,"height":440,"url":"https:\/\/i0.wp.com\/deberker.com\/archy\/wp-content\/uploads\/2016\/05\/deepL_C2Graphic.jpg?fit=1140%2C440&ssl=1","type":"image\/jpeg"}],"author":"archy","twitter_card":"summary_large_image","twitter_creator":"@archydeb","twitter_site":"@archydeb","twitter_misc":{"Written by":"archy","Est. reading time":"19 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/deberker.com\/archy\/deep-learning-textbook-chapter-2-probability\/#article","isPartOf":{"@id":"https:\/\/deberker.com\/archy\/deep-learning-textbook-chapter-2-probability\/"},"author":{"name":"archy","@id":"https:\/\/deberker.com\/archy\/#\/schema\/person\/01cf8dd0f94a4ba124b26eeeeb59e67d"},"headline":"Deep Learning Textbook Chapter 2: Probability","datePublished":"2016-05-19T13:14:25+00:00","dateModified":"2021-11-06T16:05:52+00:00","mainEntityOfPage":{"@id":"https:\/\/deberker.com\/archy\/deep-learning-textbook-chapter-2-probability\/"},"wordCount":2659,"publisher":{"@id":"https:\/\/deberker.com\/archy\/#\/schema\/person\/01cf8dd0f94a4ba124b26eeeeb59e67d"},"image":{"@id":"https:\/\/deberker.com\/archy\/deep-learning-textbook-chapter-2-probability\/#primaryimage"},"thumbnailUrl":"https:\/\/i0.wp.com\/deberker.com\/archy\/wp-content\/uploads\/2016\/05\/deepL_C2Graphic.jpg?fit=1140%2C440&ssl=1","articleSection":["Neural networks"],"inLanguage":"en-US"},{"@type":"WebPage","@id":"https:\/\/deberker.com\/archy\/deep-learning-textbook-chapter-2-probability\/","url":"https:\/\/deberker.com\/archy\/deep-learning-textbook-chapter-2-probability\/","name":"Deep Learning Textbook Chapter 2: Probability - Archy de Berker","isPartOf":{"@id":"https:\/\/deberker.com\/archy\/#website"},"primaryImageOfPage":{"@id":"https:\/\/deberker.com\/archy\/deep-learning-textbook-chapter-2-probability\/#primaryimage"},"image":{"@id":"https:\/\/deberker.com\/archy\/deep-learning-textbook-chapter-2-probability\/#primaryimage"},"thumbnailUrl":"https:\/\/i0.wp.com\/deberker.com\/archy\/wp-content\/uploads\/2016\/05\/deepL_C2Graphic.jpg?fit=1140%2C440&ssl=1","datePublished":"2016-05-19T13:14:25+00:00","dateModified":"2021-11-06T16:05:52+00:00","breadcrumb":{"@id":"https:\/\/deberker.com\/archy\/deep-learning-textbook-chapter-2-probability\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/deberker.com\/archy\/deep-learning-textbook-chapter-2-probability\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/deberker.com\/archy\/deep-learning-textbook-chapter-2-probability\/#primaryimage","url":"https:\/\/i0.wp.com\/deberker.com\/archy\/wp-content\/uploads\/2016\/05\/deepL_C2Graphic.jpg?fit=1140%2C440&ssl=1","contentUrl":"https:\/\/i0.wp.com\/deberker.com\/archy\/wp-content\/uploads\/2016\/05\/deepL_C2Graphic.jpg?fit=1140%2C440&ssl=1","width":1140,"height":440},{"@type":"BreadcrumbList","@id":"https:\/\/deberker.com\/archy\/deep-learning-textbook-chapter-2-probability\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/deberker.com\/archy\/"},{"@type":"ListItem","position":2,"name":"Deep Learning Textbook Chapter 2: Probability"}]},{"@type":"WebSite","@id":"https:\/\/deberker.com\/archy\/#website","url":"https:\/\/deberker.com\/archy\/","name":"Archy de Berker","description":"Building things with data","publisher":{"@id":"https:\/\/deberker.com\/archy\/#\/schema\/person\/01cf8dd0f94a4ba124b26eeeeb59e67d"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/deberker.com\/archy\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":["Person","Organization"],"@id":"https:\/\/deberker.com\/archy\/#\/schema\/person\/01cf8dd0f94a4ba124b26eeeeb59e67d","name":"archy","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/deberker.com\/archy\/#\/schema\/person\/image\/","url":"https:\/\/i0.wp.com\/deberker.com\/archy\/wp-content\/uploads\/2021\/09\/freelance-logo.png?fit=359%2C311&ssl=1","contentUrl":"https:\/\/i0.wp.com\/deberker.com\/archy\/wp-content\/uploads\/2021\/09\/freelance-logo.png?fit=359%2C311&ssl=1","width":359,"height":311,"caption":"archy"},"logo":{"@id":"https:\/\/deberker.com\/archy\/#\/schema\/person\/image\/"},"sameAs":["https:\/\/x.com\/archydeb"],"url":"https:\/\/deberker.com\/archy\/author\/archy\/"}]}},"jetpack_publicize_connections":[],"jetpack_featured_media_url":"https:\/\/i0.wp.com\/deberker.com\/archy\/wp-content\/uploads\/2016\/05\/deepL_C2Graphic.jpg?fit=1140%2C440&ssl=1","jetpack_sharing_enabled":true,"jetpack_shortlink":"https:\/\/wp.me\/p4cGwe-2Y","jetpack-related-posts":[],"post_mailing_queue_ids":[],"_links":{"self":[{"href":"https:\/\/deberker.com\/archy\/wp-json\/wp\/v2\/posts\/184","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/deberker.com\/archy\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/deberker.com\/archy\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/deberker.com\/archy\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/deberker.com\/archy\/wp-json\/wp\/v2\/comments?post=184"}],"version-history":[{"count":16,"href":"https:\/\/deberker.com\/archy\/wp-json\/wp\/v2\/posts\/184\/revisions"}],"predecessor-version":[{"id":1167,"href":"https:\/\/deberker.com\/archy\/wp-json\/wp\/v2\/posts\/184\/revisions\/1167"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/deberker.com\/archy\/wp-json\/wp\/v2\/media\/185"}],"wp:attachment":[{"href":"https:\/\/deberker.com\/archy\/wp-json\/wp\/v2\/media?parent=184"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/deberker.com\/archy\/wp-json\/wp\/v2\/categories?post=184"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/deberker.com\/archy\/wp-json\/wp\/v2\/tags?post=184"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}