tag:www.mitchellsnaith.com,2005:/feedMitchell Snaith2018-03-16T14:23:00-11:00tag:www.mitchellsnaith.com,2005:Article/377762018-03-16T14:23:00-11:002018-11-03T08:55:36-11:00ML Bits #1: Cook's Amazon Reseller Problem<p>A friend of mine recently showed me John Cook's <a href="https://www.johndcook.com/blog/2011/09/27/bayesian-amazon/">A Bayesian view of Amazon Resellers</a> as a cute example of Bayesian analysis on a problem where we have very different proportions of data between parameters. I really like this mini-problem, and so wanted to share a solution that softens the most painful step. </p>
<h1 id="problem">Problem</h1>
<p>Suppose we are debating which Amazon reseller to buy a used book from, concerned about each candidate's reliability as a seller. </p>
<ul>
<li>Seller A has 90 positive reviews, and 10 negative reviews. </li>
<li>Seller B has 2 positive reviews, and 0 negative reviews. </li>
</ul>
<p>Which seller should we buy from? </p>
<h1 id="analysis">Analysis</h1>
<p>Since we hardly have a wealth of data points, especially for Seller B, a Bayesian approach seems suitable. Letting $$\theta_{A}$$ and $$\theta_{B}$$ be the probabilities of satisfaction with each reseller respectively, we'll assume that positive and negative ratings are equally likely and opt for a uniform $$\text{Beta}(1, 1)$$ prior. Consequently, our posteriors become $$\text{Beta}(91, 11)$$ and $$\text{Beta}(3, 1)$$ respectively. </p>
<p>Now here is where it gets a little bit tricky. Given our knowledge of the posteriors, we want to find $$<br>
p(\theta_{A} > \theta_{B} | \mathcal{D}),<br>
$$ where we have conditioned on our data. How do we find this quantity though? </p>
<p><img alt="Silvrback blog image " src="https://silvrback.s3.amazonaws.com/uploads/bf7711cd-2db0-4af9-a743-f47589e02fa7/beta_plots.png" /></p>
<p>Cook omits the details in his article, but links to a paper on <a href="https://www.johndcook.com/UTMDABTR-005-05.pdf">Exact Calculation of Beta Inequalities</a>. In short, it is possible to solve for this probability either analytically or using some explicit numerical integration, but these approaches involve writing down a little (or a lot) of actual math. </p>
<h1 id="keeping-it-simple">Keeping it simple</h1>
<p>Given that we are really just interested in our decision at hand based on this value and not its precise quantity, why not just use a Monte Carlo method for numerical integration! This has the benefit of being trivially easy to do, and we don't have to even think about the math. We have Beta distributions that we can sample from easily, and the parameters are independent in their posteriors. So let's just take $$10^7$$ samples from each posterior and count the fraction of those where $$\theta_{A} > \theta_{B}$$. </p>
<p>Doing so in Python:</p>
<div class="highlight"><pre><span class="kn">import</span> <span class="nn">numpy</span> <span class="kn">as</span> <span class="nn">np</span>
<span class="n">post_a_samples</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">beta</span><span class="p">(</span><span class="mi">91</span><span class="p">,</span> <span class="mi">11</span><span class="p">,</span> <span class="mi">10</span><span class="o">**</span><span class="mi">7</span><span class="p">)</span>
<span class="n">post_b_samples</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">beta</span><span class="p">(</span><span class="mi">3</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">10</span><span class="o">**</span><span class="mi">7</span><span class="p">)</span>
<span class="n">post_a_larger</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">mean</span><span class="p">(</span><span class="n">post_a_samples</span> <span class="o">></span> <span class="n">post_b_samples</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="n">f</span><span class="s">'p(theta_a > theta_b | D) is approx. {post_a_larger}'</span><span class="p">)</span>
</pre></div>
<p>This gives me a result of 0.7124. The analytic result that Cook provides is 0.713, so we seem awfully close! This would suggest that despite Seller B having no negative reviews yet, counter to our intuition we ought to buy from Seller A. </p>
<p>The choice of prior is perhaps not ideal for this problem, but in any case I thought it was a neat example and the solution a good use case for handy Monte Carlo sampling. </p>
Mitchell Snaith