I took a class in Bayesian statistics in year two of graduate school. I wouldn't call myself a Bayesian -- as an industry practitioner, I take the "whatever works" approach. It's done me a lot of good to have a broad toolset, pulling out one tool or another based on what I think is best for the business problem at hand. This tends to be the attitude in ML research, but statistics can still be pretty clan-ny with various "Bayesian" societies and affiliations.
Even within the Bayesian community there are different attitudes on how to do Bayesian statistics. For example, one camp advocates for non-informative priors, which aim to not incorporate any "subjective judgements" on the parameters. Supporters call this "objective Bayes".
I remember coming across a puzzling example of the use of non-informative priors which made me doubt whether a prior could really ever be "objective". I'll describe this example below. Interestingly, it also connects to a well-known probability paradox called the "exchange paradox", which I discuss later.
Suppose we have a set of \(n\) i.i.d. exponential data, so the likelihood takes the form
\[\begin{array}{ll}p(x_1,\ldots,x_n) &=& \lambda^n e^{-\lambda n \bar{x}}. \end{array}\]
We put an improper prior over \(\lambda\), \(p(\lambda)\propto 1 \). This basically says before we observe the data, we believe the parameter is equally likely to be any positive real number. It’s not a proper prior, but it results in a valid posterior so let’s go with it.
The posterior distribution for \(\lambda\) is easily calculated as
\[\begin{array}{ll}p(\lambda \mid \bar{x}) &\propto& \lambda^n e^{-\lambda n \bar{x}}. \end{array}\]
By inspection one recognizes this as a Gamma distribution with shape parameter \(n+1\) and scale parameter \(\frac{1}{n\bar{x}}\). That means the posterior mean is
\[ \mathbb{E}_{post} [\lambda] = \frac{n+1}{n} \cdot \frac{1}{\bar{x}}. \]
The strange thing about this is that no matter what the sample is, the posterior always expects a value larger than the natural sample estimate (\(\frac{1}{\bar{x}}\))!
A more tangible example of this problem is something called the exchange paradox. You are told to choose between two envelopes filled with money, one has twice as much as the other with unknown amounts; I get the other one. You choose one envelope and it contains $100.
Next, you reason that the other envelope is equally likely to contain $50 if you’re holding the larger envelope, or $200 if you’re holding the smaller envelope. Thus, you expect to get a payoff of \(0.5 \cdot $50 + 0.5 \cdot $200 = $125\) if you trade with me. I make the same calculation with the amount in my envelope and come to the same conclusion; we both believe we will likely get a larger payoff by exchanging for the other envelope. How can this be?
There’s been endless discussion on this classic paradox, but I want to think through this problem like a Bayesian. Let’s call the quantity in the larger envelope \(X\) and the quantity you selected \(Y\); \(X\) could either take the value \(Y\) or \(2Y\) depending on whether you chose the larger or smaller amount. Now suppose you have a prior on your expectations of the value of the larger envelope, call it \(p(x)\). An application of Bayes rule gives that
\[ p(X = y \mid Y=y) = \frac{p(x)}{p(x/2)+p(x)}.\]
Similarly, \( p(X = 2y \mid Y=y) = \frac{p(x/2)}{p(x/2)+p(x)}\) since the likelihood of the two values must sum to one. We can now calculate the expected payout from switching:
\[ E = \frac{xp(x/2) }{2p(x)+2p(x/2)} + \frac{2xp(x)}{p(x)+p(x/2)} \]
Now, suppose that \(p(x)\) is the prior we considered above, uniform over the positive reals. The expected value becomes
\[ E = \frac{x}{4} + x = 1.25 x\]
We come to the same conclusion as the argument I made above, that it is always better to switch envelopes!
However, other choices of priors will yield less “paradoxical” results. For example the prior \(p(\lambda) \propto \frac{1}{\lambda} \), gives expected payoff from switching envelopes of \(x\), even odds.
My take-away is that non-informative priors still have strong consequences for inference in relation to the observed data, despite having “no information”. If you use the non-informative prior for the exchange game you are saying if your envelope had $20 trillion, you have no inclination that you hold the larger envelope than if it had 20 cents. If the conclusions of the exchange paradox don’t sit right with you, it’s because you don’t really believe in the prior you chose – maybe the non-informative prior wasn’t what you really wanted!