LChrisman/Bayes rule for MetaLog

There isn't yet a well-defined way to update a subjective MetaLog prior with observations to compute a MetaLog posterior in the standard Bayesian updating way.

Keelin and Howard did derive a Bayesian update for the case where you are modeling a variational distribution, and new observations update that distribution. For example, the variational distribution might be the distribution of fish lengths in a given river. There isn't a single fish length, you are refining your estimate of the total actual variation with each update.

The more standard subjective Bayesian update would be where there is an unknown scalar quantity, x, and you don't know its actual value. The prior denotes your prior distribution for x. As you get additional observations, [math]\displaystyle{ data }[/math], you use the likelihood function, [math]\displaystyle{ L(x|data) = p(data|x) }[/math] to update your subjective belief. This is a straight application of Bayes' rule: [math]\displaystyle{ p(x|data) \propto p(x) p(data|x) = p(x) L(x|data) }[/math].

I have a brainstorming idea for a way to compute a MetaLog posterior distribution from a MetaLog prior and an arbitrary (but known) likelihood function. I'm doing a brain dump of my idea here, but will need to analyze it further. If you stumbled on this, do not use it blindly. It is too early to say it isn't totally flawed.

BTW, it is worth noting that there is no conjugate update for a Logistic prior distribution. Since MetaLog generalizes the Logistic, we are safe assuming that no closed-form conjugate update rule exists.

Incorporating data point weights

A building block is weighted-data fitting.

Let [math]\displaystyle{ [ x_1, ... x_n ] }[/math] be a set of data points with weights [math]\displaystyle{ [w_1,...,w_n] }[/math] where [math]\displaystyle{ \sum_i w_i = 1 }[/math].

To fit this data, use the points [math]\displaystyle{ \{ (x_1, y_1), ..., (x_n, y_n) \} }[/math] where [math]\displaystyle{ y_i = {1 \over 2} w_j + \sum_{j=1}^{i-1} w_j }[/math].

For equally weighted points this becomes [math]\displaystyle{ y_i = (i-0.5)/n }[/math].

Find the "best weighted fit" MetaLog by solving:

[math]\displaystyle{ argmin_a \sum_i w_i (x_i - M(y_i;a))^2 }[/math]

s.t. [math]\displaystyle{ M'(y;a)\ge 0 }[/math] for all [math]\displaystyle{ y\in (0,1) }[/math]

Note that this is a constrained weighted regression.

Denote the solution as [math]\displaystyle{ a^*( x, w ) }[/math].

I'll use weighted-fitting as a sub-routine.

Computing the posterior

Given:

M(y ; a_{prior} ) = The quantile function for the prior distribution.
L(x | data) = A likelihood function

We compute the posterior by:

Sample [math]\displaystyle{ \hat{x}_i = M( u_i ; a_{prior} ), i=1..m }[/math] for some large [math]\displaystyle{ m }[/math].
Set [math]\displaystyle{ \hat{w}_i = {{L(x_i | data)}\over{\sum_i L(x_i | data)}} }[/math]
Compute [math]\displaystyle{ a_{posterior} = a^*( \hat x, \hat w ) }[/math]

Why does this work?

The posterior distribution has the form [math]\displaystyle{ p(x|data) \propto p(x) L(x|data) }[/math], which is essentially an important sampling where [math]\displaystyle{ w_i = L(x_i|data) }[/math] and the sampling distribution is p(x).

The posterior is an approximation, but because the MetaLog has unlimited shape flexibility, it enables us to continue to match the shape.

Convergence

Likelihood sampling works great when the sampling distribution is close to the target distribution, which in this case would be when the posterior doesn't change much relative to the prior (i.e., [math]\displaystyle{ L(x|data) }[/math] is large). We'd expect it to be a poorer fit when [math]\displaystyle{ L(x|data) }[/math] is very small, as often happens in Bayesian inference problems.

We might be able to iterate on this to get a better fit. Once you do a weighted fit, sample a new [math]\displaystyle{ [ x_1,...,x_m] }[/math] from this intermediate posterior, which would now be a new sampling distribution. We would now need to adjust the weights by the density at [math]\displaystyle{ x_i }[/math] (the sampling distribution is slightly different from the prior), but we would get a new data set with a distribution closer to the posterior, thus leading to a more solid posterior fit.