User:Lchrisman/KeelinGradient

This is a private page of Lonnie. The goal here is to derive the gradient of the log likelihood for the Keelin distribution. This is my own scratch work, so it is likely highly prone to mistakes. This is actually the gradient of the log likelihood for any linear quantile parameterized function, Keelin being a special case.

Notation

[math]\displaystyle{ x_i }[/math] = data points, [math]\displaystyle{ i=1..m }[/math]
[math]\displaystyle{ \vec{a} }[/math] = quantile coefficients, [math]\displaystyle{ [a_1, ..., a_k, ..., a_N] }[/math]
[math]\displaystyle{ \vec{B}(y) }[/math] = Basis at scalar y, where [math]\displaystyle{ 0\lt y\lt 1 }[/math] is a probability
[math]\displaystyle{ = [B_1(y), B_2(y), ..., B_k(y),...B_N(y)] }[/math]
[math]\displaystyle{ M(y) = \vec{a} \cdot \vec{B}(y) }[/math] = y-th quantile
[math]\displaystyle{ m^{-1}(y) = {{\partial M(y)}\over {\partial y}} = \vec{a} \cdot \vec{b}(y) }[/math] = inverse density
[math]\displaystyle{ \vec{b}(y) }[/math] = the derivative basis, [math]\displaystyle{ [b_1(y), b_2(y), ..., b_k(y),...b_N(y)] }[/math]
[math]\displaystyle{ m(y) }[/math] = probability density at the yth quantile
[math]\displaystyle{ y(x_i) }[/math] = the cumulative probability at [math]\displaystyle{ x_i }[/math]
[math]\displaystyle{ \nabla_a f = \langle {{\partial f}\over{\partial a_1}},..,{{\partial f}\over{\partial a_k}},..,{{\partial f}\over{\partial a_N}} \rangle }[/math] = the gradient wrt [math]\displaystyle{ \vec{a} }[/math]
[math]\displaystyle{ LL }[/math] = The log likelihood of the data -- its a function of [math]\displaystyle{ \vec{a} }[/math] and [math]\displaystyle{ [x_1,...,x_m] }[/math].
= [math]\displaystyle{ \ln \left( \prod_i p( x_i | \vec{a} ) \right) = \sum_i \ln p(x_i|\vec{a}) }[/math]
[math]\displaystyle{ p( x_i | \vec{a} ) = m( y(x_i) ) }[/math] = The probability density of the point [math]\displaystyle{ x_i }[/math]

The gradient derivation

The gradient of the log-likelihood

[math]\displaystyle{ \begin{split} \nabla_a LL &= \nabla_a \sum_i \ln p(x_i|\vec{a}) \\ &= \sum_i \nabla_a \ln m(y(x_i)) \\ &= \sum_i m^{-1}(y(x_i)) \nabla_a m(y(x_i)) \end{split} }[/math]

[math]\displaystyle{ {{\partial m(y(x_i))}\over{\partial a_k}} = b_k(y(x_i)) {{\partial y(x_i)}\over{\partial a_k}} }[/math] since [math]\displaystyle{ b_k }[/math] does not depend on [math]\displaystyle{ \vec{a} }[/math].

So we can write this as

[math]\displaystyle{ \nabla_a m(y(x_i)) = \vec{b}(y(x_i)) \nabla_a y(x_i) }[/math]

The gradient [math]\displaystyle{ \nabla_a y(x_i) }[/math] is the interesting part. [math]\displaystyle{ y(x_i) }[/math] is not an each function -- there is no analytical form for it. In Analytica 5, the CumKeelin and DensKeelin functions compute [math]\displaystyle{ y(x_i) }[/math] using the Newton-Raphson method, which is fast, but not closed form by any means.

However, assume we have computed [math]\displaystyle{ \tilde{y} = y(x_i) }[/math], for example by using Newton-Raphson's method. Now, suppose we change [math]\displaystyle{ a_k }[/math] by an epsilon amount, [math]\displaystyle{ {\partial a_k} }[/math], so that [math]\displaystyle{ \tilde{y} }[/math] is a pretty close approximation to the new [math]\displaystyle{ y(x_i) }[/math]. Since our change was small, I'll suppose we can obtain [math]\displaystyle{ y(x_i) }[/math] by executing one iteration of Newton-Raphson's method starting at [math]\displaystyle{ \tilde{y_i} }[/math].

[math]\displaystyle{ y(x_i) = \tilde{y} - {{(M(\tilde{y}) - x_i)}\over{m^{-1}(\tilde{y})}} }[/math]

The denominator follows from the fact that [math]\displaystyle{ {{\partial M(y)}\over{\partial y}} = m^{-1}(y) }[/math]. Using this,

[math]\displaystyle{ \begin{split} {{\partial y(x_i)}\over{\partial a_k}} &= - { { m^{-1}(\tilde{y_i}) {{\partial M(\tilde{y})}\over{\partial a_k}} - \left(M(\tilde{y_i})-x_i\right) {{\partial m^{-1}(\tilde{y_i})}\over{\partial a_k}}} \over { \left[ m^{-1}(\tilde{y_i}) \right]^2 } }\\ &= - { { m^{-1}(\tilde{y_i}) B_k(\tilde{y}) - \left(M(\tilde{y_i})-x_i\right) b_k(\tilde{y_i}) } \over { \left[ m^{-1}(\tilde{y_i}) \right]^2 } }\\ \nabla_a y(x_i) &= { { \left(M(\tilde{y_i})-x_i\right) \vec{b}(\tilde{y_i}) - m^{-1}(\tilde{y_i}) \vec{B}(\tilde{y_i}) } \over { \left[ m^{-1}(\tilde{y_i}) \right]^2 } } \end{split} }[/math]

Gradient descent algorithm

Given data [math]\displaystyle{ x_i }[/math], obtain a starting [math]\displaystyle{ \vec{a} }[/math] using ordinary least squares in quantile space.
Use Newton-Raphson to compute [math]\displaystyle{ y_i = y(x_i) }[/math] for each datum, given the current [math]\displaystyle{ \vec{a} }[/math].
Compute [math]\displaystyle{ \nabla_a LL }[/math] as
- Compute [math]\displaystyle{ \nabla_a y(x_i) }[/math] using last equation in previous section.
- Compute [math]\displaystyle{ \nabla_a m(y(x_i)) = \vec{b}(y(x_i)) \nabla_a y(x_i) }[/math]
- Compute [math]\displaystyle{ \nabla_a LL = \sum_i \vec{a} \cdot \vec{b}(y(x_i)) \nabla_a m(y(x_i)) }[/math]
Take a step in the direction of the gradient, e.g.,
[math]\displaystyle{ \vec{a}_{new} = \vec{a} + \epsilon \nabla_a LL }[/math]

where [math]\displaystyle{ \epsilon }[/math] is the learning rate
If converged, stop
Go to Step 2, but use the values of [math]\displaystyle{ y_i }[/math] as the starting point for the next Newton-Raphson iteration.