User:Lchrisman/KeelinGradient

This is a private page of Lonnie. The goal here is to derive the gradient of the log likelihood for the Keelin distribution. This is my own scratch work, so it is likely highly prone to mistakes. This is actually the gradient of the log likelihood for any linear quantile parameterized function, Keelin being a special case, assuming I haven't make a mistake.

Notation

[math]\displaystyle{ x_i }[/math] = data points, [math]\displaystyle{ i=1..m }[/math]
[math]\displaystyle{ \vec{a} }[/math] = quantile coefficients, [math]\displaystyle{ [a_1, ..., a_k, ..., a_N] }[/math]
[math]\displaystyle{ \vec{B}(y) }[/math] = Basis at scalar y, where [math]\displaystyle{ 0\lt y\lt 1 }[/math] is a probability
[math]\displaystyle{ = [B_1(y), B_2(y), ..., B_k(y),...B_N(y)] }[/math]
[math]\displaystyle{ M(y) = \vec{a} \cdot \vec{B}(y) }[/math] = y-th quantile
[math]\displaystyle{ \vec{b}(y) }[/math] = the derivative basis, [math]\displaystyle{ [b_1(y), b_2(y), ..., b_k(y),...b_N(y)] }[/math]
[math]\displaystyle{ b_k(y) = {{\partial B_k(y)}\over{\partial y}} }[/math]
[math]\displaystyle{ m^{-1}(y) = {{\partial M(y)}\over {\partial y}} = \vec{a} \cdot \vec{b}(y) }[/math] = inverse density
[math]\displaystyle{ m(y) }[/math] = probability density at the yth quantile
[math]\displaystyle{ y(x_i) }[/math] = the cumulative probability at [math]\displaystyle{ x_i }[/math], given an implied [math]\displaystyle{ \vec{a} }[/math], not shown explicitly.
[math]\displaystyle{ \nabla_a f = \langle {{\partial f}\over{\partial a_1}},..,{{\partial f}\over{\partial a_k}},..,{{\partial f}\over{\partial a_N}} \rangle }[/math] = the gradient of a function [math]\displaystyle{ f }[/math] wrt [math]\displaystyle{ \vec{a} }[/math]
[math]\displaystyle{ LL }[/math] = The log likelihood of the data -- its a function of [math]\displaystyle{ \vec{a} }[/math] and the data [math]\displaystyle{ [x_1,...,x_m] }[/math].
= [math]\displaystyle{ \ln \left( \prod_i p( x_i | \vec{a} ) \right) = \sum_i \ln p(x_i|\vec{a}) }[/math]
[math]\displaystyle{ p( x_i | \vec{a} ) = m( y(x_i) ) }[/math] = The probability density of the point [math]\displaystyle{ x_i }[/math]

All vectors with an explicit arrow-hat, e.g., [math]\displaystyle{ \vec{a} }[/math], are indexed by k. All gradients with respect to a are also indexed by k, since a is. Dot product is shown explicitly, e.g., [math]\displaystyle{ \vec{a} \cdot \vec{b}(y) }[/math], in which the result is not indexed by k. Component-wise multiplication is shown without a dot, e.g., [math]\displaystyle{ \vec{a} \vec{b}(y) }[/math], in which the result is indexed by k.

The Keelin basis

The derivation and algorithm that follows does not rely on any specific form of the basis. I'll just mention that for the Keelin distribution, the basis is

[math]\displaystyle{ B_k(y) = \left\{ \begin{array}{ll} (y-0.5)^{\lfloor{k/2}\rfloor} & k=0, 3, \mbox{even } k\ge 4 \\ (y-0.5)^{\lfloor{k/2}\rfloor} logit(y) & k=1, 2, \mbox{odd } k\ge 5 \\ \end{array}\right. }[/math]

The gradient derivation

The gradient of the log-likelihood

[math]\displaystyle{ \begin{split} \nabla_a LL &= \nabla_a \sum_i \ln p(x_i|\vec{a}) \\ &= \sum_i \nabla_a \ln m(y(x_i)) \\ &= \sum_i m^{-1}(y(x_i)) \nabla_a m(y(x_i)) \\ &= \sum_i (\vec{a} \cdot \vec{b}(y)) \nabla_a m(y(x_i)) \end{split} }[/math]

[math]\displaystyle{ \begin{split}{{\partial m(y(x_i))}\over{\partial a_k}} &= {{\partial \left(m^{-1}(y(x_i))\right)^{-1}}\over{\partial a_k}} \\ &= - \left( m^{-1}(y(x_i)) \right)^{-2} {{\partial m^{-1}(y(x_i))}\over{\partial a_k}} \\ &= - m^2(y(x_i)) {{\partial \vec{a}\cdot \vec{b}(y(x_i))}\over{\partial a_k}} \\ &= - m^2(y(x_i)) \left( b_k(y(x_i)) + \vec{a} \cdot \vec{b'}(y(x_i)) \right) {{\partial y(x_i)}\over{\partial a_k}} \end{split} }[/math]

So we can write this as

[math]\displaystyle{ \nabla_a m(y(x_i)) = - m^2(y(x_i)) \left( \vec{b}(y(x_i)) + \vec{a} \cdot \vec{b'}(y(x_i)) \right) \nabla_a y(x_i) }[/math]

Note: both [math]\displaystyle{ \vec{b}(y(x_i)) }[/math] and [math]\displaystyle{ \nabla_a y(x_i) }[/math] are indexed by k. This is a component-wise product (not a dot-product), with the result also indexed by k.

The gradient [math]\displaystyle{ \nabla_a y(x_i) }[/math] is the interesting part. [math]\displaystyle{ y(x_i) }[/math] is not an easy function -- there is no analytical form for it. In Analytica 5, the CumKeelin and DensKeelin functions compute [math]\displaystyle{ y(x_i) }[/math] using the Newton-Raphson method, which is fast, but not closed form by any means.

However, assume we have computed [math]\displaystyle{ \tilde{y} = y(x_i) }[/math], for example by using Newton-Raphson's method. Recall that the Newton-Raphson method uses the update rule

[math]\displaystyle{ z_{t+1} = z_t - {{f(z) - v}\over{f'(z)}} }[/math]

such that with each iteration it moves closer to the z value that makes [math]\displaystyle{ f(z)=v }[/math].

So, suppose we change [math]\displaystyle{ a_k }[/math] by an epsilon amount, [math]\displaystyle{ {\partial a_k} }[/math], so that [math]\displaystyle{ \tilde{y} }[/math] is a pretty close approximation to the new [math]\displaystyle{ y(x_i) }[/math]. Since our change was small, I'll suppose we can obtain [math]\displaystyle{ y(x_i) }[/math] by executing one iteration of Newton-Raphson's method starting at [math]\displaystyle{ \tilde{y_i} }[/math].

[math]\displaystyle{ y(x_i) = \tilde{y_i} - {{(M(\tilde{y_i}) - x_i)}\over{m^{-1}(\tilde{y_i})}} }[/math]

The denominator follows from the fact that [math]\displaystyle{ {{\partial M(y)}\over{\partial y}} = m^{-1}(y) }[/math]. Using this,

[math]\displaystyle{ \begin{split} {{\partial y(x_i)}\over{\partial a_k}} &= - { { m^{-1}(\tilde{y_i}) {{\partial M(\tilde{y})}\over{\partial a_k}} - \left(M(\tilde{y_i})-x_i\right) {{\partial m^{-1}(\tilde{y_i})}\over{\partial a_k}}} \over { \left[ m^{-1}(\tilde{y_i}) \right]^2 } }\\ &= - { { m^{-1}(\tilde{y_i}) B_k(\tilde{y}) - \left(M(\tilde{y_i})-x_i\right) b_k(\tilde{y_i}) } \over { \left[ m^{-1}(\tilde{y_i}) \right]^2 } }\\ \nabla_a y(x_i) &= { { \left(M(\tilde{y_i})-x_i\right) \vec{b}(\tilde{y_i}) - m^{-1}(\tilde{y_i}) \vec{B}(\tilde{y_i}) } \over { \left[ m^{-1}(\tilde{y_i}) \right]^2 } }\\ &= { { \left(M(\tilde{y_i})-x_i\right) \vec{b}(\tilde{y_i}) - m^{-1}(\tilde{y_i}) \vec{B}(\tilde{y_i}) } \over { \left[ \vec{a} \cdot \vec{b}(\tilde{y_i}) \right]^2} } \end{split} }[/math]

Gradient descent algorithm

Given data [math]\displaystyle{ x_i }[/math], obtain a starting [math]\displaystyle{ \vec{a} }[/math] using ordinary least squares in quantile space.
Use Newton-Raphson to compute [math]\displaystyle{ y_i = y(x_i) }[/math] for each datum, given the current [math]\displaystyle{ \vec{a} }[/math].
Compute [math]\displaystyle{ \nabla_a LL }[/math] as
- Compute [math]\displaystyle{ \nabla_a y(x_i) }[/math] using last equation in previous section.
- Compute [math]\displaystyle{ \nabla_a m(y(x_i)) = \vec{b}(y(x_i)) \nabla_a y(x_i) }[/math]
- Compute [math]\displaystyle{ \nabla_a LL = \sum_i \vec{a} \cdot \vec{b}(y(x_i)) \nabla_a m(y(x_i)) }[/math]
Take a step in the direction of the gradient, e.g.,
[math]\displaystyle{ \vec{a}_{new} = \vec{a} + \epsilon \nabla_a LL }[/math]

where [math]\displaystyle{ \epsilon }[/math] is the learning rate
If converged, stop
Go to Step 2, but use the values of [math]\displaystyle{ y_i }[/math] as the starting point for the next Newton-Raphson iteration.

The above version does not enforce any feasibility constraints of the fitted quantile function. These would need to be added in using Lagrange multipliers.