User:Lchrisman/KeelinGradient

This is a work-in-progress page of Lonnie and Tom Keelin. The goal here explore the gradients of the Keelin distribution and fast algorithms for maximum likelihood fitting of the Keelin distribution. As a work in progress, the results on this page may very well have mistakes. This is actually the gradient of the log likelihood for any linear quantile parameterized function, Keelin being a special case, assuming I haven't make a mistake.

Notation

[math]\displaystyle{ x_i }[/math] = data points, [math]\displaystyle{ i=1..m }[/math]
[math]\displaystyle{ \vec{a} }[/math] = quantile coefficients, [math]\displaystyle{ [a_1, ..., a_k, ..., a_N] }[/math]
[math]\displaystyle{ \vec{B}(y) }[/math] = Basis at scalar y, where [math]\displaystyle{ 0\lt y\lt 1 }[/math] is a probability
[math]\displaystyle{ = [B_1(y), B_2(y), ..., B_k(y),...B_N(y)] }[/math]
[math]\displaystyle{ \vec{b}(y) }[/math] = the derivative basis, [math]\displaystyle{ [b_1(y), b_2(y), ..., b_k(y),...b_N(y)] }[/math]
[math]\displaystyle{ b_k(y) = {{\partial B_k(y)}\over{\partial y}} }[/math]
[math]\displaystyle{ \vec{b'}(y) }[/math] = the second derivative basis, [math]\displaystyle{ [b'_1(y), b'_2(y), ..., b'_k(y),...b'_N(y)] }[/math]
[math]\displaystyle{ b'_k(y) = {{\partial b_k(y)}\over{\partial y}} }[/math]
[math]\displaystyle{ M(y) = \vec{a} \cdot \vec{B}(y) }[/math] = y-th quantile
[math]\displaystyle{ m^{-1}(y) = {{\partial M(y)}\over {\partial y}} = \vec{a} \cdot \vec{b}(y) }[/math] = inverse density
[math]\displaystyle{ m(y) }[/math] = probability density at the yth quantile
[math]\displaystyle{ s(y) = \vec{a} \cdot \vec{b'}(y) }[/math]
[math]\displaystyle{ s'(y) = \vec{a} \cdot\vec{b''}(y) }[/math]
[math]\displaystyle{ y(x_i) }[/math] = the cumulative probability at [math]\displaystyle{ x_i }[/math], given an implied [math]\displaystyle{ \vec{a} }[/math], not shown explicitly.
[math]\displaystyle{ \nabla_a f = \langle {{\partial f}\over{\partial a_1}},..,{{\partial f}\over{\partial a_k}},..,{{\partial f}\over{\partial a_N}} \rangle }[/math] = the gradient of a function [math]\displaystyle{ f }[/math] wrt [math]\displaystyle{ \vec{a} }[/math]
[math]\displaystyle{ LL }[/math] = The log likelihood of the data -- its a function of [math]\displaystyle{ \vec{a} }[/math] and the data [math]\displaystyle{ [x_1,...,x_m] }[/math].
= [math]\displaystyle{ \ln \left( \prod_i p( x_i | \vec{a} ) \right) = \sum_i \ln p(x_i|\vec{a}) }[/math]
[math]\displaystyle{ p( x_i | \vec{a} ) = m( y(x_i) ) }[/math] = The probability density of the point [math]\displaystyle{ x_i }[/math]

All vectors with an explicit arrow-hat, e.g., [math]\displaystyle{ \vec{a} }[/math], are indexed by k. All gradients with respect to a are also indexed by k, since a is. Dot product is shown explicitly, e.g., [math]\displaystyle{ \vec{a} \cdot \vec{b}(y) }[/math], in which the result is not indexed by k. Component-wise multiplication is shown without a dot, e.g., [math]\displaystyle{ \vec{a} \vec{b}(y) }[/math], in which the result is indexed by k.

The Keelin basis

The derivation and algorithm that follows does not rely on any specific form of the basis. I'll just mention that for the Keelin distribution, the basis is

[math]\displaystyle{ B_k(y) = \left\{ \begin{array}{ll} (y-0.5)^{\lfloor{k/2}\rfloor} & k=0, 3, \mbox{even } k\ge 4 \\ (y-0.5)^{\lfloor{k/2}\rfloor} logit(y) & k=1, 2, \mbox{odd } k\ge 5 \end{array}\right. }[/math]

The derivative basis is

[math]\displaystyle{ b_k(y) = \left\{ \begin{array}{ll} {\lfloor{k/2}\rfloor}\left(y-\dfrac{1}{2}\right)^{{\lfloor{k/2}\rfloor}-1} & k=0, 3, \mbox{even } k\ge 4 \\ \dfrac{\left(y-\frac{1}{2}\right)^{\lfloor{k/2}\rfloor}\left(\left(2{\lfloor{k/2}\rfloor}y^2-2{\lfloor{k/2}\rfloor}y\right) logit(y)-2y+1\right)}{\left(y-1\right)y\left(2y-1\right)} & k=1, 2, \mbox{odd } k\ge 5 \end{array}\right. }[/math]

The second derivative basis is

[math]\displaystyle{ b'_k(y) = \left\{ \begin{array}{ll} \left({\lfloor{k/2}\rfloor}-1\right) {\lfloor{k/2}\rfloor} \left(y-\dfrac{1}{2}\right)^{ {\lfloor{k/2}\rfloor} -2} & k=0, 3, \mbox{even } k\ge 4 \\ \dfrac{\left(y-\frac{1}{2}\right)^{\lfloor{k/2}\rfloor} \left[ g_1(y,{\lfloor{k/2}\rfloor}) logit(y)+g_2(y,{\lfloor{k/2}\rfloor})\right]}{\left(y-1\right)^2y^2\left(2y-1\right)^2} & k=1, 2, \mbox{odd } k\ge 5 \end{array}\right. }[/math]

The third derivative basis (appears in Hessian) is

[math]\displaystyle{ b''_k(y) = \left\{ \begin{array}{ll} \left({\lfloor{k/2}\rfloor}-2\right) \left({\lfloor{k/2}\rfloor}-1\right) {\lfloor{k/2}\rfloor} \left(y-\dfrac{1}{2}\right)^{ {\lfloor{k/2}\rfloor} -3} & k=0, 3, \mbox{even } k\ge 4 \\ {{2\left(y-\dfrac{1}{2}\right)^{\lfloor{k/2}\rfloor} \left[ g_3\left(y,{\lfloor{k/2}\rfloor}\right) logit(y) + g_4(y,{\lfloor{k/2}\rfloor}) \right] } \over {(y-1)^3 y^3 (2y-1)^3}} & k=1, 2, \mbox{odd } k\ge 5 \end{array}\right. }[/math]

where

[math]\displaystyle{ g_1(y,c) = \left(4 c^2 - 4 c \right)y^4+\left(8 c - 8 c^2\right)y^3+\left(4 c^2-4 c\right)y^2 }[/math]

[math]\displaystyle{ g_2(y,c) = \left(8-8 c\right)y^3+\left(12 c-12\right)y^2+\left(6-4 c\right)y-1 }[/math]

[math]\displaystyle{ g_3(y,c) = (4 c^3 - 12 c^2 + 8 c)(y^6 - 3 y^5 + 3 y^4 - y^3) }[/math]

[math]\displaystyle{ g_4(y,c) = (-12 c^2 + 36 c - 24)y^5 + (30c^2 - 90c + 60)y^4 + (-24c^2 + 78 c - 62) y^3 + (6 c^2 - 27c + 33) y^2 + (3c-9) y + 1 }[/math]

[math]\displaystyle{ \lfloor{k/2}\rfloor }[/math] is the floor of [math]\displaystyle{ k/2 }[/math] -- the largest integer that is less than or equal to [math]\displaystyle{ k/2 }[/math].

[math]\displaystyle{ logit(y) = \ln\left( {y\over{1-y}} \right) }[/math]

The gradient derivation

The gradient of the log-likelihood

[math]\displaystyle{ \begin{split} \nabla_a LL &= \nabla_a \sum_i \ln p(x_i|\vec{a}) \\ &= \sum_i \nabla_a \ln m(y(x_i)) \\ &= \sum_i m^{-1}(y(x_i)) \nabla_a m(y(x_i)) \\ \end{split} }[/math]

[math]\displaystyle{ \begin{split} \nabla_a m(y(x_i)) &= \nabla_a \left(m^{-1}(y(x_i))\right)^{-1} \\ &= - \left( m^{-1}(y(x_i)) \right)^{-2} \nabla_a m^{-1}(y(x_i)) \\ &= - m^2(y(x_i)) \nabla_a \vec{a}\cdot \vec{b}(y(x_i)) \\ &= - m^2(y(x_i)) \left( \vec{b}(y(x_i)) + \left[\vec{a} \cdot \vec{b'}(y(x_i))\right] \nabla_a y(x_i) \right) \end{split} }[/math]

[math]\displaystyle{ \begin{split} \nabla_a LL &= -\sum_i m(y(x_i)) \left( \vec{b}(y(x_i)) + s(y(x_i)) \nabla_a y(x_i) \right) \end{split} }[/math]

where [math]\displaystyle{ s(y) = \vec{a} \cdot \vec{b'}(y) }[/math].

To obtain [math]\displaystyle{ \nabla_a y(x_i) }[/math], use

[math]\displaystyle{ \begin{split} x_i &= M(y_i) = \vec{a} \cdot \vec{B}(y_i) \\ 0 &= {\partial\over{\partial a_k}} \vec{a} \cdot \vec{B}(y_i) \\ &= \vec{B}_k(y_i) + [\vec{a} \cdot \vec{b}_k(y_i)] {{\partial y_i}\over{\partial a_k}} \\ &= \vec{B}_k(y_i) + m^{-1}(y_i) {{\partial y_i}\over{\partial a_k}} \\ {{\partial y_i}\over{\partial a_k}} &= - m(y_i) \vec{B}_k(y_i) \\ \nabla_a y_i &= -m(y_i) \vec{B}(y_i) \end{split} }[/math]

So

[math]\displaystyle{ \nabla_a LL = -\sum_i m(y_i) \left( \vec{b}(y_i) - s(y_i) m(y_i) \vec{B}(y_i) \right) }[/math]

where [math]\displaystyle{ y_i = y(x_i) }[/math] is found by solving [math]\displaystyle{ x_i = M(y_i) }[/math] by Newton-Raphson.

Old derivation of [math]\displaystyle{ \nabla_a y(x_i) }[/math]

(I'm keeping this around for a while for reference. Tom pointed out the first term is zero, which lead to a simpler derivation and simpler form).

The gradient [math]\displaystyle{ \nabla_a y(x_i) }[/math] is the interesting part. [math]\displaystyle{ y(x_i) }[/math] is not an easy function -- there is no analytical form for it. In Analytica 5, the CumKeelin and DensKeelin functions compute [math]\displaystyle{ y(x_i) }[/math] using the Newton-Raphson method, which is fast, but not closed form by any means.

However, assume we have computed [math]\displaystyle{ \tilde{y} = y(x_i) }[/math], for example by using Newton-Raphson's method. Recall that the Newton-Raphson method uses the update rule

[math]\displaystyle{ z_{t+1} = z_t - {{f(z) - v}\over{f'(z)}} }[/math]

such that with each iteration it moves closer to the z value that makes [math]\displaystyle{ f(z)=v }[/math].

So, suppose we change [math]\displaystyle{ a_k }[/math] by an epsilon amount, [math]\displaystyle{ {\partial a_k} }[/math], so that [math]\displaystyle{ \tilde{y} }[/math] is a pretty close approximation to the new [math]\displaystyle{ y(x_i) }[/math]. Since our change was small, I'll suppose we can obtain [math]\displaystyle{ y(x_i) }[/math] by executing one iteration of Newton-Raphson's method starting at [math]\displaystyle{ \tilde{y_i} }[/math].

[math]\displaystyle{ y(x_i) = \tilde{y_i} - {{(M(\tilde{y_i}) - x_i)}\over{m^{-1}(\tilde{y_i})}} }[/math]

The denominator follows from the fact that [math]\displaystyle{ {{\partial M(y)}\over{\partial y}} = m^{-1}(y) }[/math]. Using this,

[math]\displaystyle{ \begin{split} {{\partial y(x_i)}\over{\partial a_k}} &= - { { m^{-1}(\tilde{y_i}) {{\partial M(\tilde{y})}\over{\partial a_k}} - \left(M(\tilde{y_i})-x_i\right) {{\partial m^{-1}(\tilde{y_i})}\over{\partial a_k}}} \over { \left[ m^{-1}(\tilde{y_i}) \right]^2 } }\\ &= - { { m^{-1}(\tilde{y_i}) B_k(\tilde{y}) - \left(M(\tilde{y_i})-x_i\right) b_k(\tilde{y_i}) } \over { \left[ m^{-1}(\tilde{y_i}) \right]^2 } }\\ \nabla_a y(x_i) &= { { \left(M(\tilde{y_i})-x_i\right) \vec{b}(\tilde{y_i}) - m^{-1}(\tilde{y_i}) \vec{B}(\tilde{y_i}) } \over { \left[ m^{-1}(\tilde{y_i}) \right]^2 } }\\ &= { { \left(M(\tilde{y_i})-x_i\right) \vec{b}(\tilde{y_i}) - m^{-1}(\tilde{y_i}) \vec{B}(\tilde{y_i}) } \over { \left[ \vec{a} \cdot \vec{b}(\tilde{y_i}) \right]^2} } \end{split} }[/math]

Status: The gradient equations above now agree with finite differencing on [math]\displaystyle{ a_k }[/math], so I'm pretty confident they are now right.

Gradient descent algorithm

Given data [math]\displaystyle{ x_i }[/math], obtain a starting [math]\displaystyle{ \vec{a} }[/math] using ordinary least squares in quantile space.
Use Newton-Raphson to compute [math]\displaystyle{ y_i = y(x_i) }[/math] for each datum, given the current [math]\displaystyle{ \vec{a} }[/math].
Compute [math]\displaystyle{ \nabla_a LL }[/math]
Take a step in the direction of the gradient, e.g.,
[math]\displaystyle{ \vec{a}_{new} = \vec{a} + \epsilon \nabla_a LL }[/math]

where [math]\displaystyle{ \epsilon }[/math] is the learning rate
If converged, stop
Go to Step 2, but use the values of [math]\displaystyle{ y_i }[/math] as the starting point for the next Newton-Raphson iteration.

Hessian Derivation

(Derived by Tom)

With the Hessian, it should be possible to use multidimensional Newton-Raphson (i.e., find the zero of the gradient) to compute [math]\displaystyle{ \vec{a} }[/math] quickly.

[math]\displaystyle{ \nabla^2_a LL = \left[ {{\partial^2 LL} \over {\partial a_j \partial a_k}} \right] }[/math]

The next line follows from differentiating [math]\displaystyle{ m = \left(\vec{a}\cdot\vec{b}\right)^{-1} }[/math]

[math]\displaystyle{ {{\partial m}\over{\partial a_k}} = -m(y)^2 (- s(y) m(y) B_k + b_k) }[/math]

For conciseness, I'll write [math]\displaystyle{ y }[/math] for [math]\displaystyle{ y_i = y(x_i) }[/math]. Let [math]\displaystyle{ t_i }[/math] denote the [math]\displaystyle{ j^{th} }[/math] term for the [math]\displaystyle{ i^{th} }[/math] data point of [math]\displaystyle{ \nabla_a LL }[/math]

[math]\displaystyle{ t_i = m(y) b_j(y) - \left( \vec{a}\cdot \vec{b'}(y) \right) m^2(y) B_j(y) = m(y) b_j(y) - s(y) B_j(y) }[/math]

So

[math]\displaystyle{ {{\partial^2 LL} \over {\partial a_j \partial a_k}} = -{\partial \over {\partial a_k}} \sum_i t_i = -\sum_i {{\partial t_i}\over{\partial a_k}} }[/math]

The derivation is:

[math]\displaystyle{ {{\partial t_i}\over{\partial a_k}} = {{\partial m}\over{\partial a_k}} b_j + m(y) b'_j(y) {{\partial y}\over{\partial a_k}} - s(y) m(y)^2 b_j(y) {{\partial y}\over{\partial a_k}} - 2 s(y) m(y) {{\partial m}\over{\partial a_k}} B_j(y) - {{\partial s(y)}\over{\partial a_k}} m(y)^2 B_j(y) }[/math]

A previous derivation is as follows (Note: After correcting a mistake in sign, I believe this is identical to the preceding one):

[math]\displaystyle{ \begin{split} {{\partial t_i}\over{\partial a_k}} &= m(y) \vec{b'}(y) {{\partial y}\over{\partial a_k}} &+ m'(y) b_j(y) {{\partial y}\over{\partial a_k}} \\ &&- \left( \vec{a}\cdot \vec{b'}(y) \right) m^2(y) b_j(y) {{\partial y}\over{\partial a_k}} \\ &&- \left( \vec{a}\cdot \vec{b'}(y) \right) 2 m(y) m'(y) {{\partial y}\over{\partial a_k}} B_j(y) \\ &&- \left( \vec{a}\cdot \vec{b''}(y) {{\partial y}\over{\partial a_k}} + b'_k(y) \right) m^2(y) B_j(y) \end{split} }[/math]

[math]\displaystyle{ \begin{split} {{\partial t_i}\over{\partial a_k}} &= {{\partial y}\over{\partial a_k}} \left[ m(y) \vec{b'}(y)\right. &+ m'(y) b_j(y) \\ &&- \left( \vec{a}\cdot \vec{b'}(y) \right) m^2(y) b_j(y) \\ &&- \left( \vec{a}\cdot \vec{b'}(y) \right) 2 m(y) m'(y) B_j(y) \\ &&- \left. \left( \vec{a}\cdot \vec{b''}(y) \right) m^2(y) B_j(y) \right] - b'_k(y) m^2(y) B_j(y) \end{split} }[/math]

Using [math]\displaystyle{ {{\partial y}\over{\partial a_k}} = -m(y) \vec{B}_k(y) }[/math], we have

[math]\displaystyle{ \begin{split} {{\partial^2 LL} \over {\partial a_j \partial a_k}} &= \sum_i \left( -m(y) \vec{B}_k(y) \left[ m(y) \vec{b'}(y)\right.\right. &+ m'(y) b_j(y) \\ &&- s(y) m^2(y) b_j(y) \\ &&- 2 s(y) m(y) m'(y) B_j(y) \\ &&- \left.\left. s'(y) m^2(y) B_j(y) \right] - m^2(y) b'_k(y) B_j(y) \right) \end{split} }[/math]

where [math]\displaystyle{ s'(y) = \vec{a}\cdot \vec{b''}(y) }[/math]

Future work

The Hessian makes it possible to use Newton-Raphson for the whole MaxLL fit (faster convergence than simple gradient descent).
Can we solve this as an constrained optimization problem (LP) using both a and y as decision variables with the extra constraint x = M(y)?
Add in feasibility constraints.