LChrisman/Boundary Method for Feasible MetaLog

Brainstorm page: This page is here to aid the development of an idea that might lead to an algorithm for reliably and efficiently finding the best fit feasible MetaLog distribution.

Notation

[math]\displaystyle{ k }[/math] = number of MetaLog terms
[math]\displaystyle{ x_i\in \Re, y_i\in [0,1] }[/math] = data points. Denote [math]\displaystyle{ x=[x_i], y=[y_i] }[/math].
[math]\displaystyle{ {a} }[/math] = quantile coefficients, [math]\displaystyle{ [a_1, ..., a_j, ..., a_k] }[/math]
[math]\displaystyle{ B(y):[0,1]\rarr \Re^k }[/math] = Basis at scalar y, where [math]\displaystyle{ 0\lt y\lt 1 }[/math] is a probability
[math]\displaystyle{ = [B_1(y), B_2(y), ..., B_j(y),...B_k(y)] }[/math]

[math]\displaystyle{ B_j(y) = (y-0.5)^{\lfloor \frac{j-1}{2} \rfloor} (\mathbf{1}_{j \in \mu} + \ell(y) \mathbf{1}_{j \in s} ) }[/math]
[math]\displaystyle{ \hat{B} = \left[ \begin{array}{c} B(y_1) \\ B(y_2) \\ \vdots \\ B(y_m) \end{array} \right] }[/math] = The basis [math]\displaystyle{ m \times k }[/math] matrix.
[math]\displaystyle{ {b}(y) : [0,1]\rarr \Re^k }[/math] = the derivative basis, [math]\displaystyle{ [b_1(y), b_2(y), ..., b_j(y),...b_k(y)] }[/math]
[math]\displaystyle{ b_j(y) = {{\partial B_j(y)}\over{\partial y}} = }[/math]
[math]\displaystyle{ M(y):[0,1]\rarr \Re = {a} \cdot {B}(y) }[/math] = y-th quantile
[math]\displaystyle{ M'(y):[0,1]\rarr\Re = {a} \cdot {b}(y) }[/math] = inverse density at [math]\displaystyle{ y^{th} }[/math] quantile

The feasible cone

At each value of [math]\displaystyle{ y }[/math] there is a half-hyperplane in [math]\displaystyle{ k }[/math]-dimensions defined by [math]\displaystyle{ M'(y)\le 0 }[/math].

The cone of feasible solutions is given by the intersection of [math]\displaystyle{ {M'}(y) \ge 0 }[/math] for all [math]\displaystyle{ 0\le y \le 1 }[/math]. Each solution is a point is k-dimensional space.

The feasible set is convex
The feasible set is closed (i.e., solutions on its boundary are feasible).
There is at least one [math]\displaystyle{ y }[/math] that corresponds to each point on the boundary.
A [math]\displaystyle{ y }[/math] that corresponds to the boundary is the y-value where the density is just touching zero.
There may be more than value of [math]\displaystyle{ y }[/math] associated with the same boundary point. This would mean that the density touches zero at two (or more) points.
Q: Are there multiple boundary points for a single y (e.g., a linear edge)? Trivial yes: When k=2, there is only [math]\displaystyle{ a_2\ge 0 }[/math]. Every [math]\displaystyle{ 0\lt y\lt 1 }[/math] gives this same constraint. But the point is that the constraint is active for all [math]\displaystyle{ a_1 }[/math], so it is a boundary line, not a boundary point.

Conjecture: The same code results using [math]\displaystyle{ 0 \lt y \lt 1 }[/math].

Loss

The quality of a function is measured by an L2 loss function:

[math]\displaystyle{ \mathcal{L}(a) = \mathcal{L}(a ; x,y) = ||M(y) - x ||_2 = \sum_{i=1}^m \left( a\cdot \hat{B} - x \right)^2 = (a\cdot \hat{B} - x )^T (a\cdot \hat{B} - x) }[/math]

Note that x,y are constant for the problem of finding the best fit.

Denote the unconstrained optimal solution as

[math]\displaystyle{ a^* = \arg\min_a \mathcal{L}(a) }[/math]

This is the solution found using ordinary least squares regression.

The loss function, [math]\displaystyle{ \mathcal{L}(a) }[/math], is a parabola in [math]\displaystyle{ k }[/math]-dimensional space, and therefore can be also represented as

[math]\displaystyle{ \mathcal{L}(a ; a^*, \Sigma) = (a - a^*)^T \Sigma^{-1} (a - a^*) + \mathcal{L}(a^*) }[/math]

Since [math]\displaystyle{ \mathcal{L}(a^*) }[/math] is constant, it can be omitted while computing the optimization.

You can find the parameters of the parabola as follows.

[math]\displaystyle{ a^* = \left( B^T B\right)^{-1} B^T x }[/math]

[math]\displaystyle{ \Sigma = \sigma^2 \left( B^T B\right)^{-1} }[/math]

[math]\displaystyle{ \sigma^2 = {1\over{m-k}} \mathcal{L}(a^*) }[/math]

Interior-point methods, boundary-traversal methods, and exterior-point methods

Many algorithms for minimizing loss subject to the feasibility constraint fall into one of three categories:

Interior-point methods produce a sequence of guesses within the interior of the feasible region. Karmarkar's algorithm is perhaps the most famous, being the first always polynomial-time algorithm for solving LP and convex QP problems. Our problem is a convex QP, but Karmarkar's cannot be directly applied to our case due to us having an infinite number of constraints, but perhaps an adaptation might be possible.
Boundary-travesal methods: These traverse a sequence of guesses on the boundary of the feasible region. The most famous such algorithm is the Simplex algorithm for LPs and QPs.
Exterior-point methods: These produce a sequence of guesses starting from the exterior of the feasible region, and returning to the exterior during the search. You could start at [math]\displaystyle{ a^* }[/math], find an infeasible [math]\displaystyle{ y }[/math], then repair it, are continue until you've repaired all infeasible points. These are more challenging to develop since they can't take advantage of the convexity of the feasible cone.

This page focuses on an idea for a boundary-traversal idea.

Moving from interior point to boundary

Interestingly, we have a method for moving from any interior point, [math]\displaystyle{ a }[/math] to the boundary. You can simply perform a binary search on the line segment from [math]\displaystyle{ a }[/math] to [math]\displaystyle{ a^* }[/math], using a subroutine that determines whether the point is feasible:

[math]\displaystyle{ Feasible(a) : \Re^k \rarr \{ True, False \} }[/math]

The implementation for [math]\displaystyle{ Feasible(a) }[/math] is extremely non-obvious, but has already been solved by Manel. It involves using a polynomial root solver. With this routine, binary search is always an option for getting us from any interior point to the boundary point in the direction of [math]\displaystyle{ a^* }[/math].

Boundary point to optimal or tangent hyperplane

From a boundary point, you can jump to a point on that hyperplane that minimizes loss. This new point will be an exterior point in general (or a boundary point in a few degenerate cases). The formula for the loss-minimizing point is:

[math]\displaystyle{ a_{next} = a - H^{-1} \nabla_{proj} \mathcal L(a ; a^*,\Sigma) }[/math]

where [math]\displaystyle{ \nabla_{proj} \mathcal L }[/math] is the projection of the loss gradient onto the tangent hyperplane, given by

[math]\displaystyle{ \nabla_{proj} \mathcal L = \nabla \mathcal L(a) - \left( {{a\cdot \nabla \mathcal{L}(a)}\over{a\cdot a}}\right) a }[/math]

and [math]\displaystyle{ H }[/math] is the loss-function Hessian at [math]\displaystyle{ a }[/math]:

[math]\displaystyle{ H_{ij} = {{\partial^2 \mathcal{L}(a)}\over{\partial a_i \partial a_j}} = \Sigma }[/math]

The resulting point is an exterior point. It would be nice to now have a way to project this back to the boundary so we could iterate. This is not easy, If you project in any linear direction, you might miss the feasible region entirely, or you might end up farther away that the previous point.

Boundary-traversal idea

A boundary-traversal method needs to start with a point on the boundary. It is easy to find a descent initial guess on the boundary. In fact, any time you reach an interior point, <math<a</math>, you can get to a boundary point by conducting a binary search

From a boundary point [math]\displaystyle{ a }[/math], you could consider moving within the tangent hyperplane, against the gradient of [math]\displaystyle{ \mathcal{L} }[/math]. But this would always move you into the exterior of the feasible region. Although it is easy to get from an interior point to boundary point, moving from an exterior point to the boundary is much less obvious.

To try to stay on the boundary, we could consider moving along a parabolic surface determined by some measure of curvature at [math]\displaystyle{ a }[/math]. The would allow the use of a Newton-method update.

[math]\displaystyle{ a_{next} = a - H^{-1} \nabla_{proj} \mathcal L(a ; a^*,\Sigma) }[/math]

where [math]\displaystyle{ \nabla_{proj} \mathcal L }[/math] is the projection of the loss gradient onto the tangent hyperplane, given by

[math]\displaystyle{ \nabla_{proj} \mathcal L = \nabla \mathcal L(a) - \left( {{a\cdot \nabla \mathcal{L}(a)}\over{a\cdot a}}\right) a }[/math]