LChrisman/Boundary Method for Feasible MetaLog
Brainstorm page: This page is here to aid the development of an idea that might lead to an algorithm for reliably and efficiently finding the best fit feasible MetaLog distribution.
Notation
- [math]\displaystyle{ k }[/math] = number of MetaLog terms
- [math]\displaystyle{ x_i\in \Re, y_i\in [0,1] }[/math] = data points. Denote [math]\displaystyle{ x=[x_i], y=[y_i] }[/math].
- [math]\displaystyle{ {a} }[/math] = quantile coefficients, [math]\displaystyle{ [a_1, ..., a_j, ..., a_k] }[/math]
- [math]\displaystyle{ B(y):[0,1]\rarr \Re^k }[/math] = Basis at scalar y, where [math]\displaystyle{ 0\lt y\lt 1 }[/math] is a probability
- [math]\displaystyle{ = [B_1(y), B_2(y), ..., B_j(y),...B_k(y)] }[/math]
- [math]\displaystyle{ B_j(y) = (y-0.5)^{\lfloor \frac{j-1}{2} \rfloor} (\mathbf{1}_{j \in \mu} + \ell(y) \mathbf{1}_{j \in s} ) }[/math]
- [math]\displaystyle{ \hat{B} = \left[ \begin{array}{c} B(y_1) \\ B(y_2) \\ \vdots \\ B(y_m) \end{array} \right] }[/math] = The basis [math]\displaystyle{ m \times k }[/math] matrix.
- [math]\displaystyle{ {b}(y) : [0,1]\rarr \Re^k }[/math] = the derivative basis, [math]\displaystyle{ [b_1(y), b_2(y), ..., b_j(y),...b_k(y)] }[/math]
- [math]\displaystyle{ b_j(y) = {{\partial B_j(y)}\over{\partial y}} = }[/math]
- [math]\displaystyle{ M(y):[0,1]\rarr \Re = {a} \cdot {B}(y) }[/math] = y-th quantile
- [math]\displaystyle{ M'(y):[0,1]\rarr\Re = {a} \cdot {b}(y) }[/math] = inverse density at [math]\displaystyle{ y^{th} }[/math] quantile
The feasible cone
At each value of [math]\displaystyle{ y }[/math] there is a half-hyperplane in [math]\displaystyle{ k }[/math]-dimensions defined by [math]\displaystyle{ M'(y)\le 0 }[/math].
The cone of feasible solutions is given by the intersection of [math]\displaystyle{ {M'}(y) \ge 0 }[/math] for all [math]\displaystyle{ 0\le y \le 1 }[/math]. Each solution is a point is k-dimensional space.
- The feasible set is convex
- The feasible set is closed (i.e., solutions on its boundary are feasible).
- There is at least one [math]\displaystyle{ y }[/math] that corresponds to each point on the boundary.
- A [math]\displaystyle{ y }[/math] that corresponds to the boundary is the y-value where the density is just touching zero.
- There may be more than value of [math]\displaystyle{ y }[/math] associated with the same boundary point. This would mean that the density touches zero at two (or more) points.
- Q: Are there multiple boundary points for a single y (e.g., a linear edge)? Trivial yes: When k=2, there is only [math]\displaystyle{ a_2\ge 0 }[/math]. Every [math]\displaystyle{ 0\lt y\lt 1 }[/math] gives this same constraint. But the point is that the constraint is active for all [math]\displaystyle{ a_1 }[/math], so it is a boundary line, not a boundary point.
Conjecture: The same feasible region results using only [math]\displaystyle{ 0 \lt y \lt 1 }[/math].
From discussion with Manel: Without including y=0 and y=1, one closed boundary edge is omitted from the feasible set. That would make it an open set at that one edge. You can't simply add in [math]\displaystyle{ M'(0)\ge 0 }[/math] because [math]\displaystyle{ b(0) }[/math] and [math]\displaystyle{ b(1) }[/math] include INF values. Adding the edge constraints comprised of various [math]\displaystyle{ s(0)\gt 0 }[/math], [math]\displaystyle{ s'(0)\gt 0 }[/math], and [math]\displaystyle{ s''(0)\gt 0 }[/math] constraints from Proposition 2 might be a pragmatic way to encode the [math]\displaystyle{ M'(0)\ge 0 }[/math]. One other conjecture we didn't explore would be whether a plain old [math]\displaystyle{ M(0)\ge 0 }[/math] works using SANE arithmetic. This would basically amount to implicit constraints of the sign of some [math]\displaystyle{ a_j }[/math] coefficients.
Loss
The quality of a function is measured by an L2 loss function:
- [math]\displaystyle{ \mathcal{L}(a) = \mathcal{L}(a ; x,y) = ||M(y) - x ||_2 = \sum_{i=1}^m \left( a\cdot \hat{B} - x \right)^2 = (a\cdot \hat{B} - x )^T (a\cdot \hat{B} - x) }[/math]
Note that [math]\displaystyle{ x,y }[/math] are constant for the problem of finding the best fit.
Denote the unconstrained optimal solution as
- [math]\displaystyle{ a^* = \arg\min_a \mathcal{L}(a) }[/math]
This is the solution found using ordinary least squares regression.
The loss function, [math]\displaystyle{ \mathcal{L}(a) }[/math], is a parabola in [math]\displaystyle{ k }[/math]-dimensional space, and therefore can be also represented as
- [math]\displaystyle{ \mathcal{L}(a ; a^*, \Sigma) = (a - a^*)^T \Sigma^{-1} (a - a^*) + \mathcal{L}(a^*) }[/math]
Since [math]\displaystyle{ \mathcal{L}(a^*) }[/math] is constant, it can be omitted while computing the optimization.
You can find the parameters of the parabola as follows.
- [math]\displaystyle{ a^* = \left( \hat{B}^T \hat{B}\right)^{-1} \hat{B}^T x }[/math]
- [math]\displaystyle{ \Sigma = \sigma^2 \left( \hat{B}^T \hat{B}\right)^{-1} }[/math]
- [math]\displaystyle{ \Sigma^{-1} = {1\over{\sigma^2}} \hat{B}^T \hat{B} }[/math]
- [math]\displaystyle{ \sigma^2 = {1\over{m-k}} \mathcal{L}(a^*) }[/math]
Interior-point methods, boundary-traversal methods, and exterior-point methods
Many algorithms for minimizing loss subject to the feasibility constraint fall into one of three categories:
- Interior-point methods produce a sequence of guesses within the interior of the feasible region. Karmarkar's algorithm is perhaps the most famous, being the first always polynomial-time algorithm for solving LP and convex QP problems. Our problem is a convex QP, but Karmarkar's cannot be directly applied to our case due to us having an infinite number of constraints, but perhaps an adaptation might be possible.
- Boundary-travesal methods: These traverse a sequence of guesses on the boundary of the feasible region. The most famous such algorithm is the Simplex algorithm for LPs and QPs.
- Exterior-point methods: These produce a sequence of guesses starting from the exterior of the feasible region, and returning to the exterior during the search. You could start at [math]\displaystyle{ a^* }[/math], find an infeasible [math]\displaystyle{ y }[/math], then repair it, are continue until you've repaired all infeasible points. These are more challenging to develop since they can't take advantage of the convexity of the feasible cone.
This page focuses on an idea for a boundary-traversal idea.
Moving from interior point to boundary
Interestingly, we have a method for moving from any interior point, [math]\displaystyle{ a }[/math] to the boundary. You can simply perform a binary search on the line segment from [math]\displaystyle{ a }[/math] to [math]\displaystyle{ a^* }[/math], using a subroutine that determines whether the point is feasible:
- [math]\displaystyle{ IsFeasible(a) : \Re^k \rarr \{ True, False \} }[/math]
The implementation for [math]\displaystyle{ IsFeasible(a) }[/math] is extremely non-obvious, but has already been solved by Manel. It involves using a polynomial root solver. With this routine, binary search is always an option for getting us from any interior point to the boundary point in the direction of [math]\displaystyle{ a^* }[/math].
There is also an analogous procedure to move from an exterior point to the boundary, which is simply to conduct a binary search on the line segment from the exterior point and any known interior point. An important difference, however, is that when moving from an interior point to the boundary, we are moving in the direction that minimizes loss. Unless you have a (non-obvious) method for selecting the interior reference point, the boundary point you reach might not be real close to your start.
Boundary point to the optimum within the tangent hyperplane
From a boundary point, you can jump to a point on that hyperplane that minimizes loss. This new point will be an exterior point in general (or a boundary point in a few degenerate cases). The formula for the loss-minimizing point is:
- [math]\displaystyle{ a_{next} = a - \Sigma^{-1} \nabla_{proj} \mathcal L(a ; a^*,\Sigma) }[/math]
where [math]\displaystyle{ \nabla_{proj} \mathcal L }[/math] is the projection of the loss gradient onto the tangent hyperplane, given by
- [math]\displaystyle{ \nabla_{proj} \mathcal L = \nabla \mathcal L(a) - \left( {{a\cdot \nabla \mathcal{L}(a)}\over{a\cdot a}}\right) a }[/math]
and [math]\displaystyle{ \Sigma }[/math] is the loss-function Hessian at [math]\displaystyle{ a }[/math]:
- [math]\displaystyle{ \Sigma_{ij} = {{\partial^2 \mathcal{L}(a)}\over{\partial a_i \partial a_j}} }[/math]
The resulting point is an exterior point. It would be nice to now have a way to project this back to the boundary so we could iterate. This is not easy, If you project in any linear direction, you might miss the feasible region entirely, or you might end up farther away that the previous point.
Boundary-traversal idea
A boundary-traversal method needs to start with a point on the boundary. We already covered how to find the first boundary point.
Assume we have a way to place a continuous coordinate system on the boundary manifold, such that each coordinate uniquely identifies a boundary point and this mapping from coordinate to boundary point is surjective (aka onto), continuous and allows derivatives and second derivatives wrt the coordinate. Such a coordinate would presumably have [math]\displaystyle{ k-1 }[/math] dimensions, although the idea might work with more dimensions. I don't think the map needs to be bijective (i.e., it seems okay if multiple coordinates map to the same boundary point), just surjective (all boundary points need at least one coordinate). Since each boundary point is on one or more active constraints, the y values of the active constraints might be part of the coordinate system, but it needs to be designed so that the transition from active to nonactive is continuous.
Denote a boundary coordinate as [math]\displaystyle{ \alpha }[/math], and the map from coordinate to boundary point as [math]\displaystyle{ f(\alpha) : \Re^{k-1} \rarr \Re^k }[/math]. So given a coordinate, we can identify the boundary point.
- [math]\displaystyle{ a = f(\alpha) }[/math]
At the current boundary point, we approximate the curvature using a quadratic.
- [math]\displaystyle{ \tilde{f}(\alpha) \approx f(\alpha_0) + J(\alpha_0)(\alpha - \alpha_0) + {1\over 2}(\alpha - \alpha_0)^T H (\alpha - \alpha_0) }[/math]
where [math]\displaystyle{ J }[/math] and [math]\displaystyle{ H }[/math] are the Jacobian and Hessian of [math]\displaystyle{ f }[/math].
The approach is to find the point on this parabola that minimizes [math]\displaystyle{ \mathcal{L} }[/math], as [math]\displaystyle{ \alpha^* }[/math], which we then use to find the next boundary point. So the next step is to solve the quadratic approximation case.
- [math]\displaystyle{ \mathcal{L}(f(\alpha)) \approx \mathcal{L}(f(\alpha_0)) + \nabla \mathcal{L}(f(\alpha_0))^T J(\alpha_0)(\alpha-\alpha_0) + {1\over 2} (\alpha - \alpha_0)^T J(\alpha_0)^T \Sigma J(\alpha_0) (\alpha-\alpha_0) + {1\over 2} \nabla \mathcal{L}(f(\alpha_0))^T H (\alpha-\alpha_0)^2 }[/math]
Incorporate the first-order optimality condition:
- [math]\displaystyle{ {{d \mathcal{L}}\over{d \alpha}} = \nabla \mathcal{L}(f(\alpha_0))^T J(\alpha_0) + J(\alpha_0)^T \Sigma J(\alpha_0) (\alpha - \alpha_0) + \nabla \mathcal{L}(f(\alpha_0))^T H (\alpha - \alpha_0) = 0 }[/math]
Solve for [math]\displaystyle{ \alpha^* }[/math] to get a closed form solution:
- [math]\displaystyle{ \alpha^* = \alpha_0 - \left( J(\alpha_0)^T \Sigma J(\alpha_0) + \nabla \mathcal{L}(f(\alpha_0))^T H \right)^{-1} \nabla L(f(\alpha_0))^T J(\alpha_0) }[/math]
This solution for [math]\displaystyle{ \alpha^* }[/math] predicts the next constraining [math]\displaystyle{ y }[/math]-value, and hence, the next constraining hyperplane, [math]\displaystyle{ b(y_{next}) }[/math], as well as the coordinate of the boundary point within that hyperplane. However, this is only a guess based on an approximation of the boundary shape, and the next point won't be exactly on the boundary. It could be either an interior point or an exterior point. When it is an interior point, we use the procedure above to adjust it to the boundary.
When it is an exterior point, I need a new creative idea... I don't yet have one. Possible directions:
- is there any way to find the point on the [math]\displaystyle{ b(y_{next}) }[/math] hyperplane it which that hyperplane becomes constraining?
- Anyway to find an interior point in the vicinity?
- Amplify the curvature of the Hessian and use the update formula again. That might curve it into the interior.
- Is there any upper bound on amplification that could guarantee an interior point? If so, then we could even do binary search on the curvature until we find the boundary.
- There might be a clever representation for [math]\displaystyle{ \alpha }[/math] that always corresponds to a well-defined boundary point. For example, it might be something like the vector of y values that touch zero. Ideally, after an update, you'd always end up at a boundary point.
Curvature amplification
If the next point lands in the exterior, then one approach might be to try again with an amplified curvature, hoping that by amplifying the curvature we end up in the interior.
Let [math]\displaystyle{ \lambda \ge 0 }[/math] be the curvature amplification factor, where [math]\displaystyle{ \lambda=0 }[/math] means to amplification. The modified Hessian would be
- [math]\displaystyle{ \tilde{H} = H + \lambda J(\alpha_0) J(\alpha_0)^T }[/math]
Barrier approach
Barrier algorithms are a class of interior point methods that add in a barrier function that goes to infinity near the boundary, which pushes a gradient descent search away from the boundaries. The challenge for us is that we have an infinite number of boundary surfaces, so we can't simply add up the barrier potentials for all hyperplanes.
Since we have a procedure that finds the boundary between an interior point and [math]\displaystyle{ a^* }[/math], one idea is to adopt a barrier potential that is only a function of that point (or that constraining hyperplane). I.e., to compute the barrier potential at prospect [math]\displaystyle{ a }[/math], find the boundary point in the direction of [math]\displaystyle{ a^* }[/math], call it [math]\displaystyle{ a' }[/math], and compute the barrier potential [math]\displaystyle{ b( a, a' ) }[/math].
A problem with barrier approaches is that the rate at which you reduce the barrier penalty needs to be balanced with the rate of progress towards the solution. If you decrease too fast, the Newton's method steps within the interior slow down so much they don't reach the optima, but if you decrease too slow it gates convergence speed. The schedules that have been proven to work in polynomial time are a function of the (finite) number of inequality constraints. Since we have an infinite number of constraints, they'd decrease the penalty infinitely fast.
Solving the dual problem
Sometimes tricks arise by solving the dual problem. So we should at least identify what the dual problem is.
The Lagrangian is
[math]\displaystyle{ \Lambda(a,\lambda) = {1\over 2}a^T \Sigma^{-1} a - a \cdot \int_0^1 \lambda(y) b(y) dy }[/math] where I'm using the residual a here to reduce notation (i.e., instead of [math]\displaystyle{ (a - a^*) }[/math]).
The integral here comes from integrating over [math]\displaystyle{ 0\lt y\lt 1 }[/math] which parameterizes all the constraints. Multiply each constraint LHS, [math]\displaystyle{ a \cdot b(y) }[/math] by its langrangian multiplier, [math]\displaystyle{ \lambda(y) }[/math], sum them up and pull the constant [math]\displaystyle{ a }[/math] out of the integral.
The dual function is [math]\displaystyle{ g(\lambda) = \inf_a \Lambda(a,\lambda) }[/math].
Set [math]\displaystyle{ \nabla_a \Lambda(a,\lambda)=0 }[/math]:
- [math]\displaystyle{ a = \Sigma \int_0^1 \lambda(y) b(y) dy }[/math]
Substitute this into the problem of maximizing [math]\displaystyle{ g(\lambda) }[/math] subject to [math]\displaystyle{ \lambda(y)\ge 0 }[/math], to get the dual problem:
- [math]\displaystyle{ \max_{\lambda\ge 0} -{1\over 2} \left( \int_0^1 \lambda(y) b(y) dy \right)^T \Sigma \left( \int_0^1 \lambda(y) b(y) dy \right) }[/math]
Notes: [math]\displaystyle{ \lambda(y) }[/math] is zero almost everywhere. The few values where it is non zero are the points where the probability density touches zero. The value of [math]\displaystyle{ \lambda(y) }[/math] on the active constraints are the marginal cost of the constraint being exactly zero here (i.e., how much the objective would change it it became slightly positive there).
Enable comment auto-refresher