LChrisman/Dual method for feasible MetaLog

This page documents an idea for computing the best-fit guaranteed-feasible Keelin (MetaLog) distribution. This is in the "idea" stage so may contain errors and may or may not turn out to be a viable approach.

The algorithm solves the problem in the dual space.

Primal problem

Minimize[math]\displaystyle{ _a }[/math] [math]\displaystyle{ Loss(a) }[/math]

s.t. [math]\displaystyle{ M'( y ; a ) \ge 0 }[/math] for all [math]\displaystyle{ y∈(0,1) }[/math]

[math]\displaystyle{ s(0;a),s(1;a) \ge \epsilon \gt 0 }[/math]

where

[math]\displaystyle{ Loss(a) = || M(\hat y;a) - \hat x ||_2 }[/math], where [math]\displaystyle{ (\hat x, \hat y) }[/math] are the data points.

[math]\displaystyle{ M(y;a) = a \cdot B(y) }[/math], where [math]\displaystyle{ B(y) }[/math] is the Keelin basis function.

[math]\displaystyle{ a }[/math] and [math]\displaystyle{ B(y) }[/math] are [math]\displaystyle{ k }[/math]-vectors.

This is a Linear Semi-Infinite Program (LSIP), with a finite number of variables and an infinite number of linear constraints. The objective is quadratic and convex.

The [math]\displaystyle{ s(0;a)\gt 0 }[/math] and [math]\displaystyle{ s(1;a)\gt 0 }[/math] constraints ensure that the distribution in unbounded with the tails in the correct directions.

We can rewrite the primal problem as

Minimize[math]\displaystyle{ _a }[/math] [math]\displaystyle{ Loss(a) }[/math]

s.t. [math]\displaystyle{ G( y ; a ) \ge \epsilon(y) }[/math] for all [math]\displaystyle{ y∈[0,1] }[/math]

where

[math]\displaystyle{ G(y;a) = \left\{ \begin{array}{l l} s(y;a) & y\in\{0,1\} \\ M'(y;a) & 0\lt y\lt 1 \end{array}\right. }[/math]

[math]\displaystyle{ s(y;a) }[/math] can also be written in the form [math]\displaystyle{ s(y;a) = a \cdot B_s(y) }[/math], where

[math]\displaystyle{ B_s(y) = [ 0, 1, y-0.5, 0, 0, (y-0.5)^2, (y-0.5)^3, 0, 0, (y-0.5)^4, (y-0.5)^5, 0, ...] }[/math].

As a convention to simplify notation, when we write [math]\displaystyle{ B'(y) }[/math] from here on, we'll actually assume:

[math]\displaystyle{ Bp(y) = \left\{ \begin{array}{ll} B'(y) & 0\lt y\lt 1 \\ B_s(y) & y \in \{0,1\} \end{array} \right. }[/math]

and we'll use the primal constraint: [math]\displaystyle{ G(y) = a \cdot Bp(y) \ge \epsilon(y) }[/math], where

[math]\displaystyle{ \epsilon(y) = \left\{ \begin{array}{ll} \epsilon & y\in\{0,1\} \\ 0 & 0\lt y\lt 1 \end{array}\right. }[/math]

Assumptions

We will need only a few assumptions about the basis function itself, namely:

That we have an algorithm to find the zeros of [math]\displaystyle{ M'(y;a) }[/math] and [math]\displaystyle{ M''(y;a) }[/math]
LICQ: The vectors [math]\displaystyle{ \{ B'(y_j) \} }[/math] resulting from a finite set of points [math]\displaystyle{ \{ y_j \} }[/math] are linearly independent. This is similar to Proposition 4 in the paper, but for the derivative basis. Some weaker conditions may also suffice like CRCQ. Used to prove convergence of active set method.

Dual problem

Maximize[math]\displaystyle{ _\lambda }[/math] [math]\displaystyle{ \inf_a L(a,\lambda) }[/math]

s.t.

[math]\displaystyle{ \lambda(y)\ge 0 }[/math] for all [math]\displaystyle{ y\in[0,1] }[/math]

where the Lagrangian function is

[math]\displaystyle{ L(a,\lambda) = Loss(a) - \int_0^1 \lambda(y) (G(y;a) - \epsilon(y) ) dy }[/math]

Complementary slackness: At the optimum,

[math]\displaystyle{ \lambda(y) ( G(y;a^*) - \epsilon(y) ) }[/math] for all [math]\displaystyle{ y\in [0,1] }[/math]

Since [math]\displaystyle{ \lambda(y)\ne 0 }[/math] only when [math]\displaystyle{ G(y;a^*)=0 }[/math], and since [math]\displaystyle{ M'(y;a) }[/math] can have at most [math]\displaystyle{ m = 2\lfloor (k-1)/2 \rfloor }[/math] roots, where [math]\displaystyle{ k }[/math] is the number of MetaLog terms, we can re-write the dual problem using a finite number of Lagrangian multipliers, [math]\displaystyle{ \Lambda = [\lambda_0, \lambda_1, ..., \lambda_m, \lambda_{m+1}] }[/math], corresponding to one for each root of [math]\displaystyle{ M'(y;a^*) }[/math] plus the two tails, [math]\displaystyle{ \lambda_0 }[/math] and [math]\displaystyle{ \lambda_{m+1} }[/math]).

Discrete dual problem:

Maximize[math]\displaystyle{ _\Lambda }[/math] [math]\displaystyle{ \inf_a L(a,\Lambda) }[/math]

s.t. [math]\displaystyle{ \Lambda \ge 0 }[/math]

where

[math]\displaystyle{ L(a,\Lambda) = Loss(a) - \sum_{j=1}^m \lambda_j \left( G(y_j ; a) - \epsilon(y_j) \right) }[/math]

and [math]\displaystyle{ y_j }[/math] are the roots of [math]\displaystyle{ M'(y,a) }[/math] unioned with [math]\displaystyle{ {0,1} }[/math].

[math]\displaystyle{ dualityGap(a, \Lambda) = \sum_{j=1}^m \lambda_j \left( G(y_j ; a) - \epsilon(y_j) \right) }[/math]

When we solve this, we will compute both [math]\displaystyle{ \Lambda }[/math] and [math]\displaystyle{ a }[/math]. Hence, this discrete dual problem has [math]\displaystyle{ m+8+k }[/math] variables and [math]\displaystyle{ m+8 }[/math] constraints.

Lagrangian is quadratic

The [math]\displaystyle{ Loss(a) }[/math] part of [math]\displaystyle{ L(a,\Lambda) }[/math] is obviously a convex parabola in [math]\displaystyle{ a }[/math], and then the duality gap term adds [math]\displaystyle{ \lambda_j a_k }[/math] terms, also quadratic (but not convex).

This opens the door to a matrix algebra solution to the dual problem (given the location of the zeros of [math]\displaystyle{ M' }[/math]).

Closed-form solution to dual

Given a finite-sized carrier set, [math]\displaystyle{ \{y_j\} }[/math], this section derives a closed-form solution to solve the dual for [math]\displaystyle{ a }[/math] and [math]\displaystyle{ \Lambda }[/math]. The solution is the correct, optimal solution in the event that the carrier set contains the actual zeros of [math]\displaystyle{ M'(y;a^*) }[/math].

[math]\displaystyle{ \nabla_a L(a,\Lambda) = 2\sum_i \left( a^T B(\hat y_i) - \hat x_i\right) B(\hat y_i) - \sum_j \lambda_j Bp(y_j) = 0 }[/math]

where [math]\displaystyle{ \hat y_i }[/math] are known data points, and not-hat [math]\displaystyle{ y_j }[/math] are carrier indices.

Let

[math]\displaystyle{ Q = 2\sum_i B(\hat y_i) B(\hat y_i)^T }[/math], a [math]\displaystyle{ k \times k }[/math] positive semi-definite matrix.

[math]\displaystyle{ c = 2 \sum_i \hat x_i B(\hat y_i) }[/math], a [math]\displaystyle{ k }[/math]-vector.

[math]\displaystyle{ D = [Bp(y_j)]_j = \left[B_s(0), B'(y_1),...,B'(y_m), B_s(1) \right] }[/math], a [math]\displaystyle{ k \times (m+2) }[/math] matrix.

[math]\displaystyle{ \vec \epsilon = [\epsilon,0,0,0,0,0] }[/math], a [math]\displaystyle{ k }[/math]-vector with a value of

[math]\displaystyle{ \Lambda = [ \lambda_L, \lambda_1, \lambda_2, ..., \lambda_m, \lambda_R ] }[/math], an [math]\displaystyle{ (m+2) }[/math] length vector.

The stationarity condition becomes:

[math]\displaystyle{ Q a - c - D \Lambda = 0 }[/math]

so

[math]\displaystyle{ a = Q^{-1}( c + D \Lambda) }[/math]

When the carrier indicies are roots of [math]\displaystyle{ M' }[/math] then all constraints are active. But to be a bit more general, let's first partition the constraints into the active set, [math]\displaystyle{ \Lambda_A }[/math] and [math]\displaystyle{ D_A }[/math] and the inactive set [math]\displaystyle{ \Lambda_I }[/math] and [math]\displaystyle{ D_I }[/math], where [math]\displaystyle{ A \cup I = \{0,2,...m+1\} }[/math]. The constraint numbers in [math]\displaystyle{ \{1,...,m\} }[/math] are interior constraints, [math]\displaystyle{ \{0,m+1\} }[/math] are the tail constraints.

The KKT system for the active set is:

[math]\displaystyle{ \left[\begin{array}{c c} Q & D_A \\ D^T_A & 0 \end{array} \right] \left[\begin{array}{c} a\\ \Lambda_A \end{array}\right] = \left[\begin{array}{c} c\\ 0 \end{array}\right] }[/math]

This should be solved using a numerical stable solver, such as one that uses LU decomposition with partial pivoting, LDL^T decomposition, or QR decomposition.

Complementary slackness (i.e., [math]\displaystyle{ \lambda_j ( a^T Bp(y_j;a) - \epsilon(y_j) ) = 0 }[/math]) implies [math]\displaystyle{ \Lambda_I=\vec 0 }[/math].

Now we have

[math]\displaystyle{ a = Q^{-1}( c + D_A \Lambda_A ) }[/math]

[math]\displaystyle{ \left( Q^{-1}( c + D_A \Lambda_A) \right)^T Bp(y_j) = \Lambda_A \odot \epsilon_A }[/math], for [math]\displaystyle{ j\in A \cap \{0,1,..,m,m+1\} }[/math].

Rewrite this using

[math]\displaystyle{ \Omega = D_A^T Q^{-1} D_A }[/math], a [math]\displaystyle{ |A|\times|A| }[/math] matrix.

[math]\displaystyle{ s = D_A^T Q^{-1} c }[/math], an [math]\displaystyle{ |A| }[/math]-vector.

[math]\displaystyle{ \epsilon_A = \left[ \epsilon(y_j) : y_j \in A \right] }[/math], an [math]\displaystyle{ |A| }[/math]-vector.

which becomes

[math]\displaystyle{ s + \Omega \Lambda_A = \Lambda_A \odot \epsilon_A }[/math]

so that we have the solution via matrix algebra:

[math]\displaystyle{ \Lambda_A = \left( diag(\epsilon_A) - \Omega \right)^{-1} s }[/math]

[math]\displaystyle{ a = Q^{-1}( c + D_A \Lambda_A) }[/math]

This closed-form solution is contingent on a correct guess regarding which constraints are active. Also, the closed-form solution does not enforce the constraint of the dual problem that [math]\displaystyle{ \Lambda_A\ge 0 }[/math].

[math]\displaystyle{ \Omega }[/math] can be singular, so an SVD pseudo-inverse should be used.

Iterate to find active set

Step 1: Initialize the active set [math]\displaystyle{ A }[/math] with an initial guess. Perhaps as the current carrier set.
Step 2: Solve [math]\displaystyle{ \Lambda_A = \Omega^{-1} s }[/math]
Step 3: If [math]\displaystyle{ \Lambda_A\ge 0 }[/math], proceed to Step 4.; If any [math]\displaystyle{ \lambda_j\lt 0 }[/math], move [math]\displaystyle{ j }[/math] from [math]\displaystyle{ A }[/math] to [math]\displaystyle{ I }[/math] and go to Step 2.
Step 4: Compute [math]\displaystyle{ a = Q^{-1}( c + D_A \Lambda_A) }[/math]
Step 5: Check the primal feasibility for all inactive constraints.; If [math]\displaystyle{ M'(y_j;a)\lt 0 }[/math] for any [math]\displaystyle{ j\in I }[/math], move [math]\displaystyle{ j }[/math] from [math]\displaystyle{ I }[/math] to [math]\displaystyle{ A }[/math] and return to Step 2.

Conjectured proposition: This iterative algorithm will always terminate in a finite number of steps.

Proof: The convergence of the Active Set Method is well-studied in convex optimization. These prerequisites are sufficient for guaranteed convergence:

✓ Constraint functions are continuous.
✓ Carrier set is compart. (Yes, it is finite)
✓ Slater's condition holds.
✓ Problem is convex. Functions are continuously differentiable.
? LICQ: Linear Independence Constraint Qualification: At the solution point, the gradients of the active constraints are linearly independent. Since the gradients of the active constraints are
[math]\displaystyle{ \nabla_a M'(y_j;a) = B'(y_j) }[/math]

this is equivalent to establishing that [math]\displaystyle{ \{ B'(y_j) : j\in A \} }[/math] has full rank. Proposition 4 established that [math]\displaystyle{ B }[/math] (not [math]\displaystyle{ B' }[/math]) has full rank the MetaLog basis when [math]\displaystyle{ k }[/math] is even and [math]\displaystyle{ m\ge k }[/math], or [math]\displaystyle{ k }[/math] is odd and [math]\displaystyle{ m \ge k }[/math].

There is also a Constant Rank Constraint Qualification (CRCQ) that might be an alternative to LICQ.

The LICQ is used to hedge against some degenerate cases, where there are equivalently good solutions, such that it may cycle between them. Another alternative is to prove convergence is to include something like a ordering on [math]\displaystyle{ j }[/math] so in the case of ties, it would only visit them in a prescribed order and avoid a cycle. The number of possible active sets is itself finite, i.e. [math]\displaystyle{ 2^m }[/math]. Thus, a terminating version is clearly possible.

Solution algorithm

The previous section finds the optimal solution (using a nearly closed-form solution) given a "guess" about the location of the zeros of [math]\displaystyle{ M' }[/math]. With a correct guess, or a superset, it finds the solution (the best-fit feasible MetaLog). Now we need to specify the algorithm to converge to the correct guess.

Step 1: Initialize the carrier set with the negative local minima of [math]\displaystyle{ M'(y;a^*) }[/math] (by computing the inflection points). Use the zeros of [math]\displaystyle{ M'(y;a^{(0)}) }[/math], where [math]\displaystyle{ a^{(0)} }[/math] is the solution to the unconstrained MetaLog.
Step 2: Find [math]\displaystyle{ \Lambda }[/math] and [math]\displaystyle{ a }[/math] using the active set method of the previous section, given this carrier set.
Step 3: Compute the zeros of [math]\displaystyle{ M'(y;a) }[/math] with this new [math]\displaystyle{ a }[/math].
Step 4: Refine the carrier set using this new set of zeros.; First, shrink the carrier set by removing points where [math]\displaystyle{ \lambda_j=0 }[/math] and [math]\displaystyle{ M'(y_j;a)\gt 0 }[/math].; Then add the new zeros to the carrier set.
Step 5: Iterate to convergence.

Conjectures (that I think are provable):

Efficiency: The carrier set size never exceeds [math]\displaystyle{ 2 m }[/math]
Every iteration reduces the duality gap.
Converges to the global optima.

Note on proof: There is a variant of Step 4 that does not shrink the carrier set before adding the new zeros. It appears to be more direct to prove convergence with that variant, possibly using existing theorems from LSIP literature. The shrinking step doesn't change the solution since we are removing carrier indices that didn't need to be there in the previous iteration.

Tail constraints

When there is a left tail reversal, there will generally not be an inflection point to the left of the leftmost zero. Hence, a point in that leftmost interval won't be added to the carrier set by the above algorithm. Instead, we need to enforce the tail constraint. Similarly for right-tail reversal.

Write [math]\displaystyle{ s(y) = a \cdot S(y) }[/math], [math]\displaystyle{ \mu(y) = a\cdot U(y) }[/math], where S and U are polynomial basis functions (with zeros at positions that don't apply). We have six possible tail constraints:

[math]\displaystyle{ s(0)\gt 0 }[/math]
[math]\displaystyle{ -s'(0)\gt 0 }[/math]
[math]\displaystyle{ \mu'(0)\ge 0 }[/math]
[math]\displaystyle{ s(1)\gt 0 }[/math]
[math]\displaystyle{ s'(1)\gt 0 }[/math]
[math]\displaystyle{ \mu'(1)\ge 0 }[/math]

During the active set search stage, #2 in the active set requires #1 to also be there, and #3 in the active set requires #2 to be there. Hence, of the first three of these, the only active set combinations are [math]\displaystyle{ \{ \}, \{\#1\}, \{\#1, \#2\}, \{\#1, \#2, \#3\} }[/math]. Similarly, for the last three constraints, only these combinations can appear in the active set: [math]\displaystyle{ \{ \}, \{\#4\}, \{\#4, \#5\}, \{\#4, \#5, \#6\} }[/math].

The active set algorithm is thus modified so that if the active set includes [math]\displaystyle{ \{\#1,\#2\} }[/math], the algorithm would check whether [math]\displaystyle{ \mu'(0)\ge 0 }[/math]. If not, then it would add #3 to the active set. If the Lagrangian multiplier for #2 is negative, then it would remove #2 from the active set.

We need to adjust the closed form equation slightly to include the active tail constraints. Let

[math]\displaystyle{ C = \left[ \begin{array}{c} S(0) \\ -S'(0) \\ U'(0) \\ S(1) \\ S'(1) \\ U'(1) \end{array}\right] }[/math]

and let [math]\displaystyle{ C_A }[/math] be the matrix with the rows that are in the active set as just discussed.

We expand the number of constraints by 6 (although they won't all be active), so now

[math]\displaystyle{ \Lambda }[/math] is a [math]\displaystyle{ (m+6) }[/math] vector, and

[math]\displaystyle{ D = [B'(y_1),...,B'(y_m), S(0), -S'(0), U'(0), S(1), S'(1), U'(1)] }[/math] is a [math]\displaystyle{ k \times (m+6) }[/math] matrix.

With this enhancement, the stationarity condition is still written as

[math]\displaystyle{ Q a - c - D \Lambda = 0 }[/math]

The former closed form equations still hold to solve for [math]\displaystyle{ \Lamba_A }[/math] and [math]\displaystyle{ a }[/math]. The main difference is that only 16 of the 64 possible combinations of the last 6 constraints are allowed for the active set.