LChrisman/Constrained regression

I consider here the solution to a regression problem with linear equality constraints:

[math]\displaystyle{ \min_a || B a - y ||^2 }[/math]

subject to [math]\displaystyle{ C a = d }[/math]

where [math]\displaystyle{ B }[/math] is the [math]\displaystyle{ n \times k }[/math] basis matrix, [math]\displaystyle{ C }[/math] is a [math]\displaystyle{ m \times k }[/math] matrix where [math]\displaystyle{ m }[/math] is the # of equality constraints and [math]\displaystyle{ k }[/math] the number of terms, and [math]\displaystyle{ d }[/math] an m-vector. The target data, [math]\displaystyle{ y }[/math] is an n-vector (column), and the solution, [math]\displaystyle{ a }[/math] is a k-vector (column).

When [math]\displaystyle{ B }[/math] and [math]\displaystyle{ C }[/math] are of full rank with [math]\displaystyle{ k + m \ge n }[/math], then a solution should pass through all points. The presence of the constraints means that even in the full rank case, it may require more terms than points to reach a perfect fit.

The Theil and van der Panne (1960) solution to this constrained regression is

[math]\displaystyle{ \lambda = \left( C (B^T B)^{-1} \right)^{-1} (C a_{ols} - d) }[/math]

[math]\displaystyle{ a^* = a_{ols} - (B^T B)^{-1} C^T \lambda }[/math]

where [math]\displaystyle{ a_{ols} }[/math] is the unconstrained solution (ordinary least squares solution). When [math]\displaystyle{ k\gt n }[/math], [math]\displaystyle{ B^T B }[/math] is singular. It does not work to replace the inverse with a Moore-Penrose inverse in this case because that selects out a single solution which in general won't be the optimal solution with the constraints.

My intention here is to document the general solution that continues to work when [math]\displaystyle{ k\gt n }[/math], and hopefully even when [math]\displaystyle{ k+m \gt n }[/math].

The null space

Let [math]\displaystyle{ r = rank(B) \le n }[/math] be the rank of [math]\displaystyle{ B }[/math]. Then there is a subspace, called the null space, with dimension [math]\displaystyle{ k-r }[/math] with the property that you can add the projection of any vector from this space to a solution for [math]\displaystyle{ a }[/math] and still have a solution that minimizes loss. Denote the null-space basis as

[math]\displaystyle{ \{ \upsilon_1,...,\upsilon_{k-r} \} }[/math]

each [math]\displaystyle{ \upsilon_i }[/math] is a k-vector. Given any solution [math]\displaystyle{ a_{part} }[/math] that minimizes the loss then any [math]\displaystyle{ \{\alpha_j\} }[/math] gives another [math]\displaystyle{ a }[/math] that also minimizes the loss using

[math]\displaystyle{ a = a_{part} + \sum_{j=1}^{k-r} \alpha_j \upsilon_j }[/math].

Denote B's SVD as

[math]\displaystyle{ B = U \Sigma V^T }[/math]

so that

[math]\displaystyle{ B^T B = (V \Sigma^T U^T) ( U \Sigma V^T) = V \Sigma^T \Sigma V^T }[/math]

with [math]\displaystyle{ \Sigma^T \Sigma }[/math] containing [math]\displaystyle{ \sigma_1^2,...,\sigma_r^2 }[/math] on the diagonal and zeros elsewhere. [math]\displaystyle{ V }[/math] is orthonormal [math]\displaystyle{ (V^T V = I) }[/math], so we write

[math]\displaystyle{ a = V \beta }[/math] for some [math]\displaystyle{ \beta \in \Re^k }[/math]

[math]\displaystyle{ B a = U \Sigma V^T ( V \beta) = U \Sigma \beta }[/math]

and the loss is

[math]\displaystyle{ || B a - y ||^2 = || \Sigma \beta - U^T y ||^2 }[/math]

Use [math]\displaystyle{ u := U^T y }[/math] to rotate [math]\displaystyle{ y }[/math] into the space that partitions [math]\displaystyle{ \beta }[/math] so that the first [math]\displaystyle{ r }[/math] rows correspond to the nonzero singular values of B, and the last [math]\displaystyle{ k-r }[/math] rows are the zero part of [math]\displaystyle{ \Sigma }[/math]. Denote this partition as

[math]\displaystyle{ \beta = \left( \begin{array}{c} \beta_1 \\ \beta_2 \end{array}\right) }[/math]

where

[math]\displaystyle{ \beta_1\in \Re^r }[/math] corresponds to the nonzero singular values [math]\displaystyle{ \sigma_1,...,\sigma_r }[/math].
[math]\displaystyle{ \beta_2\in \Re^{k-r} }[/math] corresponds to the zero part of [math]\displaystyle{ \Sigma }[/math].

then

[math]\displaystyle{ \Sigma \beta = \left(\begin{array}{c} D \beta_1 \\ 0\end{array}\right) , u = \left(\begin{array}{c} u_1 \\ u_2\end{array}\right) , \Sigma = \left(\begin{array}{cc} D & 0 \\ 0 & 0\end{array}\right) }[/math]

Minimizing wrt [math]\displaystyle{ \beta_1 }[/math] yields

[math]\displaystyle{ \beta_1^{mn} = D^{-1} u_1 }[/math]

which is the minimum norm solution matching the Moore-Penrose inverse [math]\displaystyle{ B^+ y }[/math]. The set of unconstrained minimizers is

[math]\displaystyle{ \beta = \left(\begin{array}{c} \beta_1^{mn} \\ 0\end{array}\right) + \left(\begin{array}{c}0 \\ z\end{array}\right), z \in \Re^{k-r} }[/math]

Equivalently in terms of [math]\displaystyle{ a }[/math]

[math]\displaystyle{ a = V \beta^{mn} + V_2 z }[/math]

where [math]\displaystyle{ V_2 }[/math] denotes the last [math]\displaystyle{ (k-r) }[/math] columns of [math]\displaystyle{ V }[/math], which span [math]\displaystyle{ Null(B) }[/math].

Adding in the constraint

We'll solve

[math]\displaystyle{ C \left( V_1 \beta_1^{mn} + V_2 \beta_2 \right) = d }[/math]

for [math]\displaystyle{ \beta_2 }[/math]. The solution is

[math]\displaystyle{ a^* = V_1 \beta_1^* + V_2 \beta_2 }[/math]

with

[math]\displaystyle{ \beta_1^* = D^{-1} \left( U^T y \right)_{\{1..r\}} }[/math]

[math]\displaystyle{ C V_2 \beta_2 = d - C V_1 \beta_1^* }[/math]

The Moore-Penrose inverse can be used to obtain [math]\displaystyle{ \beta_2 }[/math]

[math]\displaystyle{ \beta_2 = ( C V_2 )^+ (d - C V_1 \beta_1^* ) }[/math]

although any vector in [math]\displaystyle{ Null(C V_2) }[/math] is also a solution.

Once you've settled on a solution, the Lagrangian multipliers are

[math]\displaystyle{ \lambda^* = \left( C^T \right)^+ \left[ B^T ( B a^* - y ) \right] }[/math]

Summary of algorithm

Use SVD: [math]\displaystyle{ B = U \Sigma V^T }[/math]
Solve [math]\displaystyle{ C (V \beta^*) = d }[/math] in the row-space.
[math]\displaystyle{ a^* = V \beta^* }[/math]
Recover [math]\displaystyle{ \lambda^* }[/math] via [math]\displaystyle{ C^T \lambda^* = B^T ( B a^* - y) }[/math]

Note: I feel like there is an error in the above algorithm. [math]\displaystyle{ \beta_1^* }[/math] cannot depend only on [math]\displaystyle{ B }[/math] and [math]\displaystyle{ y }[/math], but that is the case.

Block matrix approach

This is a different approach. Again, start with the SVD of B:

[math]\displaystyle{ B = U \Sigma V^T }[/math], with rank [math]\displaystyle{ r }[/math].

Write out the KKT block system:

[math]\displaystyle{ \left( \begin{array}{cc} \Sigma^2 & -(C V)^T \\ C V & 0 \end{array}\right) \left( \begin{array}{c} \beta^* \\ \lambda \end{array}\right) = \left( \begin{array}{c} \Sigma U^T y \\ d \end{array}\right) }[/math]

A Moore-Penrose inverse should work here, then recover [math]\displaystyle{ a* = V \beta^* }[/math].

Derivation

Use [math]\displaystyle{ \mathcal{L}(a,\lambda) = {1\over 2}||B a- y||^2 - \lambda^T (C a - d) }[/math]

Stationarity:
[math]\displaystyle{ \nabla_a \mathcal{L}(a,\lambda) = B^T (B a - y) - C^T \lambda = 0 }[/math]

[math]\displaystyle{ B^T B a - C^T \lambda = B^T y }[/math]

[math]\displaystyle{ V \Sigma^T \Sigma V^T a - C^T \lambda = V \Sigma^T U^T y }[/math]

[math]\displaystyle{ \Sigma^2 (V^T a) - V^T C^T \lambda = \Sigma U^T y }[/math]

[math]\displaystyle{ \Sigma^2 \beta - (C V)^T \lambda = \Sigma U^T y }[/math]
Feasibility: [math]\displaystyle{ C a = d }[/math]
[math]\displaystyle{ C V V^T a = d }[/math]

[math]\displaystyle{ C V \beta = d }[/math]

More direct block matrix approach

With the block matrix approach, I guess we don't actually need to solve it in the rotated space. Why not just:

[math]\displaystyle{ \left( \begin{array}{cc} B^T B & -C^T \\ C & 0 \end{array}\right) \left( \begin{array}{c} a* \\ \lambda \end{array}\right) = \left( \begin{array}{c} B^T y \\ d \end{array}\right) }[/math]

This can be solved with the pseudo-inverse, i.e.

[math]\displaystyle{ \left( \begin{array}{c} a* \\ \lambda \end{array}\right) = \left( \begin{array}{cc} B^T B & -C^T \\ C & 0 \end{array}\right)^+ \left( \begin{array}{c} B^T y \\ d \end{array}\right) }[/math]

What happens when we solve this using as Schur complement approach so as to take advantage of the block structure and potential reuse of [math]\displaystyle{ (B^T B)^{-1} }[/math]? We get:

[math]\displaystyle{ \lambda = \left( C (B^T B)^{-1} C^T \right)^+ ( d - C (B^T B)^{-1} B^T y ) }[/math]

[math]\displaystyle{ a = - (B^T B)^{-1} B^T y - (B^T B)^{-1} C^T \lambda = -a_{ols} - (B^T B)^{-1} C^T \lambda }[/math]

These are the Theil and van der Panne equations again, which I've already determined don't work when you replace [math]\displaystyle{ (B^T B)^{-1} }[/math] with a pseudo-inverse. (there are some sign discrepancies, I'm not sure why).