LChrisman/Constrained regression

I consider here the solution to a regression problem with linear equality constraints:

[math]\displaystyle{ \min_a || B a - y ||^2 }[/math]

subject to [math]\displaystyle{ C a = d }[/math]

where [math]\displaystyle{ B }[/math] is the [math]\displaystyle{ n \times k }[/math] basis matrix, [math]\displaystyle{ C }[/math] is a [math]\displaystyle{ m \times k }[/math] matrix where [math]\displaystyle{ m }[/math] is the # of equality constraints and [math]\displaystyle{ k }[/math] the number of terms, and [math]\displaystyle{ d }[/math] an m-vector. The target data, [math]\displaystyle{ y }[/math] is an n-vector (column), and the solution, [math]\displaystyle{ a }[/math] is a k-vector (column).

When [math]\displaystyle{ B }[/math] and [math]\displaystyle{ C }[/math] are of full rank with [math]\displaystyle{ k + m \ge n }[/math], then a solution should pass through all points. The presence of the constraints means that even in the full rank case, it may require more terms than points to reach a perfect fit.

The Theil and van der Panne (1960) solution to this constrained regression is

[math]\displaystyle{ \lambda = \left( C (B^T B)^{-1} \right)^{-1} (C a_{ols} - d) }[/math]

[math]\displaystyle{ a^* = a_{ols} - (B^T B)^{-1} C^T \lambda }[/math]

where [math]\displaystyle{ a_{ols} }[/math] is the unconstrained solution (ordinary least squares solution). When [math]\displaystyle{ k\gt n }[/math], [math]\displaystyle{ B^T B }[/math] is singular. It does not work to replace the inverse with a Moore-Penrose inverse in this case because that selects out a single solution which in general won't be the optimal solution with the constraints.

My intention here is to document the general solution that continues to work when [math]\displaystyle{ k\gt n }[/math], and hopefully even when [math]\displaystyle{ k+m \gt n }[/math].

The null space

Let [math]\displaystyle{ r = rank(B) \le n }[/math] be the rank of [math]\displaystyle{ B }[/math]. Then there is a subspace, called the null space, with dimension [math]\displaystyle{ k-r }[/math] with the property that you can add the projection of any vector from this space to a solution for [math]\displaystyle{ a }[/math] and still have a solution that minimizes loss. Denote the null-space basis as

[math]\displaystyle{ \{ \upsilon_1,...,\upsilon_{k-r} \} }[/math]

each [math]\displaystyle{ \upsilon_i }[/math] is a k-vector. Given any solution [math]\displaystyle{ a_{part} }[/math] that minimizes the loss then any [math]\displaystyle{ \{\alpha_j\} }[/math] gives another [math]\displaystyle{ a }[/math] that also minimizes the loss using

[math]\displaystyle{ a = a_{part} + \sum_{j=1}^{k-r} \alpha_j \upsilon_j }[/math].

Denote B's SVD as

[math]\displaystyle{ B = U \Sigma V^T }[/math]

so that

[math]\displaystyle{ B^T B = (V \Sigma^T U^T) ( U \Sigma V^T) = V \Sigma^T \Sigma V^T }[/math]

with [math]\displaystyle{ \Sigma^T \Sigma }[/math] containing [math]\displaystyle{ \sigma_1^2,...,\sigma_r^2 }[/math] on the diagonal and zeros elsewhere. [math]\displaystyle{ V }[/math] is orthonormal [math]\displaystyle{ (V^T V = I) }[/math], so we write

[math]\displaystyle{ a = V \beta }[/math] for some [math]\displaystyle{ \beta \in \Re^k }[/math]

[math]\displaystyle{ B a = U \Sigma V^T ( V \beta) = U \Sigma \beta }[/math]

and the loss is

[math]\displaystyle{ || B a - y ||^2 = || \Sigma \beta - U^T y ||^2 }[/math]

Use [math]\displaystyle{ u := U^T y }[/math] to rotate [math]\displaystyle{ y }[/math] into the space that partitions [math]\displaystyle{ \beta }[/math] so that the first [math]\displaystyle{ r }[/math] rows correspond to the nonzero singular values of B, and the last [math]\displaystyle{ k-r }[/math] rows are the zero part of [math]\displaystyle{ \Sigma }[/math]. Denote this partition as

[math]\displaystyle{ \beta = \left( \begin{array}{c} \beta_1 \\ \beta_2 \end{array}\right) }[/math]

where

[math]\displaystyle{ \beta_1\in \Re^r }[/math] corresponds to the nonzero singular values [math]\displaystyle{ \sigma_1,...,\sigma_r }[/math].
[math]\displaystyle{ \beta_2\in \Re^{k-r} }[/math] corresponds to the zero part of [math]\displaystyle{ \Sigma }[/math].

then

[math]\displaystyle{ \Sigma \beta = \left(\begin{array}{c} D \beta_1 \\ 0\end{array}\right) , u = \left(\begin{array}{c} u_1 \\ u_2\end{array}\right) , \Sigma = \left(\begin{array}{cc} D & 0 \\ 0 & 0\end{array}\right) }[/math]

Minimizing wrt [math]\displaystyle{ \beta_1 }[/math] yields

[math]\displaystyle{ \beta_1^{mn} = D^{-1} u_1 }[/math]

which is the minimum norm solution matching the Moore-Penrose inverse [math]\displaystyle{ B^+ y }[/math]. The set of unconstrained minimizers is

[math]\displaystyle{ \beta = \left(\begin{array}{c} \beta_1^{mn} \\ 0\end{array}\right) + \left(\begin{array}{c}0 \\ z\end{array}\right), z \in \Re^{k-r} }[/math]

Equivalently in terms of [math]\displaystyle{ a }[/math]

[math]\displaystyle{ a = V \beta^{mn} + V_2 z }[/math]

where [math]\displaystyle{ V_2 }[/math] denotes the last [math]\displaystyle{ (k-r) }[/math] columns of [math]\displaystyle{ V }[/math], which span [math]\displaystyle{ Null(B) }[/math].

Adding in the constraint

We'll solve

[math]\displaystyle{ C \left( V_1 \beta_1^{mn} + V_2 \beta_2 \right) = d }[/math]

for [math]\displaystyle{ \beta_2 }[/math]. The solution is

[math]\displaystyle{ a^* = V_1 \beta_1^* + V_2 \beta_2 }[/math]

with

[math]\displaystyle{ \beta_1^* = D^{-1} \left( U^T y \right)_{\{1..r\}} }[/math]

[math]\displaystyle{ C V_2 \beta_2 = d - C V_1 \beta_1^* }[/math]

The Moore-Penrose inverse can be used to obtain [math]\displaystyle{ \beta_2 }[/math]

[math]\displaystyle{ \beta_2 = ( C V_2 )^+ (d - C V_1 \beta_1^* ) }[/math]

although any vector in [math]\displaystyle{ Null(C V_2) }[/math] is also a solution.

Once you've settled on a solution, the Lagrangian multipliers are

[math]\displaystyle{ \lambda^* = \left( C^T \right)^+ \left[ B^T ( B a^* - y ) \right] }[/math]

Summary of algorithm

Use SVD: [math]\displaystyle{ B = U \Sigma V^T }[/math]
Solve [math]\displaystyle{ C (V \beta^*) = d }[/math] in the row-space.
[math]\displaystyle{ a^* = V \beta^* }[/math]
Recover [math]\displaystyle{ \lambda^* }[/math] via [math]\displaystyle{ C^T \lambda^* = B^T ( B a^* - y) }[/math]

Note: I feel like there is an error in the above algorithm. [math]\displaystyle{ \beta_1^* }[/math] cannot depend only on [math]\displaystyle{ B }[/math] and [math]\displaystyle{ y }[/math], but that is the case.

Block matrix approach

This is a different approach. Again, start with the SVD of B:

[math]\displaystyle{ B = U \Sigma V^T }[/math], with rank [math]\displaystyle{ r }[/math].

Write out the KKT block system:

[math]\displaystyle{ \left( \begin{array}{cc} \Sigma^2 & -(C V)^T \\ C V & 0 \end{array}\right) \left( \begin{array}{c} \beta^* \\ \lambda \end{array}\right) = \left( \begin{array}{c} \Sigma U^T y \\ d \end{array}\right) }[/math]

A Moore-Penrose inverse should work here, then recover [math]\displaystyle{ a* = V \beta^* }[/math].

Derivation

Use [math]\displaystyle{ \mathcal{L}(a,\lambda) = {1\over 2}||B a- y||^2 - \lambda^T (C a - d) }[/math]

Stationarity:
[math]\displaystyle{ \nabla_a \mathcal{L}(a,\lambda) = B^T (B a - y) - C^T \lambda = 0 }[/math]

[math]\displaystyle{ B^T B a - C^T \lambda = B^T y }[/math]

[math]\displaystyle{ V \Sigma^T \Sigma V^T a - C^T \lambda = V \Sigma^T U^T y }[/math]

[math]\displaystyle{ \Sigma^2 (V^T a) - V^T C^T \lambda = \Sigma U^T y }[/math]

[math]\displaystyle{ \Sigma^2 \beta - (C V)^T \lambda = \Sigma U^T y }[/math]
Feasibility: [math]\displaystyle{ C a = d }[/math]
[math]\displaystyle{ C V V^T a = d }[/math]

[math]\displaystyle{ C V \beta = d }[/math]

More direct block matrix approach

With the block matrix approach, I guess we don't actually need to solve it in the rotated space. Why not just:

[math]\displaystyle{ \left( \begin{array}{cc} B^T B & C^T \\ C & 0 \end{array}\right) \left( \begin{array}{c} a* \\ \lambda \end{array}\right) = \left( \begin{array}{c} -B^T y \\ d \end{array}\right) }[/math]

This can be solved with the pseudo-inverse, i.e.

[math]\displaystyle{ \left( \begin{array}{c} a* \\ \lambda \end{array}\right) = \left( \begin{array}{cc} B^T B & C^T \\ C & 0 \end{array}\right)^+ \left( \begin{array}{c} -B^T y \\ d \end{array}\right) }[/math]

When B is full rank, you can solve this for [math]\displaystyle{ \lambda }[/math], then for [math]\displaystyle{ a }[/math] as a function of [math]\displaystyle{ \lambda }[/math] using [math]\displaystyle{ (B^T B)^{-1} }[/math]. Doing so ends up with the Theil and van der Panne equations again. It is more efficient because it inverts smaller matrices, but looses the optimal solution once we pass full rank (e.g., more terms that points).