Difference between revisions of "LChrisman/Constrained regression"

 
(13 intermediate revisions by the same user not shown)
Line 19: Line 19:
 
Let <math>r = rank(B) \le n</math> be the rank of <math>B</math>. Then there is a subspace, called the ''null space'', with dimension <math>k-r</math> with the property that you can add the projection of any vector from this space to a solution for <math>a</math> and still have a solution that minimizes loss.  Denote the null-space basis as
 
Let <math>r = rank(B) \le n</math> be the rank of <math>B</math>. Then there is a subspace, called the ''null space'', with dimension <math>k-r</math> with the property that you can add the projection of any vector from this space to a solution for <math>a</math> and still have a solution that minimizes loss.  Denote the null-space basis as
 
:<math>\{ \upsilon_1,...,\upsilon_{k-r} \}</math>
 
:<math>\{ \upsilon_1,...,\upsilon_{k-r} \}</math>
each <math>\upsilon_i</math> is a k-vector. Given any solution <math>a_{part}</math> that minimizes the loss then any <math>{\alpha_j\}</math> gives another <math>a</math> that also minimizes the loss using
+
each <math>\upsilon_i</math> is a k-vector. Given any solution <math>a_{part}</math> that minimizes the loss then any <math>\{\alpha_j\}</math> gives another <math>a</math> that also minimizes the loss using
 
:<math>a = a_{part} + \sum_{j=1}^{k-r} \alpha_j \upsilon_j</math>.
 
:<math>a = a_{part} + \sum_{j=1}^{k-r} \alpha_j \upsilon_j</math>.
  
Line 39: Line 39:
 
where  
 
where  
 
* <math>\beta_1\in \Re^r</math> corresponds to the ''nonzero'' singular values <math>\sigma_1,...,\sigma_r</math>.
 
* <math>\beta_1\in \Re^r</math> corresponds to the ''nonzero'' singular values <math>\sigma_1,...,\sigma_r</math>.
* <math>\beta_2\in Re^{k-r}</math> corresponds to the zero part of <math>\Sigma</math>.
+
* <math>\beta_2\in \Re^{k-r}</math> corresponds to the zero part of <math>\Sigma</math>.
 
then
 
then
 
:<math>\Sigma \beta = \left(\begin{array}{c} D \beta_1 \\ 0\end{array}\right)
 
:<math>\Sigma \beta = \left(\begin{array}{c} D \beta_1 \\ 0\end{array}\right)
 
,
 
,
u = \left(\begin{array}{c} u_1 \\ u_2\end{array}\right)</math>
+
u = \left(\begin{array}{c} u_1 \\ u_2\end{array}\right)
 +
,
 +
\Sigma = \left(\begin{array}{cc} D & 0 \\ 0 & 0\end{array}\right)
 +
</math>
 +
 
 +
Minimizing wrt <math>\beta_1</math> yields
 +
:<math>\beta_1^{mn} = D^{-1} u_1</math>
 +
which is the ''minimum norm'' solution matching the Moore-Penrose inverse <math>B^+ y</math>.
 +
The set of unconstrained minimizers is
 +
:<math>\beta = \left(\begin{array}{c} \beta_1^{mn} \\ 0\end{array}\right) + \left(\begin{array}{c}0 \\ z\end{array}\right),
 +
z \in \Re^{k-r}</math>
 +
Equivalently in terms of <math>a</math>
 +
:<math>a = V \beta^{mn} + V_2 z</math>
 +
where <math>V_2</math> denotes the last <math>(k-r)</math> columns of <math>V</math>, which span <math>Null(B)</math>.
 +
 
 +
== Adding in the constraint ==
 +
We'll solve
 +
:<math>C \left( V_1 \beta_1^{mn} + V_2 \beta_2 \right) = d</math>
 +
for <math>\beta_2</math>. The solution is
 +
:<math>a^* = V_1 \beta_1^* + V_2 \beta_2</math>
 +
with
 +
:<math>\beta_1^* = D^{-1} \left( U^T y \right)_{\{1..r\}}</math>
 +
:<math>C V_2 \beta_2 = d - C V_1 \beta_1^*</math>
 +
The Moore-Penrose inverse can be used to obtain <math>\beta_2</math>
 +
:<math>\beta_2 = ( C V_2 )^+ (d - C V_1 \beta_1^* )</math>
 +
although any vector in <math>Null(C V_2)</math> is also a solution.
 +
 
 +
Once you've settled on a solution, the Lagrangian multipliers are
 +
:<math>\lambda^* = \left( C^T \right)^+ \left[ B^T ( B a^* - y ) \right]</math>
 +
 
 +
== Summary of algorithm ==
 +
* Use SVD: <math>B = U \Sigma V^T</math>
 +
* Solve <math>C (V \beta^*) = d</math> in the row-space.
 +
* <math>a^* = V \beta^*</math>
 +
* Recover <math>\lambda^*</math> via <math>C^T \lambda^* = B^T ( B a^* - y)</math>
 +
 
 +
Note: I feel like there is an error in the above algorithm.  <math>\beta_1^*</math> cannot depend only on <math>B</math> and <math>y</math>, but that is the case.
 +
 
 +
== Block matrix approach ==
 +
This is a different approach. Again, start with the SVD of B:
 +
:<math>B = U \Sigma V^T</math>, with rank <math>r</math>.
 +
Write out the KKT block system:
 +
:<math>
 +
\left( \begin{array}{cc}
 +
\Sigma^2 & -(C V)^T \\
 +
C V & 0
 +
\end{array}\right)
 +
\left( \begin{array}{c}
 +
\beta^* \\
 +
\lambda
 +
\end{array}\right)
 +
=
 +
\left( \begin{array}{c}
 +
\Sigma U^T y \\
 +
d
 +
\end{array}\right)
 +
</math>
 +
A Moore-Penrose inverse should work here, then recover <math>a* = V \beta^*</math>.
 +
 
 +
=== Derivation ===
 +
Use <math>\mathcal{L}(a,\lambda) = {1\over 2}||B a- y||^2 - \lambda^T (C a - d)</math>
 +
 
 +
# '''Stationarity''':
 +
#:<math>\nabla_a \mathcal{L}(a,\lambda) =
 +
B^T (B a - y) - C^T \lambda = 0</math>
 +
#:<math>B^T B a - C^T \lambda = B^T y</math>
 +
#:<math>V \Sigma^T \Sigma V^T a - C^T \lambda = V \Sigma^T U^T y</math>
 +
#:<math>\Sigma^2 (V^T a) - V^T C^T \lambda = \Sigma U^T y</math>
 +
#:<math>\Sigma^2 \beta - (C V)^T \lambda = \Sigma U^T y</math>
 +
# '''Feasibility''': <math>C a = d</math>
 +
#:<math>C V V^T a = d</math>
 +
#:<math>C V \beta = d</math>
 +
 
 +
=== More direct block matrix approach ===
 +
With the block matrix approach, I guess we don't actually need to solve it in the rotated space. Why not just:
 +
 
 +
:<math>
 +
\left( \begin{array}{cc}
 +
B^T B & -C^T \\
 +
C & 0
 +
\end{array}\right)
 +
\left( \begin{array}{c}
 +
a* \\
 +
\lambda
 +
\end{array}\right)
 +
=
 +
\left( \begin{array}{c}
 +
B^T y \\
 +
d
 +
\end{array}\right)
 +
</math>
 +
 
 +
This can be solved with the pseudo-inverse, i.e.
 +
:<math>
 +
\left( \begin{array}{c}
 +
a* \\
 +
\lambda
 +
\end{array}\right)
 +
=
 +
\left( \begin{array}{cc}
 +
B^T B & -C^T \\
 +
C & 0
 +
\end{array}\right)^+
 +
\left( \begin{array}{c}
 +
B^T y \\
 +
d
 +
\end{array}\right)
 +
</math>
 +
 
 +
What happens when we solve this using as Schur complement approach so as to take advantage of the block structure and potential reuse of <math>(B^T B)^{-1}</math>? We get:
 +
 
 +
:<math>\lambda = \left( C (B^T B)^{-1} C^T \right)^+ ( d - C (B^T B)^{-1} B^T y )</math>
 +
:<math>a = - (B^T B)^{-1} B^T y - (B^T B)^{-1} C^T \lambda = -a_{ols} - (B^T B)^{-1} C^T \lambda</math>
 +
 
 +
These are the Theil and van der Panne equations again, which I've already determined don't work when you replace <math>(B^T B)^{-1}</math> with a pseudo-inverse. (there are some sign discrepancies, I'm not sure why).

Latest revision as of 20:27, 2 April 2025

I consider here the solution to a regression problem with linear equality constraints:

[math]\displaystyle{ \min_a || B a - y ||^2 }[/math]
subject to [math]\displaystyle{ C a = d }[/math]

where [math]\displaystyle{ B }[/math] is the [math]\displaystyle{ n \times k }[/math] basis matrix, [math]\displaystyle{ C }[/math] is a [math]\displaystyle{ m \times k }[/math] matrix where [math]\displaystyle{ m }[/math] is the # of equality constraints and [math]\displaystyle{ k }[/math] the number of terms, and [math]\displaystyle{ d }[/math] an m-vector. The target data, [math]\displaystyle{ y }[/math] is an n-vector (column), and the solution, [math]\displaystyle{ a }[/math] is a k-vector (column).

When [math]\displaystyle{ B }[/math] and [math]\displaystyle{ C }[/math] are of full rank with [math]\displaystyle{ k + m \ge n }[/math], then a solution should pass through all points. The presence of the constraints means that even in the full rank case, it may require more terms than points to reach a perfect fit.

The Theil and van der Panne (1960) solution to this constrained regression is

[math]\displaystyle{ \lambda = \left( C (B^T B)^{-1} \right)^{-1} (C a_{ols} - d) }[/math]
[math]\displaystyle{ a^* = a_{ols} - (B^T B)^{-1} C^T \lambda }[/math]

where [math]\displaystyle{ a_{ols} }[/math] is the unconstrained solution (ordinary least squares solution). When [math]\displaystyle{ k\gt n }[/math], [math]\displaystyle{ B^T B }[/math] is singular. It does not work to replace the inverse with a Moore-Penrose inverse in this case because that selects out a single solution which in general won't be the optimal solution with the constraints.

My intention here is to document the general solution that continues to work when [math]\displaystyle{ k\gt n }[/math], and hopefully even when [math]\displaystyle{ k+m \gt n }[/math].

The null space

Let [math]\displaystyle{ r = rank(B) \le n }[/math] be the rank of [math]\displaystyle{ B }[/math]. Then there is a subspace, called the null space, with dimension [math]\displaystyle{ k-r }[/math] with the property that you can add the projection of any vector from this space to a solution for [math]\displaystyle{ a }[/math] and still have a solution that minimizes loss. Denote the null-space basis as

[math]\displaystyle{ \{ \upsilon_1,...,\upsilon_{k-r} \} }[/math]

each [math]\displaystyle{ \upsilon_i }[/math] is a k-vector. Given any solution [math]\displaystyle{ a_{part} }[/math] that minimizes the loss then any [math]\displaystyle{ \{\alpha_j\} }[/math] gives another [math]\displaystyle{ a }[/math] that also minimizes the loss using

[math]\displaystyle{ a = a_{part} + \sum_{j=1}^{k-r} \alpha_j \upsilon_j }[/math].

Denote B's SVD as

[math]\displaystyle{ B = U \Sigma V^T }[/math]

so that

[math]\displaystyle{ B^T B = (V \Sigma^T U^T) ( U \Sigma V^T) = V \Sigma^T \Sigma V^T }[/math]

with [math]\displaystyle{ \Sigma^T \Sigma }[/math] containing [math]\displaystyle{ \sigma_1^2,...,\sigma_r^2 }[/math] on the diagonal and zeros elsewhere. [math]\displaystyle{ V }[/math] is orthonormal [math]\displaystyle{ (V^T V = I) }[/math], so we write

[math]\displaystyle{ a = V \beta }[/math] for some [math]\displaystyle{ \beta \in \Re^k }[/math]
[math]\displaystyle{ B a = U \Sigma V^T ( V \beta) = U \Sigma \beta }[/math]

and the loss is

[math]\displaystyle{ || B a - y ||^2 = || \Sigma \beta - U^T y ||^2 }[/math]

Use [math]\displaystyle{ u := U^T y }[/math] to rotate [math]\displaystyle{ y }[/math] into the space that partitions [math]\displaystyle{ \beta }[/math] so that the first [math]\displaystyle{ r }[/math] rows correspond to the nonzero singular values of B, and the last [math]\displaystyle{ k-r }[/math] rows are the zero part of [math]\displaystyle{ \Sigma }[/math]. Denote this partition as

[math]\displaystyle{ \beta = \left( \begin{array}{c} \beta_1 \\ \beta_2 \end{array}\right) }[/math]

where

  • [math]\displaystyle{ \beta_1\in \Re^r }[/math] corresponds to the nonzero singular values [math]\displaystyle{ \sigma_1,...,\sigma_r }[/math].
  • [math]\displaystyle{ \beta_2\in \Re^{k-r} }[/math] corresponds to the zero part of [math]\displaystyle{ \Sigma }[/math].

then

[math]\displaystyle{ \Sigma \beta = \left(\begin{array}{c} D \beta_1 \\ 0\end{array}\right) , u = \left(\begin{array}{c} u_1 \\ u_2\end{array}\right) , \Sigma = \left(\begin{array}{cc} D & 0 \\ 0 & 0\end{array}\right) }[/math]

Minimizing wrt [math]\displaystyle{ \beta_1 }[/math] yields

[math]\displaystyle{ \beta_1^{mn} = D^{-1} u_1 }[/math]

which is the minimum norm solution matching the Moore-Penrose inverse [math]\displaystyle{ B^+ y }[/math]. The set of unconstrained minimizers is

[math]\displaystyle{ \beta = \left(\begin{array}{c} \beta_1^{mn} \\ 0\end{array}\right) + \left(\begin{array}{c}0 \\ z\end{array}\right), z \in \Re^{k-r} }[/math]

Equivalently in terms of [math]\displaystyle{ a }[/math]

[math]\displaystyle{ a = V \beta^{mn} + V_2 z }[/math]

where [math]\displaystyle{ V_2 }[/math] denotes the last [math]\displaystyle{ (k-r) }[/math] columns of [math]\displaystyle{ V }[/math], which span [math]\displaystyle{ Null(B) }[/math].

Adding in the constraint

We'll solve

[math]\displaystyle{ C \left( V_1 \beta_1^{mn} + V_2 \beta_2 \right) = d }[/math]

for [math]\displaystyle{ \beta_2 }[/math]. The solution is

[math]\displaystyle{ a^* = V_1 \beta_1^* + V_2 \beta_2 }[/math]

with

[math]\displaystyle{ \beta_1^* = D^{-1} \left( U^T y \right)_{\{1..r\}} }[/math]
[math]\displaystyle{ C V_2 \beta_2 = d - C V_1 \beta_1^* }[/math]

The Moore-Penrose inverse can be used to obtain [math]\displaystyle{ \beta_2 }[/math]

[math]\displaystyle{ \beta_2 = ( C V_2 )^+ (d - C V_1 \beta_1^* ) }[/math]

although any vector in [math]\displaystyle{ Null(C V_2) }[/math] is also a solution.

Once you've settled on a solution, the Lagrangian multipliers are

[math]\displaystyle{ \lambda^* = \left( C^T \right)^+ \left[ B^T ( B a^* - y ) \right] }[/math]

Summary of algorithm

  • Use SVD: [math]\displaystyle{ B = U \Sigma V^T }[/math]
  • Solve [math]\displaystyle{ C (V \beta^*) = d }[/math] in the row-space.
  • [math]\displaystyle{ a^* = V \beta^* }[/math]
  • Recover [math]\displaystyle{ \lambda^* }[/math] via [math]\displaystyle{ C^T \lambda^* = B^T ( B a^* - y) }[/math]

Note: I feel like there is an error in the above algorithm. [math]\displaystyle{ \beta_1^* }[/math] cannot depend only on [math]\displaystyle{ B }[/math] and [math]\displaystyle{ y }[/math], but that is the case.

Block matrix approach

This is a different approach. Again, start with the SVD of B:

[math]\displaystyle{ B = U \Sigma V^T }[/math], with rank [math]\displaystyle{ r }[/math].

Write out the KKT block system:

[math]\displaystyle{ \left( \begin{array}{cc} \Sigma^2 & -(C V)^T \\ C V & 0 \end{array}\right) \left( \begin{array}{c} \beta^* \\ \lambda \end{array}\right) = \left( \begin{array}{c} \Sigma U^T y \\ d \end{array}\right) }[/math]

A Moore-Penrose inverse should work here, then recover [math]\displaystyle{ a* = V \beta^* }[/math].

Derivation

Use [math]\displaystyle{ \mathcal{L}(a,\lambda) = {1\over 2}||B a- y||^2 - \lambda^T (C a - d) }[/math]

  1. Stationarity:
    [math]\displaystyle{ \nabla_a \mathcal{L}(a,\lambda) = B^T (B a - y) - C^T \lambda = 0 }[/math]
    [math]\displaystyle{ B^T B a - C^T \lambda = B^T y }[/math]
    [math]\displaystyle{ V \Sigma^T \Sigma V^T a - C^T \lambda = V \Sigma^T U^T y }[/math]
    [math]\displaystyle{ \Sigma^2 (V^T a) - V^T C^T \lambda = \Sigma U^T y }[/math]
    [math]\displaystyle{ \Sigma^2 \beta - (C V)^T \lambda = \Sigma U^T y }[/math]
  2. Feasibility: [math]\displaystyle{ C a = d }[/math]
    [math]\displaystyle{ C V V^T a = d }[/math]
    [math]\displaystyle{ C V \beta = d }[/math]

More direct block matrix approach

With the block matrix approach, I guess we don't actually need to solve it in the rotated space. Why not just:

[math]\displaystyle{ \left( \begin{array}{cc} B^T B & -C^T \\ C & 0 \end{array}\right) \left( \begin{array}{c} a* \\ \lambda \end{array}\right) = \left( \begin{array}{c} B^T y \\ d \end{array}\right) }[/math]

This can be solved with the pseudo-inverse, i.e.

[math]\displaystyle{ \left( \begin{array}{c} a* \\ \lambda \end{array}\right) = \left( \begin{array}{cc} B^T B & -C^T \\ C & 0 \end{array}\right)^+ \left( \begin{array}{c} B^T y \\ d \end{array}\right) }[/math]

What happens when we solve this using as Schur complement approach so as to take advantage of the block structure and potential reuse of [math]\displaystyle{ (B^T B)^{-1} }[/math]? We get:

[math]\displaystyle{ \lambda = \left( C (B^T B)^{-1} C^T \right)^+ ( d - C (B^T B)^{-1} B^T y ) }[/math]
[math]\displaystyle{ a = - (B^T B)^{-1} B^T y - (B^T B)^{-1} C^T \lambda = -a_{ols} - (B^T B)^{-1} C^T \lambda }[/math]

These are the Theil and van der Panne equations again, which I've already determined don't work when you replace [math]\displaystyle{ (B^T B)^{-1} }[/math] with a pseudo-inverse. (there are some sign discrepancies, I'm not sure why).

Comments


You are not allowed to post comments.