Difference between revisions of "LChrisman/Constrained regression"
(13 intermediate revisions by the same user not shown) | |||
Line 19: | Line 19: | ||
Let <math>r = rank(B) \le n</math> be the rank of <math>B</math>. Then there is a subspace, called the ''null space'', with dimension <math>k-r</math> with the property that you can add the projection of any vector from this space to a solution for <math>a</math> and still have a solution that minimizes loss. Denote the null-space basis as | Let <math>r = rank(B) \le n</math> be the rank of <math>B</math>. Then there is a subspace, called the ''null space'', with dimension <math>k-r</math> with the property that you can add the projection of any vector from this space to a solution for <math>a</math> and still have a solution that minimizes loss. Denote the null-space basis as | ||
:<math>\{ \upsilon_1,...,\upsilon_{k-r} \}</math> | :<math>\{ \upsilon_1,...,\upsilon_{k-r} \}</math> | ||
− | each <math>\upsilon_i</math> is a k-vector. Given any solution <math>a_{part}</math> that minimizes the loss then any <math>{\alpha_j\}</math> gives another <math>a</math> that also minimizes the loss using | + | each <math>\upsilon_i</math> is a k-vector. Given any solution <math>a_{part}</math> that minimizes the loss then any <math>\{\alpha_j\}</math> gives another <math>a</math> that also minimizes the loss using |
:<math>a = a_{part} + \sum_{j=1}^{k-r} \alpha_j \upsilon_j</math>. | :<math>a = a_{part} + \sum_{j=1}^{k-r} \alpha_j \upsilon_j</math>. | ||
Line 39: | Line 39: | ||
where | where | ||
* <math>\beta_1\in \Re^r</math> corresponds to the ''nonzero'' singular values <math>\sigma_1,...,\sigma_r</math>. | * <math>\beta_1\in \Re^r</math> corresponds to the ''nonzero'' singular values <math>\sigma_1,...,\sigma_r</math>. | ||
− | * <math>\beta_2\in Re^{k-r}</math> corresponds to the zero part of <math>\Sigma</math>. | + | * <math>\beta_2\in \Re^{k-r}</math> corresponds to the zero part of <math>\Sigma</math>. |
then | then | ||
:<math>\Sigma \beta = \left(\begin{array}{c} D \beta_1 \\ 0\end{array}\right) | :<math>\Sigma \beta = \left(\begin{array}{c} D \beta_1 \\ 0\end{array}\right) | ||
, | , | ||
− | u = \left(\begin{array}{c} u_1 \\ u_2\end{array}\right)</math> | + | u = \left(\begin{array}{c} u_1 \\ u_2\end{array}\right) |
+ | , | ||
+ | \Sigma = \left(\begin{array}{cc} D & 0 \\ 0 & 0\end{array}\right) | ||
+ | </math> | ||
+ | |||
+ | Minimizing wrt <math>\beta_1</math> yields | ||
+ | :<math>\beta_1^{mn} = D^{-1} u_1</math> | ||
+ | which is the ''minimum norm'' solution matching the Moore-Penrose inverse <math>B^+ y</math>. | ||
+ | The set of unconstrained minimizers is | ||
+ | :<math>\beta = \left(\begin{array}{c} \beta_1^{mn} \\ 0\end{array}\right) + \left(\begin{array}{c}0 \\ z\end{array}\right), | ||
+ | z \in \Re^{k-r}</math> | ||
+ | Equivalently in terms of <math>a</math> | ||
+ | :<math>a = V \beta^{mn} + V_2 z</math> | ||
+ | where <math>V_2</math> denotes the last <math>(k-r)</math> columns of <math>V</math>, which span <math>Null(B)</math>. | ||
+ | |||
+ | == Adding in the constraint == | ||
+ | We'll solve | ||
+ | :<math>C \left( V_1 \beta_1^{mn} + V_2 \beta_2 \right) = d</math> | ||
+ | for <math>\beta_2</math>. The solution is | ||
+ | :<math>a^* = V_1 \beta_1^* + V_2 \beta_2</math> | ||
+ | with | ||
+ | :<math>\beta_1^* = D^{-1} \left( U^T y \right)_{\{1..r\}}</math> | ||
+ | :<math>C V_2 \beta_2 = d - C V_1 \beta_1^*</math> | ||
+ | The Moore-Penrose inverse can be used to obtain <math>\beta_2</math> | ||
+ | :<math>\beta_2 = ( C V_2 )^+ (d - C V_1 \beta_1^* )</math> | ||
+ | although any vector in <math>Null(C V_2)</math> is also a solution. | ||
+ | |||
+ | Once you've settled on a solution, the Lagrangian multipliers are | ||
+ | :<math>\lambda^* = \left( C^T \right)^+ \left[ B^T ( B a^* - y ) \right]</math> | ||
+ | |||
+ | == Summary of algorithm == | ||
+ | * Use SVD: <math>B = U \Sigma V^T</math> | ||
+ | * Solve <math>C (V \beta^*) = d</math> in the row-space. | ||
+ | * <math>a^* = V \beta^*</math> | ||
+ | * Recover <math>\lambda^*</math> via <math>C^T \lambda^* = B^T ( B a^* - y)</math> | ||
+ | |||
+ | Note: I feel like there is an error in the above algorithm. <math>\beta_1^*</math> cannot depend only on <math>B</math> and <math>y</math>, but that is the case. | ||
+ | |||
+ | == Block matrix approach == | ||
+ | This is a different approach. Again, start with the SVD of B: | ||
+ | :<math>B = U \Sigma V^T</math>, with rank <math>r</math>. | ||
+ | Write out the KKT block system: | ||
+ | :<math> | ||
+ | \left( \begin{array}{cc} | ||
+ | \Sigma^2 & -(C V)^T \\ | ||
+ | C V & 0 | ||
+ | \end{array}\right) | ||
+ | \left( \begin{array}{c} | ||
+ | \beta^* \\ | ||
+ | \lambda | ||
+ | \end{array}\right) | ||
+ | = | ||
+ | \left( \begin{array}{c} | ||
+ | \Sigma U^T y \\ | ||
+ | d | ||
+ | \end{array}\right) | ||
+ | </math> | ||
+ | A Moore-Penrose inverse should work here, then recover <math>a* = V \beta^*</math>. | ||
+ | |||
+ | === Derivation === | ||
+ | Use <math>\mathcal{L}(a,\lambda) = {1\over 2}||B a- y||^2 - \lambda^T (C a - d)</math> | ||
+ | |||
+ | # '''Stationarity''': | ||
+ | #:<math>\nabla_a \mathcal{L}(a,\lambda) = | ||
+ | B^T (B a - y) - C^T \lambda = 0</math> | ||
+ | #:<math>B^T B a - C^T \lambda = B^T y</math> | ||
+ | #:<math>V \Sigma^T \Sigma V^T a - C^T \lambda = V \Sigma^T U^T y</math> | ||
+ | #:<math>\Sigma^2 (V^T a) - V^T C^T \lambda = \Sigma U^T y</math> | ||
+ | #:<math>\Sigma^2 \beta - (C V)^T \lambda = \Sigma U^T y</math> | ||
+ | # '''Feasibility''': <math>C a = d</math> | ||
+ | #:<math>C V V^T a = d</math> | ||
+ | #:<math>C V \beta = d</math> | ||
+ | |||
+ | === More direct block matrix approach === | ||
+ | With the block matrix approach, I guess we don't actually need to solve it in the rotated space. Why not just: | ||
+ | |||
+ | :<math> | ||
+ | \left( \begin{array}{cc} | ||
+ | B^T B & -C^T \\ | ||
+ | C & 0 | ||
+ | \end{array}\right) | ||
+ | \left( \begin{array}{c} | ||
+ | a* \\ | ||
+ | \lambda | ||
+ | \end{array}\right) | ||
+ | = | ||
+ | \left( \begin{array}{c} | ||
+ | B^T y \\ | ||
+ | d | ||
+ | \end{array}\right) | ||
+ | </math> | ||
+ | |||
+ | This can be solved with the pseudo-inverse, i.e. | ||
+ | :<math> | ||
+ | \left( \begin{array}{c} | ||
+ | a* \\ | ||
+ | \lambda | ||
+ | \end{array}\right) | ||
+ | = | ||
+ | \left( \begin{array}{cc} | ||
+ | B^T B & -C^T \\ | ||
+ | C & 0 | ||
+ | \end{array}\right)^+ | ||
+ | \left( \begin{array}{c} | ||
+ | B^T y \\ | ||
+ | d | ||
+ | \end{array}\right) | ||
+ | </math> | ||
+ | |||
+ | What happens when we solve this using as Schur complement approach so as to take advantage of the block structure and potential reuse of <math>(B^T B)^{-1}</math>? We get: | ||
+ | |||
+ | :<math>\lambda = \left( C (B^T B)^{-1} C^T \right)^+ ( d - C (B^T B)^{-1} B^T y )</math> | ||
+ | :<math>a = - (B^T B)^{-1} B^T y - (B^T B)^{-1} C^T \lambda = -a_{ols} - (B^T B)^{-1} C^T \lambda</math> | ||
+ | |||
+ | These are the Theil and van der Panne equations again, which I've already determined don't work when you replace <math>(B^T B)^{-1}</math> with a pseudo-inverse. (there are some sign discrepancies, I'm not sure why). |
Latest revision as of 20:27, 2 April 2025
I consider here the solution to a regression problem with linear equality constraints:
- [math]\displaystyle{ \min_a || B a - y ||^2 }[/math]
- subject to [math]\displaystyle{ C a = d }[/math]
where [math]\displaystyle{ B }[/math] is the [math]\displaystyle{ n \times k }[/math] basis matrix, [math]\displaystyle{ C }[/math] is a [math]\displaystyle{ m \times k }[/math] matrix where [math]\displaystyle{ m }[/math] is the # of equality constraints and [math]\displaystyle{ k }[/math] the number of terms, and [math]\displaystyle{ d }[/math] an m-vector. The target data, [math]\displaystyle{ y }[/math] is an n-vector (column), and the solution, [math]\displaystyle{ a }[/math] is a k-vector (column).
When [math]\displaystyle{ B }[/math] and [math]\displaystyle{ C }[/math] are of full rank with [math]\displaystyle{ k + m \ge n }[/math], then a solution should pass through all points. The presence of the constraints means that even in the full rank case, it may require more terms than points to reach a perfect fit.
The Theil and van der Panne (1960) solution to this constrained regression is
- [math]\displaystyle{ \lambda = \left( C (B^T B)^{-1} \right)^{-1} (C a_{ols} - d) }[/math]
- [math]\displaystyle{ a^* = a_{ols} - (B^T B)^{-1} C^T \lambda }[/math]
where [math]\displaystyle{ a_{ols} }[/math] is the unconstrained solution (ordinary least squares solution). When [math]\displaystyle{ k\gt n }[/math], [math]\displaystyle{ B^T B }[/math] is singular. It does not work to replace the inverse with a Moore-Penrose inverse in this case because that selects out a single solution which in general won't be the optimal solution with the constraints.
My intention here is to document the general solution that continues to work when [math]\displaystyle{ k\gt n }[/math], and hopefully even when [math]\displaystyle{ k+m \gt n }[/math].
The null space
Let [math]\displaystyle{ r = rank(B) \le n }[/math] be the rank of [math]\displaystyle{ B }[/math]. Then there is a subspace, called the null space, with dimension [math]\displaystyle{ k-r }[/math] with the property that you can add the projection of any vector from this space to a solution for [math]\displaystyle{ a }[/math] and still have a solution that minimizes loss. Denote the null-space basis as
- [math]\displaystyle{ \{ \upsilon_1,...,\upsilon_{k-r} \} }[/math]
each [math]\displaystyle{ \upsilon_i }[/math] is a k-vector. Given any solution [math]\displaystyle{ a_{part} }[/math] that minimizes the loss then any [math]\displaystyle{ \{\alpha_j\} }[/math] gives another [math]\displaystyle{ a }[/math] that also minimizes the loss using
- [math]\displaystyle{ a = a_{part} + \sum_{j=1}^{k-r} \alpha_j \upsilon_j }[/math].
Denote B's SVD as
- [math]\displaystyle{ B = U \Sigma V^T }[/math]
so that
- [math]\displaystyle{ B^T B = (V \Sigma^T U^T) ( U \Sigma V^T) = V \Sigma^T \Sigma V^T }[/math]
with [math]\displaystyle{ \Sigma^T \Sigma }[/math] containing [math]\displaystyle{ \sigma_1^2,...,\sigma_r^2 }[/math] on the diagonal and zeros elsewhere. [math]\displaystyle{ V }[/math] is orthonormal [math]\displaystyle{ (V^T V = I) }[/math], so we write
- [math]\displaystyle{ a = V \beta }[/math] for some [math]\displaystyle{ \beta \in \Re^k }[/math]
- [math]\displaystyle{ B a = U \Sigma V^T ( V \beta) = U \Sigma \beta }[/math]
and the loss is
- [math]\displaystyle{ || B a - y ||^2 = || \Sigma \beta - U^T y ||^2 }[/math]
Use [math]\displaystyle{ u := U^T y }[/math] to rotate [math]\displaystyle{ y }[/math] into the space that partitions [math]\displaystyle{ \beta }[/math] so that the first [math]\displaystyle{ r }[/math] rows correspond to the nonzero singular values of B, and the last [math]\displaystyle{ k-r }[/math] rows are the zero part of [math]\displaystyle{ \Sigma }[/math]. Denote this partition as
- [math]\displaystyle{ \beta = \left( \begin{array}{c} \beta_1 \\ \beta_2 \end{array}\right) }[/math]
where
- [math]\displaystyle{ \beta_1\in \Re^r }[/math] corresponds to the nonzero singular values [math]\displaystyle{ \sigma_1,...,\sigma_r }[/math].
- [math]\displaystyle{ \beta_2\in \Re^{k-r} }[/math] corresponds to the zero part of [math]\displaystyle{ \Sigma }[/math].
then
- [math]\displaystyle{ \Sigma \beta = \left(\begin{array}{c} D \beta_1 \\ 0\end{array}\right) , u = \left(\begin{array}{c} u_1 \\ u_2\end{array}\right) , \Sigma = \left(\begin{array}{cc} D & 0 \\ 0 & 0\end{array}\right) }[/math]
Minimizing wrt [math]\displaystyle{ \beta_1 }[/math] yields
- [math]\displaystyle{ \beta_1^{mn} = D^{-1} u_1 }[/math]
which is the minimum norm solution matching the Moore-Penrose inverse [math]\displaystyle{ B^+ y }[/math]. The set of unconstrained minimizers is
- [math]\displaystyle{ \beta = \left(\begin{array}{c} \beta_1^{mn} \\ 0\end{array}\right) + \left(\begin{array}{c}0 \\ z\end{array}\right), z \in \Re^{k-r} }[/math]
Equivalently in terms of [math]\displaystyle{ a }[/math]
- [math]\displaystyle{ a = V \beta^{mn} + V_2 z }[/math]
where [math]\displaystyle{ V_2 }[/math] denotes the last [math]\displaystyle{ (k-r) }[/math] columns of [math]\displaystyle{ V }[/math], which span [math]\displaystyle{ Null(B) }[/math].
Adding in the constraint
We'll solve
- [math]\displaystyle{ C \left( V_1 \beta_1^{mn} + V_2 \beta_2 \right) = d }[/math]
for [math]\displaystyle{ \beta_2 }[/math]. The solution is
- [math]\displaystyle{ a^* = V_1 \beta_1^* + V_2 \beta_2 }[/math]
with
- [math]\displaystyle{ \beta_1^* = D^{-1} \left( U^T y \right)_{\{1..r\}} }[/math]
- [math]\displaystyle{ C V_2 \beta_2 = d - C V_1 \beta_1^* }[/math]
The Moore-Penrose inverse can be used to obtain [math]\displaystyle{ \beta_2 }[/math]
- [math]\displaystyle{ \beta_2 = ( C V_2 )^+ (d - C V_1 \beta_1^* ) }[/math]
although any vector in [math]\displaystyle{ Null(C V_2) }[/math] is also a solution.
Once you've settled on a solution, the Lagrangian multipliers are
- [math]\displaystyle{ \lambda^* = \left( C^T \right)^+ \left[ B^T ( B a^* - y ) \right] }[/math]
Summary of algorithm
- Use SVD: [math]\displaystyle{ B = U \Sigma V^T }[/math]
- Solve [math]\displaystyle{ C (V \beta^*) = d }[/math] in the row-space.
- [math]\displaystyle{ a^* = V \beta^* }[/math]
- Recover [math]\displaystyle{ \lambda^* }[/math] via [math]\displaystyle{ C^T \lambda^* = B^T ( B a^* - y) }[/math]
Note: I feel like there is an error in the above algorithm. [math]\displaystyle{ \beta_1^* }[/math] cannot depend only on [math]\displaystyle{ B }[/math] and [math]\displaystyle{ y }[/math], but that is the case.
Block matrix approach
This is a different approach. Again, start with the SVD of B:
- [math]\displaystyle{ B = U \Sigma V^T }[/math], with rank [math]\displaystyle{ r }[/math].
Write out the KKT block system:
- [math]\displaystyle{ \left( \begin{array}{cc} \Sigma^2 & -(C V)^T \\ C V & 0 \end{array}\right) \left( \begin{array}{c} \beta^* \\ \lambda \end{array}\right) = \left( \begin{array}{c} \Sigma U^T y \\ d \end{array}\right) }[/math]
A Moore-Penrose inverse should work here, then recover [math]\displaystyle{ a* = V \beta^* }[/math].
Derivation
Use [math]\displaystyle{ \mathcal{L}(a,\lambda) = {1\over 2}||B a- y||^2 - \lambda^T (C a - d) }[/math]
- Stationarity:
- [math]\displaystyle{ \nabla_a \mathcal{L}(a,\lambda) = B^T (B a - y) - C^T \lambda = 0 }[/math]
- [math]\displaystyle{ B^T B a - C^T \lambda = B^T y }[/math]
- [math]\displaystyle{ V \Sigma^T \Sigma V^T a - C^T \lambda = V \Sigma^T U^T y }[/math]
- [math]\displaystyle{ \Sigma^2 (V^T a) - V^T C^T \lambda = \Sigma U^T y }[/math]
- [math]\displaystyle{ \Sigma^2 \beta - (C V)^T \lambda = \Sigma U^T y }[/math]
- Feasibility: [math]\displaystyle{ C a = d }[/math]
- [math]\displaystyle{ C V V^T a = d }[/math]
- [math]\displaystyle{ C V \beta = d }[/math]
More direct block matrix approach
With the block matrix approach, I guess we don't actually need to solve it in the rotated space. Why not just:
- [math]\displaystyle{ \left( \begin{array}{cc} B^T B & -C^T \\ C & 0 \end{array}\right) \left( \begin{array}{c} a* \\ \lambda \end{array}\right) = \left( \begin{array}{c} B^T y \\ d \end{array}\right) }[/math]
This can be solved with the pseudo-inverse, i.e.
- [math]\displaystyle{ \left( \begin{array}{c} a* \\ \lambda \end{array}\right) = \left( \begin{array}{cc} B^T B & -C^T \\ C & 0 \end{array}\right)^+ \left( \begin{array}{c} B^T y \\ d \end{array}\right) }[/math]
What happens when we solve this using as Schur complement approach so as to take advantage of the block structure and potential reuse of [math]\displaystyle{ (B^T B)^{-1} }[/math]? We get:
- [math]\displaystyle{ \lambda = \left( C (B^T B)^{-1} C^T \right)^+ ( d - C (B^T B)^{-1} B^T y ) }[/math]
- [math]\displaystyle{ a = - (B^T B)^{-1} B^T y - (B^T B)^{-1} C^T \lambda = -a_{ols} - (B^T B)^{-1} C^T \lambda }[/math]
These are the Theil and van der Panne equations again, which I've already determined don't work when you replace [math]\displaystyle{ (B^T B)^{-1} }[/math] with a pseudo-inverse. (there are some sign discrepancies, I'm not sure why).
Enable comment auto-refresher