Difference between revisions of "Regression"

 
Line 1: Line 1:
 
[[category:Data Analysis Functions]]
 
[[category:Data Analysis Functions]]
 +
 +
= Regression( Y,B,I,K ) =
 +
 +
Generalized Linear Regression.
 +
 +
(see user guide)
 +
 +
= Details =
 +
(too detailed for user guide)
 +
 +
== Underconstrained Problems ==
 +
 +
When you do a regression fit, the number of data points, size(I), should be greater than the number of basis terms, size(K).  When the number of data points is less than the number of basis terms, the problem is under-constrained.  Provided that there are no two data points having the same basis values but different Y values, the fit curve in an underconstrained problem will perfectly pass through all data points, however, the co-efficients in that case are not unique.  In the under-constrained case, Analytica will issue a warning, since this most likely indicates that the I and K index parameters were inadvertently swapped.  If you ignore the warning, embed the call within an [[IgnoreWarnings]] function call, or have the "Show Result Warnings" preference disabled, a set of coefficients that passes through the existing data points is arbitrarily chosen and returned.  The algorithm used is computational inefficient in the under-constrained case where size(I) << size(K) -- i.e., the number of basis terms is much larger than the number of data points.  If you know your problem is highly underconstrained, then you probably do not intend to use a regression.
 +
 +
== Secondary Statistics ==
 +
 +
The Regression function computes the coefficients for the best-fit curve, but it does not compute secondary statistics such as parameter co-variances, R-value correlation, or goodness-of-fit.
 +
 +
In a generalized linear regression, the goodness of fit is often characterized using a Chi-squared statistic.  The statistic is defined slightly differently depending on whether you know the measurement error of Y or not.  If it is known, with a variance of of var_y, then the chi-squared goodness of fit statistic is computed as:
 +
 +
Sum( (Y - sum(C*B,K))^2 / var_y, I )
 +
 +
If the measurement error of Y is not known, it is computed as:
 +
 +
Sum( (Y-sum(C*B,K))^2, I )
 +
 +
where in both cases, C is the vector of coefficients returned from the Regression(Y,B,I,K) function call.  Denoting the above as chi2, The probability that the data fit as poor as this would occur by chance is given as:
 +
GammaI( size(I)/2 - 1, chi2 / 2 )
 +
 +
Another set of secondary statistics are the co-variances of the fitted parameters.  Again, let ''C := Regression(Y,B,I,K)''.  The co-variance is an estimtae of the amount of uncertainty in the parameter estimate given the available data.  As the number of data points increases (for a given basis), the variances and co-variances tend to decrease.  To compute the co-variances, a copy of Index K is required (since the co-variance matrix is square in K); hence, you need to create a new index node defined as:
 +
Index K2 := CopyIndex(K)
 +
 +
The co-variances are then computed as:
 +
Inverse(MatrixMultiply(B,K,I,B,I,K))
 +
 +
The diagonal elements of this matrix give the variance in each parameter.  Since there is only a finite number of samples, the parameter estimate may be off a bit due to random chance, even if the linear model assumption is correct; this variance indicates how much error exists from random chance at the given data set size. 
 +
 +
= See Also =
 +
 +
* [[Logistic_Regression]]
 +
* [[Probit_Regression]]

Revision as of 20:01, 16 April 2007


Regression( Y,B,I,K )

Generalized Linear Regression.

(see user guide)

Details

(too detailed for user guide)

Underconstrained Problems

When you do a regression fit, the number of data points, size(I), should be greater than the number of basis terms, size(K). When the number of data points is less than the number of basis terms, the problem is under-constrained. Provided that there are no two data points having the same basis values but different Y values, the fit curve in an underconstrained problem will perfectly pass through all data points, however, the co-efficients in that case are not unique. In the under-constrained case, Analytica will issue a warning, since this most likely indicates that the I and K index parameters were inadvertently swapped. If you ignore the warning, embed the call within an IgnoreWarnings function call, or have the "Show Result Warnings" preference disabled, a set of coefficients that passes through the existing data points is arbitrarily chosen and returned. The algorithm used is computational inefficient in the under-constrained case where size(I) << size(K) -- i.e., the number of basis terms is much larger than the number of data points. If you know your problem is highly underconstrained, then you probably do not intend to use a regression.

Secondary Statistics

The Regression function computes the coefficients for the best-fit curve, but it does not compute secondary statistics such as parameter co-variances, R-value correlation, or goodness-of-fit.

In a generalized linear regression, the goodness of fit is often characterized using a Chi-squared statistic. The statistic is defined slightly differently depending on whether you know the measurement error of Y or not. If it is known, with a variance of of var_y, then the chi-squared goodness of fit statistic is computed as:

Sum( (Y - sum(C*B,K))^2 / var_y, I )

If the measurement error of Y is not known, it is computed as:

Sum( (Y-sum(C*B,K))^2, I )

where in both cases, C is the vector of coefficients returned from the Regression(Y,B,I,K) function call. Denoting the above as chi2, The probability that the data fit as poor as this would occur by chance is given as:

GammaI( size(I)/2 - 1, chi2 / 2 )

Another set of secondary statistics are the co-variances of the fitted parameters. Again, let C := Regression(Y,B,I,K). The co-variance is an estimtae of the amount of uncertainty in the parameter estimate given the available data. As the number of data points increases (for a given basis), the variances and co-variances tend to decrease. To compute the co-variances, a copy of Index K is required (since the co-variance matrix is square in K); hence, you need to create a new index node defined as:

Index K2 := CopyIndex(K)

The co-variances are then computed as:

Inverse(MatrixMultiply(B,K,I,B,I,K))

The diagonal elements of this matrix give the variance in each parameter. Since there is only a finite number of samples, the parameter estimate may be off a bit due to random chance, even if the linear model assumption is correct; this variance indicates how much error exists from random chance at the given data set size.

See Also

Comments


You are not allowed to post comments.