Revision as of 03:16, 19 September 2007

Regression( Y,B,I,K )

Generalized Linear Regression.

Y : Classification value (dependent variable). Indexed by I.
X : Basis values (independent variables). Indexed by I and K.
I : Index for data points. Each element of I corresponds to a different data point.
K : Basis index. The result has one coefficient for each element of K.

Finds a set of coefficients, C, fitting the data so that Y at each data point is estimated by

Sum(C*B,K)

(see user guide for basic usage)

Details

(too detailed for user guide)

Underconstrained Problems

When you do a regression fit, the number of data points, size(I), should be greater than the number of basis terms, size(K). When the number of data points is less than the number of basis terms, the problem is under-constrained. Provided that there are no two data points having the same basis values but different Y values, the fit curve in an underconstrained problem will perfectly pass through all data points, however, the co-efficients in that case are not unique. In the under-constrained case, Analytica will issue a warning, since this most likely indicates that the I and K index parameters were inadvertently swapped. If you ignore the warning, embed the call within an IgnoreWarnings function call, or have the "Show Result Warnings" preference disabled, a set of coefficients that passes through the existing data points is arbitrarily chosen and returned. The algorithm used is computational inefficient in the under-constrained case where size(I) << size(K) -- i.e., the number of basis terms is much larger than the number of data points. If you know your problem is highly underconstrained, then you probably do not intend to use a regression.

Secondary Statistics

The Regression function computes the coefficients for the best-fit curve, but it does not compute secondary statistics such as parameter covariances, R-value correlation, or goodness-of-fit.

In what follows, we'll assume that Variable EC is the computed regression coefficients, e.g.

Variable C := Regression(Y,B,I,K)

For each data point, the predicted expected value (from the regression) is given by

Sum( C*B, K )

However, this prediction provides only the expected value. The RegressionDist function may be used to obtain a distribution over C, and hence a probabilistic estimate of Y.

If you know the measurement noise in advance, then S is given and may (optionally) be indexed by I if the measurement noise varied by data point. If you do not know s in advance, then S can be obtained from the RegressionNoise function as:

RegressionNoise(Y,B,I,K,C]]

Alternatively, S may be estimated as

var y2 := Sum(C*B,K);
Sqrt( Sum( (Y-Y2)^2,I ) / (size(I)-size(K)) )

Estimating S in either of these ways assumes that the noise level is the same for each data point.

In a generalized linear regression, the goodness of fit, or merit, is often characterized using a Chi-squared statistic, computed as:

Sum( (Y-Sum(C*B,K))^2 / S^2, I )

Denoting the above as chi2, The probability that the data fit as poor as this would occur by chance is given as:

GammaI( Size(I)/2 - 1, chi2 / 2 )

This metric can be conveniently obtained using the RegressionFitProb function.

Another set of secondary statistics are the covariances of the fitted parameters. The covariance is an estimate of the amount of uncertainty in the parameter estimate given the available data. As the number of data points increases (for a given basis), the variances and covariances tend to decrease. To compute the covariances, a copy of Index K is required (since the covariance matrix is square in K); hence, you need to create a new index node defined as:

Index K2 := CopyIndex(K)

The co-variances are then computed as:

Invert( Sum(B * B[K=K2] / S^2, I ), K, K2 )

The diagonal elements of this matrix give the variance in each parameter. Since there is only a finite number of samples, the parameter estimate may be off a bit due to random chance, even if the linear model assumption is correct; this variance indicates how much error exists from random chance at the given data set size.

With S and CV_C (the covariance of parameters C), a distribution on the expected value of Y can be obtained for a given input X (indexded by I), using:

 Variable Coef := Gaussian( C, CV_C, K, K2 )
 Variable Y_predicted := Sum(Coef * X, K ) + Normal(0,S)

The RegressionDist returns the uncertain coefficients directly, and is the more convenient function to use when you wish to estimate the uncertainty in your predicted value.

Numeric Limitations

The Regression function encounters numeric imprecisions, and therefore returns poor results, when the numbers in the basis are very large. If your numbers are routinely exceeding 1M for any particular basis value, it is best to scale those values (e.g., dividing by 1M) before applying Regression, and then if you need to, you can scale the resulting coefficient afterwards.

Numeric imprecisions from large numbers have been reported in models where the numbers were in the range of 100M. To play safe, we've used 1M as a recommended point at which you should consider scaling.

@@ Line 63: / Line 63: @@
    Variable Y_predicted := Sum(Coef * X, K ) + Normal(0,S)
 The [[RegressionDist]] returns the uncertain coefficients directly, and is the more convenient function to use when you wish to estimate the uncertainty in your predicted value.
+= Numeric Limitations =
+The Regression function encounters numeric imprecisions, and therefore returns poor results, when the numbers in the basis are very large.  If your numbers are routinely exceeding 1M for any particular basis value, it is best to scale those values (e.g., dividing by 1M) before applying Regression, and then if you need to, you can scale the resulting coefficient afterwards.
+Numeric imprecisions from large numbers have been reported in models where the numbers were in the range of 100M.  To play safe, we've used 1M as a recommended point at which you should consider scaling.
 = See Also =

Difference between revisions of "Regression"