Difference between revisions of "Logistic Regression"

(Obsolete -- removed content, redirected reader to LogisticRegression)
 
(3 intermediate revisions by one other user not shown)
Line 1: Line 1:
[[Category:Doc Status D]] <!-- For Lumina use, do not change -->
+
[[Category: Generalized Regression library functions]]
[[Category:Data Analysis Functions]]
 
  
''Note: In [[Analytica 4.5]], this library function has been superceded with the built-in function [[LogisticRegression]]''.
+
The [[Logistic_Regression]] function is obsolete, and has been replaced by the [[LogisticRegression]] function. Please see [[LogisticRegression]].
  
Logistic regression is a techique for predicting a [[Bernoulli]] (i.e., 0,1-valued) random variable from a set of continuous dependent variables.  See the [http://en.wikipedia.org/wiki/Logistic_regression Wikipedia article on Logistic regression] for a simple description. Another generalized logistic model that can be used for this purpose is the [[Probit_Regression]] model. These differ in functional form, with the logistic regression using a logit function to link the linear predictor to the predicted probability, while the probit model uses a cumulative normal for the same.
+
The old [[Logistic_Regression]] function (with the underscore) is implemented as a [[User-Defined Function]] in the  
 +
([[media:Generalized Regression.ana|Generalized Regression library]]). It requires the Analytica [[Optimizer]] edition to use. It still exists to support legacy models.
  
== Logistic_Regression(Y, B, I, K) ==
+
The newer [[LogisticRegression]] function is available in all editions of Analytica. It exists in [[Analytica 4.5]] and up.
  
(''Requires Analytica Optimizer'')
+
To convert a legacy model to use the newer version, simply remove the underscores -- the parameter order is the same.
  
The Logistic_regression function returns the best-fit coefficients, c, for a model of the form
+
==History==
 
+
In [[Analytica 4.5]], this library function [[Logistic_Regression]]() has been superseded by the built-in [[LogisticRegression]] function that does not require the Optimizer edition.
:<math>logit(p_i) = ln\left( {{p_i}\over{1-p_i}} \right) = \sum_k c_k B_{i,k}</math>
 
 
 
given a data set basis B and classifications of 0 or 1 in Y.  B is indexed by <code>I</code> and <code>K</code>, while ''Y'' is indexed by <code>I</code>.  The fitted model predicts a classification of 1 with probability <math>p_i</math> and 0 with probability <math>1-p_i</math> for any instance.
 
 
 
The syntax is the same as for the Regression function.  The basis may be of a generalized linear form, that is, each term in the basis may be an arbitrary non-linear function of your data; however, the logit of the prediction is a linear combination of these.
 
 
 
Once you have used the Logistic_Regression function to compute the coefficients for your model, the predictive model that results returns the probability that a given data point is classified as 1.
 
 
 
== Library ==
 
 
 
Generalized Regression.ana
 
 
 
== Example ==
 
 
 
Suppose you want to predict the probability that a particular treatment for diabetes is effective given several lab test results.  Data is collected for patients who have undergone the treatment, as follows, where the variable ''Test_results'' contains lab test data and ''Treatment_effective'' is set to 0 or 1 depending on whether the treatment was effective or not for that patient:
 
 
 
[[image:DiabetesData.jpg]]
 
[[Image:DiabetesOutcome.jpg]]
 
 
 
Using the data directly as the regression basis, the logistic regression coefficients are computed using:
 
 
 
:<code>Variable c := Logistic_regression( Treatment_effective, Test_results, Patient_ID, Lab_test)</code>
 
 
 
We can obtain the predicted probability for each patient in this testing set using:
 
 
 
:<code>Variable Prob_Effective := InvLogit(Sum( c*Test_results, Lab_Test))</code>
 
 
 
If we have lab tests for a new patient, say New_Patient_Tests, in the form of a vector indexed by Lab_Test, we can predict the probability that treatment will be effective using:
 
 
 
:<code>InvLogit(Sum( c*New_patient_tests, Lab_test))</code>
 
 
 
It is often possible to improve the predictions dramatically by including a y-offset term in the linear basis.  Using the test data directly as the regression basis requires the linear combination part to pass through the origin.  To incorporate the y-offset term, we would add a column to the basis having the constant value 1 across all ''patient_ID''s:
 
 
 
:<code>Index K := Concat([1], Lab_Test)</code>
 
:<code>Variable B := if K = 1 then 1 else Test_results[Lab_test = K]</code>
 
:<code>Variable C2 := Logistic_Regression( Treatment_effectiveness, B, Patient_ID, K)</code>
 
:<code>Variable Prob_Effective2 := Logit(Sum(C2*B, K)</code>
 
 
 
To get a rough idea of the improvement gained by adding the extra y-intercept term to the basis, you can compare the log-likelihood of the training data, e.g.
 
:<code>Ln(Sum]](If Treatment_effective then Prob_Effective else 1-Prob_Effective, Patient_ID))</code>
 
vs.
 
:<code>Ln(Sum( If Treatment_effective then Prob_Effective2 else 1-Prob_Effective2, Patient_ID))</code>
 
 
 
You generally need to use log-likelihood, rather than likelihood, to avoid numeric underflow. 
 
 
 
In the example data set that the screenshots are taken from, with 145 patients, the basis without the y-intercept led to a log-likelihood of -29.7, while the basis with the constant 1 produced a log-likelihood of 0 (to the numeric precision of the computer).  In the second case the logistic model predicted a probability of 0.0000 or 1.0000 for every patient in the training set, perfectly predicting the treatment effectiveness in every case.  On closer inspection I found, surprisingly, that the data was linearly separable, and that the logistic rise had become a near step-function (all coefficients were very large).  Although adding the y-offset to the basis in this case led to a substantially better fit to the training data, the result is obviously far less satisfying -- with a new patient, the model will now predict a probability of 0.000 or 1.000 for treatment effectiveness, which is clearly a very poor probability estimate.  The phenomena that this example demonstrates is a very common problem encountered in data analysis and machine learning, and is generally referred to as the problem of ''over-fitting''.  This example is an extreme case, but it does make it very clear that any degree of overfitting does effectively lead to an "over-confidence" in the predictions from a logistic regression model. 
 
 
 
If you are using logistic regression (or any other data fitting procedure for that matter) in a sensitive data analysis task, you should become familiar with the problem of over-fitting, techniques of cross-validation, boot-strapping, etc., to avoid some of these problems.  
 
  
 
== See Also ==
 
== See Also ==
 
+
* [[LogisticRegression]]
* [[Probit_Regression]]  
 
* [[Regression]], [[RegressionDist]] : When Y is continuous, with normally-distributed error
 
* [[Poisson_Regression]] : When Y models a count (number of events that occur)
 

Latest revision as of 00:15, 19 September 2018


The Logistic_Regression function is obsolete, and has been replaced by the LogisticRegression function. Please see LogisticRegression.

The old Logistic_Regression function (with the underscore) is implemented as a User-Defined Function in the (Generalized Regression library). It requires the Analytica Optimizer edition to use. It still exists to support legacy models.

The newer LogisticRegression function is available in all editions of Analytica. It exists in Analytica 4.5 and up.

To convert a legacy model to use the newer version, simply remove the underscores -- the parameter order is the same.

History

In Analytica 4.5, this library function Logistic_Regression() has been superseded by the built-in LogisticRegression function that does not require the Optimizer edition.

See Also

Comments


You are not allowed to post comments.