Difference between revisions of "Logistic Regression"
Line 1: | Line 1: | ||
− | [[Category:Doc Status D]] <!-- For Lumina use, do not change --> | + | [[Category: Data Analysis Functions]] |
− | + | [[Category: Generalized Regression library functions]] | |
+ | [[Category: Doc Status D]] <!-- For Lumina use, do not change --> | ||
+ | ''Requires Analytica Optimizer'' | ||
__TOC__ | __TOC__ | ||
− | Logistic regression is a | + | |
+ | Logistic regression is a technique for predicting a [[Bernoulli]] (i.e., 0,1-valued) random variable from a set of continuous dependent variables. See the [http://en.wikipedia.org/wiki/Logistic_regression Wikipedia article on Logistic regression] for a simple description. Another generalized logistic model that can be used for this purpose is the [[Probit_Regression]] model. These differ in functional form, with the logistic regression using a [[logit]] function to link the linear predictor to the predicted probability, while the probit model uses a cumulative normal for the same. | ||
== Logistic_Regression(Y, B, I, K) == | == Logistic_Regression(Y, B, I, K) == | ||
− | + | The Logistic_regression function returns the best-fit coefficients, ''c'', for a model of the form | |
− | |||
− | The Logistic_regression function returns the best-fit coefficients, c, for a model of the form | ||
:<math>logit(p_i) = ln\left( {{p_i}\over{1-p_i}} \right) = \sum_k c_k B_{i,k}</math> | :<math>logit(p_i) = ln\left( {{p_i}\over{1-p_i}} \right) = \sum_k c_k B_{i,k}</math> | ||
− | given a data set basis | + | given a data set basis «B» and classifications of 0 or 1 in «Y». «B» is indexed by «I» and «K», while «Y» is indexed by «I». The fitted model predicts a classification of 1 with probability <math>p_i</math> and 0 with probability <math>1-p_i</math> for any instance. |
− | The syntax is the same as for the Regression function. The basis may be of a generalized linear form, that is, each term in the basis may be an arbitrary non-linear function of your data; however, the logit of the prediction is a linear combination of these. | + | The syntax is the same as for the [[Regression]] function. The basis may be of a generalized linear form, that is, each term in the basis may be an arbitrary non-linear function of your data; however, the [[logit]] of the prediction is a linear combination of these. |
− | Once you have used the Logistic_Regression function to compute the coefficients for your model, the predictive model that results returns the probability that a given data point is classified as 1. | + | Once you have used the [[Logistic_Regression]] function to compute the coefficients for your model, the predictive model that results returns the probability that a given data point is classified as 1. |
== Library == | == Library == | ||
+ | Generalized Regression ([[media:Generalized Regression.ana|Generalized Regression.ana]]) | ||
+ | :Use '''File → Add Library...''' to add this library | ||
− | + | == Example == | |
− | + | Suppose you want to predict the probability that a particular treatment for diabetes is effective given several lab test results. Data is collected for patients who have undergone the treatment, as follows, where the variable <code>Test_results</code> contains lab test data and <code>Treatment_effective</code> is set to 0 or 1 depending on whether the treatment was effective or not for that patient: | |
− | + | :[[image:DiabetesData.jpg]] | |
− | + | :[[Image:DiabetesOutcome.jpg]] | |
− | [[Image:DiabetesOutcome.jpg]] | ||
Using the data directly as the regression basis, the logistic regression coefficients are computed using: | Using the data directly as the regression basis, the logistic regression coefficients are computed using: | ||
− | :<code>Variable c := Logistic_regression( Treatment_effective, Test_results, Patient_ID, Lab_test)</code> | + | :<code>Variable c := Logistic_regression(Treatment_effective, Test_results, Patient_ID, Lab_test)</code> |
We can obtain the predicted probability for each patient in this testing set using: | We can obtain the predicted probability for each patient in this testing set using: | ||
Line 40: | Line 42: | ||
:<code>Variable Prob_Effective := InvLogit(Sum( c*Test_results, Lab_Test))</code> | :<code>Variable Prob_Effective := InvLogit(Sum( c*Test_results, Lab_Test))</code> | ||
− | If we have lab tests for a new patient, say New_Patient_Tests, in the form of a vector indexed by Lab_Test, we can predict the probability that treatment will be effective using: | + | If we have lab tests for a new patient, say <code>New_Patient_Tests</code>, in the form of a vector indexed by <code>Lab_Test</code>, we can predict the probability that treatment will be effective using: |
:<code>InvLogit(Sum( c*New_patient_tests, Lab_test))</code> | :<code>InvLogit(Sum( c*New_patient_tests, Lab_test))</code> | ||
Line 52: | Line 54: | ||
To get a rough idea of the improvement gained by adding the extra y-intercept term to the basis, you can compare the log-likelihood of the training data, e.g. | To get a rough idea of the improvement gained by adding the extra y-intercept term to the basis, you can compare the log-likelihood of the training data, e.g. | ||
− | :<code>Ln(Sum | + | :<code>Ln(Sum(If Treatment_effective then Prob_Effective else 1-Prob_Effective, Patient_ID))</code> |
vs. | vs. | ||
− | :<code>Ln(Sum( If Treatment_effective then Prob_Effective2 else 1-Prob_Effective2, Patient_ID))</code> | + | :<code>Ln(Sum(If Treatment_effective then Prob_Effective2 else 1-Prob_Effective2, Patient_ID))</code> |
You generally need to use log-likelihood, rather than likelihood, to avoid numeric underflow. | You generally need to use log-likelihood, rather than likelihood, to avoid numeric underflow. | ||
Line 63: | Line 65: | ||
==History== | ==History== | ||
− | In [[Analytica 4.5]], this library function [[Logistic_Regression]]() has been | + | In [[Analytica 4.5]], this library function [[Logistic_Regression]]() has been superseded by the built-in [[LogisticRegression]] function that does not require the Optimizer edition. |
== See Also == | == See Also == | ||
+ | * [[LogisticRegression]] | ||
* [[Probit_Regression]] | * [[Probit_Regression]] | ||
− | * [[Regression]]: When | + | * [[Regression]]: When «Y» is continuous, with normally-distributed error |
− | * [[RegressionDist]] : When | + | * [[RegressionDist]] : When «Y» is continuous, with normally-distributed error |
− | * [[Poisson_Regression]] : When | + | * [[Poisson_Regression]] : When «Y» models a count (number of events that occur) |
+ | * [[Analytica_Libraries_and_Templates#Generalized_Regression|Generalized Regression]] | ||
+ | * [[media:Generalized Regression.ana|Generalized Regression.ana]] |
Revision as of 21:26, 24 February 2016
Requires Analytica Optimizer
Logistic regression is a technique for predicting a Bernoulli (i.e., 0,1-valued) random variable from a set of continuous dependent variables. See the Wikipedia article on Logistic regression for a simple description. Another generalized logistic model that can be used for this purpose is the Probit_Regression model. These differ in functional form, with the logistic regression using a logit function to link the linear predictor to the predicted probability, while the probit model uses a cumulative normal for the same.
Logistic_Regression(Y, B, I, K)
The Logistic_regression function returns the best-fit coefficients, c, for a model of the form
- [math]\displaystyle{ logit(p_i) = ln\left( {{p_i}\over{1-p_i}} \right) = \sum_k c_k B_{i,k} }[/math]
given a data set basis «B» and classifications of 0 or 1 in «Y». «B» is indexed by «I» and «K», while «Y» is indexed by «I». The fitted model predicts a classification of 1 with probability [math]\displaystyle{ p_i }[/math] and 0 with probability [math]\displaystyle{ 1-p_i }[/math] for any instance.
The syntax is the same as for the Regression function. The basis may be of a generalized linear form, that is, each term in the basis may be an arbitrary non-linear function of your data; however, the logit of the prediction is a linear combination of these.
Once you have used the Logistic_Regression function to compute the coefficients for your model, the predictive model that results returns the probability that a given data point is classified as 1.
Library
Generalized Regression (Generalized Regression.ana)
- Use File → Add Library... to add this library
Example
Suppose you want to predict the probability that a particular treatment for diabetes is effective given several lab test results. Data is collected for patients who have undergone the treatment, as follows, where the variable Test_results
contains lab test data and Treatment_effective
is set to 0 or 1 depending on whether the treatment was effective or not for that patient:
Using the data directly as the regression basis, the logistic regression coefficients are computed using:
Variable c := Logistic_regression(Treatment_effective, Test_results, Patient_ID, Lab_test)
We can obtain the predicted probability for each patient in this testing set using:
Variable Prob_Effective := InvLogit(Sum( c*Test_results, Lab_Test))
If we have lab tests for a new patient, say New_Patient_Tests
, in the form of a vector indexed by Lab_Test
, we can predict the probability that treatment will be effective using:
InvLogit(Sum( c*New_patient_tests, Lab_test))
It is often possible to improve the predictions dramatically by including a y-offset term in the linear basis. Using the test data directly as the regression basis requires the linear combination part to pass through the origin. To incorporate the y-offset term, we would add a column to the basis having the constant value 1 across all patient_IDs:
Index K := Concat([1], Lab_Test)
Variable B := if K = 1 then 1 else Test_results[Lab_test = K]
Variable C2 := Logistic_Regression( Treatment_effectiveness, B, Patient_ID, K)
Variable Prob_Effective2 := Logit(Sum(C2*B, K)
To get a rough idea of the improvement gained by adding the extra y-intercept term to the basis, you can compare the log-likelihood of the training data, e.g.
Ln(Sum(If Treatment_effective then Prob_Effective else 1-Prob_Effective, Patient_ID))
vs.
Ln(Sum(If Treatment_effective then Prob_Effective2 else 1-Prob_Effective2, Patient_ID))
You generally need to use log-likelihood, rather than likelihood, to avoid numeric underflow.
In the example data set that the screenshots are taken from, with 145 patients, the basis without the y-intercept led to a log-likelihood of -29.7, while the basis with the constant 1 produced a log-likelihood of 0 (to the numeric precision of the computer). In the second case the logistic model predicted a probability of 0.0000 or 1.0000 for every patient in the training set, perfectly predicting the treatment effectiveness in every case. On closer inspection I found, surprisingly, that the data was linearly separable, and that the logistic rise had become a near step-function (all coefficients were very large). Although adding the y-offset to the basis in this case led to a substantially better fit to the training data, the result is obviously far less satisfying -- with a new patient, the model will now predict a probability of 0.000 or 1.000 for treatment effectiveness, which is clearly a very poor probability estimate. The phenomena that this example demonstrates is a very common problem encountered in data analysis and machine learning, and is generally referred to as the problem of over-fitting. This example is an extreme case, but it does make it very clear that any degree of overfitting does effectively lead to an "over-confidence" in the predictions from a logistic regression model.
If you are using logistic regression (or any other data fitting procedure for that matter) in a sensitive data analysis task, you should become familiar with the problem of over-fitting, techniques of cross-validation, boot-strapping, etc., to avoid some of these problems.
History
In Analytica 4.5, this library function Logistic_Regression() has been superseded by the built-in LogisticRegression function that does not require the Optimizer edition.
See Also
- LogisticRegression
- Probit_Regression
- Regression: When «Y» is continuous, with normally-distributed error
- RegressionDist : When «Y» is continuous, with normally-distributed error
- Poisson_Regression : When «Y» models a count (number of events that occur)
- Generalized Regression
- Generalized Regression.ana
Enable comment auto-refresher