Difference between revisions of "Logistic regression functions"
m |
(hyperlinks to functions) |
||
(7 intermediate revisions by 2 users not shown) | |||
Line 4: | Line 4: | ||
__TOC__ | __TOC__ | ||
− | You can use the functions in this section to estimate the probability (or probability distribution) of a binary or categorical dependent (output) variable as a function of known values for independent (input) variables. This is similar to linear regression, which predicts the value of a dependent variable as a function of known values for independent variables. Logistic regression is the best known example generalized regression, so even though the term logistic regression technically refers to one specific form of generalized regression (with probit and poisson regression being other instances), it is also not uncommon to hear the term logistic regression functions used synonymously with generalized linear regression, as we have done with the title of this section. | + | You can use the functions in this section to estimate the probability (or probability distribution) of a binary or categorical dependent (output) variable as a function of known values for independent (input) variables. This is similar to linear regression, which predicts the value of a dependent variable as a function of known values for independent variables. Logistic regression is the best known example generalized regression, so even though the term logistic regression technically refers to one specific form of generalized regression (with [[ProbitRegression|probit]] and [[PoissonRegression|poisson]] regression being other instances), it is also not uncommon to hear the term logistic regression functions used synonymously with generalized linear regression, as we have done with the title of this section. |
− | The functions | + | The functions [[LogisticRegression]]() and [[ProbitRegression]]() predict the probability of a [[Bernoulli]] (i.e., ''0'',''1''-valued) random variable from a set of continuous independent variables. Both functions apply to the same scenarios and accept identical parameters; the final models differ slightly in their functional form. The function [[PoissonRegression]] predicts the probability distribution for the number of events that occur, where the dependent (output) variable is a non-negative integer. |
− | All three functions accept the same parameters as the [[Regression analysis#Regression(y, b, i, k)|Regression function]]. As with those functions, you construct a basis for your dependent variables, and will usually want to include the constant term (a 1 in the basis). In addition to those parameters, these functions also have two parameters, | + | All three functions accept the same parameters as the [[Regression analysis#Regression(y, b, i, k)|Regression function]]. As with those functions, you construct a basis for your dependent variables, and will usually want to include the constant term (a ''1'' in the basis). In addition to those parameters, these functions also have two parameters, «priorType» and «priorDev», which allow you to specify a Bayesian prior. |
===Bayesian priors=== | ===Bayesian priors=== | ||
− | The regression methods in this section are highly susceptible to overfitting. The problem is particularly bad when there are a small number of data points or a large number of basis terms. When your model has been overfit, it will produce probability estimates that are too close to zero or one; in other words, its predictions are overconfident. To avoid overfitting, you will usually want to employ a Bayesian prior, which you do by specifying the | + | The regression methods in this section are highly susceptible to overfitting. The problem is particularly bad when there are a small number of data points or a large number of basis terms. When your model has been overfit, it will produce probability estimates that are too close to zero or one; in other words, its predictions are overconfident. To avoid overfitting, you will usually want to employ a Bayesian prior, which you do by specifying the «priorType» parameter, which recognizes these options: |
− | : 0 = Maximum likelihood (default) | + | :<code>0</code> = Maximum likelihood (default) |
− | : 1 = Exponential L1 prior | + | :<code>1</code> = Exponential L1 prior |
− | : 2 = Normal L2 prior | + | :<code>2</code> = Normal L2 prior |
Maximum likelihood corresponds to having no prior. The L1 and L2 priors penalize larger coefficient weights. The joint prior probability of each coefficient is statistically independent, having the shape of a decaying exponential function in the case of an L1 prior or of a half-normal distribution in the case of the L2 prior. | Maximum likelihood corresponds to having no prior. The L1 and L2 priors penalize larger coefficient weights. The joint prior probability of each coefficient is statistically independent, having the shape of a decaying exponential function in the case of an L1 prior or of a half-normal distribution in the case of the L2 prior. | ||
− | You can also optionally specify the strength of the prior using the | + | You can also optionally specify the strength of the prior using the «priorDev» parameter. which specifies the standard deviation of each marginal prior distribution on each coefficient. Cross validation techniques vary this parameter to find the optimal prior strength for a given problem, which is demonstrated in the <code>Logistic Regression prior selection.ana</code> example model included with Analytica in the <code>Data Analysis</code> example models folder. If you omit the «priorDev» parameter, the function makes a reasonable guess, which will usually be superior to maximum likelihood. |
− | ==LogisticRegression(y, b, i, k, priorType,priorDev)== | + | ==LogisticRegression(y, b, i, k, ''priorType, priorDev'')== |
− | '''''Logistic regression''''' is a technique for predicting a Bernoulli (i.e., 0,1-valued) random | + | '''''Logistic regression''''' is a technique for predicting a [[Bernoulli]] (i.e., ''0'',''1''-valued) random |
− | variable from a set of continuous dependent variables. See the [http://en.wikipedia.org/wiki/Logistic_regression Wikipedia article on | + | variable from a set of continuous dependent variables. See the [http://en.wikipedia.org/wiki/Logistic_regression Wikipedia article on logistic regression] for a simple description. The [[LogisticRegression]]() function finds the parameters «c<sub>k</sub>» that fit a model of the form |
− | logistic regression] for a simple description. The | ||
− | + | :<math> | |
Logit(p(x)) = \sum_k c_kb_k(x) | Logit(p(x)) = \sum_k c_kb_k(x) | ||
− | </math | + | </math> |
− | where ''p(y)'' is the probability of outcome | + | where ''p(y)'' is the probability of outcome «y», and ''b<sub>k</sub>(x)'' is the basis vector for a data which is indexed by «k». To understand how to put together a basis from your independent variables, you should read the section on the [[Regression analysis#Regression(y, b, i, k)|Regression function]], it is exactly the same here. Notice that the righthand side of the Logit equation above is the same as for standard [[Regression_analysis#Regression.28y.2C_b.2C_i.2C_k.29|Regression equation]], but the lefthand side involves the [[Advanced probability functions#Advanced probability functions|Logit function]]. The inverse of the [[Logit]] function is the [[Sigmoid]] function, so that once you’ve obtained the result from [[LogisticRegression]](), you can use it to predict the probability for a new data point using |
− | + | :<code>[[Sigmoid]]([[Sum]](c*B(x), k))</code> | |
where <code>B(x)</code> is a user-defined function that returns the basis vector for the data point. | where <code>B(x)</code> is a user-defined function that returns the basis vector for the data point. | ||
Line 38: | Line 37: | ||
'''Example:''' Suppose you have Height, Weight and Gender information for a set of people, with these three variables indexed by Person. A logistic regression model might estimate the probability that a given person is male based on height and weight, encoded as follows: | '''Example:''' Suppose you have Height, Weight and Gender information for a set of people, with these three variables indexed by Person. A logistic regression model might estimate the probability that a given person is male based on height and weight, encoded as follows: | ||
− | + | :<code>Index K := ['b', 'height', 'weight']</code> | |
− | + | :<code>Function PersonBasis(height, weight) :=</code> | |
− | + | ::<code>[[Array]](K, [1, height, weight])</code> | |
− | + | :<code>Variable coef :=</code> | |
− | + | ::<code>[[LogisticRegression]](Gender = 'M', PersonBasis(Height, Weight),</code> | |
− | + | :::<code>Person, K, priorType: 2)</code> | |
With these coefficients, the probability that a 85kg, 170cm tall person is male is | With these coefficients, the probability that a 85kg, 170cm tall person is male is | ||
− | + | :<code>[[Sigmoid]]([[Sum]](coef*PersonBasis(170, 85), k))</code> | |
− | ==ProbitRegression(y, b, i, k, priorType, priorDev)== | + | ==ProbitRegression(y, b, i, k, ''priorType, priorDev'')== |
− | A probit model relates a continuous vector of dependent measurements to the probability of a Bernoulli (i.e., 0,1-valued) outcome. In econometrics, this model is sometimes called the '''''Harvard model'''''. The | + | A probit model relates a continuous vector of dependent measurements to the probability of a [[Bernoulli]] (i.e., ''0'', ''1''-valued) outcome. In econometrics, this model is sometimes called the '''''Harvard model'''''. The [[ProbitRegression]] function finds the parameters «c<sub>k</sub>» that fit a model of the form |
− | + | :<math> | |
CumNormal(p(x)) = \sum_k c_kb_k(x) | CumNormal(p(x)) = \sum_k c_kb_k(x) | ||
− | </math | + | </math> |
− | where ''p(y)'' is the probability of outcome | + | where ''p(y)'' is the probability of outcome «y», and ''b<sub>k</sub>(x)'' is the basis vector for a data which is indexed by «k». To understand how to put together a basis from your independent variables, you should read the section on the [[Regression analysis#Regression(y, b, i, k)|Regression function]], it is exactly the same here. Notice that the righthand side of the [[ProbitRegression]] equation is the same as for standard [[Regression_analysis#Regression.28y.2C_b.2C_i.2C_k.29|Regression equation]], but the lefthand side involves the [[CumNormal]] function. Once you’ve obtained the result from [[LogisticRegression]](), you can use it to predict the probability for a new data point using |
− | + | :<code>[[CumNormalInv]]([[Sum]](c*B(x), k))</code> | |
where <code>B(x)</code> is a user-defined function that returns the basis vector for the data point. | where <code>B(x)</code> is a user-defined function that returns the basis vector for the data point. | ||
Line 66: | Line 65: | ||
[[File:Chapter16_13.png]] | [[File:Chapter16_13.png]] | ||
− | Using the data directly as the regression basis, the logistic regression coefficients are | + | Using the data directly as the regression basis, the logistic regression coefficients are computed using this. |
− | computed using this. | ||
− | + | :<code>Variable c := [[ProbitRegression]](Treatment_effective,</code> | |
− | + | ::<code>Test_results, Patient_ID, Lab_test, priorType: 2)</code> | |
We can obtain the predicted probability for each patient in this testing set this. | We can obtain the predicted probability for each patient in this testing set this. | ||
− | + | :<code>Variable Prob_Effective :=</code> | |
− | + | ::<code>[[CumNormalInv]](Sum( c*Test_results, Lab_Test ))</code> | |
If we have lab tests for a new patient, say <code>New_Patient_Tests</code>, in the form of a vector indexed by <code>Lab_Test</code>, we can predict the probability that treatment will be effective this. | If we have lab tests for a new patient, say <code>New_Patient_Tests</code>, in the form of a vector indexed by <code>Lab_Test</code>, we can predict the probability that treatment will be effective this. | ||
− | + | :<code>[[CumNormalInv]]([[Sum]](c*New_patient_tests, Lab_test))</code> | |
− | ==PoissonRegression(y, b, i, k, priorType, priorDev)== | + | ==PoissonRegression(y, b, i, k, ''priorType, priorDev'')== |
− | A Poisson regression model is used to predict the number of events that occur, | + | A Poisson regression model is used to predict the number of events that occur, «y», from |
− | a vector independent data, | + | a vector independent data, «b», indexed by «k». The [[PoissonRegression]]() function computes the coefficients, «c», from a set of data points, («b», «y»), both indexed by «i», such that the expected number of events is predicted by this formula. |
− | + | :<math> | |
E(Y) = exp (\sum_k c_kb_k) | E(Y) = exp (\sum_k c_kb_k) | ||
− | </math | + | </math> |
The random component in the prediction is assumed to be Poisson-distributed, so that | The random component in the prediction is assumed to be Poisson-distributed, so that | ||
− | given a new data point | + | given a new data point «b», the distribution for that point is |
− | + | :<code>[[Poisson]]([[Exp]]([[Sum]](c*b, K)))</code> | |
'''Example:''' You have data collected from surveys on how many times TV viewers were exposed to your ads in a given week, and on how many times you ran ads in each time slot on those weeks. You want to fit a model to this data so that you can predict the distribution of exposures that you can expect in the future for a given allocation of ads to each time slot. | '''Example:''' You have data collected from surveys on how many times TV viewers were exposed to your ads in a given week, and on how many times you ran ads in each time slot on those weeks. You want to fit a model to this data so that you can predict the distribution of exposures that you can expect in the future for a given allocation of ads to each time slot. | ||
Line 98: | Line 96: | ||
Each data point used for training is one survey response (from one person) taken at the end of one particular week (<code>Training_exposures</code> indexed by <code>Survey_response</code>). The basis includes a constant term plus the number of times ads were run in each time slot that week (<code>Training_basis</code> indexed by <code>Time_slot_k</code> and <code>Survey_response</code>). | Each data point used for training is one survey response (from one person) taken at the end of one particular week (<code>Training_exposures</code> indexed by <code>Survey_response</code>). The basis includes a constant term plus the number of times ads were run in each time slot that week (<code>Training_basis</code> indexed by <code>Time_slot_k</code> and <code>Survey_response</code>). | ||
− | + | :<code>Index Time_Slot_K := [1, 'Prime time', 'Late night', 'Day time']</code> | |
− | + | :<code>Variable exposure_coefs :=</code> | |
− | + | ::<code>[[PoissonRegression]](Training_exposures, Training_basis,</code> | |
− | + | :::<code>Survey_response, Time_slot_K)</code> | |
To estimate the distribution for how many times a viewer will be exposed to your ads next week if you run 30 ads in prime time, 20 in late night and 50 during the day, use | To estimate the distribution for how many times a viewer will be exposed to your ads next week if you run 30 ads in prime time, 20 in late night and 50 during the day, use | ||
− | + | :<code>Decision AdAllocation := [[Table]](Time_slot_K)(1, 30, 20, 50)</code> | |
− | + | :<code>Chance ViewersExposed :=</code> | |
− | + | ::<code>[[Poisson]]([[Exp]]([[Sum]](Exposure_coefs*AdAllocation, Time_slot_K)))</code> | |
− | This example can be found in the <code>Example Models / Data Analysis</code> folder in the model file | + | This example can be found in the <code>Example Models / Data Analysis</code> folder in the model file <code>Poisson regression ad exposures.ana</code>. |
==See Also== | ==See Also== | ||
− | * [[LogisticRegression]] | + | * [[Regression analysis]] |
− | * [[ProbitRegression]] | + | * [[LogisticRegression]] |
− | * [[PoissonRegression]] | + | * [[ProbitRegression]] |
+ | * [[PoissonRegression]] | ||
+ | * [[Uncertainty in regression results]] | ||
<footer>Uncertainty in regression results / {{PAGENAME}} / Stochastic Information Packets (SIPs)</footer> | <footer>Uncertainty in regression results / {{PAGENAME}} / Stochastic Information Packets (SIPs)</footer> |
Latest revision as of 21:49, 7 August 2017
You can use the functions in this section to estimate the probability (or probability distribution) of a binary or categorical dependent (output) variable as a function of known values for independent (input) variables. This is similar to linear regression, which predicts the value of a dependent variable as a function of known values for independent variables. Logistic regression is the best known example generalized regression, so even though the term logistic regression technically refers to one specific form of generalized regression (with probit and poisson regression being other instances), it is also not uncommon to hear the term logistic regression functions used synonymously with generalized linear regression, as we have done with the title of this section.
The functions LogisticRegression() and ProbitRegression() predict the probability of a Bernoulli (i.e., 0,1-valued) random variable from a set of continuous independent variables. Both functions apply to the same scenarios and accept identical parameters; the final models differ slightly in their functional form. The function PoissonRegression predicts the probability distribution for the number of events that occur, where the dependent (output) variable is a non-negative integer.
All three functions accept the same parameters as the Regression function. As with those functions, you construct a basis for your dependent variables, and will usually want to include the constant term (a 1 in the basis). In addition to those parameters, these functions also have two parameters, «priorType» and «priorDev», which allow you to specify a Bayesian prior.
Bayesian priors
The regression methods in this section are highly susceptible to overfitting. The problem is particularly bad when there are a small number of data points or a large number of basis terms. When your model has been overfit, it will produce probability estimates that are too close to zero or one; in other words, its predictions are overconfident. To avoid overfitting, you will usually want to employ a Bayesian prior, which you do by specifying the «priorType» parameter, which recognizes these options:
0
= Maximum likelihood (default)1
= Exponential L1 prior2
= Normal L2 prior
Maximum likelihood corresponds to having no prior. The L1 and L2 priors penalize larger coefficient weights. The joint prior probability of each coefficient is statistically independent, having the shape of a decaying exponential function in the case of an L1 prior or of a half-normal distribution in the case of the L2 prior.
You can also optionally specify the strength of the prior using the «priorDev» parameter. which specifies the standard deviation of each marginal prior distribution on each coefficient. Cross validation techniques vary this parameter to find the optimal prior strength for a given problem, which is demonstrated in the Logistic Regression prior selection.ana
example model included with Analytica in the Data Analysis
example models folder. If you omit the «priorDev» parameter, the function makes a reasonable guess, which will usually be superior to maximum likelihood.
LogisticRegression(y, b, i, k, priorType, priorDev)
Logistic regression is a technique for predicting a Bernoulli (i.e., 0,1-valued) random variable from a set of continuous dependent variables. See the Wikipedia article on logistic regression for a simple description. The LogisticRegression() function finds the parameters «ck» that fit a model of the form
- [math]\displaystyle{ Logit(p(x)) = \sum_k c_kb_k(x) }[/math]
where p(y) is the probability of outcome «y», and bk(x) is the basis vector for a data which is indexed by «k». To understand how to put together a basis from your independent variables, you should read the section on the Regression function, it is exactly the same here. Notice that the righthand side of the Logit equation above is the same as for standard Regression equation, but the lefthand side involves the Logit function. The inverse of the Logit function is the Sigmoid function, so that once you’ve obtained the result from LogisticRegression(), you can use it to predict the probability for a new data point using
where B(x)
is a user-defined function that returns the basis vector for the data point.
Example: Suppose you have Height, Weight and Gender information for a set of people, with these three variables indexed by Person. A logistic regression model might estimate the probability that a given person is male based on height and weight, encoded as follows:
Index K := ['b', 'height', 'weight']
Function PersonBasis(height, weight) :=
Array(K, [1, height, weight])
Variable coef :=
LogisticRegression(Gender = 'M', PersonBasis(Height, Weight),
Person, K, priorType: 2)
With these coefficients, the probability that a 85kg, 170cm tall person is male is
ProbitRegression(y, b, i, k, priorType, priorDev)
A probit model relates a continuous vector of dependent measurements to the probability of a Bernoulli (i.e., 0, 1-valued) outcome. In econometrics, this model is sometimes called the Harvard model. The ProbitRegression function finds the parameters «ck» that fit a model of the form
- [math]\displaystyle{ CumNormal(p(x)) = \sum_k c_kb_k(x) }[/math]
where p(y) is the probability of outcome «y», and bk(x) is the basis vector for a data which is indexed by «k». To understand how to put together a basis from your independent variables, you should read the section on the Regression function, it is exactly the same here. Notice that the righthand side of the ProbitRegression equation is the same as for standard Regression equation, but the lefthand side involves the CumNormal function. Once you’ve obtained the result from LogisticRegression(), you can use it to predict the probability for a new data point using
CumNormalInv(Sum(c*B(x), k))
where B(x)
is a user-defined function that returns the basis vector for the data point.
Example: Suppose you want to predict the probability that a particlar treatment for diabetes is effective given several lab test results. Data is collected for patients who have undergone the treatment, as follows, where the variable Test_results
consists lab test data and Treatment_effective
is set to 0
or 1
depending on whether the treatment was effective or not for that patient.
Using the data directly as the regression basis, the logistic regression coefficients are computed using this.
Variable c := ProbitRegression(Treatment_effective,
Test_results, Patient_ID, Lab_test, priorType: 2)
We can obtain the predicted probability for each patient in this testing set this.
Variable Prob_Effective :=
CumNormalInv(Sum( c*Test_results, Lab_Test ))
If we have lab tests for a new patient, say New_Patient_Tests
, in the form of a vector indexed by Lab_Test
, we can predict the probability that treatment will be effective this.
CumNormalInv(Sum(c*New_patient_tests, Lab_test))
PoissonRegression(y, b, i, k, priorType, priorDev)
A Poisson regression model is used to predict the number of events that occur, «y», from a vector independent data, «b», indexed by «k». The PoissonRegression() function computes the coefficients, «c», from a set of data points, («b», «y»), both indexed by «i», such that the expected number of events is predicted by this formula.
- [math]\displaystyle{ E(Y) = exp (\sum_k c_kb_k) }[/math]
The random component in the prediction is assumed to be Poisson-distributed, so that given a new data point «b», the distribution for that point is
Example: You have data collected from surveys on how many times TV viewers were exposed to your ads in a given week, and on how many times you ran ads in each time slot on those weeks. You want to fit a model to this data so that you can predict the distribution of exposures that you can expect in the future for a given allocation of ads to each time slot.
Each data point used for training is one survey response (from one person) taken at the end of one particular week (Training_exposures
indexed by Survey_response
). The basis includes a constant term plus the number of times ads were run in each time slot that week (Training_basis
indexed by Time_slot_k
and Survey_response
).
Index Time_Slot_K := [1, 'Prime time', 'Late night', 'Day time']
Variable exposure_coefs :=
PoissonRegression(Training_exposures, Training_basis,
Survey_response, Time_slot_K)
To estimate the distribution for how many times a viewer will be exposed to your ads next week if you run 30 ads in prime time, 20 in late night and 50 during the day, use
Decision AdAllocation := Table(Time_slot_K)(1, 30, 20, 50)
Chance ViewersExposed :=
This example can be found in the Example Models / Data Analysis
folder in the model file Poisson regression ad exposures.ana
.
See Also
- Regression analysis
- LogisticRegression
- ProbitRegression
- PoissonRegression
- Uncertainty in regression results
Enable comment auto-refresher