Difference between revisions of "Selecting the Sample Size"
Jhernandez3 (talk | contribs) m |
|||
Line 4: | Line 4: | ||
Each probabilistic value is simulated by computing a random sample of values from the actual probability distribution. | Each probabilistic value is simulated by computing a random sample of values from the actual probability distribution. | ||
− | You can control the sampling method and sample size by using '''Uncertainty Setup'''. This | + | You can control the sampling method and sample size by using '''Uncertainty Setup'''. This Appendix briefly discusses how to select a sample size. |
==Choosing an Appropriate Sample Size== | ==Choosing an Appropriate Sample Size== | ||
Line 10: | Line 10: | ||
There is a clear trade-off for using a larger sample size in calculating an uncertainty variable. When you set the sample size to a large value, the result is less noisy, but it takes a longer time to compute the distribution. For an initial probabilistic calculation, a sample size of 20 to 50 is usually adequate. | There is a clear trade-off for using a larger sample size in calculating an uncertainty variable. When you set the sample size to a large value, the result is less noisy, but it takes a longer time to compute the distribution. For an initial probabilistic calculation, a sample size of 20 to 50 is usually adequate. | ||
− | How should you choose the sample size | + | How should you choose the sample size «m»? It depends both on the cost of each model run, and what you want the results for. An advantage of the Monte Carlo method is that you can apply many standard statistical techniques to estimate the precision of estimates of the output distribution. This is because the generated sample of values for each output variable is a random sample from the true probability distribution for that variable. |
==Uncertainty about the Mean== | ==Uncertainty about the Mean== | ||
− | First, suppose you are primarily interested in the precision of the mean of your output variable | + | First, suppose you are primarily interested in the precision of the mean of your output variable «y». Assume you have a random sample of «m» output values generated by Monte Carlo simulation: |
− | : | + | :<math>(y_1, y_2, y_3, …y_m) </math><div class="floatright">(1)</div> |
− | You can estimate the mean and standard deviation of | + | You can estimate the mean and standard deviation of «y» using the following equations: |
− | : | + | :<math>\vec y=\sum_{i=1}^m \frac { y_i }m</math><div class="floatright">(2)</div> |
− | : | + | :<math>s^2=\sum_{i=1}^m \frac { (y_i - \vec y)^2} {m - 1}</math><div class="floatright">(3)</div> |
− | This leads to the following confidence interval with confidence | + | This leads to the following confidence interval with confidence «α», where «c» is the deviation for the unit normally enclosing probability «α»: |
− | : | + | :<math>\left (\vec y-c\frac {s} {\sqrt m}, \vec y + c\frac {s} {\sqrt m}\right ) </math><div class="floatright">(4)</div> |
− | Suppose you wish to obtain an estimate of the mean of | + | Suppose you wish to obtain an estimate of the mean of «y» with an «α» confidence interval smaller than «w» units wide. What sample size do you need? You need to make sure that: |
− | : | + | :<math>w>2c \frac {s}{\sqrt m}</math><div class="floatright">(5)</div> |
Or, rearranging the inequality: | Or, rearranging the inequality: | ||
− | : | + | :<math>m> \left ( \frac {2cs}{w}\right ) ^2</math><div class="floatright">(6)</div> |
− | To use this, first make a small Monte Carlo run with, say, 10 values to get an initial estimate of the variance of | + | To use this, first make a small Monte Carlo run with, say, 10 values to get an initial estimate of the variance of «y» — that is, «s2». You can then use equation (6) to estimate how many samples reduce the confidence interval to the requisite width «w». |
− | For example, suppose you wish to obtain a 95% confidence interval for the mean that is less than 20 units wide. Suppose your initial sample of 10 gives s = 40. The deviation | + | For example, suppose you wish to obtain a 95% confidence interval for the mean that is less than 20 units wide. Suppose your initial sample of 10 gives ''s = 40''. The deviation «c» enclosing a probability of 95% for a unit normal is about 2. Substituting these numbers into equation (6), you get: |
− | : | + | :<math>m> \left ( \frac {2 \times 2 \times 40}{20} \right )^2 = 8^2 = 64</math><div class="floatright">(7)</div><br /> |
So, to get the required precision for the mean, you should set the sample size to about 64. | So, to get the required precision for the mean, you should set the sample size to about 64. | ||
Line 46: | Line 46: | ||
== Estimating Confidence Intervals for Fractiles == | == Estimating Confidence Intervals for Fractiles == | ||
− | Another criterion for selecting sample size is the precision of the estimate of the median and other fractiles, or more generally, the precision of the estimated cumulative distribution. Assume that the sample | + | Another criterion for selecting sample size is the precision of the estimate of the median and other fractiles, or more generally, the precision of the estimated cumulative distribution. Assume that the sample «m» values of «y» are relabeled so that they are in increasing order: |
− | : | + | :<math> y_1 \le, y_2 \le, ...y_m</math> |
− | + | «c» is the deviation enclosing probability «α» of the unit normal. Then the following pair of sample values constitutes the confidence interval: | |
− | : | + | :<math>(y_i, y_k)</math> |
where: | where: | ||
− | : | + | :<math>i = \left \lfloor mp - c \sqrt {mp (1-p)} \right \rfloor</math><div class="floatright">(8)</div><br /> |
− | : | + | :<math>i = \left \lceil mp - c \sqrt {mp (1-p)} \right \rceil</math><div class="floatright">(9)</div> |
<Tip Title="Note"> The brackets in equations (8) and (9) above mean round up [[File:lceil.png]] and round down[[File:rfloor.png]], since they are computing numbers that need to be integers.</Tip> | <Tip Title="Note"> The brackets in equations (8) and (9) above mean round up [[File:lceil.png]] and round down[[File:rfloor.png]], since they are computing numbers that need to be integers.</Tip> | ||
− | Suppose you want to achieve sufficient precision such that the | + | Suppose you want to achieve sufficient precision such that the «a» confidence interval for the pth fractile «Y<sub>p</sub>» is given by (y<sub>1</sub>, y<sub>2</sub>), where «y<sub>i</sub>» is an estimate of '''<math>{Y_p-{_\Delta}{_p}}</math>''', and «y<sub>k</sub>» is an estimate of '''<math>{Y_p+{_\Delta}{_p}}</math>'''. In other words, you want «<math>\alpha</math>» confidence of «Y<sub>p</sub>» being between the sample values used as estimates of the (<math>p-{_\Delta}{_p})</math>) and (<math>p+{_\Delta}{_p})</math>) fractiles. What sample size do you need? Ignoring the rounding, you have approximately: |
− | : | + | :<math>i = m(p-{\Delta}p), k = m(p+{\Delta} p)</math><div class="floatright">(10)</div><br /> |
Thus: | Thus: | ||
− | : | + | :<math>k - i = 2m{\Delta}p</math><div class="floatright">(11)</div><br /> |
From equations (8) and (9) above, you have: | From equations (8) and (9) above, you have: | ||
− | : | + | :<math>k - i = 2c\sqrt {mp(1-p)}</math><div class="floatright">(12)</div><br /> |
− | Equating the two expressions for k-1 , you obtain: | + | Equating the two expressions for ''k - 1'' , you obtain: |
− | : | + | :<math>2mp\Delta p = 2c \sqrt {mp(1-p)}</math><div class="floatright">(13)</div><br /> |
− | : | + | :<math>m=p(1- p) \left (\frac {c}{\Delta p} \right )^2</math><div class="floatright">(14)</div><br /> |
− | For example, suppose you want to be 95% confident that the estimated fractile Y<sub>.90</sub> is between the estimated fractiles Y<sub>.85</sub> and Y<sub>.95</sub>. So you have <math>\Delta p</math>=0.05, and | + | For example, suppose you want to be 95% confident that the estimated fractile Y<sub>.90</sub> is between the estimated fractiles Y<sub>.85</sub> and Y<sub>.95</sub>. So you have <math>\Delta p</math> = 0.05, and ''c ≈ 2''. Substituting the numbers into equation (14), you get: |
− | : | + | :<math>m = 0.90 \times (1 - 0.90) \times \left (\frac {2}{0.05} \right )^2 = 144</math><div class="floatright">(15)</div><br /> |
On the other hand, suppose you want the credible interval for the least precise estimated percentile (the 50th percentile) to have a 95% confidence interval of plus or minus one estimated percentile. Then: | On the other hand, suppose you want the credible interval for the least precise estimated percentile (the 50th percentile) to have a 95% confidence interval of plus or minus one estimated percentile. Then: | ||
− | : | + | :<math>m = 0.5 \times (1 - 0.5) \times \left (\frac {2}{0.01} \right )^2 = 10,000</math><div class="floatright">(16)</div><br /> |
− | These results are completely independent of the shape of the distribution. If you find this an appropriate way to state your requirements for the precision of the estimated distribution, you can determine the sample size before doing any runs to see what sort of distribution it might be. | + | These results are completely independent of the shape of the distribution. If you find this an appropriate way to state your requirements for the precision of the estimated distribution, you can determine the sample size before doing ''any'' runs to see what sort of distribution it might be. |
<footer>Appendices / {{PAGENAME}} / Analytica Specifications</footer> | <footer>Appendices / {{PAGENAME}} / Analytica Specifications</footer> |
Revision as of 18:59, 8 January 2016
Each probabilistic value is simulated by computing a random sample of values from the actual probability distribution.
You can control the sampling method and sample size by using Uncertainty Setup. This Appendix briefly discusses how to select a sample size.
Choosing an Appropriate Sample Size
There is a clear trade-off for using a larger sample size in calculating an uncertainty variable. When you set the sample size to a large value, the result is less noisy, but it takes a longer time to compute the distribution. For an initial probabilistic calculation, a sample size of 20 to 50 is usually adequate.
How should you choose the sample size «m»? It depends both on the cost of each model run, and what you want the results for. An advantage of the Monte Carlo method is that you can apply many standard statistical techniques to estimate the precision of estimates of the output distribution. This is because the generated sample of values for each output variable is a random sample from the true probability distribution for that variable.
Uncertainty about the Mean
First, suppose you are primarily interested in the precision of the mean of your output variable «y». Assume you have a random sample of «m» output values generated by Monte Carlo simulation:
- [math]\displaystyle{ (y_1, y_2, y_3, …y_m) }[/math](1)
You can estimate the mean and standard deviation of «y» using the following equations:
- [math]\displaystyle{ \vec y=\sum_{i=1}^m \frac { y_i }m }[/math](2)
- [math]\displaystyle{ s^2=\sum_{i=1}^m \frac { (y_i - \vec y)^2} {m - 1} }[/math](3)
This leads to the following confidence interval with confidence «α», where «c» is the deviation for the unit normally enclosing probability «α»:
- [math]\displaystyle{ \left (\vec y-c\frac {s} {\sqrt m}, \vec y + c\frac {s} {\sqrt m}\right ) }[/math](4)
Suppose you wish to obtain an estimate of the mean of «y» with an «α» confidence interval smaller than «w» units wide. What sample size do you need? You need to make sure that:
- [math]\displaystyle{ w\gt 2c \frac {s}{\sqrt m} }[/math](5)
Or, rearranging the inequality:
- [math]\displaystyle{ m\gt \left ( \frac {2cs}{w}\right ) ^2 }[/math](6)
To use this, first make a small Monte Carlo run with, say, 10 values to get an initial estimate of the variance of «y» — that is, «s2». You can then use equation (6) to estimate how many samples reduce the confidence interval to the requisite width «w».
For example, suppose you wish to obtain a 95% confidence interval for the mean that is less than 20 units wide. Suppose your initial sample of 10 gives s = 40. The deviation «c» enclosing a probability of 95% for a unit normal is about 2. Substituting these numbers into equation (6), you get:
- [math]\displaystyle{ m\gt \left ( \frac {2 \times 2 \times 40}{20} \right )^2 = 8^2 = 64 }[/math](7)
So, to get the required precision for the mean, you should set the sample size to about 64.
Estimating Confidence Intervals for Fractiles
Another criterion for selecting sample size is the precision of the estimate of the median and other fractiles, or more generally, the precision of the estimated cumulative distribution. Assume that the sample «m» values of «y» are relabeled so that they are in increasing order:
- [math]\displaystyle{ y_1 \le, y_2 \le, ...y_m }[/math]
«c» is the deviation enclosing probability «α» of the unit normal. Then the following pair of sample values constitutes the confidence interval:
- [math]\displaystyle{ (y_i, y_k) }[/math]
where:
- [math]\displaystyle{ i = \left \lfloor mp - c \sqrt {mp (1-p)} \right \rfloor }[/math](8)
- [math]\displaystyle{ i = \left \lceil mp - c \sqrt {mp (1-p)} \right \rceil }[/math](9)


Suppose you want to achieve sufficient precision such that the «a» confidence interval for the pth fractile «Yp» is given by (y1, y2), where «yi» is an estimate of [math]\displaystyle{ {Y_p-{_\Delta}{_p}} }[/math], and «yk» is an estimate of [math]\displaystyle{ {Y_p+{_\Delta}{_p}} }[/math]. In other words, you want «[math]\displaystyle{ \alpha }[/math]» confidence of «Yp» being between the sample values used as estimates of the ([math]\displaystyle{ p-{_\Delta}{_p}) }[/math]) and ([math]\displaystyle{ p+{_\Delta}{_p}) }[/math]) fractiles. What sample size do you need? Ignoring the rounding, you have approximately:
- [math]\displaystyle{ i = m(p-{\Delta}p), k = m(p+{\Delta} p) }[/math](10)
Thus:
- [math]\displaystyle{ k - i = 2m{\Delta}p }[/math](11)
From equations (8) and (9) above, you have:
- [math]\displaystyle{ k - i = 2c\sqrt {mp(1-p)} }[/math](12)
Equating the two expressions for k - 1 , you obtain:
- [math]\displaystyle{ 2mp\Delta p = 2c \sqrt {mp(1-p)} }[/math](13)
- [math]\displaystyle{ m=p(1- p) \left (\frac {c}{\Delta p} \right )^2 }[/math](14)
For example, suppose you want to be 95% confident that the estimated fractile Y.90 is between the estimated fractiles Y.85 and Y.95. So you have [math]\displaystyle{ \Delta p }[/math] = 0.05, and c ≈ 2. Substituting the numbers into equation (14), you get:
- [math]\displaystyle{ m = 0.90 \times (1 - 0.90) \times \left (\frac {2}{0.05} \right )^2 = 144 }[/math](15)
On the other hand, suppose you want the credible interval for the least precise estimated percentile (the 50th percentile) to have a 95% confidence interval of plus or minus one estimated percentile. Then:
- [math]\displaystyle{ m = 0.5 \times (1 - 0.5) \times \left (\frac {2}{0.01} \right )^2 = 10,000 }[/math](16)
These results are completely independent of the shape of the distribution. If you find this an appropriate way to state your requirements for the precision of the estimated distribution, you can determine the sample size before doing any runs to see what sort of distribution it might be.
Enable comment auto-refresher