Difference between revisions of "Selecting the Sample Size"

Revision as of 18:59, 8 January 2016

Analytica User GuideAppendicesSelecting the Sample Size

Each probabilistic value is simulated by computing a random sample of values from the actual probability distribution.

You can control the sampling method and sample size by using Uncertainty Setup. This Appendix briefly discusses how to select a sample size.

Choosing an Appropriate Sample Size

There is a clear trade-off for using a larger sample size in calculating an uncertainty variable. When you set the sample size to a large value, the result is less noisy, but it takes a longer time to compute the distribution. For an initial probabilistic calculation, a sample size of 20 to 50 is usually adequate.

How should you choose the sample size «m»? It depends both on the cost of each model run, and what you want the results for. An advantage of the Monte Carlo method is that you can apply many standard statistical techniques to estimate the precision of estimates of the output distribution. This is because the generated sample of values for each output variable is a random sample from the true probability distribution for that variable.

Uncertainty about the Mean

First, suppose you are primarily interested in the precision of the mean of your output variable «y». Assume you have a random sample of «m» output values generated by Monte Carlo simulation:

[math]\displaystyle{ (y_1, y_2, y_3, …y_m) }[/math]

(1)

You can estimate the mean and standard deviation of «y» using the following equations:

[math]\displaystyle{ \vec y=\sum_{i=1}^m \frac { y_i }m }[/math]

(2)

[math]\displaystyle{ s^2=\sum_{i=1}^m \frac { (y_i - \vec y)^2} {m - 1} }[/math]

(3)

This leads to the following confidence interval with confidence «α», where «c» is the deviation for the unit normally enclosing probability «α»:

[math]\displaystyle{ \left (\vec y-c\frac {s} {\sqrt m}, \vec y + c\frac {s} {\sqrt m}\right ) }[/math]

(4)

Suppose you wish to obtain an estimate of the mean of «y» with an «α» confidence interval smaller than «w» units wide. What sample size do you need? You need to make sure that:

[math]\displaystyle{ w\gt 2c \frac {s}{\sqrt m} }[/math]

(5)

Or, rearranging the inequality:

[math]\displaystyle{ m\gt \left ( \frac {2cs}{w}\right ) ^2 }[/math]

(6)

To use this, first make a small Monte Carlo run with, say, 10 values to get an initial estimate of the variance of «y» — that is, «s2». You can then use equation (6) to estimate how many samples reduce the confidence interval to the requisite width «w».

For example, suppose you wish to obtain a 95% confidence interval for the mean that is less than 20 units wide. Suppose your initial sample of 10 gives s = 40. The deviation «c» enclosing a probability of 95% for a unit normal is about 2. Substituting these numbers into equation (6), you get:

[math]\displaystyle{ m\gt \left ( \frac {2 \times 2 \times 40}{20} \right )^2 = 8^2 = 64 }[/math]

(7)

So, to get the required precision for the mean, you should set the sample size to about 64.

Estimating Confidence Intervals for Fractiles

Another criterion for selecting sample size is the precision of the estimate of the median and other fractiles, or more generally, the precision of the estimated cumulative distribution. Assume that the sample «m» values of «y» are relabeled so that they are in increasing order:

[math]\displaystyle{ y_1 \le, y_2 \le, ...y_m }[/math]

«c» is the deviation enclosing probability «α» of the unit normal. Then the following pair of sample values constitutes the confidence interval:

[math]\displaystyle{ (y_i, y_k) }[/math]

where:

[math]\displaystyle{ i = \left \lfloor mp - c \sqrt {mp (1-p)} \right \rfloor }[/math]

(8)

[math]\displaystyle{ i = \left \lceil mp - c \sqrt {mp (1-p)} \right \rceil }[/math]

(9)

Note

The brackets in equations (8) and (9) above mean round up

and round down

, since they are computing numbers that need to be integers.

Suppose you want to achieve sufficient precision such that the «a» confidence interval for the pth fractile «Y_p» is given by (y₁, y₂), where «y_i» is an estimate of [math]\displaystyle{ {Y_p-{_\Delta}{_p}} }[/math], and «y_k» is an estimate of [math]\displaystyle{ {Y_p+{_\Delta}{_p}} }[/math]. In other words, you want «[math]\displaystyle{ \alpha }[/math]» confidence of «Y_p» being between the sample values used as estimates of the ([math]\displaystyle{ p-{_\Delta}{_p}) }[/math]) and ([math]\displaystyle{ p+{_\Delta}{_p}) }[/math]) fractiles. What sample size do you need? Ignoring the rounding, you have approximately:

[math]\displaystyle{ i = m(p-{\Delta}p), k = m(p+{\Delta} p) }[/math]

(10)

Thus:

[math]\displaystyle{ k - i = 2m{\Delta}p }[/math]

(11)

From equations (8) and (9) above, you have:

[math]\displaystyle{ k - i = 2c\sqrt {mp(1-p)} }[/math]

(12)

Equating the two expressions for k - 1 , you obtain:

[math]\displaystyle{ 2mp\Delta p = 2c \sqrt {mp(1-p)} }[/math]

(13)

[math]\displaystyle{ m=p(1- p) \left (\frac {c}{\Delta p} \right )^2 }[/math]

(14)

For example, suppose you want to be 95% confident that the estimated fractile Y_.90 is between the estimated fractiles Y_.85 and Y_.95. So you have [math]\displaystyle{ \Delta p }[/math] = 0.05, and c ≈ 2. Substituting the numbers into equation (14), you get:

[math]\displaystyle{ m = 0.90 \times (1 - 0.90) \times \left (\frac {2}{0.05} \right )^2 = 144 }[/math]

(15)

On the other hand, suppose you want the credible interval for the least precise estimated percentile (the 50th percentile) to have a 95% confidence interval of plus or minus one estimated percentile. Then:

[math]\displaystyle{ m = 0.5 \times (1 - 0.5) \times \left (\frac {2}{0.01} \right )^2 = 10,000 }[/math]

(16)

These results are completely independent of the shape of the distribution. If you find this an appropriate way to state your requirements for the precision of the estimated distribution, you can determine the sample size before doing any runs to see what sort of distribution it might be.

@@ Line 4: / Line 4: @@
 Each probabilistic value is simulated by computing a random sample of values from the actual probability distribution.
-You can control the sampling method and sample size by using '''Uncertainty Setup'''. This appendix briefly discusses how to select a sample size.
+You can control the sampling method and sample size by using '''Uncertainty Setup'''. This Appendix briefly discusses how to select a sample size.
 ==Choosing an Appropriate Sample Size==
@@ Line 10: / Line 10: @@
 There is a clear trade-off for using a larger sample size in calculating an uncertainty variable. When you set the sample size to a large value, the result is less noisy, but it takes a longer time to compute the distribution. For an initial probabilistic calculation, a sample size of 20 to 50 is usually adequate.
-How should you choose the sample size <code>m</code>? It depends both on the cost of each model run, and what you want the results for. An advantage of the Monte Carlo method is that you can apply many standard statistical techniques to estimate the precision of estimates of the output distribution. This is because the generated sample of values for each output variable is a random sample from the true probability distribution for that variable.
+How should you choose the sample size «m»? It depends both on the cost of each model run, and what you want the results for. An advantage of the Monte Carlo method is that you can apply many standard statistical techniques to estimate the precision of estimates of the output distribution. This is because the generated sample of values for each output variable is a random sample from the true probability distribution for that variable.
 ==Uncertainty about the Mean==
-First, suppose you are primarily interested in the precision of the mean of your output variable <code>y</code>. Assume you have a random sample of <code>m</code> output values generated by Monte Carlo simulation:
+First, suppose you are primarily interested in the precision of the mean of your output variable «y». Assume you have a random sample of «m» output values generated by Monte Carlo simulation:
-:(y<sub>1</sub>, y<sub>2</sub>, y<sub>3</sub>, …y<sub>m</sub>) <div class="floatright">(1)</div>
+:<math>(y_1, y_2, y_3, …y_m) </math><div class="floatright">(1)</div>
-You can estimate the mean and standard deviation of y using the following equations:
+You can estimate the mean and standard deviation of «y» using the following equations:
-:<big><math>\vec y=\sum_{i=1}^m \frac { y_i }m</math></big><div class="floatright">(2)</div>
+:<math>\vec y=\sum_{i=1}^m \frac { y_i }m</math><div class="floatright">(2)</div>
-:<big><math>s^2=\sum_{i=1}^m \frac { (y_i - \vec y)^2} {m - 1}</math></big><div class="floatright">(3)</div>
+:<math>s^2=\sum_{i=1}^m \frac { (y_i - \vec y)^2} {m - 1}</math><div class="floatright">(3)</div>
-This leads to the following confidence interval with confidence <code>α</code>, where <code>c</code> is the deviation for the unit normally enclosing probability <code>α</code>:
+This leads to the following confidence interval with confidence «α», where «c» is the deviation for the unit normally enclosing probability «α»:
-:<big><math>\left (\vec y-c\frac {s} {\sqrt m}, \vec y + c\frac {s} {\sqrt m}\right ) </math></big><div class="floatright">(4)</div>
+:<math>\left (\vec y-c\frac {s} {\sqrt m}, \vec y + c\frac {s} {\sqrt m}\right ) </math><div class="floatright">(4)</div>
-Suppose you wish to obtain an estimate of the mean of <code>y</code> with an α confidence interval smaller than <code>w</code> units wide. What sample size do you need? You need to make sure that:
+Suppose you wish to obtain an estimate of the mean of «y» with an «α» confidence interval smaller than «w» units wide. What sample size do you need? You need to make sure that:
-:<big><math>w>2c \frac {s}{\sqrt m}</math></big><div class="floatright">(5)</div>
+:<math>w>2c \frac {s}{\sqrt m}</math><div class="floatright">(5)</div>
 Or, rearranging the inequality:
-:<big><math>m> \left ( \frac {2cs}{w}\right ) ^2</math></big><div class="floatright">(6)</div>
+:<math>m> \left ( \frac {2cs}{w}\right ) ^2</math><div class="floatright">(6)</div>
-To use this, first make a small Monte Carlo run with, say, 10 values to get an initial estimate of the variance of y — that is, s2. You can then use equation (6) to estimate how many samples reduce the confidence interval to the requisite width w.
+To use this, first make a small Monte Carlo run with, say, 10 values to get an initial estimate of the variance of «y» — that is, «s2». You can then use equation (6) to estimate how many samples reduce the confidence interval to the requisite width «w».
-For example, suppose you wish to obtain a 95% confidence interval for the mean that is less than 20 units wide. Suppose your initial sample of 10 gives s = 40. The deviation c enclosing a probability of 95% for a unit normal is about 2. Substituting these numbers into equation (6), you get:
+For example, suppose you wish to obtain a 95% confidence interval for the mean that is less than 20 units wide. Suppose your initial sample of 10 gives ''s = 40''. The deviation «c» enclosing a probability of 95% for a unit normal is about 2. Substituting these numbers into equation (6), you get:
-:<big><math>m> \left ( \frac {2 \times 2 \times 40}{20} \right )^2 = 8^2 = 64</math></big><div class="floatright">(7)</div><br />
+:<math>m> \left ( \frac {2 \times 2 \times 40}{20} \right )^2 = 8^2 = 64</math><div class="floatright">(7)</div><br />
 So, to get the required precision for the mean, you should set the sample size to about 64.
@@ Line 46: / Line 46: @@
 == Estimating Confidence Intervals for Fractiles ==
-Another criterion for selecting sample size is the precision of the estimate of the median and other fractiles, or more generally, the precision of the estimated cumulative distribution. Assume that the sample m values of y are relabeled so that they are in increasing order:
+Another criterion for selecting sample size is the precision of the estimate of the median and other fractiles, or more generally, the precision of the estimated cumulative distribution. Assume that the sample «m» values of «y» are relabeled so that they are in increasing order:
-:<big><math> y_1 \le, y_2 \le, ...y_m</math></big>
+:<math> y_1 \le, y_2 \le, ...y_m</math>
-<code>c</code> is the deviation enclosing probability <code>α</code> of the unit normal. Then the following pair of sample values constitutes the confidence interval:
+«c» is the deviation enclosing probability «α» of the unit normal. Then the following pair of sample values constitutes the confidence interval:
-:(<big>y<sub>i</sub>, y<sub>k</sub></big>)
+:<math>(y_i, y_k)</math>
 where:
-:<big><math>i = \left \lfloor mp - c \sqrt {mp (1-p)} \right \rfloor</math></big><div class="floatright">(8)</div><br />
+:<math>i = \left \lfloor mp - c \sqrt {mp (1-p)} \right \rfloor</math><div class="floatright">(8)</div><br />
-:<big><math>i = \left \lceil mp - c \sqrt {mp (1-p)} \right \rceil</math></big><div class="floatright">(9)</div>
+:<math>i = \left \lceil mp - c \sqrt {mp (1-p)} \right \rceil</math><div class="floatright">(9)</div>
 <Tip Title="Note"> The brackets in equations (8) and (9) above mean round up [[File:lceil.png]] and round down[[File:rfloor.png]], since they are computing numbers that need to be integers.</Tip>
-Suppose you want to achieve sufficient precision such that the <code>a</code> confidence interval for the pth fractile Y<sub>p></sub> is given by (y<sub>1</sub>, y<sub>2</sub>), where y<sub>i</sub> is an estimate of '''<math>{Y_p-{_\Delta}{_p}}</math>''', and y<sub>k</sub> is an estimate of '''<math>{Y_p+{_\Delta}{_p}}</math>'''. In other words, you want <math>\alpha</math> confidence of Y<code>p</code> being between the sample values used as estimates of the (<math>p-{_\Delta}{_p})</math>) and (<math>p+{_\Delta}{_p})</math>) fractiles. What sample size do you need? Ignoring the rounding, you have approximately:
+Suppose you want to achieve sufficient precision such that the «a» confidence interval for the pth fractile «Y<sub>p</sub>» is given by (y<sub>1</sub>, y<sub>2</sub>), where «y<sub>i</sub>» is an estimate of '''<math>{Y_p-{_\Delta}{_p}}</math>''', and «y<sub>k</sub>» is an estimate of '''<math>{Y_p+{_\Delta}{_p}}</math>'''. In other words, you want «<math>\alpha</math>» confidence of «Y<sub>p</sub>» being between the sample values used as estimates of the (<math>p-{_\Delta}{_p})</math>) and (<math>p+{_\Delta}{_p})</math>) fractiles. What sample size do you need? Ignoring the rounding, you have approximately:
-:<big><math>i = m(p-{\Delta}p), k = m(p+{\Delta} p)</math></big><div class="floatright">(10)</div><br />
+:<math>i = m(p-{\Delta}p), k = m(p+{\Delta} p)</math><div class="floatright">(10)</div><br />
 Thus:
-:<big><math>k - i = 2m{\Delta}p</math></big><div class="floatright">(11)</div><br />
+:<math>k - i = 2m{\Delta}p</math><div class="floatright">(11)</div><br />
 From equations (8) and (9) above, you have:
-:<big><math>k - i = 2c\sqrt {mp(1-p)}</math></big><div class="floatright">(12)</div><br />
+:<math>k - i = 2c\sqrt {mp(1-p)}</math><div class="floatright">(12)</div><br />
-Equating the two expressions for k-1 , you obtain:
+Equating the two expressions for ''k - 1'' , you obtain:
-:<big><math>2mp\Delta p = 2c \sqrt {mp(1-p)}</math></big><div class="floatright">(13)</div><br />
+:<math>2mp\Delta p = 2c \sqrt {mp(1-p)}</math><div class="floatright">(13)</div><br />
-:<big><math>m=p(1- p) \left (\frac {c}{\Delta p} \right )^2</math></big><div class="floatright">(14)</div><br />
+:<math>m=p(1- p) \left (\frac {c}{\Delta p} \right )^2</math><div class="floatright">(14)</div><br />
-For example, suppose you want to be 95% confident that the estimated fractile Y<sub>.90</sub> is between the estimated fractiles Y<sub>.85</sub> and Y<sub>.95</sub>. So you have <math>\Delta p</math>=0.05, and c≈2. Substituting the numbers into equation (14), you get:
+For example, suppose you want to be 95% confident that the estimated fractile Y<sub>.90</sub> is between the estimated fractiles Y<sub>.85</sub> and Y<sub>.95</sub>. So you have <math>\Delta p</math> = 0.05, and ''c ≈ 2''. Substituting the numbers into equation (14), you get:
-:<big><math>m = 0.90 \times  (1 - 0.90) \times \left (\frac {2}{0.05} \right )^2 = 144</math></big><div class="floatright">(15)</div><br />
+:<math>m = 0.90 \times  (1 - 0.90) \times \left (\frac {2}{0.05} \right )^2 = 144</math><div class="floatright">(15)</div><br />
 On the other hand, suppose you want the credible interval for the least precise estimated percentile (the 50th percentile) to have a 95% confidence interval of plus or minus one estimated percentile. Then:
-:<big><math>m = 0.5 \times  (1 - 0.5) \times \left (\frac {2}{0.01} \right )^2 = 10,000</math></big><div class="floatright">(16)</div><br />
+:<math>m = 0.5 \times  (1 - 0.5) \times \left (\frac {2}{0.01} \right )^2 = 10,000</math><div class="floatright">(16)</div><br />
-These results are completely independent of the shape of the distribution. If you find this an appropriate way to state your requirements for the precision of the estimated distribution, you can determine the sample size before doing any runs to see what sort of distribution it might be.
+These results are completely independent of the shape of the distribution. If you find this an appropriate way to state your requirements for the precision of the estimated distribution, you can determine the sample size before doing ''any'' runs to see what sort of distribution it might be.
 <footer>Appendices / {{PAGENAME}} / Analytica Specifications</footer>