Difference between revisions of "Selecting the Sample Size"

m
Line 2: Line 2:
 
<breadcrumbs> Analytica User Guide > Expressing Uncertainty > {{PAGENAME}}</breadcrumbs><br />
 
<breadcrumbs> Analytica User Guide > Expressing Uncertainty > {{PAGENAME}}</breadcrumbs><br />
  
Each probabilistic value is simulated by computing a random sample of values from the actual probability distribution.
+
Analytica represents the probability distribution on each uncertain value ([[prob value]]) as a random sample from the distribution. The [[Uncertainty view of a result|uncertainty view]] lets you view the sample directly or as a probability density, cumulative probability distribution, and other views.  The accuracy of the representation depends on the sample size.  A larger sample size gives a more accurate distribution with less random noise, but takes longer to compute and uses more memory. It's good to start with a small sample size, say the default setting of 100, as you build and test a model. Then you may want to increase the sample size when you want to obtain more reliable results. Here's a guide to selecting a sample size to meet your needs.
  
You can control the sampling method (system variable [[SampleType]]) and sample size (system variable [[SampleSize]]) by using [[Uncertainty Setup dialog]]. The sections below discuss how to select a sample size.
+
=== To set the sample size ===
 +
The default sample size is (usually) 100. You can modify it in [[Uncertainty Setup dialog]], available from the [[Result menu]] or control+U on the keyboard.  [[SampleSize]] is a system variable, which you can use in Definitions.  
  
==Choosing an Appropriate Sample Size==
+
====Choosing an Appropriate Sample Size====
  
There is a clear trade-off for using a larger sample size in calculating an uncertainty variable. When you set the sample size to a large value, the result is less noisy, but it takes a longer time to compute the distribution. For an initial probabilistic calculation, a sample size of 20 to 50 is usually adequate.
+
How should you choose the sample size «m»? It depends both on the cost of each model run, and what you want the results for. An advantage of the [[Monte Carlo]] method is that you can apply standard statistical techniques to estimate the precision of estimates of the output distribution. This is because the generated sample of values for each output variable is a random sample from the true probability distribution for that variable. Although, the other [[Sampling method|Sampling methods]], [[Latin Hypercube|Random Latin hypercube]] and [[Latin Hypercube|Median Latin hypercube]] sampling are not quite random, these statistical methods usually apply fairly well -- although since these methods reduce the randomness, you may be able to get the accuracy you want with a slightly smaller sample size than these statistics suggest.
  
How should you choose the sample size «m»? It depends both on the cost of each model run, and what you want the results for. An advantage of the [[Monte Carlo]] method is that you can apply many standard statistical techniques to estimate the precision of estimates of the output distribution. This is because the generated sample of values for each output variable is a random sample from the true probability distribution for that variable.
+
====Uncertainty about the Mean====
 
 
==Uncertainty about the Mean==
 
  
 
First, suppose you are primarily interested in the precision of the [[mean]] of your output variable «y». Assume you have a random sample of «m» output values generated by [[Monte Carlo]] simulation:
 
First, suppose you are primarily interested in the precision of the [[mean]] of your output variable «y». Assume you have a random sample of «m» output values generated by [[Monte Carlo]] simulation:
Line 18: Line 17:
 
:<math>(y_1, y_2, y_3, …y_m) </math><div class="floatright">(1)</div>
 
:<math>(y_1, y_2, y_3, …y_m) </math><div class="floatright">(1)</div>
  
You can estimate the mean and standard deviation of «y» using the following equations:
+
Here's how to estimate the mean and standard deviation of «y»:
  
 
:<math>\vec y=\sum_{i=1}^m \frac { y_i }m</math><div class="floatright">(2)</div>
 
:<math>\vec y=\sum_{i=1}^m \frac { y_i }m</math><div class="floatright">(2)</div>
Line 24: Line 23:
 
:<math>s^2=\sum_{i=1}^m \frac { (y_i - \vec y)^2} {m - 1}</math><div class="floatright">(3)</div>
 
:<math>s^2=\sum_{i=1}^m \frac { (y_i - \vec y)^2} {m - 1}</math><div class="floatright">(3)</div>
  
This leads to the following confidence interval with confidence «α», where «c» is the deviation for the unit normally enclosing probability «α»:
+
This gives us a confidence interval with confidence «α», where «c» is the deviation for the unit normal enclosing probability «α»:
  
 
:<math>\left (\vec y-c\frac {s} {\sqrt m}, \vec y + c\frac {s} {\sqrt m}\right ) </math><div class="floatright">(4)</div>
 
:<math>\left (\vec y-c\frac {s} {\sqrt m}, \vec y + c\frac {s} {\sqrt m}\right ) </math><div class="floatright">(4)</div>
  
Suppose you wish to obtain an estimate of the mean of «y» with an «α» confidence interval smaller than «w» units wide. What sample size do you need? You need to make sure that:
+
Suppose you want to estimate the mean of «y» with an «α» confidence interval smaller than «w» units wide. What sample size do you need? You need to make sure that:
  
 
:<math>w>2c \frac {s}{\sqrt m}</math><div class="floatright">(5)</div>
 
:<math>w>2c \frac {s}{\sqrt m}</math><div class="floatright">(5)</div>
  
Or, rearranging the inequality:
+
Rearranging the inequality:
  
 
:<math>m> \left ( \frac {2cs}{w}\right ) ^2</math><div class="floatright">(6)</div>
 
:<math>m> \left ( \frac {2cs}{w}\right ) ^2</math><div class="floatright">(6)</div>
Line 38: Line 37:
 
To use this, first make a small Monte Carlo run with, say, 10 values to get an initial estimate of the variance of «y» — that is, «s2». You can then use equation (6) to estimate how many samples reduce the confidence interval to the requisite width «w».
 
To use this, first make a small Monte Carlo run with, say, 10 values to get an initial estimate of the variance of «y» — that is, «s2». You can then use equation (6) to estimate how many samples reduce the confidence interval to the requisite width «w».
  
For example, suppose you wish to obtain a 95% confidence interval for the mean that is less than 20 units wide. Suppose your initial sample of 10 gives ''s = 40''. The deviation «c» enclosing a probability of 95% for a unit normal is about 2. Substituting these numbers into equation (6), you get:
+
For example, suppose you want a 95% confidence interval for the mean that is less than 20 units wide. Suppose your initial sample of 10 gives ''s = 40''. The deviation «c» enclosing a probability of 95% for a unit normal is about 2. Substituting these numbers into equation (6), we get:
  
 
:<math>m> \left ( \frac {2 \times 2 \times 40}{20} \right )^2 = 8^2 = 64</math><div class="floatright">(7)</div><br />
 
:<math>m> \left ( \frac {2 \times 2 \times 40}{20} \right )^2 = 8^2 = 64</math><div class="floatright">(7)</div><br />
  
So, to get the required precision for the mean, you should set the sample size to about 64.
+
So, to get the desired precision for the mean, you should set the sample size to about 64.
  
== Estimating Confidence Intervals for Fractiles ==
+
==== Estimating Confidence Intervals for Fractiles ====
  
 
Another criterion for selecting sample size is the precision of the estimate of the median and other fractiles, or more generally, the precision of the estimated cumulative distribution. Assume that the sample «m» values of «y» are relabeled so that they are in increasing order:
 
Another criterion for selecting sample size is the precision of the estimate of the median and other fractiles, or more generally, the precision of the estimated cumulative distribution. Assume that the sample «m» values of «y» are relabeled so that they are in increasing order:

Revision as of 17:52, 3 May 2017


Analytica represents the probability distribution on each uncertain value (prob value) as a random sample from the distribution. The uncertainty view lets you view the sample directly or as a probability density, cumulative probability distribution, and other views. The accuracy of the representation depends on the sample size. A larger sample size gives a more accurate distribution with less random noise, but takes longer to compute and uses more memory. It's good to start with a small sample size, say the default setting of 100, as you build and test a model. Then you may want to increase the sample size when you want to obtain more reliable results. Here's a guide to selecting a sample size to meet your needs.

To set the sample size

The default sample size is (usually) 100. You can modify it in Uncertainty Setup dialog, available from the Result menu or control+U on the keyboard. SampleSize is a system variable, which you can use in Definitions.

Choosing an Appropriate Sample Size

How should you choose the sample size «m»? It depends both on the cost of each model run, and what you want the results for. An advantage of the Monte Carlo method is that you can apply standard statistical techniques to estimate the precision of estimates of the output distribution. This is because the generated sample of values for each output variable is a random sample from the true probability distribution for that variable. Although, the other Sampling methods, Random Latin hypercube and Median Latin hypercube sampling are not quite random, these statistical methods usually apply fairly well -- although since these methods reduce the randomness, you may be able to get the accuracy you want with a slightly smaller sample size than these statistics suggest.

Uncertainty about the Mean

First, suppose you are primarily interested in the precision of the mean of your output variable «y». Assume you have a random sample of «m» output values generated by Monte Carlo simulation:

[math]\displaystyle{ (y_1, y_2, y_3, …y_m) }[/math]
(1)

Here's how to estimate the mean and standard deviation of «y»:

[math]\displaystyle{ \vec y=\sum_{i=1}^m \frac { y_i }m }[/math]
(2)
[math]\displaystyle{ s^2=\sum_{i=1}^m \frac { (y_i - \vec y)^2} {m - 1} }[/math]
(3)

This gives us a confidence interval with confidence «α», where «c» is the deviation for the unit normal enclosing probability «α»:

[math]\displaystyle{ \left (\vec y-c\frac {s} {\sqrt m}, \vec y + c\frac {s} {\sqrt m}\right ) }[/math]
(4)

Suppose you want to estimate the mean of «y» with an «α» confidence interval smaller than «w» units wide. What sample size do you need? You need to make sure that:

[math]\displaystyle{ w\gt 2c \frac {s}{\sqrt m} }[/math]
(5)

Rearranging the inequality:

[math]\displaystyle{ m\gt \left ( \frac {2cs}{w}\right ) ^2 }[/math]
(6)

To use this, first make a small Monte Carlo run with, say, 10 values to get an initial estimate of the variance of «y» — that is, «s2». You can then use equation (6) to estimate how many samples reduce the confidence interval to the requisite width «w».

For example, suppose you want a 95% confidence interval for the mean that is less than 20 units wide. Suppose your initial sample of 10 gives s = 40. The deviation «c» enclosing a probability of 95% for a unit normal is about 2. Substituting these numbers into equation (6), we get:

[math]\displaystyle{ m\gt \left ( \frac {2 \times 2 \times 40}{20} \right )^2 = 8^2 = 64 }[/math]
(7)

So, to get the desired precision for the mean, you should set the sample size to about 64.

Estimating Confidence Intervals for Fractiles

Another criterion for selecting sample size is the precision of the estimate of the median and other fractiles, or more generally, the precision of the estimated cumulative distribution. Assume that the sample «m» values of «y» are relabeled so that they are in increasing order:

[math]\displaystyle{ y_1 \le, y_2 \le, ...y_m }[/math]

«c» is the deviation enclosing probability «α» of the unit normal. Then the following pair of sample values constitutes the confidence interval:

[math]\displaystyle{ (y_i, y_k) }[/math]

where:

[math]\displaystyle{ i = \left \lfloor mp - c \sqrt {mp (1-p)} \right \rfloor }[/math]
(8)

[math]\displaystyle{ i = \left \lceil mp - c \sqrt {mp (1-p)} \right \rceil }[/math]
(9)
Note
The brackets in equations (8) and (9) above mean round up Lceil.png and round downRfloor.png, since they are computing numbers that need to be integers.

Suppose you want to achieve sufficient precision such that the «a» confidence interval for the pth fractile «Yp» is given by (y1, y2), where «yi» is an estimate of [math]\displaystyle{ {Y_p-{_\Delta}{_p}} }[/math], and «yk» is an estimate of [math]\displaystyle{ {Y_p+{_\Delta}{_p}} }[/math]. In other words, you want «[math]\displaystyle{ \alpha }[/math]» confidence of «Yp» being between the sample values used as estimates of the ([math]\displaystyle{ p-{_\Delta}{_p}) }[/math]) and ([math]\displaystyle{ p+{_\Delta}{_p}) }[/math]) fractiles. What sample size do you need? Ignoring the rounding, you have approximately:

[math]\displaystyle{ i = m(p-{\Delta}p), k = m(p+{\Delta} p) }[/math]
(10)

Thus:

[math]\displaystyle{ k - i = 2m{\Delta}p }[/math]
(11)

From equations (8) and (9) above, you have:

[math]\displaystyle{ k - i = 2c\sqrt {mp(1-p)} }[/math]
(12)

Equating the two expressions for k - 1 , you obtain:

[math]\displaystyle{ 2mp\Delta p = 2c \sqrt {mp(1-p)} }[/math]
(13)

[math]\displaystyle{ m=p(1- p) \left (\frac {c}{\Delta p} \right )^2 }[/math]
(14)

For example, suppose you want to be 95% confident that the estimated fractile Y.90 is between the estimated fractiles Y.85 and Y.95. So you have [math]\displaystyle{ \Delta p }[/math] = 0.05, and c ≈ 2. Substituting the numbers into equation (14), you get:

[math]\displaystyle{ m = 0.90 \times (1 - 0.90) \times \left (\frac {2}{0.05} \right )^2 = 144 }[/math]
(15)

On the other hand, suppose you want the credible interval for the least precise estimated percentile (the 50th percentile) to have a 95% confidence interval of plus or minus one estimated percentile. Then:

[math]\displaystyle{ m = 0.5 \times (1 - 0.5) \times \left (\frac {2}{0.01} \right )^2 = 10,000 }[/math]
(16)

These results are completely independent of the shape of the distribution. If you find this an appropriate way to state your requirements for the precision of the estimated distribution, you can determine the sample size before doing any runs to see what sort of distribution it might be.

See Also


Comments


You are not allowed to post comments.