Difference between revisions of "Selecting the Sample Size"

m
m
 
(17 intermediate revisions by 2 users not shown)
Line 2: Line 2:
 
<breadcrumbs> Analytica User Guide > Expressing Uncertainty > {{PAGENAME}}</breadcrumbs><br />
 
<breadcrumbs> Analytica User Guide > Expressing Uncertainty > {{PAGENAME}}</breadcrumbs><br />
  
Analytica represents the uncertain value ([[prob value]]) of an uncertain quantity as a random sample from its probability distribution, an array of samples indexed by [[Run]]. When viewing an uncertain result, you can use the [[Uncertainty view of a result|uncertainty view]] to see it as a probability density, cumulative probability distribution, or even the underlying random sample.  Whether you use Monte Carlo simulation, Latin Hypercube sampling or another method, the accuracy of probabilistic results depends on the sample size. A larger sample size gives more accurate results with less random noise, but takes longer to compute and uses more memory.  Here's suggestions for how to think about trade-offs when choosing a sample size.
+
Analytica represents the uncertain value ([[prob value]]) of an uncertain quantity as a random sample from its probability distribution -- an array of sample values indexed by [[Run]]. You can use the [[Uncertainty view of a result|uncertainty view]] to show each uncertain quantity as a probability density function (PDF) , cumulative distribution function (CDF), selected statistics, fractiles (percentiles), or even the underlying random sample.  By default, it uses a sample size of 1000. A larger sample size reduces random noise in the estimated distribution, increasing the accuracy of estimates of the  mean, median, median, and other statistics. distribution. But the computation time and memory used go up roughly linearly with the sample size.  
  
You can change the sample size in the [[Uncertainty Setup dialog]], available from the [[Result menu]] or with keyboard shortcut control+U
+
== Changing the sample size ==
  
It's good to start with a small sample size, say the default setting of 100, as you build and test a model. Then you may want to increase the sample size when you want to obtain more reliable results. Here's a guide to selecting a sample size to meet your needs.
+
You can change the sample size from its default of 1000 in the [[Uncertainty Setup dialog]] from the [[Result menu]] or just press control+U ("U" for uncertainty):
 
 
=== To set the sample size ===
 
The default sample size is 100 (unless you've modified your preferences). You can change the [[SampleSize]] in [[Uncertainty Setup dialog]], available from the [[Result menu]] or control+U on the keyboard:
 
 
[[File:Uncertainty setup.png|centre|frameless|388x388px]]
 
[[File:Uncertainty setup.png|centre|frameless|388x388px]]
  
====Choosing the Sample Size====
+
== Sample Size and Smooth Probability Distributions ==
  
It's usually a good idea to start with a small sample size, such as the default 100, for initial building and testing a model, so you don't have to wait for long computations. When you are ready to generate results to show a client or for a report, you may want to use a larger sample to provide greater accuracy. How large?
+
When you are first building a model, it's usually best to start with a moderate sample size, such as the default 1000 -- or even smaller if you have a large model. That way, you can test it out as you build without having to wait for long calculation times. When you are happy with your model and ready to generate results for a client or report, you might then increase the sample size to provide greater accuracy.  
  
It depends on what you want the results for.  Sometimes you just want to get a rough idea of the range, say 10th to 90th percentile of the resulting distributions. In that case, a sample of 100 to 1000 may be sufficientOften modelers want to visualize the shape of the resulting distributions. For a given sample size, the cumulative probability distribution shows less noise than the probability density function.
+
How many samples do you need? It depends on what you want the results for.  If you just want a rough idea of the range of key results, say 10th to 90th percentile, a sample of 100 to 1000 may be plentyYou can visualize distributions by selecting an [[Uncertainty views]] in the Result window. The cumulative probability distribution shows less noise -- i.e. roughness due to random sampling that does not reflect anything real about the actual distribution -- than the probability density function.  
  
If you want to see or show the shape of the density function without spurious noise due to random sampling, you can use the Smoothing option in the Probability density tab of the  [[Uncertainty Setup dialog]]:
+
If you find the density view has too much noise in your initial view, you can reduce the noise -- without using a larger sample size --with the Smoothing option in the Probability density tab of the  [[Uncertainty Setup dialog]]:
  
 
[[File:Uncertainty Setup KDE.png|frameless|378x378px]]
 
[[File:Uncertainty Setup KDE.png|frameless|378x378px]]
  
We generally recommend using the default smoothing factor.  Higher levels of smoothing can give misleading results, e.g. with tails inappropriately wide.
+
We generally recommend using the default smoothing factor.  Higher levels of smoothing may give misleading results, e.g. with inappropriately wide tails.
 +
 
 +
== Function library for choosing a sample size ==
 +
 
 +
This Analytica library includes functions to estimate the sample size that you need to estimate the mean or a fractile (percentile) of a probability distribution with a specified confidence interval.  It also contains a function to create a meta sample -- that is to rerun a Monte Carlo simulation multiple times if you want to estimate the variability of a statistic over multiple runs for a given sample size. 
 +
 
 +
Download library: [[Media:Choose sample size.ANA|Choose sample size.ANA]] to help select a sample size to meet your needs.
 +
:[[File:Diagram for model Choose sample size.ANA.png|880px]]
 +
 
 +
The next sections explain the rationale and statistics underlying the methods in this model.
  
 
=== Convergence and statistics ===
 
=== Convergence and statistics ===
Even experienced risk analysts often resort to what they call "convergence" testing to decide on sample size: They run simulations with increasing sample sizes to see "when the results seem to converge" to a consistent value. Or they compare multiple runs to see how well they agree. This kind of empirical exploration of sample size can be time consuming. But, contrary to common practice, it is often unnecessary.   
+
 
 +
Even experienced risk analysts sometimes resort to "convergence" testing to decide on sample size: They run simulations with increasing sample sizes to see "when the results seem to converge to a consistent value." Or they compare multiple runs to see how well they agree. This kind of empirical exploration of sample size can be very time consuming. But, it is usually unnecessary.   
  
 
The key point is that the results generated by [[Monte Carlo]] simulation are a random sample from the "true" distribution, assuming all the input distributions are well-chosen. This means that you can use simple statistics to select the sample size you need provided you can specify well-defined goals -- for example, if you want to estimate the true mean of the distribution has a 95% probability of being within  1% of the estimated mean -- or if you want to know that the estimated median (50th percentile) has a 95% probability of being between the estimated 49th and 51st percentile.  Below we show how to estimate the sample sizes needed to obtain results with the specified accuracy.  You can also download an Analytica library with functions to help you do these calculations.  
 
The key point is that the results generated by [[Monte Carlo]] simulation are a random sample from the "true" distribution, assuming all the input distributions are well-chosen. This means that you can use simple statistics to select the sample size you need provided you can specify well-defined goals -- for example, if you want to estimate the true mean of the distribution has a 95% probability of being within  1% of the estimated mean -- or if you want to know that the estimated median (50th percentile) has a 95% probability of being between the estimated 49th and 51st percentile.  Below we show how to estimate the sample sizes needed to obtain results with the specified accuracy.  You can also download an Analytica library with functions to help you do these calculations.  
  
Note that these results assume you are using simple  [[Monte Carlo]] simulation.  Analytica also offers median and random [[Latin Hypercube|Latin hypercube]] sampling as other options in the  [[Uncertainty Setup dialog]].  These sometimes provide faster convergence than Monte Carlo -- i.e. they provide more accurate results for a given sample size. But, you can use the same statistical methods to estimate needed sample size, and be confident that they will be adequate whichever sample method you use.
+
Note that these results assume you are using simple  [[Monte Carlo]] simulation.  Analytica also offers median and random [[Latin Hypercube|Latin hypercube]] sampling as other options in the  [[Uncertainty Setup dialog]].  These often converge a little faster than Monte Carlo -- i.e. they provide more accurate results for a given sample size. But, you can use the same statistical methods to estimate needed sample size, and be confident that they will be adequate whichever sample method you use.
  
====Estimating sample size for a confidence interval on the Mean====
+
===Estimating sample size for a confidence interval on the Mean===
  
First, suppose you are primarily interested in the precision of the [[mean]] of your output variable «y». Assume you have a random sample of «m» output values generated by [[Monte Carlo]] simulation:
+
First, suppose you are interested in the precision of the [[mean]] of a result variable «y» that your client cares about. Suppose you have a random sample of «m» values from «y» generated by [[Monte Carlo]] simulation:
  
 
:<math>(y_1, y_2, y_3, …y_m) </math><div class="floatright">(1)</div>
 
:<math>(y_1, y_2, y_3, …y_m) </math><div class="floatright">(1)</div>
Line 43: Line 50:
 
:<math>s^2=\sum_{i=1}^m \frac { (y_i - \vec y)^2} {m - 1}</math><div class="floatright">(3)</div>
 
:<math>s^2=\sum_{i=1}^m \frac { (y_i - \vec y)^2} {m - 1}</math><div class="floatright">(3)</div>
  
This gives us a confidence interval with confidence «α», where «c» is the deviation for the unit normal enclosing probability «α»:
+
The Central Limit Theorem says that the sampling distribution on the mean tends to normal for large  «m» for any distribution for «y» (if it has finite variance).  Given «c» is the deviation for the unit normal enclosing probability «α», this gives us the confidence interval on the true mean with probability «α»:
  
 
:<math>\left (\vec y-c\frac {s} {\sqrt m}, \vec y + c\frac {s} {\sqrt m}\right ) </math><div class="floatright">(4)</div>
 
:<math>\left (\vec y-c\frac {s} {\sqrt m}, \vec y + c\frac {s} {\sqrt m}\right ) </math><div class="floatright">(4)</div>
  
Suppose you want to estimate the mean of «y» with an «α» confidence interval smaller than «w» units wide. What sample size do you need? You need to make sure that:
+
What sample size do you need to estimate the mean of «y» with an «α» confidence interval smaller than «w» units wide? You need to make sure that:
  
 
:<math>w>2c \frac {s}{\sqrt m}</math><div class="floatright">(5)</div>
 
:<math>w>2c \frac {s}{\sqrt m}</math><div class="floatright">(5)</div>
Line 55: Line 62:
 
:<math>m> \left ( \frac {2cs}{w}\right ) ^2</math><div class="floatright">(6)</div>
 
:<math>m> \left ( \frac {2cs}{w}\right ) ^2</math><div class="floatright">(6)</div>
  
To use this, first make a small Monte Carlo run with, say, 10 values to get an initial estimate of the variance of «y» — that is, «s2». You can then use equation (6) to estimate how many samples reduce the confidence interval to the requisite width «w».
+
To use this, first make a small Monte Carlo run with, say, 10 values to get an initial estimate of the variance of «y» — that is, «s2». You can then use equation (6) to estimate how many samples you need to reduce the confidence interval to the desired width «w».
  
 
For example, suppose you want a 95% confidence interval for the mean that is less than 20 units wide. Suppose your initial sample of 10 gives ''s = 40''. The deviation «c» enclosing a probability of 95% for a unit normal is about 2. Substituting these numbers into equation (6), we get:
 
For example, suppose you want a 95% confidence interval for the mean that is less than 20 units wide. Suppose your initial sample of 10 gives ''s = 40''. The deviation «c» enclosing a probability of 95% for a unit normal is about 2. Substituting these numbers into equation (6), we get:
Line 63: Line 70:
 
So, to get the desired precision for the mean, you should set the sample size to about 64.
 
So, to get the desired precision for the mean, you should set the sample size to about 64.
  
==== Estimating Confidence Intervals for Fractiles ====
+
=== Estimating Confidence Intervals for Fractiles ===
  
Another criterion for selecting sample size is the precision of the estimate of the median and other fractiles, or more generally, the precision of the estimated cumulative distribution. Assume that the sample «m» values of «y» are relabeled so that they are in increasing order:
+
Another way to select a sample size is to obtain a confidence interval for  the median or other fractile (percentile) of a the probability distribution for an uncertain result of interest. Suppose that we label the sample «m» values of «y» so that they are in increasing order:
  
:<math> y_1 \le, y_2 \le, ...y_m</math>
+
:<math> y_1 \le y_2 \le ...y_m</math>
  
«c» is the deviation enclosing probability «α» of the unit normal. Then the following pair of sample values constitutes the confidence interval:
+
Define «c» as the deviation enclosing probability «α» of the unit normal (<code>c := -[[CumNormalInv]]((1-alpha)/2)</code>). Then these two sample values enclose the confidence interval on the pth percentile «Y<sub>p</sub>»:
  
 
:<math>(y_i, y_k)</math>
 
:<math>(y_i, y_k)</math>
Line 77: Line 84:
 
:<math>i = \left \lfloor mp - c \sqrt {mp (1-p)} \right \rfloor</math><div class="floatright">(8)</div><br />
 
:<math>i = \left \lfloor mp - c \sqrt {mp (1-p)} \right \rfloor</math><div class="floatright">(8)</div><br />
  
:<math>i = \left \lceil mp - c \sqrt {mp (1-p)} \right \rceil</math><div class="floatright">(9)</div>
+
:<math>k = \left \lceil mp + c \sqrt {mp (1-p)} \right \rceil</math><div class="floatright">(9)</div>
  
 
<Tip Title="Note"> The brackets in equations (8) and (9) above mean round up [[File:lceil.png]] and round down[[File:rfloor.png]], since they are computing numbers that need to be integers.</Tip>
 
<Tip Title="Note"> The brackets in equations (8) and (9) above mean round up [[File:lceil.png]] and round down[[File:rfloor.png]], since they are computing numbers that need to be integers.</Tip>
  
Suppose you want to achieve sufficient precision such that the «a» confidence interval for the pth fractile «Y<sub>p</sub>» is given by (y<sub>1</sub>, y<sub>2</sub>), where «y<sub>i</sub>» is an estimate of '''<math>{Y_p-{_\Delta}{_p}}</math>''', and «y<sub>k</sub>» is an estimate of '''<math>{Y_p+{_\Delta}{_p}}</math>'''. In other words, you want «<math>\alpha</math>» confidence of «Y<sub>p</sub>» being between the sample values used as estimates of the (<math>p-{_\Delta}{_p})</math>) and (<math>p+{_\Delta}{_p})</math>) fractiles. What sample size do you need? Ignoring the rounding, you have approximately:
+
Suppose you want to achieve sufficient precision such that the «<math>\alpha</math>» confidence interval for the pth fractile «Y<sub>p</sub>» is given by (y<sub>1</sub>, y<sub>2</sub>), where «y<sub>i</sub>» is an estimate of '''<math>{Y_p-{_\Delta}{_p}}</math>''', and «y<sub>k</sub>» is an estimate of '''<math>{Y_p+{_\Delta}{_p}}</math>'''. In other words, you want «<math>\alpha</math>» confidence of «Y<sub>p</sub>» being between the sample values used as estimates of the (<math>p-{_\Delta}{_p})</math>) and (<math>p+{_\Delta}{_p})</math>) fractiles. What sample size do you need? Ignoring the rounding, you have approximately:
  
 
:<math>i = m(p-{\Delta}p), k = m(p+{\Delta} p)</math><div class="floatright">(10)</div><br />
 
:<math>i = m(p-{\Delta}p), k = m(p+{\Delta} p)</math><div class="floatright">(10)</div><br />

Latest revision as of 21:57, 15 August 2024


Analytica represents the uncertain value (prob value) of an uncertain quantity as a random sample from its probability distribution -- an array of sample values indexed by Run. You can use the uncertainty view to show each uncertain quantity as a probability density function (PDF) , cumulative distribution function (CDF), selected statistics, fractiles (percentiles), or even the underlying random sample. By default, it uses a sample size of 1000. A larger sample size reduces random noise in the estimated distribution, increasing the accuracy of estimates of the mean, median, median, and other statistics. distribution. But the computation time and memory used go up roughly linearly with the sample size.

Changing the sample size

You can change the sample size from its default of 1000 in the Uncertainty Setup dialog from the Result menu or just press control+U ("U" for uncertainty):

Uncertainty setup.png

Sample Size and Smooth Probability Distributions

When you are first building a model, it's usually best to start with a moderate sample size, such as the default 1000 -- or even smaller if you have a large model. That way, you can test it out as you build without having to wait for long calculation times. When you are happy with your model and ready to generate results for a client or report, you might then increase the sample size to provide greater accuracy.

How many samples do you need? It depends on what you want the results for. If you just want a rough idea of the range of key results, say 10th to 90th percentile, a sample of 100 to 1000 may be plenty. You can visualize distributions by selecting an Uncertainty views in the Result window. The cumulative probability distribution shows less noise -- i.e. roughness due to random sampling that does not reflect anything real about the actual distribution -- than the probability density function.

If you find the density view has too much noise in your initial view, you can reduce the noise -- without using a larger sample size --with the Smoothing option in the Probability density tab of the Uncertainty Setup dialog:

Uncertainty Setup KDE.png

We generally recommend using the default smoothing factor. Higher levels of smoothing may give misleading results, e.g. with inappropriately wide tails.

Function library for choosing a sample size

This Analytica library includes functions to estimate the sample size that you need to estimate the mean or a fractile (percentile) of a probability distribution with a specified confidence interval. It also contains a function to create a meta sample -- that is to rerun a Monte Carlo simulation multiple times if you want to estimate the variability of a statistic over multiple runs for a given sample size.

Download library: Choose sample size.ANA to help select a sample size to meet your needs.

Diagram for model Choose sample size.ANA.png

The next sections explain the rationale and statistics underlying the methods in this model.

Convergence and statistics

Even experienced risk analysts sometimes resort to "convergence" testing to decide on sample size: They run simulations with increasing sample sizes to see "when the results seem to converge to a consistent value." Or they compare multiple runs to see how well they agree. This kind of empirical exploration of sample size can be very time consuming. But, it is usually unnecessary.

The key point is that the results generated by Monte Carlo simulation are a random sample from the "true" distribution, assuming all the input distributions are well-chosen. This means that you can use simple statistics to select the sample size you need provided you can specify well-defined goals -- for example, if you want to estimate the true mean of the distribution has a 95% probability of being within 1% of the estimated mean -- or if you want to know that the estimated median (50th percentile) has a 95% probability of being between the estimated 49th and 51st percentile. Below we show how to estimate the sample sizes needed to obtain results with the specified accuracy. You can also download an Analytica library with functions to help you do these calculations.

Note that these results assume you are using simple Monte Carlo simulation. Analytica also offers median and random Latin hypercube sampling as other options in the Uncertainty Setup dialog. These often converge a little faster than Monte Carlo -- i.e. they provide more accurate results for a given sample size. But, you can use the same statistical methods to estimate needed sample size, and be confident that they will be adequate whichever sample method you use.

Estimating sample size for a confidence interval on the Mean

First, suppose you are interested in the precision of the mean of a result variable «y» that your client cares about. Suppose you have a random sample of «m» values from «y» generated by Monte Carlo simulation:

[math]\displaystyle{ (y_1, y_2, y_3, …y_m) }[/math]
(1)

Here's how to estimate the mean and standard deviation of «y»:

[math]\displaystyle{ \vec y=\sum_{i=1}^m \frac { y_i }m }[/math]
(2)
[math]\displaystyle{ s^2=\sum_{i=1}^m \frac { (y_i - \vec y)^2} {m - 1} }[/math]
(3)

The Central Limit Theorem says that the sampling distribution on the mean tends to normal for large «m» for any distribution for «y» (if it has finite variance). Given «c» is the deviation for the unit normal enclosing probability «α», this gives us the confidence interval on the true mean with probability «α»:

[math]\displaystyle{ \left (\vec y-c\frac {s} {\sqrt m}, \vec y + c\frac {s} {\sqrt m}\right ) }[/math]
(4)

What sample size do you need to estimate the mean of «y» with an «α» confidence interval smaller than «w» units wide? You need to make sure that:

[math]\displaystyle{ w\gt 2c \frac {s}{\sqrt m} }[/math]
(5)

Rearranging the inequality:

[math]\displaystyle{ m\gt \left ( \frac {2cs}{w}\right ) ^2 }[/math]
(6)

To use this, first make a small Monte Carlo run with, say, 10 values to get an initial estimate of the variance of «y» — that is, «s2». You can then use equation (6) to estimate how many samples you need to reduce the confidence interval to the desired width «w».

For example, suppose you want a 95% confidence interval for the mean that is less than 20 units wide. Suppose your initial sample of 10 gives s = 40. The deviation «c» enclosing a probability of 95% for a unit normal is about 2. Substituting these numbers into equation (6), we get:

[math]\displaystyle{ m\gt \left ( \frac {2 \times 2 \times 40}{20} \right )^2 = 8^2 = 64 }[/math]
(7)

So, to get the desired precision for the mean, you should set the sample size to about 64.

Estimating Confidence Intervals for Fractiles

Another way to select a sample size is to obtain a confidence interval for the median or other fractile (percentile) of a the probability distribution for an uncertain result of interest. Suppose that we label the sample «m» values of «y» so that they are in increasing order:

[math]\displaystyle{ y_1 \le y_2 \le ...y_m }[/math]

Define «c» as the deviation enclosing probability «α» of the unit normal (c := -CumNormalInv((1-alpha)/2)). Then these two sample values enclose the confidence interval on the pth percentile «Yp»:

[math]\displaystyle{ (y_i, y_k) }[/math]

where:

[math]\displaystyle{ i = \left \lfloor mp - c \sqrt {mp (1-p)} \right \rfloor }[/math]
(8)

[math]\displaystyle{ k = \left \lceil mp + c \sqrt {mp (1-p)} \right \rceil }[/math]
(9)
Note
The brackets in equations (8) and (9) above mean round up Lceil.png and round downRfloor.png, since they are computing numbers that need to be integers.

Suppose you want to achieve sufficient precision such that the «[math]\displaystyle{ \alpha }[/math]» confidence interval for the pth fractile «Yp» is given by (y1, y2), where «yi» is an estimate of [math]\displaystyle{ {Y_p-{_\Delta}{_p}} }[/math], and «yk» is an estimate of [math]\displaystyle{ {Y_p+{_\Delta}{_p}} }[/math]. In other words, you want «[math]\displaystyle{ \alpha }[/math]» confidence of «Yp» being between the sample values used as estimates of the ([math]\displaystyle{ p-{_\Delta}{_p}) }[/math]) and ([math]\displaystyle{ p+{_\Delta}{_p}) }[/math]) fractiles. What sample size do you need? Ignoring the rounding, you have approximately:

[math]\displaystyle{ i = m(p-{\Delta}p), k = m(p+{\Delta} p) }[/math]
(10)

Thus:

[math]\displaystyle{ k - i = 2m{\Delta}p }[/math]
(11)

From equations (8) and (9) above, you have:

[math]\displaystyle{ k - i = 2c\sqrt {mp(1-p)} }[/math]
(12)

Equating the two expressions for k - 1 , you obtain:

[math]\displaystyle{ 2mp\Delta p = 2c \sqrt {mp(1-p)} }[/math]
(13)

[math]\displaystyle{ m=p(1- p) \left (\frac {c}{\Delta p} \right )^2 }[/math]
(14)

For example, suppose you want to be 95% confident that the estimated fractile Y.90 is between the estimated fractiles Y.85 and Y.95. So you have [math]\displaystyle{ \Delta p }[/math] = 0.05, and c ≈ 2. Substituting the numbers into equation (14), you get:

[math]\displaystyle{ m = 0.90 \times (1 - 0.90) \times \left (\frac {2}{0.05} \right )^2 = 144 }[/math]
(15)

On the other hand, suppose you want the credible interval for the least precise estimated percentile (the 50th percentile) to have a 95% confidence interval of plus or minus one estimated percentile. Then:

[math]\displaystyle{ m = 0.5 \times (1 - 0.5) \times \left (\frac {2}{0.01} \right )^2 = 10,000 }[/math]
(16)

These results are completely independent of the shape of the distribution. If you find this an appropriate way to state your requirements for the precision of the estimated distribution, you can determine the sample size before doing any runs to see what sort of distribution it might be.

See Also


Comments


You are not allowed to post comments.