The Calculate Rates tool calculates a variety of rates. You can use the tool to calculate percentages, ratios, incident rates, and smoothed rates. Smoothed rates can be calculated using the global empirical Bayes, local empirical Bayes, locally weighted average, or locally weighted median methods.
The crude rate method can be used for calculating percentages, ratios, and incident rates. However, if features have a small count or population, a smoothing method is more appropriate. The smoothing methods use information from a feature’s spatial neighbors or from a reference rate to adjust the crude rate of each feature. The tool includes the following smoothing methods:
- Global Empirical Bayes—Adjusts each feature’s crude rate estimate toward a global reference rate. The degree of the adjustment is impacted by the size of the feature’s population. Use this option if you believe a constant underlying risk exists across all the features.
- Local Empirical Bayes—Adjusts each feature’s crude rate estimate toward a local reference rate. Use this option if you believe there is spatial variability in the risk.
- Locally Weighted Average—Determines each feature’s rate using the weighted average rate of its neighborhood.
- Locally Weighted Median—Determines each feature’s rate using the weighted median rate of its neighborhood.
Potential applications
Rates are calculated in the following situations:
- Calculate simple percentages. For example, the percentage of people in the labor force that are unemployed.
- Calculate ratios. For example, the ratio of females to males in each county.
- Calculate incident rates. For example, the rates of esophagus cancer for women. This rate is an estimate of the probability of observing an event per individual in the population during a certain period. It represents the probability that the event occurs over that period, for a randomly selected individual from that population. In this scenario, the rate is a number between 0 and 1 and the counts are a subset of the individuals in the population.
- To measure the intensity of an event's occurrence relative to a reference unit. For example, the intensity of tweets posted per individual during 2020. In this case, the rate can exceed one because the counts are not necessarily a subset of the population.
Background concepts
The Calculate Rates tool calculates a rate using one of the following methods: Crude Rate, Global Empirical Bayes, Local Empirical Bayes, Locally Weighted Average, or Locally Weighted Median. The crude rate is the simplest of the methods and calculates the ratio between the counts of an event and the population over a specific period.
For example, to understand infant mortality rates, you may begin your analysis using the crude rate method to calculate a simple ratio. The chart below depicts infant mortality rate across 728 spatial features by dividing the number of infant deaths in 2020 by the total number of children born in 2020. The data includes a few large cities interspersed with numerous small towns. There is considerable variance in the size of the population, and the number of children born, across spatial features.
The chart is characterized by significant variability in the rates when the number of children born is small and relatively lower variability when the number of children born is large. For areas with fewer than 100 births in a year, rates range from 0 (the lowest possible value) to 0.20. An estimated infant mortality rate of 0.20 or 2 out of every 10 children born is unseen even in impoverished regions of the world. In contrast, there are no rates below 0.02 or above 0.08 in areas with at least 1,000 births. This may suggest that high infant mortality rates are more likely in areas with smaller populations. However, the primary cause is the larger variance of the rates in small areas, resulting in rates that are less reliable than those calculated for more populous regions.
The larger variability is due to the extreme sensitivity of the rates to the size of population, rather than actual differences in death probabilities in the areas. This issue is related to what has been called the small numbers problem. The small numbers problem occurs whenever you compute percentages, ratios, proportions, or rates for geographic areas where the population is sparse or where the event is rare. In these scenarios, small random fluctuations in the event count can cause large fluctuations in the resulting rate values. In the previous example, 15 areas had fewer than 30 births and experienced no deaths. If a single death were to occur in these areas, the rates would surge from 0 to a value between 0.05 and 0.42 (or a 42 percent chance of mortality).
When calculating rates, you are interested in understanding how the event probability, or in this example, infant mortality, varies across space. However, part of the variation of the crude rates is caused by fluctuations that are not associated with the underlying probability of the event. This variation is larger for features with smaller populations making their rates less reliable than rates calculated for features with large populations. To overcome this limitation of the crude rate method, you can use one of the other rate calculation methods available in the Calculate Rates tool.
The local empirical Bayes and global empirical Bayes methods both address the variability in the crude rates of features with small population by adjusting each feature's crude rate toward a reference rate. The extent of the adjustment depends on the size of the population: larger populations experience minimal change between their crude rate estimate and their empirical Bayes estimate, and smaller populations undergo more noticeable adjustments.
Learn more about global empirical Bayes
Learn more about local empirical Bayes
The locally weighted average, locally weighted median, and local empirical Bayes methods apply spatial smoothing to calculate rates. The rate of each feature is calculated using the rate of its neighborhood. Once the neighbors and neighbor weights of each feature are identified, the feature rates are calculated using the Rate Method parameter. The rate of each feature is one of the following:
- Locally Weighted Average—The weighted average rate of its neighborhood
- Locally Weighted Median—The weighted median rate of its neighborhood
Tool inputs
The tool includes several parameters to define and configure the rates.
Rate Fields
The Rate Fields parameter specifies the fields that are used to calculate the rates. The parameter includes a Count Field value, which specifies the field in the input layer with the event counts, and a Population Field value, which specifies the field in the input with population data that corresponds to the selected count field.
You can calculate a single rate or multiple rates. To calculate multiple rates, provide multiple Count Field and Population Field values. For example, if the feature class contains a field for cancer death counts in 2014, 2020, and 2024 and a population field for those same years, you could calculate three cancer death rates. If you calculate multiple rates, the specified Rate Method and Rate Multiplier parameter values will be applied to calculate each rate.
Rate Method
The Rate Method parameter specifies the method used to calculate the rates.
Crude rate
The crude rate estimate is calculated as follows:
where ri is the crude rate, ni is the population, and Yi is the count in the ith spatial feature. The crude rate is calculated for each feature; however, features with a count less than zero or a population less than or equal to zero will receive a null rate. You can evaluate the reliability of each crude rate estimate using the Confidence interval- upper 95%, Confidence interval- lower 95%, and Reliable fields that are included in the output feature class or table. If many features have large confidence intervals or reliability values, consider using a different rate method.
Global empirical Bayes
The global empirical Bayes rate method estimates the rates by taking a weighted average of the crude rate and a reference rate. The method is calculated as follows:
where i is the spatial feature, is the global empirical Bayes estimate, Ci is a weight with a value between 0 and 1, is the crude rate estimate of feature i, and is the reference rate.
The reference rate is the average rate of all the features. The reference rate is calculated by dividing the sum of all the feature counts by the sum of all the feature populations, as follows:
where is the reference rate, Yi is the population of the ith feature, and ni is its count. The weight, Ci, varies between features and is impacted by the size of the feature’s population. If a feature has a large population, the weight becomes very close to 1 and a feature’s global empirical Bayes rate estimate is almost identical to its crude rate estimate. If the population is small, the crude rate shrinks toward the reference rate because the weight, Ci, will be less than 1 and the global empirical Bayes rate estimate will be a weighted average of the crude rate and the reference rate.
If the Rate Method parameter is set to Global Empirical Bayes or Local Empirical Bayes, you must also specify a Probability Distribution parameter value. The probability distribution is the distribution that is assumed to model the observed count values. The Probability Distribution parameter includes two options: Poisson and Binomial. The default option is Poisson, a widely used distribution for modeling rates. This option can be used when estimating the intensity or the probability of an event's occurrence. The binomial probability distribution model assumes the following:
- The event counts (numerator) are a subset of the population (denominator).
- Each event is independent of the other events.
- The probability that an event occurs is the same for every event.
If any of these assumptions are not met, the binomial distribution is not a suitable model. It is recommended that you select the binomial distribution only when these assumptions are satisfied and the probability of the event is not rare.
Local empirical Bayes
The local empirical Bayes rate estimate of a feature is the weighted average of the focal feature's crude rate and the weighted average rate of its neighborhood. The local empirical Bayes rate is calculated as follows:
where i is the feature of interest, is the local empirical Bayes rate estimate, Ci is the weight, is the weighted average rate of feature i and its neighbors, and is the crude rate of the focal feature.
The average rate of a features neighborhood, , is determined by the Neighborhood Type and Local Weighting Scheme parameter values. The Neighborhood Type parameter specifies the method that will be used to identify each feature’s neighbors. Each neighbor is assigned a weight based on either the Neighborhood Type or Local Weighting Scheme parameter value. The tool then calculates the locally weighted average rate of each neighborhood as follows:
where i is the feature of interest, is the locally weighted average rate at i, j is the neighbor, wij is the weight of neighbor j, and is the crude rate estimate of neighbor j.
Locally weighted average
The locally weighted average rate method estimates a feature’s rate by calculating the weighted average rate of its neighborhood. To estimate the locally weighted average rates, the tool first applies the Neighborhood Type parameter value to identify each feature’s neighbors. Each neighbor is then assigned a weight based on the Neighborhood Type or Local Weighting Scheme parameter value. The locally weighted average rate of each feature is then calculated as follows:
where i is the feature of interest, is the locally weighted average rate at i, j is the neighbor, wij is the weight of neighbor j, and is the crude rate estimate of neighbor j.
Locally weighted median
The locally weighted median rate method estimates a feature's rate by calculating the weighted median rate of its neighborhood.
Local methods
The local methods use a feature’s neighbors to estimate its rate. Neighbors are identified using the specified Neighborhood Type parameter value and then each neighbor is assigned a weight.
Learn more about neighborhood types
Neighbor weights may be unweighted or calculated using a geographical weighting (kernel) function. The Local Weighting Scheme parameter supports the following neighbor weighting options: Unweighted, Gaussian, and Bisquare. Use the weighting scheme that best reflects the influence neighbor event counts have on a focal feature’s event counts. If all the neighbors influence the focal feature, regardless of distance, use the Unweighted option.
If a neighbor’s influence depends on distance, neighbors that are farther from the focal feature should be given less weight and have less influence on the focal feature’s estimated rate. Neighbors that are closer to the focal feature should be given higher weight and have greater influence on the estimated rate. In this case, use the Gaussian or Bisquare option. These options calculate the weights using a kernel, which is a function that determines how quickly weights decrease as distances increase. Both the Gaussian and bisquare kernel functions assign a weight of one to the focal feature and gradually decrease the weight as the distance from the focal feature increases. When comparing a bisquare weighting scheme to a Gaussian weighting scheme with the same neighborhood specifications, weights will decrease more quickly with bisquare.
If the Gaussian or Bisquare option are specified, you must also set the Kernel Bandwidth parameter. Set an appropriate kernel bandwidth based on your data. If you do not provide a value, a default is estimated.
Rate Multiplier
Each rate is a value between 0 and 1. If the population size is large or the event of interest is rare, the resulting rates will be small. The rates will include many leading zeroes, which may make it difficult to interpret the rates. The Rate Multiplier parameter is an integer value that scales the rates, so they are more meaningful and easier to interpret. Setting the rate multiplier to 100 computes a percentage. A good rule of thumb is to use the smallest rate value to determine the rate multiplier. For example, if the smallest rate has three leading zeros, the Rate Multiplier value should be 10,000 or greater. The smallest rate that is not 0, will then be greater than 1.
When you set the Rate Multiplier value, the rates will be expressed as the expected count per the rate multiplier units. For example, if you calculate pancreatic cancer death rates and set the Rate Multiplier value to 100,000, the resulting rates will be the expected count per 100,000 people. A feature with a rate of 144 would mean that 144 deaths are expected due to pancreatic cancer per year for each group of 100,000 people.
Tool outputs
The tool produces an output table or feature class, output group layer, and geoprocessing messages.
Output features or table
The output feature class or table includes several fields.
Excess rate
The Excess Rate field compares a feature’s rate to the average rate of all the features. Excess rate is calculated by dividing a feature’s observed rate by the average rate of all the features. The excess rate can be any positive value greater than or equal to zero. Excess rate values near one indicate that the estimated rate is similar to the average rate. If the excess rate is less than one, the feature’s estimated rate is less than the average rate. If the excess rate is greater than one, the feature’s estimated rate is greater than the average rate. For example, if a feature has an excess rate of 1.25, its rate is 25 percent larger than the average rate. Conversely, if a feature has an excess rate equal to 0.75, its rate is 25 percent smaller than the average rate.
Standardized rate
The Standardized Rate field shows how much a feature's rate deviated from the mean rate. The standardized rate is calculated as follows:
where z is the standardized rate, is the rate estimate, is the mean rate, and is the standard deviation. Features with negative standardized rates have rates that are smaller than the mean rate. Features with positive standardized rates have rates that are greater than the mean rate. The more negative a standardized rate is, the further it deviates below the mean. The greater the positive standardized rate is, the further it deviates above the mean. Features with standardized rates larger than 3 or smaller than –3 are considered outliers.
Confidence intervals
If the Rate Method parameter value is set to Crude Rate, the output table or feature class will include a Confidence Interval – Upper 95% and Confidence Interval- Lower 95% field. The 95 percent confidence interval is calculated using the methodology proposed by the Center for Disease Control (CDC) of the National Center for Health Statistics. If a feature’s count is greater than or equal to 100, a Gaussian approximation is appropriate and, as a result, the 95 percent confidence interval for the crude rate is calculated as follows:
where ri is the crude rate and Yi is the count.
If the number of counts is less than 100, the 95 percent confidence interval is calculated using a method proposed by K. Ulm in A simple method to calculate the confidence interval of a standardized mortality ratio (SMR). In this case, the Gaussian approximation of the Poisson is not appropriate and an identity between cumulative Poisson probabilities and the chi-squared distribution is used. Let qgamma(p,x) represent the quantile associated with the probability, p, of a gamma distribution with shape parameter x and rate parameter 1. Then the 95 percent confidence interval is calculated as follows:
Reliable
The values in the Reliable field reflect the reliability of the rate estimate. This field is included in the output features or table when the Rate Method parameter value is set to Crude Rate. The calculation follows the method described by the CDC in their reference manual. When the reliable value is large, the crude rate estimate is imprecise and the crude rate is considered unreliable. Beginning in 1989, the CDC in their National Center for Health Statistics considered any crude rate that is based on fewer than 20 counts as statistically unreliable. This is equivalent to a reliable value greater than or equal to 22.94.
Reliability is measured through the relative standard error (RSE), also known as the coefficient of variation. The RSE is the ratio between the standard error of the rate and the rate estimate multiplied by 100. The rate variance is calculated as follows:
and, assuming a nonzero count, the RSE is calculated as follows:
RSE only depends on the counts (Yi). Although the RSE formula does not depend on the population size directly, large populations tend to have a greater number of counts so there is an indirect effect.
Number of nonnull neighbors
The Number of Non-Null Neighbors field lists the number of neighbors, including the focal feature, that do not have a null rate. Features with a negative or null value in the Population Field or Count Field parameter values have a null rate. The spatial smoothing methods use the neighborhood of a feature to determine that feature’s rate. The Number of Non-Null Neighbors field reveals the number of neighbors that were used to smooth the rate of the focal feature. This field is included in the output table or feature class if the Rate Method parameter value is Locally Weighted Average, Locally Weighted Median, or Local Empirical Bayes.
Fill missing value
The Fill Missing Value field is a Boolean field that indicates whether a rate was imputed for the feature. Features with a negative or null value in the Count Field parameter value or a negative or null value in the Population Field parameter value will have a null rate. However, if the Rate Method value is Locally Weighted Average or Locally Weighted Median, a rate may be imputed for a feature with a null rate. If the feature has nonnull neighbors, the null rate will be replaced by the locally weighted average or locally weighted median estimate of its neighborhood.
Group layer and symbology
The tool adds a group layer to the Contents pane and a sublayer for each rate. If more than 10 rates are calculated, only the first 10 rates will be added as sublayers.
Each sublayer is a standard deviation map. The rates are split into bins based on their standard deviation. Each bin is labeled with the standard deviation interval and, in parenthesis, the corresponding rate interval. The color ramp includes three colors: green, white in the middle, and brown. The color ramp is centered around the mean rate. Features shaded green have rates that are below the mean rate. Features shaded brown have rates that are above the mean rate. Features that are strongest shades of brown (+3 standard deviations) and green (-3 standard deviations) are outliers.
Geoprocessing messages
The geoprocessing messages provide a summary of the features and the rates. The messages include a drop-down section for each rate that was calculated. Each section includes a Summary of Rates table. If the Rate Method parameter value is Locally Weighted Average, Locally Weighted Median, or Local Empirical Bayes, each section will also include a Summary of Neighborhood Counts table.
Summary of rates
If the Rate Method parameter value is not Crude Rate, the Summary of Rates table will include a column summarizing the selected rate method and an additional column summarizing the crude rates. Use these columns to compare the results from the selected rate method to the results from the crude rate method. The Summary of Rates table includes the Minimum, Maximum, Median, Mean, and Standard Deviation values of the rates. If the Rate Method parameter value is Locally Weighted Average or Locally Weighted Median, the table will include a Features with Null Rate Value and Features with Filled Values row. The Features with Null Rate Value row lists the number of features with a null rate. The Features with Filled Values row lists the number of features with an imputed rate. These features initially had a null rate; however, their neighborhood included nonnull rate values, so their rate was imputed.
Summary of neighborhood counts
If the Rate Method parameter value is Locally Weighted Average, Locally Weighted Median, or Local Empirical Bayes, each section will also include a Summary of Neighborhood Counts table summarizing all the neighborhoods. The table includes the Minimum, Maximum, Median, and Mean neighborhood count and the number of Features Without Neighbors value.
Additional resources
See the following additional resources:
- Anselin, L., N. Lozano, and J. Koschinsky. 2006."Rate Transformations and Smoothing"
- Brillinger, D. R. 1986. "A biometrics invited paper with discussion: the natural variability of vital rates and associated statistics." Biometrics, 693-734. https://pubmed.ncbi.nlm.nih.gov/3814721/
- Carlin, B.P. and T.A. Louis. 1997. "Bayes and empirical Bayes methods for data analysis." Statistics and Computing, 153- 154. https://doi.org/10.1023/A:1018577817064
- Marshall, R.J. 1991. "Mapping disease and mortality rates using empirical Bayes estimators." Journal of the Royal Society Series C: (Applied Statistics), 283-294. https://doi.org/10.2307/2347593
- Martuzzi, M. and P. Elliott. 1996. "Empirical Bayes estimation of small prevalence of non-rare conditions." Statistics in Medicine, 15(17-18) 1867-1873. https://doi.org/10.1002/(SICI)1097-0258(19960915)15:17<1867::AID-SIM398>3.0.CO;2-2
- National Center for Health Statistics. 2019. Technical appendix from vital statistics of United States 1999 mortality
- Ulm, K. 1990. "Simple method to calculate the confidence interval of a standardized mortality ratio (SMR)." American Journal of Epidemiology, 131(2) 373-375. https://doi.org/10.1093/oxfordjournals.aje.a11507