How Fill Missing Values works

The Fill Missing Values tool will replace missing values (nulls) with estimated values to minimize the impact of those null values on subsequent analysis. There are a variety of reasons that the data may be missing. For example, data might be missing because a sensor is temporarily broken, a sampling site is inaccessible, or the data values are intentionally suppressed to protect confidentiality. When one or more values are missing for a feature, most statistical methods default to dropping that feature from analysis. Dropping features in this way can introduce bias or affect the appropriateness of the results since the analysis was run on an incomplete dataset. Rather than throwing out valuable data that can impact your analyses or result in "holes" in your map, the missing data values can be "filled in" using other information from the data set or other data sets (for example, a dataset or larger aggregate units). In the case of spatial data, you can use values of neighboring features in space to create an estimate for the missing values. For spatiotemporal data, you can also use neighbors in time to fill in the missing values. Estimating and filling missing values preserves all existing values and replaces nulls based on the method chosen. Once missing values have been filled, the dataset can then be analyzed as a complete dataset.

For example, consider a dataset of the Unites States, where each of the 50 states has associated with it 100 years worth of data on relative per capita income. Now, imagine that California is missing 1 year of data (a null value). If you tried to create a space-time cube, all of California's data would be dropped from the analysis because of the single null value in the dataset. All of the other 99 values for California would be left out of the analysis, because the time series must be complete to be included. The Fill Missing Values tool helps ensure that you fill in the null value with a good approximation of the missing value and ensures that California will be included in subsequent analyses.

Interpreting results

The tool will output a new field that contains the complete set of existing and imputed values as well as a field indicating which values were estimated. The tool also outputs messages that provide information about the percentage of total records for which values were imputed, the distribution of the data before and after filling missing values, as well as the total number and percentage of values filled.

Best practices

  • Make sure you know what values are missing. The placeholder indicating a missing data value can vary from dataset to dataset. In a geodatabase feature class, missing values are stored as <Null> and are thus clearly recognizable. However, shapefiles cannot store null values. Tools or other procedures that create shapefiles may store or interpret null values as zero. Or in some cases, nulls in a shapefile are indicated by a very large positive or negative number. A simple trick to learning about the missing data values is to sort the field of interest from largest to smallest values and subsequently from smallest to largest values. Seeing null values, many zero values, or extremely large or small values may provide clues as to what placeholder was used to indicate a missing value. The metadata will sometimes indicates the placeholder for missing data.
  • Determine how many values are missing. You don't want to fill too many values. While there is no absolute cutoff for the number of missing data values you should attempt to fill, a common rule of thumb is to fill no more than 5 percent of the values in the dataset.
  • Determine where the missing values are. Map the attribute with missing data and explore the spatial patterns. Determine if the missing data is clustered or located on the periphery or in the core of your study area. Also, see if the missing values appear to be in areas of primarily high or low values. Any of these situations suggests that there is a pattern to the location or values of missing data; this is an indicator that data is not missing at random. Filling in missing values works best when data is missing at random.
  • Check the number and percentage of values filled in to determine if any values are still missing. If they are, try changing the method used to fill the values, for example, increase the number of neighbors or size of the neighborhood. Be sure not to fill in missing values with values you already filled in. This is bad practice because you are essentially estimating values from estimates.
  • Examine the distribution of the data before and after filling in missing values by comparing the descriptive statistics, such as the mean and standard deviation, and examining the histogram to check for skewing and elevating or flattening of the curve. The ideal solution would yield distributions that are similar in shape.
  • Look for local or regional applicability of the method used to fill in the values. You may find that the method you used to fill in the values worked better in some areas than in others. For example, if you are filling using the average of neighboring values and the range of the reported standard deviations is wide, you might try varying the method you used, for example, a different type of neighborhood or a different fill method. Ideally, the standard deviation would be about the same for all filled values indicating that they all vary similarly from the neighbors used to fill in the values.
  • Think about how the data will be used once the values have been filled in. When the data will simply be mapped to create an aesthetically pleasing visualization without holes, minor variations in the filled values may be masked by the mapping method. For example, choropleth mapping typically classifies data into several classes, so variations within the classes will not be visibly apparent. If the data will be used to generate official statistics, the impact of filling in missing values must be carefully examined and clearly understood.
  • Finally, communicate to your audience that you have filled missing values. If you are writing a report, describe the method you used to fill the missing values and state any assumptions you made when choosing the method to fill in the values (for example, assuring that the filled in values were not over or underestimated). If you are making a map, consider identifying the features for which the values have been filled in, for example, on a separate map. Cartographers have also identified polygonal features using a hatched or stipple pattern or a unique feature outline. Be careful when using these methods as they can obscure the polygon fill or change the way the color of the fill is seen.

Choosing how to fill in missing values

When filling in missing values, you must decide on a fill method, such as using the average, minimum, maximum, or median of the neighboring values. When you want to underestimate the filled in values, use the minimum, for example, if you are trying to fill in the number of students who receive free lunches. Similarly, use the maximum if you don't want to underestimate the missing values, for example, when filling in the number of people who have higher educational degrees. Use the median if you suspect the presence of outlier high or low values locally, such as housing values. Use the average if values tend to be similar to their neighbors.

You also must decide how to define the set of neighbors that will be used to calculate the missing values. Neighbors can be defined based on a variety of spatial relationships, such as a fixed number of neighbors, all neighbors within a fixed distance, or neighbors that are contiguous (that is, they share a border or have corners that touch).

Which fill method and which neighbors to use depends on how the filled data will ultimately be used. For example, a cartographer may want to fill polygons containing missing data to create an aesthetically pleasing map without holes. In this case, calculating the average of many spatial neighbors would be effective. A real estate analyst filling in missing data for the value of a house will use neighbors within a fixed distance and calculate their median value to avoid the influence of outliers.

When choosing the combination of type of neighborhood and fill method, think carefully about which surrounding features would legitimately influence the features with the missing values and which fill method is least likely to bias the results of the analysis. For example, consider a local public health analyst who has childhood lead poisoning data at the census block group level, but a few of the block groups have missing data. The analyst might consider using neighboring block groups that share a border with the block group with missing data and use the maximum of the surrounding values to fill the missing data. Using contiguous block groups can be justified because they likely will contain houses of similar age, and housing age is a known risk factor for lead exposure. While using the maximum value of the surrounding block groups to fill missing values might overestimate the true level of lead poisoning, in this example, where children's health is concerned, it is better to overestimate rather than underestimate the risk.

Additional resources contains an up-to-date list of all of the resources available for using the Space Time Pattern Mining and Spatial Statistics tools, including the following:

  • Tutorials
  • Videos
  • Free web seminars
  • Books, articles, and white papers
  • Sample scripts and case studies