The Fill Missing Values tool will replace missing values (nulls) with estimated values to minimize the impact of those null values on subsequent analysis. There are many reasons why that the data may be missing. For example, data may be missing because a sensor is temporarily broken, a sampling site is inaccessible, or the data values are intentionally suppressed to protect confidentiality. When one or more values are missing for a feature, most statistical methods default to dropping that feature from analysis. Dropping features in this way can introduce bias or affect the appropriateness of the results, since the analysis was run on an incomplete dataset. Rather than throwing out valuable data that can impact your analyses or result in gaps in your map, the missing data values can be filled in using other information from the dataset or other datasets (for example, a dataset or larger aggregate units). For spatial data, you can use values of neighboring features in space to estimate the missing values. For spatiotemporal data, you can also use neighbors in time to fill in the missing values. For nonspatial data, you can use global statistics of the field containing missing values to fill in the missing values. Estimating and filling missing values preserves existing values and replaces nulls based on the selected method. Once missing values have been filled, the dataset can then be analyzed as a complete dataset.
For example, in a dataset of the United States in which each of the 50 states has 100 years' worth of data on relative per capita income associated with it, California is missing 1 year of data (a null value). If you try to create a space-time cube, all of California's data is dropped from the analysis because of the single null value in the dataset. All of the other 99 values for California are left out of the analysis, because the time series must be complete to be included. The Fill Missing Values tool helps ensure that you fill in the null value with a good approximation of the missing value and ensures that California will be included in subsequent analyses.
Interpret results
The tool will output a new field that contains the complete set of existing and inputted values as well as a field indicating which values were estimated. The tool also outputs messages that provide information about the percentage of total records for which values were inputted, the distribution of the data before and after filling missing values, and the total number and percentage of values filled.
Best practices
When deciding whether this tool is appropriate for your data and which parameters you should choose, several things should be taken into account.
- Be sure you know what values are missing. The placeholder indicating a missing data value can vary from dataset to dataset. In a geodatabase feature class, missing values are stored as <Null> and are thus clearly recognizable. However, shapefiles cannot store null values. Tools or other procedures that create shapefiles may store or interpret null values as zero. Or in some cases, nulls in a shapefile are indicated by a very large positive or negative number. A tip for learning about the missing data values is to sort the field of interest from largest to smallest values and subsequently from smallest to largest values. Seeing null values, many zero values, or extremely large or small values may provide clues as to what placeholder was used to indicate a missing value. The metadata will sometimes indicate the placeholder for missing data.
- Determine how many values are missing. You don't want to fill too many values. While there is no absolute cutoff for the number of missing data values you should attempt to fill, a common guideline is to fill no more than 5 percent of the values in the dataset.
- Determine where the missing values are. Map the attribute with missing data and explore the spatial patterns. Determine whether the missing data is clustered or located on the periphery or in the core of your study area. Also, see if the missing values appear to be in areas of primarily high or low values. Any of these situations suggests that there is a pattern to the location or values of missing data; this is an indicator that data is not missing at random. Filling in missing values works best when data is missing at random.
- Check the number and percentage of values filled in to determine whether any values are still missing. If they are, try changing the method used to fill the values; for example, increase the number of neighbors or size of the neighborhood. Be sure not to fill in missing values with values you already filled in. This is bad practice because you are essentially estimating values from estimates.
- Examine the distribution of the data before and after filling in missing values by comparing the descriptive statistics, such as the mean and standard deviation, and examining the histogram to check for skewing and elevating or flattening of the curve. The best solution will yield distributions that are similar in shape.
- Look for local or regional applicability of the method used to fill in the values. You may find that the method you used to fill in the values worked better in some areas than in others. For example, if you are filling using the average of neighboring values and the range of the reported standard deviations is wide, you might try varying the method you used, for example, a different type of neighborhood or a different fill method. Ideally, the standard deviation would be about the same for all filled values, indicating that they all vary similarly from the neighbors used to fill in the values.
- Think about how the data will be used once the values have been filled in. When the data will simply be mapped to create an aesthetically pleasing visualization without holes, minor variations in the filled values may be masked by the mapping method. For example, choropleth mapping typically classifies data into several classes, so variations within the classes will not be visibly apparent. If the data will be used to generate official statistics, the impact of filling in missing values must be carefully examined and clearly understood.
- Communicate to your audience that you have filled missing values. If you are writing a report, describe the method you used to fill the missing values and state any assumptions you made when choosing the method to fill them (for example, ensuring that the filled-in values were not over or underestimated). If you are making a map, consider identifying the features for which the values have been filled in, for example, on a separate map. Cartographers have also identified polygonal features using a hatched or stipple pattern or a unique feature outline. Be careful when using these methods, as they can obscure the polygon fill or change the way the color of the fill is seen.
- For the temporal trend fill method, the location with null values being filled must have at least two time periods with values at the beginning and at least two time periods with values at the end of the time series to be filled. However, having the first and last two time period values is not always enough. You may have a large sequence of missing values in the middle of the time series, and in that case, interpolated values may not be reliable for further analysis such as tools in the Time Series Forecasting toolset.
Choose a fill method
When filling in missing values, you must decide on a fill method, such as using the average, minimum, maximum, or median of the neighboring values. When you want to underestimate the filled-in values, use the minimum, for example, if you are trying to fill in the number of students who receive free lunches. Similarly, use the maximum if you don't want to underestimate the missing values, for example, when filling in the number of people who have higher educational degrees. Use the median if you suspect the presence of outlier high or low values locally, such as housing values. Use the average if values tend to be similar to their neighbors.
You also must decide how to define the set of neighbors that will be used to calculate the missing values. Neighbors can be defined based on a variety of spatial relationships, such as a fixed number of neighbors, all neighbors within a fixed distance, or neighbors that are contiguous (that is, they share a border or have corners that touch).
Which fill method and which neighbors to use depend on how the filled data will ultimately be used. For example, a cartographer may want to fill polygons containing missing data to create an aesthetically pleasing map without holes. In this case, calculating the average of many spatial neighbors would be effective. A real estate analyst filling in missing data for the value of a house will use neighbors within a fixed distance and calculate their median value to avoid the influence of outliers.
When choosing the combination of type of neighborhood and fill method, consider which surrounding features would legitimately influence the features with the missing values and which fill method is least likely to bias the results of the analysis. For example, consider a local public health analyst who has childhood lead poisoning data at the census block group level, but a few of the block groups have missing data. The analyst might consider using neighboring block groups that share a border with the block group with missing data and use the maximum of the surrounding values to fill the missing data. Using contiguous block groups can be justified because they likely will contain houses of similar age, and housing age is a known risk factor for lead exposure. While using the maximum value of the surrounding block groups to fill missing values might overestimate the true level of lead poisoning, in this example, where children's health is concerned, it is better to overestimate rather than underestimate the risk.
Additional resources
The Spatial Statistics Resources page contains a variety of resources to help you use the Spatial Statistics and Space Time Pattern Mining tools, including the following:
- Hands-on tutorials
- Workshop videos and presentations
- Training and web seminars
- Links to books, articles, and technical papers
- Sample scripts and case studies