The Curve Fit Forecast tool uses simple curve fitting to model a time series and forecast future values at every location in a space-time cube. For example, using a space-time cube with yearly population, this tool can predict the populations in upcoming years. The primary output is a map of the final forecasted time step as well as informative messages and pop-up charts. You can also create a new space-time cube containing the data from the original cube along with the forecasted values appended.
The tool fits a curve to each location in the Input Space Time Cube and forecasts the time series by extrapolating this curve to future time steps. The curves can be linear, parabolic, S-shaped (Gompertz), or exponential. You can use the same curve type at each location of the space-time cube or allow the tool to set which curve type best fits each location. You can also choose to detect outliers in each time series to identify locations and times that significantly deviate from the fitted curve.
Curve types and potential applications
This tool supports four curve types that can be specified in the Curve Type parameter. The following image shows a typical example of each of the four curve types:
- Linear—Each time series is modeled using a straight line.
- Equation: , where Xt is the value of the time series at time t, and a and b are estimated from the data using least-squares estimation.
- Potential application: The linear curve type is useful for data that increases or decreases steadily with time. For example, this tool can be used to forecast the populations of communities during the stage of development in which the population growth is approximately linear.
- Parabolic—Each time series is modeled using a parabola, also called a quadratic curve.
- Equation: , where Xt is the value of the time series at time t, and a, b, and c are estimated from the data using least-squares estimation.
- Potential application: The parabolic curve type is useful for data that changes direction over time, either increasing then decreasing or vice versa. All other curve types assume that the values continuously increase or decrease over time.
- Exponential—Each time series is modeled using an exponential curve, also called a geometric curve.
- Equation: , where Xt is the value of the time series at time t, and a, b, and k are estimated from the data using least-squares estimation. The value k allows the exponential curve to shift to better fit the time series.
- Potential application: The exponential curve type is useful for data that rapidly increases or decreases with time. For example, periods of rapid growth in population in developing regions can be modeled with an exponential curve.
- S-shaped (Gompertz)—Each time series is modeled using a Gompertz curve. These curves take the shape of an S and come with lower and upper bounds on the curve.
- Equation: , where Xt is the value of the time series at time t, and a, b, c, and k are estimated from the data using least-squares estimation. The values a and k must be nonnegative. The value k allows the Gompertz curve to shift to better fit the time series and never exceeds ten times the largest value of the time series.
- Potential application: The Gompertz curve type is useful for modeling growth with capacity constraints. Populations often start by growing slowly before increasing rapidly once the population density is sufficient to support industry. The population growth then slows again as the population density approaches the limit that the region can sustain.
By default, the Curve Type parameter uses the Auto-detect option that fits all four curve types and identifies the one that provides the best forecast for the time series at each location. If this option is chosen, different locations in the space-time cube may use different curve types. The curve type with the smallest Validation root mean square error (RMSE) is used at each location; however, if no time steps are withheld for validation, the Forecast RMSE is used instead. Both of these statistics are saved as fields in Output Features and are discussed further in the next section.
Forecasting and validation
The tool builds two models while forecasting each time series. The first is the forecast model, which is used to forecast the values of future time steps. The second is the validation model, which is used to validate the forecasted values.
The forecast model is constructed by fitting the chosen curve type to the time series values at each location of the space-time cube. This curve is then extrapolated into the future to predict the values of future time slices. The fit of the curve to each time series is measured by the Forecast RMSE, which is equal to the square root of the average squared difference between the curve and the values of the time series.
, where T is the number of time steps, ct is the value of the curve, and rt is the raw value of the time series at time t.
The following image shows the raw values of a time series along with a Gompertz curve fitted to the time series. The Forecast RMSE measures how much these two time series differ from each other.
The Forecast RMSE only measures how well the curve fits the raw time series values. It does not measure how well the forecast model actually forecasts future values. It is common for a curve to fit a time series closely but not provide accurate forecasts when extrapolated. This problem is addressed by the validation model.
The validation model is used to determine how well the forecast model can forecast future values of each time series. It is constructed by excluding some of the final time steps of each time series and fitting the curve to the data that was not excluded. This curve is then used to forecast the values of the data that were withheld, and the forecasted values are compared to the raw values that were hidden. By default, 10 percent of the time steps are withheld for validation, but this number can be changed using the Numer of Time Steps to Exclude for Validation parameter. The number of time steps excluded cannot exceed 25 percent of the number of time steps, and no validation is performed if 0 is specified. The accuracy of the forecasts is measured by calculating a Validation RMSE statistic, which is equal to the square root of the average squared difference between the forecasted and raw values of the excluded time steps.
, where T is the number of time steps, m is the number of time steps withheld for validation, ct is the value forecasted from the first T-m time steps, and rt is the raw value of the time series withheld for validation at time t.
The following image shows a Gompertz curve fitted to the first half of a time series and extrapolated to forecast the second half of the time series. The Validation RMSE measures how much the forecasted values differ from the raw values at the withheld time steps.
The validation model is important because it can directly compare forecasted values to raw values to measure how well the curve can forecast. While it is not actually used to forecast, it is used to justify the forecast model.
Validation in time series forecasting is similar but not identical to a common technique called cross validation. The difference is that forecasting validation always excludes the final time steps for validation, and cross validation either excludes a random subset of the data or excludes each value sequentially.
There are several considerations when interpreting the Forecast RMSE and Validation RMSE values.
- The RMSE values are not directly comparable to each other because they measure different things. The Forecast RMSE measures the fit of the curve to the raw time series values, and the Validation RMSE measures how well the curve can forecast future values. Because the Forecast RMSE uses more data and does not extrapolate, it is usually smaller than the Validation RMSE.
- Both RMSE values are in the units of the data. For example, if your data is temperature measurements in degrees Celsius, a Validation RMSE of 50 is very high because it means that the forecasted values differed from the true values by approximately 50 degrees on average. However, if your data is daily revenue in U.S. dollars of a large retail store, the same Validation RMSE of 50 is very small because it means that the forecasted daily revenue only differed from the true values by $50 per day on average.
Identifying time series outliers
Outliers in time series data are values that significantly differ from the patterns and trends of the other values in the time series. For example, large numbers of online purchases around holidays or high numbers of traffic accidents during heavy rainstorms would likely be detected as outliers in their time series. Simple data entry errors, such as omitting the decimal of a number, are another common source of outliers. Identifying outliers in time series forecasting is important because outliers influence the forecast model that is used to forecast future values, and even a small number of outliers in the time series of a location can significantly reduce the accuracy and reliability of the forecasts. Locations with outliers, particularly outliers toward the beginning or end of the time series, may produce misleading forecasts, and identifying these locations helps you determine how confident you should be in the forecasted values at each location.
Outliers are not determined simply by their raw values but instead by how much their values differ from the fitted values of the forecast model. This means that whether or not a value is determined to be an outlier is contextual and depends both on its place and time. The forecast model defines what the value is expected to be based on the entire time series, and outliers are the values that deviate significantly from this baseline. For example, consider a time series of annual mean temperature. Because average temperatures have increased over the last several decades, the fitted forecast model of temperature will also increase over time to reflect this increase. This means that a temperature value that would be considered typical and not an outlier in 1950 would likely be considered an outlier if the same temperature occurred in 2020. In other words, a typical temperature from 1950 would be considered very low by the standards of 2020.
You can choose to detect time series outliers at each location using the Identify Outliers parameter. If specified, the Generalized Extreme Studentized Deviate (ESD) test is performed for each location to test for time series outliers. The confidence level of the test can be specified with the Level of Confidence parameter, and 90 percent confidence is used by default. The Generalized ESD test iteratively tests for a single outlier, two outliers, three outliers, and so on, at each location up to the value of the Maximum Number of Outliers parameter (by default, 5 percent of the number of time steps, rounded down), and the largest statistically significant number of outliers is returned. The number of outliers at each location can be seen in the attribute table of the output features, and individual outliers can be seen in the time series pop-up charts that are discussed in the next section.
The primary output of this tool is a 2D feature class showing each location in the Input Space Time Cube symbolized by the final forecasted time step with the forecasts for all other time steps stored as fields. Although each location is independently forecasted and spatial relationships are not taken into account, the map may display spatial patterns for areas with similar time series.
Clicking any feature on the map using the Explore navigation tool displays a chart in the Pop-up pane showing the values of the space-time cube along with the fitted curve and the forecasted values. The values of the space-time cube are displayed in blue and are connected by a blue line. The fitted values are displayed in orange and are connected by a dashed orange line representing the curve. The forecasted values are displayed in orange and are connected by a solid orange line representing the extrapolation and forecasting of the curve. You can hover over any point in the chart to see the date and value of the point. Additionally, if you chose to detect outliers in time series, any outliers are displayed as large purple dots.
Pop-up charts are not created when the output features are saved as a shapefile (.shp).
The tool provides a number of messages with information about the tool execution. The messages have three main sections.
The Input Space Time Cube Details section displays properties of the input space-time cube along with information about the time step interval, number of time steps, number of locations, and number of space-time bins. The properties displayed in this first section depend on how the cube was originally created, so the information varies from cube to cube.
The Analysis Details section displays properties of the forecast results, including the number of forecasted time steps, the number of time steps excluded for validation, and information about the forecasted time steps.
The Summary of Accuracy across Locations section displays summary statistics for the Forecast RMSE and Validation RMSE among all of the locations. For each value, the minimum, maximum, mean, median, and standard deviation is displayed.
The Summary of Selected Curve Types section appears if Auto-detect is chosen for the Curve Type parameter. This section displays the number of locations and percent of locations that were chosen for each of the four curve types.
The Summary of Time Series Outliers section appears if you choose to detect time series outliers using the Outlier Option parameter. This section displays information including the number and percent of locations containing outliers, the time step containing the most outliers, and summary statistics for the number of outliers by location and by time step.
Geoprocessing messages appear at the bottom of the Geoprocessing pane during tool execution. You can access the messages by hovering over the progress bar, clicking the pop-out button , or expanding the messages section in the Geoprocessing pane. You can also access the messages for a previously run tool using geoprocessing history.
Fields of the output features
In addition to Object ID, geometry fields, and the field containing the pop-up charts, the Output Features will have the following fields:
- Location ID (LOCATION)—The Location ID of the corresponding location of the space-time cube.
- Forecast for (Analysis Variable) in (Time Step) (FCAST_1, FCAST_2, and so on)—The forecasted value of each future time step. The field alias displays the name of the Analysis Variable and the date of the forecast. A field of this type is created for each forecasted time step.
- Forecast Root Mean Square Error (F_RMSE)—The Forecast RMSE.
- Validation Root Mean Square Error (V_RMSE)—The Validation RMSE. If no time steps were excluded for validation, this field is not created.
- Forecast Method (METHOD)—The curve type that was used at the location. This field can be used to identify the curve type of the location when you use the Auto-detect option.
- Forecast Equation (EQUATION)—A text field displaying the equation of the forecast curve at the location. This field is not created when you use the Auto-detect option.
- Number of Model Fit Outliers (N_OUTLIERS)—The number of outliers detected in the time series of the location. This field is only created if you chose to detect outliers with the Outlier Option parameter.
Output space-time cube
If an Output Space Time Cube is specified, the output cube contains all of the original values from the input space-time cube with the forecasted values appended. This new space-time cube can be displayed using the Visualize Space Time Cube in 2D or Visualize Space time Cube in 3D tools and can be used as input to the tools in the Space Time Pattern Mining toolbox, such as Emerging Hot Spot Analysis and Time Series Clustering.
Multiple forecasted space-time cubes can be compared and merged using the Evaluate Forecasts by Location tool. This allows you to create multiple forecast cubes using different forecasting tools and parameters, and the tool identifies the best forecast for each location using either Forecast or Validation RMSE.
Best practices and limitations
When deciding whether this tool is appropriate for your data and which parameters you should choose, several things should be taken into account.
- Compared to other forecasting tools in the Time Series Forecasting toolset, this tool is the simplest, and it is most appropriate for time series that follow a predictable trend that does not display strong seasonality. If your data follows a complex trend or displays strong seasonal cycles, it is recommended that you use other forecasting tools.
- Deciding how many time steps to exclude for validation is important. The more time steps are excluded, the fewer time steps remain to estimate the validation model. However, if too few time steps are excluded, the Validation RMSE is estimated using a small amount of data and may be misleading. It is recommended that you exclude as many time steps as possible while still maintaining sufficient time steps to estimate the validation model. It is also recommended that you withhold at least as many time steps for validation as the number of time steps you intend to forecast, if your space-time cube has enough time steps to allow this.
- This tool does not produce confidence intervals for the forecasted values.
For more information about forecasting using simple curve fitting, see the following textbook:
- Klosterman, R. E., Brooks, K., Drucker, J., Feser, E., & Renski, H. (2018). Planning support methods: Urban and regional analysis and projection. Rowman & Littlefield. ISBN: 1442220309