There is a general process or workflow associated with a machine learning (ML) project. A typical ML workflow begins with identifying the business problem and formulating the problem statement or question. This is followed by a series of steps, including: data preparation (or preprocessing), feature engineering, selecting a suitable algorithm and model training, hyperparameter tuning, and model evaluation. This is an iterative process and the optimal model is often only reached after multiple iterations and experiments.
Identifying the model that best fits the data takes a lot of time, effort, and expertise in the entire ML process. The Train Using AutoML tool automates this entire workflow and identifies the best algorithm with the best set of hyperparameters that fit the data. This implementation is built on an open source implementation called mljar-supervised. The sections below describe each of the steps in the ML process in more detail.
Train Using AutoML workflow
The Train Using AutoML tool automates the following:
- Data preprocessing—Successful ML projects require collection of high-quality input data that addresses a specific problem. This data may come from disparate data sources and may need to be combined. Once the data has been collected and synthesized, it needs to be cleaned and de-noised to ensure that ML algorithms can effectively be trained on it and learn from it. This step is typically time-consuming and tedious, and may require detailed, domain-specific knowledge and experience. Data cleaning can involve identifying and handling missing values, detecting outliers, correcting mislabeled data, and so on, all of which may require a substantial level of time and effort from the ML practitioner. The following are the preprocessing steps:
- Remove outliers—Outliers are data points that are dissimilar to the rest of the data points. Reasons for these outliers include data entry errors, data measurement errors, or a legitimate dissimilarity. Regardless of the cause, it is important to remove outliers from the data, as they have a tendency to confuse the model during the training phase. Common methods of addressing outliers are correcting the entries manually or deleting the entries from the dataset.
- Impute missing values—Some of the columns in the dataset may have missing values. Machine learning models cannot be trained if there are missing entries in the training data. To ensure that there are no missing entries in the dataset, fill the missing entries with valid data using a process called data imputation. Strategies to fill these missing entries include adding the most common occurrence of the value or adding a new value to highlight that the data is missing. For numerical data, the strategies can also be to use the mean or median of all the entries in the missing data column. Currently, that can be determined by running the Fill Missing Values tool. The Fill Missing Values tool provides the ability to impute not only with a global statistic from the column, but also using spatial strategies such as local neighbors, space-time neighbors, or temporal strategies such as time-series values.
- Scale and normalize the data—One of the core assumptions while training a linear model machine learning model such as OLS (the residuals) is that the data being trained is distributed normally; that is, it follows a bell curve.
Before using a linear model, ensure that each of the columns in the dataset is distributed normally. If any of the columns are not normally distributed, transformations are commonly applied to the data to convert it to a normal distribution.
Ensure that all the columns in the dataset are in the same scale so the model does not give importance to columns whose values fall in the larger scale. This is done by applying scaling techniques on the dataset.
While the above may be true for linear models, other algorithms such as Logistic regression, and especially the tree-based models such as Decision Tree, Extra Tree, Random Forest, XGBoost, and LightGBM, don't assume any normality and are more robust to differences in scale, skewed distributions, and so on.
- Convert categorical data to numerical—Most of the machine learning models require the data that is being used for training to be in the form of numbers. They cannot work with other data types. Convert nonnumeric columns such as State, Country, City, or Land cover category, Construction type, Dominant political party, and so on, to numbers. Techniques used to convert categorical data to numbers include label encoding and one-hot encoding.
- Feature engineering and feature selection—The columns that are used during the model training process are called features. The usefulness of these features while the model learns varies. Some of the features may not be useful, and in such cases, the model improves when the features are removed from the dataset.
Approaches such as recursive feature elimination and random feature techniques help determine the usefulness of features in the dataset, and features that are not useful in such approaches are generally removed.
In some instances, combining multiple features into a single feature improves the model. This process is called feature engineering.
Apart from the new features obtained by combining multiple features from the input, the tool also creates spatial features with names from zone3_id through zone7_id when the Advanced option is used. These new features are generated by assigning the location of input training features to multiple (up to five) spatial grids of varying sizes and using the grid IDs as categorical independent variables named zone3_id through zone7_id. This provides relevant spatial information to the models and helps them learn more from the available data.
- Model training and model selection—In the model training step, the ML practitioner chooses the appropriate ML algorithm based on the problem and the characteristics of the data. They then begin the iterative process of training models to fit to the data, which often includes experimenting with several different ML algorithms. Each of these algorithms may have many different hyperparameters, which are values specified manually by the ML practitioner and which control how the model learns. These hyperparameters are then tuned (in other words, adjusted) in an effort to improve the performance of the algorithm and achieve better results. This is an iterative process that requires the time and expertise of the ML practitioner. The various algorithms or the statistical models include Linear regression and Logistic regression, and other machine learning models include Decision Trees, Random Forests, and the more recent boosting models such as LightGBM and XGBoost. Although LightGBM and XGBoost outperform most of the other models on almost all datasets, it is difficult to predict which models will work well on a given dataset, so you must try all of the models to compare their performance before deciding on the model that best fits the data. This means models fitting the data with the low values of MSE, RMSE, MAE, MAPE, whereas high R2, Spearman, Pearson scores for regression problems and least Logloss, and high AUC, F1 score for classification problems.
In the majority of cases, combining multiple models into one and taking the output from this combined model outperforms the result from a single model. This step is called model ensembling.
- Hyperparameter tuning—Although most of the previous steps were iterative, the step that is often the most difficult while training machine learning models is hyperparameter tuning.
Hyperparameters can be considered as levers that come with each model. The hyperparameters that are used in training a Random Forest model are different from those that are used to train a LightGBM model. Learning about these hyperparameters will help you understand the model.
- Model selection—The final step in the ML workflow is model evaluation, where you validate that the trained and tuned ML algorithm will generalize well to data that it was not fitted on. This unseen data is often referred to as the validation or test set and is kept separate from the remainder of data that is used to train the model. The goal of this final step is to ensure that the ML algorithm produces acceptable predictive accuracy on new data.
It’s clear by looking at the entire ML workflow that there are varying degrees of human input, decision making, and choice occurring at every step.
- Was the appropriate data collected to address the problem, and is there enough of it?
- What signifies an outlier in the context of the data?
- If find missing values are found, what should replace them?
- Which features should be included in the ML model?
- Which ML algorithm should be used?
- What is an acceptable level of performance for the model?
- What is the best combination of hyperparameters for a given model?
This last decision can potentially involve hundreds or even thousands of combinations of hyperparameters that can be iterated over. Add a few different feature engineering scenarios and the training and tuning of several different ML algorithms, and the entire process starts to become unmanageable and unproductive. Additionally, several of the steps in the ML workflow require expert technical understanding of data science techniques, statistics, and machine learning algorithms. As such, designing and executing ML projects can be time consuming, labor intensive, costly, and often highly dependent on trained ML practitioners and data scientists.
In the past decade, machine learning has experienced rapid growth in both the range of applications it is applied to and the amount of new research produced on it. Some of the largest driving forces behind this growth are the maturity of the ML algorithms and methods themselves, the generation and proliferation of massive volumes of data for the algorithms to learn from, the abundance of inexpensive compute to run the algorithms, and the increasing awareness among businesses that ML algorithms can address complex data structures and problems.
Many organizations want to use ML to take advantage of their data and derive actionable new insights from it, but there is an imbalance between the number of potential ML applications and the amount of trained, expert ML practitioners to address them. As a result, there is an increasing demand to democratize ML across organizations by creating tools that make ML widely accessible throughout the organization and can be used off the shelf by non-ML experts and domain experts.
Recently, Automated Machine Learning (AutoML) has emerged as way to address the massive demand for ML within organizations across all experience and skill levels. AutoML aims to create a single system to automate (in other words, remove human input from) as much of the ML workflow as possible, including data preparation, feature engineering, model selection, hyperparameter tuning, and model evaluation. In doing so, it can be beneficial to nonexperts by lowering their barrier of entry into ML, but also to trained ML practitioners by eliminating some of the most tedious and time-consuming steps in the ML workflow.
AutoML for the non-ML expert (GIS analysts, business analysts, or data analysts who are domain experts)—The main advantage of using AutoML is that it eliminates some of the steps in the ML workflow that require the most technical expertise and understanding. Analysts who are domain experts can define their business problem and collect the appropriate data, and then let the computer learn to do the rest. They don’t need a deep understanding of data science techniques for data cleaning and feature engineering, they don’t need to know what all the ML algorithms do, and they don’t need to spend time exploring different algorithms and hyperparameter configurations. Instead, these analysts can focus on applying their domain expertise to a specific business problem or domain application, rather than on the ML workflow itself. Additionally, they can be less dependent on trained data scientists and ML engineers in their organization because they can build and use advanced models on their own, often without any coding experience required.
AutoML for the ML expert (data scientists or ML engineers)—AutoML can also be highly beneficial to ML experts; however, the reasons may be less obvious. For one, ML experts do not have to spend as much time supporting the domain experts in their organization, and can therefore focus on their own, more advanced ML work. When it comes to the ML experts’ ML projects, AutoML can be a significant time saver and productivity booster. Many of the time-consuming steps in the ML workflow, such as data cleaning, feature engineering, model selection, and hyperparameter tuning, can be automated. The time saved by automating many of these repetitive, exploratory steps can be shifted to more advanced technical tasks or to tasks that require more human input (for example, collaborating with domain experts, understanding the business problem, or interpreting the ML results).
In addition to its time-saving aspects, AutoML can also help boost the productivity of ML practitioners because it eliminates some of the subjective choice and experimentation involved in the ML workflow. For example, a ML practitioner approaching a new project may have the training and expertise to guide them on which new features to construct, which ML algorithm might be the best for a particular problem, and which hyperparameters could be most optimal. However, they may overlook the construction of certain new features, or fail to try all the possible combinations of hyperparameters while they are performing the ML workflow. Additionally, the ML practitioner may bias the feature selection process or choice of algorithm because they prefer a particular ML algorithm based on their previous work or its success in other ML applications they’ve seen. In reality, no single ML algorithm performs best on all datasets; some ML algorithms are more sensitive than others to the selection of hyperparameters, and many business problems have varying degrees of complexity and requirements for interpretability from the ML algorithms used to solve them. AutoML can help reduce some of this human bias by applying many different ML algorithms to the same dataset and then determining which one performs best.
For the ML practitioner, AutoML can also serve as an initial starting point or benchmark in an ML project. They can use it to automatically develop a baseline model for a dataset, which can give them a set of preliminary insights into a particular problem. From here, they may decide to add or remove specific features from the input dataset, or focus on a specific ML algorithm and fine-tune its hyperparameters. In this sense, AutoML can be viewed as a means of narrowing down the set of initial choices for a trained ML practitioner, so they can focus on improving the performance of the ML system overall. This is a very commonly used workflow in practice, where ML experts will develop a data-driven benchmark using AutoML and then build on this benchmark by incorporating their expertise to refine the results.
In the end, democratizing ML via AutoML in an organization allows domain experts to focus their attention on the business problem and obtain actionable results, allows more analysts to build better models, and can reduce the number of ML experts that the organization needs to hire. It can also help boost the productivity of trained ML practitioners and data scientists, allowing them to focus their expertise on the multitude of other tasks where it is needed most.
In summary, the Train Using AutoML tool requires the following steps to identify the best model:
- Simple algorithms, such as Decision Tree (a simple tree with a maximum depth of 4) and linear models, are attempted. This helps to get a quick examination of the data and the results to expect.
- Models that are selected in the AutoML initiation (step 1 above) are trained with default parameters, and in this step you run your data through a set of much more complex algorithms, using only the default hyperparameters. One model fit is tried for each algorithm and for all the algorithms available. The available algorithms are Linear, Random Forest, XGBoost, LightGBM, Decision Tree, and Extra Tree.
- Conduct a random search over the hyperparameter space of each individual algorithm to find the optimal set of hyperparameters.
- Construct new features using Golden Features. Determine which of the new features have predictive power, and add these to the original dataset. The best set of hyperparameters identified in the previous step are used in this step.
- This is the feature selection step. Use the optimal set of hyperparameters determined in step 4 to train one best-performing model for each algorithm, and then determine the least important features and remove them.
- The results from the best models trained so far are ensembled.
- The models are stacked and the results of the best models (including stacked) are combined.
Model ensemble—An ensemble is a collection of models whose predictions are combined by weighted averaging or voting.
The most common strategies used for ensembling are bagging (Random Forest is a classic example of bagging) and boosting (XGBoost is an example of boosting), which combine the outputs from the models belonging to the same algorithms. Recent techniques, such as those mentioned in Ensemble Selection from Libraries of Models by Caruana et. al, combine diverse models. Use the following steps to create an ensemble:
- Start with the empty ensemble.
- Add the model in the library that maximizes the ensemble’s performance to the error metric on a validation set.
- Repeat the previous step for a fixed number of iterations or until all the models have been used.
- Return the ensemble from the nested set of ensembles that has maximum performance on the validation set.
Model stacking—Model stacking also involves combining the results of multiple models and deriving the result from it.
While ensemble approaches combine the results of various models by giving different weights to the outputs, stacking uses the outputs of each of the base models as features and adds them to a higher-level model, a meta model. The output from the higher-level meta model is used as the final output.
Ensemble stack—To improve performance, model stacks are created and their outputs are combined to form an ensemble stack.
Interpret the output reports
The Train Using AutoML tool can generate an HTML report as an output.
The main page of the report shows the leaderboard. The same information is also available in the tool output window.
The leaderboard shows the evaluated models and their metric value. In this case of the regression problem, the model with the least RMSE is considered the best model (in this case, the ensemble model). It will be different (for example, logloss) for classification problems, though.
AutoML Performance Boxplot compares the evaluation metric of different models with evaluation metric (RMSE) on the y-axis and models on the x-axis, as shown below. The boxplot shows that the best model is the ensemble model that has the lowest RMSE.
This AutoML performance chart shows how the evaluation metric for the best performing model, which in this case is the ensemble model, varies across different iterations. The iteration chart helps you to understand how consistent the model is across different runs of the model.
Next, the Spearman correlation is generated for all the models that were evaluated, and the models that are more closely related are shown in darker shades of blue. For example, the outputs from LightGBM and XGBoost are the most closely related (this is expected, as both are boosting algorithms). The output from LightGBM resembles the output from Random Forest compared to the output from Extra Trees.
From the leaderboard, clicking the link for any of the models in the Name column takes you to a page that shows hyperparameters that were finalized for the model training after completing the hyperparameter tuning. In this example, the decision tree was trained with the max_depth value of 3.
The same page also shows other metrics apart from the one that was used for evaluation. In the example below, which was a regression task, you can see the MAE, MSE, R2, and MAPE metrics and the RMSE metric, which was used for model evaluation.
You can also see the scatterplot of actual and predicted outputs (here for a sample of 5,000 data points).
You can use this chart to determine the performance of the model. The figures above show the comparison of the scatterplots of two models obtained from the report. The second model is performing better than the first, where the predicted and true values diverge more.
Finally, to make models more explainable, the importance of each variable in the final model is also included in the report (similar to feature importance in sklearn). Unlike sklearn, this plot can be generated for nontree models of sklearn as well. This capability of model explainability for non-tree-based models as well is offered by Shapley Additive Explanations (SHAP).
SHAP is a game theoretic approach to explain the output of any machine learning model. It connects optimal credit allocation with local explanations using the classic Shapley values from game theory and their related extensions. More details about SHAP and its implementation can be found here. SHAP output is available only with the Basic option.
In the plot below, you can visualize the global impact of each variable in the housing dataset on the trained model. This demonstrates that the sqft_living feature influences the model the most and is the most important feature, followed by number of bathrooms and bedrooms while predicting the house prices.
With the Basic option, you also have the option to see the Automatic Exploratory Data Analysis Report (EDA report), which prints out summary statistics for the target variable and the predictors used in model training. The following is a sample EDA report for a variable named altclip.tif:
Caruana, Rich et al. "Ensemble Selection from Libraries of Models." Proceedings of the 21st International Conference on Machine Learning. Banff, Canada (2004). http://www.cs.cornell.edu/~alexn/papers/shotgun.icml04.revised.rev2.pdf.