Train Deep Learning Model (Image Analyst)—ArcGIS Pro

Available with Image Analyst license.

Summary

Trains a deep learning model using the output from the Export Training Data For Deep Learning tool.

Usage

This tool trains a deep learning model using deep learning frameworks.
To set up your machine to use deep learning frameworks in ArcGIS Pro, see Install deep learning frameworks for ArcGIS.
If you will be training models in a disconnected environment, see Additional Installation for Disconnected Environment for more information.
This tool can also be used to fine-tune an existing trained model. For example, an existing model that has been trained for cars can be fine-tuned to train a model that identifies trucks.
To run this tool using GPU, set the Processor Type environment to GPU. If you have more than one GPU, specify the GPU ID environment instead.
By default, the tool uses all available GPUs when the Model Type parameter is set to one of the following options:
- ConnectNet
- Feature classifier
- MaskRCNN
- Multi Task Road Extractor
- Single Shot Detector
- U-Net
To use a specific GPU, use the GPU ID environment.
The input training data for this tool must include the images and labels folders that are generated from the Export Training Data For Deep Learning tool.
The exception to this is when the training data uses the Pascal Visual Object Classes or the KITTI rectangles metadata formats. For these two formats, the training data can come from other sources, but the image chips must be in the image folder, and the corresponding labels must be in the labels folder.

Specify fastai transforms for data augmentation of training and validation datasets using the transforms.json file, which is in the same folder as the training data. The following is an example of a transforms.json file:

Custom augmentation parameters


{
    "Training": {
        "rotate": {
            "degrees": 30,
            "p": 0.5
        },
        "crop": {
            "size": 224,
            "p": 1,
            "row_pct": "0, 1",
            "col_pct": "0, 1"
        },
        "brightness": {
            "change": "0.4, 0.6"
        },
        "contrast": {
            "scale": "1.0, 1.5"
        },
        "rand_zoom": {
            "scale": "1, 1.2"
        }
    },
    "Validation": {
        "crop": {
            "size": 224,
            "p": 1.0,
            "row_pct": 0.5,
            "col_pct": 0.5
        }
    }
}

For information about requirements for running this tool and issues you may encounter, see Deep Learning frequently asked questions.
For more information about deep learning, see Deep learning using the ArcGIS Image Analyst extension.

Parameters

Label	Explanation	Data Type
Input Training Data	The folders containing the image chips, labels, and statistics required to train the model. This is the output from the Export Training Data For Deep Learning tool. Multiple input folders are supported when the following conditions are met: The metadata format type must be classified tiles, labeled tiles, multilabeled tiles, Pascal Visual Object Classes, or RCNN masks. All training data must have the same metadata format. All training data must have the same number of bands.	Folder
Output Folder	The output folder location where the trained model will be stored.	Folder
Max Epochs (Optional)	The maximum number of epochs for which the model will be trained. A maximum epoch of 1 means the dataset will be passed forward and backward through the neural network one time. The default value is 20.	Long
Model Type (Optional)	Specifies the model type that will be used to train the deep learning model. BDCN Edge Detector (Pixel classification)— The Bi-Directional Cascade Network (BDCN) architecture will be used to train the model. BDCN Edge Detector is used for pixel classification. This approach is useful to improve edge detection for objects at different scales. Change detector (Pixel classification)—The Change detector architecture will be used to train the model. Change detector is used for pixel classification. This approach creates a model object that uses two spatial-temporal images to create a classified raster of the change. The input training data for this model type uses the Classified Tiles metadata format. ClimaX (Pixel classification)—ClimaX is architecture will be used to train the model. This model is primarily used for weather and climate analysis. ClimaX is used for pixel classification. The preliminary data used for this method is multidimensional data. ConnectNet (Pixel classification)—The ConnectNet architecture will be used to train the model. ConnectNet is used for pixel classification. This approach is useful for road network extraction from satellite imagery. CycleGAN (Image translation)—The CycleGAN architecture will be used to train the model. CycleGAN is used for image-to-image translation. This approach creates a model object that generates images of one type to another. This approach is unique in that the images to be trained do not need to overlap. The input training data for this model type uses the CycleGAN metadata format. DeepLabV3 (Pixel classification)—The DeepLabV3 architecture will be used to train the model. DeepLab is used for pixel classification. Deep Sort (Object tracker)—The Deep Sort architecture will be used to train the model. Deep Sort is used for object detection in videos. The model is trained using frames of the video and detects the classes and bounding boxes of the objects in each frame. The input training data for this model type uses the Imagenet metadata format. While Siam Mask is useful for tracking an object, Deep Sort is useful in training a model to track multiple objects. DETReg (Object detection)—The DETReg architecture will be used to train the model. DETReg is used for object detection. The input training data for this model type uses the Pascal Visual Object Classes. This model type is GPU intensive; it requires a dedicated GPU with at least 16 GB of memory to run properly. FasterRCNN (Object detection)—The FasterRCNN architecture will be used to train the model. FasterRCNN is used for object detection. Feature classifier (Object classification)—The Feature classifier architecture will be used to train the model. Feature Classifier is used for object or image classification. HED Edge Detector (Pixel classification)— The Holistically-Nested Edge Detection (HED) architecture will be used to train the model. HED Edge Detector is used for pixel classification. This approach is useful for edge and object boundary detection. Image captioner (Image translation)—The Image captioner architecture will be used to train the model. Image captioner is used for image-to-text translation. This approach creates a model that generates text captions for an image. MaskRCNN (Object detection)—The MaskRCNN architecture will be used to train the model. MaskRCNN is used for object detection. This approach is used for instance segmentation, which is precise delineation of objects in an image. This model type can be used to detect building footprints. It uses the MaskRCNN metadata format for training data as input. Class values for input training data must start at 1. This model type can only be trained using a CUDA-enabled GPU. MaX-DeepLab (Panoptic segmentation)—The MaX-DeepLab architecture will be used to train the model. MaX-DeepLab is used for panoptic segmentation. This approach creates a model object that generates images and features. The input training data for this model type uses the Panoptic segmentation metadata format. MMDetection (Object detection)—The MMDetection architecture will be used to train the model. MMDetection is used for object detection. The supported metadata formats are Pascal Visual Object Class rectangles and KITTI rectangles. MMSegmentation (Pixel classification)—The MMSegmentation architecture will be used to train the model. MMSegmentation is used for pixel classification. The supported metadata format is Classified Tiles. Multi Task Road Extractor (Pixel classification)— The Multi Task Road Extractor architecture will be used to train the model. Multi Task Road Extractor is used for pixel classification. This approach is useful for road network extraction from satellite imagery. Pix2Pix (Image translation)—The Pix2Pix architecture will be used to train the model. Pix2Pix is used for image-to-image translation. This approach creates a model object that generates images of one type to another. The input training data for this model type uses the Export Tiles metadata format. Pix2PixHD (Image translation)—The Pix2PixHD architecture will be used to train the model. Pix2PixHD is used for image-to-image translation. This approach creates a model object that generates images of one type to another. The input training data for this model type uses the Export Tiles metadata format. PSETAE (Pixel classification)—The Pixel-Set Encoders and Temporal Self-Attention (PSETAE) architecture will be used to train the model for time series classification. PSETAE is used for pixel classification. The preliminary data used for this method is multidimensional data. Pyramid Scene Parsing Network (Pixel classification)—The Pyramid Scene Parsing Network (PSPNET) architecture will be used to train the model. PSPNET is used for pixel classification. RetinaNet (Object detection)—The RetinaNet architecture will be used to train the model. RetinaNet is used for object detection. The input training data for this model type uses the Pascal Visual Object Classes metadata format. SAMLoRA (Pixel classification)—The Segment anything model (SAM) with Low Rank Adaption (LoRA) will be used to train the model. This model type uses the SAM as a foundational model and will fine-tune to a specific task with relatively low computing requirements and a smaller dataset. Siam Mask (Object tracker)— The Siam Mask architecture will be used to train the model. Siam Mask is used for object detection in videos. The model is trained using frames of the video and detects the classes and bounding boxes of the objects in each frame. The input training data for this model type uses the MaskRCNN metadata format. Single Shot Detector (Object detection)—The Single Shot Detector (SSD) architecture will be used to train the model. SSD is used for object detection. The input training data for this model type uses the Pascal Visual Object Classes metadata format. Super-resolution (Image translation)—The Super-resolution architecture will be used to train the model. Super-resolution is used for image-to-image translation. This approach creates a model object that increases the resolution and improves the quality of images. The input training data for this model type uses the Export Tiles metadata format. U-Net (Pixel classification)—The U-Net architecture will be used to train the model. U-Net is used for pixel classification. YOLOv3 (Object detection)—The YOLOv3 architecture will be used to train the model. YOLOv3 is used for object detection.	String
Batch Size (Optional)	The number of training samples that will be processed for training at one time. Increasing the batch size can improve tool performance; however, as the batch size increases, more memory is used. When not enough GPU memory is available for the batch size set, the tool tries to estimate and use an optimum batch size. If an out of memory error occurs, use a smaller batch size.	Long
Model Arguments (Optional)	The information from the Model Type parameter will be used to populate this parameter. These arguments vary, depending on the model architecture. The supported model arguments for models trained in ArcGIS are described below. ArcGIS pretrained models and custom deep learning models may have additional arguments that the tool supports. For more information about which arguments are available for each model type, see Deep learning arguments.	Value Table
Learning Rate (Optional)	The rate at which existing information will be overwritten with newly acquired information throughout the training process. If no value is specified, the optimal learning rate will be extracted from the learning curve during the training process.	Double
Backbone Model (Optional)	Specifies the preconfigured neural network that will be used as the architecture for training the new model. This method is known as Transfer Learning. Additionally, supported convolution neural networks from the PyTorch Image Models (timm) can be specified using timm as a prefix, for example, timm:resnet31 , timm:inception_v4 , timm:efficientnet_b3, and so on. 1.40625 degrees—This backbone was trained on imagery in which the resolution of each grid cell covers an area of 1.40625 degrees by 1.40625 degrees. This is used for weather and climate predictions. This is a higher resolution setting, allowing for more precise outputs but requires more computational power. 5.625 degrees—This backbone was trained on imagery in which the resolution of each grid cell covers an area of 5.625 degrees by 5.625 degrees. This is used for weather and climate predictions. This is considered a low-resolution setting but requires less computational power. DenseNet-121—The preconfigured model will be a dense network trained on the Imagenet Dataset that contains more than 1 million images and is 121 layers deep. Unlike ResNET, which combines the layer using summation, DenseNet combines the layers using concatenation. DenseNet-161—The preconfigured model will be a dense network trained on the Imagenet Dataset that contains more than 1 million images and is 161 layers deep. Unlike ResNET, which combines the layer using summation, DenseNet combines the layers using concatenation. DenseNet-169—The preconfigured model will be a dense network trained on the Imagenet Dataset that contains more than 1 million images and is 169 layers deep. Unlike ResNET, which combines the layer using summation, DenseNet combines the layers using concatenation. DenseNet-201—The preconfigured model will be a dense network trained on the Imagenet Dataset that contains more than 1 million images and is 201 layers deep. Unlike ResNET, which combines the layer using summation, DenseNet combines the layers using concatenation. MobileNet version 2—The preconfigured model will be trained on the Imagenet Database and is 54 layers deep and intended for Edge device computing, since it uses less memory. ResNet-18—The preconfigured model will be a residual network trained on the Imagenet Dataset that contains more than 1 million images and is 18 layers deep. ResNet-34—The preconfigured model will be a residual network trained on the Imagenet Dataset that contains more than 1 million images and is 34 layers deep. This is the default. ResNet-50—The preconfigured model will be a residual network trained on the Imagenet Dataset that contains more than 1 million images and is 50 layers deep. ResNet-101—The preconfigured model will be a residual network trained on the Imagenet Dataset that contains more than 1 million images and is 101 layers deep. ResNet-152—The preconfigured model will be a residual network trained on the Imagenet Dataset that contains more than 1 million images and is 152 layers deep. VGG-11—The preconfigured model will be a convolution neural network trained on the Imagenet Dataset that contains more than 1 million images to classify images into 1,000 object categories and is 11 layers deep. VGG-11 with batch normalization—The preconfigured model will be based on the VGG network but with batch normalization, which means each layer in the network is normalized. It trained on the Imagenet dataset and has 11 layers. VGG-13—The preconfigured model will be a convolution neural network trained on the Imagenet Dataset that contains more than 1 million images to classify images into 1,000 object categories and is 13 layers deep. VGG-13 with batch normalization—The preconfigured model will be based on the VGG network but with batch normalization, which means each layer in the network is normalized. It trained on the Imagenet dataset and has 13 layers. VGG-16—The preconfigured model will be a convolution neural network trained on the Imagenet Dataset that contains more than 1 million images to classify images into 1,000 object categories and is 16 layers deep. VGG-16 with batch normalization—The preconfigured model will be based on the VGG network but with batch normalization, which means each layer in the network is normalized. It trained on the Imagenet dataset and has 16 layers. VGG-19—The preconfigured model will be a convolution neural network trained on the Imagenet Dataset that contains more than 1 million images to classify images into 1,000 object categories and is 19 layers deep. VGG-19 with batch normalization—The preconfigured model will be based on the VGG network but with batch normalization, which means each layer in the network is normalized. It trained on the Imagenet dataset and has 19 layers. DarkNet-53—The preconfigured model will be a convolution neural network trained on the Imagenet Dataset that contains more than 1 million images and is 53 layers deep. Reid_v1—The preconfigured model will be a convolution neural network trained on the Imagenet Dataset that is used for object tracking. Reid_v2—The preconfigured model will be a convolution neural network trained on the Imagenet Dataset that is used for object tracking. ResNeXt-50—The preconfigured model will be a convolution neural network trained on the Imagenet Dataset and is 50 layers deep. It is a homogeneous neural network, which reduces the number of hyperparameters required by conventional ResNet. Wide ResNet-50—The preconfigured model will be a convolution neural network trained on the Imagenet Dataset and is 50 layers deep. It has the same architecture as ResNET but with more channels. SR3—The preconfigured model will use the Super Resolution via Repeated Refinement (SR3) model. SR3 adapts denoising diffusion probabilistic models to conditional image generation and performs super-resolution through a stochastic denoising process. For more information, see Image Super-Resolution via Iterative Refinement on the arXiv site. SR3 U-ViT—This backbone model refers to a specific implementation of Vision Transformer (ViT)-based architecture designed for diffusion models within image generation and SR3 tasks. ViT-B—The preconfigured Segment Anything Model (SAM) will be used with a base neural network size. This is the smallest size. For more information, see Segment Anything on the arXiv site. ViT-L—The preconfigured Segment Anything Model (SAM) will be used with a large neural network size. For more information, see Segment Anything on the arXiv site. ViT-H—The preconfigured Segment Anything Model (SAM) will be used with a huge neural network size. This is the largest size. For more information, see Segment Anything on the arXiv site.	String
Pre-trained Model (Optional)	A pretrained model that will be used to fine-tune the new model. The input is an Esri model definition file (.emd) or a deep learning package file (.dlpk). A pretrained model with similar classes can be fine-tuned to fit the new model. The pretrained model must have been trained with the same model type and backbone model that will be used to train the new model.	File
Validation % (Optional)	The percentage of training samples that will be used for validating the model. The default value is 10.	Double
Stop when model stops improving (Optional)	Specifies whether early stopping will be implemented. Checked—Early stopping will be implemented, and the model training will stop when the model is no longer improving, regardless of the Max Epochs parameter value specified. This is the default. Unchecked—Early stopping will not be implemented, and the model training will continue until the Max Epochs parameter value is reached.	Boolean
Freeze Model (Optional)	Specifies whether the backbone layers in the pretrained model will be frozen, so that the weights and biases remain as originally designed. Checked—The backbone layers will be frozen, and the predefined weights and biases will not be altered in the Backbone Model parameter. This is the default. Unchecked—The backbone layers will not be frozen, and the weights and biases of the Backbone Model parameter can be altered to fit the training samples. This takes more time to process but typically produces better results.	Boolean
Data Augmentation (Optional)	Specifies the type of data augmentation that will be used. Data augmentation is a technique of artificially increasing the training set by creating modified copies of a dataset using existing data. Default—The default data augmentation methods and values will be used.The default data augmentation methods are crop, dihedral_affine, brightness, contrast, and zoom. These default values typically work well for satellite imagery. None—No data augmentation will be used. Custom—Data augmentation values will be specified using the Augmentation Parameters parameter. File—Fastai transforms for data augmentation of training and validation datasets will be specified using the transforms.json file, which is in the same folder as the training data. For more information about the various transformations, see vision transforms on the fastai website.	String
Augmentation Parameters (Optional)	Specifies the value for each transform in the augmentation parameter. rotate—The image will be randomly rotated (in degrees) by a probability (p). If degrees is a range (a,b), a value will be uniformly assigned from a to b. The default value is 30.0; 0.5. brightness—The brightness of the image will be randomly adjusted depending on the value of change, with a probability (p). A change of 0 will transform the image to darkest, and a change of 1 will transform the image to lightest. A change of 0.5 will not adjust the brightness. If change is a range (a,b), the augmentation will uniformly assign a value from a to b. The default value is (0.4,0.6); 1.0. contrast—The contrast of the image will be randomly adjusted depending on the value of scale, with a probability (p). A scale of 0 will transform the image to gray scale, and a scale greater than 1 will transform the image to super contrast. A scale of 1 doesn't adjust the contrast. If scale is a range (a,b), the augmentation will uniformly assign a value from a to b. The default value is (0.75, 1.5); 1.0. zoom—The image will be randomly zoomed in depending on the value of scale. The zoom value is in the form scale(a,b); p. The default value is (1.0, 1.2); 1.0 in which p is the probability. Only a scale of greater than 1.0 will zoom in on the image. If scale is a range (a,b), it will uniformly assign a value from a to b. crop—The image will be randomly cropped. The crop value is in the form size;p;row_pct;col_pct in which p is probability. The position is given by (col_pct, row_pct), with col_pct and row_pct being normalized between 0 and 1. If col_pct or row_pct is a range (a,b), it will uniformly assign a value from a to b. The default value is chip_size;1.0; (0, 1); (0, 1) in which 224 is the default chip size.	Value Table
Chip Size (Optional)	The size of the image that will be used to train the model. Images will be cropped to the specified chip size. The default chip size will be the same as the tile size of the training data. If the x- and y- tile size are different, the smaller value will be used as the default chip size. The chip size should be less than the smallest x- or y- tile size of all images in the input folders.	Long
Resize To (Optional)	Resizes the image chips. Once a chip is resized, pixel blocks of chip size will be cropped and used for training. This parameter applies to object detection (PASCAL VOC), object classification (labeled tiles), and super-resolution data only. The resize value is often half the chip size value. If the resize value is less than the chip size value, the resize value is used to create the pixel blocks for training.	String
Weight Initialization Scheme (Optional)	Specifies the scheme in which the weights will be initialized for the layer. To train a model with multispectral data, the model must accommodate the various types of bands available. This is done by reinitializing the first layer in the model. This parameter is only applicable when multispectral imagery is used in the model. Random—Random weights will be initialized for non-RGB bands, while pretrained weights will be preserved for RGB bands. This is the default. Red band—Weights corresponding to the red band from the pretrained model's layer will be cloned for non-RGB bands, while pretrained weights will be preserved for RGB bands. All random— Random weights will be initialized for RGB bands as well as non-RGB bands. This option applies only to multispectral imagery.	String
Monitor Metric (Optional)	Specifies the metric that will be monitored while checkpointing and early stopping. Validation loss—The validation loss will be monitored. When the validation loss no longer changes significantly, the model will stop. This is the default. Average precision—The weighted mean of precision at each threshold will be monitored. When this value no longer changes significantly, the model will stop. Accuracy—The ratio between the number of correct predictions to the total number of predictions will be monitored. When this value no longer changes significantly, the model will stop. F1-Score—The combination of the precision and recall scores of the model will be monitored. When this value no longer changes significantly, the model will stop. MIoU—The average between the intersection over union (IoU) of the segmented objects over all the images of the test dataset will be monitored. When this value no longer changes significantly, the model will stop. Dice—Model performance will be monitored using the Dice metric. When this value no longer changes significantly, the model will stop.This value can range from 0 to 1. The value 1 corresponds to a pixel perfect match between the validation data and training data. Precision—The precision, which measures the model's accuracy in classifying a sample as positive, will be monitored. When this value no longer changes significantly, the model will stop.The precision is the ratio between the number of positive samples correctly classified and the total number of samples classified (either correctly or incorrectly). Recall—The recall, which measures the model's ability to detect positive samples, will be monitored. When this value no longer changes significantly, the model will stop. The higher the recall, the more positive samples are detected. The recall value is the ratio between the number of positive samples correctly classified as positive and the total number of positive samples. Corpus bleu—The Corpus blue score will be monitored. When this value no longer changes significantly, the model will stop.This score is used to calculate accuracy for multiple sentences, such as a paragraph or a document. Multi label F-beta—The weighted harmonic mean of precision and recall will be monitored. When this value no longer changes significantly, the model will stop.This is often referred to as the F-beta score.	String

Derived Output

Label	Explanation	Data Type
Output Model	The output trained model file.	File

TrainDeepLearningModel(in_folder, out_folder, {max_epochs}, {model_type}, {batch_size}, {arguments}, {learning_rate}, {backbone_model}, {pretrained_model}, {validation_percentage}, {stop_training}, {freeze}, {augmentation}, {augmentation_parameters}, {chip_size}, {resize_to}, {weight_init_scheme}, {monitor})

Name	Explanation	Data Type
in_folder [in_folder,...]	The folders containing the image chips, labels, and statistics required to train the model. This is the output from the Export Training Data For Deep Learning tool. Multiple input folders are supported when the following conditions are met: The metadata format type must be classified tiles, labeled tiles, multilabeled tiles, Pascal Visual Object Classes, or RCNN masks. All training data must have the same metadata format. All training data must have the same number of bands.	Folder
out_folder	The output folder location where the trained model will be stored.	Folder
max_epochs (Optional)	The maximum number of epochs for which the model will be trained. A maximum epoch of 1 means the dataset will be passed forward and backward through the neural network one time. The default value is 20.	Long
model_type (Optional)	Specifies the model type that will be used to train the deep learning model. BDCN_EDGEDETECTOR— The Bi-Directional Cascade Network (BDCN) architecture will be used to train the model. BDCN Edge Detector is used for pixel classification. This approach is useful to improve edge detection for objects at different scales. CHANGEDETECTOR—The Change detector architecture will be used to train the model. Change detector is used for pixel classification. This approach creates a model object that uses two spatial-temporal images to create a classified raster of the change. The input training data for this model type uses the Classified Tiles metadata format. CLIMAX—ClimaX is architecture will be used to train the model. This model is primarily used for weather and climate analysis. ClimaX is used for pixel classification. The preliminary data used for this method is multidimensional data. CONNECTNET—The ConnectNet architecture will be used to train the model. ConnectNet is used for pixel classification. This approach is useful for road network extraction from satellite imagery. CYCLEGAN—The CycleGAN architecture will be used to train the model. CycleGAN is used for image-to-image translation. This approach creates a model object that generates images of one type to another. This approach is unique in that the images to be trained do not need to overlap. The input training data for this model type uses the CycleGAN metadata format. DEEPLAB—The DeepLabV3 architecture will be used to train the model. DeepLab is used for pixel classification. DEEPSORT—The Deep Sort architecture will be used to train the model. Deep Sort is used for object detection in videos. The model is trained using frames of the video and detects the classes and bounding boxes of the objects in each frame. The input training data for this model type uses the Imagenet metadata format. While Siam Mask is useful for tracking an object, Deep Sort is useful in training a model to track multiple objects. DETREG—The DETReg architecture will be used to train the model. DETReg is used for object detection. The input training data for this model type uses the Pascal Visual Object Classes. This model type is GPU intensive; it requires a dedicated GPU with at least 16 GB of memory to run properly. FASTERRCNN—The FasterRCNN architecture will be used to train the model. FasterRCNN is used for object detection. FEATURE_CLASSIFIER—The Feature classifier architecture will be used to train the model. Feature Classifier is used for object or image classification. HED_EDGEDETECTOR— The Holistically-Nested Edge Detection (HED) architecture will be used to train the model. HED Edge Detector is used for pixel classification. This approach is useful for edge and object boundary detection. IMAGECAPTIONER—The Image captioner architecture will be used to train the model. Image captioner is used for image-to-text translation. This approach creates a model that generates text captions for an image. MASKRCNN—The MaskRCNN architecture will be used to train the model. MaskRCNN is used for object detection. This approach is used for instance segmentation, which is precise delineation of objects in an image. This model type can be used to detect building footprints. It uses the MaskRCNN metadata format for training data as input. Class values for input training data must start at 1. This model type can only be trained using a CUDA-enabled GPU. MAXDEEPLAB—The MaX-DeepLab architecture will be used to train the model. MaX-DeepLab is used for panoptic segmentation. This approach creates a model object that generates images and features. The input training data for this model type uses the Panoptic segmentation metadata format. MMDETECTION—The MMDetection architecture will be used to train the model. MMDetection is used for object detection. The supported metadata formats are Pascal Visual Object Class rectangles and KITTI rectangles. MMSEGMENTATION—The MMSegmentation architecture will be used to train the model. MMSegmentation is used for pixel classification. The supported metadata format is Classified Tiles. MULTITASK_ROADEXTRACTOR— The Multi Task Road Extractor architecture will be used to train the model. Multi Task Road Extractor is used for pixel classification. This approach is useful for road network extraction from satellite imagery. PIX2PIX—The Pix2Pix architecture will be used to train the model. Pix2Pix is used for image-to-image translation. This approach creates a model object that generates images of one type to another. The input training data for this model type uses the Export Tiles metadata format. PIX2PIXHD—The Pix2PixHD architecture will be used to train the model. Pix2PixHD is used for image-to-image translation. This approach creates a model object that generates images of one type to another. The input training data for this model type uses the Export Tiles metadata format. PSETAE—The Pixel-Set Encoders and Temporal Self-Attention (PSETAE) architecture will be used to train the model for time series classification. PSETAE is used for pixel classification. The preliminary data used for this method is multidimensional data. PSPNET—The Pyramid Scene Parsing Network (PSPNET) architecture will be used to train the model. PSPNET is used for pixel classification. RETINANET—The RetinaNet architecture will be used to train the model. RetinaNet is used for object detection. The input training data for this model type uses the Pascal Visual Object Classes metadata format. SAMLORA—The Segment anything model (SAM) with Low Rank Adaption (LoRA) will be used to train the model. This model type uses the SAM as a foundational model and will fine-tune to a specific task with relatively low computing requirements and a smaller dataset. SIAMMASK— The Siam Mask architecture will be used to train the model. Siam Mask is used for object detection in videos. The model is trained using frames of the video and detects the classes and bounding boxes of the objects in each frame. The input training data for this model type uses the MaskRCNN metadata format. SSD—The Single Shot Detector (SSD) architecture will be used to train the model. SSD is used for object detection. The input training data for this model type uses the Pascal Visual Object Classes metadata format. SUPERRESOLUTION—The Super-resolution architecture will be used to train the model. Super-resolution is used for image-to-image translation. This approach creates a model object that increases the resolution and improves the quality of images. The input training data for this model type uses the Export Tiles metadata format. UNET—The U-Net architecture will be used to train the model. U-Net is used for pixel classification. YOLOV3—The YOLOv3 architecture will be used to train the model. YOLOv3 is used for object detection.	String
batch_size (Optional)	The number of training samples that will be processed for training at one time. Increasing the batch size can improve tool performance; however, as the batch size increases, more memory is used. When not enough GPU memory is available for the batch size set, the tool tries to estimate and use an optimum batch size. If an out of memory error occurs, use a smaller batch size.	Long
arguments [arguments,...] (Optional)	The information from the model_type parameter will be used to set the default values for this parameter. These arguments vary, depending on the model architecture. The supported model arguments for models trained in ArcGIS are described below. ArcGIS pretrained models and custom deep learning models may have additional arguments that the tool supports. For more information about which arguments are available for each model type, see Deep learning arguments.	Value Table
learning_rate (Optional)	The rate at which existing information will be overwritten with newly acquired information throughout the training process. If no value is specified, the optimal learning rate will be extracted from the learning curve during the training process.	Double
backbone_model (Optional)	Specifies the preconfigured neural network that will be used as the architecture for training the new model. This method is known as Transfer Learning. 1.40625deg—This backbone was trained on imagery in which the resolution of each grid cell covers an area of 1.40625 degrees by 1.40625 degrees. This is used for weather and climate predictions. This is a higher resolution setting, allowing for more precise outputs but requires more computational power. 5.625deg—This backbone was trained on imagery in which the resolution of each grid cell covers an area of 5.625 degrees by 5.625 degrees. This is used for weather and climate predictions. This is considered a low-resolution setting but requires less computational power. DENSENET121—The preconfigured model will be a dense network trained on the Imagenet Dataset that contains more than 1 million images and is 121 layers deep. Unlike ResNET, which combines the layer using summation, DenseNet combines the layers using concatenation. DENSENET161—The preconfigured model will be a dense network trained on the Imagenet Dataset that contains more than 1 million images and is 161 layers deep. Unlike ResNET, which combines the layer using summation, DenseNet combines the layers using concatenation. DENSENET169—The preconfigured model will be a dense network trained on the Imagenet Dataset that contains more than 1 million images and is 169 layers deep. Unlike ResNET, which combines the layer using summation, DenseNet combines the layers using concatenation. DENSENET201—The preconfigured model will be a dense network trained on the Imagenet Dataset that contains more than 1 million images and is 201 layers deep. Unlike ResNET, which combines the layer using summation, DenseNet combines the layers using concatenation. MOBILENET_V2—The preconfigured model will be trained on the Imagenet Database and is 54 layers deep and intended for Edge device computing, since it uses less memory. RESNET18—The preconfigured model will be a residual network trained on the Imagenet Dataset that contains more than 1 million images and is 18 layers deep. RESNET34—The preconfigured model will be a residual network trained on the Imagenet Dataset that contains more than 1 million images and is 34 layers deep. This is the default. RESNET50—The preconfigured model will be a residual network trained on the Imagenet Dataset that contains more than 1 million images and is 50 layers deep. RESNET101—The preconfigured model will be a residual network trained on the Imagenet Dataset that contains more than 1 million images and is 101 layers deep. RESNET152—The preconfigured model will be a residual network trained on the Imagenet Dataset that contains more than 1 million images and is 152 layers deep. VGG11—The preconfigured model will be a convolution neural network trained on the Imagenet Dataset that contains more than 1 million images to classify images into 1,000 object categories and is 11 layers deep. VGG11_BN—The preconfigured model will be based on the VGG network but with batch normalization, which means each layer in the network is normalized. It trained on the Imagenet dataset and has 11 layers. VGG13—The preconfigured model will be a convolution neural network trained on the Imagenet Dataset that contains more than 1 million images to classify images into 1,000 object categories and is 13 layers deep. VGG13_BN—The preconfigured model will be based on the VGG network but with batch normalization, which means each layer in the network is normalized. It trained on the Imagenet dataset and has 13 layers. VGG16—The preconfigured model will be a convolution neural network trained on the Imagenet Dataset that contains more than 1 million images to classify images into 1,000 object categories and is 16 layers deep. VGG16_BN—The preconfigured model will be based on the VGG network but with batch normalization, which means each layer in the network is normalized. It trained on the Imagenet dataset and has 16 layers. VGG19—The preconfigured model will be a convolution neural network trained on the Imagenet Dataset that contains more than 1 million images to classify images into 1,000 object categories and is 19 layers deep. VGG19_BN—The preconfigured model will be based on the VGG network but with batch normalization, which means each layer in the network is normalized. It trained on the Imagenet dataset and has 19 layers. DARKNET53—The preconfigured model will be a convolution neural network trained on the Imagenet Dataset that contains more than 1 million images and is 53 layers deep. REID_V1—The preconfigured model will be a convolution neural network trained on the Imagenet Dataset that is used for object tracking. REID_V2—The preconfigured model will be a convolution neural network trained on the Imagenet Dataset that is used for object tracking. RESNEXT50—The preconfigured model will be a convolution neural network trained on the Imagenet Dataset and is 50 layers deep. It is a homogeneous neural network, which reduces the number of hyperparameters required by conventional ResNet. WIDE_RESNET50—The preconfigured model will be a convolution neural network trained on the Imagenet Dataset and is 50 layers deep. It has the same architecture as ResNET but with more channels. SR3—The preconfigured model will use the Super Resolution via Repeated Refinement (SR3) model. SR3 adapts denoising diffusion probabilistic models to conditional image generation and performs super-resolution through a stochastic denoising process. For more information, see Image Super-Resolution via Iterative Refinement on the arXiv site. SR3_UVIT—This backbone model refers to a specific implementation of Vision Transformer (ViT)-based architecture designed for diffusion models within image generation and SR3 tasks. VIT_B—The preconfigured Segment Anything Model (SAM) will be used with a base neural network size. This is the smallest size. For more information, see Segment Anything on the arXiv site. VIT_L—The preconfigured Segment Anything Model (SAM) will be used with a large neural network size. For more information, see Segment Anything on the arXiv site. VIT_H—The preconfigured Segment Anything Model (SAM) will be used with a huge neural network size. This is the largest size. For more information, see Segment Anything on the arXiv site. Additionally, supported convolution neural networks from the PyTorch Image Models (timm) can be specified using timm as a prefix, for example, timm:resnet31 , timm:inception_v4 , timm:efficientnet_b3, and so on.	String
pretrained_model (Optional)	A pretrained model that will be used to fine-tune the new model. The input is an Esri model definition file (.emd) or a deep learning package file (.dlpk). A pretrained model with similar classes can be fine-tuned to fit the new model. The pretrained model must have been trained with the same model type and backbone model that will be used to train the new model.	File
validation_percentage (Optional)	The percentage of training samples that will be used for validating the model. The default value is 10.	Double
stop_training (Optional)	Specifies whether early stopping will be implemented. STOP_TRAINING—Early stopping will be implemented, and the model training will stop when the model is no longer improving, regardless of the max_epochs parameter value specified. This is the default. CONTINUE_TRAINING—Early stopping will not be implemented, and the model training will continue until the max_epochs parameter value is reached.	Boolean
freeze (Optional)	Specifies whether the backbone layers in the pretrained model will be frozen, so that the weights and biases remain as originally designed. FREEZE_MODEL—The backbone layers will be frozen, and the predefined weights and biases will not be altered in the backbone_model parameter. This is the default. UNFREEZE_MODEL—The backbone layers will not be frozen, and the weights and biases of the backbone_model parameter can be altered to fit the training samples. This takes more time to process but typically produces better results.	Boolean
augmentation (Optional)	Specifies the type of data augmentation that will be used. Data augmentation is a technique of artificially increasing the training set by creating modified copies of a dataset using existing data. DEFAULT—The default data augmentation methods and values will be used.The default data augmentation methods are crop, dihedral_affine, brightness, contrast, and zoom. These default values typically work well for satellite imagery. NONE—No data augmentation will be used. CUSTOM—Data augmentation values will be specified using the augmentation_parameters parameter. FILE—Fastai transforms for data augmentation of training and validation datasets will be specified using the transforms.json file, which is in the same folder as the training data For more information about the various transformations, see vision transforms on the fastai website.	String
augmentation_parameters [augmentation_parameters,...] (Optional)	Specifies the value for each transform in the augmentation parameter. rotate—The image will be randomly rotated (in degrees) by a probability (p). If degrees is a range (a,b), a value will be uniformly assigned from a to b. The default value is 30.0; 0.5. brightness—The brightness of the image will be randomly adjusted depending on the value of change, with a probability (p). A change of 0 will transform the image to darkest, and a change of 1 will transform the image to lightest. A change of 0.5 will not adjust the brightness. If change is a range (a,b), the augmentation will uniformly assign a value from a to b. The default value is (0.4,0.6); 1.0. contrast—The contrast of the image will be randomly adjusted depending on the value of scale, with a probability (p). A scale of 0 will transform the image to gray scale, and a scale greater than 1 will transform the image to super contrast. A scale of 1 doesn't adjust the contrast. If scale is a range (a,b), the augmentation will uniformly assign a value from a to b. The default value is (0.75, 1.5); 1.0. zoom—The image will be randomly zoomed in depending on the value of scale. The zoom value is in the form scale(a,b); p. The default value is (1.0, 1.2); 1.0 in which p is the probability. Only a scale of greater than 1.0 will zoom in on the image. If scale is a range (a,b), it will uniformly assign a value from a to b. crop—The image will be randomly cropped. The crop value is in the form size;p;row_pct;col_pct in which p is probability. The position is given by (col_pct, row_pct), with col_pct and row_pct being normalized between 0 and 1. If col_pct or row_pct is a range (a,b), it will uniformly assign a value from a to b. The default value is chip_size;1.0; (0, 1); (0, 1) in which 224 is the default chip size.	Value Table
chip_size (Optional)	The size of the image that will be used to train the model. Images will be cropped to the specified chip size. The default chip size will be the same as the tile size of the training data. If the x- and y- tile size are different, the smaller value will be used as the default chip size. The chip size should be less than the smallest x- or y- tile size of all images in the input folders.	Long
resize_to (Optional)	Resizes the image chips. Once a chip is resized, pixel blocks of chip size will be cropped and used for training. This parameter applies to object detection (PASCAL VOC), object classification (labeled tiles), and super-resolution data only. The resize value is often half the chip size value. If the resize value is less than the chip size value, the resize value is used to create the pixel blocks for training.	String
weight_init_scheme (Optional)	Specifies the scheme in which the weights will be initialized for the layer. To train a model with multispectral data, the model must accommodate the various types of bands available. This is done by reinitializing the first layer in the model. RANDOM—Random weights will be initialized for non-RGB bands, while pretrained weights will be preserved for RGB bands. This is the default. RED_BAND—Weights corresponding to the red band from the pretrained model's layer will be cloned for non-RGB bands, while pretrained weights will be preserved for RGB bands. ALL_RANDOM— Random weights will be initialized for RGB bands as well as non-RGB bands. This option applies only to multispectral imagery. This parameter is only applicable when multispectral imagery is used in the model.	String
monitor (Optional)	Specifies the metric that will be monitored while checkpointing and early stopping. VALID_LOSS—The validation loss will be monitored. When the validation loss no longer changes significantly, the model will stop. This is the default. AVERAGE_PRECISION—The weighted mean of precision at each threshold will be monitored. When this value no longer changes significantly, the model will stop. ACCURACY—The ratio between the number of correct predictions to the total number of predictions will be monitored. When this value no longer changes significantly, the model will stop. F1_SCORE—The combination of the precision and recall scores of the model will be monitored. When this value no longer changes significantly, the model will stop. MIOU—The average between the intersection over union (IoU) of the segmented objects over all the images of the test dataset will be monitored. When this value no longer changes significantly, the model will stop. DICE—Model performance will be monitored using the Dice metric. When this value no longer changes significantly, the model will stop.This value can range from 0 to 1. The value 1 corresponds to a pixel perfect match between the validation data and training data. PRECISION—The precision, which measures the model's accuracy in classifying a sample as positive, will be monitored. When this value no longer changes significantly, the model will stop.The precision is the ratio between the number of positive samples correctly classified and the total number of samples classified (either correctly or incorrectly). RECALL—The recall, which measures the model's ability to detect positive samples, will be monitored. When this value no longer changes significantly, the model will stop. The higher the recall, the more positive samples are detected. The recall value is the ratio between the number of positive samples correctly classified as positive and the total number of positive samples. CORPUS_BLEU—The Corpus blue score will be monitored. When this value no longer changes significantly, the model will stop.This score is used to calculate accuracy for multiple sentences, such as a paragraph or a document. MULTI_LABEL_FBETA—The weighted harmonic mean of precision and recall will be monitored. When this value no longer changes significantly, the model will stop.This is often referred to as the F-beta score.	String

Derived Output

Name	Explanation	Data Type
out_model_file	The output trained model file.	File

Code sample

TrainDeepLearningModel example 1 (Python window)

This example trains a tree classification model using the U-Net approach.

# Import system modules  
import arcpy  
from arcpy.ia import *  
 
# Check out the ArcGIS Image Analyst extension license 
arcpy.CheckOutExtension("ImageAnalyst") 
 
# Execute 
TrainDeepLearningModel(r"C:\DeepLearning\TrainingData\Roads_FC", 
     r"C:\DeepLearning\Models\Fire", 40, "UNET", 16, "# #", None, 
     "RESNET34", None, 10, "STOP_TRAINING", "FREEZE_MODEL")

TrainDeepLearningModel example 2 (stand-alone script)

This example trains an object detection model using the SSD approach.

# Import system modules  
import arcpy  
from arcpy.ia import *  
 
# Check out the ArcGIS Image Analyst extension license 
arcpy.CheckOutExtension("ImageAnalyst") 
 
#Define input parameters
in_folder = "C:\\DeepLearning\\TrainingData\\Cars" 
out_folder = "C:\\Models\\Cars"
max_epochs = 100
model_type = "SSD"
batch_size = 2
arg = "grids '[4, 2, 1]';zooms '[0.7, 1.0, 1.3]';ratios '[[1, 1], [1, 0.5], [0.5, 1]]'"
learning_rate = 0.003
backbone_model = "RESNET34" 
pretrained_model = "C:\\Models\\Pretrained\\vehicles.emd"
validation_percent = 10
stop_training = "STOP_TRAINING"
freeze = "FREEZE_MODEL"


# Execute
TrainDeepLearningModel(in_folder, out_folder, max_epochs, model_type, 
     batch_size, arg, learning_rate, backbone_model, pretrained_model, 
     validation_percent, stop_training, freeze)

Environments

Current Workspace, Processor Type, GPU ID, Scratch Workspace

Licensing information

Basic: Requires Image Analyst
Standard: Requires Image Analyst
Advanced: Requires Image Analyst

ARCGIS

CAPABILITIES

BUY ARCGIS

INDUSTRIES

Support & Services

SELF-SERVICE

CONTACT US

ESRI STORIES

About Esri

About GIS

Commitment to Innovation

Derived Output

Derived Output

Code sample

Summary

Usage

Parameters

Derived Output

Derived Output

Code sample

Environments

Licensing information

Related topics

In this topic