Train Deep Learning Model (Image Analyst)—ArcGIS Pro

Available with Image Analyst license.

Summary

Trains a deep learning model using the output from the Export Training Data For Deep Learning tool.

Usage

This tool trains a deep learning model using deep learning frameworks.
To set up your machine to use deep learning frameworks in ArcGIS Pro, see Install deep learning frameworks for ArcGIS.
If you will be training models in a disconnected environment, see Installation for Disconnected Environment for additional information.
This tool can also be used to fine-tune an existing trained model. For example, an existing model that has been trained for cars can be fine-tuned to train a model that identifies trucks.
To run this tool using GPU, set the Processor Type environment to GPU. If you have more than one GPU, specify the GPU ID environment instead.
The input training data for this tool must include the images and labels folders that are generated from the Export Training Data For Deep Learning tool.
For information about requirements for running this tool and issues you may encounter, see Deep Learning frequently asked questions.
For more information about deep learning, see Deep learning in ArcGIS Pro.

Parameters

Label	Explanation	Data Type
Input Training Data	The folders containing the image chips, labels, and statistics required to train the model. This is the output from the Export Training Data For Deep Learning tool. Multiple input folders are supported when all the following conditions are met: The metadata format must be one of the following types: classified tiles, labeled tiles, multilabeled tiles, PASCAL Visual Object Classes, or RCNN masks. All training data must have the same metadata format. All training data must have the same number of bands. All training data must have the same tile size.	Folder
Output Model	The output folder location that will store the trained model.	Folder
Max Epochs (Optional)	The maximum number of epochs for which the model will be trained. A maximum epoch of one means the dataset will be passed forward and backward through the neural network one time. The default value is 20.	Long
Model Type (Optional)	Specifies the model type that will be used to train the deep learning model. Single Shot Detector (Object detection)—The Single Shot Detector (SSD) approach will be used to train the model. SSD is used for object detection. The input training data for this model type uses the Pascal Visual Object Classes metadata format. U-Net (Pixel classification)—The U-Net approach will be used to train the model. U-Net is used for pixel classification. Feature classifier (Object classification)—The Feature Classifier approach will be used to train the model. This is used for object or image classification. Pyramid Scene Parsing Network (Pixel classification)—The Pyramid Scene Parsing Network (PSPNET) approach will be used to train the model. PSPNET is used for pixel classification. RetinaNet (Object detection)—The RetinaNet approach will be used to train the model. RetinaNet is used for object detection. The input training data for this model type uses the Pascal Visual Object Classes metadata format. MaskRCNN (Object detection)—The MaskRCNN approach will be used to train the model. MaskRCNN is used for object detection. This approach is used for instance segmentation, which is precise delineation of objects in an image. This model type can be used to detect building footprints. It uses the MaskRCNN metadata format for training data as input. Class values for input training data must start at 1. This model type can only be trained using a CUDA-enabled GPU. YOLOv3 (Object detection)—The YOLOv3 approach will be used to train the model. YOLOv3 is used for object detection. DeepLabV3 (Pixel classification)—The DeepLabV3 approach will be used to train the model. DeepLab is used for pixel classification. FasterRCNN (Object detection)—The FasterRCNN approach will be used to train the model. FasterRCNN is used for object detection. BDCN Edge Detector (Pixel classification)— The Bi-Directional Cascade Network (BDCN) architecture will be used to train the model. The BDCN Edge Detector is used for pixel classification. This approach is useful to improve edge detection for objects at different scales. HED Edge Detector (Pixel classification)— The Holistically-Nested Edge Detection (HED) architecture will be used to train the model. The HED Edge Detector is used for pixel classification. This approach is useful to in edge and object boundary detection. Multi Task Road Extractor (Pixel classification)— The Multi Task Road Extractor architecture will be used to train the model. The Multi Task Road Extractor is used for pixel classification. This approach is useful for road network extraction from satellite imagery. ConnectNet (Pixel classification)—The ConnectNet architecture will be used to train the model. ConnectNet is used for pixel classification. This approach is useful for road network extraction from satellite imagery. Pix2Pix (Image translation)—The Pix2Pix approach will be used to train the model. Pix2Pix is used for image-to-image translation. This approach creates a model object that generates images of one type to another. The input training data for this model type uses the Export Tiles metadata format. CycleGAN (Image translation)—The CycleGAN approach will be used to train the model. CycleGAN is used for image-to-image translation. This approach creates a model object that generates images of one type to another. This approach is unique in that the images to be trained do not need to overlap. The input training data for this model type uses the CycleGAN metadata format. Super-resolution (Image translation)—The Super-resolution approach will be used to train the model. Super-resolution is used for image-to-image translation. This approach creates a model object that increases the resolution and improves the quality of images. The input training data for this model type uses the Export Tiles metadata format. Change detector (Pixel classification)—The Change detector approach will be used to train the model. Change detector is used for pixel classification. This approach creates a model object that uses two spatial-temporal images to create a classified raster of the change. The input training data for this model type uses the Classified Tiles metadata format. Image captioner (Image translation)—The Image captioner approach will be used to train the model. Image captioner is used for image-to-text translation. This approach creates a model that generates text captions for an image. Siam Mask (Object tracker)—The Siam Mask approach will be used to train the model. Siam Mask is used for object detection in videos. The model is trained using frames of the video and detects the classes and bounding boxes of the objects in each frame. The input training data for this model type uses the MaskRCNN metadata format. MMDetection (Object detection)—The MMDetection approach will be used to train the model. MMDetection is used for object detection. The supported metadata formats are PASCAL Visual Object Class rectangles and KITTI rectangles. MMSegmentation (Pixel classification)—The MMSegmentation approach will be used to train the model. MMDetection is used for pixel classification. The supported metadata format is Classified Tiles. Deep Sort (Object tracker)—The Deep Sort approach will be used to train the model. Deep Sort is used for object detection in videos. The model is trained using frames of the video and detects the classes and bounding boxes of the objects in each frame. The input training data for this model type uses the Imagenet metadata format. Where Siam Mask is useful while tracking an object, Deep Sort is useful in training a model to track multiple objects. Pix2PixHD (Image translation)—The Pix2PixHD approach will be used to train the model. Pix2PixHD is used for image-to-image translation. This approach creates a model object that generates images of one type to another. The input training data for this model type uses the Export Tiles metadata format.	String
Batch Size (Optional)	The number of training samples to be processed for training at one time. The default value is 2. If you have a powerful GPU, this number can be increased to 8, 16, 32, or 64.	Long
Model Arguments (Optional)	The function arguments are defined in the Python raster function class. This is where you list additional deep learning parameters and arguments for experiments and refinement, such as a confidence threshold for adjusting sensitivity. The names of the arguments are populated from reading the Python module. When you choose Single Shot Detector (Object detection) as the Model Type parameter value, the Model Arguments parameter will be populated with the following arguments: grids—The number of grids the image will be divided into for processing. Setting this argument to 4 means the image will be divided into 4 x 4 or 16 grid cells. If no value is specified, the optimal grid value will be calculated based on the input imagery. zooms—The number of zoom levels each grid cell will be scaled up or down. Setting this argument to 1 means all the grid cells will remain at the same size or zoom level. A zoom level of 2 means all the grid cells will become twice as large (zoomed in 100 percent). Providing a list of zoom levels means all the grid cells will be scaled using all the numbers in the list. The default is 1.0. ratios—The list of aspect ratios to use for the anchor boxes. In object detection, an anchor box represents the ideal location, shape, and size of the object being predicted. Setting this argument to [1.0,1.0], [1.0, 0.5] means the anchor box is a square (1:1) or a rectangle in which the horizontal side is half the size of the vertical side (1:0.5). The default is [1.0, 1.0]. monitor—Specifies which metric to monitor while checkpointing and early stopping. Available metrics are valid_loss and average_precision. The default is valid_loss. When you choose a pixel classification model such as Pyramid Scene Parsing Network (Pixel classification), U-Net (Pixel classification), or DeepLabv3 (Pixel classification) as the Model Type parameter value, the Model Arguments parameter will be populated with the following arguments: use_net—Specifies whether the U-Net decoder will be used to recover data once the pyramid pooling is complete. The default is True. This argument is specific to the Pyramid Scene Parsing Network model. pyramid_sizes—The number and size of convolution layers to be applied to the different subregions. The default is [1,2,3,6]. This argument is specific to the Pyramid Scene Parsing Network model. mixup—Specifies whether mixup augmentation and mixup loss will be used. The default is False. class_balancing—Specifies whether the cross-entropy loss inverse will be balanced to the frequency of pixels per class. The default is False. focal_loss—Specifies whether focal loss will be used. The default is False. ignore_classes—Contains the list of class values on which the model will not incur loss. monitor—Specifies which metric to monitor while checkpointing and early stopping. Available metrics are valid_loss and accuracy. The default is valid_loss. When you choose RetinaNet (Object detection) as the Model Type parameter value, the Model Arguments parameter will be populated with the following arguments: scales—The number of scale levels each cell will be scaled up or down. The default is [1, 0.8, 0.63]. ratios—The aspect ratio of the anchor box. The default is 0.5,1,2. monitor—Specifies which metric to monitor while checkpointing and early stopping. Available metrics are valid_loss and average_precision. The default is valid_loss. When you choose Multi Task Road Extractor (Pixel classification) or ConnectNet (Pixel classification) as the Model Type parameter value, the Model Arguments parameter will be populated with the following arguments: gaussian_thresh—Sets the Gaussian threshold, which sets the required road width. The valid range is 0.0 to 1.0. The default is 0.76. orient_bin_size—Sets the bin size for orientation angles. The default is 20. orient_theta—Sets the width of orientation mask. The default is 8. mtl_model—Sets the architecture type that will be used to create the model. Valid choices are linknet or hourglass for linknet-based or hourglass-based, respectively, neural architectures. The default is hourglass. monitor—Specifies which metric to monitor while checkpointing and early stopping. Available metrics are valid_loss, accuracy, miou, and dice. The default is valid_loss. When you choose Image captioner (Image translation) as the Model Type parameter value, the Model Arguments parameter will be populated with the following arguments: decode_params—A dictionary that controls how the Image captioner will run. The default value is {'embed_size':100, 'hidden_size':100, 'attention_size':100, 'teacher_forcing':1, 'dropout':0.1, 'pretrained_emb':False}. chip_size—Sets the image size to train the model. Images are cropped to the specified chip size. If image size is less than chip size, image size is used. The default size is 224 pixels. monitor—Specifies which metric to monitor while checkpointing and early stopping. Available metrics are valid_loss, accuracy, corpus_bleu and multi_label_fbeta. The default is valid_loss. The decode_params argument is composed of the following six parameters: embed_size—Sets the embedding size. The default is 100 layers in the neural network. hidden_size—Sets the hidden layer size. The default is 100 layers in the neural network. attention_size—Sets the intermediate attention layer size . The default is 100 layers in the neural network. teacher_forcing—Sets the probability of teacher forcing. Teacher forcing is a strategy for training recurrent neural networks. It uses model output from a prior time step as an input, instead of the previous output, during back propagation. The valid range is 0.0 to 1.0. The default is 1. dropout—Sets the dropout probability. The valid range is 0.0 to 1.0. The default is 0.1. pretrained_emb—Sets the pretrained embedding flag. If True, it will use fast text embedding. If False, it will not use the pretrained text embedding. The default is False. When you choose Change detector (Pixel classification) as the Model Type parameter value, the Model Arguments parameter will be populated with the following argument: attention_type—Specifies the module type. The module choices are PAM (Pyramid Attention Module) or BAM (Basic Attention Module). The default is PAM. monitor—Specifies which metric to monitor while checkpointing and early stopping. Available metrics are valid_loss, precision, recall, and f1. The default is valid_loss. When you choose MMDetection (Object detection) as the Model Typeparameter value, the Model Arguments parameter will be populated with the following arguments: model—The backbone model used to train the model. The available choices are atss, carafe, cascade_rcnn, cascade_rpn, dcn, detectors, double_heads, dynamic_rcnn, empirical_attention, fcos, foveabox, fsaf, ghm, hrnet, libra_rcnn, nas_fcos, pafpn, pisa, regnet, reppoints, res2net, sabl, and vfnet. The default is cascade_rcnn. model_weight—Choose to use the pretrained model weights or not. The default is false. The value could also be a path to a configuration file containing the weights of a model, from the MMDetection repository. When you choose MMSegmentation (Pixel classification) as the Model Typeparameter value, the Model Arguments parameter will be populated with the following arguments: model—The backbone model used to train the model. The available choices are ann, apcnet, ccnet, cgnet, danet, deeplabv3, deeplabv3plus, dmnet , dnlnet, emanet, encnet, fastscnn, fcn, gcnet, hrnet, mobilenet_v2, mobilenet_v3, nonlocal_net, ocrnet, ocrnet_base, pointrend, psanet, pspnet, resnest, sem_fpn, unet, and upernet. The default is deeplabv3. model_weight—Choose to use the pretrained model weights or not. The default is false. The value could also be a path to a configuration file containing the weights of a model, from the MMSegmentation repository. All model types support the chip_size argument, which is the image chip size of the training samples. The image chip size is extracted from the .emd file from the folder specified in the Input Training Data parameter.	Value Table
Learning Rate (Optional)	The rate at which existing information will be overwritten with newly acquired information throughout the training process. If no value is specified, the optimal learning rate will be extracted from the learning curve during the training process.	Double
Backbone Model (Optional)	Specifies the preconfigured neural network that will be used as the architecture for training the new model. This method is known as Transfer Learning. DenseNet-121—The preconfigured model will be a dense network trained on the Imagenet Dataset that contains more than 1 million images and is 121 layers deep. Unlike RESNET, which combines the layer using summation, DenseNet combines the layers using concatenation. DenseNet-161—The preconfigured model will be a dense network trained on the Imagenet Dataset that contains more than 1 million images and is 161 layers deep. Unlike RESNET, which combines the layer using summation, DenseNet combines the layers using concatenation. DenseNet-169—The preconfigured model will be a dense network trained on the Imagenet Dataset that contains more than 1 million images and is 169 layers deep. Unlike RESNET, which combines the layer using summation, DenseNet combines the layers using concatenation. DenseNet-201—The preconfigured model will be a dense network trained on the Imagenet Dataset that contains more than 1 million images and is 201 layers deep. Unlike RESNET, which combines the layer using summation, DenseNet combines the layers using concatenation. MobileNet version 2—This preconfigured model will be trained on the Imagenet Database and is 54 layers deep geared toward Edge device computing, since it uses less memory. ResNet-18—The preconfigured model will be a residual network trained on the Imagenet Dataset that contains more than million images and is 18 layers deep. ResNet-34—The preconfigured model will be a residual network trained on the Imagenet Dataset that contains more than 1 million images and is 34 layers deep. This is the default. ResNet-50—The preconfigured model will be a residual network trained on the Imagenet Dataset that contains more than 1 million images and is 50 layers deep. ResNet-101—The preconfigured model will be a residual network trained on the Imagenet Dataset that contains more than 1 million images and is 101 layers deep. ResNet-152—The preconfigured model will be a residual network trained on the Imagenet Dataset that contains more than 1 million images and is 152 layers deep. VGG-11—The preconfigured model will be a convolution neural network trained on the Imagenet Dataset that contains more than 1 million images to classify images into 1,000 object categories and is 11 layers deep. VGG-11 with batch normalization—This preconfigured model will be based on the VGG network but with batch normalization, which means each layer in the network is normalized. It trained on the Imagenet dataset and has 11 layers. VGG-13—The preconfigured model will be a convolution neural network trained on the Imagenet Dataset that contains more than 1 million images to classify images into 1,000 object categories and is 13 layers deep. VGG-13 with batch normalization—This preconfigured model will be based on the VGG network but with batch normalization, which means each layer in the network is normalized. It trained on the Imagenet dataset and has 13 layers. VGG-16—The preconfigured model will be a convolution neural network trained on the Imagenet Dataset that contains more than 1 million images to classify images into 1,000 object categories and is 16 layers deep. VGG-16 with batch normalization—This preconfigured model will be based on the VGG network but with batch normalization, which means each layer in the network is normalized. It trained on the Imagenet dataset and has 16 layers. VGG-19—The preconfigured model will be a convolution neural network trained on the Imagenet Dataset that contains more than 1 million images to classify images into 1,000 object categories and is 19 layers deep. VGG-19 with batch normalization—This preconfigured model will be based on the VGG network but with batch normalization, which means each layer in the network is normalized. It trained on the Imagenet dataset and has 19 layers. DarkNet-53—The preconfigured model will be a convolution neural network trained on the Imagenet Dataset that contains more than 1 million images and is 53 layers deep. Reid_v1—The preconfigured model will be a convolutional neural network trained on the Imagenet Dataset that is used for object tracking. Reid_v2—The preconfigured model will be a convolutional neural network trained on the Imagenet Dataset that is used for object tracking.	String
Pre-trained Model (Optional)	A pretrained model that will be used to fine-tune the new model. The input is an Esri Model Definition file (.emd) or a deep learning package file (.dlpk). A pretrained model with similar classes can be fine-tuned to fit the new model. The pretrained model must have been trained with the same model type and backbone model that will be used to train the new model.	File
Validation % (Optional)	The percentage of training samples that will be used for validating the model. The default value is 10.	Double
Stop when model stops improving (Optional)	Specifies whether early stopping will be implemented. Checked—Early stopping will be implemented, and the model training will stop when the model is no longer improving, regardless of the Max Epochs parameter value specified. This is the default. Unchecked—Early stopping will not be implemented, and the model training will continue until the Max Epochs parameter value is reached.	Boolean
Freeze Model (Optional)	Specifies whether the backbone layers in the pretrained model will be frozen, so that the weights and biases remain as originally designed. Checked—The backbone layers will be frozen, and the predefined weights and biases will not be altered in the Backbone Model parameter. This is the default. Unchecked—The backbone layers will not be frozen, and the weights and biases of the Backbone Model parameter can be altered to fit the training samples. This takes more time to process but typically produces better results.	Boolean

Derived Output

Label	Explanation	Data Type
Output Model	The output trained model file.	File

TrainDeepLearningModel(in_folder, out_folder, {max_epochs}, {model_type}, {batch_size}, {arguments}, {learning_rate}, {backbone_model}, {pretrained_model}, {validation_percentage}, {stop_training}, {freeze})

Name	Explanation	Data Type
in_folder [in_folder,...]	The folders containing the image chips, labels, and statistics required to train the model. This is the output from the Export Training Data For Deep Learning tool. Multiple input folders are supported when all the following conditions are met: The metadata format must be one of the following types: classified tiles, labeled tiles, multilabeled tiles, PASCAL Visual Object Classes, or RCNN masks. All training data must have the same metadata format. All training data must have the same number of bands. All training data must have the same tile size.	Folder
out_folder	The output folder location that will store the trained model.	Folder
max_epochs (Optional)	The maximum number of epochs for which the model will be trained. A maximum epoch of one means the dataset will be passed forward and backward through the neural network one time. The default value is 20.	Long
model_type (Optional)	Specifies the model type that will be used to train the deep learning model. SSD—The Single Shot Detector (SSD) approach will be used to train the model. SSD is used for object detection. The input training data for this model type uses the Pascal Visual Object Classes metadata format. UNET—The U-Net approach will be used to train the model. U-Net is used for pixel classification. FEATURE_CLASSIFIER—The Feature Classifier approach will be used to train the model. This is used for object or image classification. PSPNET—The Pyramid Scene Parsing Network (PSPNET) approach will be used to train the model. PSPNET is used for pixel classification. RETINANET—The RetinaNet approach will be used to train the model. RetinaNet is used for object detection. The input training data for this model type uses the Pascal Visual Object Classes metadata format. MASKRCNN—The MaskRCNN approach will be used to train the model. MaskRCNN is used for object detection. This approach is used for instance segmentation, which is precise delineation of objects in an image. This model type can be used to detect building footprints. It uses the MaskRCNN metadata format for training data as input. Class values for input training data must start at 1. This model type can only be trained using a CUDA-enabled GPU. YOLOV3—The YOLOv3 approach will be used to train the model. YOLOv3 is used for object detection. DEEPLAB—The DeepLabV3 approach will be used to train the model. DeepLab is used for pixel classification. FASTERRCNN—The FasterRCNN approach will be used to train the model. FasterRCNN is used for object detection. BDCN_EDGEDETECTOR— The Bi-Directional Cascade Network (BDCN) architecture will be used to train the model. The BDCN Edge Detector is used for pixel classification. This approach is useful to improve edge detection for objects at different scales. HED_EDGEDETECTOR— The Holistically-Nested Edge Detection (HED) architecture will be used to train the model. The HED Edge Detector is used for pixel classification. This approach is useful to in edge and object boundary detection. MULTITASK_ROADEXTRACTOR— The Multi Task Road Extractor architecture will be used to train the model. The Multi Task Road Extractor is used for pixel classification. This approach is useful for road network extraction from satellite imagery. CONNECTNET—The ConnectNet architecture will be used to train the model. ConnectNet is used for pixel classification. This approach is useful for road network extraction from satellite imagery. PIX2PIX—The Pix2Pix approach will be used to train the model. Pix2Pix is used for image-to-image translation. This approach creates a model object that generates images of one type to another. The input training data for this model type uses the Export Tiles metadata format. CYCLEGAN—The CycleGAN approach will be used to train the model. CycleGAN is used for image-to-image translation. This approach creates a model object that generates images of one type to another. This approach is unique in that the images to be trained do not need to overlap. The input training data for this model type uses the CycleGAN metadata format. SUPERRESOLUTION—The Super-resolution approach will be used to train the model. Super-resolution is used for image-to-image translation. This approach creates a model object that increases the resolution and improves the quality of images. The input training data for this model type uses the Export Tiles metadata format. CHANGEDETECTOR—The Change detector approach will be used to train the model. Change detector is used for pixel classification. This approach creates a model object that uses two spatial-temporal images to create a classified raster of the change. The input training data for this model type uses the Classified Tiles metadata format. IMAGECAPTIONER—The Image captioner approach will be used to train the model. Image captioner is used for image-to-text translation. This approach creates a model that generates text captions for an image. SIAMMASK—The Siam Mask approach will be used to train the model. Siam Mask is used for object detection in videos. The model is trained using frames of the video and detects the classes and bounding boxes of the objects in each frame. The input training data for this model type uses the MaskRCNN metadata format. MMDETECTION—The MMDetection approach will be used to train the model. MMDetection is used for object detection. The supported metadata formats are PASCAL Visual Object Class rectangles and KITTI rectangles. MMSEGMENTATION—The MMSegmentation approach will be used to train the model. MMDetection is used for pixel classification. The supported metadata format is Classified Tiles. DEEPSORT—The Deep Sort approach will be used to train the model. Deep Sort is used for object detection in videos. The model is trained using frames of the video and detects the classes and bounding boxes of the objects in each frame. The input training data for this model type uses the Imagenet metadata format. Where Siam Mask is useful while tracking an object, Deep Sort is useful in training a model to track multiple objects. PIX2PIXHD—The Pix2PixHD approach will be used to train the model. Pix2PixHD is used for image-to-image translation. This approach creates a model object that generates images of one type to another. The input training data for this model type uses the Export Tiles metadata format.	String
batch_size (Optional)	The number of training samples to be processed for training at one time. The default value is 2. If you have a powerful GPU, this number can be increased to 8, 16, 32, or 64.	Long
arguments [arguments,...] (Optional)	The function arguments are defined in the Python raster function class. This is where you list additional deep learning parameters and arguments for experiments and refinement, such as a confidence threshold for adjusting sensitivity. The names of the arguments are populated from reading the Python module. When you choose SSD as the model_type parameter value, the arguments parameter will be populated with the following arguments: grids—The number of grids the image will be divided into for processing. Setting this argument to 4 means the image will be divided into 4 x 4 or 16 grid cells. If no value is specified, the optimal grid value will be calculated based on the input imagery. zooms—The number of zoom levels each grid cell will be scaled up or down. Setting this argument to 1 means all the grid cells will remain at the same size or zoom level. A zoom level of 2 means all the grid cells will become twice as large (zoomed in 100 percent). Providing a list of zoom levels means all the grid cells will be scaled using all the numbers in the list. The default is 1.0. ratios—The list of aspect ratios to use for the anchor boxes. In object detection, an anchor box represents the ideal location, shape, and size of the object being predicted. Setting this argument to [1.0,1.0], [1.0, 0.5] means the anchor box is a square (1:1) or a rectangle in which the horizontal side is half the size of the vertical side (1:0.5). The default is [1.0, 1.0]. monitor—Specifies which metric to monitor while checkpointing and early stopping. Available metrics are valid_loss and average_precision. The default is valid_loss. When you choose any of the pixel classification models such as PSPNET, UNET, or DEEPLAB as the model_type parameter value, the arguments parameter will be populated with the following arguments: USE_UNET—The U-Net decoder will be used to recover data once the pyramid pooling is complete. The default is True. This argument is specific to the PSPNET model. PYRAMID_SIZES—The number and size of convolution layers to be applied to the different subregions. The default is [1,2,3,6]. This argument is specific to the PSPNET model. MIXUP—Specifies whether mixup augmentation and mixup loss will be used. The default is False. CLASS_BALANCING—Specifies whether the cross-entropy loss inverse will be balanced to the frequency of pixels per class. The default is False. FOCAL_LOSS—Specifies whether focal loss will be used. The default is False. IGNORE_CLASSES—Contains the list of class values on which the model will not incur loss. monitor—Specifies which metric to monitor while checkpointing and early stopping. Available metrics are valid_loss and accuracy. The default is valid_loss. When you choose RETINANET as the model_type parameter value, the arguments parameter will be populated with the following arguments: SCALES—The number of scale levels each cell will be scaled up or down. The default is [1, 0.8, 0.63]. RATIOS—The aspect ratio of the anchor box. The default is [0.5,1,2]. monitor—Specifies which metric to monitor while checkpointing and early stopping. Available metrics are valid_loss and average_precision. The default is valid_loss. When you choose MULTITASK_ROADEXTRACTOR or CONNECTNET as the model_type parameter value, the arguments parameter will be populated with the following arguments: gaussian_thresh—Sets the Gaussian threshold, which sets the required road width. The valid range is 0.0 to 1.0. The default is 0.76. orient_bin_size—Sets the bin size for orientation angles. The default is 20. orient_theta—Sets the width of orientation mask. The default is 8. mtl_model—Sets the architecture type that will be used to create the model. Valid choices are linknet or hourglass for linknet-based or hourglass-based, respectively, neural architectures. The default is hourglass. monitor—Specifies which metric to monitor while checkpointing and early stopping. Available metrics are valid_loss, accuracy, miou, and dice. The default is valid_loss. When you choose IMAGECAPTIONER as the model_type parameter value, the arguments parameter will be populated with the following arguments: decode_params—A dictionary that controls how the Image captioner will run. The default value is {'embed_size':100, 'hidden_size':100, 'attention_size':100, 'teacher_forcing':1, 'dropout':0.1, 'pretrained_emb':False}. chip_size—Sets the size of image to train the model. Images are cropped to the specified chip size. If image size is less than chip size, image size is used. The default size is 224 pixels. monitor—Specifies which metric to monitor while checkpointing and early stopping. Available metrics are valid_loss, accuracy, corpus_bleu, and multi_label_fbeta. The default is valid_loss. The decode_params argument is composed of the following six parameters: embed_size—Sets the embedding size. The default is 100 layers in the neural network. hidden_size—Sets the hidden layer size. The default is 100 layers in the neural network. attention_size—Sets the intermediate attention layer size . The default is 100 layers in the neural network. teacher_forcing—Sets the probability of teacher forcing. Teacher forcing is a strategy for training recurrent neural networks. It uses model output from a prior time step as an input, instead of the previous output, during back propagation. The valid range is 0.0 to 1.0. The default is 1. dropout—Sets the dropout probability. The valid range is 0.0 to 1.0. The default is 0.1. pretrained_emb—Sets the pretrained embedding flag. If True, it will use fast text embedding. If False, it will not use the pretrained text embedding. The default is False. When you choose CHANGEDETECTOR as the model_type parameter value, the arguments parameter will be populated with the following argument: attention_type—Specifies the module type. The module choices are PAM (Pyramid Attention Module) or BAM (Basic Attention Module). The default is PAM. monitor—Specifies which metric to monitor while checkpointing and early stopping. Available metrics are valid_loss, precision, recall, and f1. The default is valid_loss. When you choose MMDETECTION as the model_type parameter value, the arguments parameter will be populated with the following arguments: model—The backbone model used to train the model. The available choices are atss, carafe, cascade_rcnn, cascade_rpn, dcn, detectors, double_heads, dynamic_rcnn, empirical_attention, fcos, foveabox, fsaf, ghm, hrnet, libra_rcnn, nas_fcos, pafpn, pisa, regnet, reppoints, res2net, sabl, and vfnet. The default is cascade_rcnn. model_weight—Choose to use the pretrained model weights or not. The default is false. The value could also be a path to a configuration file containing the weights of a model, from the MMDetection repository. The value could also be a path to a configuration file containing the weights of a model, from the MMDetection repository. When you choose MMSegmentation as the model_type parameter value, the arguments parameter will be populated with the following arguments: model—The backbone model used to train the model. The available choices are ann, apcnet, ccnet, cgnet, danet, deeplabv3, deeplabv3plus, dmnet, dnlnet, emanet, encnet, fastscnn, fcn, gcnet, hrnet, mobilenet_v2, mobilenet_v3, nonlocal_net, ocrnet, ocrnet_base, pointrend, psanet, pspnet, resnest, sem_fpn, unet, and upernet. The default is deeplabv3. model_weight—Choose to use the pretrained model weights or not. The default is false. The value could also be a path to a configuration file containing the weights of a model, from the MMSegmentation repository. All model types support the chip_size argument, which is the chip size of the tiles in the training samples. The image chip size is extracted from the .emd file from the folder specified in the in_folder parameter.	Value Table
learning_rate (Optional)	The rate at which existing information will be overwritten with newly acquired information throughout the training process. If no value is specified, the optimal learning rate will be extracted from the learning curve during the training process.	Double
backbone_model (Optional)	Specifies the preconfigured neural network that will be used as the architecture for training the new model. This method is known as Transfer Learning. DENSENET121—The preconfigured model will be a dense network trained on the Imagenet Dataset that contains more than 1 million images and is 121 layers deep. Unlike RESNET, which combines the layer using summation, DenseNet combines the layers using concatenation. DENSENET161—The preconfigured model will be a dense network trained on the Imagenet Dataset that contains more than 1 million images and is 161 layers deep. Unlike RESNET, which combines the layer using summation, DenseNet combines the layers using concatenation. DENSENET169—The preconfigured model will be a dense network trained on the Imagenet Dataset that contains more than 1 million images and is 169 layers deep. Unlike RESNET, which combines the layer using summation, DenseNet combines the layers using concatenation. DENSENET201—The preconfigured model will be a dense network trained on the Imagenet Dataset that contains more than 1 million images and is 201 layers deep. Unlike RESNET, which combines the layer using summation, DenseNet combines the layers using concatenation. MOBILENET_V2—This preconfigured model will be trained on the Imagenet Database and is 54 layers deep geared toward Edge device computing, since it uses less memory. RESNET18—The preconfigured model will be a residual network trained on the Imagenet Dataset that contains more than million images and is 18 layers deep. RESNET34—The preconfigured model will be a residual network trained on the Imagenet Dataset that contains more than 1 million images and is 34 layers deep. This is the default. RESNET50—The preconfigured model will be a residual network trained on the Imagenet Dataset that contains more than 1 million images and is 50 layers deep. RESNET101—The preconfigured model will be a residual network trained on the Imagenet Dataset that contains more than 1 million images and is 101 layers deep. RESNET152—The preconfigured model will be a residual network trained on the Imagenet Dataset that contains more than 1 million images and is 152 layers deep. VGG11—The preconfigured model will be a convolution neural network trained on the Imagenet Dataset that contains more than 1 million images to classify images into 1,000 object categories and is 11 layers deep. VGG11_BN—This preconfigured model will be based on the VGG network but with batch normalization, which means each layer in the network is normalized. It trained on the Imagenet dataset and has 11 layers. VGG13—The preconfigured model will be a convolution neural network trained on the Imagenet Dataset that contains more than 1 million images to classify images into 1,000 object categories and is 13 layers deep. VGG13_BN—This preconfigured model will be based on the VGG network but with batch normalization, which means each layer in the network is normalized. It trained on the Imagenet dataset and has 13 layers. VGG16—The preconfigured model will be a convolution neural network trained on the Imagenet Dataset that contains more than 1 million images to classify images into 1,000 object categories and is 16 layers deep. VGG16_BN—This preconfigured model will be based on the VGG network but with batch normalization, which means each layer in the network is normalized. It trained on the Imagenet dataset and has 16 layers. VGG19—The preconfigured model will be a convolution neural network trained on the Imagenet Dataset that contains more than 1 million images to classify images into 1,000 object categories and is 19 layers deep. VGG19_BN—This preconfigured model will be based on the VGG network but with batch normalization, which means each layer in the network is normalized. It trained on the Imagenet dataset and has 19 layers. DARKNET53—The preconfigured model will be a convolution neural network trained on the Imagenet Dataset that contains more than 1 million images and is 53 layers deep. REID_V1—The preconfigured model will be a convolutional neural network trained on the Imagenet Dataset that is used for object tracking. REID_V2—The preconfigured model will be a convolutional neural network trained on the Imagenet Dataset that is used for object tracking.	String
pretrained_model (Optional)	A pretrained model that will be used to fine-tune the new model. The input is an Esri Model Definition file (.emd) or a deep learning package file (.dlpk). A pretrained model with similar classes can be fine-tuned to fit the new model. The pretrained model must have been trained with the same model type and backbone model that will be used to train the new model.	File
validation_percentage (Optional)	The percentage of training samples that will be used for validating the model. The default value is 10.	Double
stop_training (Optional)	Specifies whether early stopping will be implemented. STOP_TRAINING—Early stopping will be implemented, and the model training will stop when the model is no longer improving, regardless of the max_epochs parameter value specified. This is the default. CONTINUE_TRAINING—Early stopping will not be implemented, and the model training will continue until the max_epochs parameter value is reached.	Boolean
freeze (Optional)	Specifies whether the backbone layers in the pretrained model will be frozen, so that the weights and biases remain as originally designed. FREEZE_MODEL—The backbone layers will be frozen, and the predefined weights and biases will not be altered in the backbone_model parameter. This is the default. UNFREEZE_MODEL—The backbone layers will not be frozen, and the weights and biases of the backbone_model parameter can be altered to fit the training samples. This takes more time to process but typically produces better results.	Boolean

Derived Output

Name	Explanation	Data Type
out_model_file	The output trained model file.	File

Code sample

TrainDeepLearningModel example 1 (Python window)

This example trains a tree classification model using the U-Net approach.

# Import system modules  
import arcpy  
from arcpy.ia import *  
 
# Check out the ArcGIS Image Analyst extension license 
arcpy.CheckOutExtension("ImageAnalyst") 
 
# Execute 
TrainDeepLearningModel(r"C:\DeepLearning\TrainingData\Roads_FC", 
     r"C:\DeepLearning\Models\Fire", 40, "UNET", 16, "# #", None, 
     "RESNET34", None, 10, "STOP_TRAINING", "FREEZE_MODEL")

TrainDeepLearningModel example 2 (stand-alone script)

This example trains an object detection model using the SSD approach.

# Import system modules  
import arcpy  
from arcpy.ia import *  
 
# Check out the ArcGIS Image Analyst extension license 
arcpy.CheckOutExtension("ImageAnalyst") 
 
#Define input parameters
in_folder = "C:\\DeepLearning\\TrainingData\\Cars" 
out_folder = "C:\\Models\\Cars"
max_epochs = 100
model_type = "SSD"
batch_size = 2
arg = "grids '[4, 2, 1]';zooms '[0.7, 1.0, 1.3]';ratios '[[1, 1], [1, 0.5], [0.5, 1]]'"
learning_rate = 0.003
backbone_model = "RESNET34" 
pretrained_model = "C:\\Models\\Pretrained\\vehicles.emd"
validation_percent = 10
stop_training = "STOP_TRAINING"
freeze = "FREEZE_MODEL"


# Execute
TrainDeepLearningModel(in_folder, out_folder, max_epochs, model_type, 
     batch_size, arg, learning_rate, backbone_model, pretrained_model, 
     validation_percent, stop_training, freeze)

Environments

Current Workspace, Processor Type, GPU ID, Parallel Processing Factor, Scratch Workspace

Licensing information

Basic: Requires Image Analyst
Standard: Requires Image Analyst
Advanced: Requires Image Analyst

Summary

Usage

Parameters

Derived Output

Derived Output

Code sample

Environments

Licensing information

Related topics

In this topic