Adjust how locations and attributes are extracted

Available with LocateXT license.

After scanning a set of documents or text and evaluating the results, you may want to adjust what is extracted and how the content is evaluated. If you have a wide range of documents of varying formats, your approach will be different than if you have different collections of documents with a known format containing semistructured information.

The Extract Locations pane uses various default settings designed to recognize the most common locations and support extracting recent dates. When you have a better understanding of the content of your documents or text, you can adjust these settings and optimize the information that is extracted. These settings are adjusted on the Properties tab.

The collection of default settings is associated with the Default Unstructured Data template. Once you determine the settings that work best for a collection of documents or specific format of text, you can save them to a custom template. Use the template when you receive a new batch of documents in the collection or similar text.

Learn more about templates for extracting locations

Options

By default, when you click the Properties tab, the Options tab Options is selected. It allows you to turn on or turn off the toggles associated with categories of information that can be extracted from the input documents or text, and how that information is processed. It also allows you to specify the symbol that will be used by the output map layer.

  • Extract locations
    • Coordinates—The Coordinates toggle is turned on by default. When documents are scanned, they are examined for spatial coordinates. A point is created in the output feature class to represent each location found.
    • Custom locations—The Custom locations toggle is turned off by default. When documents are scanned, they are examined for place names specified in a custom locations file. The custom locations file associates a place name with a spatial coordinate. A point is created in the output feature class to represent each location found.
    • Fuzzy match—The Fuzzy match toggle is turned off by default. When you are looking for custom locations, a fuzzy match can be used to compare the input documents' content to the custom locations, for example, to account for misspellings.
  • Extract attributes
    • Dates—The Dates toggle is turned on by default. When documents are scanned, they are examined for recent dates. Dates found are extracted and stored in fields in the output feature class's attribute table.
    • Custom attributes—The Custom attributes toggle is turned off by default. When documents are scanned, they are examined for keywords specified in a custom attribute file. The custom attribute file determines the keywords you are looking for and what text is extracted when the keywords are found, and it defines a custom field that will be created in the output feature class's attribute table to store the extracted content.
  • Search Control
    • Require word breaks—The Require word breaks toggle is turned on by default. When documents are scanned, they are examined for words where a word is text bounded by whitespace or punctuation characters as in European languages. This setting affects how words are identified when looking for custom locations and custom attributes in a document. It also affects how coordinates and dates are identified, for example, when text that could represent a coordinate or date is surrounded by other characters.
  • Symbology—A solid red circle is the default symbol. When the output map layer is created, points in the output feature class will be displayed using the specified symbol.

Arrow buttons Jump To Option are present next to some toggles. The arrow button allows you to move to another tab in the Extract Locations pane, where you can customize how coordinates, custom locations, dates, or custom attributes are evaluated and extracted.

The following options are also available in the Extract Locations pane and can be used to customize what files are processed, what content is extracted, and what output is created. However, these options are not represented by toggles on the Options tab.

  • Scan files—Allows you to control which files are scanned.
  • Output—Allows you to control how many features and dates are evaluated, and what content is included in the output feature class's attribute table.

Coordinates

The Coordinates tab determines what coordinate formats will be considered when input documents are scanned. Pairs of numbers and alphanumeric combinations are examined to see if they match the enabled coordinate formats. The spatial coordinate candidates are checked against all enabled formats:

  • X Y formats—Coordinates specified as x,y values
  • DD formats—Decimal degrees format
  • DM formats—Degrees decimal minutes format
  • DMS formats—Degrees, minutes, and seconds format
  • UTM formats—Universal Transverse Mercator format
  • MGRS formats—Military Grid Reference System format

A location is created in the output feature class to represent the first coordinate format match that is found.

Each coordinate format is associated with a different set of options that are set on or off by default to provide a reasonable set of output locations. Some options can produce output locations when the input documents contain pairs of numbers or alphanumeric combinations that resemble spatial coordinates but don't actually describe a location on the ground; these are referred to as false positives. Options that are turned off by default are more likely to produce false positives. However, if you know your documents contain locations in these formats, these options should be enabled. When fewer coordinate formats are enabled, documents will be scanned in less time.

The supported coordinate formats can be customized to suit a set of documents. For example, the documents may be written in a different language, or may have spatial coordinates written using a non-standard notation. The supported coordinate formats are described in more detail below along with the procedure for customizing how those coordinates are evaluated.

The Coordinates tab also allows you to specify the coordinate system with which the spatial coordinates are associated. By default, the coordinates found in documents are handled as if they were defined based on the GCS_WGS_1984 coordinate system. If you know coordinates were collected based on a different coordinate system, click the Select coordinate system button Select Coordinate System and click the correct coordinate system.

All spatial coordinates in the input documents are processed until the end of the document or the limit on the number of locations that can be extracted is reached.

Learn about limiting how many features are extracted

Access the Coordinates tab

  1. In the Extract Locations pane, click the Properties tab.
  2. Access the Coordinates tab.
    • Click the Options tab Options, and click the arrow Jump To Option next to the Coordinates toggle.
    • Click the Extract locations tab Extract locations, and click the Coordinates tab.

Turn on or turn off the coordinates toggle

  1. In the Extract Locations pane, click the Properties tab.
  2. Turn on or turn off the coordinates toggle.
    • Click the Options tab Options, and click the Coordinates toggle.
    • Click the Extract locations tab Extract locations, click the Coordinates tab, and click the Create features from coordinates toggle.

X Y formats

Candidate spatial coordinates are compared against the following coordinate formats, if they are enabled. When a candidate matches one of these formats, a location is created in the output feature class. The format of the original coordinate will be specified as x,y in the output feature class.

By default, the x,y coordinate formats as a whole are not enabled. With these formats, coordinates are represented as a pair of numbers that indicate a measurement in the units of the specified coordinate system. They can produce locations that are false positives since they closely resemble sequences of numbers or measurements with no spatial relationship. Also, when text is found to match these coordinate formats, the locations produced will be incorrect if they are associated with the wrong coordinate system.

  • X Y with unit text—Alphanumeric text is recognized as a location when it has the following structure: 71.2071779dd 46.8075410dd or 630084m 4833438m. The units are set to match the coordinate system of the input documents, but they can be changed to recognize other units or additional notations for the same units that exist in your documents. These formats are unlikely to produce locations that are false positives if the coordinate system is correct for the coordinates that are found. This is enabled by default.
  • X Y without unit text—Alphanumeric text is recognized as a location when it has the following structure: 630084 4833438 or 235407.742 900560.004. This coordinate format, and the decimal degrees coordinate format X Y with no symbols both check pairs of numbers, and both formats could find a match for the same x,y coordinate pair. A warning will appear indicating there is a conflict when both formats are enabled. If both are enabled and both find a match, the decimal degrees result will be used as the output location. The two formats are less likely to produce a conflict when a projected coordinate system is specified. This is enabled by default.

When Log invalid coordinates is checked, any candidate spatial coordinates that have invalid values or fall outside the defined coordinate system are recorded as invalid in a log file. You can review this log file when the process is complete. Invalid coordinates are logged by default.

Set coordinate units

You can change the units associated with the x,y formats to produce accurate locations based on the information contained by the input documents.

  1. Access the Coordinates tab.
  2. Turn coordinates on.
  3. Click the Coordinate System drop-down list or the Select coordinate system button Select Coordinate System and click the coordinate system associated with the spatial coordinates present in the input documents. For example, specify a projected coordinate system.
  4. Check the X Y formats option.
  5. Click to expand the options associated with the X Y with unit text format.

    The units are set by default to match the units of the coordinate system. For example, a coordinate system based on the units US Feet will have the units set to ftUS.

  6. Click the Set Units button Pencil to change the notations that will be recognized as units in the documents.

    The Allowed Units dialog box appears.

  7. Click the Add From List button Add from list to add a well-known, pre-defined unit of measure to the list, if appropriate.
  8. Add a custom unit to the list, if appropriate.
    1. In the new row at the bottom of the table, click in the Unit Text column and type the characters that should be recognized as a representation of this unit of measure. For example, type ft (US) to recognize this as an additional way to represent the ftUS units.
    2. Specify the distance in meters that is associated with this unit of measure.
    3. Click OK.
  9. Click to expand the options associated with the X Y without unit text format.
  10. Click the Set Units button Pencil to change the units that will be associated with any coordinate pairs found in the documents.

    The Default Units dialog box appears.

  11. Click the Unit Name drop-down list and click one of the internationally-recognized units defined in the list, or type the name of another unit of measure for distance that does not appear in the list.

    When you select a unit in the list, the distance in meters associated with the selected unit of measure appears in the Meters/Unit text box.

  12. If you typed the name of a custom unit of measure into the Unit Name text box, type the number of meters it represents into the Meters/Unit text box.
  13. Click OK.

DD formats

Candidate spatial coordinates are compared against the following coordinate formats, if they are enabled. When a candidate matches one of these formats, a location is created in the output feature class. The format of the original coordinate will be specified as decimal degrees in the output feature class.

  • Latitude and longitude—Alphanumeric text is recognized as a location when it has the following structure: 38.8N 77.035W or W77N38.88909. These formats are unlikely to produce locations that are false positives. This is enabled by default.
  • X Y with degree symbols—Alphanumeric text is recognized as a location when it has the following structure: 38.8° -77.035° or -077d+38.88909d. These formats are unlikely to produce locations that are false positives. This is enabled by default.
  • X Y with no symbols—Alphanumeric text is recognized as a location when it has the following structure: 38.8 -77.035 or -077.0, +38.88909. These formats are likely to produce locations that are false positives since they closely resemble sequences of numbers with no spatial relationship. These formats can also resemble numbers that define a spatial location in a projected coordinate system—a warning will appear indicating there is a conflict when this format and the X Y without unit text option are both enabled. This is enabled by default.

When Log invalid coordinates is checked, any candidate spatial coordinates that do not match any of the enabled formats are recorded as invalid in a log file. You can review this log file when the process is complete. Invalid coordinates are logged by default.

DM formats

Candidate spatial coordinates are compared against the following coordinate formats, if they are enabled. When a candidate matches one of these formats, a location is created in the output feature class. The format of the original coordinate will be specified as degrees decimal minutes in the output feature class.

  • Latitude and longitude—Alphanumeric text is recognized as a location when it has the following structure: 3853.3N 7702.100W or W7702N3853.3458. These formats are unlikely to produce locations that are false positives. This is enabled by default.
  • X Y with minutes symbols—Alphanumeric text is recognized as a location when it has the following structure: 3853' -7702.1' or -07702m+3853.3458m. These formats are unlikely to produce locations that are false positives. This is enabled by default.

When Log invalid coordinates is checked, any candidate spatial coordinates that do not match any of the enabled formats are recorded as invalid in a log file. You can review this log file when the process is complete. Invalid coordinates are logged by default.

DMS formats

Candidate spatial coordinates are compared against the following coordinate formats, if they are enabled. When a candidate matches one of these formats, a location is created in the output feature class. The format of the original coordinate will be specified as degrees, minutes, and seconds in the output feature class.

  • Latitude and longitude—Alphanumeric text is recognized as a location when it has the following structure: 385320.7N 770206.000W or W770206N385320.76. These formats are unlikely to produce locations that are false positives. This is enabled by default.
  • X Y with seconds symbols—Alphanumeric text is recognized as a location when it has the following structure: 385320" -770206.0" or -0770206.0s+355320.76s. These formats are unlikely to produce locations that are false positives. This is enabled by default.
  • X Y with separators—Alphanumeric text is recognized as a location when it has the following structure: 38:53:20 -77:2:6.0 or -077/02/06/, +38/53/20.76. These formats sometimes produce locations that are false positives, since they resemble other types of formatted numbers such as dates and times. This is enabled by default.

When Log invalid coordinates is checked, any candidate spatial coordinates that do not match any of the enabled formats are recorded as invalid in a log file. You can review this log file when the process is complete. Invalid coordinates are logged by default.

UTM formats

Candidate spatial coordinates are compared against the following coordinate formats, if they are enabled. When a candidate matches one of these formats, a location is created in the output feature class. The format of the original coordinate will be specified as Universal Transverse Mercator in the output feature class.

  • Universal Transverse Mercator—Alphanumeric text is recognized as a location when it has the following structure: 18S 323503 4306438 or 18 north 323503.25 4306438.39. These formats are unlikely to produce locations that are false positives. This is enabled by default.
  • UPS north polar—Alphanumeric text is recognized as a location when it has the following structure: Y 2722399 2000000 or north 2711399 2000000. These formats are unlikely to produce locations that are false positives, but it is not common to find these coordinates in typical documents. This is not enabled by default.
  • UPS south polar—Alphanumeric text is recognized as a location when it has the following structure: A 2000000 3168892 or south 2000000 3168892. These formats are unlikely to produce locations that are false positives, but it is not common to find these coordinates in typical documents. This is not enabled by default.

MGRS formats

Candidate spatial coordinates are compared against the following coordinate formats, if they are enabled. When a candidate matches one of these formats, a location is created in the output feature class. The format of the original coordinate will be specified as Military Grid Reference System in the output feature class.

  • Military Grid Reference System—Alphanumeric text is recognized as a location when it has the following structure: 18S UJ 13503 06438 or 18SUJ0306. These formats are unlikely to produce locations that are false positives. This is enabled by default.
  • North polar—Alphanumeric text is recognized as a location when it has the following structure: Y TG 56814 69009 or YTG5669. These formats are unlikely to produce locations that are false positives, but it is not common to find these coordinates in typical documents. This is not enabled by default.
  • South polar—Alphanumeric text is recognized as a location when it has the following structure: A TN 56814 30991 or ATN5630. These formats sometimes produce locations that are false positives, since they can resemble regular numbers. This is not enabled by default.

When Log invalid coordinates is checked, any candidate spatial coordinates that do not match any of the enabled formats are recorded as invalid in a log file. You can review this log file when the process is complete. Invalid coordinates are logged by default.

Customize how spatial coordinates are recognized

The documents you are working with may contain spatial coordinates that can't be detected with the standard coordinate format settings. For example, the author of the documents may not have had GIS training, and wrote spatial coordinates in a non-standard manner. A common example is adding extra text between latitude and longitude values. For example, in the text +45.56° and -69.66° the extra word and prevents the text from being recognized as a spatial coordinate.

Similarly, if the documents you are analyzing were written in a mixture of languages, by default text will only be recognized as a spatial coordinate for documents written in English, or where the directional notations use English words or abbreviations. For example, if the text in the document is French, and a direction is represented in the spatial coordinate using an O for Ouest such as 60.91°N, 147.34°O, instead of using the English W for West, the text will not be recognized as a spatial coordinate. Coordinate formats can be customized to recognize the formats used in other languages in addition to or instead of English, depending on how you want to process the documents.

You can customize how spatial coordinates are recognized in documents using the Customize dialog box. Default settings are provided for some languages—select the language of your documents on the Settings tab. In an Asian-language document, spatial coordinates defined using a combination of Asian characters and full-width Hindu-Arabic numerals such as 北緯51.50°、西経175.63° are not recognized as a spatial coordinate at this time.

  1. Access the Coordinates tab.
  2. Turn coordinates on.
  3. Click the Customize button Customize at the top of the list of spatial coordinate formats.
  4. If the documents are written in another language and settings are available for that language on the Settings tab in the Customize dialog box, click the language in the list.
  5. Add the settings for the selected language to the Customize dialog box.
    • Click Replace Settings to scan the documents using only the settings associated with the selected language. If the current language is English and the selected language is French, after replacing the English settings in the dialog box with the French settings, only spatial coordinates written using a French format will be recognized in the documents.
    • Click Merge Settings to scan the documents using the settings for the current language as well as the additional language. If the current language is English and the selected language is French, after merging the French settings into the settings in the dialog box, spatial coordinates written using both English and French formats will be recognized in the documents.
  6. A spatial coordinate has many components, including several that are specific to a group of languages. Choose a tab under the Coordinates heading associated with one component of a spatial coordinate, for example, North or Between Latitude/Longitude.
  7. Modify the list of terms for this component to include the notations used in the documents that are being scanned.
    1. Click in the new row at the bottom of the grid in the Term Text column.
    2. Type the appropriate value that appears in the documents that should be recognized as a component of a spatial coordinate. For example, add the misspelling Nort to the list of terms on the North tab, if this is common to a group of documents. Add and to the list of terms on the Between Latitude/Longitude tab to account for documents where this extra text appears between latitude and longitude values.
    3. Press Enter.
  8. Warnings will appear if the same term has been entered on multiple tabs in the Customize dialog box. While terms can be duplicated, this will decrease the accuracy with which locations are recognized in documents. Remove any duplicate terms that are not essential to the process of recognizing text as a location.
    1. Click one of the affected tabs.
    2. Click a row in the grid to select the duplicate term that should not be used.
    3. Click the Remove button Remove to remove the selected row from the grid.

    If the duplicate terms are left in place, a warning message will appear at the bottom of the Extract Locations pane next to the Extract button.

  9. Click OK.

The next time locations are extracted from a set of documents, the custom definitions will be used to evaluate text and determine if it represents a spatial coordinate.

Use comma as a decimal separator

By default, documents are scanned for coordinates that use a period (.) or a mid-dot (·) as the decimal separator, for example: Lat 01° 10·80’ N Long 103° 28·60’ E. If you are working with documents in which numbers use commas as the decimal separator, for example: 52° 8′ 32,14″ N; 5° 24′ 56,09″ E, you should instead check the option Use comma as decimal separator.

This setting only controls how alphanumeric text is evaluated to determine if it is a spatial coordinate. This setting does not affect how the text is evaluated to determine if it represents a custom location or matches a keyword that should be stored in a custom attribute. That is, this setting does not provide a shortcut to indicate the text is written in a European language such as French where numbers often use commas as the decimal separator. The computer's regional settings are not used to control this setting.

Interpret as longitude, latitude

When coordinate pairs are provided without symbols or directional notations, the correct spatial location is likely to be produced if one number is between 0 and 90 and the other number is between 90 and 180. If both numbers are between 0 and 90, it is more difficult to determine the correct location.

Because latitude-longitude is such a strong convention in geography, coordinate pairs where both numbers are between 0 and 90 are evaluated in this manner by default, in other words, where the first number is a value on the y-axis, and the second number is a value on the x-axis. However, coordinate pairs are often provided as x,y combinations in other disciplines, such as mathematics.

Check the option Interpret as longitude, latitude if you prefer for these ambiguous coordinate pairs to be evaluated as x,y combinations instead, that is, where the first number is a longitude and the second number is a latitude.

Determine how coordinates are evaluated

Coordinates must be turned on to change how spatial coordinates are evaluated when documents are examined.

  1. Access the Coordinates tab.
  2. Turn coordinates on.
  3. Click the Coordinate System drop-down list or the Select coordinate system button Select Coordinate System and click the coordinate system associated with the spatial coordinates present in the input documents.
  4. Check the coordinate formats you want to use to evaluate candidate spatial coordinates. Uncheck coordinate formats that you do not want to use.
  5. Specify any customizations that should be used when evaluating text to determine if it represents a spatial coordinate.
  6. Check or uncheck the Log invalid coordinates options to use the log files to evaluate the results.
  7. Check Use comma as decimal separator if the input documents have content in which the spatial coordinates are specified using commas as the decimal separator.
  8. Check Interpret as longitude, latitude if the input documents have content in which the spatial coordinates are specified as longitude-latitude coordinates instead of latitude-longitude coordinates.

The next time locations are extracted, these coordinate settings will be used to evaluate candidate spatial coordinates and determine which locations are included in the output feature class.

Identify custom locations with a fuzzy match

When custom locations are turned on, content in the documents that are being scanned is compared against the place names specified in the custom locations file. By default, the content has to exactly match one of the specified place names to create a location in the output feature class.

When fuzzy matching is turned on, an approximate match is used instead to compare the document's content to the specified place names. A location is created in the output feature class if the input content matches 70 percent of a place name's characters. This can account for some misspellings and also variations such as using the plural form of a word in a place name instead of the singular form. The 70 percent assessment is strictly based on a count of the number of letters that match; natural language processing algorithms such as stemming are not used to determine if a word in a document matches a custom location.

A useful workflow is to first extract locations with fuzzy matching turned off, and then try it again with fuzzy matching turned on to find additional place names. The results can then be compared to determine the best results. While in some cases this setting will help you find additional locations that would otherwise be missed, content in the documents can also be matched incorrectly to a place name, resulting in a location that is a false positive.

Fuzzy matching is only used with custom locations. If the custom locations toggle is turned off, turning on the fuzzy match toggle has no effect. This option doesn't change the way a document's content is compared to keywords specified in a custom attribute file, for example.

Turn on or turn off the fuzzy match toggle

  1. In the Extract Locations pane, click the Properties tab.
  2. Turn on or turn off the fuzzy match toggle.
    • Click the Options tab Options, and click the Fuzzy match toggle.
    • Click the Extract locations tab Extract locations, click the Custom Locations tab, and click the Use fuzzy matching toggle.

Dates

The Dates tab determines what date formats will be considered when input documents are scanned. Alphanumeric combinations are examined to see if they match the enabled date formats. The date candidates are checked against all enabled formats in order as specified below. Sometimes regular numbers are mistakenly identified as a date; these are referred to as false positives.

The supported date formats can be customized to suit a set of documents. For example, the documents may be written in a different language, or may have dates written using a non-standard notation. The date formats are described in more detail below along with the procedure for customizing how those dates are evaluated.

All dates in the input documents are processed until the end of the document or the limit on the number of dates that can be extracted is reached.

Learn about limiting how many dates are extracted

  • Month name used—The month name is spelled out in the text, either in full or as an abbreviation, for example: January 1, 2010 or 2 FEB 11. In languages other than English, the dates recognized when this option is enabled may not, strictly speaking, use a month name because months may be identified by number, for example. However, the dates identified when this option is used are those written in a more traditional manner instead of using a variation of the ISO 8601 date formats. These formats are unlikely to produce dates that are false positives. This is enabled by default.
  • M/D/Y and D/M/Y—The date format is either month, day, and year, or day, month, and year, with separators between the values, for example, 10/31/2017 or 28-2-11. These formats sometimes produce dates that are false positives. The actual date represented is ambiguous when both the month and day are represented by numbers less than or equal to 12. Options are available to choose how ambiguous dates are interpreted when they are found. By default, the Interpret as MDY when ambiguous option is selected, and the text 03/02/2012 will be interpreted as March 2, 2012; this option is appropriate when working with documents authored in the US, where the default date format is MM/DD/YYYY. When working with documents authored in another country where the default date format is DD/MM/YYY, select the Interpret as DMY instead; in this case, the text 4-12-13 will be interpreted as December 4, 2013. Dates are recognized both when the month and day are single digits and when those single digits have leading zeros. This format is enabled by default.
  • YYYYMMDD—The date format is year, month, day, for example, 2015-06-03 or 20140502. When separators are used between the different parts of the date, single-digit month and day values will be recognized. For example, 2015-6-3 would also be recognized as June 3, 2015, but 201452 would not be recognized as May 2, 2014. The standardized date that is produced will have leading zeros for the month and day when the original value is a single digit, with a four-digit year. These formats sometimes produce dates that are false positives. This is enabled by default.
  • YYMMDD—The date format is year, month, day, for example, 160722 or 170304. The month and day will have leading zeros when the value is a single digit, with a two-digit year. These formats are likely to produce dates that are false positives. This is enabled by default.
  • YYJJJ—The year and the Julian date, which is a number representing the day as a position in the year using a number from 1 to 366 with leading zeros when the day is a one- or two-digit number. For example, 18001 or 19365. The format YYYYJJJ is also supported, where the year is fully qualified; for example, 2020060 represents Feb 29, 2020. These formats are likely to produce dates that are false positives. This is enabled by default.

The first match that is found is extracted and stored in the output feature class's attribute table in the First Date column, as long as the date falls within the date range that is being evaluated. Similarly, the oldest date found is stored in the Earliest Date column, and the most recent date found is stored in the Latest Date column. All dates found in the document are listed in the All Dates column separated by commas, to the maximum size allowed in the table. All of these dates are recorded in the YYYY-MM-DD format, regardless of the format used in the original text. In contrast, the Extracted Date Text column records the text that was found in the document that was interpreted as a date, exactly as it was found in the document.

Learn about setting the date range

If you know your documents only contain dates in certain formats, the other date formats can be disabled. When fewer date formats are enabled, documents will be scanned in less time.

Access the Dates tab

  1. In the Extract Locations pane, click the Properties tab.
  2. Access the Dates tab.
    • Click the Options tab Options, and click the arrow Jump To Option next to the Dates toggle.
    • Click the Extract attributes tab Extract attributes, and click the Dates tab.

Turn on or turn off the dates toggle

  1. In the Extract Locations pane, click the Properties tab.
  2. Turn on or turn off the dates toggle.
    • Click the Options tab Options, and click the Dates toggle.
    • Click the Extract attributes tab Extract attributes, click the Dates tab, and click the Create fields from dates toggle.

Customize how dates are recognized

The documents you are working with may contain dates that can't be detected with the standard date format settings. For example, if the Month name used option is enabled, but the author of a set of documents habitually misspelled February as Febuary, that text will not be recognized as a date.

Similarly, if the documents you are analyzing were written in a mixture of languages, by default, text will only be recognized as a date for documents written in English. For example, with the Month name used option, the English date July 17, 2018 is recognized. However, in a French document the equivalent date 17 juillet, 2018 is not recognized as a date, by default. Date formats can be customized to recognize the formats used in other languages in addition to or instead of English, depending on how you want to process the documents.

You can customize how dates are recognized in documents using the Customize dialog box. Default settings are provided for some languages—select the language of your documents on the Settings tab. In an Asian-language document, options on the Numerals tab allow dates to be recognized when specified using only Asian characters such as 平成三十年六月十八日, and a combination of Asian characters and full-width Hindu-Arabic numerals such as 平成 2 8年 4月 14日.

Some settings control whether two- and four-digit numbers that occur in a document are recognized as a year, which affects if text is recognized as a date and in turn if it falls in the acceptable range of dates to extract from documents. When you are working with digital versions of historical documents or documents that provide a projection of future events, you may need to adjust the range of numbers that are recognized as a year to suit those documents in addition to modifying the Limit extracted dates to this range setting on the Output tab in the Extract Locations pane.

  1. Access the Dates tab.
  2. Turn on the dates toggle.
  3. Click the Customize button at the top of the list of date formats.
  4. If the documents are written in another language and settings are available for that language on the Settings tab in the Customize dialog box, click that language in the list.
  5. Add the settings for the selected language to the Customize dialog box.
    • Click Replace Settings to scan the documents using only the settings associated with the selected language. If the current language is English and the selected language is French, after replacing the English settings in the dialog box with the French settings, only spatial coordinates written using a French format will be recognized in the documents.
    • Click Merge Settings to scan the documents using the settings for the current language as well as the additional language. If the current language is English and the selected language is French, after merging the French settings into the settings in the dialog box, spatial coordinates written using both English and French formats will be recognized in the documents.
  6. A date can have many components when it is written down. Choose a tab under the Dates heading associated with one component of a date, for example, February.
  7. Modify the list of terms to include the notations used in the documents that are being scanned.
    1. Click in the new row at the bottom of the grid in the Term Text column.
    2. Type the appropriate value that appears in the documents such as the misspelling Febuary as one of the values that can identify the month of February.
    3. Press Enter.
  8. Warnings will appear if the same term has been entered on multiple tabs in the Customize dialog box. While terms can be duplicated, this will decrease the accuracy with which dates are recognized in documents. Remove any duplicate terms that are not essential to the process of recognizing text as a date.
    1. Click one of the affected tabs.
    2. Click a row in the grid to select the duplicate term that should not be used.
    3. Click the Remove button Remove to remove the selected row from the grid.

    If the duplicate terms are left in place, a warning message will appear at the bottom of the Extract Locations pane next to the Extract button.

  9. On the Year Ranges tab, specify a range of numbers that you want to interpret as years within your documents.
  10. On the Numerals tab, specify what types of characters can be recognized as a date.
  11. Click OK.

Determine how dates are evaluated

The dates toggle must be turned on to change how the input documents are evaluated with respect to dates, and to include this information in the output feature class.

  1. Access the Dates tab.
  2. Turn on the dates toggle.
  3. Check the date formats you want to use to evaluate candidate dates. Uncheck date formats that you do not want to use.
  4. Specify any customizations that should be used when evaluating text to determine if it represents a date.

The next time dates are extracted, these date settings will be used to evaluate candidate dates and determine which dates are included in the output feature class's attribute table.

Require word breaks

The Require word breaks setting determines how text is considered to be a word. When word breaks are required, text is considered a word when it is bounded by whitespace or punctuation characters as in European languages. For example, the English word Pacific would correctly not produce a match against the text The City of Pacifica is located just 15 minutes south of San Francisco. However, with the text I flew to Tokyo in Japanese, 私は東京に飛んで, you would not be able to find the word Tokyo, 東京.

With Require word breaks turned off, text does not have to be bounded by whitespace or punctuation characters to match a given set of text. For example, a custom location that was looking for the word Pacific would incorrectly produce a match against the text The City of Pacifica is located just 15 minutes south of San Francisco. However, a custom location that was looking for the Japanese text for Tokyo, 東京, would successfully produce a match against the Japanese text for I flew to Tokyo, 私は東京に飛んで.

This setting affects how documents are scanned for words that match custom locations, custom attributes, coordinates, and dates. Depending on the language of the text in the documents, this setting can either produce frequent false positives or infrequent false positives. It would be best to process documents written in different languages separately, with this setting turned on or off as appropriate for each language.

Turn on or turn off the require word breaks toggle

  1. In the Extract Locations pane, click the Properties tab.
  2. Click the Options tab Options.
  3. Turn on or turn off the Require word breaks toggle by clicking the toggle.

When the Require word breaks toggle is on, the next time documents are processed, text will only be considered a word if it is bounded by whitespace or punctuation characters. When the Require word breaks toggle is off, the next time documents are processed, any text that matches the text you are looking for will be considered a word.

Symbology

You can customize the symbol that is used to represent the locations found in the input documents when an output map layer is created. Only a single symbol can be specified for map layers in this manner.

  1. Open the Extract Locations pane.
  2. In the Extract Locations pane, click the Properties tab.
  3. Click the Options tab Options.
  4. Click the point symbol, for example, the solid red circle, under the Symbology heading.

    The Format Point Symbol panel appears in the Extract Locations pane.

  5. Click a point symbol in the gallery, or customize the symbol's properties and apply your changes. Or, click the back button Back to cancel your changes and return to the Options tab.

The next time locations are extracted and an output map layer is created, the specified symbol will be used to draw the locations on the map.

Symbolize locations by category or quantity

After extracting locations from a set of documents, you can use custom attributes to change how the output locations are symbolized. For example, you can provide different symbols to represent the keywords found at each location. The next time you extract locations using the same settings, you can append them to the existing map layer. The resulting points will automatically be symbolized in the same manner.

If you later want to use the same Extract Locations template to create a new map layer with the same symbolization, you need to first capture the original map layer's symbolization as a schema-only layer package. The layer package can be used to create a new feature class and accompanying map layer to which you can append locations from a new set of documents.

  1. Open the map containing the map layer whose symbolization you would like to reuse.
  2. Create a schema-only layer package from the existing map layer.
  3. Add the schema-only layer package to the new map to which you want to extract a new set of locations.

    A new feature class is created in the project's default geodatabase using the schema defined in the layer package. A new map layer is created using the layer definition from the layer package.

    Learn more about layers and layer packages

  4. Follow the workflow to extract locations to the existing map layer created in the previous step.

The locations extracted to the map layer are automatically symbolized based on the custom attribute values that were extracted from the documents and text.

Scan files

The Scan files tab Scan files allows you to control what documents are scanned or skipped.

Scan specific file types

A file type in this context is the file name extension. For example, if you have a table.txt file, TXT would be the file type. When you provide a folder as input and the folder contains many files, you can limit the files that are scanned by specifying a set of file types to work with. You can either eliminate files that you know are not relevant, or restrict your scan to files that you know are relevant.

  1. In the Extract Locations pane, click the Properties tab.
  2. Click the Scan files tab Scan files.
  3. Click the File types heading.
  4. Choose to scan or to skip the specified file types.
    • Scan all files except these types—Specify the file types to skip. This is the default option.
    • Scan only these file types—Specify the file types to scan.
  5. Add extensions to the file types list.
    • Click Add extensions Add extensions. On the Add Extensions dialog box, type one or more file extensions in the Extensions text box. If you type many file extensions, separate them using spaces only; do not put a comma after the file extension. For example, type txt doc csv. A period can be used before the file extension, if desired. Click OK.
    • Drag files from Windows Explorer onto the file types list.

    The specified file extensions are added to the file types list.

If the computer recognizes a file extension, the icon and type string that are used in Windows Explorer to represent that file type are included in the list. For example, if you provide the file extension .docx, the file extension .DOCX and the icon used to represent these files on your computer appear in the list in the Extension column. The Type column will contain the value Microsoft Word Document.

Skip specific files and folders

When you are scanning a folder or disk containing many files, it may be helpful to avoid scanning individual files or folders. The scan will complete faster and include fewer false positive locations. For example, folders containing financial reports may contain numbers that resemble spatial coordinates.

When scanning disks, consider excluding folders that contain installed software, operating system files, hardware drivers, and so on. Hidden files and system files, which often don't appear in Windows Explorer, will be skipped by default, but you can uncheck these options if it is appropriate for your scenario.

  1. In the Extract Locations pane, click the Properties tab.
  2. Click the Scan files tab Scan files.
  3. Click the Skip types heading.
  4. Uncheck Hidden or System under the File attributes heading, if appropriate.
  5. Add files and folders that should be skipped to the Files and folders list.
    • Click Add files and folders Add files and folders. The Add Files and Folders dialog box appears. Browse to and select the files and folders that should be skipped and click Open.
    • Drag files and folders from Windows Explorer onto the Files and folders list.

    The specified files and folders are added to the list.

The icon used in Windows Explorer to represent the item and its name appear in the list in the Name column. The Path column displays the path to the file or folder.

Some files are not processed

Documents are processed using the same technology that Windows Search uses to examine files on your computer—a plug-in known as an IFilter. The Extract Locations pane and its associated tools do not use Windows Search; they use the IFilter plug-ins that are already available on your computer to examine the input documents and text.

Several IFilters are included with Microsoft Windows operating systems that can process text files, HTML files, some Microsoft Office documents, and so on. The IFilters available are different on different operating systems. Other applications that are installed on your computer may provide additional IFilters that can be used to process the documents they handle. For example, when you install Adobe Acrobat Reader DC or Adobe Acrobat, it may provide an IFilter that can be used to process the contents of PDF files. When files are scanned, a specific IFilter for that file type will be used if one is available; otherwise, the files will be scanned using the standard IFilters and as much information as possible will be extracted.

Because ArcGIS Pro is a 64-bit application, it can only use 64-bit IFilters to process the input documents and text. A 32-bit application typically only provides 32-bit IFilters that can be used for processing its documents; ArcGIS Pro can't use these IFilters.

If you haven't set a specific file type to be skipped, such as PDF files, but are unable to extract locations from files where you know they exist, make sure an appropriate 64-bit IFilter is installed on your computer.

With Windows 10, an IFilter that ArcGIS Pro can use to process PDF files should be available. With other versions of Windows, if you have the 32-bit version of Adobe Reader installed, a 64-bit IFilter might not be available to process PDF documents. Content cannot be extracted from PDF documents using the standard Windows IFilters. You can download a 64-bit PDF IFilter from the Adobe website.

Output

The Output tab Output allows you to control what content is extracted from the documents and stored in the output feature class.

Document limits

Limits can be placed on the locations and dates that are extracted from the input documents. When you are scanning a set of input documents for the first time, you may come across a file that contains a large set of numbers that resemble but are not spatial coordinates, or where a sequence of numbers looks like a date but is actually a different type of data. By default, limits are placed on how many features and dates are extracted from the input documents. This keeps you from generating millions of points in error, or from storing many meaningless dates in the attribute table. After evaluating the output locations and the dates that are stored in their attributes, you might choose to disable this limitation or change the limit before scanning the documents again.

Sometimes you don't know anything about the documents you are scanning. At other times, you might periodically scan sets of documents that are semistructured, such as reports. Reports often begin with the date on which the report was written, and the location in which it was written; however, the subject of the report concerns events that occurred on a different day in a different location. You can choose to skip the first number of locations and dates when processing these documents so your output feature class captures the content of interest.

You can place limits on the number of features and dates, and on which features and dates, will be extracted from the input documents. These limits are described below:

  • Feature limits
    • Limit number of features per document—By default, only the first 3,000 locations found in a document are extracted and stored in the output feature class. With this option checked, you can increase or decrease the limit on the number of features extracted from a single document. Uncheck this option to evaluate all candidate spatial coordinates and custom locations in a document and extract all features found. This is enabled by default.
    • Ignore first number of features per document—By default, the first candidate spatial coordinate or custom location found in an input document is evaluated, followed by all other candidate coordinates and custom locations until either the feature limit or the end of the document is reached. With this option checked, you can skip a specified number of features at the beginning of a document, and then extract all subsequent features up to the limit; by default, only the first feature would be skipped, but you can increase this number, if appropriate. Uncheck this option to evaluate all candidate spatial coordinates and custom locations up to the limit. This is not enabled by default.
  • Date limits
    • Limit number of dates per document—By default, only the first 30 dates found in a document are extracted and stored in the output feature class's attribute table. With this option checked, you can increase or decrease the limit on the number of dates extracted from a single document. Uncheck this option to evaluate all candidate dates in a document and extract all dates found. This is enabled by default.
    • Ignore first number of dates per document—By default, the first candidate date found in an input document is evaluated, followed by all other candidate dates until either the date limit or the end of the document is reached. With this option checked, you can skip a specified number of dates at the beginning of a document, and then extract all subsequent dates up to the limit; by default, only the first date would be skipped, but you can increase this number, if appropriate. Uncheck this option to evaluate all candidate dates up to the limit. This is not enabled by default.
  1. In the Extract Locations pane, click the Properties tab.
  2. Click the Output tab Output.
  3. Click the Document limits heading.
  4. Check or uncheck the options for limiting how many features and dates are extracted, as appropriate.
  5. Click in the enabled Features and Dates text boxes and type the number that represents the maximum number of features or dates that should be extracted.
  6. Check or uncheck the options to skip a given number of features and dates at the beginning of the input document or text, as appropriate.
  7. Click in the enabled Features and Dates text boxes and type the number that represents how many features or dates should be skipped before any additional features or dates present are extracted.

Pre-Text and Post-Text limits

When a spatial coordinate or a custom location is extracted from the document and stored in the output feature class, several pieces of information are stored in the output feature class's attribute table to help you evaluate those locations later. An excerpt of the document that precedes the location is stored in a Pre-Text field in the feature class's attribute table. An excerpt of the document that follows the location is stored in a Post-Text field in the feature class's attribute table. These attributes help you to establish the context of the location—is it a real location, and if so, what happened there, and is that relevant to your analysis?

The amount of text surrounding a location that is extracted and stored in the feature class is determined by the following settings:

  • Pre-Text—By default, 254 characters of text before the location will be extracted from the document and stored in the Pre-Text field. You can increase or decrease this value, as appropriate.
  • Post-Text—By default, 254 characters of text after the location will be extracted from the document and stored in the Post-Text field. You can increase or decrease this value, as appropriate.
  1. In the Extract Locations pane, click the Properties tab.
  2. Click the Output tab Output.
  3. Click the Pre-Text and Post-Text limits heading.
  4. Click in the Pre-Text text box and type the number that represents the maximum number of characters preceding a location that will be extracted from the input document.
  5. Click in the Post-Text text box and type the number that represents the maximum number of characters following a location that will be extracted from the input document.

Other text field limits

Various pieces of information are recorded in the attribute table of the output feature class that help you evaluate the extracted locations and dates, in addition to the Pre-Text and Post-Text fields. You can tailor the size of these fields to hold more or less information to suit the content in the current collection of documents.

The amount of text stored in the feature class is determined by the following settings:

  • Name—By default, 50 characters of text can be stored in the Name field to represent the file name in which the location was found. You can increase or decrease this value, as appropriate.
  • Extracted Text—By default, 120 characters of text can be stored in the Extracted Text field to represent the spatial coordinate or custom location that was found. You can increase or decrease this value, as appropriate.
  • Extracted Type—By default, 50 characters of text can be stored in the Extracted Type field to represent the type of spatial coordinate or custom location that was found. You can increase or decrease this value, as appropriate.
  • All Dates—By default, 254 characters of text representing the dates found in the document can be stored in the All Dates field. These dates are standardized in yyyy-mm-dd format. You can increase or decrease this value, as appropriate.
  • Extracted Date Text—By default, 254 characters of text representing the dates found in the document can be stored in the Extracted Date Text field. The text from the original document that was recognized as a date is extracted and recorded. You can increase or decrease this value, as appropriate.
  • Filename—By default, 254 characters of text can be stored in the Filename field to represent the full path of the file in which the location was found. You can increase or decrease this value, as appropriate.
  • File Type—By default, 10 characters of text can be stored in the File Type field to represent the type of file that was processed. You can increase or decrease this value, as appropriate.

Learn more about the output feature class's fields

  1. In the Extract Locations pane, click the Properties tab.
  2. Click the Output tab Output.
  3. Click the Other text field limits heading.
  4. Click in the field text boxes and type the number that represents the maximum number of characters of that can be recorded in each field.

Date range

Some numbers can resemble both spatial coordinates and dates. By default, dates are only extracted from an input document if they match one of the selected date formats and if the resulting date falls within a specified date range. This reduces the chance of extracting a date that is a false positive. The default date range is January 1, 1985, to December 31, 2030. Even if a date is found within an input document, if it falls outside of the specified date range, it will not be extracted and stored in the output feature class's attribute table.

Uncheck Limit extracted dates to this range to extract any possible date from the input documents. This will increase the time it takes to evaluate the contents of a document, as all numbers have to be evaluated against the selected date formats.

If you are only interested in events that took place during a given span of time, check the Limit extracted dates to this range option and adjust the range of dates to more closely match the time period when those events occurred.

  • From—By default, January 1, 1985. Click the drop-down menu and click the beginning date of the valid date range on the calendar control.
  • To—By default, December 31, 2030. Click the drop-down menu and click the ending date of the valid date range on the calendar control.

The calendar control provides access to one month at a time. Use the arrows in the upper corners to access an earlier month or a later month. Click the month and year at the top of the calendar to see a list of months. Click the year at the top of the list of months to access a list of years. Use the arrows in the upper corners to access an earlier year or a later year.

If you are working with historical documents, additional settings on the Year Ranges tab in the Customize dialog box affect whether text is recognized as a date and how the Limit extracted dates to this range setting works. The Year Ranges tab settings determine whether two- and four-digit numbers are interpreted as years. This assessment occurs before determining if the text adjoining the year is a date.

By default, four-digit numbers between 1900 and 2099 are recognized as a year. As long as the years for the Limit extracted dates to this range setting fall within this range, it will work effectively to restrict any dates with a four-digit year that are found. If you are working with historical documents that have become available digitally, you must adjust both the Limit extracted dates to this range setting on the Output tab and the four-digit year range on the Year Ranges tab in the Customize dialog box to account for the time period in which the documents were written.

Similarly, when analyzing two-digit numbers to determine if they represent a year, a 100 year window is used that begins with the year 1970, by default. As long as the years for the Limit extracted dates to this range setting fall within this range, it will work effectively to restrict any dates with a two-digit year that are found. However, if you are working with historical documents or reports concerning projections for the future, you may need to adjust the 100 year window on the Year Ranges tab in the Customize dialog box as well as the Limit extracted dates to this range setting on the Output tab to account for the time period of the documents.

Learn more about customizing how text is recognized as a date

  1. In the Extract Locations pane, click the Properties tab.
  2. Click the Output tab Output.
  3. Click the Date range heading.
  4. Check or uncheck the option Limit extracted dates to this range, as appropriate.
  5. If the option is enabled, click the From drop-down arrow, and browse to and select the beginning date for the range of dates to extract.
  6. If the option is enabled, click the To drop-down arrow, and browse to and select the ending date for the range of dates to extract.
  7. Specify any customizations that should be used when evaluating text to determine if it represents a date.

Standardized coordinate

When a spatial coordinate or a custom location is extracted from the document and stored in the output feature class, several pieces of information are stored in the output feature class's attribute table to help you evaluate those locations later. The original text of the document that represents the location is stored in the attribute table in the Extracted Text field, and the type of location that was found is recorded in the Extracted Type field.

Additionally, a consistent representation of all locations that were found is stored in the standardized coordinate field, which has the alias Stand. Coord.. The x,y coordinates associated with the point feature are recorded in the format specified by the Standardized coordinate option.

Choose the coordinate format that meets your requirements from the following options. For example, a coordinate found in an input document such as 117.1717550°W 34.0552456°N will appear in the standardized coordinate field as specified below when each of the coordinate formats is selected.

  • DD - Decimal Degrees34.055246N 117.171755W (selected by default)
  • DM - Decimal Minutes34 03.3147N 117 10.3053W
  • DMS - Degrees Minutes Seconds34 03 18.88N 117 10 18.32W
  • UTM - Universe Transverse Mercator11S 484149 3768294
  • MGRS - Military Grid Reference System11SMT8414968295

  1. In the Extract Locations pane, click the Properties tab.
  2. Click the Output tab Output.
  3. Click the Standardized coordinate heading
  4. Click the drop-down list and click the coordinate format in which the extracted locations will be recorded.

Related topics