California Department of Public Health logo: three likenesses of people colored blue, green, and orange  
Sign-In  



Join our list

Get updates on our project activities and new features of our website. Sign up for our newsletter here.


Contact Us

California Environmental Health Tracking Program

850 Marina Bay Pkwy, P-3
Richmond, CA 94804

(510) 620-3038
E-Mail Us
Last Edited: 12/8/14

Frequently Asked Questions about Geocoding

Geocoding is the process of finding associated geographic identifiers or coordinates (often expressed as longitude and latitude, or x and y) from text data, such as street addresses.  A geographic information system (GIS) software matches each record in an attribute database with the geographic reference files.  For example, address geocoding takes the attributes of a street address (such as number, street, city, state, zip code), compares it to a database of addresses in a GIS, and assigns coordinates based on the best match. Geocoding is also called georeferencing.

Below are some Frequently Asked Questions about geocoding.


What geographic terms should I know?

  • Attribute - Data containing information on any layer of interest, which can either be contained within a specific layer, or linked to a layer by a unique or common field. Attribute data can include morbidity and mortality information as well as social, demographic, economic, and environmental data.
  • Centroid - The central location within a specified geographic area.
  • Contiguous - A descriptive term for geographic areas adjacent to one another, sharing either a common boundary or a common point.
  • Coordinate System - References used to display the locations of points in space in either two or three dimensions. Points, lines, and polygons are generally expressed on a two-dimensional map using x and y coordinates.
  • Lines - Features such as streets.
  • Points - Exact locations on earth, with specific x and y coordinates (i.e., longitude and latitude).
  • Polygons - Specific two dimensional areas on a map enclosed by boundary lines, such as county boundaries or census tracts.
  • Projection - The mathematical model that transforms three-dimensional features on the earth's curved surface to a two-dimensional map.

These definitions were adapted from "Introduction to Geographic Information Systems in Public Health" by Alan Melnick, Aspen; 2002.

 

Back to top

  


Why is geocoding important?

Geocoding is essential for many reasons. Georeferenced data can be useful for visualization, such as mapping locations where events of interest occur. Geocoding is also often the first step in linking environmental and health data for a variety of public health purposes, such as research and surveillance. Georeferencing personal health outcome data and demographics offers an ability to add individual level information to the neighborhood level data (e.g., census demographics) and environmental hazards (e.g., air pollution or pesticide use). Additionally, Healthy People 2010 objective 23-3 aimed to "increase the proportion of major national health data systems that use geocoding to promote nationwide use of geographic information systems."

 

Back to top

 


What should I know before geocoding?

Geocoding is both a science and an art. Often, data users are not well versed in the complexities of software and data essential to produce the best quality geocodes for their business needs.

It is important to understand a number of background concepts long before undertaking a geocoding project. For example, how well are you familiar with the data you will be geocoding? Do you know, in general, how geocoding happens? Are you aware of possible errors or limitations of geocoded data? What will you do with the data once you have assigned latitude and longitude to your records? Are there specific confidentiality issues you need to be aware of with your data?

Below are some starting points to consider. First and foremost, you should understand your data and check it for accuracy and completeness prior to geocoding.

  1. Know the big picture
    • What is the purpose of geocoding these particular data?
    • What is the appropriate geographic level for these data: address point, census block, block group, tract, zip code, county, other?
    • Can or should these data be aggregated to other geographies?
    • What coordinate system and projection is or should be used for your data?
    • Are these data going to be linked to other data (exposure, demographics)?
    • Can other data be assembled for the same geography and what other layers will be used?
    • What is the coordinate system and projection of any other data you will be using?
    • Is there a possibility of misrepresentation of your geocoded data (clustering, political boundaries, inappropriate overlays)?
    • What kind of software are you going to use to map your data after geocoding?
  2. Know your data
    • What kind of addresses or location information do you have (street address, road intersection, building name, direction from a landmark, urban, rural)?
    • Who collected and reported the addresses or location information (patient, payer, provider, abstracted record, administrative or billing data, real-time address collection, data entry after the fact)?
    • When was the address or location information collected (timeliness of address collection in relationship to the study of interest)?
    • Where is the address (mailing vs. residence, work, PO Box)?
    • What are the data issues (missing or blank addresses, missing parts of address, vague address, P O Box, streets in new developments)?
    • To what level will you be geocoding (address point, census block, block group, tract, zip code, county)?
    • How will you deal with PO Boxes and Rural Route addresses?
  3. Know the software (or tool) you will be using to geocode
    • Does it standardize data? (For more about standardization, see “Why is completeness important?”)
    • What kind of reference/base layer/database does it use (street centerline, parcel, etc)?
    • What is the quality of reference database(s) it uses?
    • How many reference databases does it use?
    • What is the currency of reference database(s) (older databases may be missing new streets, for example)?
    • How does it handle poor data?
    • What are its match and accuracy rates?
    • How fast is it?
    • How secure is it?
  4. Clean your data
    • Have address elements in appropriate fields (number and street, city, state, zip code)
    • Make sure street names are spelled out (for example, “Martin Luther King Blvd” not “MLK Blvd”)
    • Standardize the information (see “Why is completeness important?” for more). Note: address standardization may be taken care of by the standardization software that is part of the geocoding tool you are using.
  5. Know pitfalls
      • Misalignment (depending on the reference layer used to geocode your data, other layers may not align properly)
      • Temporal mismatch (layers of different vintages may be mismatched in time; for example, using 1990 census tracts and geocoded 2007 birth cohort)
      • Geocoding accuracy (does the address you geocoded correspond with its real position on the earth’s surface?)
      • Zip codes (zip codes are U.S. Postal Service delivery routes and do not have real, physical boundaries; for more see “What should I know about using zip codes in geocoding?”)
      • PO Boxes (post office box locations do not represent residences; in some rural areas, most geocoded points may “fall on top of each other,” corresponding to the one post office box location in that area) 
  6. Know confidentiality issues
    • Understand data privacy (confidential data include any individually identifiable information and may also be subject to HIPAA, the Health Insurance Portability and Accountability Act)
    • Protect identity of your study subjects (identity can be inferred from a geocoded address and an address can be inferred from a given latitude/longitude)
    • Decide how to deal with small numbers when mapping your data (apply masks, do spatial adjustments, aggregate, etc; for more see “What is confidential data? How do I deal with confidential data?”)
    • Practice information security by stripping your dataset of all extraneous confidential identifiers prior to geocoding (they can be joined to geocoded addresses later) 

Once the data is geocoded, you can use software like ArcGIS, Google Earth, Maptitude, or a variety of open source products or websites to display and analyze your data. You can overlay geocoded points onto other data sources, such as geographically-referenced environmental data.

For more information on how to use GIS software for public health applications, see Recommended Books (PDF, 77KB).

 

Back to top

 


How does geocoding work?

There are several methods of converting a location or an address to latitude/longitude (i.e. a place on a digital map). You can view the geocoding illustration (115KB, requires MS PowerPoint 2003 or higher) to get a general idea of geocoding mechanics.

An address could be geocoded to an area, such as a county, congressional district, census tract, or a zip code. Usually, the point's latitude and longitude are assigned in the center of the polygon outlining the boundaries of this area (also called polygon centroid). This is the least desirable level of geocoding because it does not show the exact place where the case patient or event is located, but rather maps all points from that area to a single spot in the center of a polygon.

An address could be geocoded to a street centerline. A street centerline database resembles a street map one would see in a Thomas Brothers atlas or on GoogleMaps. This method interpolates the location of the point based on the address ranges provided with the street database and assigns latitude and longitude to an estimated location along a street segment. For example, 175 Main Street would be placed 3/4 down the way of 101-199 block of Main Street in the given city and zip code. This is a more accurate method than geocoding to an area, but it still can contain a number of errors (such as incorrect address ranges in the reference layer, for instance).

An address could be geocoded to parcel layer. Land parcels (also known as cadastres) are the official property lines, as one would get from a tax accessor's office. A parcel layer could contain property polygons or points representing centers of those polygons. The highest quality parcel layers may contain building footprints. When geocoding to a parcel layer, the coordinate is usually assigned to the center of the parcel or the center of the building. This is the most accurate method of geocoding.

Depending on what kind of data you input into a geocoding tool, you may get a different result. For example, if you submit a full address (such as 123 Main Street, My City, My zip code) to a geocoding tool, you are most likely to get a match. If you only have a street intersection (such as Main Street at Broadway), the match most likely will be at the cross of the two streets; but the point may not match at all, depending on how the geocoding tool handles intersections. See table below for scenarios of how different locations may match.

ID Address City State Zip  Geocoding outcome
A1 123 Main Street My City CA 12345 Because this is a complete address, a match at any geocoding level is possible if that data is available in the reference database (i.e. parcel, zip code, county, census tract).
A2 Main Street & Broadway My City CA   Because this location is an intersection, it will either gecode to this intersection or not at all.  You may be able to geocode to a city centroid if that data is available in the reference database.
A3       12345 Providing a zip code only, you may be able to geocode to the zip code centroid if that data is available in the reference database.  Note: CEHTP geocoding tools currently do not offer zip code geocoding, so this record would not match. 
A4   My City CA   Providing a city and state only, you may be able to geocode to the city centroid if that data is available in the reference database.  Otherwise, this record would not match.

Note: When an address is geocoded, the coordinate is placed on the street line (i.e. the center of the street). In reality, buildings are not placed in the middle of the street. Therefore, you should specify an offset, which is the distance at which to put the geocoded point away from the street.

 

Back to top

 


What kinds of geocoding errors are there?

For a discussion on geocoding errors, we recommend three books: Geocoding Health Data, GIS in Public Health, and Introduction to Geographic Information Systems in Public Health.  Information on these books can be found in Recommended Books (PDF, 77KB).

 

Back to top

 


What is confidential data? How do I deal with confidential data?

Confidential data include any individually identifiable information. It could be a name, address, date of birth, diagnosis, and other information that may be used to identify an individual. When an address is geocoded, a latitude and longitude (X and Y coordinates), in some cases, may also be considered confidential because it may be possible to figure out an exact address from the coordinates (called reverse geocoding).

When geocoding confidential data, you need to use tools that will not compromise or release this information. Understanding how geocoding tools handles security is important. In some cases, you may also need to manipulate the resulting data so that the real patient locations are not shown on the resulting map. There are many techniques to do that:

  • Shifting all points a fixed distance and direction from their original location
  • Expanding or contracting all points by a scaling factor
  • Rotating all points a fixed angle
  • Randomly moving all points around within a defined buffer
  • Aggregating points to a higher level of geography, such as census tracts or counties
  • Adding fictitious points to the mix
  • Note that any of these techniques will affect any spatial analysis, if you need to analyze the point-level data. For more discussion on handling confidential data when mapping, see Recommended Books (PDF, 77KB).

You can practice safe information security by stripping your dataset of all extraneous confidential identifiers (not the address elements, of course) prior to using the geocoding service. You can later join (or merge) geocoded data back to your original dataset by using the retained unique ID field.

 

Back to top

 


What should I know about using zip codes in geocoding?

While a lot of health and environmental data is geocoded to and presented at a zip code level, zip codes have a large number of limitations.

First and foremost, zip codes are not areas and their exact spatial boundaries are not generally known. Most people think of zip codes as sub-divisions of their cities or counties, but in fact zip codes are simply mail delivery routes. When you see zip code boundaries on a map, those are merely approximations, usually based on a set of addresses where mail is being delivered. Data vendors have various techniques to estimate zip code boundaries.

According to the USPS, “The ZIP Code system was created and designed to provide an efficient postal distribution and delivery network. ZIP Code assignments are, therefore, closely linked to factors such as mail volume, delivery area size, geographic location, and topography, but not necessarily to municipal or perceived community boundaries.”

The first digit of a zip code divides the US into large regions. Digits two and three represent subdivisions within states. Fourth and fifth zip code digits identify small post offices or local delivery areas. Zip+4 contains a 4-digit add-on code, which identifies a geographic segment within the 5-digit delivery area to aid efficient mail sorting and delivery.

A single large building with high mail volume may have its own zip code (sometimes, even a single floor in a building may have a unique zip code). College campuses often have a zip code different from the city in which they are located. Many P O Box offices have their own zip codes. Some zip codes do not have contiguous areas, which means the same zip code may represent addresses several miles apart, with other zip codes in-between.

Second, zip codes change very frequently, according to population changes, mail volume changes, or other USPS operational needs. Some data vendors update their zip code files quarterly or semi-annually. Therefore, a zip code boundary file you may currently have, might be outdated and result in temporal mismatch with your health or environmental data.

Third, there is no real correlation between zip codes and Census geography. Thus, estimating zip code populations and calculating health outcome figures requires some statistical methodology.

That being said, the Census Bureau has created a statistical area called the ZIP Code Tabulation Area (ZCTA) for Census 2000. According to the Bureau, “ZCTAs are generalized area representations of U.S. Postal Service (USPS) ZIP Code service areas. Simply put, each one is built by aggregating the Census 2000 blocks, whose addresses use a given ZIP Code, into a ZCTA which gets that ZIP Code assigned as its ZCTA code. They represent the majority USPS five-digit ZIP Code found in a given area. For those areas where it is difficult to determine the prevailing five-digit ZIP Code, the higher-level three-digit ZIP Code is used for the ZCTA code.” It is important to know that ZCTAs are representative of the 2000 Census only and do not include all zip codes in existence at that time.

When geocoding to a zip code level using commercially available zip code boundary approximations, the geocoded coordinate is usually assigned to the geometric center of the zip code polygon (centroid) or a delivery-weighted center (i.e. population-weighted centroid) if such information is available in the database.

In general, you should exercise caution when geocoding to zip code level or interpreting maps showing health or environmental data mapped to zip codes. 

 

Back to top

 


How do I increase completeness and accuracy of my geocoded data?

Here are some techniques for increasing the quality of your geocoding output:

  • Use geocoding tools with the best possible reference datasets.
  • Make sure the addresses in your database have complete and accurate components (number, street, direction, city, zip code).
  • Make sure there is little or no variability in street spelling (geocoding tool address standardization should take care of this, for the most part).
  • Decrease the amount of data entry errors.
  • Limit number of PO Box addresses.
  • If possible, do not geocode (or limit geocoding) to the zip code level.
  • Do quality control on your final geocoded data by examining a sample and evaluating whether the geocoded points fell where they were supposed to. You may do so by examining administrative boundaries, aerial imagery, or using geospatial tools.

 

Back to top

 


What is a coordinate system? 

A coordinate system is a set of rules that defines the position of a point, line, or polygon in space (i.e. on earth’s curved surface). Because the earth is three-dimensional and a map is two-dimensional, one needs to represent locations (x and y coordinates or longitude and latitude) on the flat map accurately, using mathematical formulas pre-programmed in a GIS software. 

Using an appropriate coordinate system is important as it may affect the results of your spatial analysis. For more information about coordinate systems and projections, see

http://www.gis.com/content/gis-glossaries or
http://support.esri.com/index.cfm?fa=knowledgebase.gisDictionary.gateway

 

Back to top

 


Why is completeness important? 

During geocoding, addresses in your database are compared to information in the reference database(s). If a match is found, an address is geocoded.

Completeness, also known as match rate, is the proportion of addresses that are successfully geocoded. Geocoding service uses standards and algorithms to make a decision whether addresses match. Incorrect, incomplete, or ambiguous address information will reduce the matching rate.

If many records in a database can’t be geocoded for some reason, it will have an effect on the end result of the study. For example, if people living at the addresses that did not match are somehow different from those whose addresses geocoded, the results will be skewed (biased).

 

Back to top

 


Why is accuracy important?

First, we have to note that completeness of geocoding as discussed above, does not necessarily mean accuracy.  For example, you may have two addresses without zip codes: 123 Main Street, Santa Clara and 123 Main Street, Santa Clarita. Based on its algorithms, a geocoding tool may decide that both are one and the same and give you two geocodes in Santa Clara and none in Santa Clarita. Both matched, but one of them is inaccurate.

Positional accuracy is how close a geocoded point is to the “true” location of that address on earth’s surface. There are other accuracy considerations, such as assumptions used by geocoding algorithms and probability that the geocoded point is the one that you wanted. Geocoding errors may be more significant in rural areas because of large distances between homes and because homes may be set far from the road, which is particularly troublesome for addresses geocoded to street centerline reference databases. Addresses matched to a parcel centroid reference databases tend to be more accurate, especially in urban areas. However, depending on the objectives of a geocoding project, especially if it is desired to place a point on top of a residential structure, a rural geocode on property centroids can give inaccurate positional accuracy because a large property might only have a small residential structure on one end of the property. For more, see the geocoding illustration (115KB, requires MS PowerPoint 2003 or higher).

Geocoding accuracy may matter even more than completeness. For instance, studies on health outcomes related to toxic site emissions require precise distance between a given address and a specified location of a facility.

As another example, traffic-related pollutants disperse within the first 500-1000 feet from the street. A positional error of a geocoded address sometimes can be as large as 300 meters (almost 1000 feet), which means if you’re studying the relationship between asthma symptoms and traffic exhaust, you may not be able to calculate accurate epidemiologic estimates.

You can increase accuracy of geocoded data by supplying the best possible data to a geocoding tool (see What should I know before geocoding? and How do I increase completeness and accuracy of my geocoded data?).

You can also increase accuracy by using software that geocodes against higher accuracy reference layers or multiple reference layers. For example, geocoding to a parcel layer is more accurate than geocoding to a street centerline (for more, see “How does geocoding work?”).

 

Back to top

 


 Why is security important? 

Individually identifiable information is protected by law and must be kept confidential. This is particularly important for health registry and outcome databases. These data are typically protected by the Health Insurance Portability and Accountability Act (HIPAA) or federal, state, local laws. Therefore,you must ensure that datasets that include patient-level addresses are processed securely by any geocoding service that you use.

 

Back to top