Address parsing with recurrent neural networks


Introduction and background

Street addresses are a fundamental component of our civic infrastructure which geocode houses, businesses and other points of interest. Moreover, they form the economic backbone of postal services and e-commerce. The ability to maintain and process addresses on a large scale is crucial to their operation.

Address parsing, or address tokenization, is a natural language task to decompose an address string into location-specific components (or tokens) that pinpoint a location. These tokens typically comprise a building number, street name and type, postal code, city or town, and so on. Parsing is an essential step before geocoding or address verification on unstructured address data. Unfortunately, building a parser with extensive address coverage is a delicate matter, even when restricting to a single country or language. For example, there can be:

All of these can lead to nonunique ways to interpret addresses. Common address parsing methods include using regular expressions or other formal grammars. Such parsers are brittle when subjected to at least one of the above examples. In recent years, data-driven (i.e. statistical or neural) methods to tackle natural language tasks have emerged and become ubiquitous. Such methods have significant potential to successfully implement large-scale parsers with broad address coverage. A noteworthy example is libpostal, an international-scale address parser that uses a large statistical model with coverage of over 200 countries!

Project overview

Motivated by neural methods, I showcase a character-based Canadian address parser mockup, built with machine learning. The parser uses a recurrent neural network, dubbed CCAPNet, that classifies each character in an input address string to a token. During inference, a separate routine is used to synthesize the classification into a decomposition of the address. Concerning training, address data is often imbalanced due to the structured distribution of real civic addresses. To tackle this issue, I train the network on randomly generated addresses using a context-free grammar. To prevent overfitting, the network training is regularized by augmenting the data with typos.

The entirety of this project is implemented in Python, with the code available on GitHub. PyTorch is used for the model definition and training.

Note that my mockup draws inspiration from Jason Rigby's AddressNet, a recurrent neural network for parsing Australian addresses. Specifically, we adopt the character-level and typo augmentation techniques. Besides this, the CCAPNet implementation is written independently of Rigby's work.

Data methodology

This section summarizes the data representation and data assumptions we use for the parser.

Text normalization. The input strings are normalized to only contain alphanumeric characters or a space, i.e. 'ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789 '. Two consecutive spaces cannot occur, nor can spaces appear in the first or last character of the output. Lastly, non-ASCII characters (e.g. grave accent letters) are converted to their ASCII "equivalent" via unidecode. Address normalization is not used before parsing, since this is not our focus. However, CCAPNet is trained on specific data that implicitly expects a normalized address. In particular, numbers have no ordinal indicators and are represented by digits. Abbreviations and their full words, of street types, directions, and provinces are included.

Address structure. We consider Canada Post's address formatting when generating data. The format guidelines are detailed here. Moreover, considering Canada is bilingual in English and French, for simplicity we limit our scope to English address formats.

Address tokens. Considering token categories for civic address data vary by municipality in Canada, we choose the categories somewhat arbitrarily. They are listed below, with the first being a dummy category to classify spaces in the text input.

  1. SEPARATOR : space character
  2. UNIT : unit number
  3. HOUSE_NUM : house number
  4. ST_NAME : street name
  5. ST_TYPE : street type
  6. DIR_PREFIX : direction prefix, appearing before the street identifier
  7. DIR_SUFFIX : direction suffix, appearing after the street identifier
  8. CITY : city or town name
  9. PROVINCE : province or territory
  10. POST_CODE : postal code

Procedural generation. To generate addresses, we first randomly draw a "template" for each address. A template is an ordered list of non-separator address tokens where each token occurs at most once, such as [HOUSE_NUM, ST_NAME, ST_TYPE]. A manually chosen list of templates conforming to Canada Post's address formats is defined here. Each non-separator token has a specific algorithm to generate it. Some are purely algorithmic, like a postal code, and others pull from a fixed dataset, such as cities and street names. The generation algorithms are defined in a script and class definitions. The fixed datasets used are here.

Model definition

Figure 1: Diagram of CCAPNet. The network takes a character sequence as input and classifies each character to an address token.

The recurrent neural network CCAPNet, short for Canadian Civic Address Parsing neural Network, is defined in Figure 1. It takes an arbitrary-length character sequence as input and maps each character to a vector embedding, with the sequence then fed to a bidirectional GRU. The birectional output sequences xf and xb are then processed by independent 2-layer GRUs with residual connections. The resulting sequences are then concatenated and fed to a fully-connected layer. The output is a sequence of logits to classify characters.

As per common use of residual connections and gating mechanisms, we use residual GRUs to stabilize training and improve generalization performance. The network parameters used are shown in Table 2, of which there are 34,427 trainable parameters.

Component Parameters
Character embedding Vocabulary size (37), embedding layer (8)
Bidirectional GRU Hidden layer (32)
GRU Hidden layer (32)
Affine Input layer (64), output layer (11)
Table 2: CCAPNet parameters and layer sizes.

Results and analysis

Setup

For data, we randomly generate 192,000 addresses for training and 32,000 addresses for testing. We also use real test data by randomly drawing 56,320 addresses from the city of Winnipeg's civic addresses. Note that the real data only contains UNIT, HOUSE_NUM, ST_NAME, ST_TYPE and DIR_SUFFIX tokens.

CCAPNet is trained with RMSprop using cross-entropy loss on batches of addresses. We do this for 80 epochs with a batch size of 128 and step size 5e-4. Note that an epoch is referred to as a single pass of the training data, which in our case is 1,500 batches. To regularize the network, we apply L2 weight decay with parameter 5e-5 and augment the training data with random typos. Specifically for typos, we use deletions, adjacent swaps, duplications and replacements. To create realistic replacements, we only replace characters by neighbouring characters on a standard US keyboard. The number of typos that occur in each address is a Poisson random variable with rate = 1.

Training metrics

Figure 3: Per-character accuracy and parser accuracy for training data (left), generated and real test data (right). Training metrics are divided into segments of 100 batches, with the segment mean, maximum and minimum displayed.

To measure performance, we consider two key metrics: per-character accuracy and parser accuracy. Per-character accuracy is measured over characters, by how often a single character is correctly classified. Parser accuracy is measured over sequences, by how often all characters are correctly classified in a sequence. The metrics are displayed in Figure 3. The training is monitored at each batch whereas testing is monitored at the end of each epoch. CCAPNet achieves 95% parser accuracy on both generated and real test data at the 76th epoch.

Model performance and example output

Figure 4: CCAPNet confusion matrices on generated (left) and real (right) test data. The row and column indices correspond to true and predicted values, respectively. Grey squares represent zero.

For this final section, we use the 76th epoch CCAPNet, which has 95% parser accuracy on both test datasets. Confusion matrices for the model are computed for the test data and visualized in Figure 4. Although the model predominantly makes correct predictions, the matrices suggest that it occasionally misclassifies:

The former indicates the model leverages positional structure in the generated data. The latter indicates the model interprets certain text ambigiously, which may be amplified by the typo augmentation.

Finally let us show two examples of model inference. An example with correct tokenization is:

Input text: '1482 W Broadway, Burnaby, BC V6H1H4'
Normalized text: '1482 W BROADWAY BURNABY BC V6H1H4'
Tokens: {'HOUSE_NUM': '1482', 'DIR_PREFIX': 'W', 'ST_NAME': 'BROADWAY', 'CITY': 'BURNABY', 'PROVINCE': 'BC', 'POST_CODE': 'V6H1H4'}

Below is an example of misclassification. The model does not identify the street type correctly:

Input text: '150 Saint Andrews St'
Normalized text: '150 SAINT ANDREWS ST'
Tokens: {'HOUSE_NUM': '150', 'ST_NAME': 'SAINT', 'ST_TYPE': 'ANDREWS ST'}

For each character in ANDREWS, CCAPNet predicts a softmax probability > 0.8 of the street type token. This may be influenced by many street names occurring as one word in the training data.

As a final remark, further improvements to the mockup can be done. Some examples include:

Nonetheless, this mockup implementation hopefully served as a proof of concept that makes a case for data-driven address parsers!