Address parsing with recurrent neural networks
Introduction and background
Street addresses are a fundamental component of our civic infrastructure which geocode houses, businesses and other points of interest. Moreover, they form the economic backbone of postal services and e-commerce. The ability to maintain and process addresses on a large scale is crucial to their operation.
Address parsing, or address tokenization, is a natural language task to decompose an address string into location-specific components (or tokens) that pinpoint a location. These tokens typically comprise a building number, street name and type, postal code, city or town, and so on. Parsing is an essential step before geocoding or address verification on unstructured address data. Unfortunately, building a parser with extensive address coverage is a delicate matter, even when restricting to a single country or language. For example, there can be:
- equivalent address formats ("101-500 Broadway" and "500 Broadway Unit 101")
- language quirks like abbreviations ("Street" to "St") and ordinal indicators ("1", "1st", "First")
- edge cases not covered by common assumptions about addresses
- missing tokens
- typos in address entry
All of these can lead to nonunique ways to interpret addresses. Common address parsing methods include using regular expressions or other formal grammars. Such parsers are brittle when subjected to at least one of the above examples. In recent years, data-driven (i.e. statistical or neural) methods to tackle natural language tasks have emerged and become ubiquitous. Such methods have significant potential to successfully implement large-scale parsers with broad address coverage. A noteworthy example is libpostal, an international-scale address parser that uses a large statistical model with coverage of over 200 countries!
Project overview
Motivated by neural methods, I showcase a character-based Canadian address parser mockup, built with machine learning. The parser uses a recurrent neural network, dubbed CCAPNet, that classifies each character in an input address string to a token. During inference, a separate routine is used to synthesize the classification into a decomposition of the address. Concerning training, address data is often imbalanced due to the structured distribution of real civic addresses. To tackle this issue, I train the network on randomly generated addresses using a context-free grammar. To prevent overfitting, the network training is regularized by augmenting the data with typos.
The entirety of this project is implemented in Python, with the code available on GitHub. PyTorch is used for the model definition and training.
Note that my mockup draws inspiration from Jason Rigby's AddressNet, a recurrent neural network for parsing Australian addresses. Specifically, we adopt the character-level and typo augmentation techniques. Besides this, the CCAPNet implementation is written independently of Rigby's work.
Data methodology
This section summarizes the data representation and data assumptions we use for the parser.
Text normalization.
The input strings are normalized to only contain alphanumeric characters or a space, i.e. 'ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789 '
.
Two consecutive spaces cannot occur, nor can spaces appear in the first or last character of the output.
Lastly, non-ASCII characters (e.g. grave accent letters) are converted to their ASCII "equivalent" via unidecode
.
Address normalization is not used before parsing, since this is not our focus.
However, CCAPNet is trained on specific data that implicitly expects a normalized address.
In particular, numbers have no ordinal indicators and are represented by digits. Abbreviations and their full words, of street types, directions, and provinces are included.
Address structure. We consider Canada Post's address formatting when generating data. The format guidelines are detailed here. Moreover, considering Canada is bilingual in English and French, for simplicity we limit our scope to English address formats.
Address tokens. Considering token categories for civic address data vary by municipality in Canada, we choose the categories somewhat arbitrarily. They are listed below, with the first being a dummy category to classify spaces in the text input.
SEPARATOR
: space characterUNIT
: unit numberHOUSE_NUM
: house numberST_NAME
: street nameST_TYPE
: street typeDIR_PREFIX
: direction prefix, appearing before the street identifierDIR_SUFFIX
: direction suffix, appearing after the street identifierCITY
: city or town namePROVINCE
: province or territoryPOST_CODE
: postal code
Procedural generation.
To generate addresses, we first randomly draw a "template" for each address.
A template is an ordered list of non-separator address tokens where each token occurs at most once, such as [HOUSE_NUM, ST_NAME, ST_TYPE]
.
A manually chosen list of templates conforming to Canada Post's address formats is defined here.
Each non-separator token has a specific algorithm to generate it.
Some are purely algorithmic, like a postal code, and others pull from a fixed dataset, such as cities and street names.
The generation algorithms are defined in a script and class definitions.
The fixed datasets used are here.
Model definition
The recurrent neural network CCAPNet, short for Canadian Civic Address Parsing neural Network, is defined in Figure 1. It takes an arbitrary-length character sequence as input and maps each character to a vector embedding, with the sequence then fed to a bidirectional GRU. The birectional output sequences xf and xb are then processed by independent 2-layer GRUs with residual connections. The resulting sequences are then concatenated and fed to a fully-connected layer. The output is a sequence of logits to classify characters.
As per common use of residual connections and gating mechanisms, we use residual GRUs to stabilize training and improve generalization performance. The network parameters used are shown in Table 2, of which there are 34,427 trainable parameters.
Component | Parameters |
---|---|
Character embedding | Vocabulary size (37), embedding layer (8) |
Bidirectional GRU | Hidden layer (32) |
GRU | Hidden layer (32) |
Affine | Input layer (64), output layer (11) |
Results and analysis
Setup
For data, we randomly generate 192,000 addresses for training and 32,000 addresses for testing.
We also use real test data by randomly drawing 56,320 addresses from the city of Winnipeg's civic addresses. Note that the real data only contains UNIT
, HOUSE_NUM
, ST_NAME
, ST_TYPE
and DIR_SUFFIX
tokens.
CCAPNet is trained with RMSprop using cross-entropy loss on batches of addresses. We do this for 80 epochs with a batch size of 128 and step size 5e-4. Note that an epoch is referred to as a single pass of the training data, which in our case is 1,500 batches. To regularize the network, we apply L2 weight decay with parameter 5e-5 and augment the training data with random typos. Specifically for typos, we use deletions, adjacent swaps, duplications and replacements. To create realistic replacements, we only replace characters by neighbouring characters on a standard US keyboard. The number of typos that occur in each address is a Poisson random variable with rate = 1.
Training metrics
To measure performance, we consider two key metrics: per-character accuracy and parser accuracy. Per-character accuracy is measured over characters, by how often a single character is correctly classified. Parser accuracy is measured over sequences, by how often all characters are correctly classified in a sequence. The metrics are displayed in Figure 3. The training is monitored at each batch whereas testing is monitored at the end of each epoch. CCAPNet achieves 95% parser accuracy on both generated and real test data at the 76th epoch.
Model performance and example output
For this final section, we use the 76th epoch CCAPNet, which has 95% parser accuracy on both test datasets. Confusion matrices for the model are computed for the test data and visualized in Figure 4. Although the model predominantly makes correct predictions, the matrices suggest that it occasionally misclassifies:
- words that neighbour each other in the address
- tokens that produce similar text, such as unit numbers, house numbers and directions, or street names, street types and cities
Finally let us show two examples of model inference. An example with correct tokenization is:
Input text: '1482 W Broadway, Burnaby, BC V6H1H4'
Normalized text: '1482 W BROADWAY BURNABY BC V6H1H4'
Tokens: {'HOUSE_NUM': '1482', 'DIR_PREFIX': 'W', 'ST_NAME': 'BROADWAY', 'CITY': 'BURNABY', 'PROVINCE': 'BC', 'POST_CODE': 'V6H1H4'}
Below is an example of misclassification. The model does not identify the street type correctly:
Input text: '150 Saint Andrews St'
Normalized text: '150 SAINT ANDREWS ST'
Tokens: {'HOUSE_NUM': '150', 'ST_NAME': 'SAINT', 'ST_TYPE': 'ANDREWS ST'}
For each character in ANDREWS
, CCAPNet predicts a softmax probability > 0.8 of the street type token.
This may be influenced by many street names occurring as one word in the training data.
As a final remark, further improvements to the mockup can be done. Some examples include:
- training a larger model with a larger dataset
- extending address generation with more data, e.g. more address templates and token examples
- preprocess input with address normalization, e.g. word expansions and abbreviations
- text segmentation model to avoid discontinuous character classifications