Projects
Distinct from my academic work, here is a showcase of programming work and hobbyist projects that I'm proud to share.
Address parsing with recurrent neural networks
(Jun 26, 2023)
I created a mockup character-based Canadian address parser, built by machine learning and natural language processing.
The parser uses a recurrent neural network that classifies each character in a string to be in an address token, like a house number, street name, or street type.
The parser code is written in Python using torch
for the neural network code.
This project tackles the issue of imbalanced data when building data-driven address parsers, by training on randomly generated addresses.
This suggests a potential for efficient and scalable address parsing with data-driven approaches.
San Francisco fire service analysis
(May 23, 2023)
Herein I independently explore and examine the open data of San Francisco's fire service calls, safety complaints and incidents.
To conduct data exploration, I extract and clean the fire service data to then analyze it by collecting descriptive statistics and examining trends via inferential statistics.
This is done in Python using pandas
, geopandas
, scipy.stats
and seaborn
.
The potential impact of this analysis is to gain insight on fire department usage, effectiveness, and to inform emergency decision-making to increase public safety.
Strongly solving Quantik
(Mar 22, 2023) Quantik is a two-player adversarial abstract strategy game by Nouri Khalifa, published by Gigamic. The rules of the game are simple and reminiscent of Tic-tac-toe, but to contrast, Quantik has a larger state-space complexity and describing optimal play for either player is nontrivial.
I wrote a Quantik solver in C that determines which moves can force a win for the current player, if any exist. These winning moves are determined in under one second, at any state, by using an optimized combinatorial search, opening book, and cache-friendly code.
Links: [code]
OpenTabulate - a data tabulation tool in Python
(Jun 2018 - Aug 2020)
When I worked as a data scientist at Statistics Canada, I coded and documented opentabulate
, a Python command-line tool to tabulate data under a common schema.
The key motivation for opentabulate
is to support the Linkable Open Data Environment (LODE), a data portal that builds and distributes public Canadian infrastructure datasets under an open data license.
Most of the data originates from municipal, provincial and federal authorities in Canada, but vary in format and schema by jurisdiction.
My section uses opentabulate
to automate the extraction, cleaning and merging of such data.