I did a summer project at Statistics Canada working with Andrew Stelmack at Methodology department at Statistics Canada in Ottawa from May to August in 2024. We developed a robust solution for the record linkage problem, integrating data from heterogeneous sources with a focus on accuracy and scalability.
We cleaned and preprocessed the data, including handling missing values, normalizing formats, and removing duplicates to ensure consistency across datasets.
We developed and extracted relevant features from the data that could improve the performance of the machine learning models, such as string similarity metrics and domain-specific attributes.
We evaluated and selected appropriate machine learning algorithms (both individual and ensemble learning methods (bagging & boosting)) including advanced algorithms such as support vector machines, random forests, and XGBoost, to enhance model performance.
We trained the selected models on labeled data, optimizing hyperparameters and evaluating model performance using metrics such as precision, recall, and F1-score through nested cross validation.
We continuously refined models based on feedback and new data, iterating through the process of retraining and testing to improve accuracy and effectiveness.
We analyzed the results of the model outputs, interpreting the linked records and ensuring the findings are consistent with the expectations of the project goals.
We documented the methodology, results, and insights gained from the project, preparing reports and presentations for stakeholders to communicate the effectiveness of the record linkage approach.
We authored a comprehensive project report detailing the problem formulation, data preprocessing pipeline, model implementation, and in-depth analysis of results.