I did a summer project at Statistics Canada working with Andrew Stelmack at Methodology department at Statistics Canada in Ottawa from May to August in 2024. We developed a robust solution for the record linkage problem, integrating data from heterogeneous sources with a focus on accuracy and scalability.

  • We cleaned and preprocessed the data, including handling missing values, normalizing formats, and removing duplicates to ensure consistency across datasets.

  • We developed and extracted relevant features from the data that could improve the performance of the machine learning models, such as string similarity metrics and domain-specific attributes.

  • We evaluated and selected appropriate machine learning algorithms (both individual and ensemble learning methods (bagging & boosting)) including advanced algorithms such as support vector machines, random forests, and XGBoost, to enhance model performance.

  • We trained the selected models on labeled data, optimizing hyperparameters and evaluating model performance using metrics such as precision, recall, and F1-score through nested cross validation.

  • We continuously refined models based on feedback and new data, iterating through the process of retraining and testing to improve accuracy and effectiveness.

  • We analyzed the results of the model outputs, interpreting the linked records and ensuring the findings are consistent with the expectations of the project goals.

  • We documented the methodology, results, and insights gained from the project, preparing reports and presentations for stakeholders to communicate the effectiveness of the record linkage approach.

  • We authored a comprehensive project report detailing the problem formulation, data preprocessing pipeline, model implementation, and in-depth analysis of results.