This project focuses on record linkage using census data from Statistics Canada.

  • We cleaned and preprocessed the data, including handling missing values, normalizing formats, and removing duplicates to ensure consistency across datasets.

  • We developed and extracted relevant features from the data that could improve the performance of the machine learning models, such as string similarity metrics and domain-specific attributes.

  • We evaluated and selected appropriate machine learning algorithms (both individual and ensemble learning methods (bagging & boosting)) including advanced algorithms such as support vector machines, random forests, and XGBoost, to enhance model performance.

  • We trained the selected models on labeled data, optimizing hyperparameters and evaluating model performance using metrics such as precision, recall, and F1-score through nested cross validation.

  • We continuously refined models based on feedback and new data, iterating through the process of retraining and testing to improve accuracy and effectiveness.

  • We analyzed the results of the model outputs, interpreting the linked records and ensuring the findings are consistent with the expectations of the project goals.

  • We documented the methodology, results, and insights gained from the project, preparing reports and presentations for stakeholders to communicate the effectiveness of the record linkage approach.

  • We authored a comprehensive project report detailing the problem formulation, data preprocessing pipeline, model implementation, and in-depth analysis of results.

Toolkit for the project

  • R language, R Studio, RMarkdown.