This project focuses on record linkage using census data from Statistics Canada.
We cleaned and preprocessed the data, including handling missing values, normalizing formats, and removing duplicates to ensure consistency across datasets.
We developed and extracted relevant features from the data that could improve the performance of the machine learning models, such as string similarity metrics and domain-specific attributes.
We evaluated and selected appropriate machine learning algorithms (both individual and ensemble learning methods (bagging & boosting)) including advanced algorithms such as support vector machines, random forests, and XGBoost, to enhance model performance.
We trained the selected models on labeled data, optimizing hyperparameters and evaluating model performance using metrics such as precision, recall, and F1-score through nested cross validation.
We continuously refined models based on feedback and new data, iterating through the process of retraining and testing to improve accuracy and effectiveness.
We analyzed the results of the model outputs, interpreting the linked records and ensuring the findings are consistent with the expectations of the project goals.
We documented the methodology, results, and insights gained from the project, preparing reports and presentations for stakeholders to communicate the effectiveness of the record linkage approach.
We authored a comprehensive project report detailing the problem formulation, data preprocessing pipeline, model implementation, and in-depth analysis of results.
Toolkit for the project
- R language, R Studio, RMarkdown.