Tracking Patients Across Years With Record Linkage

By DataKind San Francisco

About Muso 

Muso is an NGO with a mission to eliminate preventable deaths rooted in poverty. In Mali, Muso operates at ground zero of health inequity. Mali has the fifth highest rate of child mortality in the world at 110 deaths among children under five per 1000 live births. 

According to Muso’s website:

“A decade ago a small group of Malians and Americans came together to address the injustices of health and poverty they witnessed around them. Working together out of a converted storage closet, Muso was born. Muso means woman in Bambara, a lingua franca of Mali. A commonly heard Malian proverb asserts, ‘If you educate a woman, you educate her family, her community and her entire country.’ Women are considered responsible for protecting the health of their families.

Muso recognizes that to save lives in the world’s poorest communities, a reactive model is not enough. Early access to care is crucial to survival. To address this challenge, we built a different kind of health system — one that removes barriers and brings care to patients proactively. Through a decade of research, Muso developed a proactive health care system designed to save lives.” 

DataKind San Francisco was lucky to partner with Muso from late 2018 to early 2020 to tackle a tricky record linkage problem.

undefinedDKSF logo.png

The problem we tackled with data science

In 2017, Muso launched a three-year randomized controlled trial in Bankass, a rural region of Mali, to test the impact of a proactive community case management intervention on child mortality – whether Community Health Workers who proactively search for patients will increase early access to effective treatment and decrease child mortality compared to the standard passive model of care, in which Community Health Workers provide care from a stationary health post. 

Unique household-based ID numbers were assigned to each patient in Bankass during an annual, population-based census survey. However, Muso learned that in practice, study participants may be assigned more than one ID number over time, due to natural migration and marriage patterns as well as considerable population displacement amidst increasing insecurity in the region. This is problematic because:

  • One-to-many relationship between patient and ID means it’s difficult to track patients over time and across community and clinic settings
  • If you can’t reliably track patients over time, this contributes to patients being “lost to follow up”, which can have a considerable impact on study sample size and power to measure the estimated impact of proactive case management on the outcomes of interest (e.g. child mortality)

Muso identified a need to perform probabilistic record linkage across census survey years and create a Master Person Index. The trial covers over 100,000 people in 137 clusters in rural Mali. After some research and proof of concept implementations, we narrowed down the solution to use a mature open source record linkage package written in Python to calculate various similarity scores and create the final probability of a match with the Expectation/Conditional Maximization algorithm – see “Technical specs of our solution” below for more detail.

Measuring the success of our solution

We agreed on two primary success metrics to optimize for in the creation of our record linkage model, which matched patients across two census years (year 0 and year 1):

  • Maximize the percent of the year 0 population that we were able to find a match for in year 1
  • Minimize the number of year 1 matches per year 0 patient 

We were happy to end the project with 92% of the population with a match, and an average of only 1.5 matches per patient.

A secondary explanatory metric that we investigated throughout the project – which was the original indicator to Muso that they needed a Master Person Index – was the percentage of women that were present for the census survey in year 0 and participated in a women’s survey and didn’t participate in the women’s survey in year 1 (a situation referred to as “lost-to-follow-up”). Without the record linkage, the lost-to-follow-up rate was 25%.

After performing the record linkage, we were able to reduce the percentage of women truly lost-to-follow-up between year 0 and year 1 by 12%

Technical specs of our solution

We created a series of three Python notebooks that could be run by Muso’s data operations team as each study year’s data became available. 

01: data transformation

The data transformation file was necessary to standardize fields across years: for example, the values of gender changed from year 0 to year 1, from “male”/“female” to “1”/“2”. To guard against any future changes, we created inputs for each value to ensure they were standardized in future years after we passed the model to the Muso data operations team.

02: modeling

The modeling file is where the record linkage is run with the following features:

  • Uses Python package record linkage, since many required features are built-in
  • Calculates similarity scores for each feature:
    • Latitude/longitude: geometric (haversine) distance between coordinates, with linear similarity
    • First/last name: normalized levenshtein similarity between strings
    • Unique identification number: linear similarity between numbers (these IDs were assigned based on location and family, so a linear similarity was applicable
    • Phone number: exact similarity
    • Health area: exact similarity (these areas were based on where the patient received care, so exact similarity was applicable
  • Utilizes blocking by gender and either birth date or phonetic name, for efficiency and accuracy
  • Utilizes chunking by rows of 5000 for efficiency
  • Creates final probability of match using Expectation/Conditional Maximization algorithm, which weights probabilities using conditional maximum likelihood estimators

03: final datasets

There are three final datasets necessary for the project:

  • Review dataset: This dataset has one row per patient and year, with the same Master Person Index linking them across years, but only for those with multiple matches. This dataset is used for Muso’s manual review and flagging of most accurate match.
  • Database dataset: This dataset has one row per patient and year, with the same Master Person Index linking them across years. This dataset is used to upload into Muso’s database.
  • Modeling dataset: This dataset has only one row per patient with the most up to date survey data. This dataset is used to run the next year’s modeling.

Impact and learnings from creating the solution

Because Muso will be able to make a Master Person Index for these three study years and the other datasets that Muso utilizes, they’ll be able to better track patients across years and measure not only the impact of proactive case management on child mortality, but also the impact of other factors on their patients. Moving forward, Muso has future plans to use this Master Person Index to perform record linkages and update patient tables in important secondary data sources for the randomized controlled trial, including a mobile application used by Muso community health workers to record their delivery of primary health care services, as well as an electronic medical record system. This linkage of patient records across years and settings will allow Muso to better leverage available data on patient health care service delivery, and answer important secondary study aims, such as the fidelity and cost of delivery of the proactive care intervention model, lessons which will contribute to understanding of factors contributing to the success of this model and will be shared throughout the global health community.

We have some learnings to share from the duration of this project:

  • Level-setting is important as timelines change. Although the main objective outlined above was to link three study years, Muso had better defined their objective and also wanted to build out the capability to link other datasets to these study years as well. We were able to meet this secondary objective by ensuring that the code would link records whether each field name was available in a dataset or not. 
  • A growth mindset can bridge teams together. Because our volunteer team was fluent in Python, and the technical team at Muso used Stata, we created our work in user-friendly Python notebooks, but also took the time to ensure the technical team at Muso was comfortable with modifying the notebooks. We were fortunate that Muso was open to learning a new language to utilize the model.
  • The partner’s capacity is an important guide in creating solutions. Initially, we had created a dataset with an average of 8 matches per patient. Although we all agreed this was a success given the data available, we agreed an alteration was necessary to move the project forward given the resources available to review matches. We modified the modeling file to filter year 0 patients with multiple matches that had an exact match of the original ID to only that match which dropped this to the 1.5 matches per patient shown above.


Original histogram of matches per year 0 patient with an average of 8 per patient


Thanks to Jane Yang and her colleagues at Muso for being engaged with us throughout the duration of the project and ensuring its success. We’d also like to recognize our volunteers from DataKind: our Data Ambassador Justine Schott, and our volunteers: Madeline Campbell, Adam Jones, Xi Liang, Eddie Mattia, Manjula Mishra, Julia Sachs, Will Schuerman, Shengjie Zhou. Thank you all for contributing your valuable time to make this project a success!


Julia, Xi, and Justine during the first DataDive weekend in summer 2019

Join Us

To get involved in current and upcoming projects with DataKind San Francisco, please check the DataKind website or follow DataKind San Francisco on Facebook or LinkedIn for more information.

The header image above courtesy of Muso.

Scroll to Top