By DataKind San Francisco
Worldreader is a global nonprofit working with partners to help underserved communities transform through the power of digital reading. The organization provides students and their families with a digital library available on tablets, e-readers, and mobile phones, complemented by a suite of reading support activities. To track engagement on their web-based platform, Worldreader analyzes anonymized user activity. When users register and sign in, tracking engagement is straightforward. However, when users don’t sign in, the process for “unregistered user activity” is more complex. Worldreader and DataKind San Francisco partnered to better understand unregistered user activity, monitor impact, and analyze engagement.
Photo above courtesy of Worldreader
Worldreader has built a strong engineering and data infrastructure. Reading activity is processed and stored in an Amazon Redshift database, generating millions of rows of data per month. This activity is then further processed as part of a Jenkins analytics pipeline that powers regular daily and weekly reports. In the specific case of unregistered user activity, Worldreader generates a set of custom labels called “client IDs” to stand in for the registered “user ID” equivalent.
However, in some cases, it appears that multiple client IDs are assigned to the same true user. For example, one might see cases where a client ID reading page 56 of a book then changes to a different client ID when flipping to page 57. This is a challenge because Worldreader evaluates impact and analyzes engagement by understanding this user activity. Duplicated client ID labels for the same reader would inflate the overall user count.
As a result, Worldreader currently adjusts user counts conservatively downward based on their internal research. DataKind San Francisco had two objectives for this project:
- Estimate the level of user duplication: This number ultimately informs the end metric of user engagement and so is the most important to nail down.
- Investigate methods to attribute reading activity: Being able to attribute specific reading activity across duplicated IDs would give some more confidence to the overall duplication estimate. It would also enable targeted analysis of user reading activity and potentially help the Worldreader team identify the source of the duplication errors.
Objective 1: Estimating the level of duplication
DataKind San Francisco started by assessing the size of the problem. A useful initial analysis is to inventory the data available. In this case, we realized that we had a potential simple and powerful benchmark in registered user activity. Because the Worldreader system still generates internal client IDs even when users have signed in, comparing the client IDs with registered user IDs essentially provides an answer key.
We compared the number of unique users based on actual sign-ins vs. the Worldreader-generated client IDs and derived a roughly consistent duplication rate around 9% looking at data from 2018 to 2019. This was lower than Worldreader’s estimates to date, suggesting that the number of active readers was higher than previously thought/reported!
This estimate relies on the assumption that registered and unregistered activity have similar rates of duplication. To test this, we looked at the IP address distribution of the two activity groups. For the same two-year time period, two IP-address-based metrics were relatively consistent between the two groups: the average number of IP addresses per client ID and proportion of client IDs tied to more than one IP address. This close alignment validated the assumption and provided more confidence in the estimate.
Objective 2: Investigating methods to attribute reading activity
The next step after assessing the size of the problem is to attribute specific reading activity to the deduplicated users. This would involve analyzing characteristics such as the books and page numbers involved, IP addresses, web browser type (e.g. Firefox), and country setting (e.g. Nigeria). This serves three purposes. First, it adds further confidence to the estimate from the previous objective. Second, it enables more targeted analysis of aggregate user behavior: for example, the number of unique users reading a given book. Third, looking at characteristics related to duplication could help pinpoint the bug that is causing client IDs to be duplicated.
Looking again to the same “answer key” of registered user activity, we defined a relatively straightforward heuristic involving IP address, web browser type, and country setting. If reading activity came from the same unique IP address, browser, and country combination as that of a previous ID, we assigned it to that user; if it was a new unique combination, we generated a new ID and assigned that to the user instead. Using this approach, we reduced the number of unique IDs by 4%. This reduction was less than the 9% estimate we obtained previously, which makes sense when considering the possibility of users sharing devices that would show the same IP address, browser, and country.
We then looked into whether these imputed IDs could be used for more tailored aggregated reporting. This line of analysis was not as promising. Because it could not separate users sharing a device, the method sometimes grouped together multiple actual users. We also experimented with adding logic based on the books read or even the specific page progress for the books, but this in turn reduced the ability of the heuristic to identify when multiple IDs were tied to the same user — the no free lunch theorem in action.
Finally, we looked at the statistical power of several factors in attributing reading activity to help identify potential long-term fixes. IP address was a particularly informative factor, with over 80% of IP addresses tied to just one user (shown in Figure 1). The long tail of IP addresses shared by up to 200 users, however, means this factor was not enough to solve the issue by itself.
Figure 1. Distribution of user IDs across IP addresses
From a troubleshooting perspective, browser type and country were more interesting. This is because some browsers had 0% duplication rates (certain Mozilla variants), while users on other browsers were more than three times the average (certain Opera variants). Country mapped closely to different duplication rates as well. Among the countries with the most users, Nigeria, the country with the most records, had a higher than average rate of duplication; Ghana and South Africa were solidly below average. Some countries, such as Jamaica, had no duplication at all; taking a closer look, this all still ties back to browser type — Jamaican users were all using browsers that generated 0 duplicates. The data on different browser behaviors provides direction for where bug-tackling efforts might focus.
Delivering a solution for production use
DataKind San Francisco’s solution for this problem was to deliver a Python script that could be run daily or as needed. The script calculates the level of user duplication in registered activity for a custom period of time and verifies that the IP address distributions of registered vs. unregistered groups appear similar. The script accepts any two dates as the start and endpoint for the analysis, which allows for flexibility in usage across different reporting applications. The Python script queries Worldreader’s Amazon Redshift database for the necessary user count information. The motivation was to integrate smoothly with Worldreader’s existing data pipeline and monitoring infrastructure. A graph similar to Figure 2 below can then be used to track duplication rates over time and flag any potential spikes.
Figure 2. Quarterly duplication rate over time
During the course of the collaboration, a few takeaways came up:
- Balance analytical rigor with simplicity: Simple rule-based approaches led to useful insights. Graph networks and machine learning models would have been an interesting direction to explore for attributing specific reader activity given more time, but for the first objective of measuring the level of duplication, we were able to triangulate off more easily accessible metrics. The machine learning principle of identifying the ground truth (in this case, registered user activity) still applied even when not training a formal model.
- Lean on knowledge from the team: Worldreader’s team members had an incredible amount of subject-matter expertise and practical understanding from their work. For example, it was their suggestion to validate assumptions based on consistency in IP address distribution and to analyze differences in duplication rates between countries. When anomalous spikes were found in the data, they could attribute it to a recent partnership, providing context for the trend.
- Integrate with existing processes: When designing the production solution, it was helpful to identify the ways in which Worldreader already leverages its data and to build something flexible enough to slot in within Worldreader’s strong infrastructure. This reduces the cost of integrating project work, which can be a pain point in collaboration.
Thank you to the Worldreader team, Rachel Heavner, Irina Timoshenko, and Sergio Grau, for their thoughtful guidance, partnership, and energy. DataKind San Francisco is very grateful to have had the opportunity to work together toward global literacy and appreciates the work of Edwin Zhang, Data Ambassador, on the project.
Header image above courtesy of Worldreader.