Predicting Well Groundwater Quality Using Cloud-Based Machine Learning: DataKind San Francisco Partners with Aquaya
May 02, 2023

By DataKind San Francisco


Access to clean water, sanitation, and hygiene (WASH) is critical for healthy and humane living conditions. Globally, two billion people lack sufficient access to safely managed drinking water at home, according to the CDC

Increasing access to safe drinking water, adequate sanitation, and hygiene is essential for improving quality of life with greater school attendance, fewer sick days, and increased sense of safety and dignity. However, in remote or rural areas, it’s challenging to access high quality data to better target people most in need and to make rigorous decisions.

Aquaya, a data-driven nonprofit organization operating in 24 countries around the world, was founded to help overcome these challenges by combining data collection, advanced analytics, and research to help make informed WASH program, policy, and financing decisions.

DataKind San Francisco (DKSF) was privileged to work with Aquaya throughout 2022, initially on a DataCorps® project analyzing satellite imagery, and more recently on a Data Advisory project described below to incorporate cloud-based storage and machine learning models to predict water well quality in remote regions of Uganda. Through these collaborations, DKSF was able to partner with Aquaya to centralize data and deploy scalable models in a seamless and cost-effective manner. 

A water well. (Source: Getty Images) 

Data Advisory Project Execution

For this project, DKSF collaborated with Aquaya to build machine learning models that predicted the groundwater quality from wells in Uganda. Historical measurements of E. coli levels in wells (which is an indicator of fecal contamination in water) were available from various organizations, including Water and Health for All, Whave, Water Mission, Charity Water, BGS, and Aquaya. These measurements were obtained at the daily level sporadically from 2010 - 2022 and provided a categorical assessment of E. coli levels. In addition, data was extracted from satellite imagery pertaining to the areas surrounding each well to acquire a feature set to help train machine learning models.

At a high level, the Data Advisory team: (1) identified and consolidated all datasets containing relevant features and predictors (2) set up AWS tools including S3 and SageMaker and (3) performed modeling techniques on the training data to predict the well groundwater quality.

Data Preparation 

DKSF’s first initiative was to assist the Aquaya team in consolidating all datasets with relevant features and predictors. Aquaya had around 27 GB of heterogeneous data stored in an external hard drive including CDF, TIFF, CSV, and TXT file formats. The team helped extract the well groundwater quality data as well as other parameters such as the presence of livestock, poverty levels, population density, and land cover characteristics around the well. This was combined with latitude, longitude, temperature, and precipitation information for each measurement where possible. 

Next, DKSF advised Aquaya on cloud storage solutions, and Aquaya uploaded all the relevant data into Amazon Web Services’s S3 cloud storage service to utilize the latest cloud technologies to meet their workload while operating at the lowest possible cost.    

DKSF then set up a Sagemaker notebook instance with Jupyter notebooks using a combination of R and Python programming languages to leverage AWS’s in-built machine learning pipeline infrastructure. DKSF helped Aquaya setup AWS Notebook instances in R as they were more cost effective than SageMaker R Studio. 

After preparing the data, DKSF and Aquaya leveraged Amazon’s Sagemaker Autopilot feature, which makes machine learning classification and regression analysis easy to deploy from tabular data stored in S3. Through this final step, the team was able to quickly segment the data into train and test splits (taking into account splitting by chronological timestamps), evaluate performance of multiple models, and fine tune parameters to select the top performing model.

Exploring the Data

The first step in cleaning and exploring the data was to consolidate data across various file formats into one unified csv and then drop or impute any missing values of key variables. The predictor variable (well groundwater quality) was available at the daily level per well for ~10 years historically. Days with missing values for water quality were removed as well as features for which a majority of the data was missing. 

Figure 1: Number of records available over time. The figure above depicts the number of aggregated well-days per month where well groundwater quality data was available. Most high quality data was collected between 2015 and 2018.

DKSF used several exploratory techniques to better understand the features related to the outcome of interest, ground water quality. A correlation matrix helped show which features had the highest correlation with well groundwater quality as well as identified collinear features; see visualization below. Of particular interest, some of the features that had the highest correlation were region and daily precipitation. Also, many features were correlated with each other, such as the livestock-related features, suggesting some feature reduction was necessary.

Figure 2: Correlation Matrix Results. The visualization above depicts the correlation matrix between well groundwater quality and some selected features.

Feature Engineering and Model Building 

Using the results of our correlation matrix, DKSF and Aquaya removed similar features to eliminate redundancy. For instance, the original dataset had three separate variables indicating the volume of livestock poultry within 5000, 2000, and 1000 kilometers of the well. All three of these variables were found to have similar impact on water quality so Aquaya kept the one with the highest pearson coefficient.

Similar methods were used for other features such as grass-land coverage around the well, with separate variables originally present for varied radii. In these scenarios, the team kept the land cover features with the highest pearson coefficient with the predictive outcome. 

The goal of this project was to determine if ground water extracted from wells is safe for drinking or not according to WHO standards, hence we changed the multi-class output (originally scored as 0-clean,1-acceptable and 2,3-not potable water) to a binary variable with 0 for safe water and 1 for non-safe water.

Listed below are the models that DKSF and Aquaya built using libraries such as PyCaret, SciKit-Learn, and SageMaker Studio AutoML. For all these libraries, the team leveraged cloud-based tools like AWS and Google Co-Lab for seamless analytical service for Aquaya.


Model Name




F1 Score

SageMaker Studio

Weighted Ensemble (Auto ML)






Histogram based Gradient Boosting 







Extra Trees Classifier






Random Forest Classifier






K Neighbors Classifier






Logistic Regression





Figure 3: Model Performance Table

In the end, the team selected the weighted ensemble model from AWS Auto ML due to the highest observed precision. DKSF and Aquaya chose to optimize for precision as the team wanted to minimize the number of predicted false positives (i.e. where groundwater quality is predicted as safe when in reality it isn’t). The team was less concerned with recall for this application as false negatives are acceptable (i.e. predicting water quality as not potable when it actually is potable is a safe choice).  

Figure 4: Feature Importance graph. The visualization above depicts the top predictive features in our winning model (weighted ensemble).

Looking Ahead: Next Steps 

While a final model with impressive scoring metrics was built during the course of this project, the DKSF team had a few thoughts for follow up work to further improve model performance and aid the dissemination of knowledge across Aquaya. 

One direction to improve the model is to bin many of the high variability quantitative variables into 15-20 different bins per variable. In total, of the 35 input features, many have high variability. In the relatively small dataset the team was working with, such high variability can introduce noise and outliers which is more likely to influence model results. Binning quantitative variables could improve model scores and additionally make the model easier to interpret.

Another idea to improve the model is to build a model specific to each region in Uganda. As observed from the feature importance graph in the previous section, the most important feature for model training was the region of the well. This suggests there is potentially high variability in the amount (refer to Figure 5) and type of data collected in each region. By building models specific to each region, the team can remove this region-associated variability, and optimize model parameters to relevant groupings of wells. However, it’s also possible that this may lead to overfitting as the sample size that each model is trained on is reduced.  

Figure 5: Distribution of Data collected from Regions.

Finally, in order to improve dissemination of knowledge more extensively across Aquaya, a next step would be to create an interactive visual dashboard that maps well locations and predicted water quality across Uganda. This would enable internal members of the organization to understand historic trends of well groundwater quality as well as future trends. In the long-term, this could help to influence where Aquaya resources should be allocated based on areas that have the highest need for improvement. 

As Data Advisory grows, DKSF hopes to partner with more nonprofits in need of data infrastructure guidance! We hope this project can be a useful framework for future data advisory projects requiring cloud computing. If you or your organization are looking to partner with us, reach out at to learn more.

"Aquaya's experience with DataKind was truly exceptional. Their assistance in creating a low-cost data pipeline and offering valuable guidance on data analysis enabled us to identify the best platform and the most suitable machine-learning models. We are grateful for DataKind's strong involvement in our project, their availability and their insightful recommendations that increased Aquaya's efficiency." ~Chloé Poulin, PhD, Senior Research and Program Manager, Aquaya Institute

The Data Advisory project team and those who co-wrote this blog included:

  • Chloé Poulin, Senior Research Manager, Aquaya Institute
  • DKSF Volunteers: Anjana Sundaram, Jaya Pokuri, Padma Chandramouli

We’d like to also thank Melinda Tellez (DKSF) and Melissa DiLoreto (DataKind) for their support and guidance throughout this project.

Header image above: Hand pump repair in northern Uganda. (Source: The Aquaya Institute)

Join the DataKind movement.

Quick Links

Read more posts
April 12, 2021
Powering Public Data for Communities: Highlights from Virtual DataDive®️ Report Back
We recently hosted our DataDive Report Back, keeping with the theme of Trust, Transparency, Togetherness: Powering Public Data for Communities, and we’re honored and humbled by the level of excitement...
Read full story
November 07, 2019
Applying AI to Societal Challenges in Cities & Communities: Highlights from DataKind’s Virtual DataDive With Microsoft
We can’t believe it’s been a month since our first-ever virtual DataDive. We’re still feeling the energy and enthusiasm from that day. More than 100 DataKind volunteers from around the world, hailing from...
Read full story
June 20, 2018
Using Satellite Imagery to Generate Awareness and Funds for Refugees
At a 2017 DataDive presented by USA for UNHCR, #GivingTuesday and the Bill & Melinda Gates Foundation, a team of DataKind volunteers led by Evelyn Lang, Ravin Pierre, and Friederike Schüür worked alongside...
Read full story
Blog Archive