By DataKind San Francisco
Hello Sunday Morning (HSM) is an Australian-based nonprofit that has grown to be the largest online movement for alcohol behavior change. The organization aims to change alcohol behavior in the world by helping its members reduce high-risk drinking through peer support and counseling. This is done through Daybreak, their virtual peer support social network service which enables members to communicate and support each other. On the platform, members can create posts to share their experiences with alcohol and other members can react (like/love) or comment (text) to the posts. Daybreak relies on moderators to maintain safety within the community and ensure participation is in line with community guidelines.
Currently, moderators flag problematic posts (either those that indicate potentially harmful behavior or those in breach of community guidelines), which are then escalated to a clinical team. The clinical team further analyzes the post and takes action as needed. HSM is facing the problem of growing memberships: the task of moderators is becoming unmanageable with hundreds of thousands of community activity (posts, comments, reactions) to review and flag if necessary. Moderators need assistance from an automated approach to develop an efficient and scalable solution to flag and categorize either risky or breach activity that violates community guidelines.
DataKind San Francisco was privileged with the opportunity to work with Hello Sunday Morning on building a solution to identify and categorize problematic posts.
The first part of the partnership with HSM involved working on a machine learning model to help identify risky posts on the Daybreak platform. HSM pulled and provided historical (January - September 2019), labeled post data to help build the model. This data consisted of raw text of the post (with PII removed), timestamp of the post, and risk category (if any). While there was a large amount of data, there was significant class imbalance due to less than 0.1% of the posts being categorized as risky.
Utilizing the historical post data, a model was built to predict the probability of a post being risky. The steps of this process can be seen below:
The model was tested by HSM on a sample of Daybreak post data unseen by the model (November 2019 - January 2020). When setting different probability thresholds, different scoring results were obtained as seen by the table below.
Table 1: Model performance on test data at varying thresholds
Threshold = 0.1 | Threshold = 0.3 | Threshold = 0.5 | Threshold = 0.7 | |
Recall | 0.8 | 0.5 | 0.3 | 0.2 |
Precision | 0.8 | 0.9 | 0.9 | 0.9 |
F1 score | 0.8 | 0.7 | 0.5 | 0.4 |
Due to the potential negative implications of missing risky posts, it’s important to minimize false negatives even if it means increasing false positives. This is because it’s preferable to not miss a risky post at the expense of incorrectly labeling non-risky posts as being risky. Therefore, HSM is looking to use a probability of 0.1 as the threshold for determining risky posts as this threshold maximizes recall.
In addition to the model, an additional heuristic approach was taken to identify posts with potentially problematic words. HSM provided a list of keywords which moderators look out for when assessing a post. A simple keyword identifier was built to flag posts which contain any of the relevant keywords.
The second part of the partnership with HSM involved working on techniques to help identify breach posts on the Daybreak platform. Breach posts were identified as posts with PII or profanity. Due to the small number of labeled breach posts in the historical data, which is also not comprehensive of all possible future breach posts, heuristic methods were favored.
The steps taken to identifying breach posts can be seen below:
To enable usage of the solutions created for detecting risky and breach posts, a flask app was built. The flask app has three endpoints which output the following:
The outputs of the flask app intend to be tools in helping HSM moderators assess risky and breach posts on the Daybreak platform.
Hello Sunday Morning team. Photo courtesy of Hello Sunday Morning.
We want to thank Hello Sunday Morning for entrusting and collaborating with DataKind San Francisco on this project. It’s our hope that the solutions built throughout this engagement helps HSM improve the anonymous and supportive environment they’ve created for members to change their relationship with alcohol. We also want to recognize the Data Ambassador, Jaya Pokuri, who worked with HSM for the duration of the project.