By DataKind San Francisco
An unresolved conflict at work, unconscious bias from a boss, an unexpected layoff. We all have experienced some of this at some point in time and can understand how a seemingly peaceful workplace can suddenly become tough and how important it is to have someone to talk to and guide us through these obstacles. But what if you don’t have such a supporter to confide your stress and frustration to or a wise mentor to point you to the light at the end of the tunnel? Certainly, you can turn to professional therapists, but what if you can’t afford one or have little or no prior experience with them? Enter Empower Work, a nonprofit organization that aims to solve this exact problem by providing free, confidential counseling services to help distressed employees with their work-related issues.
From November 2018 to July 2019, DataKind San Francisco was privileged to work with Empower Work to improve their services using what we do the best -- data science.
Empower Work works with volunteers and trains them to become skilled counselors, who then interact directly with the users via text messages to talk them through a specific issue and brainstorm a solution. At the time of our partnership, the organization had collected about 500 conversations (de-identified to protect user privacy) and asked us to perform the following two tasks:
The below describes our approaches and conclusions to each of the two tasks described above. Read on!
We approached this question as a classic inference problem and built a logistic regression model to understand how different factors correlate with the outcome. In particular:
After fitting our model with these features, we found the following factors particularly insightful:
In addition to the significant factors described above, we also identified a few interesting counseling practices that, although statistically insignificant in our model, are interesting enough to warrant an A/B test. For example, some repeating patterns that we see in counselor messages include questions designed to channel the solution out of the texters themselves (e.g., “What would you hope the outcome to be?”) and reframing (e.g., “What would you do if you were in his shoes?”). After all, our model was built using less than 500 conversations, which limited its explanatory power.
As a result of these findings, Empower Work has been able to refine its training and support to stress the use of tools that correlate to better outcomes.
For each tag, we built an XGBoost model to predict it, one model per tag. As for the model features, in addition to the same counseling style and text features used in the inference task above, we also included a TF-IDF matrix that represents each conversation as a numeric matrix (i.e., a “bag of words”) and assigned weights to each word based on its within-document and cross-document frequencies. This simple representation turned out to be quite effective in our case given that certain words have an almost 1:1 relationship to certain tags.
One challenge was that there were over 100 tags across 500 conversations and half of them appeared in less than 10 conversations. Obviously, there’s little we can do for those minority tags, but where should we draw the line in terms of the minimum sample size? There are traditional rules of thumb that suggest cutoffs such as 30, but how do we derive and validate them empirically for our particular task?
To do that, we decided to first build a simple model for each of the 100 tags, regardless of its size, and validate its performance on a separate holdout set. In addition, to measure the stability of the models, we built five versions for each of them with different random seeds and computed the standard deviation of their performance scores. Intuitively speaking, a good model should have a high-performance score (i.e., the F1 score in our case) and low variance. After running this experiment, we visualized our result below by plotting the average F1 score and its standard deviation against each tag’s sample size. As expected, the more data a given tag has, the better our model is at predicting it (as indicated by the higher F1 score) and the more stable the model is (as indicated by the lower standard deviation). Based on this empirical analysis, we picked 50 as our threshold and only built models for the tags that have at least 50 data points.
We selected the F1 score as the metric in this case because it balances the model’s precision (i.e., when a model predicts a tag exists, how often does it actually exist?) vs. its recall (i.e., when a tag actually exists, how often does the model catch it?), a tradeoff similar to balancing false positives vs. false negatives. We also examined the model’s performance from an accuracy perspective, where some of our better-performing models exceeded 90% accuracy. However, accuracy can be misleading for the less common tags. As an extreme example, an algorithm that predicts “no tag” for all the examples might achieve high accuracy but would, of course, be useless.
Now that the models are built, they still have limited use if they’re just sitting on our laptops. To make it easier for the organization to use them, we built a web app using Flask that accepts one or more conversations (saved in a CSV file) and outputs the probability for a given tag. A screenshot of our interface is shown below.
As a proof of concept, our model has demonstrated how a machine learning approach can help identify best practices, which in turn have improved the way Empower Work trains and supports volunteers to provide the greatest impact to people who come to them for help. With more data gathered in the future, we’re excited, along with Empower Work, to retrain the model and further expand its impact.
For this project, we collaborated with Empower Work to analyze the potential factors that impact the outcome of a counseling conversation and made recommendations to improve the service. In addition, we built models to predict the relevant tags for a given conversation to alleviate the human labeling process in the future.
We want to thank Empower Work for entrusting us with this interesting and impactful project. We have learned a lot from it. We’d also like to recognize our talented and devoted volunteers from DataKind, namely, Runze Wang (Data Ambassador), Vishal Motwani, Edwin Zhang, and Peter Adelson. The project wouldn’t have materialized without your hard work.
Lastly, but importantly, we’d like to take this opportunity to raise the awareness of work-induced stress and the importance of mental health. If you feel being subject to pressure, discrimination, and burnout from work, please don’t feel alone in taking action. Talking by itself is usually an effective first step and if you can’t find anyone that’s readily available, come and talk to the good folks at Empower Work! There’s always someone who’s willing to lend an ear. You just need to speak.
In this appendix, we briefly describe the steps we took to process the raw text data. If you’re working on something similar, we hope this will provide you with some ideas.
Because we needed to run the same processing pipeline for different tasks and subsets of the data, we modularized our development code into a library that can be easily called and extended, which allowed us to iterate quickly. Sometimes it does pay off to spend time upfront to facilitate things in the long run.
The header image above courtesy of Empower Work.