Chris was the Data Ambassador for Mobilizing Health at our San Francisco DataDive, where he led a team of non-profit technologists and data scientists to analyze MH’s mobile health program. During the event, Chris’s team sussed out optimal techniques for sending messages from patients to doctors, identified trends in symptoms over time, and even built a prototype of a classification system to predict doctors’ diagnoses based on patient symptoms. Not bad for a weekend’s work, huh? Chris is one of those rare people who is as passionate about helping the world as he is about data science and you can read all about his thoughts on data for the greater good below!
What’s your day job?
By day, I’m the data scientist for Jive Software. Jive develops the leading social business platform supporting both internal collaboration and external engagement with an organization’s customers. My current charge is to develop advanced analytics for understanding user behavior within a Jive instance. I’m particularly focused on new ways for us to understand how our customers are deriving value through their use of the platform. This involves a lot of fun interactions with our customers as we work together to find solutions that are beneficial all around.
Tell us about your work with DataKind.
About a year ago, DataKind was organizing a data dive here in San Francisco. I planned to attend and hack away with the rest of the volunteers. A few days before, Drew Conway emailed me and asked if I’d be willing to serve as a Data Ambassador during the event. We talked about the role and I readily agreed to help out in that capacity.
I served as a Data Ambassador for Mobilizing Health, a nonprofit mHealth organization focused on extending the reach of healthcare in India through the use of SMS. Mobilizing Health’s founder, Pooja Upadhyaya, was there along with another colleague to help our volunteer team understand their challenges and the details of the available data. After an initial discussion to get us all oriented, our talented team of volunteers dug in to see what they could uncover. Not surprisingly, we all spent a fair amount of time cleaning and normalizing the data. That work paid off later in the evening.
When I was not helping coordinate the group’s efforts, I’d sit down to kick around ideas with Mark Huberty. Mark and I were both interested in analyzing the message content between the doctors and the community health workers. We thought it would be interesting to run topic modeling on the SMS content to see if any interesting patterns emerged. Mark is a masterful R hacker so he made rapid progress on conditioning the data. In the end, he ran Latent Dirichlet Allocation on the content from the community health workers and had Pooja examine the resulting topics. Nothing terribly interesting emerged. So in the final hours, we pondered what to do next.
One of our fellow volunteers had done some significant work to normalize the drug prescription information contained in the SMS traffic from the doctors to the community health workers. This led me to wonder if we could predict those drugs based on the reported symptoms sent in by the community health workers. To attack that problem quickly, I suggested that we use a text regression R package that John Myles White at Princeton had just released days earlier. Mark got to work quickly to get it installed and understand how to use the package. In short order, he was learning sparse linear predictors and evaluating their performance. We were pleasantly surprised to see that we could indeed predict the doctor’s response fairly well with a small number of terms.
Then came the final question: was this even a useful capability? I sat down with Pooja to explain what we had done. When I finished, she was very excited. This was something she was hoping to tackle at some point in the future. We managed to leverage the day’s work to get a proof of principle done.
Upon further reflection, it was clear this could be a useful capability in many respects. One immediate benefit would be in accelerating the doctor’s responses. Instead of having to type out a lengthier reply for each incoming message, such a predictor could be part of a prescription autocomplete system where the doctor could simply confirm the predicted prescription and add the dosage information. If the predictor could correctly identify the doctor’s intended response most of the time, that would allow the doctor to respond to more community health workers in a given period of time.
What inspires you to use your data skills for good in your spare time?
I’m fundamentally motivated by impact. There are so many wicked problems in the world that require folks with data skills to engage. And there are simply not enough of us around. I’d like to do my part.
What is one of the most surprising things you’ve learned or seen in working with data?
One of the most fascinating phenomenon that I stumbled on while working with data is how people convey social signals through language. Starting in 2007, I became interested in the problem of predicting manager-subordinate relationships in organizational email. I was working with the publicly available Enron email corpus. Using an Enron document released during the government investigation, I was able to construct ground truth on a number of manager-subordinate relationships within the company. From there, I was able to train a high-dimensional linear ranker that would prioritize communication relationships in terms of their relative likelihood of exhibiting manager-subordinate relationships. The linear ranker leveraged message content exclusively by design to achieve this goal.
When processing the message content, one of my colleagues suggested I remove all of the stop words – the terms that are viewed as non-information-bearing from the perspective of information retrieval. As a machine learning researcher, I did not want to apply such pre-processing heuristics. If these terms were not relevant to the task, I wanted to remove them by more principled means within the learning process itself. That personal bias turned out to be crucially important.
It turns out that stop words play an important role in the social signaling that happens as we communicate with one another. The linear ranker that I had learned from the data suggested this in the relative weights ascribed to different words. Many of the most significant words were stop words. At first, I thought this was some sort of anomaly. Thanks to my friends in social psychology at UT Austin, I learned that this result was indeed real and something that they had been studying for some time. I just happened to arrive at that result in a way that was completely foreign to them. The data opened doors to a completely new world for me. (You can read more about Chris’s work on social signaling here).
What’s the most interesting or visually striking data project you’ve seen recently?
One data project I’m watching with great interest is a startup my friend Joe Reisinger is working on called Premise. Joe has a bold, long-term goal of measuring economic activity worldwide that is currently hidden from view. Through automated online and crowdsourced offline data collection, Joe and his colleagues want to present new economic indicators that give a more dynamic and nuanced view of economic activity around the world. I’m expecting big things from Joe and his team.
What does someone getting started with data science need to learn?
There are many skills taken together that make a data scientist successful. Let me focus on just one: asking the right question. Framing the problem is often one of the most challenging aspects in any engagement. Too often, people are inclined to rush through this step. Teasing apart nuance to get to the heart of the matter requires patience and persistence. It is an art – one that frankly I had not mastered by the time I finished graduate school. I was fortunate to work with some brilliant colleagues in my research career that helped me develop that skill. Find others who you believe have that skill and learn from their example. It will pay off many times over in the future.
What blogs do you read?
Honestly I don’t follow any particular blogs. Twitter is my cueing mechanism. I follow sharp folks across many domains of interest and let them guide me to the latest ideas and happenings. I gain value from my Twitter network each and every day. They keep me informed and entertained as well! That is the magic of Twitter. You have the opportunity to transcend time and distance barriers in the offline world to meet some truly incredible people. As a case in point, I can trace my connection to DataKind back to relationships that formed on Twitter.
What is your favorite vacation destination?
Anyplace that expands my horizons. I’ve been fortunate to travel to many countries outside the US and to wonderful destinations within the US. Each journey has been special and I’m fortunate to have many a story to tell. Outdoor adventure is one particular passion of mine. I’ve learned so much about myself and my friends while adventuring. Some of those bonds have been cemented in challenging circumstances. They will never be broken