Note: This post originally appeared on Pivotal's blog.
Earlier this year, I worked as part of the Pivotal for Good (P4G) program, which enables Pivotal data scientists to donate three months of time to collaborate with nonprofits that lack the skills or resources to perform data science functions. Since August 2013, Crisis Text Line’s trained crisis counselors have assisted hundreds of at-risk teens every day. During our initial analysis, we measured the value of conversations with Crisis Text Line’s crisis counselors for individual texters by detecting signs of gratitude. Since then, we have been diving further into the data to uncover characteristics that could be used in subsequent analysis and predictive models that determine positive engagements
A challenge with Crisis Text Line’s texting-only method of crisis counseling is the difficulty of detecting emotions and nuances that would be evident in spoken conversations. Texting conversations carry their own linguistic subtleties, meaning that while we may lack the emotional cues of spoken conversations, the language used in crisis-based text conversations is filled with meaning, complexity, and emotional expression.
Such sentiment analysis is a uniquely data science-relevant problem. My colleagues on the Pivotal Data Science team recently posted about typical natural language processing (NLP) tasks, which include various techniques, from sentence segmentation (identifying where one sentence ends and another begins) to named entity recognition (identifying entities such as persons, locations, and times within documents.) In practical NLP applications, out-of-the-box analysis tools often have to be modified to accommodate nuances in a particular domain. In this post, I’ll share some of the methods Crisis Text Line and I used, and modifications we made for their unique datasets.
In some of Crisis Text Line’s previous analyses, punctuations marks were not considered as features in and of themselves. However, we discovered that there was analytical value in retaining them as part of the corpus.
For Crisis Text Line, tokenization of the data includes the identification of individual words, emoticons and specific punctuation marks in each message. Typical tokenization rules could identify word boundaries in messages with punctuation as follows:
“I guess so…:(“ → [I] [guess] [so] [.] [.] [.] [:] [(]
White space tokenization can do well in identifying delimiting individual words in a text message.
“I guess so” → [I] [guess] [so]
However, this method would fail for separating groups of punctuation and emoticons as separate words:
“I guess so…:(“ → [I] [guess] [so…:(]
It would also fail for texters who stylistically choose not to use spaces in their messages:
“Iguessso:(” → [Iguessso:(]
Given all of these cases, we built a custom rule based model for tokenization. Using our model, the first example would be tokenized as:
“I guess so…:(“ → [I] [guess] [so] […] [:(]
Stop words are common words that are often filtered as part of the text cleaning process. One could consider words like “is” or “it” as filler words in a sentence, whereas words used less frequently are more interesting to study.
There are several lists of stop words readily available for use in text analysis. Here’s a snippet of one available through NLTK:
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours','yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers','herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves',...]
Consider the following sample fabricated text messages:
“Do you want to hurt yourself?”
“I want to hurt myself”
“He said he wants to hurt me”
“I said to her that I want to hurt myself”
Without stop words or punctuation, this would turn into
“Do you want to hurt yourself?” → “want hurt”
“I want to hurt myself” → “want hurt”
“They said they want to hurt me” → “said want hurt”
“I said to her that I want to hurt myself” → “said want hurt”
For Crisis Text Line, understanding the subjects (i.e. ‘I’, ‘they’, ‘you’) is critical in discerning these messages. Stripping out all the typical stop words here also strips away much of the intent of the sentences. It is important to understand if a texter’s crisis is related to other people’s actions. With that in mind, we opted to keep the following words when they appeared in the corpus:
'i', 'me', 'we', 'us'
'he', 'she', 'him','her', 'they', 'them'
'my', 'your', 'his', 'hers', 'our','their', 'ours', 'yours', 'theirs'
With these various techniques in place, we were then able to construct a dictionary that included a wide range of emoticons, punctuation and words specifically relevant to our analysis.
After properly cleansing and preparing the text data, we were able to create topic models and word-based features. This fits into a wider analytical framework that we built at Crisis Text Line to support their growing data science practice. Ultimately, this work and the final output can be reused for additional text analytics tasks that feed into both counselor training and product direction.
In our next post, I’ll discuss how our collaboration with Crisis Text Line has changed their data science practice and how they are continuing to improve the care for their texters. Data science models and analyses have led to changes to their platform and their crisis counselor training program. In the past few months alone, they have seen higher satisfaction ratings. Stay tuned!