Note: This post originally appeared on Pivotal's blog.
Earlier this year, I worked as part of the Pivotal for Good (P4G) program, which enables Pivotal data scientists to donate three months of time to collaborate with nonprofits that lack the skills or resources to perform data science functions. Since August 2013, Crisis Text Line’s trained crisis counselors have assisted hundreds of at-risk teens every day. During our initial analysis, we measured the value of conversations with Crisis Text Line’s crisis counselors for individual texters by detecting signs of gratitude. Since then, we have been diving further into the data to uncover characteristics that could be used in subsequent analysis and predictive models that determine positive engagements
A challenge with Crisis Text Line’s texting-only method of crisis counseling is the difficulty of detecting emotions and nuances that would be evident in spoken conversations. Texting conversations carry their own linguistic subtleties, meaning that while we may lack the emotional cues of spoken conversations, the language used in crisis-based text conversations is filled with meaning, complexity, and emotional expression.
Such sentiment analysis is a uniquely data science-relevant problem. My colleagues on the Pivotal Data Science team recently posted about typical natural language processing (NLP) tasks, which include various techniques, from sentence segmentation (identifying where one sentence ends and another begins) to named entity recognition (identifying entities such as persons, locations, and times within documents.) In practical NLP applications, out-of-the-box analysis tools often have to be modified to accommodate nuances in a particular domain. In this post, I’ll share some of the methods Crisis Text Line and I used, and modifications we made for their unique datasets.
In some of Crisis Text Line’s previous analyses, punctuations marks were not considered as features in and of themselves. However, we discovered that there was analytical value in retaining them as part of the corpus.
- Question-to-statement ratio: Crisis Text Line teaches crisis counselors several active listening methods, including asking the texter questions during the conversation, rather than making statements. Measuring the usage of questions versus statements aid in further analyzing the effectiveness of these interactions.
- Emoticons: People frequently express themselves over text messaging with the use of emoticons, which are almost entirely constructed with punctuation characters. Thus, we included emoticons in the Crisis Text Line dictionary. As expected, the most common emoticons used are :) and :(.
- Pauses: While a pregnant pause is easy for humans to detect in vocal conversations, it can be difficult to identify over text messaging. Many people make liberal use of ellipses in texts to express a pause, a choice that can reflect hesitation, nervousness, or simple stylistic preference. (An ellipsis in a text message appears most frequently as three consecutive periods ‘…’. We also see variations such as ‘..’, ‘….’, and ‘. . .’.) We decided to consider ellipses as individual words with their own meaning.
For Crisis Text Line, tokenization of the data includes the identification of individual words, emoticons and specific punctuation marks in each message. Typical tokenization rules could identify word boundaries in messages with punctuation as follows:
“I guess so…:(“ → [I] [guess] [so] [.] [.] [.] [:] [(]
White space tokenization can do well in identifying delimiting individual words in a text message.
“I guess so” → [I] [guess] [so]
However, this method would fail for separating groups of punctuation and emoticons as separate words:
“I guess so…:(“ → [I] [guess] [so…:(]
It would also fail for texters who stylistically choose not to use spaces in their messages:
“Iguessso:(” → [Iguessso:(]
Given all of these cases, we built a custom rule based model for tokenization. Using our model, the first example would be tokenized as:
“I guess so…:(“ → [I] [guess] [so] […] [:(]
Stop Word Removal
Stop words are common words that are often filtered as part of the text cleaning process. One could consider words like “is” or “it” as filler words in a sentence, whereas words used less frequently are more interesting to study.
There are several lists of stop words readily available for use in text analysis. Here’s a snippet of one available through NLTK:
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours','yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers','herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves',...]
Consider the following sample fabricated text messages:
“Do you want to hurt yourself?”
“I want to hurt myself”
“He said he wants to hurt me”
“I said to her that I want to hurt myself”
Without stop words or punctuation, this would turn into
“Do you want to hurt yourself?” → “want hurt”
“I want to hurt myself” → “want hurt”
“They said they want to hurt me” → “said want hurt”
“I said to her that I want to hurt myself” → “said want hurt”
For Crisis Text Line, understanding the subjects (i.e. ‘I’, ‘they’, ‘you’) is critical in discerning these messages. Stripping out all the typical stop words here also strips away much of the intent of the sentences. It is important to understand if a texter’s crisis is related to other people’s actions. With that in mind, we opted to keep the following words when they appeared in the corpus:
- First person:
'i', 'me', 'we', 'us'
- Second person:
- Third person:
'he', 'she', 'him','her', 'they', 'them'
- First person:
- Second person:
- Third person:
'my', 'your', 'his', 'hers', 'our','their', 'ours', 'yours', 'theirs'
With these various techniques in place, we were then able to construct a dictionary that included a wide range of emoticons, punctuation and words specifically relevant to our analysis.
After properly cleansing and preparing the text data, we were able to create topic models and word-based features. This fits into a wider analytical framework that we built at Crisis Text Line to support their growing data science practice. Ultimately, this work and the final output can be reused for additional text analytics tasks that feed into both counselor training and product direction.
In our next post, I’ll discuss how our collaboration with Crisis Text Line has changed their data science practice and how they are continuing to improve the care for their texters. Data science models and analyses have led to changes to their platform and their crisis counselor training program. In the past few months alone, they have seen higher satisfaction ratings. Stay tuned!
- Listen to the Data Skeptic podcast
- To start exploring Crisis Text Line’s data yourself, check out crisistrends.org
- Blog: Pivotal for Good Connects Data Scientist with Crisis Text Line to Help At-Risk Teens
- Blog: Pivotal For Good with Crisis Text Line: A First Look
- Blog: Text Analytics and Natural Language Processing in the Era of Big Data
- Blog: 3 Key Capabilities Necessary for Text Analytics & Natural Language Processing in the Era of Big Data