blogs

Understanding Text Data

over 5 years ago by Ryan Stuart • 2 min read

This is an article I wrote for the March — April issue of Research News Magazine on behalf of Kapiche.

Advances in natural language processing are allowing increasing numbers of businesses to realise the potential of free text data.

The amount of data available to organisations is increasing exponentially. In fact, the amount of data we generate currently doubles every two years. The collection and storage of data is a booming business worth in excess of $US36B annually according to International Data Corporation. However, a recent International Data Group report reveals that CIOs are still unhappy at the current analytical capabilities available to them. This is particularly true of unstructured data, which is estimated to be 80 per cent of all available data.

Unstructured text data is a significantly under-utilized resource. Eliciting reliable and actionable insight from open-ended survey questions, focus groups, and similar data efficiently has the potential to catalyze a major industry shift.

Insights drawn from free text data are uniquely valuable in that they directly encapsulate the ideas, feelings and sentiments of the customer. Subsequently, businesses have potential access to unprecedented levels of customer understanding. Fortunately, the progress being made in the natural language processing (NLP) field of artificial intelligence (AI) is allowing increasing numbers of businesses to realise this potential.

NLP is a longstanding field of AI research with a complex history. It is generally thought to have started in the 1950s with the Turing test. Since then, it has undergone many transformations, evolving from a set of hard coded rules for adding structure to text, through to the adoption of machine learning techniques for translation, parts-of-speech tagging and entity extraction. However, most of these technologies are designed to improve the operational handling of text for use in support ticketing systems and similar applications.

Topic modelling: from ‘black box’ to ‘white box’

More recently, new technologies have emerged that aim to facilitate deeper and more comprehensive understanding of text data. Chief among these is ‘topic modelling’ which is designed to make text data more accessible and, as the technology has evolved, more understandable. Topic modelling is a purely statistical approach to identifying patterns and relationships within text. More formally, it is described as a method for finding and tracing clusters of words (called ‘topics’ in shorthand) in large bodies of texts.

The early history of topic modelling is dominated by two approaches: latent Dirichlet allocation (LDA) and latent semantic analysis (LSA). One of the key limitations of both approaches is that they operate in a ‘black box’ fashion, producing models that can be difficult to interpret directly. This makes it difficult to demonstrate their veracity. Moreover, their application often requires expert knowledge and significant tuning for each domain of interest, and neither technology works particularly well on short documents (like tweets).

Recently, emerging technologies have placed greater emphasis on accessibility of topic modelling, and the interpretability of the generated models. This puts more power in the hands of the user, helping them to better identify and report on the narrative of their data. Operating in ‘white box’ fashion they provide greater visibility over the factors that have led to the identification of a topic. Further, these technologies have significantly lowered the barrier to entry, in some cases providing meaningful results at the click of a button. This is in stark contrast to their precursors, which are largely technical modelling processes that require expert knowledge.

Alleviating frustration

The emergence of these new approaches can be partially attributed to the frustrations of researches who were using manual techniques for modelling and understanding text data. In the worst case, this would mean hand crafting a ‘codebook’ that applied codes to segments of text data. The process of applying the codebook to the text data (usually called classification or coding) was either entirely manual or with some computer assisted automation. Finally, the frequency and co-occurrence statistics of codes were examined with the goal of identifying insights. There are several problems with this approach including:

human bias
reproducibility of results
maintenance of codebooks as language evolves
requirement of separate codebooks for different data domains
time to results
complexity of the entire process.

The emerging topic modelling technologies significantly mitigate these problems. With the click of a button and often within a few seconds, a model (or what might be considered a codebook historically) can be generated entirely based on the data selected for analysis. All conventional statistics such as frequency and correlation are automatically calculated and a query interface allows exploration of the underlying data. Perhaps even more importantly, there is a high degree of control that is then handed over to the user to tweak the model. That control comes in many forms including the granularity of the topics identified, editing the names of the topics, and even the ability to merge or remove topics. All of these features are particularly useful when reporting results to stakeholders who may not want or need to understand the technologies being used to uncover insights.

The next frontier

The automation of the coding process has also enabled dynamic juxtaposition of structured data with unstructured insights, subsequently providing highly contextualised understanding.

This powerful technique arguably represents the next frontier in data analysis.

For example, consider analysing social media data such as Twitter where demographic information for the tweet’s author is accessible in addition to the actual tweet content. When examining the structured data together with an understanding of tweets provided by the automated topic modelling, it becomes possible to interrogate the data with questions such as ‘What are people in Melbourne saying compared with those in Sydney?’ Following on, greater levels of detail can be explored with requests such as ‘Show me Tweets from males in Melbourne that were positive.’

The ability to understand text data with the latest AI technologies represents a major shift in the capabilities market researchers are able to offer clients. Not only are findings from data sources like focus groups or open-ended surveys imbued with a greater degree of rigor, they are reached more rapidly, with high reproducibility, and at a much cheaper cost. Additionally, the coupling of structured data with the understanding of text data means the level of insights that can be identified with data is greatly increased. Most importantly, the technology that powers these advances is available and usable by almost anyone today.