Kapiche had a question: what’s the best way to help a new user begin to understand the text data they upload? How could we transform a large, complex dataset into a compelling visualization that presents a narrative and conveys the scope of the data?

Did someone say… word cloud?

A word cloud fits the description at first glance; it’s a nifty little visualisation that shows the keywords of a textual dataset with words sized based on frequency. You’ve probably seen them before — on blogs, news articles, and the left side of this article’s header image.

Word clouds only go so far

As Jacob Harris writes, word clouds throw out all of the principles good data journalism strive for:

A narrative that pares away extraneous information to find a story in the data; context to help the reader understand the basics of the subject; interviewing the data to find its flaws and be sure of our conclusions.

Word clouds probably never sought out to do these things, mind you. They are simply a visually appealing and accessible way to display text frequency. How people misuse them is not what we’re here to talk about, that’s already been discussed enough.

I spent more time than I’m proud of to make it look this good. Made with Worditout.

Word clouds are the opposite of what we want to achieve at Kapiche; we want our users to be accurate, explorative, and critical when analyzing their data. We want to give them the tools to understand the whole narrative and the context surrounding what people are talking about. So if wordclouds aren’t the solution, then what is?

Watch Kapiche in action
Kapiche Storyboard 👆

Enter the Storyboard

We knew word clouds weren’t enough, so we set out to build something better. In particular, we aimed to:

  • Provide a starting point — a big picture view — for users to orient themselves and begin delving into a dataset;
  • Provide enough context and useful information without overwhelming the viewer;
  • Invite interaction and exploration to promote understanding of a dataset; and
  • Avoid injecting bias via aesthetic choices or design elements.

The Storyboard takes the visually exciting, accessible nature of word clouds and the depth & structure from topic models to display large amounts of complex text data in an easily digestible and informative way.

Breakdown of the Storyboard

Glancing at the above image should give you a general understanding of what is going on in this anonymous dataset, even without being told anything. It looks like a concept map, but we’ve added some interesting design choices to add understanding and usefulness:

Let’s break it down:

  • Terms form topics. For example, “friendly, helpful, staff” is a topic. An example of a term would be “staff”. All of the terms are actual words in the dataset — we do not introduce any human bias by categorizing through codebooks or dictionaries (which also means there’s no tedious setup required, either). It’s all done automatically through unsupervised machine learning algorithms.
  • Topics & terms are colored & sized by frequency, from purple (highest) to yellow (lowest). The size of the circles behind the terms also helps to reflect frequency. You can see in the above image that “friendly, helpful, staff” and “good, quality, products, brands, high” are the highest frequency topics.
  • Topics & terms are positioned by relatedness. For example, in the above dataset we can see that “good, quality, products, brands, high” is related to “price”. On the contrary, “price” is not likely to be related to “customer, service”.

By using the design principles of colour, size, and position to distinguish different meanings in the data, our users navigate the storyboard very quickly and then use it as a launch point into deeper insights.

Clicking into a topic sends you to a different screen which shows the full details of that topic, including sentiment, co-occurences with other topics, topic drivers, NPS data, data coverage stats, and more. This screen also shows you the expanded topic with all of the terms included, which we call the Context Network, shown here:

Context Network showing the full composition of the “friendly, helpful, staff” topic.

One major advantage is that all of this happens within a few minutes of uploading a dataset. You don’t have to wait days or weeks for someone to process the data.  Kapiche has put countless hours into figuring out how to let you generate this automatically from your data. This also means that the size of the data source doesn’t cause it to take longer to generate — putting 50,000 survey responses through Kapiche won’t take much longer than 1,000.

Final thoughts

We’ve received a lot of great feedback on the Storyboard, and it has been achieving the goals we set for it with our clients. It’s an excellent starting point to begin understanding the narrative of a dataset, and the interactivity makes it a powerful springboard for delving deeper into interesting points of the data. There are no random aesthetics — every design element has a purpose and conveys something meaningful. I’m sure there are more improvements to come (everything can be improved) as we see how users interact with the product as a whole.