Text Analysis – Harkive Stories

Several people have expressed an interest in how the stories collected by The Harkive Project are analysed, so this post is a quick overview of some of the methods I’ve used. The post is accompanied by a sample data set, some code, and the walkthrough video above. If you would like to perform some analysis of your own you should be able to replicate the work shown here by adapting the script I’ve provided to your own datasets. However, if you are just interested in what happens to the data that the project gathers, I hope this post will still be useful and interesting.

The central methodological challenge of my research has been to devise a way of making sense of the large collection of texts Harkive has gathered since 2013. Because of my specific interest in the role that digital and data technologies play in the ways in which we experience music, one of the routes I’ve explored is computational processing. Broadly speaking, these are methods associated with what Gary Hall has called the ‘computational turn’ in humanities research, which he describes as “the process whereby techniques and methodologies drawn from computer science and related fields..are used to create new ways of approaching and understanding texts in the humanities”. You may have come across terms such as Digital Humanities, or Cultural Analytics – these are ways of describing the kind of academic work that Hall is talking about. From a media and cultural studies perspective, which is where I am located within BCMCR, digital and data technologies are of great interest because of their relationship to the ways in which cultural goods – which of includes popular music – are produced, distributed and consumed. Ultimately this means that they are very much a part of the cultures that are associated with those goods, and as such and understanding of those cultures means (in part) getting to grips with these technologies.

One way to think about approaching this is an observation from David Berry (2011), who points out that in order ‘to mediate an object, a digital or computational device requires that this object be translated into the digital code that it can understand’. I’m interested in what happens during that process of translation, and through practice-based research (and the Harkive project) I’m attempting to engage with the processes involved when real world experiences are abstracted into data points and analysis, and how this in turn plays a role in real world experiences.

We can think here, for instance, of algorithmic recommendation services and the ways in which we use them, and how these in turn influence (or not) the music we hear. To greater or lesser extents we each have an everyday relationship with data technologies, yet we don’t fully understand them, how they work, or what the potential consequences may be of our use of them. The aim of my research is to begin building an understanding of the relationship between computational technologies and our experiences of popular music, and in the specific terms of Harkive this becomes a question of what happens when a person experiencing music uses an online interface to describe that experience, which in turn creates a set of data points that can be processed, and ultimately used to help ‘produce’ a form of knowledge in terms of research findings. Clearly there are a huge number of steps involved here (abstractions, reductions, assumptions, and so on), each of which raise questions about both how we use digital technologies in our everyday lives, and also how we as researchers may approach this. What follows, then, is a very quick overview of how the descriptions of real world experiences collected by the Harkive project get processed, and how from a single line of text – a tweet – a large number of numeric and categorical abstractions can be created.

The data

For the purposes of this overview I have created a sample data set of 50 tweets gathered on 25th July 2017. This is a comparatively small data set, so the analysis presented below is not intended to demonstrate any solid findings. Rather, the data is being used here to illustrate a small number of computational processes.

The original data set contains 4 variables: unique numbers for both the stories and the users, the text of each tweet, and the time each tweet was sent. The process below will take those four variables as a starting point and create around 30 new variables that can be used in terms of exploring and visualising the stories.

Creating and visualising additional variables

In order to demonstrate what I mean by additional variables, the first part of the R script performs some simple calculations and data tidying. By counting the number of characters and words in each tweet, the following visualisations are produced. Here we can see some differences between the 50 tweets in terms of the amount of words and characters within each.

NB: 3 tweets appear to be over the 140-character limit for Twitter. As shown in the video, this is because they are either replies to multiple accounts, or else contain images.

CharsTweet

wordstweet

Creating and exploring Document Term Matrix

The first stage of the analysis proper is to prepare the text within the tweets for processing. This includes the following steps:

  • Removal of all punctuation and other extraneous characters (e.g. @, #, //)
  • Removal of words that occur with very high frequency in written text, commonly known as ‘stopwords’. (e.g. the, it, at, were)
  • All text is converted to lower case (i.e. to avoid the counting of ‘Vinyl’ and ‘vinyl’ as separate entities)
  • Removal of ‘whitespace’, such as that which occurs between paragraphs
  • All words are ‘stemmed’ to their roots (i.e. to avoid ‘played’, ‘play’, ‘player’ and other derivations being counted a separate entries)
  • Removal of specific additional stopwords that occurred with very high frequency in this particular dataset – for example: ‘harkive’, ‘music’

Following that, a document term matrix is created. This represents each word within the corpus along on axis, and each document within the corpus along the other. The amount of times each unique word within the corpus appears in each document is contained within each cell. This enables the words within the matrix to be counted and visualised. If certain words appear at this point that are not required, or which may skew analysis, they can be removed. We can do this by adding them to the list of stopwords and then repeating the process of creating the document term matrix. Here we can see that the following words appear frequently within the dataset.

wordcloud

Topic modelling

David Blei defines Topic Modelling as a process that ‘provides a suite of algorithms to discover hidden thematic structure in large collections of texts. The results of topic modeling algorithms can be used to summarize, visualize, explore, and theorize about a corpus’. Topics can be understood as recurring data points (in this case, words) across a dataset (a corpus of text documents). The model, meanwhile, represents the extent to which each individual entry in a dataset (the Tweets) contains data points (topics/words). For a more detailed overview of using Topic Modelling, see Kailash Await’s excellent post from which my own script is derived, or read David Blei’s overview of the process

Because the data set in this particular instance is small the results will not be too instructive, but based on setting the process to organise the documents according to 3 topics we get the following results. Here are the top 5 words associated with each topic.

TOPIC 1: bbcmusic; nowplaying; spotify; perfect; piano

TOPIC 2: radio; begin; home; play; alarm

TOPIC 3: bus; morning; listen; start; ace

In terms of how this looks across the whole dataset, we can see that there is a fairly even split between topics. This process has also produced several new numeric and categorical variables that can be used at a later stage.

Topics

A closer look at the numbers involved here, however, reveal that the differences between documents, and thus their alignment with discrete topics, are more subtle than the overview suggests. Topic modelling is process that assumes documents within a corpus exhibit similarities to all topics in varying degrees. The differences between documents and their relationships to topics are often marginal and suggest that further enquiry is necessary before drawing conclusions based on topic allocation. Nevertheless, the process is a useful step in helping to think about the themes within a large collection of documents, particularly as it helps reveal associations between groups of words that may not necessarily be apparent through a manual reading of texts.

Sentiment analysis

Sentiment Analysis has been described by Bing Liu as the ‘computational study of opinions, sentiments and emotions expressed in text’. This process searches documents for the appearance of certain words that are individually scored, producing an overall value that marks a document as either exhibiting a positive, negative or neutral sentiment. This produces numeric scores based on text that enables individual documents to be grouped together according to numeric similarities, differences and statistical relationships. This part of the analysis is based on Julia Silge’s work on her own tweets. For further reading and a more critical view I would also suggest Annie Swafford‘s work.

In the case of the 50 tweets under examination here, the following visualisation is produced. As discussed in the video, we can see that three tweets have been marked as containing anger. This highlights certain limitations discussed by Annie Swafford in terms of the sentiment analysis libraries’ ability to deal with nuanced issues such as sarcasm and the use of certain words in difference contexts. In the case of one of the tweets designated as ‘angry’, the respondent named a song by DJ Shadow called ‘Horror Show’ – it would seem that the word ‘horror’ is responsible for the angry rating, when in actual fact the tweet (to my reading, at least) was anything but. As with Topic Modelling, the results of Sentiment Analysis need to be considered alongside a closer, manual engagement with the texts.

Sentiment

As with Topic Modelling above, Sentiment Analysis also produces several new variables. We can now use these along the other additional variables to produce some further visualisations

Combining variables

The additional variables created by both the Topic Modelling and Sentiment Analysis processes, along with the variables related to character counts and time, can be used in combination to explore the corpus a little further. For example, the following visualisation shows the relationship between topic allocation and sentiment. We can see that Topic 1 contains a higher proportion of positively scored texts, whereas Topic 2 contains a higher proportion of negatively scored texts.

TopSent

In the next, we can see Topic and Sentiment scores combined with the Time at which each tweets was made. Those closer to 8am appear on the left-hand-side, moving towards 8.30am on the right-hand-side.

TopSentTime

We can remember from the Topic allocation at the words Spotify and Radio appeared frequently, so we may want to compare the results of analysis in these terms. In the visualisations below we can see different combinations of analysis based on stories that contain the word Radio and Spotify.

RadioSentTopSentSpot

Correlations

Another potentially interesting thing to look at once we have generated additional numeric variables is to see if there is any statistical relationship between them. Again, with such a small data set we would not expect to see anything of great significance, but in order to demonstrate the process here is a correlation matrix based on the additional variables created.

cormat

Discussion

The abstraction of complex, real world activity into data points on the one hand makes analysing large collections of texts more manageable, but on the other it can often produce results that can be misleading – think, for example, of the ‘angry’ tweets above. This means that we should always question both the process and the results of that process.

Using computational processes to analyse and explore text-based corpora is an interesting route, but it is not without significant issues. In the first instance, and speaking from my own experience, learning how to perform analysis of this kind is tricky and time-consuming, particularly when – as in my case – the researcher does not come from a computational background. I have learned (and am still learning) how these process work. This, I think, is representative of a wider questions in the humanities when it comes to work of this kind: where scholars are attracted to the affordances of large datasets and computational techniques through their increasing availability and falling barriers to entry, but are simultaneously ill-equipped to use them adequately, fully understand the results, or usefully explain the nuts and bolts of the methods of analysis used once such techniques are deployed. Through sharing these works-in-progress and I am attempting to contribute to Sandvig and Hargittai’s recent call for academics to share the details of their  ‘messy’ benchwork following attempts to put such techniques to use. The desired outcome, they say, is a space where ‘researchers can reveal the messy details of what they are actually doing, aiming towards mutual reflection, creativity, and learning that advances the state of the art’.

Resources

The post is accompanied by a sample data set and R script. If you would like to replicate the work shown here you will need the following:

Selected bibliography

Berry, D.M., 2011. The computational turn: Thinking about the digital humanities. Cult. Mach. 12, 2.

Blei, D.M., 2012. Topic modeling and digital humanities. J. Digit. Humanit. 2, 8–11.

Hall, G., 2013. Toward a postdigital humanities: Cultural analytics and the computational turn to data-driven scholarship. Am. Lit. 85, 781–809.

Liu, B., 2010. Sentiment Analysis and Subjectivity. Handb. Nat. Lang. Process. 2, 627–666.

Sandvig, C., Hargittai, E., 2015. How to Think about Digital Research. Digit. Res. Confid. Secrets Stud. Behav. Online 1.