I recently booked my ticket to attend the London Big Data Week conference, which takes place on 25th November. I thought this would be a good opportunity to discuss some of the reasons why I’m attending, and at the same time provide an update on how things are progressing with The Harkive Project, and with the PhD research that builds upon it.
As with a lot of things around the development of Harkive, it is often my own personal experiences as a music listener and fan that kickstart ideas and directions for my research, and I’ll discuss one such experience below.
The initial impetus for Harkive, in fact, came from a personal experience. Harkive’s attempt to document our collective experience with music in the Digital Age came from a frustration at my own failing memory. At some point in 2012 I had struggled to remember how I’d been listening to music back in 2004, 2005, 2006, I couldn’t accurately recall the devices and services I had been using at the time. This, I reasoned, was partly because the landscape of listening has changed so much, so quickly, and Harkive was my attempt to capture peoples experiences and memories that may, like mine, have otherwise been lost as new practices, habits and ways of listening quickly replaced others.
We perhaps forget sometimes that the pace of change in popular music is and has been rapid. Many of the digital services we now take for granted are only around a decade old, but they have collectively helped create a very different landscape of possibilities for listening to that which existed in the late 20th Century. That is not to say that ‘old’ listening methods, such as Radio, Vinyl and CDs, have disappeared, but rather that they have been joined by many new ways of listening, and one of the things that I find fascinating about my research is that this landscape is still changing rapidly. New services continue to emerge, our legacy practices still find their place, and the 21st Century newcomers are still evolving. It’s a heady and interesting mix. One such relative newcomer is Spotify, and it is one of my own recent experiences with that service that is discussed below, because the experience helped me begin to think about some of the important issues around Harkive and my broader project.
One major issue is this: A huge number of people have kindly told their music listening stories to Harkive since it first ran in 2013, and a central challenge of my PhD will be to devise a way of understanding, organising and analysing what is a complex and rich dataset. However, whilst it is a big data set, it is not ‘Big Data’, certainly not in the sense that I’m beginning to understand what ‘Big Data’ is. Yet, as I’ll discuss below, some of the experiences that the Harkive data records can be seen as having been influenced by technical and/or business practices that can be understood in terms of ‘Big Data’ (for instance, a playlist recommendation, or a pop-up ad, or a link in a social media feed), or else were originally posted in environments where ‘Big Data’ practices are built in to the interfaces and technical infrastructure of the services concerned (a Tweet, or a Facebook post, for example). The issue of what to do with the Harkive dataset boils down to this: I’m a reasonably tech-savvy media scholar, but I’m not a data scientist, or a coder, but I need to find a way of making sense of the Harkive data that takes into consideration ideas around Big Data.
I’ve spent a lot of my research time over the last few months getting to grips with an emerging body of academic literature on the subject, and my attendance at the London Data Week conference is an attempt to develop that understanding further by seeing new commercial and technical developments in the field. I will be interested to see how some of the same issues academics are currently wrestling with, such as data protection, use, ethics, monetisation, ownership, access, surveillance, storage, and archiving, are addressed by conference speakers in what promises to be an interesting day.
The personal experience, described below, along with my engagement with commercial and academic writing, has helped me begin to crystallise some of my thoughts and provide me with what I hope is a productive way forward in addressing the issues with my research project. There are some very ‘Big Questions’ being asked about Big Data, and by looking at the way in which we listen to popular music, particularly in digital environments, I hope I can make some sort of contribution towards helping to answer some of them.
Did Algorithms (Nearly) Make Me Cry?
I am Spotify subscriber and have been recently enjoying their new Discover Weekly playlist service. For those of you who are not aware of it, Discover Weekly playlist is an automatically generated, personalised playlist of around 30 songs that updates weekly for each subscriber. In a Spotify press release from July 2015 it was described by the company as ‘two hours of custom-made music recommendations, tailored specifically to you and delivered as a unique Spotify playlist. It’s like having your best friend make you a personalised mixtape every single week.‘
My own Discover Weekly playlist recently included the song ‘Sexuality’ by Billy Bragg, which led me subsequently following the links in Spotify’s interface and playing the 1992 album from which it came, ‘Don’t Try This At Home’. Although I would broadly describe myself as being fond of Billy Bragg as an artist and political activist, it is also the case that, apart from the records he made in the late 1990s and early 2000s with Wilco, which featured new songs based on Woody Guthrie lyrics, I had not really paid too much attention to Bragg’s recent work. In fact, ‘Don’t Try This At Home’ was the last of his albums I had properly paid attention to. Even so, at a rough estimate, it was probably 15 or 20 years since I had last listened to it. The fact that Sexuality appeared in my Spotify Discover Playlist in the first place is concrete evidence, as far as my understanding of the service extends, that not only had I never played the song in my 7 years of using the service (this being an algorithmic pre-requisite of the system suggesting the song in the first place), but that I had also not played many songs by Billy Bragg, hence Spotify’s algorithmic attempt to steer me in his direction. 1
Listening to the whole LP sent me almost directly back to 1992, when I was 18 years of age and worked in a record shop. I recalled buying an expensive, limited-edition version of the album (thanks to my dealer price staff discount) on a strange format: all 16 songs songs spread across 8 different 7” singles, which came packaged together in a nice box. Even at the time it was an unwieldy format for listening to an album, so I had also borrowed a CD copy from the shop (as we were unofficially allowed to do) and taped it at home. Most of my listening to the album had been on that copied tape version, which I still have, along with the box-set that remains largely unplayed to this day. I also began to recall other things related to Billy Bragg and to my time in the record shop, and fondly remembered going to a Bragg gig in Birmingham with my then-colleagues on the eve of the 1992 General Election. During the gig Bragg was joined onstage by the Labour MP, Roy Hattersley, and generated a wave of hope and optimism in the room that evaporated the very next day when, as history records, the election was lost.
As well as a wave of memories and nostalgia, however, I also approached the record from the vantage point of the present day, as a 41-year-old man, a husband, a father of young children. As well as the uplifting Sexuality and a few wry, lost-love songs (“I saw them in the hardware store. He looked boring and she looked bored”), there were also several other songs on the record that I found extremely sad. In particular the song Tank Park Salute – about the death of a father (presumably Bragg’s) – hit me pretty hard as I rode on the packed 7.52 train into Birmingham New Street. So much so, in fact, that I found myself close to tears and had to try very hard not to embarrass myself in front of a carriage of complete strangers.
Discussing the incident with friends on Facebook later that morning, it seems Tank Park Salute has elicited similar reactions from others over the years. Some, of course, took the opportunity to say that hearing Billy Bragg also made them cry, but for different reasons!
What is interesting about this incident, however, is that my wave of nostalgia and narrowly averted public emotional collapse were, in part at least, prompted by an algorithm. Had Spotify Discover Weekly not provided Sexuality in my list that day, I would not have been sent back to Don’t Try This At Home, and subsequently to my emotional response to Tank Park Salute. Of course, the exact same thing could have happened had Sexuality been played on the radio, or if it had been mentioned in my Twitter or Facebook feed, but this leads to a second interesting point about the incident: it wasn’t played/mentioned in a public forum, it was situated instead in a ‘personalised’ playlist, and one that had been generated for me by a machine performing analysis on data about me and many others. In that sense, it was quite unlike a shared experience, or a public one, it was a private and personalised one, albeit one based on the public broadcast of mine and many others’ listening habits and taste.
I began to think about the extent to which Big Data is ‘producing’ our experience in online environments, and how that might spill over into our experience of popular music more generally, in a similar way as it had done once my experience became a conversation with others. I began to realise that I needed a way to understand these potential effects.
Understanding Big Data
As mentioned above, there is a great deal of work currently being undertaken by scholars in a number of fields, and I’m attempting to engage with as much of that as I can. This engagement will develop into a piece of writing in due course, which will reference many of the papers that have informed my thinking. In the meantime, however, I’ve provided acknowledgements at the bottom of this post to the various academics and writers who’s work has been so valuable. Based on my initial reading I’ve sketched out the beginnings of a model that I hope will help broaden my understanding – as a reasonably tech-savvy media scholar who isn’t a data scientist, or a coder – of what Big Data technologies, practices and business models look like, and how these may manifest themselves in the field of popular music.
By way of briefly explaining the model, I will map Spotify’s Discover Weekly service on to it. By illustrating what is perhaps quite a dry, academic work-in-process with a real-world service, I hope you might find it interesting and informative.
A PDF version of this image is available here. Please note: along with being a reasonably tech-savvy media scholar who is not a data scientist, I am also not a designer. My apologies for the cumbersome nature of the image
The model attempts to break ‘Big Data’ down into a set of components, which each contribute to a cyclical process:
1: Data is generated internally, usually through the public Interface of the service provider, and is also acquired from 3rd parties, either via Open APIs or commercial deals. Internal data will commonly will take the form of account and demographic information about users, but will also include activity logs. Data about media content and other assets held by the service provider will also be collected. External data may include such information that can gathered from social networks, search engines, and elsewhere.
2: Data is Categorised according to the needs of the service provider, placing certain users/sets of users in groups based on demographic, activity data. Categorisation will also take place in order to organise content and other assets.
3: Algorithms are generated according to the business needs of the service provider, based in part on the type of data and how it is categorised, and will attempt to extract information salient to the generation of a competitive advantage.
4: Results of algorithmic processing produce knowledge about both consumers and content in the form of Analysis, and also additional, useful data in the form of ‘Exhaust Data’, which may re-enter the process at Stage 1. Knowledge produced at this stage may be deployed internally, or may be made available commercially.
5: The results of Analysis are deployed via Interface design, which in turn creates more user data, some of which is made available to 3rd parties through APIs or commercial deals.
This locations of activity around this basic, simplified cyclical model take place in either the Public or the Private realm, depending on the commercially sensitive nature of data, processes or interfaces. In the Private realm, some data, processes or interfaces will reside Internally, whilst others will be made available privately to External organisations, through commercial deals and partnerships.
In the Public realm, certain data and the knowledge gleaned from processing, is made available via Interfaces. Such interfaces may be take the form of the front-end User experience, some of which provide basic analytical tools which are built in to that experience, or through APIs. There is a degree of ‘user knowledge’ required for the access, use and processing of certain data, which I have attempted to represent by making the distinction between Users (who will have varying degrees of competence) and Developers (who likewise will have varying degrees of skill).
A common factor across the Public and Private realms is Data Visualisation, which I have included here as a component that attempts to cover items such as publicly available interfaces, analytical tools, and 3rd party creations based on data gleaned from Open APIs, and privately available in the form of Internal and Externally available back-end interfaces, reports and analytical tools.
The concept of ‘Exhaust’ data is included here to attempt to explain two things: 1) New/reconfigured data generated by the algorithmic/analytical process 2) Data derived from 3rd Party services, either commercially, or through APIs.
This very basic model is a work-in-progress, and I expect it to evolve over the coming months to account for any holes in my knowledge and understanding (and please do feel free to point any of those out to me!), but as a starting point it has proved useful to me in terms of unpacking the large and somewhat nebulous subject of Big Data. It is allowing me, for instance, to separately formulate useful questions for each stage of the cyclical process: What is a unit of data? How is it constructed? Does the nature of it, and its relationship to other units, change depending on the category it is placed? How are algorithms constructed, and to what ends? What form does the knowledge produced through a process of analysis take, and does this change the nature of that knowledge? Who has access to it? How does algorithmically-generated knowledge inform interface design, and – coming full circle – does this effect the type of data interfaces are able to produce?
I’m grateful to my BCU colleague, Paul Bradshaw, for some initial feedback on this model, which will help inform the next iteration. He points out that an element of categorisation is inherent in interfaces, and he is correct. A very simple example here would be a Yes/No question on a website, which immediately places data collected into a certain category. He also points out that each of the five elements in my model have a bearing on the other. Interfaces, for example, can be algorithmically-generated, for instance, as well as human-designed. A basic example here would be different content being delivered to a website based on the profile of the visitor. I will continue to work on the model over the coming months but, as stated, as a starting point it is proving useful.
As is usually the case, however, I’m left with more questions than answers (and more questions than I started with!), but I nevertheless feel that progress is being made. To conclude this piece, then, I shall attempt to map Spotify’s Discover Weekly service on to the model above, and in doing so I would like to acknowledge the work of journalists John Paul Titlow and Ben Popper. Their articles, based on interviews with Spotify employees, provided valuable insight into the manner in which the service works.
Spotify’s Discover Weekly Playlist
Launched in July 2015, Spotify’s Discover Weekly playlist offers each user a unique, 30-song playlist that is updated each Monday morning. I’ve been enjoying using the service and have found my way towards some interesting songs and artists as a consequence, and having access to the rest of the Spotify catalogue makes is easy to disappear down rabbit holes of discovery. I’ve also been impressed with how the weekly playlists fit together – songs from artists in different genres bump up against each other in interesting ways, and so they playlist functions quite well as a standalone mix.
It’s not perfect, granted. There are things the service recommends to me that I either don’t like (although this is quite rare, so far), and there are often things that I’m already aware of because I have many, many years of listening under my belt that have happened outside of my Spotify use, records I have on vinyl or CD, or know from radio play, for instance. These are minor gripes, however, in what is otherwise a genuinely impressive service, and one that will surely develop more sophisticated methods of recommendation over time. Using the model above, I’d like to explore how it works.
In terms of Data, there are some basic principles to understand. The first is that, as it is for each of its millions of users, my relationship with Spotify’s catalogue can be divided neatly into two parts: songs I’ve listened to, and songs I haven’t listened to. This information has been gathered over the 7 or 8 years I’ve been using the service. Further to that, there have been over 2Bn user-generated playlists, within which users, including myself, have placed songs and artists in a myriad of new configurations and contexts that give meaning to songs way beyond their generic conventions: For a certain user, creating a playlist at a certain time, a Bob Dylan song may sound great following a song by Kraftwerk, for example.
The listening habits, playlists and the catalogue itself has been Categorised in a number of ways. Songs and artists are understood – again, in very basic terms – according to genres, and each user has an ‘affinity score’ with artists and genres based on their past listening. Algorithmic processing can suggest, for instance, that a perceived preference for Artist A would mean a likely potential affinity with the work of Artist B.
Further, algorithmically-driven Analysis will find playlists containing songs and artists you have already demonstrated a liking for, along with those that you may have a potential affinity for, and cross-reference this information with the basic split between ‘have listened to’/’have not listened to’. The manner in which other users have ordered their playlists informs the order in which the recommended songs are placed in the individuals’ Discover playlist, which provides the crucial element of ‘flow’ to the list of songs.
Finally, this information is presented via the Spotify Interface in the form of the playlist. This brings us full circle to the point of Data, as each user will generate new information based on their engagement with the list of songs: some will be skipped, or partially played, and others will lead users down further rabbit holes. This data informs subsequent playlist generation.
In a sense, then, and bearing in mind that it is a very large assumption that I’m correct in this overview of how Discover Weekly works, we can see how the cyclical nature of human engagement with data-driven systems such as Spotify’s help facilitate what is commonly referred to as ‘machine-learning’. Such interfaces not only reflect back at us our own taste, but also the aggregated taste of many others and represent a considerable step forward from the ‘Customers who bought X also bought Y’ recommendation systems in evidence a few years ago. These are more sophisticated and capable of adding more nuance, so that the recommendation is more akin to something resembling: ‘Customers who bought X also bought Y, but only at this time of day, after experiencing a certain kind of weather, and in a particular location (and so on)’. Rather than a linear recommendation, it is instead multi-directional, multi-dimensional, and potentially capable of understanding complex contextual variables that could mean the difference between a song sounding great in one setting, and awful in another.
Whilst this is, on the one hand, very exciting and will be an interesting thing to see develop over the coming years, it is also potentially problematic. Similar systems to that described above are increasingly being employed in a host of places beyond that of popular music, from financial services, to oil exploration, to healthcare, and the aggregation of data from diverse platforms is a key component of data-driven business models. This is what is leading academics and commercial entities alike to variously raise the questions, as mentioned above, around data protection, use, monetisation, ownership, access, surveillance, storage, and archiving.
Thus far, Harkive has gathered stories from people that cover a huge range of different, individual listening experiences. Some of these have involved online interfaces, such as Spotify, and others have involved ‘old’ listening methods, such as vinyl and CD. Others still have detailed listening that occurs when walking down the street, or of songs ‘playing’ in peoples heads, conjured up by memory. Most of the stories involve various combinations of the above.
I’m becoming increasingly interested in how the popular music industries make sense of those experiences, and how much the data collection we are all subject to in our daily life (in and outside of popular music listening) is helping to produce our experience as listeners. One of the ways I hope to explore that will be through the development of the Harkive interfaces currently in development. The Data Explorer and the Harkive Platform are, at present, very basic interfaces that enable some simple capture and search functions, but I hope we can develop these into something a little more sophisticated and engaging over the coming months. The development of these will be informed by a line of enquiry that I’m in the very early stages of following, but I hope that eventually it may be able to provide some interesting questions, provocations and – perhaps – some answers.
Get in touch
If you’ve found this blog post useful, problematic, or even horrendously wide of the mark, please do feel free to get in touch – email@example.com. Similarly, if you’d like to know more about Harkive and think we may be able to collaborate, drop me a line.
Thanks for reading.
As well as the work of Ben Popper and John Paul Titlow, and my colleague Paul Bradshaw, each thanked above, the work of the following academics has proved extremely useful in the creation of this blog post. Their work will be fully referenced in a paper I am currently drafting, which will be available upon request later in 2015/early 2016:
Mike Ananny, Mark Andrejevic, William Housley, Rob Procter, Adam Edwards, Peter Burnap, Matthew Williams, Luke Sloan, Omer Rana, Jeffrey Morgan, Alex Voss, Anita Greenhill, Jimmy Lin, Dmitriy Ryaboy, Philipp Max Hartmann, Mohamed Zaki, Niels Feldmann, Andy Neely, danah boyd, Kate Crawford, Lawrence Busch, Nick Couldry, Jospeh Turow, Kate Miltner, Mary L. Gray, Wei Fan, Albert Bifet, Rob Kitchin, Lev Manovich, Xavier Amatriain, Dawn Nafus, Jamie Sherman, Glenn Parry, Ferran Vendrell-Herrero, Oscar F. Bustinza, Cornelius Puschmann, Jean Burgess, Jim Thatcher, Bernard Rieder.