In this blog post I’m going to attempt to explain the process I use to collect data for The Harkive Project, and how collecting it in the manner that I do considerably helps reduce the amount of time required in organising and cleaning data ahead of analysis. The resulting database created by this process is organised according to the principles of Tidy Data, principles that are extremely useful when using the R package (as I am) as the primary means of data analysis. Hopefully you may find this post and the accompanying video useful if you are considering using social media and/or other digital data in your own research projects.
It should be noted at the outset that Harkive as a project has specific needs in terms of data collection, and the process I will describe has been devised with those in mind. This necessarily means that the process I use may not be entirely replicable for your own needs, so I will instead attempt to explain in terms of general principles and tasks so that it may be useful to you with some minor adjustments.
Before I begin, and for those of you reading this who may be unaware of The Harkive Project, I will provide a little context and background that may be useful. Following an explanation of the process, I have included some general discussion about the process and its limitations.
Harkive is an online research project that seeks to gather information from people about the detail of their engagement with popular music on a single day each year. Since 2013 the project has operated on a day in July, and participants are invited to tell the ‘story’ of their music listening day by either posting to social media platforms using the #harkive hashtag, emailing the project directly, or by completing an online form. When the project ran in the years 2013-2015, I used a combination of different techniques to get the data from various sources that resulted in several different databases, each organised according to different schema. I was particularly grateful to Martin Hawksey and his work on TAGS, which enabled me to gather data from Twitter. By using GoogleDocs I was able to create and embed a simple Form on the project website to capture ‘stories’ there, as well as through emails sent to the project. For most other services, I used a combination of IFTTT recipes.
Whilst these separate collection techniques all worked well in isolation, it left me with data in several separate spreadsheets, each of which were organised according to different schema. This meant that a period of data sorting and cleaning was required to get data from different sources into a single spreadsheet before analysis could begin. This was time-consuming and error-prone. Following discussions with my BCU colleague, Nick Moreton, we began to investigate the Zapier service as a means by which to collect data in a manner that also sorted and organised it, thus making the entire process more efficient
Here is a video that walks through the process described below.
General Overview of Process
As with a number of other services, Zapier will enable you to connect to the APIs of a number of different 3rd party online platforms, including Twitter, Tumblr, Instagram, and then collect data from these based on a specified searches, or other conditions. Zapier can also be used to extract data from emails, forms and other online media.
In order to do this via Zapier you create a separate ‘Zap’ for each service/place that you wish to collect data from. During the creation of ‘Zaps’ you are be able to specify which elements of the third party APIs to collect data from, which means that you can discard elements you do not need.
Each Zap can be augmented by adding additional steps that allow you to write the collected data to a specified location within a database of your choice. Common elements from different services (for example, usernames) can thus be written to a single username column in a destination database, even though the naming conventions and/or data formats of those datapoints may differ from one API to another.
Adding a fixed variable to each ‘Zap’ – such as one that specifies the source of each entry (e.g. Twitter, Tumblr, Instagram) in the example below – will write a separate column in the destination database that records this fixed variable in each row.
Likewise, fixed variables denoting NULL values can also be added where the API of one service does not provide an element that is present in another. For example, Tumblr and Instagram provide tags (hashtags) added by users as elements of their APIs, yet Twitter does not. By giving a NULL value to the tags element of the Twitter Zap, whilst adding the relevant tags elements to the equivalent variables in the Instagram and Tumblr Zaps, the column in the destination database is populated with either a true value, or a NULL value.
These NULL values become useful when a present variable in an API is not populated by a user. For example, if a Twitter user does not add an image to their post, but you are collecting that element, the Zap will populate the database with a blank value. Blank values (not present) can thus be differentiated from NULL values (not available).
Additional steps can also be added that can pre-process certain data before it is written to the destination spreadsheet. This is particularly useful in terms of Date and Time stamp formats, which often differ from service to service.
The process will involve some trial and error, but eventually you should end up with a single spreadsheet of data collected from numerous sources that contains data organised and formatted according to a schema of your choice.
What You Will Need To Get Started
- An account with Zapier.com
- Accounts on the Social Media channels you wish to collect data from
- A Gmail account
For ease of set up I recommend that you log in to each of the relevant services above, and have each open in separate tabs within your browser before proceeding with this workflow.
For the purposes of this simple example we will collect data from Twitter, Tumblr and Instagram, and will limit our collection to just the usernames of posters, the text of their posts, and date/time stamps. Adding additional variables is simply a case of extending the process described below.
Step 1: Creating A Schema
Within GoogleDocs (or similar) you need to create a blank spreadsheet into which Zapier can write the data collected from different services. The aim here is to create columns within this sheet into which common data points can be written. For the purposes of this example, create four columns and call them service, user_name, text, and date. Name your new spreadsheet zapier_test.
Your new, blank spreadsheet should look like this:
It may sound slightly counter-intuitive to suggest creating this step first, since you will not yet know what data are available from the different services that you wish to collect from, or where commonalities occur. This is where the Trial and Error element comes in and you may need to add or remove columns until you get the right API elements into the right columns.
Step 2: Creating A Zap
You now need to create separate Zaps for each of the three services (Twitter, Tumblr and Instagram). Let’s set up Twitter first.
Click on Make A Zap on the main dashboard, search for and select Twitter as your Trigger App, and then follow through the steps prescribed by Zapier to authorise it to access your Twitter account. After that, specify your search term. Finally, test the collection of Twitter data based on that search. If the test is successful, you will be able to view the different elements of data returned from the Twitter API. The elements that we want to capture in this instance are: user__name, text & created_at, which are those that correspond to the user_name, text and date elements of the destination spreadsheet, zapier_test. Make a note of these as you will need them in the next step.
Next, and still within the Make A Zap window, you need to add the step that will write data to your newly created GoogleDocs sheet, zapier_test. Select ‘GoogleSheets’ from the available dashboard, and then the radio button for ‘Create Spreadsheet Row’, and (after authorising your GoogleDocs account) you should select zapier_test and Sheet 1 from the pull-down menu. If it doesn’t appear, use the search option to find it.
You can now add the elements from the Twitter API to the required fields in the spreadsheet, like in the image below. Note that we have added twitter as a ‘constant’ entry.
NB: The first time you set this up you will need to follow the instruction that allow Zapier to write a test row to the spreadsheet before continuing. This is to test Zapier’s ability to write to the spreadsheet. You will only need to do this once.
Once completed click ‘Finish’ and then ‘Name Your Zap’.
You can now repeat this process for Tumblr by returning the beginning and selecting to Make A New Zap. After selecting Tumblr as your Trigger App you will notice that the option to search on a specific term is not available in the same way as it was through Twitter. The closest equivalent in Tumblr is to search for tags added to posts. This is will be discussed further in the discussion below, but it is an inherent flaw in collection methods of this kind and ultimately relate to the fact that you can only perform tasks that are permitted by the 3rd party concerned.
The elements of the Tumblr API that match those that we want to gather from the Twitter API are as follows: blog_name; body; date. As before, we also add tumblr as a constant on each entry.
Repeating this step for Instagram (which like Tumblr requires that you search on tags, rather than search terms) results in the following:
Once you have completed the process of setting up this third Zap, and depending on the frequency with which your search term is used by posters to theses three services, you will then begin to see your zapier_test spreadsheet begin to update with new entries.
You may wish to add some pre-processing to the steps you add to your Zaps in order to render data in a united format. The collection method described above is collecting Date/Timestamps from each service. However, these appear in different formats.
Twitter API date/time format: Fri Jul 09 09:51:53 +0000 2010
Tumblr API date/time format: 2015-10-27 07:13:17 GMT
It is therefore useful to convert these to a common format before writing to the zapper_test spreadsheet.
I am by no means an expert when it comes to either tutorial production or social media data collection. However, I have found that this process is suitable to my own research needs (although not without issues – which I shall highlight below). If you have any questions, corrections or other comments about this post and/or video, please do get in touch and I will attempt to assist you. If you have found it useful, please also feel free to share, adapt, build upon or otherwise repurpose elements of it.
You should be aware that Zapier offers tiered subscriptions based on usage. Although the free tier may be suitable for your needs, if your research project has the potential for the collection of a large amount of data, you should consult their documentation on pricing before proceeding.
The process described above contains within it the potential for fragility because it relies on the availability of data from the owners of 3rd party platforms. As boyd and Crawford observe, ‘data companies have no responsibility to make their data available, and they have total control over who gets to see them’ (2012). The data available via APIs is thus limited and subject to change at any moment, and as such you are advised to monitor the collection regularly to ensure that you are capturing what you expect to see. As boyd and Crawford also observe, it is not clear and indeed largely impossible to discover, whether the process described above is capable of capturing everything posted, or just a sample. Tweets, for instance, from accounts where users have instigated privacy settings (known as ‘protected tweets’ in the language of Twitter) are excluded from searches of this kind and so will not be collected. Therefore, data collected in this way can only ever be described as a sample, rather than the entirety of posted data. This issue highlights the caution that many scholars have in terms of using Social Media data as representational of human behaviour and experience.
In addition to the fragility of the API-derived data, there is also an issue with the manner in which the availability of data necessarily directs the methods by which researchers are able to collect data. This is evident in the workflow above, where the manner in which data is searched differs from service to service. The Facebook API, not covered in this post, is also a case in point. This will only collection of data that is posted by a user to a specific page. Anything posted to the user’s own timeline is not collected. This clearly differs from the main means by which many users engage with Facebook – i.e. through their own feed, not the pages of others. As such data collected from Facebook (and other sources) needs to be considered in terms of its limitations, and not just affordances of the relative ease by which it can be gathered.
I’ve been meaning to post this workflow for some time, but a couple of things have made it expedient to do so now. In the first instance I’ve recently completed a first draft of the methodology chapter of my PhD thesis, so this (and other) practical elements of the project are the things I’m currently thinking through in detail. The word-count constraints of a PhD thesis do not really allow for the inclusion of finer details about processes such as that being described here, so this and subsequent posts are intended as appendices of sorts. Further to that, next week I’ll be part of a panel at an event at The British Library that is looking at how non-text outputs of PhD and other academic research projects can be helpfully made available via the EThOS system. Although this post is quite clearly predominantly text-based, I would nevertheless include in the general category of non-text outputs that I am producing through my research, alongside such things as datasets, visualisations, R scripts, and other elements that do not necessarily fit into a ‘traditional’ format of a thesis. Finally, a group conversation that recently took place on the emailing list of the AOIR, of which I’m a member, suggested that sharing this workflow may be useful to others. This, I feel, is an important point more broadly and relates to my interest in the ETHoS project: the results or outcomes of research (commonly referred to as Impact in academic circles) have the opportunity through mediums such a blogs and other digital repositories to be potentially useful outside of the context of the original research, particularly when elements of a larger methodological processes can be isolated and presented as general guides. Sandvig and Hargittai have argued recently that the ‘workaday’ practice of the humanities/social sciences research process needs to be highlighted, particularly in areas of work that look at digital media and the Internet, because these are producing ‘new methods, new opportunities, and new challenges for understanding human behaviour and society.’ As such, this post will be the first in a series of ‘practical’ posts that I hope will make a small contribution to that and may prove to be useful.