Slow progress towards screen scraping conversations

It has been a long time since I blogged on this site. This is partly due to my involvement in a couple of really interesting research projects (the Digital Data Analysis project of the network with Dr Helen Kennedy and Dr Giles Moss, and the Leeds Media Ecology project which will result in a forthcoming book edited by Prof. Stephen Coleman). It is partly due to my taking over as the programme leader of the BA Hons New Media degree at the Institute of Communications Studies at the University of Leeds (most of this summer has been spent developing a new level three module entitled “Mobile Media”, looking at the impacts and influences of and on mobile communications and introducing students to the basics of mobile web and app development). So I’ve been pretty busy.

I have continued to develop my PhD work, though, reaching the stage where I am about to start to harvest a lot of conversational data from all across the web where people are talking about UK political issues. When I say a lot, I mean thousands of contributions. Hopefully tens of thousands, maybe lots more. To enable this, I managed to find three weeks this summer to devote myself to the development of a tool that will allow me to grab all this data pretty quickly and get it all stuffed nicely into my database. The Conversation Scraper is the solution I came up with – a Mozilla Firefox plug-in that operates in the same way as many screen scraping applications out there, allowing users to select parts of a web page, mark them up as a category of relevant content, building up a profile for a particular website before clicking a button and watching the data be selected and harvested automatically before their eyes -at least in theory, as long as the user has marked up the page carefully enough. The difference with this screen scraper is tat it is customised for conversation, allowing users to mark up fields like usernames, dates, message content, reply-to names and ratings or likes.

The tool works really well. I was quite surprised at how easily and quickly I could produce it (after the initial delay of trying to come to terms with how to build a Firefox plug-in). Using just a XUL sidebar and some custom JavaScript code (with AJAX and jQuery bundled in to make it all a bit nicer) I now have a fairly user friendly tool that allows me to mark up web pages and build profiles for conversation spaces all over the web. I have profiles for spaces like and, some local forums and some government spaces such as the redtapechallenge and I intend to keep building up the list over the next couple of months until some big story lands and I can start to harvest conversation about it. With a few clicks of the mouse I can harvest several pages of comments from a web page straight into my database. All the usernames are anonymised before insertion, and the data is encrypted, so I don’t know anything about real individuals, I just have a large store of contributions that I can use to calculate metrics about the different conversations.

I will eventually release this tool on GPL. At present it puts data into my own database (which is no good to anyone else and is not robust enough to handle crowd sourcing) but I hope to modify it to produce a JSON or CSV export instead of a database insert, so that anyone can use it. Get in touch if you want to have a look. Maybe we could do a trade – you tell me some good ways to measure metrics like domination in a conversation and I’ll give you a look at the plug-in!


About birchallchris

Research Associate in the School of Media and Communication, University of Leeds, teaching digital media practice and theory to students on the BA/MA New/Digital Media programmes. I research digital citizenship, using innovative digital methods; trying to bridge the gap between vary large scale phenomena and the individual human.
This entry was posted in PhD, Technology, thoughts. Bookmark the permalink.

1 Response to Slow progress towards screen scraping conversations

  1. peterlevine says:

    I am not sure I have anything worthy of trading, but I would be very interested in taking a look at the data. I’d be particularly interested in an example of a long conversation thread that is truly deliberative–with lots of actual replies to previous posts, arguments, and evidence (instead of name-calling).

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s