Text analysis and website archiving

I have been keeping an eye out for new and useful technologies that might be important in my studies and have recently come across a couple that are worthy of note:

Launched just this month,  DiscoverText – a product of Texifter, a spin-out company based on text analysis research by Dr Stuart Shulman of the University of Massachusetts –  is a web based text analysis tool which allows users to upload or identify text data sources to be analysed using a range of automated and “human-in-the-loop” processes. There is a simple free service and more complex analysis in paid-for solutions. The software can be used to capture live data from facebook, twitter or any RSS feed and can also work with uploaded files such as large volumes of PDF or Microsoft Office documents or an email archive. In this way, users can create an online archive of text data which can be analysed at will. The software combines automated text analysis with expert analysis by allowing collaboration between networks of trusted peers. Initially, comments can be grouped into “clusters” of similar contextual meaning and common themes identified. Some of Shulman’s recent research focuses on the detection of threats and novel ideas communicated in public comments, blogs and other media, which raises the prospect of using the system to analyse for the presence of formulated public opinion in text content relevant to my studies. I have yet to identify the exact mechanisms and theories used in this product but hopefully some study of Shulman’s work will shed some light on the issue.

WinHTTrack Website Copier
This open source free (GPLdownload allows users to store entire copies of websites, archived in a clear storage structure, for future offline browsing or analysis. Rather than simply requesting pages and storing them in a structure exactly like that of the target site the software can also create a copy of dynamic data served from databases onto a web site. Every hyperlink is investigated and the html returned is stored as a page and becomes accessible by the URL, query string and all. This is a valuable tool for my research as it enables me to capture entire e-participation initiatives and store them locally before they disappear when their operational period comes to an end.


About birchallchris

Research Associate in the School of Media and Communication, University of Leeds, teaching digital media practice and theory to students on the BA/MA New/Digital Media programmes. I research digital citizenship, using innovative digital methods; trying to bridge the gap between vary large scale phenomena and the individual human.
This entry was posted in PhD, Research Notes, Technology and tagged , , , , , , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s