Text Hackathon

Q: What is it?
A: A two-day Hackathon on Extracting Knowledge from Big Digital Texts, from 12.30pm on 10 November to 12.30pm on 12 November 2017, at De Montfort University

If you are interested in what we can do with computers to extract knowledge from big digital texts, then this event will interest you. How big do we mean? How about all the 19th-century novels? Or all the speeches in the UK Parliament since the Second World War? Or all the printed books published in England between the arrival of printing in 1475 and the year 1800? Or all the 18th-century newspaper reviews of London theatrical performances? Or all the 11,500,000 leaked Panama Papers (= the Mossack Fonseca files)? If there is a big dataset of text to be investigated, this event can show you how to extract knowledge from it.

Everyone is welcome, from those who know nothing about using digital texts to answer interesting questions and would like to find out how it is done, to those who study this topic and those who teach and research it. Free high-quality food--with vegetarian, vegan, gluten-free, kosher, halal, and coeliac options--will be served throughout the event. There are bursaries available to help with the costs of individuals and groups attending the event.

What kind of 'challenges' will we tackle? Participants can propose their own challenges to the organizers, but to give you an idea of the sort of thing that groups at the Hackathon might tackle we offer these suggestions:

* Do female writers use language differently from male writers?

* What parts of the body are most often mentioned in medical texts from the 19th century?

* What was said about my home town in newspapers from the 16th and 17th centuries?

* What is it about a writer that counts as a distinctive writing style? Do we each have our own distinct style that a computer can tell apart from everyone else's?

* How has the language used to discuss homosexuality in the UK parliament changed over the past 100 years? For example, when did 'queer' start to be used in a positive way?

* Does each of the supposed authors of a 'gospel' in the Christian Bible have a distinct way of writing?

* Did William Shakespeare write the plays that are attributed to him?

* Are today's novels easier to read that 19th-century ones?

* Do Queen Victoria's journals reveal anything about her sex life and recreational drug taking?

* What kinds of language did American newspapers use to describe African Americans during the Civil War?

* What kinds of adjectives were used about Mahatma Ghandi when he was first mentioned in British newspapers and in government reports?

* Were 'teenagers', as a distinct social group, invented in the 1960s, or earlier?

When you say 'challenge', does that mean that the event is competitive? No, not at all: it's about challenging ourselves and learning new things in a totally non-competitive, collaborative environment.

Programme (what's happening when)

It all happens in the Eric Wood Learning Zone next to the Kimberlin Library on De Montfort University's Leicester City campus, with breakouts to the Learning Development Zone in the Kimberlin Library itself, here :

Here is the secret info for this event: credentials for logging into our PCs and WiFi network, and links for accessing the ProQuest and Gale Cengage databases provided. To access the secret info you'll need a userid and password that will be divulged orally at the event.


Friday 12.30pm
Lunch (provided)

Friday 1-1.15pm
Brief welcoming address from the organizer Gabriel Egan, including housekeeping matters (incl. fire, toilets, and food), thanks to funders, logging in to PCs and finding secret info, introductions to the Subject Matter Experts, advice to find group-partners (if you have them) and start discussing your data, and suggestions for how to spend this first day.

Friday 1.15-3pm (with refreshments break at 2.30pm)
Hands-on demo from Paul Brown and Gabriel Egan
"The surprising amount of text-mining you can do with Microsoft Word, Excel and Notepad"

Friday 3-6pm
Free Hacking Time OR join one of the following groups (by walking over to it)

Friday 4-6pm
Subject Expertise Huddle (not a talk but a gathering of interested parties in one part of the room)
Tom English of Gale Cengage will discuss, answer questions about, and demonstrate the big textual databases that Gale Cengage sells

Friday 4-6pm

Subject Expertise Huddle (not a talk but a gathering of interested parties in one part of the room)
Jonathan Cates of Jisc will discuss, answer questions about, and demonstrate the big textual databases that Jisc sells

Friday 4-6pm
Subject Expertise Huddle (not a talk but a gathering of interested parties in one corner of the room)
John Pegum of ProQuest will discuss, answer questions about, and demonstrate the big textual databases that ProQuest sells

Friday 6pm
Dinner (provided)

Friday 6.30pm to Saturday 8.30am
Free Hacking Time. The venue will be open the whole time (with Security present) so come and go as you please. (But do go somewhere to get some sleep!)



Saturday 8.30am
Breakfast (provided), followed by brief guide to the day from Gabriel Egan.

Saturday 9-10am
Talk by Jonathan Hope
"Searching the Oxford English Dictionary (OED), early print, and Jisc Historical Texts"

Saturday 10am-11am
Free Hacking Time (refreshments arrive at 10.30am)

Saturday 11am-12noon
Talk by Jonathan Culpeper
"From simple word counts to collocates and keywords" Handout

Saturday 12noon-1pm
Free Hacking Time (lunch arrives at 12.30pm)

Saturday 1-2pm
Talk by Paul Rayson
"Adjusting a semantic taxonomy and annotation tool for historical corpora" Handout

Saturday 2-3pm
Free Hacking Time (refreshments arrive at 2.30pm)

Saturday 3-4pm
Talk by Nick Smith
"Language change in a popular radio show: does it matter who we sample?"

Saturday 4-5pm
Free Hacking Time

Saturday 5-5.45pm
Talk by Iain Emsley
"How did I GET that data?: Reproducing steps"

Saturday 6pm
Dinner (provided)

Saturday 7pm to Sunday 8.30am
Free Hacking Time. The venue will be open the whole time (with Security present) so come and go as you please. (But do go somewhere to get some sleep!)



Sunday 8.30am
Breakfast (provided)

Sunday 9-10am
Free Hacking Time

Sunday 10-11am
Talk by Brett Greatley-Hirsch
"Measuring style in writing"

Sunday 11-11.10am
Challenge results presentation: Laurence T. Droy on "How does the way e-cigarettes are talked about vary across old and new media?"

Free Hacking Time

Sunday 12noon-12.10pm
Challenge results presentation: Stephanie Collins, Ellen Roberts, and Andressa Gomide, and Amir Andwari on "Did Shakespeare write the plays attributed to him?"

Sunday 12.10-12.20pm
Challenge results presentation: Robyn Pritzker, Rianna Walcott, Olivia Ferguson, and Suzanne Blackon "Investigating affective contempory responses to historical and contemporary Gothic writing"

Sunday 12.30-12.40pm
Challenge results presentation: Isobelle Clarke, Jacqueline Cordell, Katherine Pearce, and Viola Wiegand "Can we identify textual patterns creating the discourse of 'sexual harassment' prior to 1973, and how have attitudes towards sexual harassment changed over time?"

Sunday 12.40pm
Closing remarks and Farewell lunch

Q: Really, anyone can come?
A: Yes! Just register for free using the link on the left.

And you can come and go as you please across the 48 hours. The event is being hosted by De Montfort University's Centre for Textual Studies as part of the AHRC-funded Research Leadership of its Director Prof Gabriel Egan. Anybody with an interest in this topic, including school groups, may apply to participate. The event is aimed at people of all levels of expertise from "none at all" to cutting-edge research projects, and as well as the usual activities of collaborative problem-solving the Hackathon will feature a series of talks, presentations, and hands-on demonstrations for the sharing of knowledge and skills. A full programme of these activities appears above.

Who are the Subject Matter Experts who'll be guiding us in the various challenges?

Dr Elizabeth Williamson of Exeter University
Expertise early modern print and manuscript culture, text encoding, textual scholarship, digital publication
Can advise on idiosyncrasies of early modern print and manuscript, and digital collections thereof. What can we tell about early modern power networks by looking at letters in the State Papers Online?

Dr Edmund G. C. King of the Open University
Expertise Shakespeare, the history of reading, digitization
Can advise on Locating reading experiences, identifying named entities

Prof Jonathan Hope of Strathclyde University
Expertise Linguistics
Can advise on Did Shakespeare really invent all those words?

Dr Brett Greatley-Hirsch of Leeds University
Expertise Textual studies, computational stylistics, and literary/cultural history. In particular, I'm interested in authorship attribution, editing and publishing, and computational methods of literary study.
Can advise on: Does punctuation-use change over time? Do patterns emerge in the choice of character names in literature/film?

Dr Paul Rayson of Lancaster University
Expertise Natural Language Processing
Can advise on: VARDing to modernize spellings in historical texts for improved corpus analysis

Prof Jonathan Culpeper of Lancaster University
Expertise Corpus stylistics
Can advise on Using corpus tools (e.g. CQPweb, WMatrix) to explore meanings and styles

Dr Nick Smith of University of Leicester
Expertise Applied linguistics
Will be around Friday and Saturday
Can advise on: What kinds of grammatical changes can we detect in UK parliamentary speeches over time?

Iain Emsley of Oxford University
Expertise Python, automation, and reproducibility
Can advise on Python, APIs, and automation

Tom Salyers of Sheffield University
Expertise Literary and linguistic computing. I have a background both in software development and literary and performance studies, with a strong emphasis on Shakespeare.
Can advise on Syntactic tagging, cluster analysis, authorship identification, and using Python, SQL, and Java

Tom English of Gale Cengage
Expertise Digital primary sources (especially Gale's)
Can advise on Thomson Gale's big textual databases

Dr John Pegum of ProQuest and Alexander Street
Expertise English Literature, Digital Resources, Literature Online (LION)
Can advise on ProQuest's big textual databases

John Rothwell of ProQuest (their Lead Software Engineer)
Expertise Database architecture and data structure within EEBO, LION and other ProQuest databases
Can advise on ProQuest's big textual databases

Prof Gabriel Egan of De Montfort University
Expertise Plays of Shakespeare's time, digital methods
Can advise on Getting started with digital methods

Dr Paul Brown of De Montfort University
Expertise Early modern drama, digital methods
Can advise on Getting started with digital methods

Jonathan Cates of Jisc Collections
Expertise digital collections and archives
Can advise on Jisc Historical Texts and Jisc Journal Archives

What data will we have to play with?

The Hackathon is concerned with any and all big collections of textual data. Attendees can bring their own data and/or use the datasets that the event will provide, which will include all publicly accessible websites (such as WikiLeaks, Hansard, and Project Gutenberg) and also locally downloaded resources such as the Text Creation Partnership searchable transcriptions of 25,000 books printed in English from the invention of printing to the year 1700. Because Jisc, ProQuest, and Gale Cengage are all coming to the event, we will be offering free access to some of their most exciting new big-text databases.

Throughout the Hackathon there will be a series of demonstrations and talks about the various sources of big textual collections and the various ways that they can be used to answer the kinds of questions that are now, for the first time in history, askable and answerable because we have these digital collections.

So far, we know for sure that we will have access to:

Early English Books Online (EEBO) being everything published in England from the invention of printing to the year 1700. We will have this dataset in the form of the Jisc Historical Text version, the ProQuest version, and the Brigham Young University transcriptions from the Text Creation Partnership EEBO Phase One (25,000 books) that have had part-of-speech, lemma, and semantic tagging applied.

Eighteenth-Century Collections Online (ECCO) being everything published in England from 1701 to 1800

British Library 19th Century Texts being 65,000 editions of nineteenth-century books

Literature Online (LION) being 355,000 literary works in English across all periods from ancient to modern

Associated Press Collection Online being mainly 20th-century actual wire copy and correspondence reporting on news from bureaux around the world

US Declassified Documents Online being American goverment documents from such sources as Central Intelligence Agency, the State Department, and the White House giving insight into post-World-War-II domestic and foreign policy in America

Gale Historical Newspapers being an amalgamation of the The Times Digital Archive, 17th and 18th Century Burney Collection, The Financial Times Historical Archive, The Economist Historical Archive, and many more, giving 15 million digitized pages of newspaper content spanning four centuries

Crime, Punishment and Popular Culture 1790-1920 being trial transcripts, detective agency records, newspaper and police reports, and fictionalizations of crime (penny dreadfuls, dime novels, detective fictions) from the British Isles

State Papers Online 1509-1714 & 1714-1782 being the digital transcriptions of manuscript documents arising from the practicalities of governing Great Britain from the reign of Henry VIII to that of Queen Anne.

Black Abolitionist Papers, 1830-1865 being the writings and publications of the African American anti-slavery activists themselves.

British Periodicals being full text of hundreds of periodicals from the late seventeenth century to the early twentieth.

Colonial State Papers being over 7,000 hand-written documents and more than 40,000 bibliographic records covering British relations with the Americas and other European rivals for power, and the Caribbean and Atlantic world.

Country Life Archive being the full text of this weekly British culture and lifestyle magazing from 1897 to 2005.

Early European Books being page images of a cross-selection of books to convey the history of printing in Europe from its origins through to the close of the seventeenth century.

Entertainment Industry Magazine Archive being essential primary sources for studying the history of the film and entertainment industries, from the era of vaudeville and silent movies through to 2000.

Historic Literary Criticism is a collection of over 20,000 historical contemporary reviews, essays and commentary related to more than 500 influential authors from the 17th to the early 20th century.

Historical Statistical Abstracts of the United States being more than 600,000 published tables from statistical information produced by U.S. Federal agencies, states, private organizations, and major intergovernmental organizations.

House of Commons and House of Lords Parliamentary Papers being the working documents of government for all areas of social, political, economic and foreign policy, showing how issues were explored and legislation was formed.

Index Islamicus being pointers to publications on Islamic subjects throughout the world from 1906 to the present, including records from 3,000 journals together with conference proceedings, monographs, multi-authored works and book reviews.

Nineteenth Century Short Title Catalogue being a bibliography of over 1.2 million records for the 19th-century holdings of eight of the world's top research libraries.

Digital U.S. Bills and Resolutions, 1789-Present being the most comprehensive collection of historic and current congressional information available anywhere.

Historical Newspapers being the full texts of all the major English-language newspapers from the US, Ireland, and the UK from the 18th-century to the end of the 20th.

Queen Victoria's Journals being high-resolution, colour images of every page of the surviving volumes of her journals, from her first diary entry in 1832 to shortly before her death in 1901.

The Cecil Papers being 30,000 manuscript documents written by some of the most significant figures of Elizabethan and Jacobean history.

The Vogue Archive being every issue of American Vogue, from the first in 1892 to the current month, with indexes to find images by garment type, designer and brand names.

The Women's Wear Daily Archive being every issue from the first in 1910 to the current year, with searchable text and indexes.

The Panama Papers being leaked documents from the Panamanian law firm Mossack Fonseca, which provides offshore financial services, that recently brought down the Pakistani Prime Minister Nawaz Sharif.

What specific challenges have come in so far?

"How does the way e-cigarettes are talked about vary across old and new media?" (from Laurence T. Droy, who'll be using data scraped from Twitter and newspapers and hoping to apply such things as sentiment analysis to it).

"To what extent do women work collectively in assigning cultural value to certain texts in datasets such as Facebook groups like 'Women in the Arts Scotland' and in historical archives such as the 'Reading Experience Database'?" (from Robyn Pritzker, Rianna Walcott, Olivia Ferguson, and Suzanne Black).

"Can we identify textual patterns creating the discourse of 'sexual harassment' prior to 1973 (when that phrase was coined), and how have attitudes towards sexual harassment (i.e. the semantic prosody) changed over time?" (from Isobelle Clarke, Jacqueline Cordell, Katherine Pearce, and Viola Wiegand, who'll be using evidence from social media, fiction, and legal texts).

"What can use of the word 'zeal' in seventeenth century English religious and political discourse tell us about the nature of religious conflict and emotional community?" (from Dhiaa Kareem Ali, Wendy Li, Ravindra Babu, and Martyn Cutmore)

"Did Shakespeare write the plays attributed to him?" (from Amir Andwari, Stephanie Collins, Ellen Roberts, and Andressa Gomide)

We want to thank ...

Funding for this event came from De Montfort University, of course, and the Arts and Humanities Research Council (grant AH/N007654/1).

AHRC logo