Text Hackathon

Q: What is it?
A: A two-day Hackathon on Extracting Knowledge from Big Digital Texts, from 12.30pm on 10 November to 12.30pm on 12 November 2017, at De Montfort University

If you are interested in what we can do with computers to extract knowledge from big digital texts, then this event will interest you. How big do we mean? How about all the 19th-century novels? Or all the speeches in the UK Parliament since the Second World War? Or all the printed books published in England between the arrival of printing in 1475 and the year 1800? Or all the 18th-century newspaper reviews of London theatrical performances? Or all the 11,500,000 leaked Panama Papers (= the Mossack Fonseca files)? If there is a big dataset of text to be investigated, this event can show you how to extract knowledge from it.

Everyone is welcome, from those who know nothing about using digital texts to answer interesting questions and would like to find out how it is done, to those who study this topic and those who teach and research it. Free high-quality food--with vegetarian, vegan, gluten-free, kosher, halal, and coeliac options--will be served throughout the event. There are bursaries available to help with the costs of individuals and groups attending the event.

What kind of 'challenges' will we tackle? Participants can propose their own challenges to the organizers, but to give you an idea of the sort of thing that groups at the Hackathon might tackle we offer these suggestions:

* Do female writers use language differently from male writers?

* What parts of the body are most often mentioned in medical texts from the 19th century?

* What was said about my home town in newspapers from the 16th and 17th centuries?

* What is it about a writer that counts as a distinctive writing style? Do we each have our own distinct style that a computer can tell apart from everyone else's?

* How has the language used to discuss homosexuality in the UK parliament changed over the past 100 years? For example, when did 'queer' start to be used in a positive way?

* Does each of the supposed authors of a 'gospel' in the Christian Bible have a distinct way of writing?

* Did William Shakespeare write the plays that are attributed to him?

* Are today's novels easier to read that 19th-century ones?

* Do Queen Victoria's journals reveal anything about her sex life and recreational drug taking?

* What kinds of language did American newspapers use to describe African Americans during the Civil War?

* What kinds of adjectives were used about Mahatma Ghandi when he was first mentioned in British newspapers and in government reports?

* Were 'teenagers', as a distinct social group, invented in the 1960s, or earlier?

When you say 'challenge', does that mean that the event is competitive? No, not at all: it's about challenging ourselves and learning new things in a totally non-competitive, collaborative environment.

Bursaries, you say?

Students and tutors at all levels of Higher Education may apply for a bursary to help with their costs in attending this event. We are offering:

* 15 x 100 GBP Individual Bursaries are available for individual students to attend the event. All that applicants have to do is write to the Hackathon organizer, Prof Gabriel Egan (using the link on the left), stating in 100 words why they want to attend the event.

* 5 x 600 GBP Group Bursaries are available for groups of at least four students to attend the event. All that a group has to do is write to the Hackathon organizer, Prof Gabriel Egan (using the link on the left), stating in 100 words why they want to attent the event and naming a 'challenge' (an interesting textual question) that they want to explore.

* 5 x 300 GBP Subject Matter Expert Bursaries are available for tutors from Higher Education to attend the event and lead a group of attendees in the exploration of their 'challenge'. All that applicants have to do is write to the Hackathon organizer, Prof Gabriel Egan (using the link on the left), stating in 100 words what their Subject Matter Expertise consists of.

The judgement of the Hackathon organizer in choosing the successful applicants will be final. The deadline for bursary applications is 25 October 2017 and all applicants will be notified of the outcome on 30 October 2017.

Q: Where, exactly, is it?
A: The Eric Wood Learning Zone (next to the Kimberlin Library) on De Montfort University's Leicester campus

(Also, in the "The Learning Development Zone on the ground floor of Kimberlin Library" as a breakout space.)


Q: Really, anyone can come?
A: Yes! Just register for free using the link on the left.

The event is being hosted by De Montfort University's Centre for Textual Studies as part of the AHRC-funded Research Leadership of its Director Prof Gabriel Egan. Anybody with an interest in this topic, including school groups, may apply to participate. The event is aimed at people of all levels of expertise from "none at all" to cutting-edge research projects, and as well as the usual activities of collaborative problem-solving the Hackathon will feature a series of talks, presentations, and hands-on demonstrations for the sharing of knowledge and skills. A full programme of these activities will appear here shortly.

Who'll be there to speak to and guide us?


Prof Jonathan Culpeper (stylistics) Lancaster University

Prof Jonathan Hope (English) Strathcylde University

Dr Brett Greatley-Hirsch (textual studies) Leeds University

Prof Marc Alexander (linguistics) University of Glasgow

Iain Emsley (digital humanities) Oxford University

Tom Salyers (literary and linguistic computing)

Prof Matt Steggle (English) Sheffield Hallam University

Tom English (digital primary sources) Gale Cengage

Dr Paul Rayson (corpus analysis) Lancaster University


Jisc Collections, the state-owned procurers for digital content for all UK higher education

ProQuest Chadwyck-Healey, the commercial sellers of the databases Literature Online (LION) and Early English Books Online (EEBO) and others

Gale Cengage, the commercial sellers of the databases Eighteenth Century Collections Online (ECCO) and American Fiction, 1774-1920 and others

What data will we have to play with?

The Hackathon is concerned with any and all big collections of textual data. Attendees can bring their own data and/or use the datasets that the event will provide, which will include all publicly accessible websites (such as WikiLeaks, Hansard, and Project Gutenberg) and also locally downloaded resources such as the Text Creation Partnership searchable transcriptions of 25,000 books printed in English from the invention of printing to the year 1700. Because Jisc, ProQuest, and Gale Cengage are all coming to the event, we will be offering free access to some of their most exciting new big-text databases.

Throughout the Hackathon there will be a series of demonstrations and talks about the various sources of big textual collections and the various ways that they can be used to answer the kinds of questions that are now, for the first time in history, askable and answerable because we have these digital collections.

So far, we know for sure that we will have access to:

* Early English Books Online (EEBO), being everything published in England from the invention of printing to the year 1700

* Eighteenth-Century Collections Online (ECCO), being everything published in England from 1701 to 1800

* British Library 19th Century Texts, being 65,000 editions of nineteenth-century books

* Literature Online (LION), being 355,000 literary works in English across all periods from ancient to modern

* Associated Press Collection Online, being mainly 20th-century actual wire copy and correspondence reporting on news from bureaux around the world

* US Declassified Documents Online, being American goverment documents from such sources as Central Intelligence Agency, the State Department, and the White House giving insight into post-World-War-II domestic and foreign policy in America

* Gale Historical Newspapers, being an amalgamation of the The Times Digital Archive, 17th and 18th Century Burney Collection, The Financial Times Historical Archive, The Economist Historical Archive, and many more, giving 15 million digitized pages of newspaper content spanning four centuries

* Crime, Punishment and Popular Culture 1790-1920, being trial transcripts, detective agency records, newspaper and police reports, and fictionalizations of crime (penny dreadfuls, dime novels, detective fictions) from the British Isles

* State Papers Online 1509-1714 & 1714-1782, being the digital transcriptions of manuscript documents arising from the practicalities of governing Great Britain from the reign of Henry VIII to that of Queen Anne

* The Panama Papers, being leaked documents from the Panamanian law firm Mossack Fonseca, which provides offshore financial services, that recently brought down the Pakistani Prime Minister Nawaz Sharif

 Plus the databases the ProQuest makes available to us, to be announced shortly.