GATE: Full Lifecycle Text Analytics

(20 years work in 20 minutes, plus questions)

Hamish Cunningham,
University of Sheffield

These slides: tinyurl.com/figint13
(FIG: First International GATE
symposium, number 6)

Contents

0. Background; Housekeeping
1. GATE in the Wild
- Semantics in the Media
- Text Mining for Biomedicine
- Other Users
2. The GATE Family
- Developer, Embedded
- Teamware, Process
- Cloud
- KIM, OWLIM, Linked Data
- Mímir
3. A Lifecycle for Text Analysis

Background

c. 1991: Welcome to the miserable crew?

Les Miserables Les Miserables 2

2013 ☺: what changed?

the key factor, of course, is how many elements of our lives have moved on-line, and the role of text as a communication medium in those elements is absolutely central
this has meant transformations in many areas, including
- the size of the data we work with
- the importance of that data (sadly, the most profitable area is advertising — sometimes it seems like our entire field has become a question of who clicks on which advert — some examples of non-advertising stuff below...)

What didn't change?

Square Fish Syndrome

Imagine you're looking at a river under which
a fish swims by
You can show the ripples and eddies
- to an artist and ask them to draw a fish
- to a statistician and ask them to model a fish
In both cases you're liable to get a square fish
Human language processing can be done by
- linguists intuiting about grammar
- machine learning creating statistical models
Language is a surface phenomenon but
communication is about intelligence

Progress:
- modern approaches use elements of both methods, and large quantities of data, and large quantities of human judgements about that data
- ways to represent results have changed, with linked data, location-dependent services, and etc.
But, fundamentally, we have yet to see any dramatic breakthrough

Which means...

the most important questions have often been:
- how can we build systems that achieve the plateau as cheaply and efficiently as possible?
- how can we mininise the effort in adapting these systems to new problem domains?
- how can we ensure that they work reliably and predictably over time?
- how can we maximise reuse across systems?
these were the question that we started work on in the early 1990s, and that's what became the GATE research programme...
what are people doing with it in 2013?

Housekeeping Information (1)

If you hear a fire alarm?

...run in circles, scream and shout
Run in circles scream and shout 1

Housekeeping Information (2)

The building is fully equipped with toilets

Pan

Housekeeping Information (3)

Members of the GATE team are available at all times

Munsch The Scream

0. Background; Housekeeping
1. GATE in the Wild
- Other Users
2. The GATE Family
- Developer, Embedded
- Teamware, Process
- Cloud
- KIM, OWLIM, Linked Data
- Mímir
3. A Lifecycle for Text Analysis

The greatest sporting event ever?

The London olympics.

And the leading local media organisation covering it?

The BBC.

And the best text analytics ecosystem? (You can see where this is going...)

GATE, of course, which is why the BBC are using it under their all their sports web coverage :-)

...

GATE at the Olympics

BBC Olympics coverage using GATE

The BBC's sports website uses GATE for text mining
(see this BBC blog for some of the gorey details)

GATE at the Olympics (2)

The BBC system has some interesting characteristics:

big (or medium) data, very high query volumes
extremely flexible in face of data evolution (an unknown gets a gold, a trainer gets arrested with steroids, ...)
linked data is key
- documents are annotated relative to a domain-specific OWL model (using GATE's LKB gazetteer from Ontotext)
- everything is served out of a clustered semantic repository
changing the pages served and their content is a semantic operation — which starts to move the production focus away from the DBA and the web developer and back to the journalist
the austerity quote: their world cup 2010 system achieved cost savings of ~80% compared to a conventional database-backed web system
now serves more than 10,000 pages

Semantics in Media — the New Thing?

(slides from [ex-]PA's Jarred McGinnis)

The Press Association are running a similar programme (following on from long-running GATE project that processes captions in massive image library)

PA Olympics coverage using GATE

Media (2)

system helps the journalist create the metadata
feedback from the journalist helps system accuracy
when you've got good metadata in an OWL store you can access the data with extreme flexibility... e.g.
- generate the pages on the BBC site,
- sell custom feeds into niche markets
- drive visualisations or populate COTS data analysers

now partners with Sheffield/Ontotext/IMR in the AnnoMarket project (more later)

Media (3)

CNN commissionned a similar system
News International contacted us in 2011 but we ignored them and they went away :-)
media is a good application area for text analysis and semantic modelling technology, partly because journalistic language is very well-behaved (relatively speaking!), partly because the content is extremely valuable, and partly because existing classification schemes are typically applied quite rigorously
life sciences and biomedical data has some similar characteristics...

Media (4) — Social/Mobile

EPSRC programme in summarisation and mining of consumer-generated media
TrendMiner: cross-lingual trend analysis in real-time media streams
- won the Hypertext 2013 Ted Nelson best newcomer award for Where's Wally?, geolocation of Twitter posts
new in 2013/4: computing veracity (the 4th V of big data, after Gartner's volume, velocity, variety)

Biomedical Example (1): EHR Mining

...sneak preview...

Bio Example 2: Genetic Epidemiology

it is hypothesised that
- genetic factors play a strong role in susceptibility to disease
- in future targetted pharmaceuticals will be tailored to individual genetics
a substantial body of work looks for associations between mutations and diseases
World Health International Agency for Research in Cancer, the world's biggest cancer epidemiology lab.
genetics groups: which mutations (SNPs) associate with carcinogenesis?
new trends in genetic association studies (stimulated by decreased cost of sequencing):
- objective: identify common genetic variants involved in susceptibility to disease
- candidate gene approach: genes selected and tested based on prior knowledge/hypotheses
- GWAS approach: test “all” common genetic variants with no prior knowledge/hypothesis (Genome-Wide Association Studies)

Genetic Epidemiology (2)

The problem: needle (significant associations) in a haystack (large-scale gene sequence probes)

P value results

Genetic Epidemiology (3)

The WHO results

Nature paper 2008: GWAS result showing a particular genetic polymorphism correlates with increased risk of lung cancer in smokers
required a large amount of manual work to examine data from sensor arrays
the usual statistical techniques need large numbers of samples to make the analysis usable and reliable

An annotation experiment

our experiment used text analysis to add prior knowledge about genes
e.g. if a gene is expressed in lung tissue, represent this in the BFDP model when calculating relevance of sensor data for related polymorphisms
use annotation to find papers that discuss particular genes, diseases, anatomy and so on (AdAPT — Adjusting Association Priors with Text)
works using half the data (potential saving: €250k)

Genetic Epidemiology (4)

Confirmation of the methodology: new association found using AdAPT for head and neck cancer
- Using Prior Information from the Medical Literature in GWAS of Oral Cancer Identifies Novel Susceptibility Variant on Chromosome 4 — the AdAPT Method, Johansson et al, PLoS ONE, May 2012: http://dx.plos.org/10.1371/journal.pone.0036888
- See also PLoS Comptational Biology paper, Feb 2013
Future plans: epigenetics and gene-environment interaction studies — GxE

Other Users

SLaM, South London and Maudsley Hospital Biomedical Research Center
- largest UK mental health patient cohort
- sophisticated clinical record information system
British Library
Food and Environment Research Agency
TSO, the Stationery Office
TNA, the UK National Archives
SMEs: Fizzback; Innovantage; Sentimetrix; Ontotext; ...
Corporates:
- pharmas (all the majors)
- publishers, media
- bizintel users (demand: ~$1 billion in 2010)
...
You next?

0. Background; Housekeeping
1. GATE in the Wild
- Semantics in the Media
- Text Mining for Biomedicine
- Other Users
2. The GATE Family
- Mímir
3. A Lifecycle for Text Analysis

The GATE Family

an architecture
an IDE: GATE Developer: an integrated development environment for language processing components bundled with a widely used Information Extraction system and a comprehensive set of other plugins
a framework: GATE Embedded: an object library optimised for inclusion in diverse applications giving access to all the services used by GATE Developer and more
- used worldwide by thousands of scientists, companies, teachers and students (>30k downloads per year at present, not counting SVN)
- open source (LGPL), 100% java
a web app: GATE Teamware a collaborative annotation environment for factory-style semantic annotation projects built around a workflow engine
- a process: not "get this software and it will revolutionise your life" but "this is how to implement robust and maintainable services"
GATE Cloud: a parallel and distributed service infrastructure running on Amazon EC2
GATE Mímir: (Multi-paradigm Information Management Index and Repository) a scaleable multiparadigm index built on Ontotext's semantic repository family, GATE's annotation structures database plus full-text indexing from MG4J
and finally...
- GATE Prospector (semantic search UI)
- related tools from Ontotext (OWLIM, KIM, Linked Data endpoints)
- a wiki/CMS (GATE Wiki.sf.net), mainly to host our own websites and as a testbed for some of our experiments
- a community

GATE Developer (1)

Motivation: 1990s apparatus envy: physicists had supercolliders; medics had MRI scanners; language processing researchers had.... Perl?

A specialist Integrated Development Environment for language engineering R&D
Analogous to
- Eclipse or Netbeans for programmers
- Mathematica or SPSS for maths and stats
Visualisation and editing text, annotations, ontologies, parse trees, etc.
Constructing applications from components
Measurement, evaluation, benchmarking
Etc., etc.

GATE Embedded (1)

Object-oriented Java framework. Architectural principles:

Non-prescriptive, theory neutral (strength and weakness)
Re-use, interoperation, not reimplementation (e.g. diverse XML support, integration of Protégé, OWLIM, Weka, Lingpipe, OpenNLP, SVM Lite, etc. etc....)
(Almost) everything is a component, and component sets are user-extendable
(Almost) all operations are available both from API (Embedded) and GUI (Developer)

CREOLE: a Collection of REusable Objects for Language Engineering

GATE components: modified Java Beans with XML configuration
The minimal component = 10 lines of Java, 10 lines of XML, 1 URL

GATE Embedded (3)

persistence, visualisation and editing
a finite state transduction language (JAPE)
extraction of training instances for machine learning (ML)
pluggable ML implementations (Weka, YALE, SVM, ...)
components for language processing, e.g. parsers, machine learning tools, stemmers, a few IR tools (Lucene, GYM query plugins), IE components for various languages...
bundled with a very widely used Information Extraction system (ANNIE)
- MUC, TREC, ACE, DUC, Pascal, NTCIR, etc.
simple API for RDF or OWL (metadata) via OWLIM
a suite of tools for biomedical text processing
kitchen sink

Process, workflow, GATE Teamware (1)

A typical annotation project:

client discussion, task exploration,
draft extraction specification
manual annotation, inter-annotator
agreement, iterate the task spec
prototype a machine solution
more manual annotation for training
and test data (gold standard)
implement production solution
more manual annotation for quality
control, maintenance, adaptation...

Process, workflow, GATE Teamware (2)

The GATE process is a set of steps to follow in the definition, prototyping, development, deployment and maintenance of semantic annotation processes.
GATE Teamware is a workflow-based web engine supporting these processes.
Based on JBoss Process Management engine (BPEL compatible)
Teamware supports marshalling the manual annotation team, job allocation, quality control, training, communication, process monitoring...
Case study: Lighthouse Group runs teams of annotators in Cebu (Philippines), e.g. supplying 10,000 hours to Khresmoi project for on-line medical information.

Teamware (3): workflow configuration

Teamware (4): process monitoring

Teamware (5): quality control

Teamware (6): staff communication

GATE Cloud (1): instant scaling with no CAPEX

Cloud computing means many things in many contexts. On GATECloud.net it means:

zero fixed costs: you don't buy software licences or server hardware, just pay for the compute time that you use
near zero startup time: in a matter of minutes you can specify, provision and deploy the type of computation that used to take months of planning
easy in, easy out: if you try it and don't like it, go elsewhere! you can even take the software with you, it's all open source
someone else takes the admin load:
- the GATE team from the University of Sheffield make sure you're running the best of breed technology for text, search and semantics
- cloud providers' data center managers (e.g. at Amazon Inc.) make sure the hardware and operating platform for your work is scaleable, reliable and cheap

GATE Cloud (2): engineering

parallel execution engine of automatic annotation processes + distributed execution of parallel engine
scalability: auto-scaling of processor swarms running on top of AWS EC2
flexibility: parameters configure behaviour, select the GATE application being executed, the input protocol used reading documents, the output protocol used for exporting the resulting annotations, ...
robustness: jobs run unattended over large data sets
- extensively tested and profiled (no memory leaks)
- errors and exceptions that occur during processing are trapped and reported
- if the process crashes (e.g. hardware failure), can be restarted and resumes execution where it left off

GATE Cloud (3): how it works

A cloud annotation job

GATE Cloud (4): a research perspective

something like the facility that the the IRF was trying to set up for IR more generally
host a growing family of experimental system configurations, data sets, results
biased heavily towards information extraction (perhaps some mileage in adding more mainstream IR?)
persistence and reuse of experimental setups: virtualisation makes it possible to store not just data but the entire compute platform operable for particular experiments or analyses
plans for 2013:
- bigger big data
- Hadoop
- a marketplace, 3rd-party contributions, AnnoMarket

First Cousins — the Ontotext family

Complementing the GATE tools KIM provides a straightforward front-end deployment option and their Linked Data offerings a good baseline for model building:

Ontotext KIM: UIs demonstrating multiple conceptual and facetted search modes
- see also GATE Prospector (below)
Ontotext OWLIM: the fastest and most scaleable semantic repository
Ontotext FactForge: ~4 billion statements from the Linked Data cloud
Ontotext Linked Life Data: over 4 billion statements from life sci databases including UniProt, PubMed, EntrezGene and 20 more

GATE Mímir: hitting the indexing problem

Circa 2007:

A new project on patent searching at the IRF = culture shock!
- Full text, boolean ("100% recall"!)
Initial prototyping and demo work:
- Conceptual and semantic search and navigation (KIM, as above)
- ANNotations In Context (ANNIC) (right)
User requirement: put it all together
Ooops: ANNIC scaled to 200 short docs...

ANNIC (ANNotations In Context):

Annotations: the Missing Data Structure (1)

How do we search a billion-node annotation graph...?

Model:

(Cf. TIPSTER, TEI/XCES, ATLAS, UIMA, ...)

UI example:

Annotations: the Missing Data Structure (2)

Our first thoughts: where can we steal one?
Annotations: how to index the graph?
- XML indexing and retrieval work doesn't solve it (biased towards trees)
- RDBMS doesn't solve it (biased towards relations)
- augmented full-text indices can help with efficient access, but the data storage requirements of our prototype (based on Lucene) grew exponentially with the cardinality of the annotation sets

May 2008: workshop on Persisting, Indexing and Querying Multi-Paradigm Text Models, IRF, Vienna

MG4J (Eric Graf, Glasgow)
Terrier (Gianni Amati, FUB/Glasgow)
INEX (Norbert Fuhr, Essen-Duisburg)
KIM, OWLIM (Atanas Kiryakov, OntoText)

ANNIC (Valentin Tablan, Sheffield)
HTML-XML Search Engines (Ralf Schenkel, MPG)
Monet DB (Arjen de Vries, CWI)

May 2009: custom solution based on MG4J (Sebastiano Vigna) + OWLIM called Mímir...

Mímir: Multi-paradigm Information Management Index and Respository

Mímir is an index engine that can search over:

text
textual and semantic annotations
ontologies and knowledge bases
scales to the terabyte level

Built on top of: the MG4J text indexing library; GATE's annotation index (remodelled in MG4J); Ontotext's semantic repository family

May 2010: version 2: incremental indices; federation
May 2011: version 3: full source release under the AGPL; cloud release
2012: version 4: document centric results mode; 64-bit doc ids
...?: version 5: http://gate.ac.uk/mimir/doc/mimir-guide.pdf (p. 61)

GATE Prospector

All that data, all those query languages, what about a UI?! There are some developer UIs; or KIM; or:

Prospector co-occurrence

0. Background; Housekeeping
1. GATE in the Wild
- Semantics in the Media
- Text Mining for Biomedicine
- Other Users
2. The GATE Family
- Developer, Embedded
- Teamware, Process
- Cloud
- KIM, OWLIM, Linked Data
- Mímir
3. A Lifecycle for Text Analysis

Full lifecycle information extraction

Take one large pile of text (documents, emails, tweets, patents, papers, transcripts, blogs, comments, acts of parliament, and so on and so forth).
Pick a structured description of interesting things in the text (a telephone directory, or chemical taxonomy, or something from the Linked Data cloud) — call this your ontology.
Use GATE Teamware to mark up a gold standard example set of annotations of the corpus (1.) relative to the ontology (2.). (For smaller jobs use GATE Developer.)
Use GATE Developer to build a semantic annotation pipeline to do the annotation job automatically and measure performance against the gold standard.
Take the pipeline from 4. and apply it to your text pile using GATE Cloud (or embed it in your own systems using GATE Embedded). Use it to bootstrap more manual (now semi-automatic) work in Teamware.
Use GATE Mímir to store the annotations relative to the ontology in a multiparadigm index server.
(Probably) write a domain-specific UI to go on top of Mímir — see demos.gate.ac.uk/pin for a simple example.
Hey presto, you have search that applies your annotations and you ontology to your corpus (and a sustainable process for coping with changing information need and/or changing text).
Your users are happy (and GATE.ac.uk has a "donate" button ;-) ).

GATE: Full Lifecycle Text Analytics

Background

What didn't change?

Which means...

Housekeeping Information (1)

Housekeeping Information (2)

Housekeeping Information (3)

Contents

The greatest sporting event ever?

GATE at the Olympics

GATE at the Olympics (2)

Semantics in Media — the New Thing?

Media (2)

Media (3)

Media (4) — Social/Mobile

Biomedical Example (1): EHR Mining

Bio Example 2: Genetic Epidemiology

Genetic Epidemiology (2)

Genetic Epidemiology (3)

Genetic Epidemiology (4)

Other Users

Contents

The GATE Family

GATE Developer (1)

GATE Developer (2)

GATE Embedded (1)

GATE Embedded (2)

GATE Embedded (3)

Process, workflow, GATE Teamware (1)

Process, workflow, GATE Teamware (2)

Teamware (3): workflow configuration

Teamware (4): process monitoring

Teamware (5): quality control

Teamware (6): staff communication

GATE Cloud (1): instant scaling with no CAPEX

GATE Cloud (2): engineering

GATE Cloud (3): how it works

GATE Cloud (4): a research perspective

First Cousins — the Ontotext family

GATE Mímir: hitting the indexing problem

Annotations: the Missing Data Structure (1)

Annotations: the Missing Data Structure (2)

Mímir: Multi-paradigm Information Management Index and Respository

GATE Prospector

Contents

Full lifecycle information extraction

Links