The Infernal Beauty of Text

Hamish Cunningham,
University of Sheffield

Contents

1. Introduction
- Examples: TV Sport; Genetic Epidemiology; ...
2. The GATE family
- Developer, Embedded
- Teamware, Process
- Cloud
- KIM, OWLIM, Linked Data
- Mímir
3. [Demos
- an IDE for text analysis specialists
- collaborative manual annotation workflows
- GATECloud.net
- Mímir: a mixed-mode index server]
4. A lifecycle for text analysis

Context (1): the Infernal Beauty of Text

Language is the quintessential product of human cooperation

text is a beautiful gift that projects moments in time and place forwards
we're so good at language that it seems easy, but...
...it is infernally difficult to compute (a measure of our ignorance of human intelligence?)

A large proportion of what we know is externalised only in text

structured data (DBs, taxonomies, dictionaries, ontologies...): machine tractable, but expensive and inflexible
how do we bridge these two worlds?
text analysis becoming a predictable and robust engineering process
deriving structured data from textual sources now much easier

Context (2): Finding, Navigating, Abstracting

Value, volume

if you have content that is sufficiently high value or low volume (or you're Google) then you can use sophisticated methods to help people find, browse or abstract over that content
these methods include building symbolic models (taxonomic, logical, conceptual, semantic...) of the subject matter and annotating content with references to those models

social vs. technological success factors
- technological: the expressivity of the modelling languages and the quality of the annotation algorithms, and of the indexing, search and browsing tools
- social: the level of expertise and the quantity of time and effort deployed by the people building models and extraction patterns (or creating training data and running learning algorithms), or tuning indices or user interfaces
GATE: family of tools that can minimise time and effort in developing and maintaining rich information retrieval and management systems, while attempting to stay close to the state of the technological art (sometimes by favouring interoperation and reuse over innovation or reinvention)
covers the full lifecycle of text analysis & rich search

Example 1: TV Sport

The BBC served its 2010 World Cup pages out of BigOWLIM. They used text mining to make links into other relevant pages according to their (ontological) data model. They now use GATE for this text mining function -- next stop the 2012 Olympics...

The BBC's system achieved cost savings of ~80% compared to a conventional database-backed web system.

The Press Association is also going full speed with similar efforts, following on from their long-running GATE project that processes the captions in their massive image library.

Media is a perfect application area for our text analysis and semantic modelling technology, partly because journalistic language is very well-behaved (relatively speaking!), partly because the content is extremely valuable, and partly because existing classification schemes are typically applied quite rigorously.

Example 2: Genetic Epidemiology

it is hypothesised that
- genetic factors play a strong role in susceptibility to disease
- in future targetted pharmaceuticals will be tailored to individual genetics
a substantial body of work looks for associations between mutations and diseases
World Health International Agency for Research in Cancer, the world's biggest cancer epidemiology lab.
genetics groups: which mutations (SNPs) associate with carcinogenesis?
new trends in genetic association studies (stimulated by decreased cost of sequencing):
- objective: identify common genetic variants involved in susceptibility to disease
- candidate gene approach: genes selected and tested based on prior knowledge/hypotheses
- GWAS approach: test “all” common genetic variants with no prior knowledge/hypothesis (Genome-Wide Association Studies)

Genetic Epidemiology (2)

The problem: needle (significant associations) in a haystack (large-scale gene sequence probes)

Genetic Epidemiology (3)

The WHO results

Nature paper 2008: GWAS result showing a particular genetic polymorphism correlates with increased risk of lung cancer in smokers
required a large amount of manual work to examine data from sensor arrays
the usual statistical techniques need large numbers of samples to make the analysis usable and reliable

An annotation experiment

our experiment used text analysis to add prior knowledge about genes
e.g. if a gene is expressed in lung tissue, represent this in the BFDP model when calculating relevance of sensor data for related polymorphisms
use annotation to find papers that discuss particular genes, diseases, anatomy and so on (AdAPT -- Adjusting Association Priors with Text)
works using half the data (potential saving: €250k)

2010: applied to head and neck cancer and found a new association

(Next: epigenetics and gene-environment interaction studies -- GxE)

More examples

TNA, the UK National Archives
TSO, the Stationery Office
SLaM, South London and Maudsley Hospital
SMEs: Fizzback; Innovantage; Sentimetrix; Ontotext; ...
Corporates: pharmas, publishers, bizintel users
- demand: ~$1 billion in 2010
starting up: British Library; Food and Environment Research Agency; Health on the Net
(you next?)

1. Introduction
- Context
- Examples: TV Sport; Genetic Epidemiology; etc.
2. The GATE family
- Wiki
3. [Demos
- an IDE for text analysis specialists
- collaborative manual annotation workflows
- GATECloud.net
- Mímir: a mixed-mode index server]
4. A lifecycle for text analysis

The GATE Family

an architecture
an IDE: GATE Developer: an integrated development environment for language processing components bundled with a widely used Information Extraction system and a comprehensive set of other plugins
a framework: GATE Embedded: an object library optimised for inclusion in diverse applications giving access to all the services used by GATE Developer and more
- used worldwide by thousands of scientists, companies, teachers and students (>30k downloads per year at present, not counting SVN)
- open source (LGPL), 100% java
a web app: GATE Teamware a collaborative annotation environment for factory-style semantic annotation projects built around a workflow engine
- a process: not "get this software and it will revolutionise your life" but "this is how to implement robust and maintainable services"
GATE Cloud: a parallel and distributed service infrastructure running on Amazon EC2
GATE Mímir: (Multi-paradigm Information Management Index and Repository) a scaleable multiparadigm index built on Ontotext's semantic repository family, GATE's annotation structures database plus full-text indexing from MG4J
and finally...
- related tools from Ontotext (OWLIM, KIM, Linked Data endpoints)
- GATE Wiki
- a community

GATE Developer (1)

Motivation: 1990s apparatus envy: physicists had supercolliders; medics had MRI scanners; language processing researchers had.... Perl?

A specialist Integrated Development Environment for language engineering R&D
Analogous to
- Eclipse or Netbeans for programmers
- Mathematica or SPSS for maths and stats
Visualisation and editing text, annotations, ontologies, parse trees, etc.
Constructing applications from components
Measurement, evaluation, benchmarking
Etc., etc.

GATE Embedded (1)

Object-oriented Java framework. Architectural principles:

Non-prescriptive, theory neutral (strength and weakness)
Re-use, interoperation, not reimplementation (e.g. diverse XML support, integration of Protégé, OWLIM, Weka, Lingpipe, OpenNLP, SVM Lite, etc. etc....)
(Almost) everything is a component, and component sets are user-extendable
(Almost) all operations are available both from API (Embedded) and GUI (Developer)

CREOLE: a Collection of REusable Objects for Language Engineering

GATE components: modified Java Beans with XML configuration
The minimal component = 10 lines of Java, 10 lines of XML, 1 URL

GATE Embedded (3)

persistence, visualisation and editing
a finite state transduction language (JAPE)
extraction of training instances for machine learning (ML)
pluggable ML implementations (Weka, YALE, SVM, ...)
components for language processing, e.g. parsers, machine learning tools, stemmers, a few IR tools (Lucene, GYM query plugins), IE components for various languages...
bundled with a very widely used Information Extraction system (ANNIE)
- MUC, TREC, ACE, DUC, Pascal, NTCIR, etc.
simple API for RDF or OWL (metadata) via OWLIM
kitchen sink

Process, workflow, GATE Teamware (1)

A typical annotation project:

client discussion, task exploration,
draft extraction specification
manual annotation, inter-annotator
agreement, iterate the task spec
prototype a machine solution
more manual annotation for training
and test data (gold standard)
implement production solution
more manual annotation for quality
control, maintenance, adaptation...

Process, workflow, GATE Teamware (2)

The GATE process is a set of steps to follow in the definition, prototyping, development, deployment and maintenance of semantic annotation processes.
GATE Teamware is a workflow-based web engine supporting these processes.
Based on JBoss Process Management engine (BPEL compatible)
Teamware supports marshalling the manual annotation team, job allocation, quality control, training, communication, process monitoring...
Case study: Lighthouse Group runs teams of annotators in Cebu (Philippines), e.g. supplying 10,000 hours to Khresmoi project for on-line medical information.

Teamware (3): workflow configuration

Teamware (4): process monitoring

Teamware (5): quality control

Teamware (6): staff communication

GATE Cloud (1): the marketing BS

Cloud computing means many things in many contexts. On GATECloud.net it means:

zero fixed costs: you don't buy software licences or server hardware, just pay for the compute time that you use
near zero startup time: in a matter of minutes you can specify, provision and deploy the type of computation that used to take months of planning
easy in, easy out: if you try it and don't like it, go elsewhere! you can even take the software with you, it's all open source
someone else takes the admin load:
- the GATE team from the University of Sheffield make sure you're running the best of breed technology for text, search and semantics
- cloud providers' data center managers (e.g. at Amazon Inc.) make sure the hardware and operating platform for your work is scaleable, reliable and cheap

GATE Cloud (2): engineering

parallel execution engine of automatic annotation processes + distributed execution of parallel engine
scalability: auto-scaling of processor swarms running on top of AWS EC2
flexibility: parameters configure behaviour, select the GATE application being executed, the input protocol used reading documents, the output protocol used for exporting the resulting annotations, ...
robustness: jobs run unattended over large data sets
- extensively tested and profiled (no memory leaks)
- errors and exceptions that occur during processing are trapped and reported
- if the process crashes (e.g. hardware failure), can be restarted and resumes execution where it left off

A cloud annotation job

GATE Cloud (3): a research perspective

something like the facility that the the IRF was trying to set up for IR more generally
host a growing family of experimental system configurations, data sets, results
biased heavily towards information extraction (perhaps some mileage in adding more mainstream IR?)
persistence and reuse of experimental setups: virtualisation makes it possible to store not just data but the entire compute platform operable for particular experiments or analyses

First Cousins -- the Ontotext family

Complementing the GATE tools KIM provides a straightforward front-end deployment option and their Linked Data offerings a good baseline for model building:

Ontotext KIM: UIs demonstrating multiple conceptual and facetted search modes
Ontotext OWLIM: the fastest and most scaleable semantic repository
Ontotext FactForge: ~4 billion statements from the Linked Data cloud
Ontotext Linked Life Data: over 4 billion statements from life sci databases including UniProt, PubMed, EntrezGene and 20 more

GATE Mímir: hitting the indexing problem

Circa 2007:

A new project on patent searching at the IRF = culture shock!
- Full text, boolean ("100% recall"!)
Initial prototyping and demo work:
- Conceptual and semantic search and navigation (KIM, as above)
- ANNotations In Context (ANNIC) (right)
User requirement: put it all together
Ooops: ANNIC scaled to 200 short docs...

ANNIC (ANNotations In Context):

Annotations: the Missing Data Structure (1)

How do we search a billion-node annotation graph...?

Model:

(Cf. TIPSTER, TEI/XCES, ATLAS, UIMA, ...)

UI example:

Annotations: the Missing Data Structure (2)

Our first thoughts: where can we steal one?
Annotations: how to index the graph?
- XML indexing and retrieval work doesn't solve it (biased towards trees)
- RDBMS doesn't solve it (biased towards relations)
- augmented full-text indices can help with efficient access, but the data storage requirements of our prototype (based on Lucene) grew exponentially with the cardinality of the annotation sets

May 2008: workshop on Persisting, Indexing and Querying Multi-Paradigm Text Models, IRF, Vienna

MG4J (Eric Graf, Glasgow)
Terrier (Gianni Amati, FUB/Glasgow)
INEX (Norbert Fuhr, Essen-Duisburg)
KIM, OWLIM (Atanas Kiryakov, OntoText)

ANNIC (Valentin Tablan, Sheffield)
HTML-XML Search Engines (Ralf Schenkel, MPG)
Monet DB (Arjen de Vries, CWI)

May 2009: custom solution based on MG4J (Sebastiano Vigna) + OWLIM called Mímir...
May 2010: version 2: incremental indices; federation
May 2011: version 3: full source release under the AGPL; cloud release

Mímir: Multi-paradigm Information Management Index and Respository

Mímir is an index engine that can search over:

text
textual and semantic annotations
ontologies and knowledge bases

Built on top of:

the MG4J text indexing library
GATE's annotation index (remodelled in MG4J)
Ontotext's semantic repository family

(Just about) scales to a terabyte of annotated text

More information: some query examples; demos; user and developer guide

The Poor Relation: GATE Wiki (1)

Why another wiki? Scratching three itches:

adding interaction to a largish static site (15k HTML files, 40k other files)
wiki style collaborative document creation with asynchonous off-line editing
a test-bed for experiments in controlled languages for round-trip ontology engineering

Hence CoW, a Controllable Wiki (aka GATEWiki): http://gate.ac.uk/gatewiki/cow/

GATE Wiki (2)

Main features

designed from the ground up to support concurrent editing and off-line working with straightforward synchronisation using Subversion (SVN)
uses the YAM language, which
- outputs LaTeX as well as HTML
- allows paths as links (i.e. does not limit the namespace to a single directory like e.g. JSPWiki does) and consequently allows a tree-structured page store (and later graph-structured navigation via an ontology)
allows mixing of all types of files in its page store (which is just an SVN sandbox, in fact)
supports versioning and differencing via SVN, and allows other tools that manipulate SVN repositories to be used with the wiki data (e.g. SVN itself, Eclipse, ViewCVS, etc.)
may optionally support embedded CLOnE (Controlled Language for Ontology Editing), and therefore experiments with applications that store their data in semantic repositories whose schema is user-defined and maintained

1. Introduction
- Context
- Examples: TV Sport; Genetic Epidemiology; etc.
2. The GATE family
- Developer, Embedded
- Teamware, Process
- Cloud
- KIM, OWLIM, Linked Data
- Mímir
- Wiki
3. [Demos
- Mímir: a mixed-mode index server]
4. A lifecycle for text analysis

Demos

			an IDE for text analysis specialists
			collaborative manual annotation workflows

			GATECloud.net

			Mímir: a mixed-mode index server

1. Introduction
- Context
- Examples: TV Sport; Genetic Epidemiology; etc.
2. The GATE family
- Developer, Embedded
- Teamware, Process
- Cloud
- KIM, OWLIM, Linked Data
- Mímir
- Wiki
3. [Demos
- an IDE for text analysis specialists
- collaborative manual annotation workflows
- GATECloud.net
- Mímir: a mixed-mode index server]
4. A lifecycle for text analysis

Full lifecycle information extraction

Take one large pile of text (documents, emails, tweets, patents, papers, transcripts, blogs, comments, acts of parliament, and so on and so forth).
Pick a structured description of interesting things in the text (a telephone directory, or chemical taxonomy, or something from the Linked Data cloud) -- call this your ontology.
Use GATE Teamware to mark up a gold standard example set of annotations of the corpus (1.) relative to the ontology (2.).
Use GATE Developer to build a semantic annotation pipeline to do the annotation job automatically and measure performance against the gold standard.
Take the pipeline from 4. and apply it to your text pile using GATE Cloud (or embed it in your own systems using GATE Embedded). Use it to bootstrap more manual (now semi-automatic) work in Teamware.
Use GATE Mímir to store the annotations relative to the ontology in a multiparadigm index server.
(Probably) write a half-decent UI to go on top of Mímir.
Hey presto, you have search that applies your annotations and you ontology to your corpus (and a sustainable process for coping with changing information need and/or changing text).
Your users are happy (and GATE.ac.uk has a "donate" button ;-) ).

The Infernal Beauty of Text

Context (1): the Infernal Beauty of Text

Context (2): Finding, Navigating, Abstracting

Example 1: TV Sport

Example 2: Genetic Epidemiology

Genetic Epidemiology (2)

Genetic Epidemiology (3)

More examples

Contents

The GATE Family

GATE Developer (1)

GATE Developer (2)

GATE Embedded (1)

GATE Embedded (2)

GATE Embedded (3)

Process, workflow, GATE Teamware (1)

Process, workflow, GATE Teamware (2)

Teamware (3): workflow configuration

Teamware (4): process monitoring

Teamware (5): quality control

Teamware (6): staff communication

GATE Cloud (1): the marketing BS

GATE Cloud (2): engineering

GATE Cloud (3): a research perspective

First Cousins -- the Ontotext family

GATE Mímir: hitting the indexing problem

Annotations: the Missing Data Structure (1)

Annotations: the Missing Data Structure (2)

Mímir: Multi-paradigm Information Management Index and Respository

The Poor Relation: GATE Wiki (1)

GATE Wiki (2)

Contents

Demos

Contents

Full lifecycle information extraction

Links