20 years work in 20 minutes

(plus two demos)

Hamish Cunningham,
University of Sheffield









Contents

  • 1. Background: the GATE family
    • Developer, Embedded
    • Teamware, Process
    • Cloud
    • KIM, OWLIM, Linked Data
    • Mímir
    • [Wiki]
  • 2. Demos
    • [an IDE for text analysis specialists]
    • [collaborative manual annotation workflows]
    • GATECloud.net
    • Mímir: a mixed-mode index server
  • [3. Case studies]
    • [IARC: the WHO's cancer lab]
    • [TNA: UK National Archives]
    • [SLaM: NHS clinical records mining]
  • 4. A lifecycle for text analysis

Expensive IR systems

Value, volume

Social vs. technological success factors

The GATE Family

GATE Developer (1)

Motivation: 1990s apparatus envy: physicists had supercolliders; medics had MRI scanners; language processing researchers had.... Perl?

GATE Developer (2)

GATE Embedded (1)

Object-oriented Java framework. Architectural principles:

CREOLE: a Collection of REusable Objects for Language Engineering

GATE Embedded (2)

GATE Embedded (3)

Process, workflow, GATE Teamware (1)

A typical annotation project:

  • client discussion, task exploration, draft extraction specification
  • manual annotation, inter-annotator agreement, iterate the task spec
  • prototype a machine solution
  • more manual annotation for training and test data (gold standard)
  • implement production solution
  • more manual annotation for quality control, maintenance, adaptation...
   

Process, workflow, GATE Teamware (2)

Teamware (3): workflow configuration

Teamware (4): process monitoring

Teamware (5): quality control

Teamware (6): staff communication

GATE Cloud (1): the marketing BS

Cloud computing means many things in many contexts. On GATECloud.net it means:

GATE Cloud (2): engineering

 

  • parallel execution engine of automatic annotation processes + distributed execution of parallel engine
  • scalability: auto-scaling of processor swarms running on top of AWS EC2
  • flexibility: parameters configure behaviour, select the GATE application being executed, the input protocol used reading documents, the output protocol used for exporting the resulting annotations, ...
  • robustness: jobs run unattended over large data sets
    • extensively tested and profiled (no memory leaks)
    • errors and exceptions that occur during processing are trapped and reported
    • if the process crashes (e.g. hardware failure), can be restarted and resumes execution where it left off

GATE Cloud (3): the research point

First Cousins -- the Ontotext family

Complementing the GATE tools KIM provides a straightforward front-end deployment option and their Linked Data offerings a good baseline for model building:

GATE Mímir: hitting the indexing problem

Circa 2007:

  • A new project on patent searching at the IRF = culture shock!
    • Full text, boolean ("100% recall"!)
  • Initial prototyping and demo work:
    • Conceptual and semantic search and navigation (KIM, as above)
    • ANNotations In Context (ANNIC) (right)
  • User requirement: put it all together
  • Ooops: ANNIC scaled to 200 short docs...

ANNIC (ANNotations In Context):

Annotations: the Missing Data Structure (1)

How do we search a billion-node annotation graph...?




Model:


(Cf. TIPSTER, TEI/XCES, ATLAS, UIMA, ...)

         

UI example:

Annotations: the Missing Data Structure (2)

Mímir: Multi-paradigm Information Management Index and Respository

Mímir is an index engine that can search over:

Built on top of:

(Just about) scales to a terabyte of annotated text

More information: some query examples; demos; user and developer guide

The Poor Relation: GATE Wiki (1)

Why another wiki? Scratching three itches:

  1. adding interaction to a largish static site (15k HTML files, 40k other files)
  2. wiki style collaborative document creation with asynchonous off-line editing
  3. a test-bed for experiments in controlled languages for round-trip ontology engineering

Hence CoW, a Controllable Wiki (aka GATEWiki): http://gate.ac.uk/gatewiki/cow/

GATE Wiki (2)

Main features

Contents

Demos


an IDE for text analysis specialists


collaborative manual annotation workflows


GATECloud.net


Mímir: a mixed-mode index server

Contents

Genome-Wide Association Studies at the WHO (1)

World Health International Agency for Research in Cancer, the world's biggest cancer epidemiology lab.

Genetics groups: which mutations (SNPs) associate with carcinogenesis?

New trends in genetic association studies (stimulated by decreased cost of sequencing):

GWAS (2)

The problem: needle (significant associations) in a haystack (large-scale gene sequence probes)

GWAS (3)

The WHO results

An annotation experiment

2010: applied to head and neck cancer and found a new association

The National Archives (TNA) of the UK (1)

TNA (2): current search facilities

TNA (3): new facilities

Method:

TNA (4): architecture

TNA (5): annotation types

South London and Maudsley NHS Trust (1)

Mining clinical records

NHS (2)

Mini Mental State Exam

  • Generic entities such as anatomical location, diagnosis and drug are sometimes of interest
  • But many of the enquiries we have seen are more often interested in large numbers of very specific, and ad hoc entities or events
  • This example is with a UK National Biomedical Research Center
  • An example – cognitive ability as shown by the MMSE score

Some results:

NHS (3)

Contents

Full lifecycle information extraction

How it all hangs together:

  1. Take one large pile of text (documents, emails, tweets, patents, papers, transcripts, blogs, comments, acts of parliament, and so on and so forth).
  2. Pick a structured description of interesting things in the text (a telephone directory, or chemical taxonomy, or something from the Linked Data cloud) -- call this your ontology.
  3. Use GATE Teamware to mark up a gold standard example set of annotations of the corpus (1.) relative to the ontology (2.).
  4. Use GATE Developer to build a semantic annotation pipeline to do the annotation job automatically and measure performance against the gold standard.
  5. Take the pipeline from 4. and apply it to your text pile using GATE Cloud (or embed it in your own systems using GATE Embedded). Use it to bootstrap more manual (now semi-automatic) work in Teamware.
  6. Use GATE Mímir to store the annotations relative to the ontology in a multiparadigm index server.
  7. (Probably) write a half-decent UI to go on top of Mímir.
  8. Hey presto, you have search that applies your annotations and you ontology to your corpus (and a sustainable process for coping with changing information need and/or changing text).
  9. Your users are happy (though possibly also broke :-().

Links

More information