<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">

<html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en">
<head>
<title>20 years work in 20 minutes</title>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
<meta name="copyright"
content="GATE Team, University of Sheffield - gate.ac.uk"/>
<link rel="stylesheet" type="text/css" media="screen, projection, print"
href="gslidy/slidy.css"/>
<script src="gslidy/slidy.js"
type="text/javascript"></script>
</head>
<body>
<div class="background">
<img id="head-icon" alt="" align="right" src="http://gate.ac.uk/sale/images/gate4/logo-colour.png" width="150"/>
</div>
<div class="slide">
<h1 class="cow-title-heading">GATE: Full Lifecycle Text Analytics</h1>
<p><em>(20 years work in 20 minutes, plus questions)</em></p>



<table> <tr><td>
<p>Hamish Cunningham, <br> University of Sheffield <br> <br> <br>
<b>These slides:</b> <a class="cow-url" href="http://tinyurl.com/figint13">tinyurl.com/figint13</a>
<br> (FIG: <em>First International GATE <br> symposium</em>, number 6)
<br> <br> <br>
<img src="splash.png" alt=""GATE"" width="280" height="215" align="top" border="0"></p>
</td><td>
<p><b>Contents</b></p>
<ul>
<li><u><b>0. Background; Housekeeping</b></u></li>
<li>1. GATE in the Wild
<ul>
<li>Semantics in the Media</li>
<li>Text Mining for Biomedicine</li>
<li>Other Users</li>
</ul></li>
<li>2. The GATE Family
<ul>
<li>Developer, Embedded</li>
<li>Teamware, Process</li>
<li>Cloud</li>
<li>KIM, OWLIM, Linked Data</li>
<li>M&iacute;mir</li>
</ul></li>
<li>3. A Lifecycle for Text Analysis</li>
</ul>
</td></tr>
</table>
<p></div><div class="slide"></p><h1 class="cow-heading">Background</h1>

<p>c. 1991: Welcome to the miserable crew?</p>
<p><img src="les-mis1.jpg" alt="Les Miserables" width="540">
<img src="les-mis2.jpg" alt="Les Miserables 2" width="450"></p>

<p>2013 ☺: what changed?</p>
<ul>
<li>the key factor, of course, is how many elements of our lives have moved
on-line, and the role of text as a communication medium in those elements is
absolutely central</li>
<li>this has meant transformations in many areas, including
<ul>
<li>the size of the data we work with</li>
<li>the importance of that data (sadly, the most profitable area is
advertising &mdash; sometimes it seems like our entire field has become a
question of who clicks on which advert &mdash; some examples of non-advertising
stuff below...)</li>
</ul></li>
</ul>
<p></div><div class="slide"></p><h1 class="cow-heading">What didn't change?</h1>
<table> <tr><td>
<p><b>Square Fish Syndrome</b></p>
<ul>
<li>Imagine you're looking at a river under which <br> a fish swims by</li>
<li>You can show the ripples and eddies
<ul>
<li>to an artist and ask them to draw a fish</li>
<li>to a statistician and ask them to model a fish</li>
</ul></li>
<li>In both cases you're liable to get a square fish</li>
<li>Human language processing can be done by
<ul>
<li>linguists intuiting about grammar</li>
<li>machine learning creating statistical models</li>
</ul></li>
<li>Language is a surface phenomenon but <br> communication is about intelligence</li>
</ul>
</td><td>
<img src="square-fish.jpg" alt="Square fish" width="450">
</td></tr>
</table>
<ul>
<li>Progress:
<ul>
<li>modern approaches use elements of both methods, and large quantities of
data, and large quantities of human judgements about that data</li>
<li>ways to represent results have changed, with linked data,
location-dependent services, and etc.</li>
</ul></li>
<li>But, fundamentally, we have yet to see any dramatic breakthrough</li>
</ul>
<p></div><div class="slide"></p><h1 class="cow-heading">Which means...</h1>
<ul>
<li>the most important questions have often been:
<ul>
<li>how can we build systems that achieve the plateau as cheaply and
efficiently as possible?</li>
<li>how can we mininise the effort in adapting these systems to new problem
domains?</li>
<li>how can we ensure that they work reliably and predictably over time?</li>
<li>how can we maximise reuse across systems?</li>
</ul></li>
<li>these were the question that we started work on in the early 1990s, and
that's what became the GATE research programme...</li>
<li>what are people doing with it in 2013?</li>
</ul>
<p></div><div class="slide"></p><h1 class="cow-heading">Housekeeping Information (1)</h1>
<p>If you hear a fire alarm?</p>
<p class="incremental">
...run in circles, scream and shout <br>
<img src="panic1.jpg" alt="Run in circles scream and shout 1" width="450">
<img src="panic2.jpg" alt="Run in circles scream and shout 2" width="450"></p>


<p></div><div class="slide"></p><h1 class="cow-heading">Housekeeping Information (2)</h1>
<p>The building is fully equipped with toilets</p>



<p><img src="bog.jpg" alt="Pan" width="450"></p>

<p></div><div class="slide"></p><h1 class="cow-heading">Housekeeping Information (3)</h1>
<p>Members of the GATE team are available at all times</p>
<p><img src="the-scream.jpg" alt="Munsch The Scream" width="650"></p>





<p></div><div class="slide"></p><h1 class="cow-heading">Contents</h1>
<ul>
<li>0. Background; Housekeeping</li>
<li><u><b>1. GATE in the Wild
<ul>
<li>Semantics in the Media</li>
<li>Text Mining for Biomedicine</li>
<li>Other Users</b></u></li>
</ul></li>
<li>2. The GATE Family
<ul>
<li>Developer, Embedded</li>
<li>Teamware, Process</li>
<li>Cloud</li>
<li>KIM, OWLIM, Linked Data</li>
<li>M&iacute;mir</li>
</ul></li>
<li>3. A Lifecycle for Text Analysis</li>
</ul>
<p></div><div class="slide"></p><h1 class="cow-heading">The greatest sporting event ever?</h1>
<p class="incremental">
The London olympics.</p>
<p class="incremental">
And the leading local media organisation covering it?</p>
<p class="incremental">
The BBC.</p>
<p class="incremental">
And the best text analytics ecosystem? (You can see where this is going...)</p>
<p class="incremental">
GATE, of course, which is why the BBC are using it under their all their
sports web coverage :-)</p>
<p class="incremental">
...</p>
<p></div><div class="slide"></p><h1 class="cow-heading">GATE at the Olympics</h1>
<p><img src="bbc-olympics.png" alt="BBC Olympics coverage using GATE" width="900"></p>
<p>The <a class="cow-url" href="http://www.bbc.co.uk/sport/0/olympics/2012/">BBC's sports website</a>
uses GATE for text mining <br> (see
<a class="cow-url" href="http://www.bbc.co.uk/blogs/bbcinternet/2012/04/sports_dynamic_semantic.html">this BBC blog</a> for some of the gorey details)</p>
<p></div><div class="slide"></p><h1 class="cow-heading">GATE at the Olympics (2)</h1>
<p>The BBC system has some interesting characteristics:</p>
<ul>
<li>big (or medium) data, very high query volumes</li>
<li>extremely flexible in face of data evolution (an unknown gets a gold, a
trainer gets arrested with steroids, ...)</li>
<li>linked data is key
<ul>
<li>documents are annotated relative to a domain-specific OWL model (using
GATE's LKB gazetteer from Ontotext)</li>
<li>everything is served out of a clustered semantic repository</li>
</ul></li>
<li>changing the pages served and their content is a <b>semantic operation</b> &mdash;
which starts to move the production focus away from the DBA and the web
developer and back to the journalist</li>
<li>the austerity quote: their world cup 2010 system achieved <b>cost savings of
~80%</b> compared to a conventional database-backed web system</li>
<li>now serves more than 10,000 pages</li>
</ul>
<p></div><div class="slide"></p><h1 class="cow-heading">Semantics in Media &mdash; the New Thing?</h1>
<p>(slides from [ex-]PA's <a class="cow-url" href="http://wickedtomocktheafflicted.com/">Jarred McGinnis</a>)</p>
<p>The <b>Press Association</b> are running a similar programme (following on from
long-running GATE project that processes captions in massive image library)</p>
<p><img src="pa-olympics.png" alt="PA Olympics coverage using GATE" width="900"></p>
<p></div><div class="slide"></p><h1 class="cow-heading">Media (2)</h1>
<table><tr><td>
<ul>
<li>system helps the journalist create the metadata</li>
<li>feedback from the journalist helps system accuracy</li>
<li>when you've got good metadata in an OWL store you can access the data with
extreme flexibility... e.g.
<ul>
<li>generate the pages on the BBC site,</li>
<li>sell custom feeds into niche markets</li>
<li>drive visualisations or populate COTS data analysers</li>
</ul></li>
</ul>
</td><td>
<img src="pa-olympics.png" alt="PA Olympics coverage using GATE" width="400">
</td></tr>
</table>
<ul>
<li>now partners with Sheffield/Ontotext/IMR in the AnnoMarket project (more
later)</li>
</ul>
<p></div><div class="slide"></p><h1 class="cow-heading">Media (3)</h1>
<ul>
<li><b>CNN</b> commissionned a similar system</li>
<li>News International contacted us in 2011 but we ignored them and they went
away :-)</li>
<li>media is a good application area for text analysis and semantic modelling
technology, partly because journalistic language is very well-behaved
(relatively speaking!), partly because the content is extremely valuable,
and partly because existing classification schemes are typically applied
quite rigorously</li>
<li>life sciences and biomedical data has some similar characteristics...</li>
</ul>
<p></div><div class="slide"></p><h1 class="cow-heading">Media (4) &mdash; Social/Mobile</h1>
<ul>
<li>EPSRC programme in summarisation and mining of consumer-generated media</li>
<li>TrendMiner: cross-lingual trend analysis in real-time media streams
<ul>
<li>won the Hypertext 2013 Ted Nelson best newcomer award for
<a class="cow-url" href="http://t.co/YLbykYKsJM">Where's Wally?</a>, geolocation of Twitter posts</li>
</ul></li>
<li>new in 2013/4: computing veracity (the 4th V of big data, after Gartner's
volume, velocity, variety)</li>
</ul>
<p></div><div class="slide"></p><h1 class="cow-heading">Biomedical Example (1): EHR Mining</h1>
<p>...sneak preview...</p>
<p></div><div class="slide"></p><h1 class="cow-heading">Bio Example 2: Genetic Epidemiology</h1>
<ul>
<li>it is <b>hypothesised</b> that
<ul>
<li>genetic factors play a strong role in susceptibility to disease </li>
<li>in future targetted pharmaceuticals will be tailored to individual
genetics</li>
</ul></li>
<li>a substantial body of work looks for associations between mutations and
diseases</li>
<li>World Health <a class="cow-url" href="http://www.iarc.fr">International Agency for Research in
Cancer</a>, the world's biggest cancer epidemiology lab.</li>
<li>genetics groups: which mutations (SNPs) associate with carcinogenesis?</li>
<li>new trends in genetic association studies (stimulated by decreased cost of
sequencing):
<ul>
<li><b>objective</b>: identify common genetic variants involved in susceptibility
to disease</li>
<li><b>candidate gene approach</b>: genes selected and tested based on prior
knowledge/hypotheses</li>
<li><b>GWAS approach</b>: test &ldquo;all&rdquo; common genetic variants with no prior
knowledge/hypothesis (Genome-Wide Association Studies)</li>
</ul></li>
</ul>
<p></div><div class="slide"></p><h1 class="cow-heading">Genetic Epidemiology (2)</h1>
<p>The problem: needle (significant associations) in a haystack (large-scale
gene sequence probes)</p>
<p><img src="p-value-results.png" alt="P value results" width="600"></p>
<p></div><div class="slide"></p><h1 class="cow-heading">Genetic Epidemiology (3)</h1>
<p>The WHO results</p>
<ul>
<li><a class="cow-url" href="http://www.nature.com/nature/journal/v452/n7187/abs/nature06885.html">Nature paper</a> 2008: GWAS result showing a particular genetic
polymorphism correlates with increased risk of lung cancer in smokers</li>
<li>required a large amount of manual work to examine data from sensor arrays</li>
<li>the usual statistical techniques need large numbers of samples to make the
analysis usable and reliable</li>
</ul>
<p>An annotation experiment</p>
<ul>
<li>our experiment used text analysis to add prior knowledge about genes</li>
<li>e.g. if a gene is expressed in lung tissue, represent this in the BFDP model
when calculating relevance of sensor data for related polymorphisms</li>
<li>use annotation to find papers that discuss particular genes, diseases,
anatomy and so on (AdAPT &mdash; Adjusting Association Priors with Text)</li>
<li>works using half the data (potential saving: &euro;250k)</li>
</ul>
<p></div><div class="slide"></p><h1 class="cow-heading">Genetic Epidemiology (4)</h1>
<ul>
<li><b>Confirmation of the methodology</b>: new association found using AdAPT for
head and neck cancer
<ul>
<li><em>Using Prior Information from the Medical Literature in GWAS of Oral
Cancer Identifies Novel Susceptibility Variant on Chromosome 4 &mdash; the
AdAPT Method</em>, Johansson et al, <b>PLoS ONE</b>, May 2012:
<a class="cow-url" href="http://dx.plos.org/10.1371/journal.pone.0036888">http://dx.plos.org/10.1371/journal.pone.0036888</a></li>
<li>See also
<a class="cow-url" href="http://tinyurl.com/gate-life-sci">PLoS Comptational Biology paper</a>, Feb
2013</li>
</ul></li>
<li>Future plans: epigenetics and gene-environment interaction studies &mdash; GxE</li>
</ul>
<p></div><div class="slide"></p><h1 class="cow-heading">Other Users</h1>
<ul>
<li>SLaM, South London and Maudsley Hospital Biomedical Research Center
<ul>
<li>largest UK mental health patient cohort</li>
<li>sophisticated clinical record information system</li>
</ul></li>
<li>British Library</li>
<li>Food and Environment Research Agency</li>
<li>TSO, the Stationery Office</li>
<li>TNA, the UK National Archives</li>
<li>SMEs: Fizzback; Innovantage; Sentimetrix; Ontotext; ...</li>
<li>Corporates:
<ul>
<li>pharmas (all the majors)</li>
<li>publishers, media</li>
<li>bizintel users (demand:
<a class="cow-url" href="http://www.informationweek.com/news/software/bi/229500096">~$1 billion
in 2010</a>)</li>
</ul></li>
<li>...</li>
<li>You next?</li>
</ul>
<p></div><div class="slide"></p><h1 class="cow-heading">Contents</h1>
<ul>
<li>0. Background; Housekeeping</li>
<li>1. GATE in the Wild
<ul>
<li>Semantics in the Media</li>
<li>Text Mining for Biomedicine</li>
<li>Other Users</li>
</ul></li>
<li><u><b>2. The GATE Family
<ul>
<li>Developer, Embedded</li>
<li>Teamware, Process</li>
<li>Cloud</li>
<li>KIM, OWLIM, Linked Data</li>
<li>M&iacute;mir</b></u></li>
</ul></li>
<li>3. A Lifecycle for Text Analysis</li>
</ul>
<p></div><div class="slide"></p><h1 class="cow-heading">The GATE Family</h1>
<ul>
<li>an <b>architecture</b></li>
<li>an IDE: <b>GATE Developer</b>: an integrated development environment for language
processing components bundled with a widely used Information Extraction system
and a comprehensive set of
<a class="cow-non-existant-url" href="http://gate.ac.uk/gate/doc/plugins.html?m=1">other plugins</a></li>
<li>a framework: <b>GATE Embedded</b>: an object library optimised for inclusion in
diverse applications giving access to all the services used by GATE Developer
and more
<ul>
<li>used worldwide by thousands of scientists, companies, teachers and students
(<u>&gt;30k downloads per year at present</u>, not counting SVN)</li>
<li>open source (LGPL), 100% java</li>
</ul></li>
<li>a web app: <b>GATE Teamware</b> a collaborative annotation environment for
factory-style semantic annotation projects built around a workflow engine
<ul>
<li>a process: not &quot;get this software and it will revolutionise your life&quot; but
&quot;this is how to implement robust and maintainable services&quot;</li>
</ul></li>
<li><b>GATE Cloud</b>: a parallel and distributed service infrastructure running on
Amazon EC2</li>
<li><b>GATE M&iacute;mir</b>: (Multi-paradigm Information Management Index and Repository) a
scaleable multiparadigm index built on <a class="cow-url" href="http://www.ontotext.com/">Ontotext</a>'s
<a class="cow-url" href="http://www.ontotext.com/owlim/">semantic repository family</a>, GATE's
annotation structures database plus full-text indexing from
<a class="cow-url" href="http://mg4j.dsi.unimi.it/">MG4J</a></li>
<li>and finally...
<ul>
<li><b>GATE Prospector</b> (semantic search UI)</li>
<li>related tools from Ontotext (OWLIM, KIM, Linked Data endpoints)</li>
<li>a wiki/CMS (<b><a class="cow-url" href="http://gatewiki.sf.net">GATE Wiki.sf.net</a></b>), mainly to host
our own websites and as a testbed for some of our experiments</li>
<li>a <b>community</b></li>
</ul></li>
</ul>
<p></div><div class="slide"></p><h1 class="cow-heading">GATE Developer (1)</h1>
<p>Motivation: 1990s apparatus envy: physicists had supercolliders; medics had
MRI scanners; language processing researchers had.... Perl?</p>
<ul>
<li>A specialist Integrated Development Environment for language engineering R&amp;D</li>
<li>Analogous to
<ul>
<li>Eclipse or Netbeans for programmers</li>
<li>Mathematica or SPSS for maths and stats</li>
</ul></li>
<li>Visualisation and editing text, annotations, ontologies, parse trees, etc.</li>
<li>Constructing applications from components</li>
<li>Measurement, evaluation, benchmarking</li>
<li>Etc., etc.</li>
</ul>
<p></div><div class="slide"></p><h1 class="cow-heading">GATE Developer (2)</h1>
<p><img src="life-nerc-example.png" alt=""LifeNERC"" width="900"></p>
<p></div><div class="slide"></p><h1 class="cow-heading">GATE Embedded (1)</h1>
<p>Object-oriented Java framework. Architectural principles:</p>
<ul>
<li>Non-prescriptive, theory neutral (strength and weakness) </li>
<li>Re-use, interoperation, not reimplementation (e.g. diverse XML support,
integration of Prot&eacute;g&eacute;, OWLIM, Weka, Lingpipe, OpenNLP, SVM Lite, etc.
etc....) </li>
<li>(Almost) everything is a component, and component sets are user-extendable </li>
<li>(Almost) all operations are available both from API (Embedded) and GUI
(Developer)</li>
</ul>
<p>CREOLE: a Collection of REusable Objects for Language Engineering</p>
<ul>
<li>GATE components: modified Java Beans with XML configuration</li>
<li>The minimal component = 10 lines of Java, 10 lines of XML, 1 URL</li>
</ul>
<p></div><div class="slide"></p><h1 class="cow-heading">GATE Embedded (2)</h1>
<p><img src="http://gate.ac.uk/sale/talks/gate-apis.png" alt=""GATE Embedded APIs"" width="800"></p>
<p></div><div class="slide"></p><h1 class="cow-heading">GATE Embedded (3)</h1>
<ul>
<li>persistence, visualisation and editing</li>
<li>a finite state transduction language (JAPE)</li>
<li>extraction of training instances for machine learning (ML)</li>
<li>pluggable ML implementations (Weka, YALE, SVM, ...)</li>
<li>components for language processing, e.g. parsers, machine learning tools,
stemmers, a few IR tools (Lucene, GYM query plugins), IE components for
various languages...</li>
<li>bundled with a very widely used Information Extraction system (ANNIE)
<ul>
<li>MUC, TREC, ACE, DUC, Pascal, NTCIR, etc.</li>
</ul></li>
<li>simple API for RDF or OWL (metadata) via OWLIM </li>
<li>a suite of tools for biomedical text processing</li>
<li>kitchen sink</li>
</ul>
<p></div><div class="slide"></p><h1 class="cow-heading">Process, workflow, GATE Teamware (1)</h1>
<table> <tr><td>
<p>A typical annotation project:</p>
<ul>
<li>client discussion, task exploration, <br> draft <em>extraction specification</em></li>
<li>manual annotation, inter-annotator <br> agreement, iterate the task spec</li>
<li>prototype a machine solution</li>
<li>more manual annotation for training <br> and test data (gold standard)</li>
<li>implement production solution</li>
<li>more manual annotation for quality <br> control, maintenance, adaptation...</li>
</ul>
</td><td> &nbsp;&nbsp;&nbsp;</td><td>
<img src="http://gate.ac.uk/family/process/images/problem-definition.png" alt=""Process"" width="330">
</td></tr>
</table>
<p></div><div class="slide"></p><h1 class="cow-heading">Process, workflow, GATE Teamware (2)</h1>
<ul>
<li><b>The GATE process</b> is a set of steps to follow in the definition, prototyping,
development, deployment and maintenance of semantic annotation processes.</li>
<li><b>GATE Teamware</b> is a workflow-based web engine supporting these processes.</li>
<li>Based on JBoss Process Management engine (BPEL compatible)</li>
<li>Teamware supports marshalling the manual annotation team, job allocation,
quality control, training, communication, process monitoring...</li>
<li>Case study: <a class="cow-url" href="http://www.lighthouseipg.com/">Lighthouse Group</a> runs teams of
annotators in Cebu (Philippines), e.g. supplying 10,000 hours to Khresmoi
project for on-line medical information.</li>
</ul>

<p></div><div class="slide"></p><h1 class="cow-heading">Teamware (3): workflow configuration</h1>
<table> <tr><td>
<img src="http://gate.ac.uk/sale/talks/tal/teamware-tasks.gif" alt=""teamware-tasks.gif"" width="500">
</td><td>
<img src="http://gate.ac.uk/sale/talks/tal/templateoverview.png" alt=""templateoverview.png"" width="500">
</td></tr>
</table>
<p></div><div class="slide"></p><h1 class="cow-heading">Teamware (4): process monitoring</h1>
<p><img src="http://gate.ac.uk/sale/talks/tal/annotationstatusoverview.png" alt=""annotationstatusoverview.png""></p>
<p></div><div class="slide"></p><h1 class="cow-heading">Teamware (5): quality control</h1>
<table> <tr><td>
<img src="http://gate.ac.uk/sale/talks/tal/iaacaculation.png" alt=""iaacaculation.png"" width="500">
</td><td>
<img src="http://gate.ac.uk/sale/talks/tal/iaaresult.png" alt=""iaaresult.png"" width="500">
</td></tr>
</table>
<p></div><div class="slide"></p><h1 class="cow-heading">Teamware (6): staff communication</h1>
<p><img src="http://gate.ac.uk/sale/talks/tal/forum.png" alt=""forum.png""></p>
<p></div><div class="slide"></p><h1 class="cow-heading">GATE Cloud (1): instant scaling with no CAPEX</h1>
<p>Cloud computing means many things in many contexts. On <b>GATECloud.net</b> it
means:</p>
<ul>
<li><b>zero fixed costs</b>: you don't buy software licences or server hardware, just
pay for the compute time that you use</li>
<li><b>near zero startup time</b>: in a matter of minutes you can specify, provision
and deploy the type of computation that used to take months of planning</li>
<li><b>easy in, easy out</b>: if you try it and don't like it, go elsewhere! you can
even take the software with you, it's all open source</li>
<li><b>someone else takes the admin load</b>:
<ul>
<li><a class="cow-url" href="http://gate.ac.uk/">the GATE team</a> from the <a class="cow-url" href="http://www.shef.ac.uk/">University of Sheffield</a> make sure you're running the best of breed
technology for text, search and semantics</li>
<li>cloud providers' data center managers (e.g. at <a class="cow-url" href="http://aws.amazon.com/">Amazon Inc.</a>) make sure the hardware and operating platform for your work
is scaleable, reliable and cheap</li>
</ul></li>
</ul>
<p></div><div class="slide"></p><h1 class="cow-heading">GATE Cloud (2): engineering</h1>
<ul>
<li><b>parallel</b> execution engine of automatic annotation processes + <b>distributed</b>
execution of parallel engine</li>
<li><b>scalability</b>: auto-scaling of processor swarms running on top of AWS EC2</li>
<li><b>flexibility</b>: parameters configure behaviour, select the GATE application
being executed, the input protocol used reading documents, the output protocol
used for exporting the resulting annotations, ...</li>
<li><b>robustness</b>: jobs run unattended over large data sets
<ul>
<li>extensively tested and profiled (no memory leaks)</li>
<li>errors and exceptions that occur during processing are trapped and reported</li>
<li>if the process crashes (e.g. hardware failure), can be restarted and resumes
execution where it left off</li>
</ul></li>
</ul>
<p></div><div class="slide"></p><h1 class="cow-heading">GATE Cloud (3): how it works</h1>
<p><img src="http://gate.ac.uk/sale/talks/gate-course-may11/gatecloud.net-intro/images/annotation-job.png" alt="A cloud annotation job" width="700"></p>
<p></div><div class="slide"></p><h1 class="cow-heading"></h1>
<p><img src="http://gate.ac.uk/talks/gate-course-may11/gatecloud.net-intro/images/job-editor-details.png" alt="Job editor: details" width="700">
</div><div class="slide"></p><h1 class="cow-heading"></h1>
<p><img src="http://gate.ac.uk/talks/gate-course-may11/gatecloud.net-intro/images/job-editor-inputs.png" alt="Job editor: inputs" width="850">
</div><div class="slide"></p><h1 class="cow-heading"></h1>
<p><img src="http://gate.ac.uk/talks/gate-course-may11/gatecloud.net-intro/images/job-editor-outputs.png" alt="Job editor: outputs" width="850">
</div><div class="slide"></p><h1 class="cow-heading"></h1>
<p><img src="http://gate.ac.uk/talks/gate-course-may11/gatecloud.net-intro/images/job-editor-progress.png" alt="Job editor: progress" width="850">
</div><div class="slide"></p><h1 class="cow-heading"></h1>
<p><img src="http://gate.ac.uk/talks/gate-course-may11/gatecloud.net-intro/images/job-editor-complete.png" alt="Job editor: complete" width="850">
</div><div class="slide"></p><h1 class="cow-heading"></h1>
<p><img src="http://gate.ac.uk/talks/gate-course-may11/gatecloud.net-intro/images/job-editor-results.png" alt="Job editor: results" width="850"></p>
<p></div><div class="slide"></p><h1 class="cow-heading">GATE Cloud (4): a research perspective</h1>
<ul>
<li>something like the <em>facility</em> that the the IRF was trying to set up for IR
more generally</li>
<li>host a growing family of experimental system configurations, data sets,
results</li>
<li>biased heavily towards information extraction (perhaps some mileage in
adding more mainstream IR?)</li>
<li>persistence and reuse of experimental setups: virtualisation makes it
possible to store not just data but the entire compute platform operable for
particular experiments or analyses</li>
<li><b>plans for 2013:</b>
<ul>
<li>bigger big data</li>
<li>Hadoop</li>
<li>a marketplace, 3rd-party contributions, <a class="cow-url" href="http://annomarket.eu">AnnoMarket</a></li>
</ul></li>
</ul>
<p></div><div class="slide"></p><h1 class="cow-heading">First Cousins &mdash; the Ontotext family</h1>
<p>Complementing the GATE tools KIM provides a straightforward front-end deployment
option and their Linked Data offerings a good baseline for model building:</p>
<ul>
<li><b>Ontotext <a class="cow-url" href="http://www.ontotext.com/kim">KIM</a></b>: UIs demonstrating multiple
conceptual and facetted search modes
<ul>
<li>see also GATE Prospector (below)</li>
</ul></li>
<li><b>Ontotext <a class="cow-url" href="http://www.ontotext.com/owlim">OWLIM</a></b>: the fastest and most
scaleable semantic repository</li>
<li><b>Ontotext <a class="cow-url" href="http://ontotext.com/factforge/">FactForge</a></b>: ~4 billion statements
from the Linked Data cloud</li>
<li><b>Ontotext <a class="cow-url" href="http://www.linkedlifedata.com/">Linked Life Data</a></b>: over 4 billion
statements from life sci databases including UniProt, PubMed, EntrezGene and
20 more</li>
</ul>
<p></div><div class="slide"></p><h1 class="cow-heading">GATE M&iacute;mir: hitting the indexing problem</h1>
<p><b>Circa 2007</b>:</p>
<ul>
<li>A new project on patent searching at <a class="cow-url" href="http://www.ir-facility.org/">the IRF</a> =
<b>culture shock</b>!
<ul>
<li>Full text, boolean (&quot;100% recall&quot;!)</li>
</ul></li>
<li>Initial prototyping and demo work:
<ul>
<li>Conceptual and semantic search and navigation (KIM, as above)</li>
<li>ANNotations In Context (ANNIC) (right)</li>
</ul></li>
<li>User requirement: put it all together</li>
<li>Ooops: ANNIC scaled to 200 short docs...</li>
</ul>
<p><b>ANNIC (ANNotations In Context)</b>: <br> <br>
<img src="http://gate.ac.uk/sale/talks/tal/annic.png" alt=""ANNIC"" width="550"></p>
<p></div><div class="slide"></p><h1 class="cow-heading">Annotations: the Missing Data Structure (1)</h1>
<table> <tr><td>
<table> <tr><td>
<b>How do we search a billion-node annotation graph...?</b>
</td></tr>
<tr><td> <p><br> <br> <br>
<b>Model</b>:</p>
<p><img src="http://gate.ac.uk/sale/talks/tal/annotationGraph.png" alt=""An annotation graph"" width="550"></p>
</td></tr>
<tr><td> <p><br>
(Cf. TIPSTER, TEI/XCES, ATLAS, UIMA, ...)</p>
</td></tr>
</table>
</td><td> &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;</td><td>
<p><b>UI example</b>:</p>
<p><img src="http://gate.ac.uk/sale/talks/tal/chinese1.png" alt=""Chinese annotations"" width="400"></p>
</td></tr>
</table>
<p></div><div class="slide"></p><h1 class="cow-heading">Annotations: the Missing Data Structure (2)</h1>
<ul>
<li>Our first thoughts: where can we steal one?</li>
<li>Annotations: how to index the graph?
<ul>
<li>XML indexing and retrieval work doesn't solve it (biased towards trees)</li>
<li>RDBMS doesn't solve it (biased towards relations)</li>
<li>augmented full-text indices can help with efficient access, but the data
storage requirements of our prototype (based on Lucene) grew exponentially
with the cardinality of the annotation sets</li>
</ul></li>
<li>May 2008: workshop on <em>Persisting, Indexing and Querying Multi-Paradigm Text
Models</em>, IRF, Vienna
<table> <tr><td>
<ul>
<li>MG4J (Eric Graf, Glasgow)</li>
<li>Terrier (Gianni Amati, FUB/Glasgow)</li>
<li>INEX (Norbert Fuhr, Essen-Duisburg)</li>
<li>KIM, OWLIM (Atanas Kiryakov, OntoText)</li>
</ul> </td><td>
<ul>
<li>ANNIC (Valentin Tablan, Sheffield)</li>
<li>HTML-XML Search Engines (Ralf Schenkel, MPG)</li>
<li>Monet DB (Arjen de Vries, CWI)</li>
</ul> </td></tr>
</table></li>
<li>May 2009: custom solution based on MG4J (Sebastiano Vigna) + OWLIM called
<b>M&iacute;mir</b>...</li>
</ul>
<p></div><div class="slide"></p><h1 class="cow-heading">M&iacute;mir: Multi-paradigm Information Management Index and Respository</h1>
<p>M&iacute;mir is an index engine that can search over:</p>
<ul>
<li>text</li>
<li>textual and semantic annotations</li>
<li>ontologies and knowledge bases</li>
<li>scales to the terabyte level</li>
</ul>
<p>Built on top of: the MG4J text indexing library; GATE's annotation index
(remodelled in MG4J); Ontotext's semantic repository family</p>
<ul>
<li>May 2010: version 2: incremental indices; federation</li>
<li>May 2011: version 3: full source release under the AGPL; cloud release</li>
<li>2012: version 4: document centric results mode; 64-bit doc ids</li>
<li>...?: version 5: <a class="cow-url" href="http://gate.ac.uk/mimir/doc/mimir-guide.pdf">http://gate.ac.uk/mimir/doc/mimir-guide.pdf</a> (p. 61)</li>
</ul>

<p></div><div class="slide"></p><h1 class="cow-heading">GATE Prospector</h1>
<p>All that data, all those query languages, what about a UI?! There are
<a class="cow-url" href="http://gate.ac.uk/mimir/">some developer UIs</a>; or KIM; or:</p>
<p><img src="prospector-search-example.png" alt="Prospector" width="700">
<img src="prospector-co-occurrence-ui.png" alt="Prospector co-occurrence" width="700"></p>
<p></div><div class="slide"></p><h1 class="cow-heading">Contents</h1>
<ul>
<li>0. Background; Housekeeping</li>
<li>1. GATE in the Wild
<ul>
<li>Semantics in the Media</li>
<li>Text Mining for Biomedicine</li>
<li>Other Users</li>
</ul></li>
<li>2. The GATE Family
<ul>
<li>Developer, Embedded</li>
<li>Teamware, Process</li>
<li>Cloud</li>
<li>KIM, OWLIM, Linked Data</li>
<li>M&iacute;mir</li>
</ul></li>
<li><u><b>3. A Lifecycle for Text Analysis</b></u></li>
</ul>
<p></div><div class="slide"></p><h1 class="cow-heading">Full lifecycle information extraction</h1>

<ol>
<li>Take one large pile of text (documents, emails, tweets, patents, papers,
transcripts, blogs, comments, acts of parliament, and so on and so forth).</li>
<li>Pick a structured description of interesting things in the text (a telephone
directory, or chemical taxonomy, or something from the
<a class="cow-url" href="http://linkeddata.org/">Linked Data</a> cloud) &mdash; call this your <em>ontology</em>. </li>
<li>Use <a class="cow-non-existant-url" href="http://gate.ac.uk/teamware/?m=1">GATE Teamware</a> to mark up a <em>gold standard</em> example set of
annotations of the corpus (1.) relative to the ontology (2.). (For smaller
jobs use GATE Developer.)</li>
<li>Use <a class="cow-non-existant-url" href="http://gate.ac.uk/family/developer.html?m=1">GATE Developer</a> to build a <em>semantic annotation
pipeline</em> to do the annotation job automatically and measure performance
against the gold standard.</li>
<li>Take the pipeline from 4. and apply it to your text pile using
<a class="cow-url" href="http://gatecloud.net/">GATE Cloud</a> (or embed it in your own systems
using <a class="cow-non-existant-url" href="http://gate.ac.uk/family/embedded.html?m=1">GATE Embedded</a>). Use it to bootstrap more manual
(now semi-automatic) work in Teamware.</li>
<li>Use <a class="cow-non-existant-url" href="http://gate.ac.uk/family/mimir.html?m=1">GATE M&iacute;mir</a> to store the annotations relative to the
ontology in a <em>multiparadigm index server</em>.</li>
<li>(Probably) write a domain-specific UI to go on top of M&iacute;mir &mdash; see
<a class="cow-url" href="http://demos.gate.ac.uk/pin/">demos.gate.ac.uk/pin</a> for a simple example.</li>
<li>Hey presto, you have search that applies your annotations and you
ontology to your corpus (and a <a class="cow-non-existant-url"
href="http://gate.ac.uk/family/process.html?m=1">sustainable process</a>
for coping with changing information need and/or changing text).</li>
<li>Your users are happy (and <a class="cow-url" href="http://gate.ac.uk/">GATE.ac.uk</a> has a &quot;donate&quot;
button ;-) ).</li>
</ol>
<p></div><div class="slide"></p><h1 class="cow-heading">Links</h1>
<table> <tr><td>
<p>More information</p>
<ul>
<li>GATE home page: <a class="cow-url" href="http://gate.ac.uk/">gate.ac.uk</a></li>
<li>these slides: <a class="cow-url" href="http://tinyurl.com/figint13">tinyurl.com/figint13</a></li>
<li>GATE for life sciences (and an up-to-date overview):
<br> <a class="cow-url" href="http://tinyurl.com/gate-life-sci">tinyurl.com/gate-life-sci</a>
<br> (PLoS Comptational Biology, Feb 2013)</li>
<li>GATE Cloud: <a class="cow-url" href="http://gatecloud.net/">gatecloud.net</a>; <br>
<a class="cow-url" href="http://tinyurl.com/gate-cloud-royal-soc">tinyurl.com/gate-cloud-royal-soc</a>
<br> (Royal Soc Phil Trans A, Dec 2012)</li>
<li>my home page: <a class="cow-url" href="http://gate.ac.uk/hamish/">gate.ac.uk/hamish</a></li>
</ul>
<p><br> <br> <br> <br> <br> <br></p>
</td><td>
<p class="incremental">
<img src="keep-calm-bg.png" alt="Keep calm and move to Bulgaria" width="450"></p>
</td></tr>
</table>



</div> </body></html>
