<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">

<html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en">
<head>
<title>The Infernal Beauty of Text</title>
<meta name="copyright"
content="GATE Team, University of Sheffield - gate.ac.uk"/>
<link rel="stylesheet" type="text/css" media="screen, projection, print"
href="gslidy/slidy.css"/>
<script src="gslidy/slidy.js"
type="text/javascript"></script>
</head>
<body>
<div class="background">
<img id="head-icon" alt="" align="right" src="http://gate.ac.uk/sale/images/gate4/logo-colour.png" width="150"/>
</div>
<div class="slide">
<h1 class="cow-title-heading">The Infernal Beauty of Text</h1>
<table> <tr><td>
<p>Hamish Cunningham, <br> University of Sheffield
<br> <br> <br> <br> <br> <br> <br> <br> <br> <br>
<img src="http://gate.ac.uk/sale/images/gate4/splash.png" alt=""GATE"" width="280" height="215" align="top" border="0"></p>
</td><td>
<p><b>Contents</b></p>
<ul>
<li><u><b>1. Introduction
<ul>
<li>Context</li>
<li>Examples: TV Sport; Genetic Epidemiology; ...</b></u></li>
</ul></li>
<li>2. The GATE family
<ul>
<li>Developer, Embedded</li>
<li>Teamware, Process</li>
<li>Cloud</li>
<li>KIM, OWLIM, Linked Data</li>
<li>M&iacute;mir</li>
</ul></li>
<li>3. [Demos
<ul>
<li>an IDE for text analysis specialists</li>
<li>collaborative manual annotation workflows</li>
<li>GATECloud.net </li>
<li>M&iacute;mir: a mixed-mode index server]</li>
</ul></li>
<li>4. A lifecycle for text analysis</li>
</ul>
</td></tr>
</table>

<p></div><div class="slide"></p><h1 class="cow-heading">Context (1): the Infernal Beauty of Text</h1>
<p>Language is the quintessential product of human cooperation</p>
<ul>
<li>text is a beautiful gift that projects moments in time and place forwards</li>
<li>we're so good at language that it seems easy, but...</li>
<li>...it is infernally difficult to compute (a measure of our ignorance of
human intelligence?)</li>
</ul>
<p>A large proportion of what we know is externalised only in text</p>
<ul>
<li>structured data (DBs, taxonomies, dictionaries, ontologies...): machine
tractable, but expensive and inflexible</li>
<li>how do we bridge these two worlds?</li>
<li>text analysis becoming a predictable and robust engineering process</li>
<li>deriving structured data from textual sources now much easier</li>
</ul>

<p></div><div class="slide"></p><h1 class="cow-heading">Context (2): Finding, Navigating, Abstracting</h1>
<p>Value, volume</p>
<ul>
<li>if you have content that is sufficiently <em>high value</em> or <em>low volume</em> (or
you're Google) then you can use sophisticated methods to help people find,
browse or abstract over that content</li>
<li>these methods include building symbolic models (taxonomic, logical,
conceptual, semantic...) of the subject matter and annotating content with
references to those models</li>
</ul>

<ul>
<li>social vs. technological success factors
<ul>
<li><em>technological</em>: the expressivity of the modelling languages and the
quality of the annotation algorithms, and of the indexing, search and
browsing tools</li>
<li><em>social</em>: the level of expertise and the quantity of time and effort
deployed by the people building models and extraction patterns (or
creating training data and running learning algorithms), or tuning indices
or user interfaces</li>
</ul></li>
<li>GATE: family of tools that can minimise time and effort in developing and
maintaining rich information retrieval and management systems, while
attempting to stay close to the state of the technological art (sometimes by
favouring interoperation and reuse over innovation or reinvention)</li>
<li>covers the full lifecycle of text analysis &amp; rich search</li>
</ul>


<p></div><div class="slide"></p><h1 class="cow-heading">Example 1: TV Sport</h1>
<p>The BBC served its
<a class="cow-url" href="http://news.bbc.co.uk/sport1/hi/football/world_cup_2010/default.stm">2010
World Cup pages</a> out of <a class="cow-url" href="http://www.ontotext.com/owlim">BigOWLIM</a>. They used
text mining to make links into other relevant pages according to their
(ontological) data model. They now use GATE for this text mining function --
next stop the 2012 Olympics...</p>
<p>The BBC's system achieved <b>cost savings of ~80%</b> compared to a conventional
database-backed web system.</p>
<p>The <a class="cow-url" href="http://www.pressassociation.com/">Press Association</a> is also going full
speed with similar efforts, following on from their long-running GATE project
that processes the captions in their massive image library.</p>
<p>Media is a perfect application area for our text analysis and semantic modelling
technology, partly because journalistic language is very well-behaved
(relatively speaking!), partly because the content is extremely valuable, and
partly because existing classification schemes are typically applied quite
rigorously.</p>
<p></div><div class="slide"></p><h1 class="cow-heading">Example 2: Genetic Epidemiology</h1>
<ul>
<li>it is <b>hypothesised</b> that
<ul>
<li>genetic factors play a strong role in susceptibility to disease </li>
<li>in future targetted pharmaceuticals will be tailored to individual
genetics</li>
</ul></li>
<li>a substantial body of work looks for associations between mutations and
diseases</li>
<li>World Health <a class="cow-url" href="http://www.iarc.fr">International Agency for Research in
Cancer</a>, the world's biggest cancer epidemiology lab.</li>
<li>genetics groups: which mutations (SNPs) associate with carcinogenesis?</li>
<li>new trends in genetic association studies (stimulated by decreased cost of
sequencing):
<ul>
<li><b>objective</b>: identify common genetic variants involved in susceptibility
to disease</li>
<li><b>candidate gene approach</b>: genes selected and tested based on prior
knowledge/hypotheses</li>
<li><b>GWAS approach</b>: test &ldquo;all&rdquo; common genetic variants with no prior
knowledge/hypothesis (Genome-Wide Association Studies)</li>
</ul></li>
</ul>
<p></div><div class="slide"></p><h1 class="cow-heading">Genetic Epidemiology (2)</h1>
<p>The problem: needle (significant associations) in a haystack (large-scale
gene sequence probes)</p>
<p><img src="http://gate.ac.uk/sale/talks/tal/p-value-results.png" alt=""P value results"" width="600"></p>
<p></div><div class="slide"></p><h1 class="cow-heading">Genetic Epidemiology (3)</h1>
<p>The WHO results</p>
<ul>
<li><a class="cow-url" href="http://www.nature.com/nature/journal/v452/n7187/abs/nature06885.html">Nature paper</a> 2008: GWAS result showing a particular genetic
polymorphism correlates with increased risk of lung cancer in smokers</li>
<li>required a large amount of manual work to examine data from sensor arrays</li>
<li>the usual statistical techniques need large numbers of samples to make the
analysis usable and reliable</li>
</ul>
<p>An annotation experiment</p>
<ul>
<li>our experiment used text analysis to add prior knowledge about genes</li>
<li>e.g. if a gene is expressed in lung tissue, represent this in the BFDP model
when calculating relevance of sensor data for related polymorphisms</li>
<li>use annotation to find papers that discuss particular genes, diseases,
anatomy and so on (AdAPT -- Adjusting Association Priors with Text)</li>
<li>works using half the data (potential saving: &euro;250k)</li>
</ul>
<p>2010: applied to head and neck cancer and found a new association</p>
<p>(Next: epigenetics and gene-environment interaction studies -- GxE)</p>
<p></div><div class="slide"></p><h1 class="cow-heading">More examples</h1>
<ul>
<li>TNA, the UK National Archives</li>
<li>TSO, the Stationery Office</li>
<li>SLaM, South London and Maudsley Hospital</li>
<li>SMEs: Fizzback; Innovantage; Sentimetrix; Ontotext; ...</li>
<li>Corporates: pharmas, publishers, bizintel users
<ul>
<li>demand: <a class="cow-url" href="http://www.informationweek.com/news/software/bi/229500096">~$1
billion in 2010</a></li>
</ul></li>
<li>starting up: British Library; Food and Environment Research Agency; Health
on the Net</li>
<li>(you next?)</li>
</ul>
<p></div><div class="slide"></p><h1 class="cow-heading">Contents</h1>
<ul>
<li>1. Introduction
<ul>
<li>Context</li>
<li>Examples: TV Sport; Genetic Epidemiology; etc.</li>
</ul></li>
<li><u><b>2. The GATE family
<ul>
<li>Developer, Embedded</li>
<li>Teamware, Process</li>
<li>Cloud</li>
<li>KIM, OWLIM, Linked Data</li>
<li>M&iacute;mir</li>
<li>Wiki</b></u></li>
</ul></li>
<li>3. [Demos
<ul>
<li>an IDE for text analysis specialists</li>
<li>collaborative manual annotation workflows</li>
<li>GATECloud.net </li>
<li>M&iacute;mir: a mixed-mode index server]</li>
</ul></li>
<li>4. A lifecycle for text analysis</li>
</ul>
<p></div><div class="slide"></p><h1 class="cow-heading">The GATE Family</h1>
<ul>
<li>an <b>architecture</b></li>
<li>an IDE: <b>GATE Developer</b>: an integrated development environment for language
processing components bundled with a widely used Information Extraction system
and a comprehensive set of
<a class="cow-url" href="http://gate.ac.uk/gate/doc/plugins.html">other plugins</a></li>
<li>a framework: <b>GATE Embedded</b>: an object library optimised for inclusion in
diverse applications giving access to all the services used by GATE Developer
and more
<ul>
<li>used worldwide by thousands of scientists, companies, teachers and students
(<u>&gt;30k downloads per year at present</u>, not counting SVN)</li>
<li>open source (LGPL), 100% java</li>
</ul></li>
<li>a web app: <b>GATE Teamware</b> a collaborative annotation environment for
factory-style semantic annotation projects built around a workflow engine
<ul>
<li>a process: not &quot;get this software and it will revolutionise your life&quot; but
&quot;this is how to implement robust and maintainable services&quot;</li>
</ul></li>
<li><b>GATE Cloud</b>: a parallel and distributed service infrastructure running on
Amazon EC2</li>
<li><b>GATE M&iacute;mir</b>: (Multi-paradigm Information Management Index and Repository) a
scaleable multiparadigm index built on <a class="cow-url" href="http://www.ontotext.com/">Ontotext</a>'s
<a class="cow-url" href="http://www.ontotext.com/owlim/">semantic repository family</a>, GATE's
annotation structures database plus full-text indexing from
<a class="cow-url" href="http://mg4j.dsi.unimi.it/">MG4J</a></li>
<li>and finally...
<ul>
<li>related tools from Ontotext (OWLIM, KIM, Linked Data endpoints)</li>
<li><a class="cow-url" href="http://gate.ac.uk/gatewiki/cow/">GATE Wiki</a></li>
<li>a <b>community</b></li>
</ul></li>
</ul>
<p></div><div class="slide"></p><h1 class="cow-heading">GATE Developer (1)</h1>
<p>Motivation: 1990s apparatus envy: physicists had supercolliders; medics had
MRI scanners; language processing researchers had.... Perl?</p>
<ul>
<li>A specialist Integrated Development Environment for language engineering R&amp;D</li>
<li>Analogous to
<ul>
<li>Eclipse or Netbeans for programmers</li>
<li>Mathematica or SPSS for maths and stats</li>
</ul></li>
<li>Visualisation and editing text, annotations, ontologies, parse trees, etc.</li>
<li>Constructing applications from components</li>
<li>Measurement, evaluation, benchmarking</li>
<li>Etc., etc.</li>
</ul>
<p></div><div class="slide"></p><h1 class="cow-heading">GATE Developer (2)</h1>
<p><img src="life-nerc-example.png" alt=""LifeNERC"" width="900"></p>
<p></div><div class="slide"></p><h1 class="cow-heading">GATE Embedded (1)</h1>
<p>Object-oriented Java framework. Architectural principles:</p>
<ul>
<li>Non-prescriptive, theory neutral (strength and weakness) </li>
<li>Re-use, interoperation, not reimplementation (e.g. diverse XML support,
integration of Prot&eacute;g&eacute;, OWLIM, Weka, Lingpipe, OpenNLP, SVM Lite, etc.
etc....) </li>
<li>(Almost) everything is a component, and component sets are user-extendable </li>
<li>(Almost) all operations are available both from API (Embedded) and GUI
(Developer)</li>
</ul>
<p>CREOLE: a Collection of REusable Objects for Language Engineering</p>
<ul>
<li>GATE components: modified Java Beans with XML configuration</li>
<li>The minimal component = 10 lines of Java, 10 lines of XML, 1 URL</li>
</ul>
<p></div><div class="slide"></p><h1 class="cow-heading">GATE Embedded (2)</h1>
<p><img src="http://gate.ac.uk/sale/talks/gate-apis.png" alt=""GATE Embedded APIs"" width="800"></p>
<p></div><div class="slide"></p><h1 class="cow-heading">GATE Embedded (3)</h1>
<ul>
<li>persistence, visualisation and editing</li>
<li>a finite state transduction language (JAPE)</li>
<li>extraction of training instances for machine learning (ML)</li>
<li>pluggable ML implementations (Weka, YALE, SVM, ...)</li>
<li>components for language processing, e.g. parsers, machine learning tools,
stemmers, a few IR tools (Lucene, GYM query plugins), IE components for
various languages...</li>
<li>bundled with a very widely used Information Extraction system (ANNIE)
<ul>
<li>MUC, TREC, ACE, DUC, Pascal, NTCIR, etc.</li>
</ul></li>
<li>simple API for RDF or OWL (metadata) via OWLIM </li>
<li>kitchen sink</li>
</ul>
<p></div><div class="slide"></p><h1 class="cow-heading">Process, workflow, GATE Teamware (1)</h1>
<table> <tr><td>
<p>A typical annotation project:</p>
<ul>
<li>client discussion, task exploration, <br> draft <em>extraction specification</em></li>
<li>manual annotation, inter-annotator <br> agreement, iterate the task spec</li>
<li>prototype a machine solution</li>
<li>more manual annotation for training <br> and test data (gold standard)</li>
<li>implement production solution</li>
<li>more manual annotation for quality <br> control, maintenance, adaptation...</li>
</ul>
</td><td> &nbsp;&nbsp;&nbsp;</td><td>
<img src="http://gate.ac.uk/family/process/images/problem-definition.png" alt=""Process"" width="330">
</td></tr>
</table>
<p></div><div class="slide"></p><h1 class="cow-heading">Process, workflow, GATE Teamware (2)</h1>
<ul>
<li><b>The GATE process</b> is a set of steps to follow in the definition, prototyping,
development, deployment and maintenance of semantic annotation processes.</li>
<li><b>GATE Teamware</b> is a workflow-based web engine supporting these processes.</li>
<li>Based on JBoss Process Management engine (BPEL compatible)</li>
<li>Teamware supports marshalling the manual annotation team, job allocation,
quality control, training, communication, process monitoring...</li>
<li>Case study: <a class="cow-url" href="http://www.lighthouseipg.com/">Lighthouse Group</a> runs teams of
annotators in Cebu (Philippines), e.g. supplying 10,000 hours to Khresmoi
project for on-line medical information.</li>
</ul>

<p></div><div class="slide"></p><h1 class="cow-heading">Teamware (3): workflow configuration</h1>
<table> <tr><td>
<img src="http://gate.ac.uk/sale/talks/tal/teamware-tasks.gif" alt=""teamware-tasks.gif"" width="500">
</td><td>
<img src="http://gate.ac.uk/sale/talks/tal/templateoverview.png" alt=""templateoverview.png"" width="500">
</td></tr>
</table>
<p></div><div class="slide"></p><h1 class="cow-heading">Teamware (4): process monitoring</h1>
<p><img src="http://gate.ac.uk/sale/talks/tal/annotationstatusoverview.png" alt=""annotationstatusoverview.png""></p>
<p></div><div class="slide"></p><h1 class="cow-heading">Teamware (5): quality control</h1>
<table> <tr><td>
<img src="http://gate.ac.uk/sale/talks/tal/iaacaculation.png" alt=""iaacaculation.png"" width="500">
</td><td>
<img src="http://gate.ac.uk/sale/talks/tal/iaaresult.png" alt=""iaaresult.png"" width="500">
</td></tr>
</table>
<p></div><div class="slide"></p><h1 class="cow-heading">Teamware (6): staff communication</h1>
<p><img src="http://gate.ac.uk/sale/talks/tal/forum.png" alt=""forum.png""></p>
<p></div><div class="slide"></p><h1 class="cow-heading">GATE Cloud (1): the marketing BS</h1>
<p>Cloud computing means many things in many contexts. On <b>GATECloud.net</b> it
means:</p>
<ul>
<li><b>zero fixed costs</b>: you don't buy software licences or server hardware, just
pay for the compute time that you use</li>
<li><b>near zero startup time</b>: in a matter of minutes you can specify, provision
and deploy the type of computation that used to take months of planning</li>
<li><b>easy in, easy out</b>: if you try it and don't like it, go elsewhere! you can
even take the software with you, it's all open source</li>
<li><b>someone else takes the admin load</b>:
<ul>
<li><a class="cow-url" href="http://gate.ac.uk/">the GATE team</a> from the <a class="cow-url" href="http://www.shef.ac.uk/">University of Sheffield</a> make sure you're running the best of breed
technology for text, search and semantics</li>
<li>cloud providers' data center managers (e.g. at <a class="cow-url" href="http://aws.amazon.com/">Amazon Inc.</a>) make sure the hardware and operating platform for your work
is scaleable, reliable and cheap</li>
</ul></li>
</ul>
<p></div><div class="slide"></p><h1 class="cow-heading">GATE Cloud (2): engineering</h1>
<ul>
<li><b>parallel</b> execution engine of automatic annotation processes + <b>distributed</b>
execution of parallel engine</li>
<li><b>scalability</b>: auto-scaling of processor swarms running on top of AWS EC2</li>
<li><b>flexibility</b>: parameters configure behaviour, select the GATE application
being executed, the input protocol used reading documents, the output protocol
used for exporting the resulting annotations, ...</li>
<li><b>robustness</b>: jobs run unattended over large data sets
<ul>
<li>extensively tested and profiled (no memory leaks)</li>
<li>errors and exceptions that occur during processing are trapped and reported</li>
<li>if the process crashes (e.g. hardware failure), can be restarted and resumes
execution where it left off</li>
</ul></li>
</ul>
<p><img src="http://gate.ac.uk/sale/talks/gate-course-may11/gatecloud.net-intro/images/annotation-job.png" alt="A cloud annotation job" width="700"></p>
<p></div><div class="slide"></p><h1 class="cow-heading">GATE Cloud (3): a research perspective</h1>
<ul>
<li>something like the <em>facility</em> that the the IRF was trying to set up for IR
more generally</li>
<li>host a growing family of experimental system configurations, data sets,
results</li>
<li>biased heavily towards information extraction (perhaps some mileage in adding
more mainstream IR?)</li>
<li>persistence and reuse of experimental setups: virtualisation makes it possible
to store not just data but the entire compute platform operable for particular
experiments or analyses</li>
</ul>
<p></div><div class="slide"></p><h1 class="cow-heading">First Cousins -- the Ontotext family</h1>
<p>Complementing the GATE tools KIM provides a straightforward front-end deployment
option and their Linked Data offerings a good baseline for model building:</p>
<ul>
<li><b>Ontotext <a class="cow-url" href="http://www.ontotext.com/kim">KIM</a></b>: UIs demonstrating multiple
conceptual and facetted search modes</li>
<li><b>Ontotext <a class="cow-url" href="http://www.ontotext.com/owlim">OWLIM</a></b>: the fastest and most
scaleable semantic repository</li>
<li><b>Ontotext <a class="cow-url" href="http://ontotext.com/factforge/">FactForge</a></b>: ~4 billion statements
from the Linked Data cloud</li>
<li><b>Ontotext <a class="cow-url" href="http://www.linkedlifedata.com/">Linked Life Data</a></b>: over 4 billion
statements from life sci databases including UniProt, PubMed, EntrezGene and
20 more</li>
</ul>
<p></div><div class="slide"></p><h1 class="cow-heading">GATE M&iacute;mir: hitting the indexing problem</h1>
<table> <tr><td>
<p><b>Circa 2007</b>:</p>
<ul>
<li>A new project on patent searching at <a class="cow-url" href="http://www.ir-facility.org/">the IRF</a> =
<b>culture shock</b>!
<ul>
<li>Full text, boolean (&quot;100% recall&quot;!)</li>
</ul></li>
<li>Initial prototyping and demo work:
<ul>
<li>Conceptual and semantic search and navigation (KIM, as above)</li>
<li>ANNotations In Context (ANNIC) (right)</li>
</ul></li>
<li>User requirement: put it all together</li>
<li>Ooops: ANNIC scaled to 200 short docs...</li>
</ul>
</td><td>
<p><b>ANNIC (ANNotations In Context)</b>: <br> <br>
<img src="http://gate.ac.uk/sale/talks/tal/annic.png" alt=""ANNIC"" width="550"></p>
</td></tr>
</table>
<p></div><div class="slide"></p><h1 class="cow-heading">Annotations: the Missing Data Structure (1)</h1>
<table> <tr><td>
<table> <tr><td>
<b>How do we search a billion-node annotation graph...?</b>
</td></tr> <tr><td> <p><br> <br> <br>
<b>Model</b>:</p>
<p><img src="http://gate.ac.uk/sale/talks/tal/annotationGraph.png" alt=""An annotation graph"" width="550"></p>
</td></tr> <tr><td> <p><br>
(Cf. TIPSTER, TEI/XCES, ATLAS, UIMA, ...)</p>
</td></tr>
</table>
</td><td> &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;</td><td>
<p><b>UI example</b>:</p>
<p><img src="http://gate.ac.uk/sale/talks/tal/chinese1.png" alt=""Chinese annotations"" width="400"></p>
</td></tr>
</table>
<p></div><div class="slide"></p><h1 class="cow-heading">Annotations: the Missing Data Structure (2)</h1>
<ul>
<li>Our first thoughts: where can we steal one?</li>
<li>Annotations: how to index the graph?
<ul>
<li>XML indexing and retrieval work doesn't solve it (biased towards trees)</li>
<li>RDBMS doesn't solve it (biased towards relations)</li>
<li>augmented full-text indices can help with efficient access, but the data
storage requirements of our prototype (based on Lucene) grew exponentially
with the cardinality of the annotation sets</li>
</ul></li>
<li>May 2008: workshop on <em>Persisting, Indexing and Querying Multi-Paradigm Text
Models</em>, IRF, Vienna
<table> <tr><td>
<ul>
<li>MG4J (Eric Graf, Glasgow)</li>
<li>Terrier (Gianni Amati, FUB/Glasgow)</li>
<li>INEX (Norbert Fuhr, Essen-Duisburg)</li>
<li>KIM, OWLIM (Atanas Kiryakov, OntoText)</li>
</ul> </td><td>
<ul>
<li>ANNIC (Valentin Tablan, Sheffield)</li>
<li>HTML-XML Search Engines (Ralf Schenkel, MPG)</li>
<li>Monet DB (Arjen de Vries, CWI)</li>
</ul> </td></tr>
</table></li>
<li>May 2009: custom solution based on MG4J (Sebastiano Vigna) + OWLIM called
<b>M&iacute;mir</b>...</li>
<li>May 2010: version 2: incremental indices; federation</li>
<li>May 2011: version 3: full source release under the AGPL; cloud release</li>
</ul>
<p></div><div class="slide"></p><h1 class="cow-heading">M&iacute;mir: Multi-paradigm Information Management Index and Respository</h1>
<p>M&iacute;mir is an index engine that can search over:</p>
<ul>
<li>text</li>
<li>textual and semantic annotations</li>
<li>ontologies and knowledge bases</li>
</ul>
<p>Built on top of:</p>
<ul>
<li>the MG4J text indexing library</li>
<li>GATE's annotation index (remodelled in MG4J)</li>
<li>Ontotext's semantic repository family</li>
</ul>
<p>(Just about) <b>scales to a terabyte of annotated text</b></p>
<p>More information: <a class="cow-url" href="http://gate.ac.uk/sale/talks/tal/query-examples/index.html">some query
examples</a>; <a class="cow-url" href="http://gate.ac.uk/mimir/">demos</a>;
<a class="cow-url" href="http://gate.ac.uk/family/mimir.html">user and developer guide</a></p>
<p></div><div class="slide"></p><h1 class="cow-heading">The Poor Relation: GATE Wiki (1)</h1>
<p>Why another wiki? Scratching three itches:</p>
<ol>
<li>adding interaction to a largish static site (15k HTML files, 40k other
files)</li>
<li>wiki style collaborative document creation with asynchonous off-line editing</li>
<li>a test-bed for experiments in controlled languages for round-trip ontology
engineering </li>
</ol>
<p>Hence CoW, a Controllable Wiki (aka GATEWiki): <a class="cow-url" href="http://gate.ac.uk/gatewiki/cow/">http://gate.ac.uk/gatewiki/cow/</a></p>
<p></div><div class="slide"></p><h1 class="cow-heading">GATE Wiki (2)</h1>
<p>Main features</p>
<ul>
<li>designed from the ground up to support concurrent editing and off-line
working with straightforward synchronisation using Subversion (SVN)</li>
<li>uses the YAM language, which
<ul>
<li>outputs LaTeX as well as HTML</li>
<li>allows paths as links (i.e. does not limit the namespace to a single
directory like e.g. JSPWiki does) and consequently allows a
tree-structured page store (and later graph-structured navigation via an
ontology)</li>
</ul></li>
<li>allows mixing of all types of files in its page store (which is just an SVN
sandbox, in fact)</li>
<li>supports versioning and differencing via SVN, and allows other tools that
manipulate SVN repositories to be used with the wiki data (e.g. SVN itself,
Eclipse, ViewCVS, etc.)</li>
<li>may optionally support embedded CLOnE
(<a class="cow-url" href="http://gate.ac.uk/sale/lrec2006/clie/clie.pdf">Controlled Language for
Ontology Editing</a>), and therefore experiments with applications that store
their data in semantic repositories whose schema is user-defined and
maintained</li>
</ul>
<p></div><div class="slide"></p><h1 class="cow-heading">Contents</h1>
<ul>
<li>1. Introduction
<ul>
<li>Context</li>
<li>Examples: TV Sport; Genetic Epidemiology; etc.</li>
</ul></li>
<li>2. The GATE family
<ul>
<li>Developer, Embedded</li>
<li>Teamware, Process</li>
<li>Cloud</li>
<li>KIM, OWLIM, Linked Data</li>
<li>M&iacute;mir</li>
<li>Wiki</li>
</ul></li>
<li>3. <u><b>[Demos
<ul>
<li>an IDE for text analysis specialists</li>
<li>collaborative manual annotation workflows</li>
<li>GATECloud.net </li>
<li>M&iacute;mir: a mixed-mode index server]</b></u></li>
</ul></li>
<li>4. A lifecycle for text analysis</li>
</ul>
<p></div><div class="slide"></p><h1 class="cow-heading">Demos</h1>
<table> <tr><td>
</td><td> <img src="http://gate.ac.uk/sale/images/gate4/logo-colour.png" alt=""GATE logo"" width="170">
</td><td> <br> </td><td> <a class="cow-url" href="http://gate.ac.uk/download/">an IDE for text analysis specialists</a>
</td></tr> <tr><td> <br>
</td><td> <img src="http://gate.ac.uk/teamware/teamware-logo.png" alt=""Teamware logo"" width="200">
</td><td> <br> </td><td> <a class="cow-url" href="https://gatecloud.net/yourAccount/machineReservationDetails/39">collaborative manual annotation workflows</a>
</td></tr> <tr><td> <br> </td></tr> <tr><td>
</td><td> <img src="logo-cloud.png" alt=""Cloud logo"" width="150">
</td><td> <br> </td><td> <a class="cow-url" href="http://gatecloud.net/">GATECloud.net</a>
</td></tr> <tr><td> <br> </td></tr> <tr><td>
</td><td> <img src="logo-mimir1.png" alt=""Cloud logo"" width="150">
</td><td> <br> </td><td> <a class="cow-url" href="http://gate.ac.uk/mimir/">M&iacute;mir: a mixed-mode index server</a>
</td></tr>
</table>
<p></div><div class="slide"></p><h1 class="cow-heading">Contents</h1>
<ul>
<li>1. Introduction
<ul>
<li>Context</li>
<li>Examples: TV Sport; Genetic Epidemiology; etc.</li>
</ul></li>
<li>2. The GATE family
<ul>
<li>Developer, Embedded</li>
<li>Teamware, Process</li>
<li>Cloud</li>
<li>KIM, OWLIM, Linked Data</li>
<li>M&iacute;mir</li>
<li>Wiki</li>
</ul></li>
<li>3. [Demos
<ul>
<li>an IDE for text analysis specialists</li>
<li>collaborative manual annotation workflows</li>
<li>GATECloud.net </li>
<li>M&iacute;mir: a mixed-mode index server]</li>
</ul></li>
<li><u><b>4. A lifecycle for text analysis</b></u></li>
</ul>
<p></div><div class="slide"></p><h1 class="cow-heading">Full lifecycle information extraction</h1>

<ol>
<li>Take one large pile of text (documents, emails, tweets, patents, papers,
transcripts, blogs, comments, acts of parliament, and so on and so forth).</li>
<li>Pick a structured description of interesting things in the text (a telephone
directory, or chemical taxonomy, or something from the
<a class="cow-url" href="http://linkeddata.org/">Linked Data</a> cloud) -- call this your <em>ontology</em>. </li>
<li>Use <a class="cow-non-existant-url" href="teamware/?m=1">GATE Teamware</a> to mark up a <em>gold standard</em> example set of
annotations of the corpus (1.) relative to the ontology (2.).</li>
<li>Use <a class="cow-non-existant-url" href="family/developer.html?m=1">GATE Developer</a> to build a <em>semantic annotation
pipeline</em> to do the annotation job automatically and measure performance
against the gold standard.</li>
<li>Take the pipeline from 4. and apply it to your text pile using
<a class="cow-url" href="http://gatecloud.net/">GATE Cloud</a> (or embed it in your own systems
using <a class="cow-non-existant-url" href="family/embedded.html?m=1">GATE Embedded</a>). Use it to bootstrap more manual
(now semi-automatic) work in Teamware.</li>
<li>Use <a class="cow-non-existant-url" href="family/mimir.html?m=1">GATE M&iacute;mir</a> to store the annotations relative to the
ontology in a <em>multiparadigm index server</em>.</li>
<li>(Probably) write a half-decent UI to go on top of M&iacute;mir.</li>
<li>Hey presto, you have search that applies your annotations and you
ontology to your corpus (and a <a class="cow-non-existant-url" href="family/process.html?m=1">sustainable process</a>
for coping with changing information need and/or changing text).</li>
<li>Your users are happy (and <a class="cow-url" href="http://gate.ac.uk/">GATE.ac.uk</a> has a &quot;donate&quot;
button ;-) ).</li>
</ol>
<p></div><div class="slide"></p><h1 class="cow-heading">Links</h1>
<p>More information</p>
<ul>
<li>GATE home page: <a class="cow-url" href="http://gate.ac.uk/">http://gate.ac.uk/</a></li>
<li>these slides: <a class="cow-url" href="http://gate.ac.uk/hamish/talks/ibot-slidy.html">http://gate.ac.uk/hamish/talks/ibot-slidy.html</a></li>
<li>GATECloud.net: <a class="cow-url" href="http://gatecloud.net/">http://gatecloud.net/</a></li>
</ul>

</div> </body></html>
