<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">

<html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en">
<head>
<title>MLi WP5</title>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
<meta name="copyright"
content="GATE Team, University of Sheffield - gate.ac.uk"/>
<link rel="stylesheet" type="text/css" media="screen, projection, print"
href="../gslidy/slidy.css"/>
<script src="../gslidy/slidy.js"
type="text/javascript"></script>
</head>
<body>
<div class="background">
<img id="head-icon" alt="" align="right" src="../mli-logo.png" width="150"/>
</div>
<div class="slide">
<h1 class="cow-title-heading">MLi Research Observatory: WP5</h1>
<table> <tr><td>
<p>&nbsp;&nbsp;<b>Hamish Cunningham</b> <br> &nbsp;&nbsp;Research Professor of <br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Internet Computing &nbsp;&nbsp;&nbsp;<br>
&nbsp;&nbsp;Department of Computer Science &nbsp;&nbsp;&nbsp;<br>
&nbsp;&nbsp;University of Sheffield <br>
<br>
&nbsp;&nbsp;<a class="cow-url" href="https://gate.ac.uk/">https://gate.ac.uk/</a> <br>
&nbsp;&nbsp;<a class="cow-url" href="https://hamish.gate.ac.uk/">https://hamish.gate.ac.uk/</a> <br>
&nbsp;<br></p>
</td><td> <p><a class="cow-url" href="http://mli-project.eu/"><img class="cow-img" src="../mli-logo.png" alt=""MLi"" width="300"></a> <br>
<br> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
December 2014, Luxembourg</p>
</td></tr>
</table>

<p></div><div class="slide"></p><h1 class="cow-heading">Summary</h1>
<ul>
<li>Mission: highlight the potential of infrastructure in promoting tech
transfer roll-over for EU language and knowledge industries</li>
<li>Method: digging into the startup and SME scene</li>
<li>Outcome: recommendations for MLi and for future EC interventions
(including the long-term and the blue sky!)</li>
</ul>
<p></div><div class="slide"></p><h1 class="cow-heading">The Terrain</h1>
<ul>
<li>Social Media
<ul>
<li>problems of access and challenges of scale:
<ul>
<li>Twitter: DataSift and NTT Data only independents with firehose access
(Apple bought Topsy, Twitter bought GNIP)</li>
<li>Facebook: a closed book</li>
<li>The Rest: a long tail</li>
</ul></li>
<li>the four Vs:
<ul>
<li>recent focus: volume, variety, velocity</li>
<li>upcoming: veracity</li>
</ul></li>
</ul></li>
<li>Big Data
<ul>
<li>infrastructure:
<ul>
<li>leading player still AWS (Amazon Web Services)</li>
<li>some inroads by App Engine, Azure, Rackspace &amp; smaller players</li>
<li>open alternatives (OpenStack, CloudStack, Eucalyptus, OpenNebula, ...)
reaching towards critical mass</li>
</ul></li>
<li>analytics startups: throw a rock</li>
</ul></li>
<li>Problematic issues:
<ul>
<li>privacy and security in the age of mass surveillance (below)</li>
<li>predominant model: advertising (zero contribution to inclusion,
resilience, quality of life)</li>
</ul></li>
</ul>
<p></div><div class="slide"></p><h1 class="cow-heading">EU Industry: Contradictory Pressures</h1>
<ul>
<li>majority of EU LT suppliers in the small-to-medium size range</li>
<li>social/mobile creating contradictory forces:
<ul>
<li>constriction of the market spaces for small players due to:
<ul>
<li>data volume rises and infrastructure costs commensurately higher;</li>
<li>the economies of scale open to the big players become proportionately
greater as the gap between large and small increases</li>
</ul></li>
<li>new opportunities for innovative first movers, but only available to those
able to exploit 3rd-party infrastructure for scaling without prohibitive
cost
<ul>
<li>distribution of these abilities currently skewed towards the
north-western and transatlantic technology communities</li>
</ul></li>
</ul></li>
</ul>
<p></div><div class="slide"></p><h1 class="cow-heading">EU Industry: Contradictory Pressures (2)</h1>
<p>Specific examples of operations that are currently prohibitively expensive for
European SMEs and research labs:</p>
<ul>
<li>the filtering and supply of large web fragments (for example all the new or
changed pages in France yesterday, or all the regularly changing commercial
pages on <tt>.com</tt>, or all the pages with high page-rank, or etc.)</li>
<li>the filtering and aggregation of real-time social data streams (for a
promising &mdash; and European &mdash; filtering example see <a class="cow-url" href="http://datasift.com">DataSift.com</a>)</li>
<li>time series analyses or geographic aggregation of opinion or sentiment (it
it was impossible last year, for example, to get a broad measure of European
thinking with respect to the Greek situation, as a leading Athens text
processing researcher pointed out to me recently)</li>
</ul>
<p></div><div class="slide"></p><h1 class="cow-heading">A European Language Data Success Story...</h1>
<p>Technology transfer for <em>Compact Predictive Typing</em> on touch devices</p>
<ul>
<li>when the inventors of a leading Android predictive typing system won time on
CERN's Large Hadron Collider they used it to crawl as much fluent
multilingual text as they could</li>
<li>the result: a successful business that employs 100 people in London</li>
<li>they took the massive volume of language data that they were able to collect
on CERN's machines and built <a class="cow-url" href="http://en.wikipedia.org/wiki/Language_model">statistical models</a></li>
<li>encoded them in a very compact format (using
<a class="cow-url" href="http://en.wikipedia.org/wiki/Locality-sensitive_hashing">approximate
hashing</a>) </li>
<li>use them to predict the next word during typing on smart phones and other
touch-based devices</li>
</ul>
<p><b>MLi / WP5</b>: how to best promote more stories like this? (via a. requirements
on the MLi Hub architecture(s), and b. via programmatic interventions for e.g.
CEF/AT)</p>
<p></div><div class="slide"></p><h1 class="cow-heading">Success Story (2)</h1>
<p><b>Lessons</b>:</p>
<ul>
<li>it would have been impossible to achieve without access to an extremely
large computational infrastructure: the data volumes are truly enormous
(CERN Data Centre processes about one petabyte of data every day; hosts
10,000 servers with 90,000 processor cores)</li>
<li>the computation was <em>bursty</em> &mdash; a classic case for cloud computing</li>
<li>Google could have done it, but didn't</li>
</ul>
<p>These suggest that:</p>
<ul>
<li>small players can still enter the market spaces of the giants in cases where
the infrastructural requirements are not constant</li>
<li>these are typically client-side with only intermittent large-scale
computation</li>
<li>this is <em>not</em> true of 24/7/365 applications like public search
infrastructure &mdash; here the EC would need to make a different category of
intervention (see below)</li>
</ul>
<p></div><div class="slide"></p><h1 class="cow-heading">Preliminary Recommendations (1)</h1>
<ul>
<li>Prioritise availability of extremely large infrastructure for language
technology (see deliverable 5.1 <a class="cow-url" href="#swiftkey">section 2</a>; also <a class="cow-url" href="#web-cern">section 7</a>).</li>
<li>Prime the technology transfer pipeline with research and startup use cases
(see <a class="cow-url" href="#pipeline">section 3</a>).</li>
<li>Develop XaaS marketplaces for the deployment and monetisation of CEF
building blocks (see <a class="cow-url" href="#xaas-market">section 4</a>).</li>
<li>In the specific case of translation (see <a class="cow-url" href="#mt">section 5</a>):
<ul>
<li>prioritise translation of user-generated content</li>
<li>incorporate crowd-sourced models of translation</li>
</ul></li>
<li>Taxonomise the technology transfer roll-over point (see <a class="cow-url" href="#taxon">section
6</a>) as a foundation for infrastructure design.</li>
</ul>
<p>(XaaS: IaaS, PaaS, SaaS...)</p>

<p></div><div class="slide"></p><h1 class="cow-heading">Preliminary Recommendations (2)</h1>
<ul>
<li><em>Reuse not reinvention</em>: leverage the success of AWS, OpenStack, Hadoop, S4,
and the like.</li>
<li><em>Applications-driven</em>: infrastructural activity that is insufficiently bound
to applications is a resource sink with little chance of long-term utility.</li>
<li>Active support for decentralised social networks as a vector for privacy
preservation and European diversity.</li>
<li>Provisioning of the European Language Cloud.</li>
<li>Network as societal infrastructure:
<ul>
<li>search and social media are the lifeblood of the digital economy</li>
<li>they should therefore be regarded as social infrastructure</li>
<li>we don't expect our roads to make money &mdash; neither should we expect the
basic functions of the network</li>
</ul></li>
</ul>
<p></div><div class="slide"></p><h1 class="cow-heading">Futures (1): the Device Frontier &amp; LT Solutions</h1>
<ul>
<li>LT was at the heart of the mobile revolution; continues as main conduit
between users and on-line needs via
<ul>
<li>search, geolocation, translation and etc.</li>
</ul></li>
<li>beyond mobile, advances in low power chips (chiefly the EU's ARM) are
spawning new generation of LT-powered devices</li>
<li>e.g. Amazon Echo was announced last week &mdash; <a class="cow-url" href="http://www.amazon.com/oc/echo/">http://www.amazon.com/oc/echo/</a>
<ul>
<li>combination of a small device doing speech recognition on the client and
question answering from a KB of information extraction data in the cloud</li>
</ul></li>
<li>Europe has both LT expertise and a resurgent emedded hardware ecosystem
(e.g. ARM, but also Arduino or Raspberry Pi)</li>
<li>combination of these two strengths could drive innovative products and
services meeting societal challenges
<ul>
<li>e.g. devices that help the elderly control smart homes, or address
exclusion by increasing local community connectedness</li>
</ul></li>
</ul>
<p></div><div class="slide"></p><h1 class="cow-heading">Futures (2): Euroogle?! (a CERN for the Web)</h1>
<p>If the EU parliament things we should
<a class="cow-url" href="http://www.theguardian.com/technology/2014/nov/27/european-parliament-votes-yes-google-breakup-motion">break up Google</a>, perhaps the time is ripe for a new research programme...</p>
<p><blockquote>
The web was invented at CERN, the European particle physics lab, and grew
beyond all precedent to be the centerpiece of the information revolution...
and just as we need to understand the physics of our world to prepare for our
future needs so we also must understand and profit from the web.</p>
<p>We live in times when the European ethos of a strong and stable civil society
with high cohesion and excellent social services must weather a tempestuous
financial storm. It is more vital than ever to promote scientific and
technological leadership in order to drive industrial growth and social
progress. We need a <b>CERN for the web</b>!
</blockquote></p>
<p>What does that mean?</p>
<p></div><div class="slide"></p><h1 class="cow-heading">A CERN for the Language Web (2)</h1>
<ul>
<li>provision of a shared infrastructure for web R&amp;D that can scale to petabytes
and is open and flexible enough to cater for scientists and startups,
governments and libraries, technologists and citizens (perhaps building on
the bare metal layers from e.g. <a class="cow-url" href="http://www.helix-nebula.eu/;">http://www.helix-nebula.eu/;</a> note new GATE
work with Science and Technologies Facilities Council)</li>
<li>transition of research programmes in language technology onto the shared
infrastructure, and the creation of a market for web analysis solutions, &amp;
foundation of an R&amp;D programme on open search for Europe</li>
<li>create new marketplaces for services on top of the cloud infrastructure
(e.g. customer relations no longer sitting next to the phone; now reading
250,000 tweets per week)</li>
</ul>
<p>The cost of an alternative to Google may be too large to swallow in one go; a
smaller set of chunks can draw the path to a <b>solution</b> in the medium term.</p>
<p></div><div class="slide"></p><h1 class="cow-heading">Futures (3): Privacy in the Age of Surveillance</h1>
<ul>
<li>Edward Snowden and others have exposed how the NSA (and their Five Eyes
collaborators in the UK/Canada/Australia/NZ) is attempting to record and
analyse <em>all</em> electronic communication, <em>everywhere</em>, <em>all of the time</em></li>
<li>to do this they actively subvert security systems by
<ul>
<li>tapping cables (between countries or between data centers)</li>
<li>weakenning standards (e.g. NIST eliptic curve crypto)</li>
<li>coercing telecoms and internet companies <em>in secret</em> (gagging orders)</li>
</ul></li>
<li>regardless of if this is a good idea it inevitably compromises both online
privacy and online security</li>
<li>closed systems must be assumed compromised by default
<ul>
<li>heartbleed and shellshock: bugs in open systems that have been fixed</li>
<li>closed systems shrouded in secrecy and open to coerced backdoor insertion</li>
</ul></li>
</ul>
<p></div><div class="slide"></p><h1 class="cow-heading">Links</h1>
<ul>
<li>these slides:
<ul>
<li>HTML: <a class="cow-url" href="https://hamish.gate.ac.uk/pages/about/talks/mli-review1/">https://hamish.gate.ac.uk/pages/about/talks/mli-review1/</a></li>
<li>PDF: <a class="cow-url" href="https://hamish.gate.ac.uk/pages/about/talks/mli-review1/index.pdf">https://hamish.gate.ac.uk/pages/about/talks/mli-review1/index.pdf</a></li>
</ul></li>
<li>MLi: <a class="cow-url" href="http://mli-project.eu/">http://mli-project.eu/</a>
<ul>
<li>WP5 deliverables: <a class="cow-url" href="http://mli-project.eu/?p=490">http://mli-project.eu/?p=490</a></li>
</ul></li>
<li>GATE: <a class="cow-url" href="https://gate.ac.uk/">https://gate.ac.uk/</a></li>
<li>me: <a class="cow-url" href="https://hamish.gate.ac.uk/">https://hamish.gate.ac.uk/</a></li>
</ul>







</div> </body></html>
