12 Putting into Practice: Mashups with YAHOO! PIPES and XProc
Mashups are Web applications that integrate and combine data from multiple Web sources to
present them in a new way to a user. This chapter shows two different ways to construct mashup
applications in practice: YAHOO! PIPES, a graphical user interface for building mashups, and XProc,
a W3C language for describing workflows of transformations over XML documents. Pros
and cons of either approach will be made clear as one follows the indicated steps. The
goal will be to present information about news events, each event being accompanied
by its localization displayed on a map. For that purpose, we integrate three sources of
information:
- A Web feed about current events in the world, in RSS format (e.g., CNN’s top stories
at http://rss.cnn.com/rss/edition.rss). Any such RSS feed is fine, though
English is preferable to ensure precision of the geolocalization.
- A geolocalization service. We use information from the GeoNames
geographical database, and specifically their RSS to GeoRSS converter, whose API is
described at http://www.geonames.org/rss-to-georss-converter.html.
- A mapping service. We use Yahoo! Maps.
12.1 YAHOO! PIPES: A Graphical Mashup Editor
YAHOO! PIPES
allows creating simple mashup applications (simply called pipe) using a graphical interface based
on the construction of a pipeline of boxes connected to each other, each box performing a given
operation (fetching information, annotating it, reorganizing it, etc.) until the final output of the
pipeline. It can be used by non-programmers, though defining complex mashups still requires skill
and experience with the platform. The mashup we want to build is demonstrated at
http://pipes.yahoo.com/webdam/geolocalized_news: it asks the user for a feed
URL, and displays with markers on a map the result of the geolocalization of each news
item.
- Go to the YAHOO! PIPES website and either log in using an existing Yahoo! account
or create a free account. Once you follow the links for creating a pipe, you will be
presented with the interface of the graphical editor: on the left, a list of all boxes that
can be used inside a pipe; in the center, the workspace where you can build your pipe;
in the bottom part, a debugger shows the output of the currently selected box.
- Drag a “Fetch Feed” box on the workspace. Enter the URL in the box and connect it
to the “Pipe Output” box at the bottom of the workspace by dragging a link from the
output port of the initial box (shown as a circle on its bottom border) to the input port
of the final box. Save your pipe and click on “Run pipe…” to see the result.
- We are going to add some geolocalization information by using the “Location
Extractor” operator of YAHOO! PIPES, which should be put in the middle of the two
existing boxes. Save and run the pipe.
- The location extractor of YAHOO! PIPES is not always as precise or complete as
GeoNames. Study the documentation of the RSS to GeoRSS converter REST API. Use
this API by trying to form URLs directly in your browser until you fully understand
how it works. Then integrate it into your pipe by using a “URL Builder” whose output
port is connected to the url parameter of the existing “Fetch Feed” box. Compare the
results to what you had before.
- To give a final touch to your pipe, add a “URL input” box to ask the user for the URL
of the feed to be geolocalized. Save and test.
You can decide to publish your pipe to give other users access to it; if you want to keep playing with
YAHOO! PIPES, you can try enriching your pipe by retrieving data from multiple RSS feeds,
using a Web search operator to discover feeds dealing with a given topic, adding to feed
items images obtained by querying Flickr with keywords from the description of the
item, and so on. You can also look at the vast library of published pipes to get some
inspiration.
12.2 XProc: An XML Pipeline Language
XProc is a W3C Recommendation for describing transformation
workflows on XML documents. Throughout this section, refer to the XProc
specification
for more detail about the language. As with YAHOO! PIPES, a workflow is seen as a pipeline of
operations (here called steps) that fetch or process information; these operations heavily rely on other
XML standards (XPath, XSLT, XInclude, XML Schema, XQuery, etc.). In YAHOO! PIPES, connections
between boxes are described in a graphical manner; in XProc they are described using an XML
syntax. Finally, contrary to YAHOO! PIPES, which deals with Web data at large, XProc is dedicated
to the processing of XML data only. Any XProc processor can be used; we recommend XML
CALABASH,
a Java implementation that is easy to install and to use.
- Download the skeleton pipeline skeleton.xpl from the book website,
http://webdam.inria.fr/Jorge/. Test your XProc processor; if you use XML
CALABASH and its installation directory is in your path environment variable, you can just
type
calabash skeleton.xpl > result.html
No error message should show (only information messages), and result.html should
contain a basic view of CNN’s feed.
- Look at the skeleton.xpl file. The whole pipeline is described inside a top-level
<p:pipeline> element. First, a variable is declared; declared variables can be used in XPath
expressions further in the file (all values of select attributes are XPath expressions). Then the
RSS file is loaded with the help of the standard <p:load> step (again, see the XProc
specification for the definition of standard steps). All items of the RSS feed are
put into a sequence (<p:for-each>), and this sequence is then wrapped under a
common item element (<p:wrap-sequence>); these two steps are arguably not very
useful, but this structure will help us in extending this pipeline. Finally, an inline
XSLT stylesheet is reformatting the list of items into a table, where each line has
a single cell, containing the title of the item and pointing to the corresponding
article.
- Change the <p:load> so that a geolocalized version of the RSS feed is loaded instead of the
original one. Once again, refer to the documentation of the API of GeoNames to determine
which URL to load. You can use the XPath 2.0 function encode-for-uri() to properly encode
special characters in a URL.
- Items should now have geo:lat and geo:long child elements with geolocalization
information. Test this by adding in the XSLT stylesheet, after the item’s title, two
<xsl:value-of> elements that show both coordinates. Test.
- We now want to filter out items that do not have any geolocalization information (if any). For
this purpose, you can modify the select attribute of the <p:iteration-source> to keep
only items with geo:long and geo:lat child elements.
- We will use the Yahoo! Maps Image
API
to add a map for each news item. Carefully study the API documentation and apply for a
Yahoo! Application ID.
- Replace the <p:identity> step with a <p:load> step that calls the Yahoo! Maps Image API
appropriately. Remember you can use any XPath expression inside a select attribute. In the
XSLT stylesheet, add a cell:
<td><xsl:value-of select="." /></td>
before the existing cell to display the URL of the map image. The display of the title and link
to the article does not work any more because we discarded the news items to keep only the
map image. We are going to fix this later on.
- Replace the display of the URL by an HTML <img> element that loads this URL. In XSLT, to
input an XPath expression inside an arbitrary attribute, surround the XPath expression with
curly braces.
- To keep in the sequence both map images and information about news items, you will need
two <p:for-each> steps and a <p:pack> step to combine the two sequences. Refer to the
XProc specification. The <p:pack> step will introduce an extra wrapping element, so
remember to adapt the XPath expressions used in the XSLT stylesheets.