Rainbow IE Component

Last changed: March 2005
Please send bug reports and ideas for improvement to Labsky [at] vse.cz

Index

Overview

This website demonstrates an information extraction (IE) component designed for the Rainbow project. Purpose of the IE component is to extract structured data from legacy web sites. For the demo, we chose a restricted domain of bicycle product advertisements, where we extract structured information such as bicycle name, make, price, picture, color, size, year, and also a number of components a bike may have (together about 12 attributes).

Using the demo

In the upper panel, type a url of a web page from which you want to extract. This web page should contain bicycle ads. Alternatively, you can choose a page from the sample documents (1-133) shown in the list box on the right. The sample documents 1-100 also include training data, so extraction results on some of these documents will be overly optimistic. Sample documents 101-133 do not include any training data.

Once you choose your document, you can perform extraction by pressing the Extract button. When completed, you will see the original web page annotated with different colors corresponding to the individual attributes. When you hover the mouse pointer over a certain annotation, a text will appear saying the attribute name.

Depending on the extraction results, a table with extracted instances may appear at the beginning of the annotated page. Each row then corresponds to an extracted instance, which consists of some of the annotated attributes.

If you want the extracted instances to be in XML format, check the xml checkbox and you will receive an XML list of instances, without the annotated page.

If you want just the annotated document and no instance table on output, use the Annotate button instead of Extract.

On the left, you can choose an IE model used to perform annotation. Currently, only the naive model is enabled but we plan to add further models.

Technology used

The extractor is based on a Hidden Markov Model, trained using 90 web pages in which the desired product attributes were manually labeled (these 90 documents are scattered through the first 100 sample documents). The structure of the model consists of an extraction (target) states, which produce tokens to be extracted, and of other states, which produce uninteresting tokens. For details, see Product Information Extraction from Semistructured Documents using HMMs . The HMM provides a document with annotated attributes. For the extraction of structured instances, we group attributes that belong together (e.g. bike name, its price and picture) using a simple algorithm described in Multimedia Information Extraction from HTML Product Catalogues (this demo doesn't contain the image analyser described in the paper).

Web Service

All demonstrated functionality is available also as a web service - the demo itself is implemented using simple web service calls. There is a WSDL description of the service available.

Clients can send a doAnnotate request with the following attributes, all of which are strings:

action
This can be either extract, annotate, or display. Setting action to extract will return a list of extracted instances, annotate will return an annotated page, and display will only return the original (unannotated) page.
url
The absolute url of the page to be processed. Sample documents can be addressed by a dedicated scheme test://h0001.html.
model
This is the HMM model to be used for annotation. Currently, only bikes/all_naive/trn_all_0 is enabled.
format
Set this to xml to obtain XML list of instances on output. If not specified, a complete annotated page will be returned. An instance table will be at the top of the page if action was extract.

A response will consist of a doAnnotate message which has a single parameter:

return
This will contain BASE64-encoded output of the extractor. This is either an annotated HTML page or an XML list of instances.