This website demonstrates an information extraction (IE) component designed for the Rainbow project. Purpose of the IE component is to extract structured data from legacy web sites. For the demo, we chose a restricted domain of bicycle product advertisements, where we extract structured information such as bicycle name, make, price, picture, color, size, year, and also a number of components a bike may have (together about 12 attributes).
In the upper panel, type a url of a web page from which you want to extract. This web page should contain bicycle ads. Alternatively, you can choose a page from the sample documents (1-133) shown in the list box on the right. The sample documents 1-100 also include training data, so extraction results on some of these documents will be overly optimistic. Sample documents 101-133 do not include any training data.
Once you choose your document, you can perform extraction by pressing the Extract button. When completed, you will see the original web page annotated with different colors corresponding to the individual attributes. When you hover the mouse pointer over a certain annotation, a text will appear saying the attribute name.
Depending on the extraction results, a table with extracted instances may appear at the beginning of the annotated page. Each row then corresponds to an extracted instance, which consists of some of the annotated attributes.
If you want the extracted instances to be in XML format, check the xml checkbox and you will receive an XML list of instances, without the annotated page.
If you want just the annotated document and no instance table on output, use the Annotate button instead of Extract.
On the left, you can choose an IE model used to perform annotation. Currently, only the naive model is enabled but we plan to add further models.
The extractor is based on a Hidden Markov Model, trained using 90 web pages in which the desired product attributes were manually labeled (these 90 documents are scattered through the first 100 sample documents). The structure of the model consists of an extraction (target) states, which produce tokens to be extracted, and of other states, which produce uninteresting tokens. For details, see Product Information Extraction from Semistructured Documents using HMMs . The HMM provides a document with annotated attributes. For the extraction of structured instances, we group attributes that belong together (e.g. bike name, its price and picture) using a simple algorithm described in Multimedia Information Extraction from HTML Product Catalogues (this demo doesn't contain the image analyser described in the paper).
All demonstrated functionality is available also as a web service - the demo itself is implemented using simple web service calls. There is a WSDL description of the service available.
Clients can send a
doAnnotate request with the following
attributes, all of which are strings:
display. Setting action to
extractwill return a list of extracted instances,
annotatewill return an annotated page, and
displaywill only return the original (unannotated) page.
xmlto obtain XML list of instances on output. If not specified, a complete annotated page will be returned. An instance table will be at the top of the page if action was
A response will consist of a
doAnnotate message which has a single parameter: