Ex information extraction system
General description
Ex is an IE system based on extraction ontologies, developed by the Knowledge Engineering Group
(KEG) at UEP since 2006.
Extraction ontologies aim to extract standalone named entities (standalone attributes) and instances
(groups of attributes which "belong together"). The advantage of this technology is that it can
utilize multiple sources of extraction knowledge which should lower the requirement for training data.
Ex can be used for extraction from heavily structured (e.g. tabular) documents,
semi-structured documents and also from free-text documents.
For a domain of interest, the user writes an extraction ontology.
An extraction ontology is structurally similar a conventional domain ontology, however it reflects the way
information is presented on the web rather than the inherent state of affairs, and is extended with
extraction knowledge that can be used to identify the described objects in text.
An extraction ontology can be viewed as a set of attribute definitions, class definitions and axiom definitions.
Development of Ex is ongoing; on this website we publish development snapshots when we feel the code is
stable enough to be useful. The code is writen in Java.
Ex is distributed under the LGPL license.
More information on Ex is downloadable here (newer material listed first):
- Labsky, M., Svatek, V., Nekvasil, M.: Multi-Paradigm and Multi-Lingual Information Extraction as Support for Medical Web Labelling Authorities, Journal of System Integration, Vol 1, No 4, 2010.
- Information Extraction using Extraction Ontologies (in Czech), invited talk at the Znalosti 2010 conference.
- Information Extraction from Websites using Extraction Ontologies, PhD Thesis, May 2009. Slides in Czech: ie_eo_dis2009.ppt
A summary of the thesis in Czech
- Short tutorial
and new distribution "Carp2" containing sample extraction ontologies, May 2009 (fixes several Carp1 bugs)
- IE Based on Extraction Ontologies: Design, Deployment and Evaluation, Artificial Inteligence Seminar, 2008. Slides: ie_by_eo.ppt
- Information Extraction Based on Extraction Ontologies: Design, Deployment and Evaluation, In: Workshop on Ontology Based IE Systems (OBIES), Kaiserslautern, 2008.
- Combining Multiple Sources of Evidence in Web Information Extraction, In: Proc. Intl. Symposium on Intelligent Systems (ISMIS), Toronto, 2008.
- Labsky, M., Svatek, V., Nekvasil, M., Rak, D.:
The Ex Project: Web Information Extraction using Extraction Ontologies.
In: Proc. Workshop on Prior Conceptual Knowledge in Machine Learning and Knowledge Discovery (PriCKL'07)
held within ECML/PKDD'07, Warsaw, Poland, September 21, 2007.
- Labsky, M., Nekvasil, M., Svatek, V.:
Towards Web Information Extraction using Extraction Ontologies and (Indirectly) Domain Ontologies.
Poster paper. In: Proc. 4th International Conference on Knowledge Capture, K-Cap 2007, Whistler, BC, Canada. ACM, to appear.
- Technical report, 2006 ex.pdf
- Poster about Ex from the BOEMIE Workshop, Podebrady 2006 exposter.zip (ppt)
- Web Image Classification for Information Extraction, In: 1st. Intl. Workshop on Representation and Analysis of Web Space (RAWS), Tocna, 2005.
- Information Extraction from HTML Product Catalogues: from Source Code and Images to RDF, In: Proc. Web Intelligence Conference (WI), Compiegne, 2005.
Distributions for download
(Ex requires Java 1.5 or higher)
- Distribution "carp2" May 2009 (still the "latest greatest" release; next version may be exposed in future as a REST web service - volunteer(s) wanted!)
- Distribution "tilapia" November 2008
binaries
sources
- Distribution "frogmouth" from September 15, 2007
binaries
- Distribution "anglerfish" from May 3, 2007
binaries
Contact
Martin Labsky, labsky at seznam dot cz
Marek Nekvasil, marek.nekvasil at gmail dot com
Vojtech Svatek, svatek at vse dot cz
This research is supported by the EC projects MedIEQ and K-Space.
Last modified: April 19, 2012.