Terminologies for text-mining; an experiment in the lipoprotein metabolism domain

Dimitra Alexopoulou; Thomas Wächter; Laura Pickersgill; Cecilia Eyre; Michael Schroeder

doi:10.1186/1471-2105-9-S4-S2

Terminologies for text-mining; an experiment in the lipoprotein metabolism domain

Research output: Contribution to journal › Research article › Contributed › peer-review

Contributors

Dimitra Alexopoulou - , Biotechnology Center, Chair of Molecular Developmental Genetics (Author)
Thomas Wächter - (Author)
Laura Pickersgill - (Author)
Cecilia Eyre - (Author)
Michael Schroeder - , Chair of Bioinformatics (Author)

Abstract

BACKGROUND: The engineering of ontologies, especially with a view to a text-mining use, is still a new research field. There does not yet exist a well-defined theory and technology for ontology construction. Many of the ontology design steps remain manual and are based on personal experience and intuition. However, there exist a few efforts on automatic construction of ontologies in the form of extracted lists of terms and relations between them.

RESULTS: We share experience acquired during the manual development of a lipoprotein metabolism ontology (LMO) to be used for text-mining. We compare the manually created ontology terms with the automatically derived terminology from four different automatic term recognition (ATR) methods. The top 50 predicted terms contain up to 89% relevant terms. For the top 1000 terms the best method still generates 51% relevant terms. In a corpus of 3066 documents 53% of LMO terms are contained and 38% can be generated with one of the methods.

CONCLUSIONS: Given high precision, automatic methods can help decrease development time and provide significant support for the identification of domain-specific vocabulary. The coverage of the domain vocabulary depends strongly on the underlying documents. Ontology development for text mining should be performed in a semi-automatic way; taking ATR results as input and following the guidelines we described.

AVAILABILITY: The TFIDF term recognition is available as Web Service, described at http://gopubmed4.biotec.tu-dresden.de/IdavollWebService/services/CandidateTermGeneratorService?wsdl.

Details

Original language	English
Pages (from-to)	S2
Journal	BMC bioinformatics
Volume	9 Suppl 4
Publication status	Published - 25 Apr 2008
Peer-reviewed	Yes

External IDs

PubMedCentral	PMC2367629
Scopus	44649186340
ORCID	/0000-0003-2848-6949/work/141543401

Keywords

Algorithms, Database Management Systems, Databases, Factual, Lipoproteins/classification, Natural Language Processing, Periodicals as Topic, Semantics, Software, Terminology as Topic

Research Portal of the TU Dresden

Contributors

Abstract

Details

External IDs

Keywords

Keywords