• No results found

B The Need for Dictionaries, Ontologies, and Controlled Vocabularies

N/A
N/A
Protected

Academic year: 2021

Share "B The Need for Dictionaries, Ontologies, and Controlled Vocabularies"

Copied!
2
0
0

Bezig met laden.... (Bekijk nu de volledige tekst)

Hele tekst

(1)

9 OMICS A Journal of Integrative Biology

Volume 7, Number 1, 2003 © Mary Ann Liebert, Inc.

The Need for Dictionaries, Ontologies, and

Controlled Vocabularies

HELEN M. BERMAN and JOHN WESTBROOK

B

IOLOGY HAS NOW SEENan enormous increase in the amount and types of data collected. There are many types of databases ranging from archival collections containing the results of analyses from a broad field of biology to highly curated databases containing a great deal of information about a very small field of biology. In addition, laboratories routinely use LIMS to collect and organize experimental data. The need for sound informatics infrastructure is very high.

Any sound informatics infrastructure should provide for the following: • Acquisition: Accurate collection/harvesting of information

• Exchange: Data exchange including format conversion with semantic precision

• Dissemination: Documentation, database schemas definition, and application program interface defin-ition

• Maximum software reusability

Data dictionaries provide an important underpinning to the informatics infrastructure. The content of such a dictionary needs to contain the following:

• Precise definitions and examples are the most valuable elements • Controlled vocabularies

• Allowed ranges and boundary conditions • Data types

• Data relationships (parent-child/foreign key)

The dictionary content can be expressed in any/many concrete formats and is not bound to a particular technology. In creating such a dictionary the data model must be adequate to represent the particular sub-ject matter. It must also be accessible to real users in that it must be easy to understand and extend, and it needs to be supportable with reusable software tools.

mmCIF IS AN EXAMPLE OF SUCH A DICTIONARY

There are examples of these types of dictionaries including Gene Ontology. The one used to support the Protein Data Bank (PDB, http://www.pdb.org) is the Macromolecular Information File (mmCIF,

http://www.deposit.pdb.org/mmcif). The mmCIF dictionary was created as part of a community effort man-dated by the International Union of Crystallographers (IUCr). It took several years to create the dictionary. The people involved had deep knowledge of the field and involved other experts in creating the dictionary definitions. The first version of the dictionary was published in 1996. It had 1700 definitions that spanned the terms used to describe the crystallographic experiment as well as the results of the experiment—the three-dimensional structure. The data model is reusable, extensible, and easily imported into standard rela-tional database engines.

(2)

Because so much care was taken, both with the content of the dictionary content and the syntax, the data can now be expressed in alternate formats. Thus, an mmCIF can be expressed either in the legacy PDB for-mat or in the more modern XML. In addition, the dictionary itself can be expressed as an XML schema. This dictionary has proven to be an effective underpinning for every aspect of the PDB operation and is now serving as the basis for an Application Programming Interface and data exchange.

LESSONS LEARNED

The creation of the mmCIF dictionary was a voluntary effort. The creation of the data model and all the definitions were done by committed members of the community who were convinced that such an effort was required if we were ever to be able to cope with the large amounts of complex data that needed to be handled. Funds to support these sorts of initiatives will ensure that they go forward more efficiently and with the necessary peer review.

Address reprint requests to:

Dr. H. Berman Department of Chemistry & Chemical Biology Rutgers University 610 Taylor Road Piscataway, NJ 08854 E-mail: berman@rcsb.rutgers.edu

BERMAN AND WESTBROOK

Referenties

GERELATEERDE DOCUMENTEN

Het nadeel van weg 2 is dat het veel inspanningen zal ver- gen om van telers die gewend zijn om met (impliciete) ervaringskennis te beslissen, telers te maken die het proces

Hieruit volgt dat mensen die venting of empowerment als motief hebben de reputatie slechter beoordelen dan mensen met altruïsme wanneer er geen CHV wordt gebruikt in de reactie

association between parent and observer reports of child behavior will be weaker when parents report relatively high levels of dysfunctional parenting (i.e., negative parenting), and/

All the relevant elements of employee commitment, namely the importance of commitment, factors affecting commitment and how it affects employees, strategies for increasing

Tabel 3.4 en 3.5 geven de gemiddelde analyseresultaten weer van het drainwater en de voedingsoplossing in de steenwol van respectievleijk de heer Bazuin en de heer

Drijvende objecten zoals de aansluitingen van benodigde nutsvoorzieningen dienen bestand te zijn tegen waterfluctuatie Indien op de gekozen waterpartij waarop Floating Roses

We cannot be certain whether the influence of the role of the supernatural in the ancient world on Shakespeare was direct and whether he actually looked at these ancient examples as

138 Regardless of the correct etymological interpretation, the presence of a ‘real’ vowel (/a/ < *o) is clear and so the form should be read [aSa]. The form usa/i- ‘year’,