• No results found

D Data Management Challenges for Molecular and Cell Biology: An Industry Perspective

N/A
N/A
Protected

Academic year: 2021

Share "D Data Management Challenges for Molecular and Cell Biology: An Industry Perspective"

Copied!
2
0
0

Bezig met laden.... (Bekijk nu de volledige tekst)

Hele tekst

(1)

121 OMICS A Journal of Integrative Biology

Volume 7, Number 1, 2003 © Mary Ann Liebert, Inc.

Data Management Challenges for Molecular and

Cell Biology: An Industry Perspective

VICTOR M. MARKOWITZ

D

ATA MANAGEMENT for molecular and cell biology involves the traditional areas of data generation and acquisition, data modeling, data integration, and data analysis. In industry, the main focus of the past several years has been the development of methods and technologies supporting high-throughput data gen-eration, especially for DNA sequence and gene expression data (Cambridge Healthtech Institute, 2001). New technology platforms for generating biological data present data management challenges arising from the need to capture, organize, interpret and archive vast amounts of experimental data. Platforms keep evolv-ing with new versions benefitevolv-ing from technological improvements, such as higher density arrays and bet-ter probe selection for microarrays. This evolution raises the additional problem of collecting potentially incompatible data generated using different versions of the same platform, encountered both when these data need to be integrated and analyzed. Further challenges include qualifying the data generated using in-herently imprecise tools and techniques and the high complexity of integrating data residing in diverse and poorly correlated repositories.

The data management challenges mentioned above, as well as other data management challenges (Jagadish and Olken, 2003), have been examined in the context of both traditional and scientific database applications. When considering these challenges, it is important to determine whether they require new or additional research, or can be addressed by adapting and/or applying existing data management tools and methods to the biological domain.

The experience gained at Gene Logic in developing data management systems for gene expression data (Gene Logic Products, 2003) suggests that existing data management tools and methods, such as commer-cial database management systems, data warehousing tools and statistical methods, can be adapted effec-tively to the biological domain. For example, the development of Gene Logic s gene expression data man-agement system has involved modeling and analyzing microarray data in the context of gene annotations (including sequence data from a variety of sources), pathways, and sample (e.g., morphology, demography, clinical) annotations, and has been carried out using or adapting existing tools. Dealing with data uncer-tainty or inconsistency for experimental data has required statistical, rather than data management, meth-ods; adapting statistical methods to gene expression data analysis at various levels of granularity has been the subject of intense research and development in recent years (Oz, 2003). The most difficult problems have been encountered in the area of data semantics-properly qualifying data values (e.g., an expression es-timated value) and their relationships, especially in the context of continuously changing platforms and evolving biological knowledge. While such problems are encountered across all data management areas, from data generation through data collection and integration to data analysis, the solutions require domain specific knowledge and extensive data definition and curation work, with data management providing only the framework (e.g., controlled vocabularies, ontologies) to address these problems.

In an industry setting, solutions to data management challenges need to be considered in terms of com-plexity, cost, robustness, performance and other user and product specific requirements. Devising effective solutions for biological data management problems requires thorough understanding of the biological

(2)

plication, the data management field, and the overall context in which the problems are considered. Inade-quate understanding of the biological application and of data management technology and practices seem to present more problems than the limitations of existing data management technology in supporting bio-logical data specific structures or queries.

REFERENCES

CAMBRIDGE HEALTHTECH INSTITUTE. (2001). Bioinformatics: getting results in the era of high-throughput ge-nomics. Cambridge Healthtech Institute Report 9.

GENE LOGIC PRODUCTS. (2003). Available: www.genelogic.com/products.htm.

JAGADISH, H.V., and OLKEN, F. (2003). NSF workshop proposal. Available: http://pueblo.lbl.gov/,olken/ wdmbio/wsproposal/1.htm.

OZ. (2003). Available: http://oz.berkeley.edu/users/terry/zarray/Html/index/html.

Address reprint requests to: Dr. Victor M. Markowitz Data Management Systems Gene Logic, Inc. 2001 Center Street Berkeley, CA 94704 E-mail: markowitz@genelogic.com MARKOWITZ

Referenties

GERELATEERDE DOCUMENTEN

This paper studies how consistent the different aggregators are in terms of the social media metrics provided by them and discusses the extent to which the strategies and

Les archives paroissiales conservées au presby[ère, que nous avons dépouillées complètement, sont pratiquement muettes sur la création, les a g randissements et les

Ef- fective integration of protein data can be accomplished through better data modeling.. We demonstrate this through the

The purpose of the workshop was to formulate a research agenda for the data management community to develop better technology for sup- porting bioinformatics applications.. This

This requires a complex mixture of linking between databases, acquiring information from the biological literature, integration of genome-wide experimental results, and development

Fur- ther research is needed to support learning the costs of query evaluation in noisy WANs; query evaluation with delayed, bursty or completely unavailable sources; cost based

We believe that development of general purpose graph data management systems (GDMSs) could become major platforms for development of a wide variety of bioinformatics database

We see the need for two parallel architectures for integration of federated data and applications, respec- tively: wrappers written to the SQL-MED API specification, to