Towards a Toxico-Chemogenomic Future: The Transformation of Public Gene Expression Data and Consideration for its Use.

Abstract

The term “toxico-chemogenomics†is used to convey extension of toxicogenomics to more broadly survey gene expression changes across chemical space. Moving towards an improved, publicly available toxico-chemogenomics capability requires not only common data standards and protocols across public resources, but also broad data coverage within the chemical, genomics and toxicological information domains, and transparent and functional linkages of Internet data resources. The first goal of this project was to assess the current extent of standardization, interoperability, and chemical indexing of public genomics resources with respect to toxico-chemogenomics utility. Focusing on the largest of these public data resources – Gene Expression Omnibus (GEO) and ArrayExpress -- the second goal was to chemically index the full experimental content of these repositories to assess the current coverage of chemical exposure-related microarray experiments in relation to chemical space and toxicology, and to make these data accessible in relation to other publicly available, chemically-indexed toxicological information. Current standards for chemical annotation within ArrayExpress and GEO are presently inadequate to this task, such that development of new methodologies to mine the author-submitted content was required. A series of automated Perl programs were utilized along with extensive manual review to transform the raw experiment/study descriptions and text files into a standardized chemically-indexed inventory of microarray experiments in both resources. These files and top-level experiment annotations allowed for identification of all current chemical-associated experimental content as well as the subset of chemical exposure-related (or “Treatment†) content deemed most relevant to toxicogenomics in the GEO Series and ArrayExpress Repository experiment inventories. With chemical exposure experiments suitably indexed by chemical structure, it is possible for the first time to assess the breadth of chemical study space represented in these databases, as well as the overlapping chemical content, and to begin to assess the sufficiency of data for making chemical similarity inferences. Chemical indexing of public genomics databases is also the first step towards integrating chemical, toxicological and genomics data into predictive toxicology by providing linkages across public resources. The main products of this effort include the following: (1) published, downloadable and structure-searchable DSSTox Structure-Index (Locator) files for both the GEO Series (GEOGDS) and ArrayExpress Repository (ARYEXP), containing standard chemical fields for the unique chemical “Treatment†subset, accompanied by URLs to AccessionID experiment pages in GEO and ArrayExpress; (2) published, downloadable DSSTox Aux data files for GEOGDS and ARYEXP providing a chemical-experiment pair index to all chemical-associated content in each resource and containing 14 standard genomics fields (e.g., Experiment_Title, Experiment_Description, Experiment_ArrayType, Species, Number_Samples, etc.) and source-specific fields extracted from each resource (e.g., MIAME_Protocol, MIAMI_Factors, etc. for ArrayExpress); and (3) incorporation of the “Treatment†chemical-experiment pair index with URLs linked directly to AccessionID pages for GEO and ArrayExpress into the National Center for Biotechnology Information (NCBI) PubChem resource. The secondary product of this effort is a methodology discussion about the proper use of public microarray data with a demonstrative analysis of how one might use the newly identified public microarray data.

Description

Keywords

gene expression, text mining, database, microarray, chemoinformatics, toxicogenomics

Citation

Degree

PhD

Discipline

Bioinformatics

Collections