Pancreatic Expression database: a generic model for the organization, integration and mining of complex cancer datasets
- 9.5k Downloads
Pancreatic cancer is the 5th leading cause of cancer death in both males and females. In recent years, a wealth of gene and protein expression studies have been published broadening our understanding of pancreatic cancer biology. Due to the explosive growth in publicly available data from multiple different sources it is becoming increasingly difficult for individual researchers to integrate these into their current research programmes. The Pancreatic Expression database, a generic web-based system, is aiming to close this gap by providing the research community with an open access tool, not only to mine currently available pancreatic cancer data sets but also to include their own data in the database.
Currently, the database holds 32 datasets comprising 7636 gene expression measurements extracted from 20 different published gene or protein expression studies from various pancreatic cancer types, pancreatic precursor lesions (PanINs) and chronic pancreatitis. The pancreatic data are stored in a data management system based on the BioMart technology alongside the human genome gene and protein annotations, sequence, homologue, SNP and antibody data. Interrogation of the database can be achieved through both a web-based query interface and through web services using combined criteria from pancreatic (disease stages, regulation, differential expression, expression, platform technology, publication) and/or public data (antibodies, genomic region, gene-related accessions, ontology, expression patterns, multi-species comparisons, protein data, SNPs). Thus, our database enables connections between otherwise disparate data sources and allows relatively simple navigation between all data types and annotations.
The database structure and content provides a powerful and high-speed data-mining tool for cancer research. It can be used for target discovery i.e. of biomarkers from body fluids, identification and analysis of genes associated with the progression of cancer, cross-platform meta-analysis, SNP selection for pancreatic cancer association studies, cancer gene promoter analysis as well as mining cancer ontology information. The data model is generic and can be easily extended and applied to other types of cancer. The database is available online with no restrictions for the scientific community at http://www.pancreasexpression.org/.
KeywordsPancreatic Cancer Chronic Pancreatitis Intraductal Papillary Mucinous Neoplasm cDNA Array Normal Pancreas
Pancreatic ductal adenocarcinoma (PDAC) usually presents at an advanced stage so that surgical cure is rarely achieved and conventional chemotherapy and radiotherapy have little impact, resulting in a very low 5-year survival rate (0.5%–5%) . Thus a number of laboratories have focused on studying the evolution of pancreatic cancer from its earliest stages (pancreatic intraepithelial neoplasias or PanINs), putting pancreatic cancer among the best studied tumour tissue types at the molecular level. Thus a wealth of information regarding mutated and aberrantly expressed genes, miRNAs and proteins is now available, not only significantly boosting our biological understanding of the disease but also helping to identify new (early) diagnostic and therapeutic targets. Unfortunately, the huge and still rising volume and diversity of public pancreatic datasets makes it increasingly difficult for researchers to integrate this information into their current research efforts. In this report, we describe a dedicated Pancreatic Expression database  aiming to overcome this restriction, and furthermore propose it as a generic model for the organization, integration and presentation of complex cancer research data. The model is designed to address various research problems, ranging from the specimen origin and type, through cancer development stages to expression patterns. By bringing complex profiling data together, the Pancreatic Expression database should enable scientists worldwide to perform a whole range of user-friendly queries, from deciphering the biological mechanisms underlying pancreatic disease to target discovery.
Construction and Content
The aim of the Pancreatic Expression database is to provide a comprehensive mining tool for large-scale genomic, transcriptomic and proteomic data sets. In order to achieve this, we designed a robust internal structure encompassing specific pre-defined modules (which can be found under the "Filters" section in the database) including "pancreatic specimen/cell type", "pancreatic differential expression information", "genes differentially expressed in" and "genes expressed in" modules. Our design enables uploading of any available (pancreatic) datasets that comply with the structure of the pre-defined modules. Each module contains a number of subcategories related to the module name, which are fundamental to store and retrieve user-defined sub-datasets from the database by setting filters to the specific subcategories within each module. The "pancreatic specimen/cell type" module covers categories such as normal (microdissected ductal cells (ND) or bulk normal pancreas (NP), acinar cells, islet cells, stromal cells and pancreatic stellate cells), and disease specimens from both exocrine (pancreatic intraepithelial neoplasias (PanIN-1A, PanIN-1B, PanIN-2, PanIN-3), chronic pancreatitis (CP), pancreatic adenocarcinoma (PDAC), intraductal papillary mucinous neoplasms (IPMN), mucinous cystic tumours and ampullary carcinoma) and endocrine (functioning and non-functioning tumours) origin. Moreover, pancreatic juice, plasma, urine, serum, and fine needle aspirates are included as additional options to further broaden future expansion of the database. The "pancreatic differential expression information" module provides information on direction of regulation (up- and down-regulation), fold-change, SAGE tag number and whether a gene or protein was found to be expressed only in pancreatic adenocarcinoma (PDAC) or in normal pancreas. The "genes differentially expressed in" module enables more defined selection of comparison methods such as pancreatic adenocarcinoma (PDAC) versus normal pancreas (bulk tissue or microdissected normal ductal cells), chronic pancreatitis (CP) versus normal pancreas (bulk tissue or microdissected normal ductal cells), chronic pancreatitis (CP) versus pancreatic adenocarcinoma (PDAC), pancreatic intraepithelial neoplasias (PanIN-1A, PanIN-1B, PanIN-2 or PanIN-3) versus normal pancreas (ND) or microdissected normal ductal cells), etc. The "genes expressed in" module lists the genes expressed in the tissue types defined in the pancreatic specimen/cell module, irrespective of their mode of regulation (whether they are differentially expressed or not). The "platform technology" module enables the selection of the technology used, such as Affymetrix arrays, cDNA arrays, Sanger human 10K cDNA arrays version 1.2.1, Sanger custom 5K1 cDNA arrays, Clontech Atlas Human Cancer cDNA Expression Array, SAGE, Agilent Human Genome CGH array, 2D PAGE, SELDI, etc. The data is stored in a data management system created using MySQL  and based on the open-source BioMart technology , a simple, federated query system designed specifically for use with large datasets. We imported the available Ensembl  human genome annotations (Ensembl release 41) for genes and proteins, SNP information, sequences, gene structure and multi-species data enabling the integration and annotation of heterogeneous pancreatic cancer data. In order to avoid integration and annotations errors, we used the pre-established Ensembl annotations and microarray probe set mapping. Ensembl links to UniProt/Swiss-Prot, RefSeq and UniProt/TrEMBL databases are made on the basis of sequence similarity. All other subsequent links are inferred from these mappings. Ensembl also establishes mappings to microarray probe set identifiers by matching probe set sequences to Ensembl transcripts . We also integrated the antibody data from the Human Protein Atlas  based on Ensembl gene ID.
The Pancreatic Expression database currently contains 32 datasets from 20 different published sources, from 14 international laboratories encompassing 22 different platforms (Affymetrix GeneChip Human Full Length Array HuGeneFL, Affymetrix GeneChip Human Genome U95 Set (HG-U95A, HG-U95B, HG-U95C, HG-U95D, HG-U95E), Affymetrix GeneChip Human Genome U133 Array Set (HG-U133A, HG-U133B), 2D PAGE, cDNA arrays, SAGE, Operon oligo array version 2.0, Clontech Atlas Human Cancer cDNA Expression Array, immunohistochemistry, in situ hybridisation, Oligo array, MALDI, mass spectrometry, Sanger human 10K cDNA arrays version 1.2.1, Sanger custom 5K1 cDNA arrays, United Gene Technique Ltd, BD PowerBlot Western array and qRT-PCR) [8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27]. These initial datasets provide valuable information about 7636 gene expression measurements from a first-pass selection of relevant papers in the field of pancreatic research; however, the inclusion of additional relevant datasets will be a continuous and ongoing process. All the datasets were manually processed, checked for accuracy and consistency and loaded into our relational database alongside annotations from several public resources such as Ensembl, GO, dbSNP, UniProt and the Human protein atlas. Currently, several modules are present for which data are either not yet incorporated or not available (ICAT, iTRAQ), but these will be populated as it is our intention to continuously extend the current data content and cover all the existing modules as and when the data becomes available.
Utility and Discussion
Examples of use
Navigation between all data types is simple and user-friendly; a variety of possible query combinations allow researchers quickly to determine the most de-regulated genes and proteins across all platforms.
Our integration model brings together relevant pancreatic cancer datasets and annotations from public sources and enables scientists to perform a wide variety of complex queries on various types of data. The design of the database allows easy integration of additional modules and annotations from new public databases.
The Pancreatic Expression database constitutes a unique and valuable resource for the wider cancer research community, and is in rapid and constant development. We aim to continuously import new data sources and update the database on a regular basis, and invite scientists worldwide to deposit and share their data.
Although initially constructed using pancreatic cancer expression datasets, we have designed and implemented a generic system that can be easily modified and applied to any other type of cancer. The system is available for collaboration with all interested research groups either by extending it to include other cancer data or by sharing our model should they want to adopt it for their data.
Availability and requirements
Project name: Pancreatic Expression database
Project home page: http://www.pancreasexpression.org
Operating system(s): Platform independent; Standard WWW browser (Safari, Firefox)
Programming language: Perl, SQL, BioMart data management system
Licence: The database is freely available to academic and non-academic users. However, should you find the Pancreatic Expression database useful to your work, please cite this paper.
The funding for this project was obtained through FW6 EU project MolDiag-Paca and Cancer Research UK programme grant C355/A6253.
Disclaimer: Information in the Pancreatic Expression database is curated from highly relevant published pancreatic cancer papers. However, the quality and accuracy of the published data are solely the responsibility of the authors. The Pancreatic Expression database is a mining tool to the literature rather than a substitute for the experiments. We highly recommend researchers to trace the origin of the data to check if the data may comply with their quality standards. We also recommend researchers to apply independent technologies to confirm data retrieved through our mining tool prior to integrating them into the individual research efforts.
- 2.Pancreatic Expression database. [http://www.pancreasexpression.org]
- 3.MySQL. [http://www.mysql.com]
- 4.BioMart. [http://www.biomart.org]
- 5.Ensembl. [http://www.ensembl.org]
- 6.Ensembl microarray probeset mapping. [http://www.ensembl.org/info/about/docs/microarray_probe_set_mapping.html]
- 7.Human Protein Atlas. [http://www.proteinatlas.org]
- 8.Van Heek NT, Maitra A, Koopmann J, Fedarko N, Jain A, Rahman A, Iacobuzio-Donahue CA, Adsay V, Ashfaq R, Yeo CJ, Cameron JL, Offerhaus JA, Hruban RH, Berg KD, Goggins M: Gene expression profiling identifies markers of ampullary adenocarcinoma. Cancer Biol Ther. 2004, 3 (7): 651-656.PubMedCrossRefGoogle Scholar
- 10.Anderson NL, Polanski M, Pieper R, Gatlin T, Tirumalai RS, Conrads TP, Veenstra TD, Adkins JN, Pounds JG, Fagan R, Lobley A: The human plasma proteome: a nonredundant list developed by combination of four separate sources. Mol Cell Proteomics. 2004, 3 (4): 311-326. 10.1074/mcp.M300127-MCP200.PubMedCrossRefGoogle Scholar
- 14.Crnogorac-Jurcevic T, Gangeswaran R, Bhakta V, Capurso G, Lattimore S, Akada M, Sunamura M, Prime W, Campbell F, Brentnall TA, Costello E, Neoptolemos J, Lemoine NR: Proteomic analysis of chronic pancreatitis and pancreatic adenocarcinoma. Gastroenterology. 2005, 129 (5): 1454-1463. 10.1053/j.gastro.2005.08.012.PubMedCrossRefGoogle Scholar
- 15.Crnogorac-Jurcevic T, Missiaglia E, Blaveri E, Gangeswaran R, Jones M, Terris B, Costello E, Neoptolemos JP, Lemoine NR: Molecular alterations in pancreatic carcinoma: expression profiling shows that dysregulated expression of S100 genes is highly prevalent. J Pathol. 2003, 201 (1): 63-74. 10.1002/path.1418.PubMedCrossRefGoogle Scholar
- 17.Grutzmann R, Pilarsky C, Ammerpohl O, Luttges J, Bohme A, Sipos B, Foerder M, Alldinger I, Jahnke B, Schackert HK, Kalthoff H, Kremer B, Kloppel G, Saeger HD: Gene expression profiling of microdissected pancreatic ductal carcinomas using high-density DNA microarrays. Neoplasia. 2004, 6 (5): 611-622. 10.1593/neo.04295.PubMedCentralPubMedCrossRefGoogle Scholar
- 19.Iacobuzio-Donahue CA, Maitra A, Shen-Ong GL, van Heek T, Ashfaq R, Meyer R, Walter K, Berg K, Hollingsworth MA, Cameron JL, Yeo CJ, Kern SE, Goggins M, Hruban RH: Discovery of novel tumor markers of pancreatic cancer using global gene expression technology. Am J Pathol. 2002, 160 (4): 1239-1249.PubMedCentralPubMedCrossRefGoogle Scholar
- 20.Logsdon CD, Simeone DM, Binkley C, Arumugam T, Greenson JK, Giordano TJ, Misek DE, Kuick R, Hanash S: Molecular profiling of pancreatic adenocarcinoma and chronic pancreatitis identifies multiple genes differentially regulated in pancreatic cancer. Cancer Res. 2003, 63 (10): 2649-2657.PubMedGoogle Scholar
- 23.Nakamura T, Furukawa Y, Nakagawa H, Tsunoda T, Ohigashi H, Murata K, Ishikawa O, Ohgaki K, Kashimura N, Miyamoto M, Hirano S, Kondo S, Katoh H, Nakamura Y, Katagiri T: Genome-wide cDNA microarray analysis of gene expression profiles in pancreatic cancers using populations of tumor cells and normal ductal epithelial cells selected for purity by laser microdissection. Oncogene. 2004, 23 (13): 2385-2400. 10.1038/sj.onc.1207392.PubMedCrossRefGoogle Scholar
- 24.Segara D, Biankin AV, Kench JG, Langusch CC, Dawson AC, Skalicky DA, Gotley DC, Coleman MJ, Sutherland RL, Henshall SM: Expression of HOXB2, a retinoic acid signaling target in pancreatic cancer and pancreatic intraepithelial neoplasia. Clin Cancer Res. 2005, 11 (9): 3587-3596. 10.1158/1078-0432.CCR-04-1813.PubMedCrossRefGoogle Scholar
- 25.Shen J, Person MD, Zhu J, Abbruzzese JL, Li D: Protein expression profiles in pancreatic adenocarcinoma compared with normal pancreatic tissue and tissue affected by pancreatitis as detected by two-dimensional gel electrophoresis and mass spectrometry. Cancer Res. 2004, 64 (24): 9018-9026. 10.1158/0008-5472.CAN-04-3262.PubMedCrossRefGoogle Scholar
- 28.Pancreatic Expression database web-based query interface. [http://www.pancreasexpression.org/biomart/martview]
- 29.Bioconductor. [http://www.bioconductor.org]
- 31.R project. [http://www.r-project.org]
- 33.Galaxy. [http://main.g2.bx.psu.edu]
- 34.Pancreatic Expression database access through web services. [http://www.pancreasexpression.org/biomart/martservice]
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.