Evaluation of a gene information summarization system by users during the analysis process of microarray datasets
- 3.7k Downloads
Summarization of gene information in the literature has the potential to help genomics researchers translate basic research into clinical benefits. Gene expression microarrays have been used to study biomarkers for disease and discover novel types of therapeutics and the task of finding information in journal articles on sets of genes is common for translational researchers working with microarray data. However, manually searching and scanning the literature references returned from PubMed is a time-consuming task for scientists. We built and evaluated an automatic summarizer of information on genes studied in microarray experiments. The Gene Information Clustering and Summarization System (GICSS) is a system that integrates two related steps of the microarray data analysis process: functional gene clustering and gene information gathering. The system evaluation was conducted during the process of genomic researchers analyzing their own experimental microarray datasets.
The clusters generated by GICSS were validated by scientists during their microarray analysis process. In addition, presenting sentences in the abstract provided significantly more important information to the users than just showing the title in the default PubMed format.
The evaluation results suggest that GICSS can be useful for researchers in genomic area. In addition, the hybrid evaluation method, partway between intrinsic and extrinsic system evaluation, may enable researchers to gauge the true usefulness of the tool for the scientists in their natural analysis workflow and also elicit suggestions for future enhancements.
GICSS can be accessed online at: http://ir.ohsu.edu/jianji/index.html
KeywordsGene Ontology Microarray Experiment MeSH Name Entity Recognition Relevance Judgment
Gene microarray technology is frequently used in biomedical research to investigate the differential expression levels of genes in the whole genome under different conditions, e.g. control vs. diseased, young vs. aged. For instance, experiments can be performed to conduct a comparison of gene expression between normal and breast cancer tissues. These results are already translating into changes in clinical practice . Since these experiments can measure the expression level of tens and thousands of genes simultaneously, the analysis of the results produced is nontrivial because of the large data size.
Searching the literature databases such as MEDLINE for information on the genes differentially expressed is a necessary task for translational researchers during the analysis of the microarray experiment. With the increasing volume of published full-text scientific articles, even the most robust information retrieval (IR) engine returns more documents than scientists are able to manually review. One approach to address this issue is to automatically produce customized summaries for the users who are analyzing the result of a specific microarray experiment. Summarization is defined by Sparck Jones  as "a reductive transformation of source text to summary text through content reduction selection and/or generalization on what is important in the source". Automatic summarization systems have been studied since the late 1950s [3, 4] and applied in different domains with some notable success , but less well studied in the biomedicine domain . The information that is of most interest to scientists may reside in sentences describing some specific biological process such as phosphorylation and activation, or the relationship between genes and a certain medical conditions. These specific information requirements can be used in the biomedical domain by emphasizing domain-specific keywords to extract important information and to construct summaries.
By exploiting the use of domain terminology and the analysis workflow of microarray experiments, we adapted the automatic summarization technology of Edmundson  to the biomedical domain. Focusing on the analysis of differentially expressed gene sets from microarray data, the Gene Information Clustering and Summarization System (GICSS) consists of a two-step process. First the gene set is clustered into functional related groups based on free text, Medical Subject Headings (MeSH), and Gene Ontology (GO) terms. Next, a summary for each gene is generated as sentences ranked by features such as domain vocabulary, length, representation of its functional cluster, cue words and recency. This is a novel approach, since previous work either focus on functional gene clustering [7, 8] or gene information summarization, but there was no integration of these two related steps in microarray data analysis process.
Evaluation is a critical part of any system development. Since the ultimate goal of a summarizer is to present the succinct information in the literature to practicing biomedical researchers, extrinsic evaluation that measures how useful the system is to the intended end users has been heralded by experts in the field . However, text-mining and automatic summarization systems are still lagging behind information retrieval systems in routine use, and user-oriented evaluation presents greater challenges in labor intensiveness and study design than intrinsic evaluation (measures how good the system is) . In the evaluation of this project, biomedical researchers served as the subjects in evaluating the output of our systems in a real-world context using their own microarray data. This allowed us to create a gold standard and use intrinsic evaluation measurements. The experts judged how well the clustering and sentence ranking algorithms work with the microarray experiment data they were analyzing at the time of evaluation. Our approach is a hybrid paradigm, providing measures based on actual tasks important to the subjects with reduced human labor. This is one-step closer to the ultimate goal of real-time, user-oriented evaluation.
Clustering of genes into functionally related groups
We modeled the genes by three categories of features: MeSH Headings, GO terms and text words in the sentences with at least one reference to a gene. Each gene is modeled as vector combination of the above three categories of features. Direct k-means clustering algorithm was used to find the functional clusters.
Ranking of sentences for each gene
Sentences are modeled as word vectors after parsing, stop word removal and stemming. Each sentence is assigned a score by linear combination of features. Specifically, sentence score S is calculated as:
S = w1CluSim + w2NGene + w3CTword + w4TPword + w5L + w6Recency
Cluster representation (CluSim). The top five descriptive features (a set of MeSH, GO terms and/or text words) for each gene cluster from the previous step are used as this ranking measure.
Gene relations (NGene). Sentences referenced to more than one gene/protein names score higher, otherwise, 0.
Cue phrases (CTword). This is identical to the Edmundson's Cue feature based on the assumption that the importance of a sentence is represented on the presence of certain key terms. For example, the term 'conclusion' may indicate importance.
Domain specific keywords (TPword). Biologically relevant keywords were extracted from the Textpresso  ontology.
Length (L). L is calculated as the fraction of the longest sentence.
Recency is calculated as a linear scale for the sentences from one to zero.
The weighting scheme of w was adjusted empirically based on feedback from users during the first stage of the evaluation process. This used a data set distinct from those used for the evaluation discussed below.
Evaluating the clustering algorithm
Evaluating ranking of informative sentences
Sentences for ten genes (genes from each of the clusters evaluated in the previous step) were used in this step. Sentences from the output of the system and PubMed searches were pooled together and judged by the same scientists who studied the gene sets. The searches on PubMed were done by e-search provided by Entrez . The queries were the name of the gene and synonym expansion using the synonym dictionary from . The results of the searches were filtered on Date of Publication (DP) to limit to the time period of 1994–2003 and on MeSH term (MH) 'Mice'. These filtering criteria were the same as the text collection, making the comparison between PubMed search results and system output possible.
Our system output: Sentences with reference to the gene extracted from the abstracts ranked by the scoring algorithm.
Same sentence set as in #1 above but in reversed chronological order (same as PubMed's ranking).
Output from PubMed search (title of abstract in reversed chronological order).
For each sentence presentation, the AveP scores were calculated for each of the genes and the scores were analyzed using repeated measure with post-hoc comparison with the Sidak adjustment.
The five gene sets evaluated were results from microarray experiments covering different areas of translational research such as mice model of myelodysplastic syndromes and genes in the brain circuits involved in drug dependence and withdrawal. They were obtained from similar platforms: Affymetrix 430A or 430 2.0. The number of clusters for each gene set depended on the number of genes in the list and the natural diversity of the gene set. The size of the five gene sets ranged from 53 to 275, which we believed represented the numbers of differentially expressed genes produced in microarray experiments. However, the criteria to choose these sets of differentially expressed genes were chosen individually by each scientist for their own data, without any intervention from the author. The number of genes in the 10 clusters evaluated by the scientists ranged from four to 12.
Average precision scores for the three sentence presentations. The GICSS system performed significantly better than PubMed titles presentation.
Comments on the system
In general, the clustering algorithm gave better gene groups than random as supported by the fact that clusters generated by both MeSH and GO terms were significantly better than random grouping of genes. The comparison between the feature types showed insignificant differences; even though the confidence interval trend suggested that MeSH and GO may be better than text as features for clustering. It appeared that the perception of a good cluster did not depend on the scientist knowing the clusters' descriptive terms. Once the participants found the better cluster, they were likely to stick with it even after seeing the keywords. The preference for MeSH and GO clusters suggests that controlled vocabularies fare better than text words for generating clusters. Controlled vocabularies usually have more domain-specific content, which may be able to give more information to users.
Providing sentences in the abstract gave much more relevant information than titles alone. The results showed significant differences between the system output and the PubMed search output. These results suggest that it may be useful for the PubMed result list to include highlighted keywords from the sentences in the abstracts to provide more information to the searchers. In the current title-only list, some relevant articles may be missed because the titles do not provide enough information to warrant further exploration of the abstract, especially when the returned list is long.
Further analysis of the relation between recency and relevance indicated that the more recent the date of publication, the more likely the sentence be judged as relevant. This relation may be explained by the fact that the experts doing the relevance judgment were to a certain extent aware of the knowledge accumulation timeline of the specific genes. It seems that this is a good assumption because the experts were instructed to select two genes they were familiar with to perform the evaluation. Because of their pre-existing knowledge of the genes, they were likely to pick the newly discovered information as more relevant than the well-known facts on the genes.
GICSS is built in a modular fashion. This open framework allows the substituting of other NER systems for changing the method and species. Furthermore, the sentence ranking approach can be enhanced with new useful features as they become apparent.
Reflection on the evaluation process
The time for cluster evaluation of each gene set ranged from 20 to 35 minutes. During the experiment, we found that judging cluster pairs was not an easy task for the scientists. Even though each cluster had at least one of the genes they chose as familiar, it was very common that some genes in the cluster (with average size of 8 genes) were not very familiar to them. In order to judge the quality of the cluster, they needed to follow links in the evaluation screen to obtain information on other genes in the cluster. This created a larger than expected work load for the evaluators and by the end of the session, they make their best judgments without going through the information for not-so-familiar genes, possibly due to fatigue. On the other hand, without a system such as GICSS which provided hyperlinked text to articles relevant to genes and clusters, the scientists would have a much larger workload for making any sense of the clusters at all. Systems that only provide gene cluster output effectively leave scientists in this situation.
The workload was also the reason each participant was asked to judge clusters for only two genes, which amounted to judging 12 cluster pairs in our experimental design. The fatigue factor may also have influenced the quality of the judgment. In order to overcome the difficulty of getting experiment participants and ensure the quality of the judgment, it will be better to conduct future evaluation experiments in a real-world routine user setting, spread over a longer period of time where the scientists are analyzing their data in the normal course of their gene microarray investigations. For example, the system could log the clicking and timing of the participants in addition to the defined questions presented to the participants. In this manner, we may be able to capture some of the information that might have been lost during this evaluation. For example, evaluators mentioned that after reading the summary sentences they have found informative abstracts that were important to their analyses. These examples were important to illustrate the system's success, but they were not recorded for evaluative purposes because the relevance judgments were the focus of the sentence extraction evaluation.
How to best measure the quality of clusters is still a general issue, especially in this case, where we define quality as how useful the clusters were for the analysis of a specific microarray experiment. Some analytical measures, such as internal and external similarities, entropy and mutual information, are likely not correlate closely. These measures are commonly used to quantify the quality of the clusters in many comparative experiments [7, 16]. We are unaware of any previous work on how the purity of the cluster as defined by these statistical measures correlates with the biological meaning of the clusters for a user. Furthermore, in this study, the raw differentially expressed gene list from the experiment was clustered. By nature, the gene list contains genes that were differentially expressed and most likely from many different pathways and groups especially considering many genes may perform multiple functions. We expect the list to be harder to cluster than choosing several distinct known gene groups, such as GO and cell cycle groups and try to cluster them to the right class, which is the most common evaluation method used [7, 16, 17].
The GICSS system can support the use of user defined query terms in selecting the important sentences. In this way, it gives the users more leverage in getting to the information of greatest interest. For example, they can enter a disease's name in order to retrieve sentences referring to the gene and the disease. Hopefully, this feature can provide customized sentences presentation to fit different needs of the users. Due to the limited resources in this project, this feature was not evaluated and the studying of its use is future work.
We built and evaluated a gene information summarization system for translational researchers. The evaluation was performed during the time that real scientists were analyzing their own microarray experiment results. It was a hybrid evaluation method, partway between intrinsic and extrinsic system evaluation, and may enable researchers to more truly gauge the usefulness of a system to its intended users. The result of the evaluation suggested the system could be a useful tool for translational genomics researchers.
This work was supported in part by NLM Training Grant 1T15 LM009461. The authors thank all the evaluation participants in OHSU for their time and effort.
This article has been published as part of BMC Bioinformatics Volume 10 Supplement 2, 2009: Selected Proceedings of the First Summit on Translational Bioinformatics 2008. The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/10?issue=S2.
- 1.Soderlund K, Skoog L, Fornander T, Askmalm MS: The BRCA1/BRCA2/Rad51 complex is a prognostic and predictive factor in early breast cancer. Breast Radiobiology. 2007, 84 (3): 242-251.Google Scholar
- 2.Sparch Jones K: Automatic summarizing: factors and directions. Advances in Automatic Text Summarization. Edited by: Mani I, Maybury MT. 1999, London: MIT PressGoogle Scholar
- 4.Luhn H: The Automatic Creation of Literature Abstracts. IBM Journal. 1958, 159-165.Google Scholar
- 7.Liu Y, Ciliax BJ: Comparison of two schemes for automatic keyword extraction from MEDLINE for functional gene clustering. Proc IEEE Comput Syst Bioinform Conf. 2004, 394-404.Google Scholar
- 11.Yang J, Cohen AM, Hersh W: Automatic Summarization of Mouse Gene Information by Clustering and Sentence Extraction from MEDLINE Abstracts. AMIA Annu Symp Proc. 2007, 831-835.Google Scholar
- 12.Cluto clustering software package. [http://glaros.dtc.umn.edu/gkhome/views/cluto]
- 13.Cohen A: Unsupervised gene/protein entity normalization using automatically extracted dictionaries. Linking Biological Literature, Ontologies and Databases: Mining Biological Semantics, Proceedings of the BioLINK2005 Workshop; Detroit, MI. 2005, Association for Computational Linguistics, 17-24.Google Scholar
- 15.Entrez E-Search Utility. [http://www.ncbi.nlm.nih.gov/entrez/query/static/esearch_help.html]
- 17.Glenisson P, Antal P, Mathys J, Moreau Y, Moor BD: Evaluation of the vector space representation in text-based gene clustering. Pac Symp Biocomput. 2003, 391-402.Google Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.