Identifying diseases-related metabolites using random walk
Metabolites disrupted by abnormal state of human body are deemed as the effect of diseases. In comparison with the cause of diseases like genes, these markers are easier to be captured for the prevention and diagnosis of metabolic diseases. Currently, a large number of metabolic markers of diseases need to be explored, which drive us to do this work.
The existing metabolite-disease associations were extracted from Human Metabolome Database (HMDB) using a text mining tool NCBO annotator as priori knowledge. Next we calculated the similarity of a pair-wise metabolites based on the similarity of disease sets of them. Then, all the similarities of metabolite pairs were utilized for constructing a weighted metabolite association network (WMAN). Subsequently, the network was utilized for predicting novel metabolic markers of diseases using random walk.
Totally, 604 metabolites and 228 diseases were extracted from HMDB. From 604 metabolites, 453 metabolites are selected to construct the WMAN, where each metabolite is deemed as a node, and the similarity of two metabolites as the weight of the edge linking them. The performance of the network is validated using the leave one out method. As a result, the high area under the receiver operating characteristic curve (AUC) (0.7048) is achieved. The further case studies for identifying novel metabolites of diabetes mellitus were validated in the recent studies.
In this paper, we presented a novel method for prioritizing metabolite-disease pairs. The superior performance validates its reliability for exploring novel metabolic markers of diseases.
KeywordsMetabolites Similarity of diseases Similarity of metabolites Random walk InfDisSim MISIM
Complex and ordinal chemical reactions in the human body are essential for maintaining human life. The whole process is called metabolites [1, 2]. The maintenance, growth and reproduction of organisms are depended on the metabolites . In terms of gaining energy, metabolites are divided into two sections. One is obtaining energy by the catabolism of large molecules, such as cellular respiration. The other one is getting energy by the synthesis inside the cells, such as proteins and nucleic acids . Once people get sick, the exchange of substances and energy would occur abnormity. Then a series of abnormal metabolites would be generated. Therefore, metabolites can effectively diagnose and treat diseases .
Nowadays, recognizing diseases in the molecular level can be achieved by the advanced technology, which is really helpful to the researchers [6, 7, 8, 9, 10, 11, 12, 13, 14]. Many researchers aim to find out the role of single gene, single mRNA transcript and protein towards diseases . This leads to a high explanation of diseases. While the complex genes and micro-RNAs often interact with others, it is hard to analysis the underlying mechanism of diseases. However, metabolisms are the final production of the mechanisms, which have already been a significant factor to identify diseases.
Firstly, due to the correlations between different diseases, the similarity of diseases can be calculated depend on genes and their corresponding proteins. For example, the colorectal cancer has a strong relationship with ulcerative colitis, which is reported in the PM Choi’s paper . Achalasia and Parkinson’s disease share similar features to some extent, so SJ Qualman et al.  found out the similarity of the two diseases. Furthermore, a various researches have reported the methods to obtain the similarity of diseases. J Li et al.  developed a method named DOSim to compute the similarity of diseases, and the method has been packaged into a R-based software package. J Wang et al.  proposed a method to calculate the phenotype similarity scores, then the score can be used to obtain the similarity of diseases. Rischer et al.  built a gene-to-metabolites network to explain the mechanism of catharanthus roseus Cells. Mounet et al.  also built a network of genes and metabolites to find out the candidate gene for tomato’s composition and development. To improve the robust of metabolites’ network, Huss  divided the network into small subnetworks and removed the most abundant substrates. Based on the 3D-structure similarity of metabolites, Ohtana et al.  found out the relationship between biological activities and metabolites. Steve O′ Hagan and Douglas B. Kell  analyzed the similarity between drug and metabolites. Kang et al.  classified the plants by their metabolites’ similarity.
Since metabolites are the key to explain the diseases’ mechanisms. Analyzing the metabolisms is very attractive to researchers because the number of compounds which are needed to be identified and quantified is relatively low . In 2009, Vladimir V.Tolstikov  developed a method that can find out more related metabolites to the data analysis. In 2010, H Zur et al.  predicted the enzymes’ metabolic flux by a novel method ‘iMAT’. Paige et al.  had collected the metabolisms of depressed patients and did the analysis. M Cuperlović-Culf et al.  identified the individual cell lines, groups of cancer and normal cell lines, non-invasive and invasive tumor cell lines by metabolites.
Therefore, we try to find out more related metabolites by analyzing the data of metabolites and diseases. Firstly, we calculated the similarity of different diseases, then the similarity of metabolites could be obtained based on the similarity of diseases, finally a network could be built, where each disease could reach the metabolites on the network. Then we can obtain more disease-related metabolites by the network.
To obtain the basic relationship between metabolites and diseases, three datasets are used as following: HMDB, NCBO Annotator and Diseases ontology.
Data collection and database content
Human metabolome database
We downloaded the metabolites data from Human Metabolome Database (HMDB) . The most widely used and complete database involves more than 40,000 kinds of metabolomes. It contains three kinds of data information: Chemical data, Clinical data and Biochemical data. They collected this information from thousands of public sources.
The dataset we got is the diseases’ related metabolites which has many complex files. So we would use the other datasets to future understand these data.
Diseases Ontology  started as a part of NUgene project in Northwestern University in 2003. By summarizing other datasets, Diseases Ontology can strongly support the heredity, environmental factor and other inducements of diseases, which help researchers understands diseases better.
Each disease or the concept of the diseases is a node. They all have cross literature comments and a DOID name is given for each disease. The nodes in the lower layer are subclasses or subtypes of the nodes in the upper layer, and the parent-child relationship between the DOID is preserved in the data information. All the diseases are classified into seven groups: diseases caused by environmental origin, diseases caused by infectious agent, diseases of anatomical entity, diseases of behavior, diseases of biological process, hereditary disease, disease syndrome and gene ontology. All the nodes are connected by the Directed Acyclic Graph (DAG).
After obtaining the data of diseases-related metabolites by HMDB, we used the Diseases Ontology to annotate the diseases. Therefore, we can know the name and the related information of the diseases.
National center for biomedical ontology
In order to improve the semantic expression ability and open interconnection ability of data, National center for biomedical ontology (NCBO)  proposed a data sharing project to solve the lack of integration tools for scientific ontologies. The dataset of each domain are presented in the form of information islands. Most of the information can not be semantically identified by the machine, so that there is an obstacle to the interaction between the information nodes, which goes against to biomedical research and knowledge discovery. NCBO has six core components, including computer science and biomedical informatics research, promoting biology projects and external research collaboration, infrastructure, education, communication and management.
We can further understand and annotate the HMDB data through NCBO. Then a disease-to-metabolic data file can be obtained.
Calculating similarity of pair-wise diseases
There is a certain similarity between diseases, whereas the similarity is often caused by the same molecular origins. Protein-coding genes’ interaction can reflect the mechanism of the diseases to some extent. Therefore, the similarity of diseases can be achieved by the genes behind the diseases.
In this paper, to calculate the similarity of the diseases we used the method named ‘InfDisSim’ [13, 34]. This method measured the similarity of diseases by gene functional network. Gene functional network can provide the information flow which can be used to calculate the disease similarity. To analyze the information flow, ITM Probe  is employed which included three models: absorbing, emitting and channel. Each disease is a boundary node in the network, besides, each gene is a transient node.
WhereG1,G2 indicates metabolites set of t 1 and t 2 , respectively. G MICA is the metabolites set of t 3 . And ∣. ∣ represents the number of terms in the specified set.
Then we could obtain the similarity of the diseases.
Calculating similarity of pair-wise metabolites
A method named ‘MISIM’ was proposed by Dong Wang et al.  which is used to estimate the similarity of micro-RNAs. In the research, they pointed out that the genes which have similar functions are often associated with similar diseases, so the similarity of diseases could be computed by DAG. This idea is quite similar with the work we did in the ‘InfDisSim’, in addition, this is also the premise of calculating similarity of metabolites. Due to the thought and the miRNA-disease association data, they presented ‘MISM’ to infer the functional similarity of miRNAs by the diseases relationship.
Compared with our research, we tried to compute the similarity of the metabolites. Since the background and theoretical basis are the same, we applied the ‘MISM’ to calculate the similarity of metabolites by the similarity of diseases.
Here d represent one disease and D means one disease group. S(d, D) is the maximum similarity between one disease and one disease set.
Then similarity between M1 and M2could be obtained.
Predicting novel disease-metabolite relationships using random walk
Random Walk is an important part of stochastic process. For example, if an ant starts from X t , it takes a step forward by the probability of 0.5 (Xt + 1 = X t + 1) or takes a step back by the probability of 0.5 (Xt + 1 = X t − 1). Then the points which the ant arrives at each moment can constitute a one-dimensional random walk process.
D is the degree matrix of A which is a diagonal matrix. The diagonal element is D(i, i) = ∑ A(i, j). Here P is the random walk matrix, and the sum of the jump probabilities of each node and all other nodes is 1.
When πP = π, the equilibrium state is reached.
Where I is a unit matrix, P is the corresponding random walk matrix, and W is a matrix which the equilibrium state’s rows are stacked. For a regular Markov chain, W can be considered as the case where n in P n tends to infinity.
Given initial iteration point x, step length is λ, control accuracy is ℓ
Iteration times is N, k is the current iteration time
When k < N, randomly generate a N-dimension vector u = (u1, u2 … u n ).then finish the first walkx1 = x + λu'
If f(x1) < f(x), k = 1 and return to the step 2, else k = k + 1 and return to the step 3.
If the optimal solution is not found in N times, the optimal solution is centered on the current optimal solution.
Here, A is the column-normalized adjacency matrix, P0is the initial probability vector and P t is the probability vector which element at node i at step t. According to the previous study, γ would be 0.85 .
First, we use NCBO Annotator and Disease Ontology to process the data we get in HMDB. Then the data would be integrated by metabolites and disease one by one. Finally, we made a statistic of the corresponding diseases and metabolites.
After analyzing the two figures, we could speculate that there are more metabolites related to the diseases. To understand the mechanism of diseases, we need to know all the related metabolites.
The metabolites related to diseases
Further, we calculate the similarity of diseases by InfDisSim. We totally get 3524 diseases and we calculated the similarity between each two diseases.
Then we calculate the similarity of metabolites by MISM. In terms of the similarity of metabolites, we could draw the figure as following:
We totally get 604 metabolites, so we get the 182,710 similarities from these metabolites. Among these similarities, 90.8% of them are lower than 0.1. Therefore, we use the rest similarities which are higher than 0.1 to draw the Fig. 6. As we can see in the Fig. 6, very few similarities are higher than 0.7. Every point of the figure means the probability between two points on x axis. Take the first point as an example, about 10% of the rest similarities are higher than 0.1 and lower than 0.2. Due to the huge amount of similarities, we need to filter the similarities which are lower than 0.7. So 0.7 is the threshold to select similarities. Therefore, we excluded more than 90% of the rest similarities to continue the rest research. The number of similarity we collected is 2589.
We distributed 20 nodes in a circle whose radius is 1, and connected them by lines in terms of their similarity. Each note represents a metabolite in the network. If there is relationship between the two nodes, they would be connected by the lines. on the contrary, if the two nodes do not have similarity, they would be divided.Through the lines of the network, diseases can be linked to more metabolites through several known metabolites. In terms of the lines, we could get every metabolite’s probability. We can sort this probability and obtain the candidates of diseases-related metabolites.
After building up the network of 453 metabolites, we use RW algorithm to get the metabolites related to the 228 diseases. For every disease, they may only relate to several metabolites in the known dataset. By the network, we could identify more related metabolites towards every disease.
For example, the Alzheimer’s disease is related to 86 metabolites in our original dataset. But we do not know which metabolite has the strongest relationship with it and we also do not know the important degree of different metabolites to this disease. After processing the RW, we could get the rank of metabolites as the following figure:
Performance evaluation using leave-one-out validation
To validate the performance of our method for prioritizing the metabolite-disease pair, the leave-one-out validation method was utilized here based on existing metabolite-disease associations. Step 1, one metabolite-disease pair was removed from prior knowledge. Step 2, the metabolite network was constructed based on the remained metabolite-disease associations. Step 3, the removal metabolite-disease pair was defined as positive group (PG), and other pairs of metabolites and this disease not in the prior knowledge were defined as negative group (NG). Step 4, we utilized the RWR method to score all the metabolites and disease in the PG and NG based on the network. Step 5, the above steps was iterated for all the metabolite-disease pairs in the prior knowledge. The area under the receiver operating characteristic cure (AUC) was then calculated to validate the performance of our method based on all the NGs and PGs. The high AUC (0.7048) validate the superior performance of our method for predicting novel metabolite-disease associations.
Since we mapped the metabolites to the diseases, we found more metabolites which are related to the diseases. To prove the relationship that we found is correct, we conducted a case study.
A good case in point is diabetes mellitus, it is originally related to 28 metabolites, and we found it related to 242 metabolites. Although some of these metabolites’ relationships with diabetes mellitus are weak, there must be some connection between diabetes mellitus and metabolites for sure.
To verify the novel relationship, we selected one of the novel related metabolites to explore whether it is related to the diabetes mellitus. We selected HMDB004793-Methylhistidine which is not reported in the dataset we used in section 2(A). Kuan-Hsing Chen et al.  have found this metabolite is related to diabetes mellitus.
DPK Ng et al.  have reported that Hydroxyphenylacetic acid is related to the diabetes mellitus. Whereas the original database did not include these metabolites as a related metabolites of diabetes mellitus, we found the relationship between Hydroxyphenylacetic acid and diabetes mellitus by RW.
These two evidences proved that our method is suitable and effective to identify relationship between diseases and metabolites.
We got the data from three public datasets: HMDB, Diseases Ontology and NCBO. Then we got the data which metabolites and disease are one–to-one correspondence. Firstly, we observed the situation that metabolites map to the diseases. Then we speculate that there should be more metabolites that are related to the diseases.
Firstly, we used the ‘InfDisSim’ to calculate the similarity of the diseases. By the genes related to the diseases, we could get the similarity of diseases. Then the similarity of metabolites could be obtained by the similarity of the diseases. The ‘MISM’ gives us a chance to build up a network of metabolites’ similarities. Finally, we used the Random Walk to find more metabolites which are related to the diseases.
By the network of metabolites’ similarity, more metabolites could be connected to the diseases by the lines. The correlation coefficient between the diseases and metabolites could also be obtained. Then we could sort these scores and understand which metabolites are most likely to be associated with disease and which ones are less related to the diseases. The rank could be the important information for researchers to find out the candidate metabolites. The researchers should not be limited by the metabolites reported, the complex metabolites network might give them more chances to understand the mechanism behind diseases.
The presented approach in this paper is also used to predict central nervous system disease-related SNPs and risk pathways by constructing virtual SNP-SNP network and pathway-pathway network [12, 40, 41, 42, 43].
The complex diseases are caused by complex gene interactions. It is hard to explain the mechanism behind diseases by these complex gene networks. However, the corresponding micro-RNAs may not fully explain the way diseases work. Metabolites, as a production of the complex mechanism have become the vital factor to understand the diseases.
The result shows the power of our method and it would be helpful to the further research. We found the unreported metabolites which are related to diabetes mellitus are reported in other researchers’ works. Through our network, these unknown metabolites could be mapped to the diseases.
Tianyi Zang, Jun Zhang and Liang Cheng are the corresponding authors. Yang Hu and Tianyi Zhao are the co-first authors.
Publication costs were funded by the Fundamental Research Funds for the Central Universities (Grant No. HIT NSRIF 201856), National Natural Science Foundation of China (Grant No. 61502125), Heilongjiang Postdoctoral Fund (Grant No. LBH-Z6064 and LBH-Z15179), and China Postdoctoral Science Foundation (Grant No. 2016 M590291).
Availability of data and materials
All the datasets used in this paper could be downloaded from website.
About this supplement
This article has been published as part of BMC Bioinformatics Volume 19 Supplement 5, 2018: Selected articles from the Biological Ontologies and Knowledge bases workshop 2017. The full contents of the supplement are available online at https://bmcbioinformatics.biomedcentral.com/articles/supplements/volume-19-supplement-5.
LC did the data preprocessing and TZ did the algorithm simulation under the direction of YH. NZ, TZ and JZ helped proofreading the manuscript. All authors have read and approved the final version of the manuscript.
Ethics approval and consent to participate
Consent for publication
The authors declare that they have no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
- 2.Cheng L, Yang H, Zhao H, Pei X, Shi H, Sun J, Zhang Y, Wang Z, Zhou M. MetSigDis: a manually curated resource for the metabolic signatures of diseases. Brief Bioinform. 2017; https://doi.org/10.1093/bib/bbx103
- 9.Peng J, Wang H, Lu J, Hui W, Wang Y, Shang X. Identifying term relations cross different gene ontology categories. BMC Bioinformatics. 2017;18(16)Google Scholar
- 13.Yang H, Meng Z, Shi H, et al. Measuring disease similarity and predicting disease-related ncRNAs by a novel method[J]. Bmc Medical Genomics. 2017;10(5):71.Google Scholar
- 14.Peng J, Zhang X, Hui W, Lu J, Li Q, Shang X. Improving the measurement of semantic similarity by combining gene ontology and co-functional network: a random walk based approach. BMC systems biology. 2018;12(2):18. Google Scholar
- 19.Wang J, Zhou X, Zhu J, Guo Z. Bias of phenotype similarity scores between diseases. International Conference on Bioinformatics and Biomedical Engineering. 2010:1–4.Google Scholar
- 20.Rischer H, Oresic M, Seppänenlaakso T, Katajamaa M, Lammertyn F, Ardilesdiaz W, Van Montagu MC, Inzé D, Oksmancaldentey KM, Goossens A. Gene-to-metabolite networks for terpenoid indole alkaloid biosynthesis in Catharanthus roseus cells. Proc Natl Acad Sci U S A. 2006;103(14):5614.CrossRefPubMedPubMedCentralGoogle Scholar
- 23.Ohtana Y, Abdullah AA, Altaf-Ul-Amin M, Huang M, Ono N, Sato T, Sugiura T, Horai H, Nakamura Y, Morita HA. Clustering of 3D-structure similarity based network of secondary metabolites reveals their relationships with biological activities. Molecular Informatics. 2014;33(11–12):790–801.PubMedGoogle Scholar
- 25.Kang L, Abdullah AA, Ming H, Nishioka T, Altafulamin M, Kanaya S. Novel approach to classify plants based on metabolite-content similarity. Biomed Res Int. 2017;2017(2):5296729.Google Scholar
- 27.Foote RS, Lee JW. Micro and Nano Technologies in bioanalysis: Humana press; 2009.Google Scholar
- 30.Miroslava, c x, uperlovi, #x, -Culf, Belacel N, Culf AS, Chute IC, Ouellette RJ, Burton IW et al: NMR metabolic analysis of samples using fuzzy K-means clustering. Magn Reson Chem 2009, 47 Suppl 1(S1):S96.Google Scholar
- 34.Hu Y, Zhou M, Shi H, Ju H, Jiang Q, Cheng L. InfDisSim: a novel method for measuring disease similarity based on information flow. In: IEEE International Conference on Bioinformatics and Biomedicine. 2017:20–6.Google Scholar
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.