Truth Discovery in Material Science Databases
Instead of performing expensive experiments, it is common in industry to make predictions of important material properties based on some existing experimental results. Databases consisting of experimental observations are widely used in the field of Material Science Engineering. However, these databases are expected to be noisy since they rely on human measurements, and also because they are an amalgamation of various independent sources (research papers). Therefore, some conflicting information can be found between various sources. In this paper, we introduce a novel truth discovery approach to reduce the amount of noise and filter the incorrect conflicting information hidden in the scientific databases. Our method ranks the multiple data sources by considering the relationships between them, i.e., the amount of conflicting information and the amount of agreement, and as well eliminates the conflicting information. The scalable Gaussian process interpolation technique (SGP) is then applied to the clean dataset to make predictions of materials property. Comprehensive performance study has been done on a real life scientific database. With our new approach, we are able to highly improve the accuracy of SGP predictions and provide a more reliable database.
KeywordsNoisy Data Interpolation Technique Truth Discovery Gaussian Process Regression Training Database
Unable to display preview. Download preview PDF.
- 3.Besses, B.D.D.: Xongrid interpolation add-in (2015)Google Scholar
- 4.Birol, B., Polat, G., Saridede, M.: Estimation model for electrical conductivity of molten caf2-al2o3-cao slags based on optical basicity. JOM, pp. 1–9 (2014)Google Scholar
- 5.Dekel, O., Shamir, O.: Vox populi: Collecting high-quality labels from a crowd (2009)Google Scholar
- 7.Dong, X.L., Saha, B., Srivastava, D.: Less is more: Selecting sources wisely for integration. In: Proceedings of the VLDB Endowment, vol. 6, pp. 37–48. VLDB Endowment (2012)Google Scholar
- 11.Sheng, V.S., Provost, F., Ipeirotis, P.G.: Get another label? improving data quality and data mining using multiple, noisy labelers. In: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 614–622. ACM (2008)Google Scholar
- 13.Tsuboi, H., Chutia, A., Lv, C., Zhu, Z., Onuma, H., Miura, R., Suzuki, A., Sahnoun, R., Koyama, M., Hatakeyama, N., Endou, A., Takaba, H., Carpio, C.A.D., Deka, R.C., Kubo, M., Miyamoto, A.: An electrical conductivity prediction simulator based on tb-qcmd and kmc. system development and applications. Journal of Molecular Structure: THEOCHEM, 903(1–3):11–22, Recent advances in the theoretical understanding of catalysis (2009)Google Scholar
- 14.Wang, D., Kaplan, L., Le, H., Abdelzaher, T.: On truth discovery in social sensing: A maximum likelihood estimation approach. In: Proceedings of the 11th International Conference on Information Processing in Sensor Networks, pp. 233–244. ACM (2012)Google Scholar
- 16.Yin, X., Tan, W.: Semi-supervised truth discovery. In: Proceedings of the 20th International Conference on World Wide Web, pp. 217–226. ACM (2011)Google Scholar
- 17.Zhao, Z., Cheng, J., Ng W.: Truth discovery in data streams: A single-pass probabilistic approach. In: Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management, pp. 1589–1598. ACM (2014)Google Scholar