Abstract
Web data often manifest high levels of uncertainty. We focus on categorical Web data and we represent these uncertainty levels as first- or second-order uncertainty. By means of concrete examples, we show how to quantify and handle these uncertainties using the Beta-Binomial and the Dirichlet-Multinomial models, as well as how take into account possibly unseen categories in our samples by using the Dirichlet process. We conclude by exemplifying how these higher-order models can be used as a basis for analyzing datasets, once at least part of their uncertainty has been taken into account. We demonstrate how to use the Battacharyya stastistical distance to quantify the similarity between Dirichlet distributions, and use such results to analyze a Web dataset of piracy attacks both visually and automatically.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
- 2.
- 3.
The code is available at http://trustingwebdata.org/books/URSW_III/DP.zip.
References
Agresti, A.: Categorical Data Analysis, 3rd edn. Wiley, Hoboken (2013)
Alexander, K., Cyganiak, R., Hausenblas, M., Zhao, J.: Describing linked datasets with the void vocabulary. Technical report, W3C (2011)
Auer, S., Demter, J., Martin, M., Lehmann, J.: LODStats – an extensible framework for high-performance dataset analytics. In: ten Teije, A., Völker, J., Handschuh, S., Stuckenschmidt, H., d’Acquin, M., Nikolov, A., Aussenac-Gilles, N., Hernandez, N. (eds.) EKAW 2012. LNCS, vol. 7603, pp. 353–362. Springer, Heidelberg (2012)
Bhattacharyya, A.: On a measure of divergence between two statistical populations defined by their probability distributions. Bull. Calcutta Math. Soc. 35, 99–109 (1943)
Ceolin, D., Moreau, L., O’Hara, K., van Hage, W.R., Fokkink, W.J., Maccatrozzo, V., Schreiber, G., Shadbolt, N.: Two procedures for estimating the reliability of open government data. In: Laurent, A., Strauss, O., Bouchon-Meunier, B., Yager, R.R. (eds.) IPMU 2014. CCIS, vol. 442, pp. 15–24. Springer, Heidelberg (2014)
Ceolin, D., van Hage, W.R., Fokkink, W.J., Schreiber, G.: Estimating Uncertainty of Categorical Web Data. In: URSW, pp. 15–26, November 2011. CEUR-WS.org
Koch, G., Davis, C.: Categorical Data Analysis Using SAS, 3rd edn. SAS Institute, Norwood (2012)
Cyganiak, R., Reynolds, D., Tennison, J.: The RDF data cube vocabulary. Technical report, W3C (2014)
Davy, M., Tourneret, J.: Generative supervised classification using dirichlet process priors. IEEE Trans. Pattern Anal. Mach. Intell. 32, 1781–1794 (2010)
Dirac, P.: Principles of Quantum Mechanics. Oxford at the Clarendon Press, Oxford (1958)
Andersen, E.: Sufficiency and exponential families for discrete sample spaces. J. Am. Stat. Assoc. 65, 1248–1255 (1970)
Elkan, C.: Clustering documents with an exponential-family approximation of the Dirichlet compound multinomial distribution. In: ICML, pp. 289–296. ACM (2006)
Escobar, M.D., West, M.: Bayesian density estimation and inference using mixtures. J. Am. Stat. Assoc. 90, 577–588 (1994)
Ferguson, T.S.: A Bayesian analysis of some nonparametric problems. Ann. Stat. 1(2), 209–230 (1973)
Fink, D.: A compendium of conjugate priors. Technical report, Cornell University (1995)
Fokoue, A., Srivatsa, M., Young, R.: Assessing trust in uncertain information. In: Patel-Schneider, P.F., Pan, Y., Hitzler, P., Mika, P., Zhang, L., Pan, J.Z., Horrocks, I., Glimm, B. (eds.) ISWC 2010, Part I. LNCS, vol. 6496, pp. 209–224. Springer, Heidelberg (2010)
Schlaifer, R., Raiffa, H.: Applied Statistical Decision Theory. M.I.T Press, Cambridge (1968)
Hausenblas, M., Halb, W., Raimond, Y., Feigenbaum, L., Ayers, D.: SCOVO: using statistics on the Web of data. In: Aroyo, L., Traverso, P., Ciravegna, F., Cimiano, P., Heath, T., Hyvönen, E., Mizoguchi, R., Oren, E., Sabou, M., Simperl, E. (eds.) ESWC 2009. LNCS, vol. 5554, pp. 708–722. Springer, Heidelberg (2009)
Hazewinkel, M.: Encyclopaedia of Mathematics. In: Chapter De Finetti theorem. Springer, New York (2001)
Hilgevoord, J., Uffink, J.: Uncertainty in prediction and in inference. Found. Phys. 21, 323–341 (1991)
Killick, R., Eckley, I.A.: Changepoint: An R Package for Changepoint Analysis (2013). http://cran.r-project.org/package=changepoint
Krause, E.F.: Taxicab Geometry. Dover, New York (1987)
Kvam, P., Day, D.: The multivariate polya distribution in combat modeling. Naval Res. Logistics (NRL) 48(1), 1–17 (2001)
Madsen, R.E., Kauchak, D., Elkan, C.: Modeling word burstiness using the Dirichlet distribution. In: ICML, pp. 545–552. ACM (2005)
Neal, R.M.: Markov chain sampling methods for Dirichlet process mixture models. J. Comput. Graph. Stat. 9(2), 249–265 (2000)
Pitman, J.: Exchangeable and partially exchangeable random partitions. Probab. Theor. Relat. Fields 102(2), 145–158 (1995)
Rasmussen, C.E.: The Infinite Gaussian Mixture Model. Advances in Neural Information Processing Systems, vol. 12, pp. 554–560. MIT Press, Cambridge (2000)
Rauber, T.W., Conci, A., Braun, T., Berns, K.: Bhattacharyya probabilistic distance of the dirichlet density and its application to split-and-merge image segmentation. In: WSSIP08, pp. 145–148 (2008)
Rodriguez, A., Dunson, D.B., Gelfand, A.E.: The nested Dirichlet process. J. Am. Stat. Assoc. 103(483), 1131–1144 (2008)
Teh, Y.W., Jordan, M.I., Beal, M.J., Blei, D.M.: Hierarchical Dirichlet processes. J. Am. Stat. Assoc. 101(476), 1566–1581 (2006)
van Hage, W.R., van Erp, M., Malaisé, V.: Linked open piracy: a story about e-science, linked data, and statistics. J. Data Seman. 1(3), 187–201 (2012)
W3C. OWL Reference, August 2011. http://www.w3.org/TR/owl-ref/
W3C. Resource Definition Framework, August 2011. http://www.w3.org/RDF/
W3C. SPARQL, August 2011. http://www.w3.org/TR/rdf-sparql-query/
Wilcoxon, F.: Individual comparisons by ranking methods. Biometrics Bull. 1(6), 80–83 (1945)
Xing, E.: Bayesian haplotype inference via the Dirichlet process. In: ICML, pp. 879–886. ACM Press (2004)
Acknowledgments
This research was partially supported by the Data2Semantics Media project in the Dutch national program COMMIT.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
Ceolin, D., van Hage, W.R., Fokkink, W., Schreiber, G. (2014). Uncertainty Estimation and Analysis of Categorical Web Data. In: Bobillo, F., et al. Uncertainty Reasoning for the Semantic Web III. URSW URSW URSW 2012 2011 2013. Lecture Notes in Computer Science(), vol 8816. Springer, Cham. https://doi.org/10.1007/978-3-319-13413-0_14
Download citation
DOI: https://doi.org/10.1007/978-3-319-13413-0_14
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-13412-3
Online ISBN: 978-3-319-13413-0
eBook Packages: Computer ScienceComputer Science (R0)