Rules of Thumb for Information Acquisition from Large and Redundant Data

Gatterbauer, Wolfgang

doi:10.1007/978-3-642-20161-5_47

Wolfgang Gatterbauer²¹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 6611))

Included in the following conference series:

European Conference on Information Retrieval

6700 Accesses
2 Citations

Abstract

We develop an abstract model of information acquisition from redundant data. We assume a random sampling process from data which contain information with bias and are interested in the fraction of information we expect to learn as function of (i) the sampled fraction (recall) and (ii) varying bias of information (redundancy distributions). We develop two rules of thumb with varying robustness. We first show that, when information bias follows a Zipf distribution, the 80-20 rule or Pareto principle does surprisingly not hold, and we rather expect to learn less than 40% of the information when randomly sampling 20% of the overall data. We then analytically prove that for large data sets, randomized sampling from power-law distributions leads to “truncated distributions” with the same power-law exponent. This second rule is very robust and also holds for distributions that deviate substantially from a strict power law. We further give one particular family of power-law functions that remain completely invariant under sampling. Finally, we validate our model with two large Web data sets: link distributions to web domains and tag distributions on delicious.com.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Achlioptas, D., Clauset, A., Kempe, D., Moore, C.: On the bias of traceroute sampling: or, power-law degree distributions in regular graphs. In: STOC, pp. 694–703 (2005)
Google Scholar
Adamic, L.A.: Zipf, power-law, pareto – a ranking tutorial. Technical report, Information Dynamics Lab, HP Labs, Palo Alto, CA 94304 (October 2000)
Google Scholar
Bernstein, Y., Zobel, J.: Redundant documents and search effectiveness. In: CIKM, pp. 736–743 (2005)
Google Scholar
Capurro, R., Hjørland, B.: The concept of information. Annual Review of Information Science and Technology 37(1), 343–411 (2003)
Article Google Scholar
Cattuto, C., Loreto, V., Pietronero, L.: Semiotic dynamics and collaborative tagging. PNAS 104(5), 1461–1464 (2007)
Article Google Scholar
Chaudhuri, S., Church, K.W., König, A.C., Sui, L.: Heavy-tailed distributions and multi-keyword queries. In: SIGIR, pp. 663–670 (2007)
Google Scholar
Clauset, A., Shalizi, C.R., Newman, M.: Power-law distributions in empirical data. SIAM Review 51(4), 661–703 (2009)
Article MathSciNet MATH Google Scholar
Downey, D., Etzioni, O., Soderland, S.: A probabilistic model of redundancy in information extraction. In: IJCAI, pp. 1034–1041 (2005)
Google Scholar
Flajolet, P., Dumas, P., Puyhaubert, V.: Some exactly solvable models of urn process theory. Discrete Math. & Theoret. Comput. Sci. AG, 59–118 (2006)
MathSciNet MATH Google Scholar
Flajolet, P., Sedgewick, R.: Analytic combinatorics. CUP (2009)
Google Scholar
Gardy, D.: Normal limiting distributions for projection and semijoin sizes. SIAM Journal on Discrete Mathematics 5(2), 219–248 (1992)
Article MathSciNet MATH Google Scholar
Gatterbauer, W.: Estimating Required Recall for Successful Knowledge Acquisition from the Web. In: WWW, pp. 969–970 (2006)
Google Scholar
Gatterbauer, W.: Rules of thumb for information acquisition from large and redundant data. CoRR abs/1012.3502 (2010)
Google Scholar
Haas, P.J., Naughton, J.F., Seshadri, S., Stokes, L.: Sampling-based estimation of the number of distinct values of an attribute. In: VLDB, pp. 311–322 (1995)
Google Scholar
Ipeirotis, P.G., Agichtein, E., Jain, P., Gravano, L.: To search or to crawl? towards a query optimizer for text-centric tasks. In: SIGMOD, pp. 265–276 (2006)
Google Scholar
Mitzenmacher, M.: A brief history of generative models for power law and lognormal distributions. Internet Mathematics 1(2), 226–251 (2004)
Article MathSciNet MATH Google Scholar
Newman, M.E.: Power laws, pareto distributions and zipf’s law. Contemporary Physics 46(5), 323–351 (2005)
Article Google Scholar
Soboroff, I., Harman, D.: Overview of the trec 2003 novelty track. In: TREC 2003. NIST, pp. 38–53 (2003)
Google Scholar
Stumpf, M.P.H., Wiuf, C., May, R.M.: Subnets of scale-free networks are not scale-free: sampling properties of networks. PNAS 102(12), 4221–4224 (2005)
Article Google Scholar
Zipf, G.K.: Human Behaviour and the Principle of Least Effort: an Introduction to Human Ecology. Addison-Wesley, Reading (1949)
Google Scholar

Download references

Author information

Authors and Affiliations

Computer Science and Engineering, University of Washington, Seattle, USA
Wolfgang Gatterbauer

Authors

Wolfgang Gatterbauer
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Information School, University of Sheffield, Regent Court, 211 Portobello Street, S1 4DP, Sheffield, UK
Paul Clough
CLARITY: Centre for Sensor Web Technologies, School of Computing, Dublin City University, Glasnevin, Dublin 9, Ireland
Colum Foley , Cathal Gurrin & Hyowon Lee , &
Centre for Next Generation Localisation, School of Computing, Dublin City University, Glasnevin, Dublin 9, Ireland
Gareth J. F. Jones
TNO Human Factors, Brassersplein 2, 2612 CT, Delft, The Netherlands
Wessel Kraaij
Yahoo! Research, 177 Diagonal, 08018, Barcelona, Spain
Vanessa Mudoch

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Gatterbauer, W. (2011). Rules of Thumb for Information Acquisition from Large and Redundant Data. In: Clough, P., et al. Advances in Information Retrieval. ECIR 2011. Lecture Notes in Computer Science, vol 6611. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-20161-5_47

Download citation

DOI: https://doi.org/10.1007/978-3-642-20161-5_47
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-20160-8
Online ISBN: 978-3-642-20161-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics