Adaptive and Parallel Data Acquisition from Online Big Graphs

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10827)

Abstract

Acquisition of contents from online big graphs (OBGs) like linked Web pages, social networks and knowledge graphs, is critical as data infrastructure for Web applications and massive data analysis. However, effective data acquisition is challenging due to the massive, heterogeneous, dynamically evolving properties of OBGs with unknown global topological structures. In this paper, we give an adaptive and parallel approach for effective data acquisition from OBGs. We adopt the ideas of Quasi Monte Carlo (QMC) and branch & bound methods to propose an adaptive Web-scale sampling algorithm for parallel data collection implemented upon Spark. Experimental results show the effectiveness and efficiency of our method.

Keywords

Online big graph Data acquisition Adaptive collection Parallel crawler Spark 

Notes

Acknowledgment

This paper was supported by the National Natural Science Foundation of China (Nos. 61472345, 61562090), Program for Excellent Young Talents of Yunnan University (No. WX173602), Research Foundation of Yunnan University (No. 2017YDJQ06), and Research Foundation of Educational Department of Yunnan Province (No. 2017ZZX228).

References

  1. 1.
    Yang, D., Xiao, Y., Tong, H., Zhang, J., Wang, W.: An integrated tag recommendation algorithm towards Weibo user profiling. In: Renz, M., Shahabi, C., Zhou, X., Cheema, M.A. (eds.) DASFAA 2015. LNCS, vol. 9049, pp. 353–373. Springer, Cham (2015).  https://doi.org/10.1007/978-3-319-18120-2_21CrossRefGoogle Scholar
  2. 2.
    Faure, H., Lemieux, C.: Improved Halton sequences and discrepancy bounds. Monte Carlo Methods Appl. 16(3), 1–18 (2010)MathSciNetMATHGoogle Scholar
  3. 3.
    Hammersley, J., Handscomb, D.: Monte Carlo methods. Appl. Stat. 14(2/3), 347–385 (1964)MATHGoogle Scholar
  4. 4.
    Sharma, A., Baral, C.: Automatic extraction of events-based conditional commonsense knowledge. In: Proceedings of Workshops at the 30th AAAI Conference on Artificial Intelligence, Phoenix, USA, pp. 527–531. AAAI (2016)Google Scholar
  5. 5.
    Surendran, S., Prasad, D., Kaimal, M.: A scalable geometric algorithm for community detection from social networks with incremental update. Soc. Netw. Anal. Min. 6(1), 90:1–90:13 (2016)CrossRefGoogle Scholar
  6. 6.
    Xi, S., Sun, F., Wang, J.: A cognitive crawler using structure pattern for incremental crawling and content extraction. In: IEEE International Conference on Cognitive Informatics, Beijing, China, pp. 238–244. IEEE (2010)Google Scholar
  7. 7.
    Wu, X., Chen, H., Wu, G., Liu, J., et al.: Knowledge engineering with big data. IEEE Intell. Syst. 30(5), 46–55 (2015)CrossRefGoogle Scholar
  8. 8.
    Stivala, A., Koskinen, J., Rolls, D., Wang, P., Robins, G.: Snowball sampling for estimating exponential random graph models for large networks. Soc. Netw. 47, 167–188 (2016)CrossRefGoogle Scholar
  9. 9.
    Urbani, J., Dutta, S., Gurajada, S., Weikum, G.: KOGNAC: efficient encoding of large knowledge graphs. In: International Joint Conference on Artificial Intelligence, New York, USA, pp. 3896–3902 (2016)Google Scholar
  10. 10.
    Wu, C., Hou, W., Shi, Y., Liu, T.: A Web search contextual crawler using ontology relation mining. In: International Conference on Computational Intelligence and Software Engineering, pp. 1–4. IEEE (2009)Google Scholar
  11. 11.
    Tsai, C., Lin, W., Ke, S.: Big data mining with parallel computing: a comparison of distributed and MapReduce methodologies. J. Syst. Softw. 122, 83–92 (2016)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing AG, part of Springer Nature 2018

Authors and Affiliations

  1. 1.School of Information Science and EngineeringYunnan UniversityKunmingChina

Personalised recommendations