Advertisement

World Wide Web

, Volume 22, Issue 4, pp 1577–1610 | Cite as

Deep Web crawling: a survey

  • Inma HernándezEmail author
  • Carlos R. Rivero
  • David Ruiz
Article

Abstract

Deep Web crawling refers to the problem of traversing the collection of pages in a deep Web site, which are dynamically generated in response to a particular query that is submitted using a search form. To achieve this, crawlers need to be endowed with some features that go beyond merely following links, such as the ability to automatically discover search forms that are entry points to the deep Web, fill in such forms, and follow certain paths to reach the deep Web pages with relevant information. Current surveys that analyse the state of the art in deep Web crawling do not provide a framework that allows comparing the most up-to-date proposals regarding all the different aspects involved in the deep Web crawling process. In this article, we propose a framework that analyses the main features of existing deep Web crawling-related techniques, including the most recent proposals, and provides an overall picture regarding deep Web crawling, including novel features that to the present day had not been analysed by previous surveys. Our main conclusion is that crawler evaluation is an immature research area due to the lack of a standard set of performance measures, or a benchmark or publicly available dataset to evaluate the crawlers. In addition, we conclude that the future work in this area should be focused on devising crawlers to deal with ever-evolving Web technologies and improving the crawling efficiency and scalability, in order to create effective crawlers that can operate in real-world contexts.

Keywords

Deep Web Web crawling Form filling Query selection Survey 

Notes

Acknowledgements

The authors would like to thank Dr. Rafael Corchuelo for his support and assistance throughout the entire research process that led to this article, and for his helpful and constructive comments that greatly contributed to improving the article. They would also like to thank the anonymous reviewers of this and past submissions, since their comments have contributed to give shape to this current version. Supported by the European Commission (FEDER), the Spanish and the Andalusian R &D & I programmes (grants TIN2016-75394-R, and TIN2013-40848-R).

References

  1. 1.
    Álvarez, M, Raposo, J, Pan, A, Cacheda, F, Bellas, F, Carneiro, V: Crawling the content hidden behind Web forms. In: ICCSA, pp. 322–333 (2007).  https://doi.org/10.1007/978-3-540-74477-1_31
  2. 2.
    Anupam, V., Freire, J., Kumar, B., Lieuwen, D.F.: Automating Web navigation with the WebVCR. Comput. Netw. 33(1-6), 503–517 (2000).  https://doi.org/10.1016/S1389-1286(00)00073-6 Google Scholar
  3. 3.
    Asudeh, A., Thirumuruganathan, S., Zhang, N., Das, G.: Discovering the skyline of Web databases. PVLDB 9(7), 600–611 (2016).  https://doi.org/10.14778/2904483.2904491 Google Scholar
  4. 4.
    Barbosa, L, Freire, J: Siphoning hidden-Web data through keyword-based interfaces. In: SBBD, pp. 309–321. (2004).Google Scholar
  5. 5.
    Barbosa, L, Freire, J: Searching for hidden-Web databases. In: WebDB, pp. 1–6 (2005)Google Scholar
  6. 6.
    Barbosa, L, Freire, J: An adaptive crawler for locating hidden-Web entry points. In: WWW, pp. 441–450 (2007).  https://doi.org/10.1145/1242572.1242632
  7. 7.
    Baumgartner, R, Ceresna, M, Ledermuller, G: Deep Web navigation in Web data extraction. In: CIMCA/IAWTIC, pp. 698–703 (2005).  https://doi.org/10.1109/CIMCA.2005.1631550
  8. 8.
    Bergholz, A, Chidlovskii, B: Crawling for domain-specific hidden Web resources. In: WISE, pp. 125–133 (2003).  https://doi.org/10.1109/WISE.2003.1254476
  9. 9.
    Bergman, M.K.: The deep Web: Surfacing hidden value. J. Electron. Publ. 7, 1 (2001).Google Scholar
  10. 10.
    Blanco, L, Dalvi, N, Machanavajjhala, A: Highly efficient algorithms for structural clustering of large Webs ites. In: WWW, pp. 437–446 (2011).  https://doi.org/10.1145/1963405.1963468
  11. 11.
    Blythe, J., Kapoor, D., Knoblock, C.A., Lerman, K., Minton, S.: Information integration for the masses. J UCS 14(11), 1811–1837 (2008).  https://doi.org/10.3217/jucs-014-11-1811 Google Scholar
  12. 12.
    Bollacker, K, Evans, C, Paritosh, P, Sturge, T, Taylor, J: Freebase: A collaboratively created graph database for structuring human knowledge. In: SIGMOD, pp. 1247–1250 (2008).  https://doi.org/10.1145/1376616.1376746
  13. 13.
    Calì, A, Martinenghi, D: Querying the deep Web. In: EDBT, pp. 724–727 (2010).  https://doi.org/10.1145/1739041.1739138
  14. 14.
    Caverlee, J, Liu, L, Buttler, D: Probe, cluster, and discover: Focused extraction of qa-pagelets from the deep Web. In: ICDE, pp. 103–114 (2004).  https://doi.org/10.1109/ICDE.2004.1319988
  15. 15.
    Chakrabarti, S., Dom, B., Raghavan, P., Rajagopalan, S., Gibson, D., Kleinberg, J.M.: Automatic resource compilation by analyzing hyperlink structure and associated text. Comput. Netw. 30(1-7), 65–74 (1998).  https://doi.org/10.1016/S0169-7552(98)00087-7 Google Scholar
  16. 16.
    Chang, K.C.C., He, B., Li, C., Patel, M., Zhang, Z.: Structured databases on the Web: Observations and implications. SIGMOD Record 33(3), 61–70 (2004).  https://doi.org/10.1145/1031570.1031584 Google Scholar
  17. 17.
    Chang, KCC, He, B, Zhang, Z: Toward large scale integration: Building a metaquerier over databases on the Web. In: CIDR, pp. 44–55. (2005).Google Scholar
  18. 18.
    Chen, H.: Dark Web: Exploring and data mining the dark side of the Web. Online Inf. Rev. 36(6), 932–933 (2012).  https://doi.org/10.1108/14684521211287981 Google Scholar
  19. 19.
    Cho, J., Garcia-Molina, H.: Effective page refresh policies for Web crawlers. ACM Trans. Database Syst 28(4), 390–426 (2003).  https://doi.org/10.1145/958942.958945 Google Scholar
  20. 20.
  21. 21.
    Cope, J., Craswell, N., Hawking, D.: Automated discovery of search interfaces on the Web. In: ADC, CRPIT, vol. 17, pp. 181–189 (2003)Google Scholar
  22. 22.
    Davulcu, H, Freire, J, Kifer, M, Ramakrishnan, IV: A layered architecture for querying dynamic Web content. In: SIGMOD, pp. 491–502 (1999).  https://doi.org/10.1145/304182.304225
  23. 23.
    Devine, J., Egger-Sider, F.: Beyond google: The invisible Web in the academic library. J. Acad. Librarianship 30(4), 265–269 (2004).  https://doi.org/10.1016/j.acalib.2004.04.010 Google Scholar
  24. 24.
    Dragut, E.C., Kabisch, T., Yu, C., Leser, U.: A hierarchical approach to model Web query interfaces for Web source integration. PVLDB 2(1), 325–336 (2009).  https://doi.org/10.14778/1687627.1687665 Google Scholar
  25. 25.
    Dragut, E.C., Meng, W., Yu, C.T.: Deep Web Query Interface Understanding and Integration. Synthesis Lectures on Data Management. Morgan & Claypool (2012).  https://doi.org/10.2200/S00419ED1V01Y201205DTM026
  26. 26.
    Fetto, J.: Mobile search: Topics and themes. report, Hitwise (2017)Google Scholar
  27. 27.
    Furche, T., Gottlob, G., Grasso, G., Guo, X., Orsi, G., Schallhart, C.: The ontological key: Automatically understanding and integrating forms to access the deep Web. VLDBJ 22(5), 615–640 (2013).  https://doi.org/10.1007/s00778-013-0323-0 Google Scholar
  28. 28.
    Furche, T., Gottlob, G., Grasso, G., Schallhart, C., Sellers, A.J.: OXPath: A language for scalable data extraction, automation, and crawling on the Deep Web. VLDB J 22(1), 47–72 (2013).  https://doi.org/10.1007/s00778-012-0286-6 Google Scholar
  29. 29.
    Furche, T., Gottlob, G., Grasso, G., Guo, X., Orsi, G., Schallhart, C., Wang, C.: DIADEM: Thousands of Websites to a single database. PVLDB 7 (14), 1845–1856 (2014).  https://doi.org/10.14778/2733085.2733091 Google Scholar
  30. 30.
    Green, D.: The evolution of Web searching. Online Inf. Rev. 24(2), 124–137 (2000).  https://doi.org/10.1108/14684520010330283 Google Scholar
  31. 31.
    He, B., Patel, M., Zhang, Z., Chang, K.C.C.: Accessing the deep Web: A survey. Commun ACM 50(5), 94–101 (2007).  https://doi.org/10.1145/1230819.1241670 Google Scholar
  32. 32.
    He, H, Meng, W, Lu, Y, Yu, CT, Wu, Z: Towards deeper understanding of the search interfaces of the Deep Web. In: WWW, pp. 133–155 (2007).  https://doi.org/10.1007/s11280-006-0010-9
  33. 33.
    He, Y, Xin, D, Ganti, V, Rajaraman, S, Shah, N: Crawling deep Web entity pages. In: WSDM, pp. 355–364 (2013).  https://doi.org/10.1145/2433396.2433442
  34. 34.
    Hernández, I, Rivero, CR, Ruiz, D, Corchuelo, R: Towards discovering conceptual models behind Web sites. In: ER, pp. 166–175 (2012).  https://doi.org/10.1007/978-3-642-34002-4_13
  35. 35.
    Hernández, I, Rivero, C.R., Ruiz, D., Corchuelo, R.: CALA: An unsupervised URL-based Web page classification system. Knowl.-Based Syst. 57(0), 168–180 (2014).  https://doi.org/10.1016/j.knosys.2013.12.019 Google Scholar
  36. 36.
    Hicks, C, Scheffer, M, Ngu, AHH, Sheng, QZ: Discovery and cataloging of deep Web sources. In: IRI, pp. 224–230 (2012).  https://doi.org/10.1109/IRI.2012.6303014
  37. 37.
    Holmes, A, Kellogg, M: Automating functional tests using selenium. In: AGILE, pp. 270–275 (2006).  https://doi.org/10.1109/AGILE.2006.19
  38. 38.
  39. 39.
    iMacros: http://imacros.net/ (2016)
  40. 40.
    Jamil, HM, Jagadish, HV: A structured query model for the deep relational Web. In: CIKM, pp. 1679–1682 (2015).  https://doi.org/10.1145/2806416.2806589
  41. 41.
    Jiang, L, Wu, Z, Feng, Q, Liu, J, Zheng, Q: Efficient deep Web crawling using reinforcement learning. In: PAKDD, pp. 428–439 (2010).  https://doi.org/10.1007/978-3-642-13657-3_46
  42. 42.
    Jiménez, P, Corchuelo, R.: Roller: A novel approach to Web information extraction. Knowl. Inf. Syst., 1–45 (2016).  https://doi.org/10.1007/s10115-016-0921-4
  43. 43.
    Jin, X, Mone, A, Zhang, N, Das, G: Mobies: Mobile-interface enhancement service for hidden Web database. In: SIGMOD, pp. 1263–1266 (2011).  https://doi.org/10.1145/1989323.1989471
  44. 44.
    Jin, X, Zhang, N, Das, G: Attribute domain discovery for hidden Web databases. In: SIGMOD, pp. 553–564 (2011).  https://doi.org/10.1145/1989323.1989381
  45. 45.
    Kabisch, T., Dragut, E.C., Yu, C.T., Leser, U.: Deep Web integration with visQI. PVLDB 3(2), 1613–1616 (2010).  https://doi.org/10.14778/1920841.1921053 Google Scholar
  46. 46.
    Kantorski, GZ, Moraes, TG, Moreira, VP, Heuser, CA: Advances in Databases and Information Systems, pp 125–136. Springer, Berlin (2013). Chap Choosing Values for Text Fields in Web FormsGoogle Scholar
  47. 47.
    Kantorski, G.Z., Moreira, V.P., Heuser, C.A.: Automatic filling of hidden Web forms: A survey. SIGMOD Rec 44(1), 24–35 (2015).  https://doi.org/10.1145/2783888.2783898 Google Scholar
  48. 48.
    Kautz, H.A., Selman, B., Shah, M.A.: The hidden Web. AI Mag 18(2), 27–36 (1997).  https://doi.org/10.1609/aimag.v18i2.1291 Google Scholar
  49. 49.
    Khare, R, An, Y, Song, IY: Understanding deep Web search interfaces: A survey. SIGMOD Rec. 39(1), 33–40 (2010). https://doi.acm.org/10.1145/1860702.1860708 Google Scholar
  50. 50.
    Kumar, M, Bhatia, R: Design of a mobile Web crawler for hidden Web. In: RAIT, pp. 186–190 (2016)Google Scholar
  51. 51.
    Kushmerick, N: Learning to invoke Web forms. In: CoopIS, pp. 997–1013 (2003).  https://doi.org/10.1007/978-3-540-39964-3_63
  52. 52.
    Kushmerick, N, Thomas, B: Adaptive information extraction: Core technologies for information agents. In: Intelligent Information Agents - The AgentLink Perspective, pp. 79–103 (2003).  https://doi.org/10.1007/3-540-36561-3_4
  53. 53.
    Lage, J.P., da Silva, A.S., Golgher, P.B., Laender, A.H.F.: Automatic generation of agents for collecting hidden Web pages for data extraction. Data Knowl Eng 49(2), 177–196 (2004).  https://doi.org/10.1016/j.datak.2003.10.003 Google Scholar
  54. 54.
    Li, Y., Wang, Y., Du, J.: E-FFC: An enhanced form-focused crawler for domain-specific deep Web databases. J Intell Inf Syst 40(1), 159–184 (2013).  https://doi.org/10.1007/s10844-012-0221-8 Google Scholar
  55. 55.
    Liakos, P, Ntoulas, A: Topic-sensitive hidden-Web crawling. In: WISE, pp. 538–551 (2012).  https://doi.org/10.1007/978-3-642-35063-4_39
  56. 56.
    Liddle, SW, Embley, DW, Scott, DT, Yau, SH: Extracting data behind Web forms. In: Workshop on Conceptual Modeling Approaches for e-Business, pp. 402–413 (2002).  https://doi.org/10.1007/b12013
  57. 57.
    Losada, J., Raposo, J., Pan, A., Montoto, P.: Efficient execution of Web navigation sequences. WWWJ 17(5), 921–947 (2014).  https://doi.org/10.1007/s11280-013-0259-8 Google Scholar
  58. 58.
    Madhavan, J, Jeffery, SR, Cohen, S, Dong, XL, Ko, D, Yu, C, Halevy, A: Web-scale data integration: You can only afford to pay as you go. In: CIDR, pp. 342–350 (2007)Google Scholar
  59. 59.
    Madhavan, J., Ko, D., Kot, L., Ganapathy, V., Rasmussen, A., Halevy, A.Y.: Google’s deep Web crawl. PVLDB 1(2), 1241–1252 (2008).  https://doi.org/10.14778/1454159.1454163 Google Scholar
  60. 60.
    Madhavan, J., Afanasiev, L., Antova, L., Halevy, A.Y.: Harnessing the deep Web: present and future. Syst. Res. 2(2), 50–54 (2009).Google Scholar
  61. 61.
    Manvi, Dixit, A, Bhatia, KK: Design of an ontology based adaptive crawler for hidden Web. In: CSNT, pp. 659–663 (2013).  https://doi.org/10.1109/CSNT.2013.140
  62. 62.
    Mccoy, D, Bauer, K, Grunwald, D, Kohno, T, Sicker, D: Shining light in dark places: Understanding the tor network. In: PETS, pp. 63–76 (2008).  https://doi.org/10.1007/978-3-540-70630-4_5
  63. 63.
    Meng, X, Hu, D, Li, C: Schema-guided wrapper maintenance for Web-data extraction. In: WIDM, pp. 1–8 (2003).  https://doi.org/10.1145/956699.956701
  64. 64.
    Modica, GA, Gal, A, Jamil, HM: The use of machine-generated ontologies in dynamic information seeking. In: CoopIS, pp. 433–448 (2001).  https://doi.org/10.1007/3-540-44751-2_32
  65. 65.
    Montoto, P, Pan, A, Raposo, J, Bellas, F, Lopez, J: Web navigation sequences automation in modern Websites. In: DEXA, pp. 302–316 (2009).  https://doi.org/10.1007/978-3-642-03573-9_25
  66. 66.
    Nazi, A, Asudeh, A, Das, G, Zhang, N, Jaoua, A: Mobiface: A mobile application for faceted search over hidden Web databases. In: ICCA, pp. 13–17 (2017).  https://doi.org/10.1109/COMAPP.2017.8079749
  67. 67.
    Nguyen, H., Nguyen, T., Freire, J.: Learning to extract form labels. PVLDB 1(1), 684–694 (2008).  https://doi.org/10.14778/1453856.1453931 Google Scholar
  68. 68.
    nightwatch: http://nightwatchjs.org/ (2018)
  69. 69.
    Ntoulas, A, Zerfos, P, Cho, J: Downloading textual hidden Web content through keyword queries. In: JCDL, pp. 100–109 (2005).  https://doi.org/10.1145/1065385.1065407
  70. 70.
    Olston, C., Najork, M.: Web crawling. Found. Trends Inf. Retriev. 4(3), 175–246 (2010).  https://doi.org/10.1561/1500000017 zbMATHGoogle Scholar
  71. 71.
    Olston, C, Pandey, S: Recrawl scheduling based on information longevity. In: WWW, pp. 437–446 (2008).  https://doi.org/10.1145/1367497.1367557
  72. 72.
    Pan, A, Raposo, J, Álvarez, M, Hidalgo, J, Viña, Á: Semi-automatic wrapper generation for commercial Web sources. In: EISIC, pp. 265–283 (2002).  https://doi.org/10.1007/978-0-387-35614-3_16
  73. 73.
    Pandey, S, Olston, C: User-centric Web crawling. In: WWW, pp. 401–411.  https://doi.org/10.1145/1060745.1060805 (2005)
  74. 74.
    phantomjs.org: http://phantomjs.org/ (2018)
  75. 75.
    Raghavan, S, Garcia-Molina, H: Crawling the hidden Web. In: VLDB, pp. 129–138 (2001)Google Scholar
  76. 76.
    Ru, Y., Horowitz, E.: Indexing the invisible Web: a survey. Online Inf. Rev. 29(3), 249–265 (2005).  https://doi.org/10.1108/14684520510607579 Google Scholar
  77. 77.
    Schulz, A, Lässig, J, Gaedke, M: Practical Web data extraction: are we there yet? - a short survey. In: WI, pp. 562–567 (2016).  https://doi.org/10.1109/WI.2016.0096
  78. 78.
    Scrapy: http://scrapy.org/ (2016)
  79. 79.
    Settles, B.: Active learning. Synthesis Lect. Artif. Intell. Mach. Learn. 6(1), 1–114 (2012).  https://doi.org/10.2200/S00429ED1V01Y201207AIM018 MathSciNetzbMATHGoogle Scholar
  80. 80.
    Sheng, C., Zhang, N., Tao, Y., Jin, X.: Optimal algorithms for crawling a hidden database in the Web. PVLDB 5(11), 1112–1123 (2012).  https://doi.org/10.14778/2350229.2350232 Google Scholar
  81. 81.
    Shu, L, Meng, W, He, H, Yu, CT: Querying capability modeling and construction of deep Web sources. In: WISE, pp. 13–25 (2007).  https://doi.org/10.1007/978-3-540-76993-4_2
  82. 82.
    Sleiman, H.A., Corchuelo, R.: A survey on region extractors from Web documents. TKDE 25(9), 1960–1981 (2013).  https://doi.org/10.1109/TKDE.2012.135 Google Scholar
  83. 83.
    Sleiman, H.A., Corchuelo, R.: Trinity: On using trinary trees for unsupervised Web data extraction. IEEE Trans Knowl Data Eng 26(6), 1544–1556 (2014).  https://doi.org/10.1109/TKDE.2013.161 Google Scholar
  84. 84.
    Srinivasan, P., Menczer, F., Pant, G.: A general evaluation framework for topical crawlers. Inf. Retr. 8(3), 417–447 (2005).  https://doi.org/10.1007/s10791-005-6993-5 Google Scholar
  85. 85.
    Statista: Mobile internet usage worldwide. Report (2018)Google Scholar
  86. 86.
    Su, W., Wu, H., Li, Y., Zhao, J., Lochovsky, F.H., Cai, H., Huang, T.: Understanding query interfaces by statistical parsing. ACM Trans Web 7(2), 8,1–8,22 (2013).  https://doi.org/10.1145/2460383.2460387 Google Scholar
  87. 87.
    Su, W, Li, Y, Lochovsky, FH: Query interfaces understanding by statistical parsing. In: WWW, pp. 1291–1294 (2014).  https://doi.org/10.1145/2567948.2579702
  88. 88.
    Toda, G.A., Cortez, E., da Silva, A.S., de Moura, E.: A probabilistic approach for automatically filling form-based Web interfaces. PVLDB 4(3), 151–160 (2010).  https://doi.org/10.14778/1929861.1929862 Google Scholar
  89. 89.
    Vidal, M.L.A., da Silva, A.S., de Moura, E.S., Cavalcanti, J.M.B.: Structure-based crawling in the Hidden Web. J UCS 14(11), 1857–1876 (2008)Google Scholar
  90. 90.
    Vieira, K., Barbosa, L., Silva, A.S., Freire, J., Moura, E.: Finding seeds to bootstrap focused crawlers. World Wide Web, 1–26 (2015).  https://doi.org/10.1007/s11280-015-0331-7
  91. 91.
    Wang, Y, Lu, J, Chen, J: Crawling deep Web using a new set covering algorithm. In: ADMA, pp. 326–337 (2009).  https://doi.org/10.1007/978-3-642-03348-3_32
  92. 92.
    Watij.com: http://watij.com/ (2016)
  93. 93.
    Watin.org: http://watin.org/ (2016)
  94. 94.
    Watir.com: http://watir.com/ (2016)
  95. 95.
    Weninger, T., Palȧcios, R, Crescenzi, V., Gottron, T., Merialdo, P.: Web content extraction: A metaanalysis of its past and thoughts on its future. SIGKDD Explorations 17(2), 17–23 (2015).  https://doi.org/10.1145/2897350.2897353 Google Scholar
  96. 96.
    Wu, Z, Raghavan, V, Qian, H, Rama, KV, Meng, W, He, H, Yu, C: Towards automatic incorporation of search engines into a large-scale metasearch engine. In: WI, pp. 658–661 (2003).  https://doi.org/10.1109/WI.2003.1241290
  97. 97.
    Wu, P, Wen, JR, Liu, H, Ma, WY: Query selection techniques for efficient crawling of structured Web sources. In: ICDE, pp. 47–56 (2006).  https://doi.org/10.1109/ICDE.2006.124
  98. 98.
    Wu, W, Doan, A, Yu, C, Meng, W: Modeling and extracting deep-Web query interfaces, pp. 65–90 (2009).  https://doi.org/10.1007/978-3-642-04141-9_4
  99. 99.
    Wu, W, Zhong, T: Searching the deep Web using proactive phrase queries. In: WWW Companion, pp. 137–138 (2013).  https://doi.org/10.1145/2487788.2487854
  100. 100.
    Wu, W., Meng, W., Su, W., Zhou, G., Chiang, Y.Y.: Q2p: discovering query templates via autocompletion. ACM Trans Web 10(2), 10,1–10,29 (2016).  https://doi.org/10.1145/2873061 Google Scholar
  101. 101.
    Xu, S., Yoon, H.J., Tourassi, G.: A user-oriented Web crawler for selectively acquiring online content in e-health research. Bioinformatics 30(1), 104–114 (2014).  https://doi.org/10.1093/bioinformatics/btt571 Google Scholar
  102. 102.
    Yan, H., Gong, Z., Zhang, N., Huang, T., Zhong, H., Wei, J.: Aggregate estimation in hidden databases with checkbox interfaces. TKDE 27(5), 1192–1204 (2015).  https://doi.org/10.1109/TKDE.2014.2365800 Google Scholar
  103. 103.
    Zhang, Z, He, B, Chang, KCC: Understanding Web query interfaces: Best-effort parsing with hidden syntax. In: SIGMOD, pp. 107–118 (2004).  https://doi.org/10.1145/1007568.1007583
  104. 104.
    Zhao, J, Wang, P: Nautilus: a generic framework for crawling Deep Web. In: ICDKE, pp. 141–151 (2012).  https://doi.org/10.1007/978-3-642-34679-8_14
  105. 105.
    Zhao, F., Zhou, J., Nie, C., Huang, H., Jin, H.: Smartcrawler: a two-stage crawler for efficiently harvesting deep-Web interfaces. IEEE Trans Serv. Comput. 9 (4), 608–620 (2016).  https://doi.org/10.1109/TSC.2015.2414931 Google Scholar
  106. 106.
    Zheng, Q., Wu, Z., Cheng, X., Jiang, L., Liu, J.: Learning to crawl deep Web. Inf. Syst. 38(6), 801–819 (2013).  https://doi.org/10.1016/j.is.2013.02.001 Google Scholar
  107. 107.
    Zhou, X, Belkin, M: Chapter 22 - semi-supervised learning. In: Academic Press Library in Signal Processing: Volume 1, Academic Press Library in Signal Processing, vol 1, pp. 1239–1269. Elsevier (2014).  https://doi.org/10.1016/B978-0-12-396502-8.00022-X
  108. 108.
    zombiejs.org: http://zombie.js.org/ (2018)

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  1. 1.Department of Languages and Computer SystemsUniversity of SevilleSevilleSpain
  2. 2.Department of Computer ScienceRochester Institute of TechnologyRochesterUSA

Personalised recommendations