Abstract
Many companies collect vast amounts of data from the web by using web crawlers such as Apache Nutch. Available for more than ten years, Nutch is an open-source product provided by Apache and has a large community of committed users. An Apache Lucene open-source search platform, Solr can be used in connection with Nutch to index and search the data that Nutch collects. When you combine this functionality with Hadoop, you can store the resulting large data volume directly in a distributed file system.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsNotes
- 1.
I own the site and it’s contents, so there are no issues with processing the site contents and displaying them here.
Author information
Authors and Affiliations
Rights and permissions
Copyright information
© 2015 Michael Frampton
About this chapter
Cite this chapter
Frampton, M. (2015). Collecting Data with Nutch and Solr. In: Big Data Made Easy. Apress, Berkeley, CA. https://doi.org/10.1007/978-1-4842-0094-0_3
Download citation
DOI: https://doi.org/10.1007/978-1-4842-0094-0_3
Published:
Publisher Name: Apress, Berkeley, CA
Print ISBN: 978-1-4842-0095-7
Online ISBN: 978-1-4842-0094-0
eBook Packages: Professional and Applied ComputingApress Access BooksProfessional and Applied Computing (R0)