Collecting Data with Nutch and Solr

  • Michael Frampton
Chapter

Abstract

Many companies collect vast amounts of data from the web by using web crawlers such as Apache Nutch. Available for more than ten years, Nutch is an open-source product provided by Apache and has a large community of committed users. An Apache Lucene open-source search platform, Solr can be used in connection with Nutch to index and search the data that Nutch collects. When you combine this functionality with Hadoop, you can store the resulting large data volume directly in a distributed file system.

Keywords

File System Seed File Distribute File System Command Sequence Schema File 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Copyright information

© Michael Frampton 2015

Authors and Affiliations

  • Michael Frampton
    • 1
  1. 1.ParaparaumuNew Zealand

Personalised recommendations