Scalability of the Nutch search engine Conference Paper uri icon


  • Nutch is an open source search engine that is gaining increasing popularity in the commercial world. The Nutch architecture leads itself to a wide range of parallelization techniques. Multiple backend servers can be used to both partition the corpus of search data, thus increasing the rate of queries serviced, and to increase the size of the search data while preserving the service rate. Alternatively, multiple search engines can operate in parallel, further increasing the query rate. In this paper, we analyze the performance and scalability of various configurations of Nutch. The configurations were implemented as part of the Commercial Scale Out project at IBM Research, and were used to investigate the applicability of scale-out architectures in commercial environments. We conclude that Nutch is highly scalable, with the different configurations behaving differently from a performance perspective. Copyright 2007 ACM.

name of conference

  • Proceedings of the 21st annual international conference on Supercomputing

published proceedings

  • Proceedings of the 21st annual international conference on Supercomputing

author list (cited authors)

  • Moreira, J. E., Michael, M. M., Da Silva, D., Shiloach, D., Dube, P., & Zhang, L. i.

citation count

  • 21

complete list of authors

  • Moreira, José E||Michael, Maged M||Da Silva, Dilma||Shiloach, Doron||Dube, Parijat||Zhang, Li

publication date

  • January 2007