Does Lucene Scale?

Does Lucene scale?

Enterprise Search: The Case for Lucene: "Generally speaking, for systems with light to moderate traffic with reasonably simple queries on datasets up to 100,000 documents, our current impression is that Lucene should be adequate. We have seen reports of Lucene performing well on a 300,000 document dataset, and we have run queries on 800,000 document sets. Simple queries still performed reasonably.

If your datasets are routinely in the 100,000 document range, or if you will ever be searching more than 1 million records, you should investigate performance carefully.

If you require an average of more than 10 queries per second, we encourage you to at least do some performance testing before making decisions. This holds true for commercial vendors as well. Lucene does support some amount of threading.

Lucene does not do as well for systems with highly volatile data. When source data changes, the Lucene indices must be updated to reflect the new terms present in the modified content. For each "update", Lucene requires a pair of "delete" and an "add" transactions; and the "add" will only be visible to newly opened search sessions. This can cause search synchronization and/or latency issues if not properly handled.
"

If you start running into these problems with Lucene, you will need to start doing more advanced Lucene indexing and searching; a good article to get started with this is Advanced Text Indexing with Lucene.

Comments

burtonator said…
No idea where they got their figures from . The key point about Lucene is that they support an distributed query infrastructure. If you start to become overloaded on ONE box you can easily build a parallel search infra on commodity hardware and the RemoteMultiSearcher...