Enterprise Search support for Apache Lucene and Solr by Lucid Imagination

Secondary links

  • Contact Us
  • Log in
  • Downloads
  • Solutions
    • Software |
    • Services |
    • Training |
    • White Papers & Case Studies |
    • Webinars & Events |
  • Developers
    • Blog |
    • Tech Articles |
    • Community |
    • Documentation |
    • Downloads |
    • Webcasts & Podcasts |
  • About
    • Market Overview |
    • Management |
    • Company News |
    • In the Media |
    • Contact |

beta

Start new search

Back to search results

  1. FromDate
  2. lionel duboeuf2010-02-05 04:27
  3. Uwe Schindler2010-02-05 04:40
  4. Ard Schrijvers2010-02-05 04:53
  5. lionel duboeuf2010-02-08 12:12
  6. lionel duboeuf2010-02-08 12:13

[general] Document Frequency for a set of documents

Subject:
Re: Document Frequency for a set of documents
From:
lionel duboeuf <lionel.duboeuf@...>
Date:
2010-02-08 12:13
Thanks ard for your response,i found it usefull.

regards.
lionel

Ard Schrijvers a écrit :
crossposting to the user list as I think this issue belongs there. See my comments inline On Fri, Feb 5, 2010 at 10:27 AM, lionel duboeuf <lionel.duboeuf@boozter.com> wrote:
Hi, Sorry for asking again, **I still have not found a scalable solution to get the document frequency of a term t according a set of documents. Lucene only store the document frequency for the global corpus, but i would like to be able to get the document frequency of a term according only to a subset of documents (i.e. a user's collection of documents). I guess that querying the index to get the number of hits for each term and for each field, filtered by a user will be to slow. Any idea ?
I have recently developed out-of-the-box faceted navigation exposed over jcr (hippo repository on top of jackrabbit) where I think you are looking for efficient faceted navigation as well, right? First of all, I am also interested if others have something to add to my findings. First of all, you can approach your issue in two different angles, where I think depending on the number of results vs number of terms (unique facets), you can best switch (runtime) between the two approaches: Approach (1): The lucene TermEnum is leading: if the lucene field has *many* (say more then 100.000) unique values, it becomes slow (and approach two might be better) You have a BitSet matchingDocs, and you want the count for all the terms for field 'brand' where of course one of the documents in matchingDocs should have the term: Suppose your field is thus 'brand', then you can do: TermEnum termEnum = indexReader.terms(new Term("brand", "")); // iterate through all the values of this facet and see look at number of hits per term try { TermDocs termDocs = indexReader.termDocs(); // open termDocs only once, and use seek: this is more efficient try { do { Term term = termEnum.term(); int count = 0; if (term != null && term.field() == internalFacetName) { // interned comparison termDocs.seek(term); while (termDocs.next()) { if (matchingDocs.get(termDocs.doc())) { count++; } } if (count > 0) { if (!"".equals(term.text())) { facetValueCountMap.put(term.text(), new Count(count)); } } } else { break; } } while (termEnum.next()); } finally { termDocs.close(); } } finally { termEnum.close(); } Approach (2): matching docs are leading. All lucene fields that should be useable for your facet counts, must be indexed with TermVectors. This approach becomes slow when the matching docs grow > 100.000 hits. Then, you rather use approach (1) Create your own HitCollector, and have its hit method something like: public final void collect(final int docid, final float score) { try { if (facetMap != null) { final TermFreqVector tfv = reader.getTermFreqVector(docid, internalName); if (tfv != null) { for (int i = 0; i < tfv.getTermFrequencies().length; i++) { addToFacetMap(tfv.getTerms()[i]); } Note that the HitCollector's are not advised for large hit sets, also see [1] This is how i currently have a really performant faceted navigation exposed as a jcr tree. If somebody has tried more ways, or something to add, I would be interested Regards Ard [1] http://lucene.apache.org/java/2_3_2/api/org/apache/lucene/search/HitCollector.html
regards, Lionel * *

Solr Powered

Give us your feedback

  • Lucene
  • Solr
  • Nutch
  • Tika
  • Mahout
  • Droids
  • PyLucene
  • Lucene.Net
  • Lucy
  • Lucene4c
  • Open Relevance Project
  • How We Can Help:
    • Getting Started |
    • Support Subscriptions |
    • White Papers |
    • Training |
    • Consulting |
    • Contact Us |
  • Developers:
    • Blog |
    • Documentation |
    • Tech Articles |
    • Podcasts and Videos |
    • Community |
  • Downloads:
    • LucidWorks for Solr |
    • LucidWorks for Lucene |
    • LucidGaze for Solr |
    • LucidGaze for Lucene |
  • Products:
  • Services:

Contact | Privacy Policy | Legal Terms of Use | Copyrights and Disclaimers | Admin

Apache Solr, Apache Lucene, ApacheCon and their logos are trademarks of the Apache Software Foundation.

© 2010 Lucid Imagination. All Right reserved.