Enterprise Search support for Apache Lucene and Solr by Lucid Imagination

Secondary links

  • Contact Us
  • Log in
  • Downloads
  • Solutions
    • Software |
    • Services |
    • Training |
    • White Papers & Case Studies |
    • Webinars & Events |
  • Developers
    • Blog |
    • Tech Articles |
    • Community |
    • Documentation |
    • Downloads |
    • Webcasts & Podcasts |
  • About
    • Market Overview |
    • Management |
    • Company News |
    • In the Media |
    • Contact |

beta

Start new search

Back to search results

  1. FromDate
  2. Stefano Cherchi2010-02-04 11:00
  3. Stefano Cherchi2010-02-08 07:26
  4. Julien Nioche2010-02-08 08:24
  5. Stefano Cherchi2010-02-09 09:51
  6. Julien Nioche2010-02-09 11:01

[nutch-user] Nutch + Solr: filtering URL while indexing

Subject:
Re: Nutch + Solr: filtering URL while indexing
From:
Julien Nioche <lists.digitalpebble@...>
Date:
2010-02-08 08:24
Hi,

You'd need to filter the URLs from the segments as well before you
index. Removing the entries from the linkDB will just prevent them
from getting anchor fields - they'll still be added to the index.
Look at the class IndexerMapReduce for more details.

An option would be to add support for URLfilters in the map method to
be able to determine which URLs to remove from the indexing
altogether. This is pretty trivial to implement and could be a nice
contribution. Feel free to add submit it to JIRA if you implement it

HTH

Julien

-- 
DigitalPebble Ltd
http://www.digitalpebble.com

On 8 February 2010 12:26, Stefano Cherchi <stefanocherchi@yahoo.it> wrote:
Is there nobody out there who can provide some kind of hint? I'm really stuck with this problem and I cannot figure out what else I can do. Thanks S ----- Messaggio originale -----
Da: Stefano Cherchi <stefanocherchi@yahoo.it> A: nutch-user@lucene.apache..org Inviato: Gio 4 febbraio 2010, 17:00:35 Oggetto: Nutch + Solr: filtering URL while indexing Hi everybody. I've been struggling for three days now with a quite trivial problem, without solution. I need to index a few web sites with the following structure: Page type 1: List of posts (http://www.website.com/list.html?page=XXx) where XXx is a progressive number from 00 to 999. Each page has links to the following and the previous list page. Page type 2: the actual post page (http://www.website.com/post--x_y_z.html) where xyz is an arbitrary string of letters and numbers representing the post title.. Page type 3: other contents like statical pages, external links, and other unwanted and useless stuff. I need to crawl pages of both type 1 and 2 but I want to index only type 2. Crawling pages of type 1 is the only way to reach type 2 because pages of type 2 have unpredictable URLs. So I'm performing a step-by-step indexing this way: I set the following regular expressions in regex-urlfilter.txt +^http://www.website.com/list.html[?]page[=][0-9]{2,3}$ +^http://www.website.com/post-- -. inject (http://www.website.com/list.html?page=00) then I cycle N times generate fetch parse updatedb and I can see that only type 1 and type 2 pages are actually crawled and fetched. Great. Then I edit the regex-urlfilter.txt leaving only +^http://www.website.com/post-- -. and perform invertlinks (with filtering on) solrindex Now I would expect that all type 1 pages are stripped away from the linkdb and only type 2 pages are added to Solr index, but when I browse the indexed documents I still found both 1 and 2 page types. Can someone please explain why? Thank you. S ---------------------------------- "Anyone proposing to run Windows on servers should be prepared to explain what they know about servers that Google, Yahoo, and Amazon don't." Paul Graham "A mathematician is a device for turning coffee into theorems." Paul Erdos (who obviously never met a sysadmin)

Solr Powered

Give us your feedback

  • Lucene
  • Solr
  • Nutch
  • Tika
  • Mahout
  • Droids
  • PyLucene
  • Lucene.Net
  • Lucy
  • Lucene4c
  • Open Relevance Project
  • How We Can Help:
    • Getting Started |
    • Support Subscriptions |
    • White Papers |
    • Training |
    • Consulting |
    • Contact Us |
  • Developers:
    • Blog |
    • Documentation |
    • Tech Articles |
    • Podcasts and Videos |
    • Community |
  • Downloads:
    • LucidWorks for Solr |
    • LucidWorks for Lucene |
    • LucidGaze for Solr |
    • LucidGaze for Lucene |
  • Products:
  • Services:

Contact | Privacy Policy | Legal Terms of Use | Copyrights and Disclaimers | Admin

Apache Solr, Apache Lucene, ApacheCon and their logos are trademarks of the Apache Software Foundation.

© 2010 Lucid Imagination. All Right reserved.