Enterprise Search support for Apache Lucene and Solr by Lucid Imagination

Secondary links

  • Contact Us
  • Log in
  • Downloads
  • Solutions
    • Software |
    • Services |
    • Training |
    • White Papers & Case Studies |
    • Webinars & Events |
  • Developers
    • Blog |
    • Tech Articles |
    • Community |
    • Documentation |
    • Downloads |
    • Webcasts & Podcasts |
  • About
    • Market Overview |
    • Management |
    • Company News |
    • In the Media |
    • Contact |

beta

Start new search

Back to search results

  1. FromDate
  2. Stefano Cherchi2010-02-04 11:00
  3. Stefano Cherchi2010-02-08 07:26
  4. Julien Nioche2010-02-08 08:24
  5. Stefano Cherchi2010-02-09 09:51
  6. Julien Nioche2010-02-09 11:01

[nutch-user] Nutch + Solr: filtering URL while indexing

Subject:
Nutch + Solr: filtering URL while indexing
From:
Stefano Cherchi <stefanocherchi@...>
Date:
2010-02-04 11:00
Hi everybody. I've been struggling for three days now with a quite trivial problem, without solution. 

I need to index a few web sites with the following structure:

Page type 1: List of posts (http://www.website.com/list.html?page=XXx) where XXx is a progressive number from 00 to 999. Each page has links to the following and the previous list page. 
Page type 2: the actual post page (http://www.website.com/post--x_y_z.html) where xyz is an arbitrary string of letters and numbers representing the post title..
Page type 3: other contents like statical pages, external links, and other unwanted and useless stuff.

I need to crawl pages of both type 1 and 2 but I want to index only type 2. Crawling pages of type 1 is the only way to reach type 2 because pages of type 2 have unpredictable URLs. So I'm performing a step-by-step indexing this way:

I set the following regular expressions in regex-urlfilter.txt
+^http://www.website.com/list.html[?]page[=][0-9]{2,3}$
+^http://www.website.com/post--
-.

inject (http://www.website.com/list.html?page=00)

then I cycle N times
generate
fetch
parse
updatedb

and I can see that only type 1 and type 2 pages are actually crawled and fetched. Great.

Then I edit the regex-urlfilter.txt leaving only 
+^http://www.website.com/post--
-.

and perform
invertlinks (with filtering on)
solrindex

Now I would expect that all type 1 pages are stripped away from the linkdb and only type 2 pages are added to Solr index, but when I browse the indexed documents I still found both 1 and 2 page types.

Can someone please explain why?

Thank you.

S

---------------------------------- 
"Anyone proposing to run Windows on servers should be prepared to explain 
what they know about servers that Google, Yahoo, and Amazon don't."
Paul Graham


"A mathematician is a device for turning coffee into theorems."
Paul Erdos (who obviously never met a sysadmin)

Solr Powered

Give us your feedback

  • Lucene
  • Solr
  • Nutch
  • Tika
  • Mahout
  • Droids
  • PyLucene
  • Lucene.Net
  • Lucy
  • Lucene4c
  • Open Relevance Project
  • How We Can Help:
    • Getting Started |
    • Support Subscriptions |
    • White Papers |
    • Training |
    • Consulting |
    • Contact Us |
  • Developers:
    • Blog |
    • Documentation |
    • Tech Articles |
    • Podcasts and Videos |
    • Community |
  • Downloads:
    • LucidWorks for Solr |
    • LucidWorks for Lucene |
    • LucidGaze for Solr |
    • LucidGaze for Lucene |
  • Products:
  • Services:

Contact | Privacy Policy | Legal Terms of Use | Copyrights and Disclaimers | Admin

Apache Solr, Apache Lucene, ApacheCon and their logos are trademarks of the Apache Software Foundation.

© 2010 Lucid Imagination. All Right reserved.