Lucid Imagination

Secondary links

  • Contact Us
  • Sign Up or Login
  • Downloads
  • Solutions
    • Partners |
    • Blog |
    • Software |
    • Services |
    • Training |
    • Case Studies |
    • Webcasts |
  • Developers
    • Blog |
    • Tech Articles |
    • Community |
    • Docs |
    • Downloads |
    • Whitepapers |
    • Podcasts |
  • About
    • Market Overview |
    • Management |
    • Company News |
    • In the Media |
    • Contact |

beta

Start new search

Back to search results

  1. FromDate
  2. Stefano Cherchi2010-02-04 11:00
  3. Stefano Cherchi2010-02-08 07:26
  4. Julien Nioche2010-02-08 08:24
  5. Stefano Cherchi2010-02-09 09:51
  6. Julien Nioche2010-02-09 11:01

[nutch-user] Nutch + Solr: filtering URL while indexing

Subject:
Re: Nutch + Solr: filtering URL while indexing
From:
Julien Nioche <lists.digitalpebble@...>
Date:
2010-02-09 11:01
If I understand well, at the moment I have to perform a mergesegments cycle before the final indexing to filter out the undesired urls?
it would probably be easier and quicker to filter the crawlDB instead. Have a look at the commend mergedb, that will create a filtered copy of your crawldb which you can then use for the indexing J.
Talking of adding filtering to the map method, I need to take some time to further investigate and I hope I can contribute that feature soon. S  ---------------------------------- "Anyone proposing to run Windows on servers should be prepared to explain what they know about servers that Google, Yahoo, and Amazon don't." Paul Graham "A mathematician is a device for turning coffee into theorems." Paul Erdos (who obviously never met a sysadmin) ----- Messaggio originale -----
Da: Julien Nioche <lists.digitalpebble@gmail.com> A: nutch-user@lucene.apache.org Inviato: Lun 8 febbraio 2010, 14:24:40 Oggetto: Re: Nutch + Solr: filtering URL while indexing Hi, You'd need to filter the URLs from the segments as well before you index. Removing the entries from the linkDB will just prevent them from getting anchor fields - they'll still be added to the index. Look at the class IndexerMapReduce for more details. An option would be to add support for URLfilters in the map method to be able to determine which URLs to remove from the indexing altogether. This is pretty trivial to implement and could be a nice contribution. Feel free to add submit it to JIRA if you implement it HTH Julien -- DigitalPebble Ltd http://www.digitalpebble.com On 8 February 2010 12:26, Stefano Cherchi wrote:
Is there nobody out there who can provide some kind of hint? I'm really stuck with this problem and I cannot figure out what else I can do. Thanks S ----- Messaggio originale -----
Da: Stefano Cherchi A: nutch-user@lucene.apache..org Inviato: Gio 4 febbraio 2010, 17:00:35 Oggetto: Nutch + Solr: filtering URL while indexing Hi everybody. I've been struggling for three days now with a quite trivial problem, without solution. I need to index a few web sites with the following structure: Page type 1: List of posts (http://www.website.com/list.html?page=XXx) where
XXx
is a progressive number from 00 to 999. Each page has links to the following
and
the previous list page. Page type 2: the actual post page (http://www.website.com/post--x_y_z.html) where xyz is an arbitrary string of letters and numbers representing the post title.. Page type 3: other contents like statical pages, external links, and other unwanted and useless stuff. I need to crawl pages of both type 1 and 2 but I want to index only type 2. Crawling pages of type 1 is the only way to reach type 2 because pages of
type 2
have unpredictable URLs. So I'm performing a step-by-step indexing this way: I set the following regular expressions in regex-urlfilter.txt +^http://www.website.com/list.html[?]page[=][0-9]{2,3}$ +^http://www.website.com/post-- -. inject (http://www.website.com/list.html?page=00) then I cycle N times generate fetch parse updatedb and I can see that only type 1 and type 2 pages are actually crawled and fetched. Great. Then I edit the regex-urlfilter.txt leaving only +^http://www.website.com/post-- -. and perform invertlinks (with filtering on) solrindex Now I would expect that all type 1 pages are stripped away from the linkdb
and
only type 2 pages are added to Solr index, but when I browse the indexed documents I still found both 1 and 2 page types. Can someone please explain why? Thank you. S ---------------------------------- "Anyone proposing to run Windows on servers should be prepared to explain what they know about servers that Google, Yahoo, and Amazon don't." Paul Graham "A mathematician is a device for turning coffee into theorems." Paul Erdos (who obviously never met a sysadmin)
-- DigitalPebble Ltd http://www.digitalpebble.com

Solr Powered

Give us your feedback

  • Lucene
  • Solr
  • Nutch
  • Tika
  • Mahout
  • Droids
  • PyLucene
  • Lucene.Net
  • Lucy
  • Lucene4c
  • Open Relevance Project
  • How We Can Help:
    • Getting Started |
    • Support Subscriptions |
    • White Papers |
    • Training |
    • Consulting |
    • Contact Us |
  • Developers:
    • Blog |
    • Documentation |
    • Tech Articles |
    • Podcasts and Videos |
    • Community |
  • Downloads:
    • LucidWorks for Solr |
    • LucidWorks for Lucene |
    • LucidGaze for Solr |
    • LucidGaze for Lucene |
  • Products:
  • Services:

Contact | Privacy Policy | Legal Terms of Use | Copyrights and Disclaimers | Admin

Apache Solr, Apache Lucene, ApacheCon and their logos are trademarks of the Apache Software Foundation.

© 2010 Lucid Imagination. All Right reserved.