If I understand well, at the moment I have to perform a mergesegments cycle before the final indexing to filter out the undesired urls?
it would probably be easier and quicker to filter the crawlDB instead.
Have a look at the commend mergedb, that will create a filtered copy
of your crawldb which you can then use for the indexing
J.
Talking of adding filtering to the map method, I need to take some time to further investigate and I hope I can contribute that feature soon.
S
----------------------------------
"Anyone proposing to run Windows on servers should be prepared to explain
what they know about servers that Google, Yahoo, and Amazon don't."
Paul Graham
"A mathematician is a device for turning coffee into theorems."
Paul Erdos (who obviously never met a sysadmin)
----- Messaggio originale -----
Da: Julien Nioche <lists.digitalpebble@gmail.com>
A: nutch-user@lucene.apache.org
Inviato: Lun 8 febbraio 2010, 14:24:40
Oggetto: Re: Nutch + Solr: filtering URL while indexing
Hi,
You'd need to filter the URLs from the segments as well before you
index. Removing the entries from the linkDB will just prevent them
from getting anchor fields - they'll still be added to the index.
Look at the class IndexerMapReduce for more details.
An option would be to add support for URLfilters in the map method to
be able to determine which URLs to remove from the indexing
altogether. This is pretty trivial to implement and could be a nice
contribution. Feel free to add submit it to JIRA if you implement it
HTH
Julien
--
DigitalPebble Ltd
http://www.digitalpebble.com
On 8 February 2010 12:26, Stefano Cherchi wrote:
Is there nobody out there who can provide some kind of hint?
I'm really stuck with this problem and I cannot figure out what else I can do.
Thanks
S
----- Messaggio originale -----
Da: Stefano Cherchi
A: nutch-user@lucene.apache..org
Inviato: Gio 4 febbraio 2010, 17:00:35
Oggetto: Nutch + Solr: filtering URL while indexing
Hi everybody. I've been struggling for three days now with a quite trivial
problem, without solution.
I need to index a few web sites with the following structure:
Page type 1: List of posts (http://www.website.com/list.html?page=XXx) where
XXx
is a progressive number from 00 to 999. Each page has links to the following
and
the previous list page.
Page type 2: the actual post page (http://www.website.com/post--x_y_z.html)
where xyz is an arbitrary string of letters and numbers representing the post
title..
Page type 3: other contents like statical pages, external links, and other
unwanted and useless stuff.
I need to crawl pages of both type 1 and 2 but I want to index only type 2.
Crawling pages of type 1 is the only way to reach type 2 because pages of
type 2
have unpredictable URLs. So I'm performing a step-by-step indexing this way:
I set the following regular expressions in regex-urlfilter.txt
+^http://www.website.com/list.html[?]page[=][0-9]{2,3}$
+^http://www.website.com/post--
-.
inject (http://www.website.com/list.html?page=00)
then I cycle N times
generate
fetch
parse
updatedb
and I can see that only type 1 and type 2 pages are actually crawled and
fetched. Great.
Then I edit the regex-urlfilter.txt leaving only
+^http://www.website.com/post--
-.
and perform
invertlinks (with filtering on)
solrindex
Now I would expect that all type 1 pages are stripped away from the linkdb
and
only type 2 pages are added to Solr index, but when I browse the indexed
documents I still found both 1 and 2 page types.
Can someone please explain why?
Thank you.
S
----------------------------------
"Anyone proposing to run Windows on servers should be prepared to explain
what they know about servers that Google, Yahoo, and Amazon don't."
Paul Graham
"A mathematician is a device for turning coffee into theorems."
Paul Erdos (who obviously never met a sysadmin)
--
DigitalPebble Ltd
http://www.digitalpebble.com