Hi,
You'd need to filter the URLs from the segments as well before you
index. Removing the entries from the linkDB will just prevent them
from getting anchor fields - they'll still be added to the index.
Look at the class IndexerMapReduce for more details.
An option would be to add support for URLfilters in the map method to
be able to determine which URLs to remove from the indexing
altogether. This is pretty trivial to implement and could be a nice
contribution. Feel free to add submit it to JIRA if you implement it
HTH
Julien
--
DigitalPebble Ltd
http://www.digitalpebble.com
On 8 February 2010 12:26, Stefano Cherchi <stefanocherchi@yahoo.it> wrote:
Is there nobody out there who can provide some kind of hint?
I'm really stuck with this problem and I cannot figure out what else I can do.
Thanks
S
----- Messaggio originale -----
Da: Stefano Cherchi <stefanocherchi@yahoo.it>
A: nutch-user@lucene.apache..org
Inviato: Gio 4 febbraio 2010, 17:00:35
Oggetto: Nutch + Solr: filtering URL while indexing
Hi everybody. I've been struggling for three days now with a quite trivial
problem, without solution.
I need to index a few web sites with the following structure:
Page type 1: List of posts (http://www.website.com/list.html?page=XXx) where XXx
is a progressive number from 00 to 999. Each page has links to the following and
the previous list page.
Page type 2: the actual post page (http://www.website.com/post--x_y_z.html)
where xyz is an arbitrary string of letters and numbers representing the post
title..
Page type 3: other contents like statical pages, external links, and other
unwanted and useless stuff.
I need to crawl pages of both type 1 and 2 but I want to index only type 2.
Crawling pages of type 1 is the only way to reach type 2 because pages of type 2
have unpredictable URLs. So I'm performing a step-by-step indexing this way:
I set the following regular expressions in regex-urlfilter.txt
+^http://www.website.com/list.html[?]page[=][0-9]{2,3}$
+^http://www.website.com/post--
-.
inject (http://www.website.com/list.html?page=00)
then I cycle N times
generate
fetch
parse
updatedb
and I can see that only type 1 and type 2 pages are actually crawled and
fetched. Great.
Then I edit the regex-urlfilter.txt leaving only
+^http://www.website.com/post--
-.
and perform
invertlinks (with filtering on)
solrindex
Now I would expect that all type 1 pages are stripped away from the linkdb and
only type 2 pages are added to Solr index, but when I browse the indexed
documents I still found both 1 and 2 page types.
Can someone please explain why?
Thank you.
S
----------------------------------
"Anyone proposing to run Windows on servers should be prepared to explain
what they know about servers that Google, Yahoo, and Amazon don't."
Paul Graham
"A mathematician is a device for turning coffee into theorems."
Paul Erdos (who obviously never met a sysadmin)