Found 29,395 results in 0.126 seconds. Displaying page 6 of 2,940, sorted by
Sent 2010-02-28 by "Ian M. Evans" <ianevans@...>
Using Nutch as a crawler for solr.
I've been digging around the nutch-user archives a bit and have seen
some people discussing how to ignore menu items or other unnecessary div
areas like common footers, etc. I still haven't come across a full
answer yet.
Is there a to define a div by id tha...
Sent 2010-02-27 by QueroVc <yuri.gopfert@...>
hello, I have a problem.
You can configure how the nutch (crawl or searc ????) creates the summaries?
--
View this message in context: http://old.nabble.com/Summary-tp27731301p27731301.html
Sent from the Nutch - User mailing list archive at Nabble.com.
Sent 2010-02-27 by Apache Hudson Server <hudson@...>
See
Sent 2010-02-27 by Ted Yu <yuzhihong@...>
Please disregard my previous email - the command was launched from incorrect
directory.
I don't see improvement for my latest run:
[root@snv-qa-lin-domain-crawler1 software]# hfs -text
/user/tomcatadmin/lpm/15-100226111258118-tomcatadmin/parse/0/part-m-00000
10/02/27 07:36:28 INFO util.NativeCod...
Sent 2010-02-27 by Ted Yu <yuzhihong@...>
Now I see this in the log:
[root@snv-qa-lin-domain-crawler1 webmap_workflow]# hfs -text
/user/tomcatadmin/lpm/15-100226111258118-tomcatadmin/generate/0/part-r-00000
2010-02-27 07:25:08,062 WARN [main] conf.Configuration DEPRECATED:
hadoop-site.xml found in the classpath. Usage of hadoop-site.xml...
Sent 2010-02-27 by Julien Nioche <lists.digitalpebble@...>
Look at the Hadoop option -libjars and use it to point to the nutch-1.0.jar,
that should work
J.
On 27 February 2010 13:08, Ted Yu wrote:
> Hi,
> We use nutch to perform domain crawl but I see strange 'can't load class'
> error:
>
> [root@snv-qa-lin-domain-crawler1 softwar...
Sent 2010-02-27 by Ted Yu <yuzhihong@...>
Hi,
We use nutch to perform domain crawl but I see strange 'can't load class'
error:
[root@snv-qa-lin-domain-crawler1 software]# hfs -text
/user/tomcatadmin/lpm/12-100226111258118-tomcatadmin/parse/0/part-m-00000
10/02/27 04:45:10 INFO util.NativeCodeLoader: Loaded the native-hadoop
library
10/0...
Sent 2010-02-27 by Patricio Galeas <pgaleas@...>
Hello,
Two weeks ago, we started a web crawl (depth=6, threads=10) and today is the process aborted because our hard disk is full. We defined a 100GB partition for the hadoop.tmp.dir.
Yesterday (night), I checked the size of hadoop.tmp.dir by the last crawl and it had 23GB. Some hours later ...
Sent 2010-02-26 by Felix Zimmermann <felizimm@...>
Hi,
when dumping segments with "bin/nutch readseg -dump ...", special
characters of non-utf8 encoced pages are lost. For example the
"ö" (ö) is replaced by a "?"...
I am really in need of the dumped files with correct representation of
special chars. How can I deal with this problem?
Tha...
Sent 2010-02-25 by Eddie Drapkin <oorza2k5@...>
Hello,
I'm trying to upgrade from Nutch 0.9 to Nutch 1.0 and I've solved all of the
issues that I seem be having, except for one.
When I run a web crawl, everything fetches fine until it gets to dedup, in
which case, I get this stack trace:
2010-02-25 14:31:46,592 WARN mapred.LocalJobRunner ...