On Tue, Mar 16, 2010 at 12:55 AM, Susam Pal <susam.pal@gmail.com> wrote:
On Mon, Mar 15, 2010 at 2:32 PM, Graziano Aliberti
<graziano.aliberti@eng.it> wrote:
Il 13/03/2010 22.55, Susam Pal ha scritto:
On Fri, Mar 12, 2010 at 3:17 PM, Susam Pal<susam.pal@gmail.com> wrote:
On Fri, Mar 12, 2010 at 2:09 PM, Graziano Aliberti
<graziano.aliberti@eng.it> wrote:
Il 11/03/2010 16.20, Susam Pal ha scritto:
On Thu, Mar 11, 2010 at 8:24 PM, Graziano Aliberti
<graziano.aliberti@eng.it> wrote:
Hi everyone,
I'm trying to use nutch ver. 1.0 on a system under squid proxy
control.
When
I try to fetch my website list, into the log file I see that the
authentication was failed...
I've configured my nutch-site.xml file with all that properties needed
for
proxy auth, but my error is "httpclient.HttpMethodDirector - No
credentials
available for BASIC 'Squid proxy-caching web
server'@proxy.my.host:my.port"
Did you replace 'protocol-http' with 'protocol-httpclient' in the
value for 'plugins.include' property in 'conf/nutch-site.xml'?
Regards,
Susam Pal
Hi Susam,
yes of course!! :) Maybe I can post you the configuration file:
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>http.agent.name</name>
<value>my.agent.name</value>
<description>
</description>
</property>
<property>
<name>plugin.includes</name>
<value>protocol-httpclient|urlfilter-regex|parse-(text|html|js)|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
<description>
</description>
</property>
<property>
<name>http.auth.file</name>
<value>my_file.xml</value>
<description>Authentication configuration file for
'protocol-httpclient' plugin.
</description>
</property>
<property>
<name>http.proxy.host</name>
<value>ip.my.proxy</value>
<description>The proxy hostname. If empty, no proxy is
used.</description>
</property>
<property>
<name>http.proxy.port</name>
<value>my.port</value>
<description>The proxy port.</description>
</property>
<property>
<name>http.proxy.username</name>
<value>my.user</value>
<description>
</description>
</property>
<property>
<name>http.proxy.password</name>
<value>my.pwd</value>
<description>
</description>
</property>
<property>
<name>http.proxy.realm</name>
<value>my_realm</value>
<description>
</description>
</property>
<property>
<name>http.agent.host</name>
<value>my.local.pc</value>
<description>The agent host.</description>
</property>
<property>
<name>http.useHttp11</name>
<value>true</value>
<description>
</description>
</property>
</configuration>
Only another question: where i must put the user authentication
parameters
(user,pwd)? In nutch-site.xml file or in my_file.xml that I use for
authentication?
Thank you for your attention,
--
-----------
Graziano Aliberti
Engineering Ingegneria Informatica S.p.A
Via S. Martino della Battaglia, 56 - 00185 ROMA
*Tel.:* 06.49.201.387
*E-Mail:* graziano.aliberti@eng.it
The configuration looks okay to me. Yes, the proxy authentication
details are set in 'conf/nutch-site.xml'. The file mentioned in
'http.auth.file' property is used for configuring authentication
details for authenticating to a web server.
Unfortunately, there aren't any log statements in the part of the code
that reads the proxy authentication details. So, I can't suggest you
to turn on debug logs to get some clues about the issue. However, in
case you want to troubleshoot it yourself by building Nutch from
source, I can tell you the code that deals with this.
The file is: src/java/org/apache/nutch/protocol/httpclient/Http.java :
http://svn.apache.org/viewvc/lucene/nutch/trunk/src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpclient/Http.java?view=markup
The line number is: 200.
If I get time this weekend, I will try to insert some log statements
into this code and send a modified JAR file to you which might help
you to troubleshoot what is going on. But I can't promise this since
it depends on my weekend plans.
Two questions before I end this mail. Did you set the value of
'http.proxy.realm' property as: Squid proxy-caching web server ?
Also, do you see any 'auth.AuthChallengeProcessor' lines in the log
file? I'm not sure whether this line should appear for proxy
authentication but it does appear for web server authentication.
Regards,
Susam Pal
I managed to find some time to insert more logs into
protocol-httpclient and create a JAR. I have attached it with this
email.
Please replace your
'plugins/protocol-httpclient/protocol-httpclient.jar' file with the
one that I have attached. Also, edit your 'conf/log4j.properties' file
to add these two lines:
log4j.logger.org.apache.nutch.protocol.httpclient=DEBUG,cmdstdout
log4j.logger.org.apache.commons.httpclient.auth=DEBUG,cmdstdout
When you run a crawl now, you should see more logs in
'logs/hadoop.log' than before. I hope it helps you in providing some
clues. In case you want to compare the logs with how the control flows
from the source code, I have attached the JAVA file as well.
Regards,
Susam Pal
Hi Susam,
first of all I want to thank you for your support :). I've tried your
solution and I've seen in the log file the that the authentication
parameters was correctly read by the application.
In the log file I've finded these lines about auth.AuthChallengeProcessor:
2010-03-15 09:52:33,140 DEBUG auth.AuthChallengeProcessor - Challenge for
ntlm authentication scheme not available
2010-03-15 09:52:33,140 DEBUG auth.AuthChallengeProcessor - Challenge for
digest authentication scheme not available
2010-03-15 09:52:33,140 INFO auth.AuthChallengeProcessor - basic
authentication scheme selected
2010-03-15 09:52:33,140 DEBUG auth.AuthChallengeProcessor - Using
authentication scheme: basic
2010-03-15 09:52:33,140 DEBUG auth.AuthChallengeProcessor - Authorization
challenge processed
2010-03-15 09:52:33,140 INFO httpclient.HttpMethodDirector - No credentials
available for BASIC 'Squid proxy-caching web server'@my.proxy:my.port
'Squid proxy-caching web server'@my.proxy:my.port - should be the
authentication details mentioned in the proxy configuration.
It means that the 'http.proxy.realm' should be specified as: Squid
proxy-caching web server
You can also try omitting the value for 'http.proxy.realm' property.
I was also wanted to confirm whether you got the following line in
'logs/hadoop.log':
Custom logs for troubleshooting authentication (set 4)
If you have got this line and your configuration is correct, I don't
see a reason why AuthChallengeProcessor should complain about missing
credentials. It could be a bug either in Nutch or in the Jakarta
Commons HttpClient library which is being used in Nutch to do the
authentication. It could also be a mistake in the configuration.
In case, you find a way to resolve it, please let us know what the
problem was and how you resolved it.
Regards,
Susam Pal
Here is an update. It is most likely a configuration problem. I just
tested the proxy authentication feature with a Squid proxy server with
the same realm as yours. It works well.
Also, the issue you are facing seems to be due to an incorrect realm
specified. I would suggest that you omit the realm and see if it works
fine. When you omit the realm, the corresponding XML code for the
configuration should look like this:
<property>
<name>http.proxy.realm</name>
<value></value>
<description></description>
</property>
In case you do want to specify the realm, your XML code should look like this:
<property>
<name>http.proxy.realm</name>
<value>Squid proxy-caching web server</value>
<description></description>
</property>
Note that this is the exact string appearing in the log message:
INFO httpclient.HttpMethodDirector - No credentials available for
BASIC 'Squid proxy-caching web server'@my.proxy:my.port
There should be no quotes around the string.
If everything goes fine, the logs should appear like the following.
These logs are from my system.
2010-03-16 02:45:28,280 DEBUG auth.AuthChallengeProcessor - Supported
authentication schemes in the order of preference: [ntlm, digest,
basic]
2010-03-16 02:45:28,280 DEBUG auth.AuthChallengeProcessor - Challenge
for ntlm authentication scheme not available
2010-03-16 02:45:28,280 DEBUG auth.AuthChallengeProcessor - Challenge
for digest authentication scheme not available
2010-03-16 02:45:28,280 INFO auth.AuthChallengeProcessor - basic
authentication scheme selected
2010-03-16 02:45:28,280 DEBUG auth.AuthChallengeProcessor - Using
authentication scheme: basic
2010-03-16 02:45:28,281 DEBUG auth.AuthChallengeProcessor -
Authorization challenge processed
2010-03-16 02:45:28,282 DEBUG auth.AuthChallengeProcessor - Supported
authentication schemes in the order of preference: [ntlm, digest,
basic]
2010-03-16 02:45:28,283 DEBUG auth.AuthChallengeProcessor - Challenge
for ntlm authentication scheme not available
2010-03-16 02:45:28,283 DEBUG auth.AuthChallengeProcessor - Challenge
for digest authentication scheme not available
2010-03-16 02:45:28,283 INFO auth.AuthChallengeProcessor - basic
authentication scheme selected
2010-03-16 02:45:28,283 DEBUG auth.AuthChallengeProcessor - Using
authentication scheme: basic
2010-03-16 02:45:28,283 DEBUG auth.AuthChallengeProcessor -
Authorization challenge processed
2010-03-16 02:45:28,284 DEBUG auth.BasicScheme - enter
BasicScheme.authenticate(Credentials, HttpMethod)
2010-03-16 02:45:28,286 DEBUG auth.BasicScheme - enter
BasicScheme.authenticate(UsernamePasswordCredentials, String)
2010-03-16 02:45:28,286 DEBUG auth.AuthChallengeProcessor - Supported
authentication schemes in the order of preference: [ntlm, digest,
basic]
2010-03-16 02:45:28,287 DEBUG auth.AuthChallengeProcessor - Challenge
for ntlm authentication scheme not available
2010-03-16 02:45:28,287 DEBUG auth.AuthChallengeProcessor - Challenge
for digest authentication scheme not available
2010-03-16 02:45:28,287 INFO auth.AuthChallengeProcessor - basic
authentication scheme selected
2010-03-16 02:45:28,287 DEBUG auth.AuthChallengeProcessor - Using
authentication scheme: basic
2010-03-16 02:45:28,288 DEBUG auth.AuthChallengeProcessor -
Authorization challenge processed
2010-03-16 02:45:28,288 DEBUG auth.BasicScheme - enter
BasicScheme.authenticate(Credentials, HttpMethod)
2010-03-16 02:45:28,288 DEBUG auth.BasicScheme - enter
BasicScheme.authenticate(UsernamePasswordCredentials, String)
2010-03-16 02:45:28,284 DEBUG auth.BasicScheme - enter
BasicScheme.authenticate(Credentials, HttpMethod)
2010-03-16 02:45:28,289 DEBUG auth.BasicScheme - enter
BasicScheme.authenticate(UsernamePasswordCredentials, String)
2010-03-16 02:45:28,330 DEBUG httpclient.Http - url:
http://en.wikipedia.org/robots.txt; status code: 200; bytes received:
4853; Content-Length: 4853; Content-Encoding: gzip; extracted to 26147
bytes
I hope this helps.
If you still face issues, please send me the complete log file
(logs/hadoop.log) and the complete configuration file
(conf/nutch-site.xml). It is easier to spot configuration mistakes if
you send the complete files. Please do remove the existing hadoop.log
file before starting a new crawl so that that the log file you send us
isn't too large.
Regards,
Susam Pal