Lucid Imagination

Secondary links

  • Contact Us
  • Sign Up or Login
  • Downloads
  • Solutions
    • Partners |
    • Blog |
    • Software |
    • Services |
    • Training |
    • Case Studies |
    • Webcasts |
  • Developers
    • Blog |
    • Tech Articles |
    • Community |
    • Docs |
    • Downloads |
    • Whitepapers |
    • Podcasts |
  • About
    • Market Overview |
    • Management |
    • Company News |
    • In the Media |
    • Contact |

beta

Start new search

Back to search results

  1. FromDate
  2. Graziano Aliberti2010-03-11 09:54
  3. Susam Pal2010-03-11 10:20
  4. Graziano Aliberti2010-03-12 03:39
  5. Susam Pal2010-03-12 04:47
  6. Susam Pal2010-03-13 16:55
  7. Graziano Aliberti2010-03-15 05:02
  8. Susam Pal2010-03-15 15:25
  9. Susam Pal2010-03-15 17:29

[nutch-user] Proxy Authentication

Subject:
Re: Proxy Authentication
From:
Susam Pal <susam.pal@...>
Date:
2010-03-15 17:29
On Tue, Mar 16, 2010 at 12:55 AM, Susam Pal <susam.pal@gmail.com> wrote:
On Mon, Mar 15, 2010 at 2:32 PM, Graziano Aliberti <graziano.aliberti@eng.it> wrote:
Il 13/03/2010 22.55, Susam Pal ha scritto:
On Fri, Mar 12, 2010 at 3:17 PM, Susam Pal<susam.pal@gmail.com>  wrote:
On Fri, Mar 12, 2010 at 2:09 PM, Graziano Aliberti <graziano.aliberti@eng.it>  wrote:
Il 11/03/2010 16.20, Susam Pal ha scritto:
On Thu, Mar 11, 2010 at 8:24 PM, Graziano Aliberti <graziano.aliberti@eng.it>    wrote:
Hi everyone, I'm trying to use nutch ver. 1.0 on a system under squid proxy control. When I try to fetch my website list, into the log file I see that the authentication was failed... I've configured my nutch-site.xml file with all that properties needed for proxy auth, but my error is "httpclient.HttpMethodDirector - No credentials available for BASIC 'Squid proxy-caching web server'@proxy.my.host:my.port"
Did you replace 'protocol-http' with 'protocol-httpclient' in the value for 'plugins.include' property in 'conf/nutch-site.xml'? Regards, Susam Pal
Hi Susam, yes of course!! :) Maybe I can post you the configuration file: <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Put site-specific property overrides in this file. --> <configuration> <property> <name>http.agent.name</name> <value>my.agent.name</value> <description> </description> </property> <property> <name>plugin.includes</name> <value>protocol-httpclient|urlfilter-regex|parse-(text|html|js)|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value> <description> </description> </property> <property> <name>http.auth.file</name> <value>my_file.xml</value> <description>Authentication configuration file for  'protocol-httpclient' plugin. </description> </property> <property> <name>http.proxy.host</name> <value>ip.my.proxy</value> <description>The proxy hostname.  If empty, no proxy is used.</description> </property> <property> <name>http.proxy.port</name> <value>my.port</value> <description>The proxy port.</description> </property> <property> <name>http.proxy.username</name> <value>my.user</value> <description> </description> </property> <property> <name>http.proxy.password</name> <value>my.pwd</value> <description> </description> </property> <property> <name>http.proxy.realm</name> <value>my_realm</value> <description> </description> </property> <property> <name>http.agent.host</name> <value>my.local.pc</value> <description>The agent host.</description> </property> <property> <name>http.useHttp11</name> <value>true</value> <description> </description> </property> </configuration> Only another question: where i must put the user authentication parameters (user,pwd)? In nutch-site.xml file or in my_file.xml that I use for authentication? Thank you for your attention, -- ----------- Graziano Aliberti Engineering Ingegneria Informatica S.p.A Via S. Martino della Battaglia, 56 - 00185 ROMA *Tel.:* 06.49.201.387 *E-Mail:* graziano.aliberti@eng.it
The configuration looks okay to me. Yes, the proxy authentication details are set in 'conf/nutch-site.xml'. The file mentioned in 'http.auth.file' property is used for configuring authentication details for authenticating to a web server. Unfortunately, there aren't any log statements in the part of the code that reads the proxy authentication details. So, I can't suggest you to turn on debug logs to get some clues about the issue. However, in case you want to troubleshoot it yourself by building Nutch from source, I can tell you the code that deals with this. The file is: src/java/org/apache/nutch/protocol/httpclient/Http.java : http://svn.apache.org/viewvc/lucene/nutch/trunk/src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpclient/Http.java?view=markup The line number is: 200. If I get time this weekend, I will try to insert some log statements into this code and send a modified JAR file to you which might help you to troubleshoot what is going on. But I can't promise this since it depends on my weekend plans. Two questions before I end this mail. Did you set the value of 'http.proxy.realm' property as: Squid proxy-caching web server ? Also, do you see any 'auth.AuthChallengeProcessor' lines in the log file? I'm not sure whether this line should appear for proxy authentication but it does appear for web server authentication. Regards, Susam Pal
I managed to find some time to insert more logs into protocol-httpclient and create a JAR. I have attached it with this email. Please replace your 'plugins/protocol-httpclient/protocol-httpclient.jar' file with the one that I have attached. Also, edit your 'conf/log4j.properties' file to add these two lines: log4j.logger.org.apache.nutch.protocol.httpclient=DEBUG,cmdstdout log4j.logger.org.apache.commons.httpclient.auth=DEBUG,cmdstdout When you run a crawl now, you should see more logs in 'logs/hadoop.log' than before. I hope it helps you in providing some clues. In case you want to compare the logs with how the control flows from the source code, I have attached the JAVA file as well. Regards, Susam Pal
Hi Susam, first of all I want to thank you for your support :). I've tried your solution and I've seen in the log file the that the authentication parameters was correctly read by the application. In the log file I've finded these lines about auth.AuthChallengeProcessor: 2010-03-15 09:52:33,140 DEBUG auth.AuthChallengeProcessor - Challenge for ntlm authentication scheme not available 2010-03-15 09:52:33,140 DEBUG auth.AuthChallengeProcessor - Challenge for digest authentication scheme not available 2010-03-15 09:52:33,140 INFO  auth.AuthChallengeProcessor - basic authentication scheme selected 2010-03-15 09:52:33,140 DEBUG auth.AuthChallengeProcessor - Using authentication scheme: basic 2010-03-15 09:52:33,140 DEBUG auth.AuthChallengeProcessor - Authorization challenge processed 2010-03-15 09:52:33,140 INFO  httpclient.HttpMethodDirector - No credentials available for BASIC 'Squid proxy-caching web server'@my.proxy:my.port
'Squid proxy-caching web server'@my.proxy:my.port - should be the authentication details mentioned in the proxy configuration. It means that the 'http.proxy.realm' should be specified as: Squid proxy-caching web server You can also try omitting the value for 'http.proxy.realm' property. I was also wanted to confirm whether you got the following line in 'logs/hadoop.log': Custom logs for troubleshooting authentication (set 4) If you have got this line and your configuration is correct, I don't see a reason why AuthChallengeProcessor should complain about missing credentials. It could be a bug either in Nutch or in the Jakarta Commons HttpClient library which is being used in Nutch to do the authentication. It could also be a mistake in the configuration. In case, you find a way to resolve it, please let us know what the problem was and how you resolved it. Regards, Susam Pal
Here is an update. It is most likely a configuration problem. I just tested the proxy authentication feature with a Squid proxy server with the same realm as yours. It works well. Also, the issue you are facing seems to be due to an incorrect realm specified. I would suggest that you omit the realm and see if it works fine. When you omit the realm, the corresponding XML code for the configuration should look like this: <property> <name>http.proxy.realm</name> <value></value> <description></description> </property> In case you do want to specify the realm, your XML code should look like this: <property> <name>http.proxy.realm</name> <value>Squid proxy-caching web server</value> <description></description> </property> Note that this is the exact string appearing in the log message: INFO httpclient.HttpMethodDirector - No credentials available for BASIC 'Squid proxy-caching web server'@my.proxy:my.port There should be no quotes around the string. If everything goes fine, the logs should appear like the following. These logs are from my system. 2010-03-16 02:45:28,280 DEBUG auth.AuthChallengeProcessor - Supported authentication schemes in the order of preference: [ntlm, digest, basic] 2010-03-16 02:45:28,280 DEBUG auth.AuthChallengeProcessor - Challenge for ntlm authentication scheme not available 2010-03-16 02:45:28,280 DEBUG auth.AuthChallengeProcessor - Challenge for digest authentication scheme not available 2010-03-16 02:45:28,280 INFO auth.AuthChallengeProcessor - basic authentication scheme selected 2010-03-16 02:45:28,280 DEBUG auth.AuthChallengeProcessor - Using authentication scheme: basic 2010-03-16 02:45:28,281 DEBUG auth.AuthChallengeProcessor - Authorization challenge processed 2010-03-16 02:45:28,282 DEBUG auth.AuthChallengeProcessor - Supported authentication schemes in the order of preference: [ntlm, digest, basic] 2010-03-16 02:45:28,283 DEBUG auth.AuthChallengeProcessor - Challenge for ntlm authentication scheme not available 2010-03-16 02:45:28,283 DEBUG auth.AuthChallengeProcessor - Challenge for digest authentication scheme not available 2010-03-16 02:45:28,283 INFO auth.AuthChallengeProcessor - basic authentication scheme selected 2010-03-16 02:45:28,283 DEBUG auth.AuthChallengeProcessor - Using authentication scheme: basic 2010-03-16 02:45:28,283 DEBUG auth.AuthChallengeProcessor - Authorization challenge processed 2010-03-16 02:45:28,284 DEBUG auth.BasicScheme - enter BasicScheme.authenticate(Credentials, HttpMethod) 2010-03-16 02:45:28,286 DEBUG auth.BasicScheme - enter BasicScheme.authenticate(UsernamePasswordCredentials, String) 2010-03-16 02:45:28,286 DEBUG auth.AuthChallengeProcessor - Supported authentication schemes in the order of preference: [ntlm, digest, basic] 2010-03-16 02:45:28,287 DEBUG auth.AuthChallengeProcessor - Challenge for ntlm authentication scheme not available 2010-03-16 02:45:28,287 DEBUG auth.AuthChallengeProcessor - Challenge for digest authentication scheme not available 2010-03-16 02:45:28,287 INFO auth.AuthChallengeProcessor - basic authentication scheme selected 2010-03-16 02:45:28,287 DEBUG auth.AuthChallengeProcessor - Using authentication scheme: basic 2010-03-16 02:45:28,288 DEBUG auth.AuthChallengeProcessor - Authorization challenge processed 2010-03-16 02:45:28,288 DEBUG auth.BasicScheme - enter BasicScheme.authenticate(Credentials, HttpMethod) 2010-03-16 02:45:28,288 DEBUG auth.BasicScheme - enter BasicScheme.authenticate(UsernamePasswordCredentials, String) 2010-03-16 02:45:28,284 DEBUG auth.BasicScheme - enter BasicScheme.authenticate(Credentials, HttpMethod) 2010-03-16 02:45:28,289 DEBUG auth.BasicScheme - enter BasicScheme.authenticate(UsernamePasswordCredentials, String) 2010-03-16 02:45:28,330 DEBUG httpclient.Http - url: http://en.wikipedia.org/robots.txt; status code: 200; bytes received: 4853; Content-Length: 4853; Content-Encoding: gzip; extracted to 26147 bytes I hope this helps. If you still face issues, please send me the complete log file (logs/hadoop.log) and the complete configuration file (conf/nutch-site.xml). It is easier to spot configuration mistakes if you send the complete files. Please do remove the existing hadoop.log file before starting a new crawl so that that the log file you send us isn't too large. Regards, Susam Pal

Solr Powered

Give us your feedback

  • Lucene
  • Solr
  • Nutch
  • Tika
  • Mahout
  • Droids
  • PyLucene
  • Lucene.Net
  • Lucy
  • Lucene4c
  • Open Relevance Project
  • How We Can Help:
    • Getting Started |
    • Support Subscriptions |
    • White Papers |
    • Training |
    • Consulting |
    • Contact Us |
  • Developers:
    • Blog |
    • Documentation |
    • Tech Articles |
    • Podcasts and Videos |
    • Community |
  • Downloads:
    • LucidWorks for Solr |
    • LucidWorks for Lucene |
    • LucidGaze for Solr |
    • LucidGaze for Lucene |
  • Products:
  • Services:

Contact | Privacy Policy | Legal Terms of Use | Copyrights and Disclaimers | Admin

Apache Solr, Apache Lucene, ApacheCon and their logos are trademarks of the Apache Software Foundation.

© 2010 Lucid Imagination. All Right reserved.