Lucid Imagination

Secondary links

  • Contact Us
  • Sign Up or Login
  • Downloads
  • Solutions
    • Partners |
    • Blog |
    • Software |
    • Services |
    • Training |
    • Case Studies |
    • Webcasts |
  • Developers
    • Blog |
    • Tech Articles |
    • Community |
    • Docs |
    • Downloads |
    • Whitepapers |
    • Podcasts |
  • About
    • Market Overview |
    • Management |
    • Company News |
    • In the Media |
    • Contact |

beta

Start new search

Back to search results

  1. FromDate
  2. Pravin Karne2010-03-05 02:26
  3. Pravin Karne2010-03-08 07:32
  4. MilleBii2010-03-08 09:32
  5. Pravin Karne2010-03-09 01:53
  6. MilleBii2010-03-09 02:35
  7. Pravin Karne2010-03-09 05:14
  8. MilleBii2010-03-09 08:36
  9. Gora Mohanty2010-03-09 09:45
  10. eks dev2010-03-09 12:07
  11. eks dev2010-03-09 12:11

[nutch-user] Two Nutch parallel crawl with two conf folder.

Subject:
Re: Two Nutch parallel crawl with two conf folder.
From:
MilleBii <millebii@...>
Date:
2010-03-08 09:32
How parallel is parallel in your case ?
Don't forget Hadoop in distributed mode will serialize your jobs anyhow.

For the rest why don't you create two Nutch directories and run things
totally independently


2010/3/8, Pravin Karne <pravin_karne@persistent.co.in>:
Hi guys any pointer on following. Your help will highly appreciated . Thanks -Pravin -----Original Message----- From: Pravin Karne Sent: Friday, March 05, 2010 12:57 PM To: nutch-user@lucene.apache.org Subject: Two Nutch parallel crawl with two conf folder. Hi, I want to do two Nutch parallel crawl with two conf folder. I am using crawl command to do this. I have two separate conf folders, all files from conf are same except crawl-urlfilter.txt . In this file we have different filters(domain filters). e.g . 1 st conf have - +.^http://([a-z0-9]*\.)*abc.com/ 2nd conf have - +.^http://([a-z0-9]*\.)*xyz.com/ I am starting two crawl with above configuration and on separate console.(one followed by other) I am using following crawl commands - bin/nutch --nutch_conf_dir=/home/conf1 crawl urls -dir test1 -depth 1 bin/nutch --nutch_conf_dir=/home/conf2 crawl urls -dir test2 -depth 1 [Note: We have modified nutch.sh for '--nutch_conf_dir'] urls file have following entries- http://www.abc.com http://www.xyz.com http://www.pqr.com Expected Result: CrawlDB test1 should contains abc.com's data and CrawlDB test2 should contains xyz.com's data. Actual Results: url filter of first run is overridden by url filter of second run. So Both CrawlDB have xyz.com's data. Please provide pointer regarding this. Thanks in advance. -Pravin DISCLAIMER ========== This e-mail may contain privileged and confidential information which is the property of Persistent Systems Ltd. It is intended only for the use of the individual or entity to which it is addressed. If you are not the intended recipient, you are not authorized to read, retain, copy, print, distribute or use this message. If you have received this communication in error, please notify the sender and delete all copies of this message. Persistent Systems Ltd. does not accept any liability for virus infected mails.
-- -MilleBii-

Solr Powered

Give us your feedback

  • Lucene
  • Solr
  • Nutch
  • Tika
  • Mahout
  • Droids
  • PyLucene
  • Lucene.Net
  • Lucy
  • Lucene4c
  • Open Relevance Project
  • How We Can Help:
    • Getting Started |
    • Support Subscriptions |
    • White Papers |
    • Training |
    • Consulting |
    • Contact Us |
  • Developers:
    • Blog |
    • Documentation |
    • Tech Articles |
    • Podcasts and Videos |
    • Community |
  • Downloads:
    • LucidWorks for Solr |
    • LucidWorks for Lucene |
    • LucidGaze for Solr |
    • LucidGaze for Lucene |
  • Products:
  • Services:

Contact | Privacy Policy | Legal Terms of Use | Copyrights and Disclaimers | Admin

Apache Solr, Apache Lucene, ApacheCon and their logos are trademarks of the Apache Software Foundation.

© 2010 Lucid Imagination. All Right reserved.