Lucid Imagination

Secondary links

  • Contact Us
  • Sign Up or Login
  • Downloads
  • Solutions
    • Partners |
    • Blog |
    • Software |
    • Services |
    • Training |
    • Case Studies |
    • Webcasts |
  • Developers
    • Blog |
    • Tech Articles |
    • Community |
    • Docs |
    • Downloads |
    • Whitepapers |
    • Podcasts |
  • About
    • Market Overview |
    • Management |
    • Company News |
    • In the Media |
    • Contact |

beta

Start new search

Back to search results

  1. FromDate
  2. ksee2010-03-15 15:08
  3. ksee2010-03-17 18:36
  4. Chris Laif2010-03-18 05:56
  5. ksee2010-03-25 18:43
  6. reinhard schwab2010-03-26 13:04

[nutch-user] problem crawling entire internal website

Subject:
problem crawling entire internal website
From:
ksee <ksee@...>
Date:
2010-03-15 15:08
Hi,

I'm a new nutch user. My company wants me to look into using this technology
to index our internal wiki website as well as sharepoint docs (using tika).

Right now, I just want nutch to index the entire wiki site but I'm having
problems. I've read other people's problems with this but I haven't found a
solution that worked for me.

I have nutch 1.0 installed.
The wiki site is MoinMoin if that helps. The pages don't have extensions
like .html. They're in the form of http://wiki:8000/Engineering as an
example. So all pages only have 1-level depth paths.

I'm running nutch with the follow command:
bin/nuch crawl urls -dir crawl -depth 100 -topN 1000000 > crawl.log

I have a urls folder with a file called wiki that points to the top-level
page of the site.

I set the crawl-urlfilter.txt to accept everything except the default
exclusions:
-^(file|ftp|mailto):
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
-[?*!@=]
-.*(/[^/]+)/[^/]+\1/[^/]+\1/
+.

And I set the db.ignore.external.links property in nutch-default.xml to true
so it doesn't go outside of the site. (db.ignore.interal.links is set to
false)

After the crawl command completes, the search returns some pages, but there
are still some pages that are maybe 2 or 3 levels from the starting page
that don't show up on search.

Any help would be appreciated.

Thanks,
Kane
-- 
View this message in context: http://old.nabble.com/problem-crawling-entire-internal-website-tp27908943p27908943.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Solr Powered

Give us your feedback

  • Lucene
  • Solr
  • Nutch
  • Tika
  • Mahout
  • Droids
  • PyLucene
  • Lucene.Net
  • Lucy
  • Lucene4c
  • Open Relevance Project
  • How We Can Help:
    • Getting Started |
    • Support Subscriptions |
    • White Papers |
    • Training |
    • Consulting |
    • Contact Us |
  • Developers:
    • Blog |
    • Documentation |
    • Tech Articles |
    • Podcasts and Videos |
    • Community |
  • Downloads:
    • LucidWorks for Solr |
    • LucidWorks for Lucene |
    • LucidGaze for Solr |
    • LucidGaze for Lucene |
  • Products:
  • Services:

Contact | Privacy Policy | Legal Terms of Use | Copyrights and Disclaimers | Admin

Apache Solr, Apache Lucene, ApacheCon and their logos are trademarks of the Apache Software Foundation.

© 2010 Lucid Imagination. All Right reserved.