Lucid Imagination

Secondary links

  • Contact Us
  • Sign Up or Login
  • Downloads
  • Solutions
    • Partners |
    • Blog |
    • Software |
    • Services |
    • Training |
    • Case Studies |
    • Webinars |
  • Developers
    • Blog |
    • Tech Articles |
    • Community |
    • Docs |
    • Downloads |
    • Whitepapers |
    • Podcasts |
  • About
    • Market Overview |
    • Management |
    • Company News |
    • In the Media |
    • Contact |

beta

Start new search

Back to search results

  1. FromDate
  2. "Withanage, Dulip"2010-02-03 06:08
  3. Ken Krugler2010-02-03 13:52
  4. Alexander Aristov2010-02-03 14:59
  5. "Withanage, Dulip"2010-02-04 04:58
  6. Alexander Aristov2010-02-04 06:11

[nutch-user] PDF Parsing

Subject:
Re: PDF Parsing
From:
Ken Krugler <kkrugler_lists@...>
Date:
2010-02-03 13:52
On Feb 3, 2010, at 3:08am, Withanage, Dulip wrote:

I parse a pdf collection using the web crawler. Some PDFs are corrupt and it makes the whole lucene index unusable. Does anybody have any idea, how to go around this problem.
How does it make the "whole Lucene index unusable"? Normally a corrupt PDF can cause an exception to be thrown during parsing, or it can cause the parser to hang. It might output a bunch of garbage, but that shouldn't cause the index to become invalid. -- Ken
Best regards, Dulip Withanage, M.Sc Cluster of Excellence Karl Jaspers Centre Heidelberg e-mail: withanage@asia-europe.uni-heidelberg.de
-------------------------------------------- Ken Krugler +1 530-210-6378 http://bixolabs.com e l a s t i c w e b m i n i n g

Solr Powered

Give us your feedback

  • Lucene
  • Solr
  • Nutch
  • Tika
  • Mahout
  • Droids
  • PyLucene
  • Lucene.Net
  • Lucy
  • Lucene4c
  • Open Relevance Project
  • How We Can Help:
    • Getting Started |
    • Support Subscriptions |
    • White Papers |
    • Training |
    • Consulting |
    • Contact Us |
  • Developers:
    • Blog |
    • Documentation |
    • Tech Articles |
    • Podcasts and Videos |
    • Community |
  • Downloads:
    • LucidWorks for Solr |
    • LucidWorks for Lucene |
    • LucidGaze for Solr |
    • LucidGaze for Lucene |
  • Products:
  • Services:

Contact | Privacy Policy | Legal Terms of Use | Copyrights and Disclaimers | Admin

Apache Solr, Apache Lucene, ApacheCon and their logos are trademarks of the Apache Software Foundation.

© 2010 Lucid Imagination. All Right reserved.