On Feb 3, 2010, at 3:08am, Withanage, Dulip wrote:
I parse a pdf collection using the web crawler.
Some PDFs are corrupt and it makes the whole lucene index unusable.
Does anybody have any idea, how to go around this problem.
How does it make the "whole Lucene index unusable"?
Normally a corrupt PDF can cause an exception to be thrown during
parsing, or it can cause the parser to hang.
It might output a bunch of garbage, but that shouldn't cause the index
to become invalid.
-- Ken
Best regards,
Dulip Withanage, M.Sc
Cluster of Excellence
Karl Jaspers Centre
Heidelberg
e-mail: withanage@asia-europe.uni-heidelberg.de
--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c w e b m i n i n g