Your problem has nothing to do with PDFs. Do you have messages/exceptions
where you are merging indexes?
Best Regards
Alexander Aristov
On 4 February 2010 12:58, Withanage, Dulip <
withanage@asia-europe.uni-heidelberg.de> wrote:
Thanks for the initial ideas.
do they really corrupt or they get corrupted when they are downloaded?
Sorry for my false assumption at the beginning. I am absolutely new to
lucene and nutch both.
I think the index is not corrupt. It gets corrupted in the mergecrawl
process.
These are my steps
1. I have a pdf index of around 2000 documents in web server.
2. I generate one index for each 100 documents.
3. Then I use a modified mergecrawl_script to merge the indexes
http://wiki.apache.org/nutch/MergeCrawl
4. I add each directory one after other to make a complete index.
5. The merged lucene index is corrupt after I encounter a index directory
of about 400 mb.
-----Original Message-----
From: Alexander Aristov [mailto:alexander.aristov@gmail.com]
Sent: Wednesday, February 03, 2010 9:00 PM
To: nutch-user@lucene.apache.org
Subject: Re: PDF Parsing
hi
do they really corrupt or they get corrupted when they are downloaded?
There
is a parameter in Nutch which limits downloaded content size. it just cuts
files and they became corrupted. check this setting
Best Regards
Alexander Aristov
On 3 February 2010 21:52, Ken Krugler <kkrugler_lists@transpac.com> wrote:
On Feb 3, 2010, at 3:08am, Withanage, Dulip wrote:
I parse a pdf collection using the web crawler.
Some PDFs are corrupt and it makes the whole lucene index unusable.
Does anybody have any idea, how to go around this problem.
How does it make the "whole Lucene index unusable"?
Normally a corrupt PDF can cause an exception to be thrown during
parsing,
or it can cause the parser to hang.
It might output a bunch of garbage, but that shouldn't cause the index to
become invalid.
-- Ken
Best regards,
Dulip Withanage, M.Sc
Cluster of Excellence
Karl Jaspers Centre
Heidelberg
e-mail: withanage@asia-europe.uni-heidelberg.de
--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c w e b m i n i n g