Yes, the term count reported by CheckIndex is the total number of unique terms.
It indeed looks like you are exceeding the unique term count limit --
16777214 * 128 (= the default term index interval) is 2147483392 which
is mighty close to max/min 32 bit int value. This makes sense,
because CheckIndex steps through the terms in order, one by one. So
the first term just over the limit triggered the exception.
Hmm -- can you try a patched Lucene in your area? I have one small
change to try that may increase the limit to termIndexInterval
(default 128) * 2.1 billion.
Mike
On Tue, Feb 9, 2010 at 12:23 PM, Tom Burton-West <tburtonwest@gmail.com> wrote:
Thanks Lance and Michael,
We are running Solr 1.3.0.2009.09.03.11.14.39 (Complete version info from
Solr admin panel appended below)
I tried running CheckIndex (with the -ea: switch ) on one of the shards.
CheckIndex also produced an ArrayIndexOutOfBoundsException on the larger
segment containing 500K+ documents. (Complete CheckIndex output appended
below)
Is it likely that all 10 shards are corrupted? Is it possible that we have
simply exceeded some lucene limit?
I'm wondering if we could have exceeded the lucene limit of unique terms of
2.1 billion as mentioned towards the end of the Lucene Index File Formats
document. If the small 731 document index has nine million unique terms as
reported by check index, then even though many terms are repeated, it is
concievable that the 500,000 document index could have more than 2.1 billion
terms.
Do you know if the number of terms reported by CheckIndex is the number of
unique terms?
On the other hand, we previously optimized a 1 million document index down
to 1 segment and had no problems. That was with an earlier version of Solr
and did not include CommonGrams which could conceivably increase the number
of terms in the index by 2 or 3 times.
Tom
-----------------------------------------------------------------------------------
Solr Specification Version: 1.3.0.2009.09.03.11.14.39
Solr Implementation Version: 1.4-dev 793569 - root - 2009-09-03 11:14:39
Lucene Specification Version: 2.9-dev
Lucene Implementation Version: 2.9-dev 779312 - 2009-05-27 17:19:55
[tburtonw@slurm-4 ~]$ java -Xmx4096m -Xms4096m -cp
/l/local/apache-tomcat-serve/webapps/solr-sdr-search/serve-10/WEB-INF/lib/lucene-core-2.9-dev.jar:/l/local/apache-tomcat-serve/webapps/solr-sdr-search/serve-10/WEB-INF/lib
-ea:org.apache.lucene... org.apache.lucene.index.CheckIndex
/l/solrs/1/.snapshot/serve-2010-02-07/data/index
Opening index @ /l/solrs/1/.snapshot/serve-2010-02-07/data/index
Segments file=segments_zo numSegments=2 version=FORMAT_DIAGNOSTICS [Lucene
2.9]
1 of 2: name=_29dn docCount=554799
compound=false
hasProx=true
numFiles=9
size (MB)=267,131.261
diagnostics = {optimize=true, mergeFactor=2,
os.version=2.6.18-164.6.1.el5, os=Linux, mergeDocStores=true,
lucene.version=2.9-dev 779312 - 2009-05-27 17:19:55, source=merge,
os.arch=amd64, java.version=1.6.0_16, java.vendor=Sun Microsystems Inc.}
has deletions [delFileName=_29dn_7.del]
test: open reader.........OK [184 deleted docs]
test: fields, norms.......OK [6 fields]
test: terms, freq, prox...FAILED
WARNING: fixIndex() would remove reference to this segment; full
exception:
java.lang.ArrayIndexOutOfBoundsException: -16777214
at
org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:246)
at
org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:218)
at
org.apache.lucene.index.SegmentTermDocs.seek(SegmentTermDocs.java:57)
at
org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:474)
at org.apache.lucene.index.CheckIndex.main(CheckIndex.java:715)
2 of 2: name=_29im docCount=731
compound=false
hasProx=true
numFiles=8
size (MB)=421.261
diagnostics = {optimize=true, mergeFactor=3,
os.version=2.6.18-164.6.1.el5, os=Linux, mergeDocStores=true,
lucene.version=2.9-dev 779312 - 2009-05-27 17:19:55, source=merge,
os.arch=amd64, java.version=1.6.0_16, java.vendor=Sun Microsystems Inc.}
no deletions
test: open reader.........OK
test: fields, norms.......OK [6 fields]
test: terms, freq, prox...OK [9504552 terms; 34864047 terms/docs pairs;
144869629 tokens]
test: stored fields.......OK [3550 total field count; avg 4.856 fields
per doc]
test: term vectors........OK [0 total vector count; avg 0 term/freq
vector fields per doc]
WARNING: 1 broken segments (containing 554615 documents) detected
WARNING: would write new segments file, and 554615 documents would be lost,
if -fix were specified
[tburtonw@slurm-4 ~]$
The index is corrupted. In some places ArrayIndex and NPE are not
wrapped as CorruptIndexException.
Try running your code with the Lucene assertions on. Add this to the
JVM arguments: -ea:org.apache.lucene...
--
View this message in context: http://old.nabble.com/TermInfosReader.get-ArrayIndexOutOfBoundsException-tp27506243p27518800.html
Sent from the Solr - User mailing list archive at Nabble.com.