We've been thinking about running some kind of a classifier against each book
to select books with a high percentage of dirty OCR for some kind of special
processing. Haven't quite figured out a multilingual feature set yet other
than the punctuation/alphanumeric and character block ideas mentioned above.
I'm not sure I understand your suggestion. Since real word hapax legomenons
are generally pretty common (maybe 40-60% of unique words) wouldn't using
them as the "no" set provide mixed signals to the classifier?
Tom
Walter Underwood-2 wrote:
Hmm, how about a classifier? Common words are the "yes" training set,
hapax legomenons are the "no" set, and n-grams are the features.
But why isn't the OCR program already doing this?
wunder
--
View this message in context: http://old.nabble.com/Cleaning-up-dirty-OCR-tp27840753p27871444.html
Sent from the Solr - User mailing list archive at Nabble.com.