Lucid Imagination

Secondary links

  • Contact Us
  • Sign Up or Login
  • Downloads
  • Solutions
    • Partners |
    • Blog |
    • Software |
    • Services |
    • Training |
    • Case Studies |
    • Webcasts |
  • Developers
    • Blog |
    • Tech Articles |
    • Community |
    • Docs |
    • Downloads |
    • Whitepapers |
    • Podcasts |
  • About
    • Market Overview |
    • Management |
    • Company News |
    • In the Media |
    • Contact |

beta

Start new search

Back to search results

  1. FromDate
  2. Robert Muir1969-12-31 19:00
  3. simon1969-12-31 19:00
  4. "Burton-West, Tom"1969-12-31 19:00
  5. Robert Muir1969-12-31 19:00
  6. "Burton-West, Tom"1969-12-31 19:00
  7. Tom Burton-West1969-12-31 19:00
  8. Chris Hostetter1969-12-31 19:00
  9. Walter Underwood1969-12-31 19:00
  10. Tom Burton-West1969-12-31 19:00
  11. Robert Muir1969-12-31 19:00
  12. Chris Hostetter1969-12-31 19:00
  13. Tom Burton-West1969-12-31 19:00
  14. Robert Muir1969-12-31 19:00

[solr-user] Re: Cleaning up dirty OCR

Subject:
Re: Cleaning up dirty OCR
From:
Tom Burton-West <tburtonwest@...>
Date:
1969-12-31 19:00
We've been thinking about running some kind of a classifier against each book
to select books with a high percentage of dirty OCR for some kind of special
processing.  Haven't quite figured out a multilingual feature set yet other
than the punctuation/alphanumeric and character block ideas mentioned above.   

I'm not sure I understand your suggestion. Since real word hapax legomenons
are generally pretty common (maybe 40-60% of unique words) wouldn't  using
them as the "no" set provide mixed signals to the classifier?

Tom


Walter Underwood-2 wrote:
Hmm, how about a classifier? Common words are the "yes" training set, hapax legomenons are the "no" set, and n-grams are the features. But why isn't the OCR program already doing this? wunder
-- View this message in context: http://old.nabble.com/Cleaning-up-dirty-OCR-tp27840753p27871444.html Sent from the Solr - User mailing list archive at Nabble.com.

Solr Powered

Give us your feedback

  • Lucene
  • Solr
  • Nutch
  • Tika
  • Mahout
  • Droids
  • PyLucene
  • Lucene.Net
  • Lucy
  • Lucene4c
  • Open Relevance Project
  • How We Can Help:
    • Getting Started |
    • Support Subscriptions |
    • White Papers |
    • Training |
    • Consulting |
    • Contact Us |
  • Developers:
    • Blog |
    • Documentation |
    • Tech Articles |
    • Podcasts and Videos |
    • Community |
  • Downloads:
    • LucidWorks for Solr |
    • LucidWorks for Lucene |
    • LucidGaze for Solr |
    • LucidGaze for Lucene |
  • Products:
  • Services:

Contact | Privacy Policy | Legal Terms of Use | Copyrights and Disclaimers | Admin

Apache Solr, Apache Lucene, ApacheCon and their logos are trademarks of the Apache Software Foundation.

© 2010 Lucid Imagination. All Right reserved.