Early modern OCR project (eMOP) at Texas A&M University Conference Paper uri icon

abstract

  • Great effort is being made to collect and preserve historic manuscripts from the early modern and eighteenth-century periods; unfortunately, searching the Early English Books Online (EEBO) and Eighteenth Century Collections Online (ECCO) collections can be extremely difficult for researchers because current Optical Character Recognition (OCR) engines struggle to read and recognize various historic fonts, especially in manuscripts of declining quality. To address this problem, the Early Modern OCR Project (eMOP) at the Initiative for the Digital Humanities, Media, and Culture (IDHMC) at Texas A&M University seeks to train OCR engines to read historic documents more effectively in order to make the entirety of these collections accessible to searching. The first step in this project involves using Aletheia Desktop Tool, developed by PRImA Research Lab at the University of Salford, to use documents from the EEBO and ECCO collections to create training sets to aid OCR engines, such as Google's Tesseract, in recognizing the special characters such as ligatures, italics, and blackletter found within early modern fonts. In the year that the Aletheia team has been working to create these font training libraries, we have overcome several problems, including learning how to select, extract, and deliver the data that best suits Tesseract training requirements. This work with Aletheia is part of a larger scholarly project that endeavors to not only make the EEBO and ECCO collections more accessible for data mining purposes for researchers, but also seeks to make available to the public the methodologies, workflow, and digital tools developed during the eMOP project to aid libraries, museums, and scholars in other fields in their efforts to preserve and study our combined cultural history. 2013 ACM.

name of conference

  • Proceedings of the 2013 ACM symposium on Document engineering

published proceedings

  • Proceedings of the 2013 ACM symposium on Document engineering

altmetric score

  • 6.986

author list (cited authors)

  • Torabi, K., Durgan, J., & Tarpley, B.

citation count

  • 5

complete list of authors

  • Torabi, Katayoun||Durgan, Jessica||Tarpley, Bryan

publication date

  • September 2013