Mass Digitization of Early Modern Texts With Optical Character Recognition

abstract

Optical character recognition (OCR) engines work poorly on texts published with premodern printing technologies. Engaging the key technological contributors from the IMPACT project, an earlier project attempting to solve the OCR problem for early modern and modern texts, the Early Modern OCR Project (eMOP) of Texas A8M received funding from the Andrew W. Mellon Foundation to improve OCR outputs for early modern texts from the Eighteenth Century Collections Online (ECCO) and Early English Books Online (EEBO) proprietary database productsor some 45 million pages. Added to print problems are the poor quality of the page images in these collections, which would be too time consuming and expensive to reimage. This article describes eMOP's attempts to OCR 307,000 documents digitized from microfilm to make our cultural heritage available for current and future researchers. We describe the reasoning behind our choices as we undertook the project based on other relevant studies; discoveries we made; the data and the system we developed for processing it; the software, algorithms, training procedures, and tools that we developed; and future directions that should be taken for further work in developing OCR engines for cultural heritage materials.

authors

Mandell, Laura

published proceedings

ACM JOURNAL ON COMPUTING AND CULTURAL HERITAGE

altmetric score

3

author list (cited authors)

Christy, M., Gupta, A., Grumbach, E., Mandell, L., Furuta, R., & Gutierrez-Osuna, R.

citation count

6

complete list of authors

Christy, Matthew||Gupta, Anshul||Grumbach, Elizabeth||Mandell, Laura||Furuta, Richard||Gutierrez-Osuna, Ricardo

publication date

January 2018

publisher

Association for Computing Machinery (ACM) Publisher

published in

ACM Journal on Computing and Cultural Heritage Journal

keywords

Digital Humanities
Machine Learning

Digital Object Identifier (DOI)

10.1145/3075645

start page

6

end page

25

volume

11

issue

1

URL

http://dx.doi.org/10.1145/3075645

Mass Digitization of Early Modern Texts With Optical Character Recognition Conference Paper

Overview

abstract

authors

published proceedings

altmetric score

author list (cited authors)

citation count

complete list of authors

publication date

publisher

published in

Research

keywords

Identity

Digital Object Identifier (DOI)

Additional Document Info

start page

end page

volume

issue

Other

URL