A Font Setting based Bayesian Model to Extract Mathematical Expression in PDF Files Conference Paper

Overview
Research
Identity
Additional Document Info
Other
View All

abstract

2017 IEEE. This paper proposes a Font Setting based Bayesian (FSB) model to extract mathematical expressions (MEs) in the portable document format (PDF) files. The FSB model is a self-adaptive unsupervised algorithm which first uses rules to identify ME and non-ME (NME) and then extracts the remaining ME using the Bayesian inference based on the observation that MEs tend to repeatedly represented in a particular style. PDF files are first processed using a PDF parser and document layout is analyzed using projection profiling cutting based algorithm to detect columns and lines. Heuristic rules derived from the knowledge of math usage and writing practices are employed to reason about the posterior probability of a char being ME vs. NME, conditional upon the font and value information. Based on the char level posterior probability, Bayesian inference is used to infer a none-separable character set (NSCS) being ME or not. Consecutive (fragmented) ME NSCS are merged to produce final results. Experimental results show that our approach achieves 0.006 (0.135) false rate and 0.111/0.093 miss rate for IME (EME) extraction. As for NSCS classification, our approach achieves 93.1% precision, 90.5% recall rate, and F1 score of 0.918. The processing time is markedly shorter than supervised machine learning techniques, and the extracted information and analytics products can be used for high level applications.

name of conference

2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR)

authors

Liu, Jyh

published proceedings

2017 14TH IAPR INTERNATIONAL CONFERENCE ON DOCUMENT ANALYSIS AND RECOGNITION (ICDAR), VOL 1

author list (cited authors)

Wang, X., & Liu, J.

citation count

6

complete list of authors

Wang, Xing||Liu, Jyh-Charn

publication date

November 2017

publisher

Institute of Electrical and Electronics Engineers (IEEE) Publisher

published in

Proceedings of the ... International Conference on Document Analysis and Recognition / sponsored by the IAPR TC-11 and TC-10, in cooperation with the IEEE Computer Society and IGS. International Conference on Document Analysis and Recog... Journal