A Font Setting Based Bayesian Model to Extract Mathematical Expression in PDF Files
- Additional Document Info
- View All
© 2017 IEEE. This paper proposes a Font Setting based Bayesian (FSB) model to extract mathematical expressions (MEs) in the portable document format (PDF) files. The FSB model is a self-adaptive unsupervised algorithm which first uses rules to identify ME and non-ME (NME) and then extracts the remaining ME using the Bayesian inference based on the observation that MEs tend to repeatedly represented in a particular style. PDF files are first processed using a PDF parser and document layout is analyzed using projection profiling cutting based algorithm to detect columns and lines. Heuristic rules derived from the knowledge of math usage and writing practices are employed to reason about the posterior probability of a char being ME vs. NME, conditional upon the font and value information. Based on the char level posterior probability, Bayesian inference is used to infer a none-separable character set (NSCS) being ME or not. Consecutive (fragmented) ME NSCS are merged to produce final results. Experimental results show that our approach achieves 0.006 (0.135) false rate and 0.111/0.093 miss rate for IME (EME) extraction. As for NSCS classification, our approach achieves 93.1% precision, 90.5% recall rate, and F1 score of 0.918. The processing time is markedly shorter than supervised machine learning techniques, and the extracted information and analytics products can be used for high level applications.
author list (cited authors)