College of Natural and Applied Sciences

Ensemble methods with simple features for document zone classification

Tayo Obafemi-Ajayi, Missouri University of Science and TechnologyFollow
Gady Agam
Bingqing Xie

Abstract

Document layout analysis is of fundamental importance for document image understanding and information retrieval. It requires the identification of blocks extracted from a document image via features extraction and block classification. In this paper, we focus on the classification of the extracted blocks into five classes: text (machine printed), handwriting, graphics, images, and noise. We propose a new set of features for efficient classifications of these blocks. We present a comparative evaluation of three ensemble based classification algorithms (boosting, bagging, and combined model trees) in addition to other known learning algorithms. Experimental results are demonstrated for a set of 36503 zones extracted from 416 document images which were randomly selected from the tobacco legacy document collection. The results obtained verify the robustness and effectiveness of the proposed set of features in comparison to the commonly used Ocropus recognition features. When used in conjunction with the Ocropus feature set, we further improve the performance of the block classification system to obtain a classification accuracy of 99.21%. © 2011 Copyright Society of Photo-Optical Instrumentation Engineers (SPIE).

Department(s)

Engineering Program

Document Type

Conference Proceeding

DOI

https://doi.org/10.1117/12.912103

Keywords

Document image analysis, ensemble classifiers, layout analysis, zone classification

Publication Date

2-27-2012

Recommended Citation

Obafemi-Ajayi, Tayo, Gady Agam, and Bingqing Xie. "Ensemble methods with simple features for document zone classification." In Document Recognition and Retrieval XIX, vol. 8297, p. 829706. International Society for Optics and Photonics, 2012.

Journal Title

Proceedings of SPIE - The International Society for Optical Engineering

Link to Full Text

COinS

College of Natural and Applied Sciences

Ensemble methods with simple features for document zone classification

Abstract

Department(s)

Document Type

DOI

Keywords

Publication Date

Recommended Citation

Journal Title

Browse

Search

Author Corner

College of Natural and Applied Sciences

Ensemble methods with simple features for document zone classification

Authors

Abstract

Department(s)

Document Type

DOI

Keywords

Publication Date

Recommended Citation

Journal Title

Share

Browse

Search

Author Corner