Physical distortions, next to digital artefacts, are commonly seen in document images. Their presence sabotages the optical character recognition (OCR) process which not only leads to a reduced amount of automatically retrievable content, but also deteriorates the performance of other document analysis algorithms that rely on layout analysis or content recognition. This paper proposes a method to identify and remove certain types of physical distortions from document images. By exploiting the intensity and spatial relation of distorted pixels, we construct a conditional random field (CRF) based method for distortion identification. Furthermore, a peak searching method is proposed so that the model parameters of the energy functions in the conditional probability are automatically learnt from the image. Discrimination of the pixels from original document content and those from physical noises is obtained by maximizing the conditional probability in the CRF model. Examples from real-life image samples demonstrate the effectiveness of the proposed method.

Original languageEnglish
Title of host publicationProceedings of the 2018 7th European Workshop on Visual Information Processing, EUVIP 2018
EditorsK. Egiazarian, C. Larabi, L. Oudre, I. Tabus, A. Beghdadi, F. Battisti
Place of PublicationTampere, Finland
PublisherIEEE
Pages1-6
Number of pages6
Volume2018-November
ISBN (Electronic)978-1-5386-6897-9
ISBN (Print)978-1-5386-6897-9, 978-1-5386-6898-6
DOIs
Publication statusPublished - 14 Jan 2019
Event7th European Workshop on Visual Information Processing - Tampere, Finland
Duration: 26 Nov 201828 Nov 2018
http://www.tut.fi/euvip2018/

Conference

Conference7th European Workshop on Visual Information Processing
Abbreviated titleEUVIP
CountryFinland
CityTampere
Period26/11/1828/11/18
Internet address

    Research areas

  • CRF, document analysis, peak searching, physical noise

ID: 44200798