Digitisation is a process of representing real world objects in digital format. The rapid conversion of material available in printed form to editable digital form requires a significant amount of work if we are to maintain the format and the style of the electronic documents similar to their printed counterparts. Most of the existing digitisation procedures cover only text, and they have to go beyond OCR (text only) for making the text inside objects such as tables searchable, while preserving the format and converting the objects to an editable form to modify and reprint the content, for which the processes of detection and recognition are important. In past research the table detection process mostly followed certain assumptions such as i) either having rule lines or no lines and ii) Manhattan or multi-column layout. However, the recognition process assumed that the considered tables have already been detected and fails to preserve their formatting features for future modifications and re-printing. To address these issues, we proposed a simple and fast algorithm using local thresholds for word space and line height, which locates all types of tables and extract their formatting features. From the experiments performed on 353 records, we have achieved a much higher detection ability than the earlier algorithm and is superior as it performs extended layout analysis, elimination of header-footer and bulleted-numbered sections before the reconstruction of tables. While reconstructing the extracted table, the most prominent features of the tables are preserved with the earlier formats. The algorithm has an advantage of linear complexity as it performs the conversion locally.
How to Cite:
Jahan, M. & Ragel, R.G., (2016). Reproducing tables in scanned documents. Journal of the National Science Foundation of Sri Lanka. 44(4), pp.367–377. DOI: http://doi.org/10.4038/jnsfsr.v44i4.8019