IJSTR

International Journal of Scientific & Technology Research

Home About Us Scope Editorial Board Blog/Latest News Contact Us
0.2
2019CiteScore
 
10th percentile
Powered by  Scopus
Scopus coverage:
Nov 2018 to May 2020

CALL FOR PAPERS
AUTHORS
DOWNLOADS
CONTACT

IJSTR >> Volume 9 - Issue 5, May 2020 Edition



International Journal of Scientific & Technology Research  
International Journal of Scientific & Technology Research

Website: http://www.ijstr.org

ISSN 2277-8616



Auto-Table-Extract: A System To Identify And Extract Tables From Pdf To Excel

[Full Text]

 

AUTHOR(S)

Rohit Sahoo, Chinmay Kathale, Milind Kubal, Shaveta Malik

 

KEYWORDS

Table Detection, Table Extraction, Layout Analysis, Machine Learning, PDFMiner, K-Means Clustering, Tesseract OCR.

 

ABSTRACT

Detection of the table and extracting information from it plays an essential role in the domain of document analysis. Tables are the simplest way to illustrate vital information in a structured format. To further utilize the learning from an ever-increasing knowledge source, it requires effective tools that can automatically extract such vital information from the documents into the desired format. Table detection and extraction from documents is a challenging task because tables can have a variety of layouts. A good number of researches have been carried out in the field of table detection, but the majority of them are not able to identify and extract the information from borderless and partially bordered tables. In this paper, we have proposed a Machine Learning based system called Auto-Table-Extract. This tool identifies and extracts the tables from PDF documents and dumps the data into excel sheets. It works with all kinds of PDF containing bordered, borderless, or partially bordered tables. This system can extract data from both searchable and scanned PDF. The system’s performance is commensurate to other table detection and extraction methods, but it overcomes limitations of both detecting borderless as well as partially bordered tables and proves to be an efficient solution for the detection of tables from diverse documents.

 

REFERENCES

[1] M. Ohta, R. Yamada, T. Kanazawa, And A. Takasu, “A Cell-Detection Based Table-Structure Recognition Method,” In Proceedings Of The Acm Symposium On Document Engineering. Acm, 2019, Pp. 1–4.
[2] A. Gilani, S. R. Qasim Et Al. “Table Detection Using Deep Learning,” In 14th Iapr International Conference On Document Analysis And Recognition, 2017.
[3] F. Shafait And R. Smith, “Table Detection In Heterogeneous Documents,”In Proceedings Of The 9th Iapr International Workshop On Document Analysis Systems. Acm, 2010, Pp. 65–72.
[4] J. Hu, R. S. Kashi, D. P. Lopresti, And G. Wilfong, “Medium Independent Table Detection,” In Electronic Imaging. International Society For Optics And Photonics, 1999, Pp. 291–302.
[5] G. Harit And A. Bansal, “Table Detection In Document Images Using Header And Trailer Patterns,” In Proceedings Of The Eighth Indian Conference On Computer Vision, Graphics And Image Processing. Acm, 2012, P. 62.
[6] T. Kasar, P. Barlas, S. Adam, C. Chatelain, And T. Paquet, “Learning To Detect Tables In Scanned Document Images Using Line Information,” In Document Analysis And Recognition (Icdar), 12th International Conference On. Ieee, 2013, Pp. 1185 1189.
[7] M. A. Jahan And R. G. Ragel, “Locating Tables In Scanned Documents For Reconstructing And Republishing,” In Information And Automation For Sustainability (Iciafs), 2014 7th International Conference On. Ieee, 2014, Pp. 1–6.
[8] T. T. Anh, N. In-Seop, And K. Soo-Hyung, “A Hybrid Method For Table Detection From Document Image,” In Pattern Recognition (Acpr), 2015 3rd Iapr Asian Conference On. Ieee, 2015, Pp. 131–135.