tabula-py: Extract table from PDF into Python DataFrame

(Note: Oct 7th, 2019)
As of Oct. 2019, I launched
a documentation site and Google Colab notebook for tabula-py. The FAQ would be good place to execute accurate extraction.

Today, I released tabula-py 0.3.0, which extracts table from PDF into Python pandas’s DataFrame.

chezou/tabula-py
_tabula-py - Simple wrapper of tabula-java: extract table from PDF into pandas DataFrame_github.com
[](https://github.com/chezou/tabula-py)

It is simple wrapper of tabula-java and it enables you to extract table into DataFrame or JSON with Python. You also can extract tables from PDF into CSV, TSV or JSON file.

tabula is a tool to extract tables from PDFs. It is GUI based software, but tabula-java is a tool based on CUI. Though there were Ruby, R, and Node.js bindings of tabula-java, before tabula-py there isn’t any Python binding of it. I believe PyData is a great ecosystem for data analysis and that’s why I created tabula-py. If you are familiar with R, I highly recommend to use tabulizer, which has the most richest bindings including rich GUI.

You can install tabula-py via pip:

pip install tabula-py

With tabula-py, you can get DataFrame with read_pdf() method.

example of read\_pdf() example of read_pdf()

You can also extract tables as JSON format:

example of JSON example of JSON

You can extract tables into a file like JSON, CSV or TSV with convert_into() method.

You can see more examples in Jupyter notebook.

chezou/tabula-py
_tabula-py - Simple wrapper of tabula-java: extract table from PDF into pandas DataFrame_github.com
[](https://github.com/chezou/tabula-py/blob/master/examples/tabula_example.ipynb)

I hope you will enjoy data wrangling with tabula-py. Any feedback would be welcome!

Waiting for your collaboration!

If you have any trouble with tabula-py, please file an issue on GitHub. I don’t want to receive emails because the answer will not share with other people. Make sure to fill the issue template, it will reduce many costs for me to solve the problem. Or, I also check StackOverflow. You can ask about it.

Other tabula-py articles

Avatar
Aki Ariga
Machine Learning Engineer

Interested in Machine Learning, ML Ops, and Data driven business.