A recent update of tabula-py

Photo by [Joshua Rawson-Harris](https://unsplash.com/@joshrh19?utm_source=medium&utm_medium=referral) on [Unsplash](https://unsplash.com?utm_source=medium&utm_medium=referral) Photo by Joshua Rawson-Harris on Unsplash

This article is a repost of Patreon article published last December. I’m planning to bump up the next version of tabula-py within few weeks.

(Note: Oct 7th, 2019)
As of Oct. 2019, I launched
a documentation site and Google Colab notebook for tabula-py. The FAQ would be good place to execute accurate extraction.

This is my first post on patreon. Apologies for delayed announcement of recent update of tabula-py. I will introduce the key features of updates.

Use Tabula app template

Tabula app has template exporting feature to reuse same bounding box for extraction. tabula-py now load and extract with tabula app’s template.

dfs = tabula.read_pdf_with_template(‘./examples/data.pdf’, ‘./examples/data.tabula-template.json’, pandas_options={‘header’: 0})

Support file-like object

Like many python libraries, tabula-py has been able to extract from file-like object.

# With file-like object
pdf_path = ‘tests/resources/data.pdf’
with open(pdf_path, ‘rb’) as f:
 df =
tabula.read_pdf(f)

# With pathlib
from pathlib import Path
pdf_path = ‘tests/resources/data.pdf’
df =
tabula.read_pdf(Path(pdf_path))

Allow multiple area option

As of tabula-java v1.0.2, tabula can handle multiple area option.

pdf_path = ‘tests/resources/MultiColumn.pdf’
# Relative area
df_relative =
tabula.read_pdf(
 pdf_path, pages=1, area=[[0, 0, 100, 50], [0, 50, 100, 100]], relative_area=True)
# Absolute area
df_absolute =
tabula.read_pdf(
 pdf_path, pages=1, area=[[0, 0, 451, 212], [0, 212, 451, 425]])

Tip: Get table position

This is not a new feature, but I think it might be helpful for some PDFs.
Detailed post: https://github.com/chezou/tabula-py/issues/102

read_pdf with JSON contains position info, so you can get the table position as follows:

In [5]: tables = read_pdf(“./examples/data.pdf”, output_format=”json”, page=2)
In [9]: top = tables[0][‘top’]
In [10]: left = tables[0][‘left’]
In [11]: bottom = tables[0][‘height’] + top
In [12]: right = tables[0][‘width’] + left
In [13]: top, bottom, left, right
Out[13]: (0.0, 528.8800048828125, 0.0, 564.8800048828125)

If you have any question, ask on Stack Overflow!

Other tabula-py articles

Avatar
Aki Ariga
Machine Learning Engineer

Interested in Machine Learning, ML Ops, and Data driven business.

Related