tabula-py now able to extract remote PDF and multiple tables at once

2017-05-27 19:18:39 -07:00·

Aki Ariga

· 2 min read

Note

(Note: Oct 7th, 2019) As of Oct. 2019, I launched a documentation site and Google Colab notebook for tabula-py. The FAQ would be good place to execute accurate extraction.

tabula-py is a Python library which enables you to extract tables from PDF into pandas DataFrames. Today, I released v0.8.0. In this post, I will introduce improvements after previous post of tabula-py. If you don’t familiar with tabula-py, you can see previous one.

Change Notes

Able to read remote PDF passing URL
\[Experimental\]Add multiple_tables mode
Add batch conversion method:convert_into_by_batch()
Add encoding option
Add java_options
Will deprecate read_pdf_table() method

I will explain important features.

Read remote PDF passing URL

If you want extract a DataFrame from the internet, you can extract remote PDF without downloading it manually.

read_pdf("https://github.com/tabulapdf/tabula-java/raw/master/src/test/resources/technology/tabula/12s0324.pdf")

\[Experimental\] Add “`multiple_tables"` mode

tabula-py is a simple wrapper of tabula-java, it was hard to handle multiple tables in a page. But now, you can extract multiple tables in a page using multiple_tables option.

read_pdf('tests/resources/data.pdf', pages=2, multiple_tables=True)

This function create a list of DataFrames via JSON from tabula-java, so if tabula-java’s JSON format will change, the output could be broken. If you see CParserError , try to set multiple_tables option.

Add batch conversion method: “`convert_into_by_batch()"`

After tabula-java v0.9.2, we can extract tables from PDF by batch. You can use this function through convert_into_by_batch() method.

convert_into_by_batch(path_to_dir, output_format='csv')

You should set directory path of PDFs, not the specific pdf path.

tabula-py extracts tables same directory as input files.

TODOs

There are several problems those may be fixed after releasing of tabula-java 0.9.3. e.g) Handling embedded font, including Japanese…

Waiting for your collaboration!

If you have any troubles with tabula-py, please file an issue on GitHub. I don’t want to receive emails because the answer will not share to other people. Make sure fill the issue template, it will reduce many costs for me to solve the problem.

No results found

tabula-py now able to extract remote PDF and multiple tables at once

Change Notes

Read remote PDF passing URL

\[Experimental\] Add “`multiple_tables"` mode

Add batch conversion method: “`convert_into_by_batch()"`

TODOs

Waiting for your collaboration!

Other tabula-py articles

Related

No results found

tabula-py now able to extract remote PDF and multiple tables at once

Change Notes

Read remote PDF passing URL

\[Experimental\] Add “multiple_tables" mode

Add batch conversion method: “convert_into_by_batch()"

TODOs

Waiting for your collaboration!

Other tabula-py articles

Related

\[Experimental\] Add “`multiple_tables"` mode

Add batch conversion method: “`convert_into_by_batch()"`