tabula-py now able to extract remote PDF and multiple tables at once
tabula-py is a Python library which enables you to extract tables from PDF into pandas DataFrames. Today, I released v0.8.0. In this post, I will introduce improvements after previous post of tabula-py. If you don’t familiar with tabula-py, you can see previous one.
tabula-py: Extract table from PDF into Python DataFrame
_Today, I released tabula-py 0.3.0, which extracts table from PDF into Python pandas’s DataFrame._blog.chezo.uno(https://blog.chezo.uno/tabula-py-extract-table-from-pdf-into-python-dataframe-6c7acfa5f302)
- Able to read remote PDF passing URL
- [Experimental] Add
- Add batch conversion method:
- Will deprecate
I will explain important features.
Read remote PDF passing URL
If you want extract a DataFrame from the internet, you can extract remote PDF without downloading it manually.
[Experimental] Add “
tabula-py is a simple wrapper of tabula-java, it was hard to handle multiple tables in a page. But now, you can extract multiple tables in a page using
read_pdf(‘tests/resources/data.pdf’, pages=2, multiple_tables=True)
This function create a list of DataFrames via JSON from tabula-java, so if tabula-java’s JSON format will change, the output could be broken. If you see
CParserError , try to set
Add batch conversion method: “
After tabula-java v0.9.2, we can extract tables from PDF by batch. You can use this function through
You should set directory path of PDFs, not the specific pdf path.
tabula-py extracts tables same directory as input files.
There are several problems those may be fixed after releasing of tabula-java 0.9.3. e.g) Handling embedded font, including Japanese…
Waiting for your collaboration!
If you have any troubles with tabula-py, please file an issue on GitHub. I don’t want to receive emails because the answer will not share to other people. Make sure fill the issue template, it will reduce many costs for me to solve the problem.