tabula-py 2.8.0 now uses jpype to launch JVM

2023-09-09 17:13:08 -07:00 · 2 min read
blog python

Recently, I released tabula-py 2.8.0. It is a major release because it uses jpype to launch JVM. This means that it reduces JVM launch time since jpype reuse JVM via JNI.

How fast is it?

I measured read_pdf_with_template function execution time, which repeatedly launches Java process in the previous version.

The example template contains 4 rules, which means it calls tabula-java 4 times.

$ cat examples/data.tabula-template.json | jq
[
  {
    "page": 1,
    "extraction_method": "guess",
    "x1": 153.99985500000003,
    "x2": 565.5698550000001,
    "y1": 123.999615,
    "y2": 531.7446150000001,
    "width": 411.57,
    "height": 407.745
  },
  {
    "page": 2,
    "extraction_method": "guess",
    "x1": 153.99985500000003,
    "x2": 453.879855,
    "y1": 123.99884999999993,
    "y2": 210.44384999999994,
    "width": 299.88,
    "height": 86.44500000000001
  },
  {
    "page": 2,
    "extraction_method": "guess",
    "x1": 153.99985500000003,
    "x2": 487.53985500000005,
    "y1": 410.99625000000003,
    "y2": 497.44125,
    "width": 333.54,
    "height": 86.44500000000001
  },
  {
    "page": 3,
    "extraction_method": "guess",
    "x1": 153.99985500000003,
    "x2": 235.85485500000001,
    "y1": 123.99885000000012,
    "y2": 322.8988500000001,
    "width": 81.855,
    "height": 198.9
  }

The result is as follows:

v2.7.0:

$ python -m timeit 'import tabula; tabula.read_pdf_with_template("examples/data.pdf", "examples/data.tabula-template.json")' 2> /dev/null
1 loop, best of 5: 1.31 sec per loop

v2.8.0:

$ python -m timeit 'import tabula; tabula.read_pdf_with_template("examples/data.pdf", "examples/data.tabula-template.json")' 2> /dev/null
1 loop, best of 5: 75 msec per loop

It is 17 times faster than the previous version!

Caveats

Since jpype doesn’t allow to reboot JVM, you can pass java_options for the first time only. If you want to change java_options, you need to restart Python process.

Challenges for releasing v2.8.0

I had to solve several challenges to release this version.

The test issue with different java_options

As I mentioned, jpype doesn’t allow to reboot JVM. This causes unit test with different java_options to fail. I solved this by separating run with nox session.

See https://github.com/chezou/tabula-py/pull/356/files#r1306600161 for details.

This limitation is not a big deal for tabula-py users because tabula-py users don’t need to change java_options frequently.

Read the docs default behavior change

Read the docs changed the default installation packages for Sphinx. I didn’t declared the dependency for Sphinx, so it caused the build failure.

The default behavior of RTD was just installing the latest version of Sphinx and sphinx-rtd-theme, however, now it installs very old version of them like: https://github.com/readthedocs/readthedocs.org/issues/10670#issuecomment-1694761746

I solved this by pinning the versions of dependency for Sphinx and sphinx-rtd-theme.

Aki Ariga
Authors
Staff Software Engineer
AI Product Engineer. Interested in Machine Learning, MLOps, and Data driven business. If you like my blog post, I’m glad if you can buy me a tea 😉

Related