tabula-py 2.8.0 now uses jpype to launch JVM

Recently, I released tabula-py 2.8.0. It is a major release because it uses jpype to launch JVM. This means that it reduces JVM launch time since jpype reuse JVM via JNI.

How fast is it?

I measured read_pdf_with_template function execution time, which repeatedly launches Java process in the previous version.

The example template contains 4 rules, which means it calls tabula-java 4 times.

$ cat examples/data.tabula-template.json | jq
[
  {
    "page": 1,
    "extraction_method": "guess",
    "x1": 153.99985500000003,
    "x2": 565.5698550000001,
    "y1": 123.999615,
    "y2": 531.7446150000001,
    "width": 411.57,
    "height": 407.745
  },
  {
    "page": 2,
    "extraction_method": "guess",
    "x1": 153.99985500000003,
    "x2": 453.879855,
    "y1": 123.99884999999993,
    "y2": 210.44384999999994,
    "width": 299.88,
    "height": 86.44500000000001
  },
  {
    "page": 2,
    "extraction_method": "guess",
    "x1": 153.99985500000003,
    "x2": 487.53985500000005,
    "y1": 410.99625000000003,
    "y2": 497.44125,
    "width": 333.54,
    "height": 86.44500000000001
  },
  {
    "page": 3,
    "extraction_method": "guess",
    "x1": 153.99985500000003,
    "x2": 235.85485500000001,
    "y1": 123.99885000000012,
    "y2": 322.8988500000001,
    "width": 81.855,
    "height": 198.9
  }

The result is as follows:

v2.7.0:

$ python -m timeit 'import tabula; tabula.read_pdf_with_template("examples/data.pdf", "examples/data.tabula-template.json")' 2> /dev/null
1 loop, best of 5: 1.31 sec per loop

v2.8.0:

$ python -m timeit 'import tabula; tabula.read_pdf_with_template("examples/data.pdf", "examples/data.tabula-template.json")' 2> /dev/null
1 loop, best of 5: 75 msec per loop

It is 17 times faster than the previous version!

Caveats

Since jpype doesn’t allow to reboot JVM, you can pass java_options for the first time only. If you want to change java_options, you need to restart Python process.

Challenges for releasing v2.8.0

I had to solve several challenges to release this version.

The test issue with different java_options

As I mentioned, jpype doesn’t allow to reboot JVM. This causes unit test with different java_options to fail. I solved this by separating run with nox session.

See https://github.com/chezou/tabula-py/pull/356/files#r1306600161 for details.

This limitation is not a big deal for tabula-py users because tabula-py users don’t need to change java_options frequently.

Read the docs default behavior change

Read the docs changed the default installation packages for Sphinx. I didn’t declared the dependency for Sphinx, so it caused the build failure.

The default behavior of RTD was just installing the latest version of Sphinx and sphinx-rtd-theme, however, now it installs very old version of them like: https://github.com/readthedocs/readthedocs.org/issues/10670#issuecomment-1694761746

I solved this by pinning the versions of dependency for Sphinx and sphinx-rtd-theme.

Aki Ariga
Aki Ariga
Staff Software Engineer

Interested in Machine Learning, ML Ops, and Data driven business. If you like my blog post, I’m glad if you can buy me a tea 😉

  Gift a cup of Tea

Related