Blog Posts | Democratizing Data

Migrated From Netlify to Cloudflare Pages

Fri, 02 Feb 2024 16:42:14 -0800

Netlify is a great service, but it is also known as slowness in Japan. I have been using Netlify for my blog hosting for a long time, but I decided to migrate to Cloudflare Pages to improve the speed of access to my blog from Japan.

The migration step from Netlify is pretty straight foward. I just need to follow this official guide: Migrating from Netlify to Pages.

My blog is built by Hugo and I use Hugoblox, f.k.a., Wowchemy, as a theme. And, I manage my blog content on GitHub, and just adding the Cloudflare Pages app to my GitHub repository, and it automatically detects the settings and builds the site.

If you use Cloudflare for DNS, it automatically sets up the DNS settings for you.

The special consideration on build settings is that I need to set the environment variable HUGO_VERSION to the version of Hugo that I use. In my case, I use Hugo 0.88.1, so I set HUGO_VERSION to 0.101.0. Also, I need to set -b URL option, it was $URL in Netlify, but it is $CF_PAGES_URL in Cloudflare Pages.

The build time is pretty fast, and the PageSpeed Insights score is also improved. I can feel faster access on my browser as well. I’m happy with the migration. Actually, the major reason of slowness was downloading fonts and using Font cache on Cloudflare solved the problem.

Scrape Notion and convert into PDF

Fri, 26 Jan 2024 18:40:00 -0700

I love VanGohan, who is a Japanese meal kits provider in Vancouver. Their meal kits are really tasty and authentic Japanese foods. I can’t live without them. When I visited Japan last year, I wasn’t too eager to find nice Japanese restaurants because of them.

Recipe on Notion is good, if it’s printable

They provide a recipe on Notion. Seeing the recipes on it is great since they can fix recipes quite quickly.

However, there’s one caveat of Notion. They don’t provide printable pages. It’s super annoying to copy and past the recipes to the memo app, and print it out. I asked Notion’s support team, but they answered it isn’t a prioritized item implicitly.

Ok, it’s automation time!

Scrape Notion with Python

As my handy tool, I have been using Python for this kind of automation for years. Originally, I used beautifulsoup, which is great package for web scraping, but I gave it up to use it. Contents of Notion is rendered by JavaScript dynamically.

I chose Selenium and it works like a charm.

Here is the GitHub repository:

They key takeaways are:

chromedriver-autoinstaller package is useful to avoid extra efforts of Chrome driver installation.
Selenium is easy enough to export PDF code.
Running the script on GitHub Actions is easy. Don’t forget to install fonts if it’s not English page.

Originally, I thought I had to prepare a Docker image, but I was aware it was not mandatory. Managing a Docker image for this kind of hobby script would be costly. So, I’m going to keep this approach and will look back if it is the right way.

Currently, I scheduled the GitHub Actions workflow. It will update the PDFs on the repository automatically.

https://github.com/chezou/vangohan-pdf/tree/main/docs

Edit: Now I use Cloudflare Pages to host the PDFs. You can check at https://vangohan.chezo.uno/.

No Python environment on a local machine is needed anymore.

Yay, automation is completed! 😁

tabula-py 2.8.0 now uses jpype to launch JVM

Sat, 09 Sep 2023 17:13:08 -0700

Recently, I released tabula-py 2.8.0. It is a major release because it uses jpype to launch JVM. This means that it reduces JVM launch time since jpype reuse JVM via JNI.

How fast is it?

I measured read_pdf_with_template function execution time, which repeatedly launches Java process in the previous version.

The example template contains 4 rules, which means it calls tabula-java 4 times.

$ cat examples/data.tabula-template.json | jq
[
 {
 "page": 1,
 "extraction_method": "guess",
 "x1": 153.99985500000003,
 "x2": 565.5698550000001,
 "y1": 123.999615,
 "y2": 531.7446150000001,
 "width": 411.57,
 "height": 407.745
 },
 {
 "page": 2,
 "extraction_method": "guess",
 "x1": 153.99985500000003,
 "x2": 453.879855,
 "y1": 123.99884999999993,
 "y2": 210.44384999999994,
 "width": 299.88,
 "height": 86.44500000000001
 },
 {
 "page": 2,
 "extraction_method": "guess",
 "x1": 153.99985500000003,
 "x2": 487.53985500000005,
 "y1": 410.99625000000003,
 "y2": 497.44125,
 "width": 333.54,
 "height": 86.44500000000001
 },
 {
 "page": 3,
 "extraction_method": "guess",
 "x1": 153.99985500000003,
 "x2": 235.85485500000001,
 "y1": 123.99885000000012,
 "y2": 322.8988500000001,
 "width": 81.855,
 "height": 198.9
 }

The result is as follows:

v2.7.0:

$ python -m timeit 'import tabula; tabula.read_pdf_with_template("examples/data.pdf", "examples/data.tabula-template.json")' 2> /dev/null
1 loop, best of 5: 1.31 sec per loop

v2.8.0:

$ python -m timeit 'import tabula; tabula.read_pdf_with_template("examples/data.pdf", "examples/data.tabula-template.json")' 2> /dev/null
1 loop, best of 5: 75 msec per loop

It is 17 times faster than the previous version!

Caveats

Since jpype doesn’t allow to reboot JVM, you can pass java_options for the first time only. If you want to change java_options, you need to restart Python process.

Challenges for releasing v2.8.0

I had to solve several challenges to release this version.

The test issue with different `java_options`

As I mentioned, jpype doesn’t allow to reboot JVM. This causes unit test with different java_options to fail. I solved this by separating run with nox session.

See https://github.com/chezou/tabula-py/pull/356/files#r1306600161 for details.

This limitation is not a big deal for tabula-py users because tabula-py users don’t need to change java_options frequently.

Read the docs default behavior change

Read the docs changed the default installation packages for Sphinx. I didn’t declared the dependency for Sphinx, so it caused the build failure.

The default behavior of RTD was just installing the latest version of Sphinx and sphinx-rtd-theme, however, now it installs very old version of them like: https://github.com/readthedocs/readthedocs.org/issues/10670#issuecomment-1694761746

I solved this by pinning the versions of dependency for Sphinx and sphinx-rtd-theme.

4 Steps to Release a CLI in Python

Fri, 20 May 2022 23:32:41 -0700

This is what I learned from creating a Python CLI (digdaglog2sql) in a day.

In just 4 steps, you can release a CLI written in Python easily.

Create a project by using poetry

Poetry is a modern Python packaging and dependency management tool. Poetry is becoming popular and defacto rapidly.

By using Poetry, it enables us to manage package dependency, to create a project template, and to publish to PyPI.

To setup a project with Poetry, this article is the best to read even if you build a CLI.

https://future--architect-github-io.translate.goog/articles/20210611a/?_x_tr_sl=auto&_x_tr_tl=en&_x_tr_hl=ja&_x_tr_pto=wapp (originally written in Japanese)

One thing I added to my project is isort. isort is to sort imports automatically.

Here is the example of my project.

[tool.taskipy.tasks]
test = { cmd = "pytest tests", help = "runs all unit tests" }
pr_test = "task lint"
fmt = { cmd = "black tests digdaglog2sql && isort digdaglog2sql tests", help = "format code" }
lint = { cmd = "task lint_black && task lint_flake8 && task lint_isort && task lint_mypy", help = "exec lint" }
lint_flake8 = "flake8 --max-line-length=88 tests digdaglog2sql"
lint_mypy = "mypy tests digdaglog2sql"
lint_black = "black --check tests digdaglog2sql"
lint_isort = "isort digdaglog2sql tests --check-only"

Create a CLI with Click/Cloup

Click is a famous Python package to build a command line tool. You can easily create a CLI by using decorator.

Here is the example from the Click website:

import click

@click.command()
@click.option("--count", default=1, help="Number of greetings.")
@click.option("--name", prompt="Your name",
 help="The person to greet.")
def hello(count, name):
 """Simple program that greets NAME for a total of COUNT times."""
 for _ in range(count):
 click.echo("Hello, %s!" % name)

if __name__ == '__main__':
 hello()

Cloup is an extension of Click.

Using by Cloup, you can handle option groups and complex constraints like mutually_exclusive as:

@option_group(
 "Cool options",
 option('--foo', help='This text should describe the option --foo.'),
 option('--bar', help='This text should describe the option --bar.'),
 constraint=mutually_exclusive,
)

Constraints of Cloup can validate the dependency and it also renders constraints in help.

Use poetry-dynamic-versioning for version management

poetry-dynamic-versioning is a Python package to do same thing as setuptools-scm. You don’t need to write version number by hand since this package use the version from tag of Git, e.g., “v.0.1.0”.

Managing version by Git enables you to release to PyPI from GitHub Actions. This means you can release to PyPI on mobile device by releasing from GitHub.

After installation of poetry-dynamic-versioning, you just add three thing in pyproject.toml:

[tool.poetry]
version = "0.0.0"

[tool.poetry-dynamic-versioning]
enable = true

[build-system]
requires = ["poetry-core>=1.0.0", "poetry-dynamic-versioning"]
build-backend = "poetry.core.masonry.api"

Note that build-system configuration may vary depending on how you install poetry-dynamic-versioning. See the document for detail.

Introduce GitHub Actions to release the package to PyPI

As I mentioned above, I highly recommend to use GitHub Actions to release a Package to PyPI.

Since GitHub provides Release notes generation feature now, creating a release from GitHub with triggering PyPI release is the best way to publish a new version.

Here is the snippet of GH Actions to release to PyPI by using poetry.

name: Upload Python Package

on:
 release:
 types: [created]

permissions:
 contents: read

jobs:
 deploy:

 runs-on: ubuntu-latest

 steps:
 - uses: actions/checkout@v3

 steps:
 - uses: actions/checkout@v3
 - name: Set up Python
 uses: actions/setup-python@v3
 with:
 python-version: '3.x'
 - name: Install dependencies
 run: |
 python -m pip install --upgrade pip
 pip install poetry
 - name: Build and publish package
 run: |
 poetry version $(git describe --tags --abbrev=0)
 poetry build
 poetry publish --username __token__ --password ${{ secrets.PYPI_API_TOKEN }}

Note that, while PyPI API Token can be found on PyPI, if you need to create project scope token, you need to upload a package manually first.

Create data lineage from Trino/Hive queries in digdag log with Python

Thu, 05 May 2022 20:31:05 -0700

What’s data lineage?

Data lineage is something to describe “Where this data comes from and where it goes?”

I learned this term in my previous job. They provided “Cloudera Navigator” which includes data lineage from execution logs of Hive/Spark etc.

lineage of Cloudera Navigator via https://docs.cloudera.com/documentation/enterprise/6/6.3/topics/cn_lineage_generation.html

sqllineage is awesome open source tool for visualizing lineage

Recently, I learned there is a Python package so called sqllinage, that makes analyze and visualize data lineage from SQLs.

sqllineage consists of Python implementation to analyze SQL and web application written in React.

Visualize data lineage from Treasure Data’s workflow logs

I found that Treasure Data’s workflow log outputs SQLs in its log. But it still needs to format pure SQLs.

Then, I create digdaglog2sql to extract SQLs from Treasure Workflow logs.

You can use it with Python 3.7+. Here is the overview of the usage and check details on GitHub.

Install via pip:

pip install --user digdaglog2sql

If you have a workflow log downloaded from Treasure Data, you can convert into SQL as:

digdaglog2sql --input workflow-log.txt --output output.sql

Or, if you want extract SQLs from specific workflow, you can use Session ID of it.

export TD_API_KEY=1234XXX/YYYYYYYY
digdaglog2sql --session-id 12345 --site us --output output.sql

You can fetch SQLs from your hosted digdag as the following:

digdaglog2sql --session-id 12345 --endpoint digdag.example.com --output output.sql

~~Note that, as of May 5, 2022, sqllineage and sqlparse, which is an important backend of sqllineage, are not fully compatible with Trino and Hive queries.~~

As of 2022/05/11, the issues in sqllineage around Hive/Trino were fixed and it is available in 1.3.5 on PyPI. It means, you don’t have to have node for sqllineage installation from source.

As of 2022/10/06, the issue in sqlparse was resolved in 0.4.3.

~~These are the PRs that approaches the issues:~~

✅ https://github.com/reata/sqllineage/pull/252 -> Released in 1.3.5
✅ https://github.com/reata/sqllineage/pull/255 -> Released in 1.3.5
✅ https://github.com/andialbrecht/sqlparse/pull/662 -> Released in 0.4.3
✅ https://github.com/andialbrecht/sqlparse/pull/664 -> Released in 0.4.3

~~Don’t worry about it. I prepared patched branches on GitHub. You can install sqllineage and sqlparse as the following:~~

pip install git+https://github.com/chezou/sqlparse.git@trino#egg=sqlparse==0.4.3.dev0
pip install sqllineage

~~If you see some error on installation of sqllineage, double-check if you have node installed.~~

Then, you can visualize your SQL file as:

$ sqllineage -g -f output.sql
 * SQLLineage Running on http://localhost:5000/?f=output.sql

Now you can see visualization of data linage, both table level and column level.

SQL lineageの例

Let’s try sqllineage!

3 configs add recommend articles into your Hugo blog by GitHub Actions

Tue, 25 Jan 2022 19:37:52 -0800

Hugo has a feature to show keyword based related articles.

Yeah, keyword based articles might be useful, for people who can manage keyword, category, etc, constantly. I’d love to add content based recommendation that doesn’t require to write explicit keywords by myself. Then, I found an open source named “Prelims” which is developed by takuti.

Prelims is a post-processing tool for Front matter of Hugo/Jekyll, that is a metadata of an article. The recommendation method which is implemented for now is classical, create a TF-IDF based word vector and find similar articles by consign similarity.

The reason why I love Prelims is it’s simple and flexible. Post-processing of front matter doesn’t break your articles nor blog system at all. You can remove extra meta data Prelims generated whenever you want.

Isn’t it practical, right?

One downside of Prelims is it requires to implement Python code for tokenizing or vectorizing TF-IDF. I don’t want to bring my laptop for blog writing and wanna use Netlify CMS and iPad without having Python environment.

So, I built a CLI tool for Prelims, named prelims-cli, which enables to add recommended articles just writing 1 configuration YAML file. It also runs with GitHub Actions.

The three things you need to prepare are:

Configuration YAML file for prelims-cli. e.g., scripts/config/myconfig.yaml
Hugo HTML partial layout, e.g., layouts/partials/page_related.html
GitHub Actions workflow for prelims-cli

Here is the example gist what you need to write.

where content/blog is the directory for English articles and content/post is the directory for Japanese articles.

Putting three files enables you to show recommended articles into your Hugo blog, like the screenshot in the top of this article.

Internally, for Japanese tokenization, it uses SudachiPy. Since keywords prelims generates are a-bit noisy and didn’t wanted to cleanup, so I stopped using it.

The good things I feel are, I can use my blog articles for my hobby recommendation project, and I don’t need to manage tags and categories seriously.

You can enjoy your recommendation without having Python environment, so you can write your articles on iPad with Netlify CMS!

py> operator development guide for Python users

Thu, 05 Mar 2020 14:15:52 +0000

Japanese version is here

How to build & test custom scripts on local env before pushing

General strategy:

Make a Python task reasonable granularity to run on local env

Since Treasure Workflow doesn’t have intermediate storage between tasks, a task can be huge. Considering container launch time, it would be better to create a single huge task, but it makes difficult for debugging. Starting from creating a reasonable size of function which is able to debug easily. Then, you can create a function that calls those minimal functions at once.

There are few options to develop py> operator on the local environment.

Use TD docker image
Create a Python virtual environment on local env

1. Use TD docker image

To develop a single py> operator task, you can use the official docker image to run python tasks locally. Like ordinal Python script, you can add the main guard like:

if __name__ == "__main__":
 your_function("default_argument")

As of Mar. 5, 2020, our latest official images are shown as the following:

digdag/digdag-python:3.7 https://hub.docker.com/r/digdag/digdag-python
digdag/digdag-anaconda3:2019.03 https://hub.docker.com/r/digdag/digdag-anaconda3

If you want to run a debugger toward Docker container, we recommend using PyCharm to run a remote debugger. See also PyCharm document.

2. Create a Python virtual environment on local env

Python provides venv to create virtual environments, you can create the same environment by using pip.

Download requirements.txt and constraints.txt from the gist and you can install dependencies as same environment with digdag-python:3.7 as:

$ python -m venv .venv
$ source .venv/bin/activate
(.venv)$ pip install -r requirements.txt -c constraints.txt`

Using this virtual environment, you can develop by using the same packages on the local environment.

Note that this approach can’t ensure OS differences, which means the production environment is running on Debian but the development environment might be Windows/macOS X. This causes errors when executing OS-dependent commands like apt-get.

If you want to create the same environment with anaconda image, you can download environment.yml from the gist, and run:

conda env update -n base -f environment.yml

Now you have the same Python packages with digdag/digdag-anaconda3:2019.03

Note that this command will overwrite existing conda environment, we highly recommend to modify name in environment.yml from base to your environment name like my-env, and run:

conda env create -f environment.yml

Test a workflow including Python

If you want to run an entire workflow on the local environment, ~~you can use digdag v0_10 branch~~.

As of Mar 5, 2020, Treasure Data uses digdag v0_10 branch, but it may change in the near future.

As of Feb 14, 2021, Treasure Data moved to v0_11 branch. You may use the latest release branch. https://github.com/treasure-data/digdag/pull/1502 https://github.com/treasure-data/digdag/pull/1504

Passing Parameters to py> operator

There are two ways to pass parameters into py> operator:

ordinal digdag argument
environment variable
digdag variable

1. digdag argument

Assuming we have a Python script named py_scripts/examples.py like:

def print_arg(msg):
 print(f"Message is {msg}")

Passing msg argument from simple_with_arg task can be like:

+simple_with_arg:
 py>: py_scripts.examples.print_arg
 msg: "Hello World"
 docker:
 image: "digdag/digdag-python:3.7"

If you want to pass multiple arguments, you can add arguments in your function, then add them into digdag arguments as well.

Note that digdag arguments can be passed into Python seamlessly so that you might face unintended variables passed by using keyword arguments **kwargs.

For example, in this case, docker variable can be passed as a dictionary {“image”: “digdag/digdag-python:3.7”}. We recommend having implicit arguments on a Python function.

Note that there might be unintended conflicts between digdag and py> operator. Assuming you set some digdag variables like the following:

_export:
 td:
 database: my_db

+simple_with_arg2:
 py>: py_scripts.examples.print_arg_td
 msg: "Hello World"
 docker:
 image: "digdag/digdag-python:3.7"

having python function print_arg_td with td argument like the following:

def print_arg_td(msg, td=None):
 print(f"'msg' is {msg} and 'td' is {td}")

In this case, td variable never can be None since exported td variable, i.e., {“database”: “my_db”} always should be passed. This may cause type mismatches like dictionary and string. We recommend avoiding to use preserved arguments for digdag, like td variables like:

td.endpoint
td.apikey
td.use_ssl
td.proxy.enabled
td.proxy.host
td.proxy.port
td.proxy.password
td.proxy.user

Note that these variables might be changed in the future. There are build-in digdag variables. See digdag build-in variables at http://docs.digdag.io/workflow_definition.html#using-variables

Also, digdag might converts unintended type e.g., an integer from a string, so we recommend to evaluate or explicitly convert type on a Python function.

2. environment variable

Environment variables can be another option to pass parameters to py> operator. An environment variable is reasonable for passing secure information or secrets.

For example, if we have a task simple_with_env

+simple_with_env:
 py>: py_scripts.examples.print_env
 _env:
 MY_ENV_VAR: "hello"
 docker:
 image: "digdag/digdag-python:3.7"

This MY_ENV_VAR can be accessed by using os.environ like:

import os

def print_env():
 print(f'Env var is {os.environ["MY_ENV_VAR"]}')

Using an environment variable should be important especially when you need to use secrets information e.g. Treasure Data API key or AWS secrets key, etc.

digdag has a feature to store secrets information. Secrets are stored on digdag (or Treasure Workflow) database when executing td workflow secrets subcommand.

Assuming you’ve set a secret named td.apikey. This secret can be passed to py> operator like:

+simple_with_env2:
 py>: py_scripts.examples.access_td
 _env:
 TD_API_KEY: ${secret:td.apikey}
 docker: image: "digdag/digdag-python:3.7"

from py_scripts/examples.py like:

import os

def access_td():
 apikey = os.environ["TD_API_KEY"]
 # Do awesome execution

If you try to pass secrets from ordinal digdag arguments, secrets will never be fetched from secrets DB. For example, if you have a task like the following:

+simple_with_env_ng:
 py>: py_scripts.examples.access_td_ng
 apikey: ${secret:td.apikey}
 docker: image: "digdag/digdag-python:3.7"

by using the following script like:

def access_td_ng(apikey):
 print(apikey)
 # Always shows "${secret:td.apikey}" insted of actual API key like "1234/XXXX"

3. digdag variable

If you want to read digdag variable in a Python script, you can use digdag.env.params as the following:

def read_workflow_env(msg):
 import digdag
 print(digdag.env.params["my_msg"])

Note that import digdag can be run only when the script is run as a digdag py> operator task. If you want to avoid import error, you should write try-except syntax like:

try:
 import digdag
 digdag.env.store({"feature_query": feature_query})
except ImportError:
 pass

Directory structures

I recommend having the following directory structure.

my_project
├── README.md
├── config
│ ├── params.test.yml <- Configuration file for run through test. Mirror params.yml except for `td.database`
│ └── params.yml <- Configuration file for production
├── awesome_workflow.dig <- Main workflow to be executed
├── ingest.dig <- Data ingestion workflow
├── py_scripts <- Python scripts directory
│ ├── __init__.py
│ ├── data.py <- Script to upload data to Arm Treasure Data
│ └── my_script.py <- Main script to execute e.g. Data enrichment, ML training
├── queries <- SQL directory
│ └── example.sql
├── run_test.sh <- Test shell script for local run through test
└── test.dig <- Test workflow for local run through test

You can generate this structure from a template by using cookiecutter-digdag.

chezou/cookiecutter-digdag

How to install Python packages / OS packages

For installation of Python packages, you can use os.syste or subprocess.run like:

import os, sys
os.system(f"{sys.executable} -m pip install --upgrade pytd==1.4.3")

import subprocess
# arguments should be passed by list
subprocess.run([sys.executable, "-m", "pip", "install", "--upgrade", "pytd==1.4.3"])

Ensure you set the version number of Python package.

To install OS packages, you can execute like the following:

import os
os.system("apt-get update") # Need to run before doing apt-get install
os.system("apt-get install -y wkhtmltopdf")

How to read/write tiny variables between digdag tasks

To read a digdag variable, you can use digdag.env.params as mentioned above.

To pass variables to another Python task, you can use import digdag.

def store_workflow_env(msg):
 import digdag
 digdag.env.store({"my_msg": msg})

This example code sets my_msg variable which is able to use the following tasks like:

+store_msg:
 py>: py_scripts.examples.store_workflow_env
 msg: "Hello World"
 docker:
 image: "digdag/digdag-python:3.7"

+restore_msg:
 echo>: ${my_msg}

Error notification with Python stack trace

digdag has _error: syntax to send a notification for an error message. You can access ${error.message} digdag variable to send the notification for Slack or Email.

Assuming that if we have the following workflow:

+simple_raise_error:
 py>: py_scripts.examples.error_sample
 docker:
 image: "digdag/digdag-python:3.7"

_error:
 echo>: ${error.message}

with this Python script:

def error_sample():
 int("a1234") # raises ValueError

This script always raises ValueError and the workflow log shows stack trace of Python as the following:

2019-12-24 23:06:32 +0900 [INFO] (0039@[0:python]+simple^error): echo>: Python command failed with code 1: invalid literal for int() with base 10: 'a1234' (ValueError)
 from Traceback (most recent call last):
 from File ".digdag/tmp/digdag-py-2-1815457087076518360/runner.py", line 165, in <module>
 result = callable_type(**args)
 from File "/private/var/folders/y9/bnjb3krn39s22rmg_wvlnf7m0000gp/T/digdag-tempdir2111531196420040503/workspace/1_simple_1_2_2945225080250994454/py_scripts/examples.py", line 5, in print_arg
 int("a1234")
 from ValueError: invalid literal for int() with base 10: 'a1234' (runtime)

In this example, we use echo> operator to show the error message, but you can use mail> operator for sending Email or http> operator to send a Slack message.

How to release Python package from GitHub Actions

Tue, 26 Nov 2019 01:42:11 +0900

Photo by Hitesh Choudhary on Unsplash

Recently, I changed my CI from Travis to GitHub Actions. GitHub Actions is handy and useful for testing, publishing Python packages.

Testing Python code on GitHub Actions

Migration from Travis is super easy, just writing a simple workflow like:

https://github.com/chezou/tabula-py/blob/master/.github/workflows/pythontest.yml

The benefits of GitHub Actions for Python are:

We can use build matrix (e.g., OS and Python versions) like Travis
Launch time of GitHub is faster than Travis
Easy for additional dependency installation by using uses syntax, which uses another workflow

For example, installing JDK can be written as:

- uses: actions/setup-java@v1
 with:
 java-version: '12'
 java-package: jdk
 architecture: x64

The downside of GitHub Actions are:

Unable to write Windows temp directory
Hard to find the resources for debugging on the internet and unable to ssh to the instance

Releasing Python package from GitHub Actions to PyPI

I created the workflow like the following sequence:

Push a tag from local, or create a tag on GitHub. Using setuptools-scm enables you to make a new version from Git tag
GitHub Actions creates GitHub release from the tag
GitHub Actions publishes wheel to PyPI by using PyPI API Token

You can see the actual workflow on GitHub:

https://github.com/chezou/tabula-py/blob/master/.github/workflows/pythonpublish.yml

The key points are:

Triggering the workflow from Git tag

on:
 push:
 tags:
 - 'v\*'

2. Adding dependency for deploy task

deploy:
 needs: release

needs syntax supports to write dependency. In this case, I describe release job for creating GitHub release, and then deploy job publishes the package to PyPI.

3. Preparation secrets for PyPI

Recently, PyPI provides API tokens for package publishments so that you can get an API token for the specific project. See details on the official document since it is under beta, and spec might change.

https://pypi.org/help/#apitoken

After getting API Token from PyPI, you can set secrets on GitHub by clicking “Settings” -> “Secrets” on the project page. Using my example workflow, you should set __token__ for PYPI_USERS , and a token starting with pypi- got on PyPI configuration for PYPI_PASSWORD .

Now, you can publish Python package to PyPI by just tagging on GitHub.

How to test a new Docker image for digdag workflow on CircleCI?

Sun, 06 Oct 2019 05:17:30 +0900

Photo by Campaign Creators on Unsplash

Testing workflow runnability would be important when we build a complex workflow. digdag is a workflow engine which syntax is simple and is able to run tasks with SQL, Python, Ruby, shell script, etc. digdag has Docker executor and it works like a charm with py>, rb>, and sh> operators.

How to ensure a new Docker image runnable with existing digdag workflow? I’ll show the way to run through it on CircleCI.

You can see the example repo on GitHub:

chezou/digdag_circleci
_You can’t perform that action at this time. You signed in with another tab or window. You signed out in another tab or…_github.com

An issue with digdag Docker executor on CircleCI

Although CircleCI docker executor is the primary choice for CircleCI 2.0, which easily run with arbitrary Docker image, it doesn’t provide volume mount for docker since it launches remote sibling docker container. Hence digdag Docker executor assumes to mount a volume, like -v /tmp:/tmp, you need some workaround to avoid it.

FileNotFoundError occurs in python operator in docker · Issue #649 · treasure-data/digdag
_HI. I am running the digdag server in the docker container with the following version. # docker –version Docker…_github.com

In this article, I’ll show you how to execute local mode digdag, a.k.a. didgag run, on CircleCI with digdag docker executor.

Use CircleCI machine executor

tl;dr, use CircleCI machine executor, which runs VM on CircleCI.

version: 2jobs: test: working_directory: ~/app machine: image: ubuntu-1604:201903-01 docker_layer_caching: true steps: - checkout - run: name: Install digdag command: | curl -o ~/bin/digdag --create-dirs -L "https://dl.digdag.io/digdag-latest" chmod +x ~/bin/digdag echo 'export PATH="$HOME/bin:$PATH"' >> ~/.bashrc - run: name: Test digdag run command: | set -x digdag run test.dig

Machine executor has Python, Ruby, Java, and Docker CE by default, so you can easily run digdag on CircleCI.

Here are the dig file and Python script.

# test.dig+task: py>: test.show docker: image: "python:3.7-slim-buster"

Python script:

# test.pydef show(): print("Hello CircleCI")

Build custom Docker image and test with digdag

In some cases, you want to test whether a new Docker image works appropriately with existing workflow.

If you build a new Docker image for digdag Docker executor and test with existing workflow, you can write like the following:

version: 2jobs: build_and_test: working_directory: ~/app machine: image: ubuntu-1604:201903-01 docker_layer_caching: true steps: - checkout - run: name: Install digdag command: | curl -o ~/bin/digdag --create-dirs -L "https://dl.digdag.io/digdag-latest" chmod +x ~/bin/digdag echo 'export PATH="$HOME/bin:$PATH"' >> ~/.bashrc - run: name: Build application Docker image command: | docker build -f ./Dockerfile -t chezou/my-image:latest . - run: name: Test treasure-boxes workflows command: | set -x digdag run test_custom.dig

Building a Docker image on CircleCI, you can use it form digdag run command with the following workflow and Dockerfile.

# test_custom.dig+task: py>: test.show docker: image: "chezou/my-image:latest"

# DockerfileFROM python:3.7-slim-buster

RUN pip install tabula-py

CMD ["python3"]

Conclusion

Using CircleCI’s machine executor enables to use digdag run with digdag Docker executor.
It empowers us to do run through test for new Docker image with existing workflow on CircleCI

You can try it with this GitHub repo:

https://github.com/chezou/digdag_circleci

The first conference of Operational Machine Learning: OpML ‘19

Tue, 04 Jun 2019 13:50:07 +0900

I attended OpML ’19 is a conference for “Operational Machine Learning” held at Santa Clara on May 20th.

OpML ‘19
_The 2019 USENIX Conference on Operational Machine Learning (OpML ‘19) will take place on Monday, May 20, 2019, at the…_www.usenix.org

The scope of this conference is varied and seems not to be specified yet, even if I attended it. I’ll borrow the description from the OpML website.

The 2019 USENIX Conference on Operational Machine Learning (OpML ’19) provides a forum for both researchers and industry practitioners to develop and bring impactful research advances and cutting edge solutions to the pervasive challenges of ML production lifecycle management. ML production lifecycle is a necessity for wide-scale adoption and deployment of machine learning and deep learning across industries and for businesses to benefit from the core ML algorithms and research advances.

Overview of the conference

The number of attendees was 210, they came from LinkedIn, Microsoft, Google, Airbnb, Facebook, etc.
The target of “Operational Machine Learning” is diverse. I thought it focuses on MLOps things such as reproducibility, ML DSL for productionization, visualization, stakeholder management, but there are many talks about ML for system, system utilization optimization, SRE for ML, hardware accelerator, etc.
There is a contrast between tech giants, e.g. Google, Uber, Facebook, Airbnb, Microsoft, and LinkedIn, and other followers. While ML lead companies are talking about their OSSs or ML infrastructures, following companies tend to talk about their specific use case or their solutions (those speakers seems to be small ML ventures).

Some interesting talks

Keynote: Ray: A Distributed Framework for Emerging AI Applications

https://www.usenix.org/conference/opml19/presentation/jordan
Current target of Machine Learning is pattern recognition, but Jordan said decision-making will be the future of ML/AI
Creating a “recommendation market” is the key

MLOp Lifecycle Scheme for Vision-based Inspection Process in Manufacturing

https://www.usenix.org/conference/opml19/presentation/lim
https://www.usenix.org/sites/default/files/conference/protected-files/opml19_slides_lim.pdf
A challenge for defeat recognition by an image in edge applied for Samsung smartphone.
They need to inference for 3000 GB images/day.
The team structure which involves product inspectors and product managers is interesting

From https://www.usenix.org/sites/default/files/conference/protected-files/opml19_slides_lim.pdf

AIOps: Challenges and Experiences in Azure

https://www.usenix.org/conference/opml19/presentation/li-ze
Anomaly detection and diagnosis with lambda architecture for Azure
Disk failure prediction for Azure which introduces proactively live to migrate the workloads to a healthy disk

How the Experts Do It: Production ML at Scale

A panel discussion for ML infrastructures

Lead and moderator: Joel Young, LinkedIn

Panelists:

Sandhya Ramu, Director, AI SRE, LinkedIn
Andrew Hoh, Product Manager, ML Infra and Applied ML, AirBNB
Aditya Kalro, Engineering Manager, AI Infra Services and Platform, Facebook
Faisal Siddiqi, Engineering Manager, Personalization Infrastructure, Netflix
Pranav Khaitan, Engineering Manager, Personalization and Dialog ML Infra, Google

The important thing to keep top level is

the lead time from experiment to production
Flows build for production with involving different team
Not everything is the highest priority. Metrics, dashboards are important

Cost of run/train vs Agility

It’s hard to find down streaming use cases. (Airbnb)
Monitor model resource usage
Keep ML infrastructure extremely flexible
Hard to force using a single framework

What are the important things for your ML platform?

Facebook

Reliability
Scalability
Developer productivity

Agility (Available libraries etc)
Enabling the latest technology
Cost and impact of Machine Learning

Netflix

How quickly/many A/B test we can do
How rapid new researcher can do?

Airbnb

Business impact
# of users for the infrastructures
How many inferences/scoring is done?
Availability, scalability, cost, and long-term decision making

Google

Innovation aspect
How can the ML infrastructure system will empower the next 5 yrs products?

Continuous Training for Production ML in the TensorFlow Extended (TFX) Platform

https://www.usenix.org/conference/opml19/presentation/baylor
TFX provides a library for recording and retrieving metadata for ML: ML Metadata https://www.tensorflow.org/tfx/guide/mlmd

From https://www.usenix.org/system/files/opml19papers-baylor.pdf

Disdat: Bundle Data Management for Machine Learning Pipelines

https://www.usenix.org/conference/opml19/presentation/yocum
https://github.com/kyocum/disdat
Talk about OSS for ML pipeline and data versioning.

Predictive Caching@Scale

https://www.usenix.org/conference/opml19/presentation/janardhan
Traffic prediction for CDN (Akamai)
Interesting cache strategy with covering prediction error

Ruby for Data Science and Machine Learning

Wed, 24 Apr 2019 12:10:28 +0900

I attended RubyKaigi 2019 held at Fukuoka from Apr 18 to Apr 21. This year’s RubyKaigi was a really great opportunity for me to know the possibility of Data Science and Machine Learning for Ruby.

Data Science and Ruby

As many of you may know, Ruby is widely known for web application with such as Ruby on Rails, but there is another momentum of Ruby or non-Python language. Here is the list of the sessions about Data Science.

Ruby for NLP
A Deep Learning Adventure [repo] (talked by Paolo Perrotta, the author of Metaprogramming Ruby!)
Ruby Data Workshop
Reducing ActiveRecord memory consumption using Apache Arrow
Red Chainer and Cumo: Practical Deep Learning in Ruby
Make Ruby Differentiable

Center of data science with Ruby

There is three core software supporting these movements:

Apache Arrow
Numo/Cumo
Red Chainer (Deep Learning framework ported from Chainer, implemented in Python)

Apache Arrow is a cross-language data structure for in-memory data. Kohei Sutou, the creator of Red Arrow, Ruby binding of Apache Arrow, who is a Japanese PMC of Apache Arrow. He has also been organizing an initiative called Red Data tools, monthly developer meet-ups for Ruby data tools. The meetup drives Ruby data ecosystem, especially for beginners. I heard from mrkn, a Ruby committer, that Arrow is trying to implement data manipulations those pandas does as C++ code. That means, calculations of tabula style data, a.k.a. DataFrame can be done in Apache Arrow’s Table format so that Ruby would be able to be suitable for data manipulation.

Another essential thing is Numo, which enables to handle numeric array like Numpy and is the fundamental part of DS/ML execution. Cumo is the GPU version of Numo and 75 times faster than Numo for the hello world for Deep Learning, a.k.a. MNIST. The talk about Cumo suggested that many Deep Learning related executions depend on CUDA so that scripting languages can be just a wrapper of them.

Red Chainer enables Deep Learning tasks, but it seems still young. Rather than that, Menoh-Ruby can be a great tool, which allows to inference/predict with pre-trained models with PyTorch, Chainer, or any other frameworks which can export ONNX, the intermediate format of DL.

So, how will be the Ruby data science going on?

Looking at those momenta of Apache Arrow and Cumo, I feel the data science on Ruby would become much easier since the core problems which are related to execution speed can be hidden into C++/GPU layer. And using Menoh-Ruby can be a good opportunity for Ruby on Rails applications to serve prediction results on Ruby!

Red Data tools also create opportunities for many software engineers to jump into ML/DS world. One of my friends told me why he started working on Red Data tools that he wanted to change his field, and it’s an excellent area to join.

If you interested in this movement, let’s join Red Data tools!

A recent update of tabula-py

Mon, 18 Feb 2019 01:26:00 +0900

Photo by Joshua Rawson-Harris on Unsplash

This article is a repost of Patreon article published last December. I’m planning to bump up the next version of tabula-py within few weeks.

(Note: Oct 7th, 2019) As of Oct. 2019, I launched a documentation site and Google Colab notebook for tabula-py. The FAQ would be good place to execute accurate extraction.

This is my first post on patreon. Apologies for delayed announcement of recent update of tabula-py. I will introduce the key features of updates.

Use Tabula app template

Tabula app has template exporting feature to reuse same bounding box for extraction. tabula-py now load and extract with tabula app’s template.

dfs = tabula.read_pdf_with_template(
 './examples/data.pdf',
 './examples/data.tabula-template.json',
 pandas_options={'header': 0})

Support file-like object

Like many python libraries, tabula-py has been able to extract from file-like object.

# With file-like object 
pdf\_path = ‘tests/resources/data.pdf’
with open(pdf\_path, ‘rb’) as f:
 df = tabula.read_pdf(f)

# With pathlib 
from pathlib import Path
pdf_path = 'tests/resources/data.pdf'
df = tabula.read_pdf(Path(pdf_path))

Allow multiple area option

As of tabula-java v1.0.2, tabula can handle multiple area option.

pdf_path = 'tests/resources/MultiColumn.pdf'
# Relative area 
df_relative = tabula.read_pdf(
 pdf_path, pages=1,
 area=[[0, 0, 100, 50], [0, 50, 100, 100]], relative_area=True)

# Absolute area 
 df_absolute = tabula.read_pdf(
 pdf_path, pages=1, area=[[0, 0, 451, 212], [0, 212, 451, 425]])

Tip: Get table position

This is not a new feature, but I think it might be helpful for some PDFs.
Detailed post: https://github.com/chezou/tabula-py/issues/102

read_pdf with JSON contains position info, so you can get the table position as follows:

In [5]: tables = read_pdf("./examples/data.pdf", output_format="json", page=2)
In [9]: top = tables[0]['top']
In [10]: left = tables[0]['left']
In [11]: bottom = tables[0]['height'] + top
In [12]: right = tables[0]['width'] + left
In [13]: top, bottom, left, right
Out[13]: (0.0, 528.8800048828125, 0.0, 564.8800048828125)

If you have any question, ask on Stack Overflow!

Use Markdown document on brand new PyPI

Tue, 17 Apr 2018 13:21:33 +0900

Yesterday, PyPI was renewed to the next-generation site. It is modern and stylish one.

@aodag told me that PEP 566, which was accepted Feb. 2018, allows us for a document on PyPI to use not only reStructuredText but also other formats such as Markdown.

So I enabled my Markdown document on brand-new PyPI.

Upgrade Python packages (if necessary)

We can use Markdown with setuptools as of v.38.6.0. Let’s upgrade you python packages if needed. Without that, Markdown description will not be rendered appropriately.

$ python -m pip install –upgrade pip
$ pip install –upgrade wheel
$ pip –version
pip 10.0.0 from c:\users\chezo\documents\source\tabula-py\venv\lib\site-packages\pip (python 3.6)
$ pip list
Package Version Location
-—————- ———– ————————————–
(…snip…)
setuptools 38.1.0
(…snip…)
wheel 0.31.0

Modify setup.py

If you’ve already used README.md as a long description on PyPI, all you have to do is to add long_description_content_type to setup.py as follows:

long_description=open('README.md').read(),
long_description_content_type="text/markdown",

You can see the full description of the PR :

Handle markdown long description for Pypi by chezou · Pull Request #85 · chezou/tabula-py
_Thanks for PEP 566, as of setuptools v38.6.0, PyPI can render long description written in markdown. This PR allows…_github.com

Build a wheel and upload with twine

Now, you can build a wheel and upload with twine.

$ python setup.py bdist_wheel
$ twine upload dist/*

The Markdown document was rendered!

CAVEAT: I didn’t upgrade PyPI because it is too much to bump up for just rendering Markdown. I tested on test.pypi.org.

References

Python basics: package management

Wed, 30 Aug 2017 11:31:15 +0900

Python is a very famous programming language for machine learning. In this article, I will introduce basic Python environment.

Glossary

I will introduce basic terms about Python package management.

pip: A tool for package installation. It retrieves Python packages from PyPI. pip is gem command of Ruby.
virtualenv: Package isolation tool for Python. It has similar function with bundler of Ruby, but it also has the function to change Python versions over 2.x and 3.x.
venv: It is an official tool for package isolation introduced from Python 3.3. But, if you want to use Python 2.x or you are Debian/Ubuntu user, I recommend you to use virtualenv.

venv switches with a command like python3.5 -m venv some-awesome-env, so it can’t handle over Python 2 and 3. venv installed by Debian/Ubuntu installs useless dependencies for other OSs, so I’m an Ubuntu user so I don’t use venv.

These are common tool sets for many Pythonistas. They are recommended tools of PyPA, a working group that maintains many of the relevant projects in Python packaging.

There is one more tool that is for the specific purpose.

conda: conda is a tool for package management for scientific computation developed by Anaconda, Inc. It can manage not only Python but also R. PyData community loves conda.

I use conda for my work, but I recommend you to know the pros/cons of conda and virtualenv/venv and chose write tool for your purpose.

Installation of Python

Since it is 2017, Python beginners should use the latest version of Python 3. However, there are some cases to use Python 2.x for some painful reasons.

If you need to install Python 2 and 3, you can install multiple Python with package management tools like apt or yum. In Ubuntu, you can install Python 2.7 with apt install python-dev, and you can install Python 3.6 via apt install python3-dev.

After installation, you can see the Pythons under /usr/bin:

/usr/bin/python #<- 2.7
/python2 #<- 2.7
/python2.7 #<- 2.7
/python3 #<- 3.6
/python3.6 #<- 3.6

If you’re macOS user, you can install both Python 2 and 3 via brew install or port install.

For Windows users, you can install Python 2 and 3 using official installer or Chocolatey. From Python 3.6 for Windows, there is py command that switches Python version.

Caution: Never try to keep using System Python. System Python is often old, and it depends on system critical system such as yum. If you run sudo pip install carelessly, there is a risk of destroying the environment of the OS itself.

Package management

As I mentioned, you should not do sudo pip install awesome-package. Hence, Many important systems depend on system Python, don’t use sudo pip.

If you’re a venv user, this tutorial will help you.
https://docs.python.jp/3/tutorial/venv.html

For virtualenv users, I will write a tutorial of virtualenv. It is a translation of the document written by aodag.
https://gist.github.com/aodag/bea141d255e22d204a2140fba658ebf2

Why should we use virtualenv/venv?

virtualenv avoids:
- Conflicting Python packages with system Python
- Conflicting packages between projects
- Losing sight of which project depends on those packages

Install virtualenv

First, you can install virtualenv under user home directory.

$ wget https://bootstrap.pypa.io/get-pip.py
$ export PATH=”~/.local/bin/:$PATH”
$ python get-pip.py --user
$ pip install virtualenv --user
\# Windows user can isntall just via \`pip install\`
\> pip install virtualenv

With --user option, you can install packages under user directory.

virtualenv can create a Python virtual environment. Creating the environment under the project root is common.

Run virtualenv as follows:

$ virtualenv venv -p python3.6

then, you can get virtual environment.

Since Python packages will be installed under the venv directory, don’t forget to add venv directory into .gitignore.

$ source venv/bin/activate
(venv) $
\# For Windows
\> . venv/Script/activate

Install Python packages via pip

You can install packages via pip. After activating virtualenv/venv, pip will install packages under venv directory.

(venv) $ pip install pyramid

If you want to install the specific version of the package, you can set version number:

(venv) $ pip install pyramid==1.8.1

Without version number, pip will install latest stable version.
https://www.python.org/dev/peps/pep-0440/

You can list installed packages with pip list command.

(venv) $ pip list
numpy (1.13.1)
pandas (0.20.3)
pip (9.0.1)
pkginfo (1.4.1)
pytest (3.2.0)
python-dateutil (2.6.1)
pytz (2017.2)
wheel (0.29.0)

Managing package version

From pip 7.1, we can fix version of packages with constraints.txt. Using pip freeze command, you can list packages with a version number.

(venv)$ pip freeze -l
numpy==1.13.1
pandas==0.20.3
pkginfo==1.4.1
pytest==3.2.0
python-dateutil==2.6.1
pytz==2017.2
(venv)$ pip freeze -l > constraints.txt

You should list your required packages into requirements.txt,

(venv)$ cat requirements.txt
pandas
numpy

Then you can install required packages as follows:

(venv)$ pip install -r requirements.txt -c constraints.txt

Levelaging wheelhouse

Modern Python package is distributed by wheel format, which is the binary type format. There is another format, sdist, which is the source type format and it requires compile from source if it depends on native codes. I highly recommend using wheel format, because it is faster installation than sdist without compilation and even if you have an offline environment which unable to connect PyPI you can deploy the project easily.

Put all dependent .whl format package files under wheelhouse directory, you can install as follows:

$ pip install -r requirements.txt -c constraints.txt -f wheelhouse — no-index

-w or --wheel-dir option allows you to set wheel directory. -f or--find-links option uses wheelhouse directory primary.--no-index option prevent to connect PyPI.

If you want to export all the dependencies into wheelhouse directory, you can use pip wheel command.

$ pip wheel -r requirements.txt -c constraints.txt -w wheelhouse

Should I use conda?

Anaconda is a Python distribution for scientific computing such as machine learning. Anaconda suit consists of Anaconda, which includes the recommended package and Miniconda, which is the minimum environment for conda and you can install only necessary packages yourself. Anaconda sometimes includes heavy packages. It used to include Django, so check the default package and use it properly.

Unlike virtualenv, Anaconda can create its original virtual environment. Characteristically, using the --copy option makes it possible to copy system level libraries, .so, etc. without creating symbolic links. If you archive a set of virtual environments with zip or tar, you can use it on other machines.

$ conda create -n myenv --copy python=3.6
$ conda activate myenv

In other words, libraries, which are managed by OS level package management tools such as apt, are also managed by conda. Conda has its own package repository different from PyPI and upload binaries for each OS on it. Since the same package, such as OpenCV, is registered in the repository by multiple users, you should care which package is the best one.

In many machine learning books, it is often written that conda can be used, but I think that it is better not to use it much outside Windows.

The reasons are as follows:

In 2017, wheel is de facto for the binary package format, so conda’s original purpose, handling scientific packages like numpy, or Scipy, can be done without conda.
conda will replace commands such as openssl/curl/python in macOS / Linux System (strictly speaking, conda will pass PATH first) [issue]
Package developers are often not conda users, and they seem to be asked for support in an environment that they do not normally use, such as JRuby or Rubyinius (or Windows specific trouble).
In the conda world, it is difficult to pass information that should be included in a build of a native extension (such as Cython dependence)

So I recommend using conda for Windows users or people do not develop heavily but want to experience machine learning. Or, put Miniconda under pyenv control. I use conda under Docker environment.

However, we can not install the package like Scipy on Windows via pip install, you need to download wheel on your own. I think that this point is better for honest conda.

Historical details are detailed in this article. In short, because old binary format egg was not good, conda was created.

Conclusion

I introduced installation of Python and how to manage Python packages. I think we can manage Python packages via virtualenv/venv well without conda, but there is good case for conda to pack some environment with system libraries.

References

Original Japanese document:

Why OSS based machine learning is good?

Thu, 03 Aug 2017 12:56:59 +0900

This article is translation of Japanese version.

After releasing of TensorFlow, the movement of OSS-based machine learning is accelerating. François Chollet, the creator of Keras, says the essential point of this change. I think his phrase is enough, but in this article, I would like to organize why open source machine learning is great, and what recent trends are.

tl;dr

Machine learning and deep learning frameworks have become standard things for software engineers
Since arXiv becomes very famous, many papers are published before peer review of international conferences. This change made easier for other companies to validate the algorithm.
Many researchers have been started to study machine learning, machine learning researches in academia become Red Oceanic.
The strategy, “Make a great algorithm, but the implementation is secret” becomes a thing of the past.

Halcyon days

Five or ten years ago, almost all players working on advanced machine learning were in laboratories such as universities or large enterprises, or some advanced companies. In particular, the amount of data with a label was smaller than the present, and many researchers had been improving the performance by researching algorithms, by feature engineering.

Many researchers from academia studied state-of-the-art machine learning, posted to international conferences. Most of the insights were shared after peer review. Implementation was not shared as much as now, and each researcher had to reimplement the preceding research from scratch. A typical cycle for releasing new algorithms was a half year, in some cases more than a year.

There were few open source machine learning libraries/frameworks like Weka. scikit-learn, released in 2010, was not famous among software engineers. Many of us used libraries with single/few algorithms such as libsvm and liblinear.

Fast moving era

As of 2017, people who work in machine learning have significantly increased compared with 10 years ago. The center of machine learning has been moved from academia to companies with large data. In particular, software engineers, who have never worked on machine learning, entering deep learning world. I was surprised to hear that my friend of the community who had never worked on machine learning in business had started working on Deep Learning. The reasons for this movement are 1) it became general for companies to store large data that can be used for machine learning,
2) excellent machine learning frameworks have been increased, and 3) the GPU power leverage Deep Learning for efficient calculation.

Many open source libraries became popular not only in the frameworks of Deep Learning such as TensorFlow, Chainer, MXNet, Caffe 2, PyTorch but also by XGBoost, Lightgbm, which are famouse among kaggler. scikit-learn is also common tool as a framework to experiment with multiple algorithms.

The rise of “open papers”

This movement is supported by machine learning competition site “kaggle”, and by a place to post open papers called “arXiv”. (There is discussion arxiv does not have a peer review process and quality is not assured. So can we call the document as a research paper? But, in this post, I will call the research paper style report as “paper”)

The following article describes the number of paper submissions related to machine learning (especially Deep Learning) submitted to arXiv. According to this article, it is pointed out that the number of papers related to machine learning has more than quadrupled in 2017 compared to five years ago.

A Peek at Trends in Machine Learning
_Have you looked at Google Trends? It’s pretty cool — you enter some keywords and see how Google Searches of that term…_medium.com

Papers of arXiv are posted every day. It means, state-of-the-art results from such as Google, Facebook, Microsoft, etc. are published more and more before peer review. This is a challenge for the central laboratories of the traditional large enterprises to research and develop cutting edge algorithms of machine learning itself. Those companies usually set targets for a year or half a year. There is also criticism of “just adding parts”, but it is clear that the speed of developing machine learning algorithms is significantly fast.

In the field of machine translation, the breakthrough in deep learning was encoder-decoder and attention. The subsequent papers are not interesting, “I just put existing parts here.” I can’t understand why these papers come to the top conference.

Recently, for those who read new arXiv’s paper day and night, there is an system called “ariXiv Times” to better check new arrival documents.

arXivTimes Indicator
_Edit description_arxivtimes.herokuapp.com

Open papers accelerates Open source machine learning

This March, a paper about “Deep Forest” was published at arXiv, and it became a hot topic with the author claims that “performance is better than Deep Learning”.

[1702.08835] Deep Forest: Towards An Alternative to Deep Neural Networks
_Abstract: In this paper, we propose gcForest, a decision tree ensemble approach with performance highly competitive to…_arxiv.org

This method proposed in this paper, about one week (2017/3/5) after the publication (2017/2 / 28), R implementation came up and Python implementation came out after R one. A discussion was made with the following LightGBM issue on GitHub, and it came out that there was not reproducibility of the article, they can’t confirm the performance.

Support gcForest · Issue #331 · Microsoft/LightGBM
_LightGBM - A fast, distributed, high performance gradient boosting (GBDT, GBRT, GBM or MART) framework based on…_github.com

It is a symbolic event where the OSS implementation of the paper
appeared within a week after published in arXiv and the discussion of the community began.

I hear that it is increasing that the number of international conferences that require disclosing the implementation when a paper is submitted.

Conclusion

It is an essential task to develop the machine learning algorithm. Thanks to open papers, ML competition web site, and fast implementation of new algorithms as an OSS, we can adopt state-of-the-art knowledge into the business rapidly.

IMHO, it is becoming fun to focus on where we can make use of ML in business rather than developing the algorithm itself.

In other words, now, it is too hard to say “special machine learning algorithms that only our company can do”. Of course, people in academia will push these cutting-edge initiatives if they can prepare data. What is the evidence that one company invents a better algorithm quickly than most state-of-the-art people from tech giants like Google, Facebook, Microsoft, etc.? That is the reason for the strength of open source based machine learning.

Among academia, there is a famous phrase, “Standing on the shoulders of giants”, it means that we should thank previous research then we can go on to the next step. Even in machine learning based on open source, we can not ignore this phrase. We cannot ignore giants.

How to run Cloudera Director on your macOS/Windows 10

Wed, 02 Aug 2017 12:12:31 +0900

Cloudera Director is a provisioning tool for CDH and Cloudera Enterprise. We can launch cluster with Web GUI or CLI tool. Using Cloudera Director CLI tool, you can manage your cluster with configuration file, that enables you to manage configurations with git. In this article, I will introduce how to install Cloudera Director into your local macOS or Windows 10.

For usage of Cloudera Director, see also the document.

Cloudera Director 2.5.x Documentation
Cloudera Director 2.5.x Documentationwww.cloudera.com

Install Cloudera Director on you macOS with homebrew

If you’re homebrew user, you can install Cloudera Director easily.

chezou/homebrew-cloudera
_homebrew-cloudera - Homebrew Formulas for cloudera tools_github.com

$ brew tap chezou/cloudera
$ brew install cloudera-director-server

Then, you can launch/terminate Cloudera Director as follows:

# Start Cloudera Director Server background
$ cloudera-director-server-start
# After launching director server, you can open with http://locahost:7189/

# Stop Cloudera Director Server background
$ cloudera-director-server-stop

Install Cloudera Director on you Windows 10

If you are Windows 10 user, you can install Ubuntu as the Linux Subsystem.

Launch bash on windows, then run as follows:

Make sure to get not IP address of Windows but Ubuntu’s one.

Use Docker image

If you don’t want to install your machine directly, you can use Docker image of Cloudera Director.

tsuyo/cloudera-boot
_cloudera-boot - Cloudera Director Utilities_github.com

After installation of Docker, run following commands then your Director will launch.

$ git clone https://github.com/tsuyo/cloudera-boot$ cd cloudera-boot$ . bin/cloudera-boot.sh # load several functions/aliases$ cb-build # may take a while# set you secrets

You can launch a Director server or use client as well. To get further information, see also README.md.

Simple way to distribute your private Python packages within your organization

Mon, 24 Jul 2017 01:21:40 +0900

https://www.irasutoya.com/2017/05/blog-post_22.html

This article is a translation of this article, originally written by aodag in Japanese. I translated it with his permission. This article is aimed to know simple ways to prepare internal Python package host like a local gem server on Ruby.

Methods

Include your packages in your git repository
Publish a directory including your packages via HTTP server
Build a local PyPI-equivalent server

It is a high-cost way to create a local PyPI-equivalent server (translator note: like devpi), and I don’t think there is no need to do so, I will describe first two options.

Include your packages in your Git repository

If your packages are required for a particular project, it is straightforward to contain them in the Git repository. You can put them in the directory named wheelhouse, which comes from the name of the previous default directory created by pip wheel. (translator note: this method is assumed you to know wheel. If not, this story and this JIRA would be helpful.)If you put the private package foo in the wheelhouse, you can install as follows:

$ pip install foo -f wheelhouse

Note that -f is the short option for --find-links, with that option, pip will search packages in the directory first, then fall back to pypi.

Publish a directory including your packages via HTTP server

We can use--find-link option to search not only local directory but also a remote server via http. If you have a package used by multiple projects, this method will help you.

The easiest way to distribute your packages with this method is executing python -m http.server with Python 3.x (or python -m SimpleHTTPServer with Python 2.7) on the wheelhouse directory. This simple server provides directory listings so that we can just use--find-links to use the directory. Make sure to open http://localhost:8000 that you can see the list of files under the wheelhouse directory via a web browser.

To install foo package via HTTP server you launched, you can execute as follows:

$ pip install foo -f http://localhost:8000

Since this is a simple server, for production, it is good to put them in cloud storage such as AWS S3, you should check the way for directory listings, or you can use Apache with DirectoryIndex enabled.

Conclusion

I recommend these methods because they are simple and no need to prepare the dedicated application server.

tabula-py now able to extract remote PDF and multiple tables at once

Sun, 28 May 2017 11:18:39 +0900

(Note: Oct 7th, 2019) As of Oct. 2019, I launched a documentation site and Google Colab notebook for tabula-py. The FAQ would be good place to execute accurate extraction.

tabula-py is a Python library which enables you to extract tables from PDF into pandas DataFrames. Today, I released v0.8.0. In this post, I will introduce improvements after previous post of tabula-py. If you don’t familiar with tabula-py, you can see previous one.

Change Notes

Able to read remote PDF passing URL
[Experimental] Add multiple_tables mode
Add batch conversion method:convert_into_by_batch()
Add encoding option
Add java_options
Will deprecate read_pdf_table() method

I will explain important features.

Read remote PDF passing URL

If you want extract a DataFrame from the internet, you can extract remote PDF without downloading it manually.

read_pdf("https://github.com/tabulapdf/tabula-java/raw/master/src/test/resources/technology/tabula/12s0324.pdf")

[Experimental] Add “`multiple_tables"` mode

tabula-py is a simple wrapper of tabula-java, it was hard to handle multiple tables in a page. But now, you can extract multiple tables in a page using multiple_tables option.

read_pdf('tests/resources/data.pdf', pages=2, multiple_tables=True)

This function create a list of DataFrames via JSON from tabula-java, so if tabula-java’s JSON format will change, the output could be broken. If you see CParserError , try to set multiple_tables option.

Add batch conversion method: “`convert_into_by_batch()"`

After tabula-java v0.9.2, we can extract tables from PDF by batch. You can use this function through convert_into_by_batch() method.

convert_into_by_batch(path_to_dir, output_format='csv')

You should set directory path of PDFs, not the specific pdf path.

tabula-py extracts tables same directory as input files.

TODOs

There are several problems those may be fixed after releasing of tabula-java 0.9.3. e.g) Handling embedded font, including Japanese…

Waiting for your collaboration!

If you have any troubles with tabula-py, please file an issue on GitHub. I don’t want to receive emails because the answer will not share to other people. Make sure fill the issue template, it will reduce many costs for me to solve the problem.

An easy way to get URL list of your Medium publication

Tue, 02 May 2017 11:01:01 +0900

I imported blog posts from own Wordpress but I have to redirect old articles to Medium manually. There is Wordpress plugin which enables you to redirect articles, but it requires URL mapping in CSV format. When you want to get Medium publication’s URL list, you may use official APIs, but officially, it lacks the function to get list of posts. We need some other choices, but I couldn’t get the post list. In this article, I will show you how to get URL list of your Medium publication easily.

Note: I tried this method with under 150 articles publication. It might not work with huge number of articles.

How-to

You can use following Python script, after showing whole articles with accesing /latest of a publication. For example, after opening https://blog.chezo.uno/latest, you can get whole contents with scrolling down and down and down…

sparkavro: Manupilate Apache Avro file with sparklyr

Sun, 26 Mar 2017 21:02:01 +0900

I created a simple sparklyr extension to handle Apache Avro file. It is just a simple wrapper of DataBrick’s spark-avro. It is listed in the official document of sparklyr extensions.

chezou/sparkavro
_sparkavro - Load Avro data into Spark with sparklyr_github.com

Installation

Use {devtools} to install sparkavro.

devtools::install_github(“chezou/avrospark”)

Simple usage

You can read and write Avro file as follows:

library(sparklyr)
library(sparkavro)
sc <- spark_connect(master = “spark://HOST:PORT”)
df <- spark_read_avro(sc, “test_table”, “/user/foo/test.avro”)
spark_write_avro(df, “/tmp/output”)

This is the very first version, so there might be bugs especially around options. If you find any bug, please raise on the GitHub issue.

How to connect secure Impala cluster from RStudio on macOS with implyr

Sun, 26 Mar 2017 06:35:45 +0900

Impala is very fast SQL-on-Hadoop, and it will enhance your R experience with implyr, a dplyr based interface for Apache Impala (incubating) created by Ian Cook. I will show you how to setup connection to Kerberized Impala cluster with implyr from local macOS. You can find my GitHub repo as follows:

chezou/implyr-example
_implyr-example - Example repository of implyr_github.com

Setting up ODBC environment for macOS

Install unixODBC with homebrew

First, we will install unixODBC to handle Impala with ODBC. In R world, ODBC is preferred to connect Impala because of its performance and compatibility. Let’s install unixODBC with homebrew.

$ brew install unixodbc

Download and install the latest version of the Impala ODBC driver from Cloudera

You can download the latest Impala ODBC Driver.

Configure your .odbc.ini and .odbcinst.ini

After installing Impala ODBC driver for macOS, basic configuration templates can be found in /opt/cloudera/impalaodbc/Setup/.

cp /opt/cloudera/impalaodbc/Setup/odbc.ini ~/.odbc.ini
cp /opt/cloudera/impalaodbc/Setup/odbcinst.ini ~/.odbcinst.ini

Before using following setting, you must replace HOST and KrbRealm with appropriate ones. Let’s modify your .odbc.ini as follows:

[ODBC]
# Specify any global ODBC configuration here such as ODBC tracing.

[ODBC Data Sources]
Impala=Cloudera ODBC Driver for Impala

[Impala]

# Description: DSN Description.
# This key is not necessary and is only to give a description of the data source.
Description=Cloudera Impala ODBC Driver DSN

# Driver: The location where the ODBC driver is installed to.
Driver=/opt/cloudera/impalaodbc/lib/universal/libclouderaimpalaodbc.dylib

# The DriverUnicodeEncoding setting is only used for SimbaDM
# When set to 1, SimbaDM runs in UTF-16 mode.
# When set to 2, SimbaDM runs in UTF-8 mode.
#DriverUnicodeEncoding=2

# Values for HOST, PORT, KrbFQDN, and KrbServiceName should be set here.
# They can also be specified on the connection string.
HOST=[REPLACE_YOUR_IMPALA_HOST]
PORT=21050
Schema=default

# The authentication mechanism.
# 0 — No authentication (NOSASL)
# 1 — Kerberos authentication (SASL)
# 2 — Username authentication (SASL)
# 3 — Username/password authentication (NOSASL or SASL depending on UseSASL configuration)
AuthMech=1

# Set to 1 to use SASL for authentication.
# Set to 0 to not use SASL.
# When using Kerberos authentication (SASL) or Username authentication (SASL) SASL is always used
# and this configuration is ignored. SASL is always not used for No authentication (NOSASL).
UseSASL=1

# Kerberos related settings.
KrbFQDN=_HOST
KrbRealm=[REPLACE_YOUR_REALM]
KrbServiceName=impala

# Username/password authentication with SASL settings.
UID=
PWD=

# Set to 0 to disable SSL.
# Set to 1 to enable SSL.
SSL=1
CAIssuedCertNamesMismatch=1
TrustedCerts=/opt/cloudera/impalaodbc/lib/universal/cacerts.pem

# If you use SSL with AllowSelfSignedServerCert, you can set this configuration.
#AllowSelfSignedServerCert=1

# Specify the proxy user ID to use.
#DelegationUID=

# General settings
TSaslTransportBufSize=1000
RowsFetchedPerBlock=10000
SocketTimeout=0
StringColumnLength=32767
UseNativeQuery=0

After setting up the .odbc.ini , your application will refer this setting with appropriate DSN name, like Impala in this case.

Check the configuration

After configuration, you should kinit with your principal.

$ kinit $USER@YOUR_REALM

You should replace `$USER` and `YOUR_REALM` with the appropriate REALM.

Before using RStudio on you mac, you can check configuration with `isql` command.

Implyr Example

After setting .odbc.ini you can connect secure Impala cluster with {implyr}. For instance, We will visualize the airports’ data.

First, install R packages.

install.packages(c(“implyr”, “odbc”, “DBI”, “dplyr”, “ggplot2”, “ggExtra”))

Then, connect the Impala cluster.

library(implyr)
library(odbc)
drv <- odbc::odbc()
impala <- src_impala(
drv = drv,
dsn = “Impala”
)

If your .odbc.ini is configured properly, you can connect to Impala cluster.

Let’s visualize the airports data. In this case, we assume the data is in u_ariga database, so that we will change database using SQL use u_ariga.

library(DBI)
# Change database
dbExecute(impala, “use u_ariga”)
dbGetQuery(impala, “show tables”)
airports <- tbl(impala, “airports_pq”)

# Show the head of airports data
View(airports)

airports %>% filter(latitude < 35) %>% count()
#903

Finally, we will show a joint histogram of longitude and latitude.

airports_by_geo <- airports %>% select(longitude, latitude) %>% collect()

library(ggplot2)

p <- ggplot(airports_by_geo, aes(longitude, latitude)) + geom_point() + theme_classic()
ggExtra::ggMarginal(p, type = “histogram”)

Conclusion

{implyr} is a great package for Impala and dplyr but it is pretty young project. If you find some problems, why don’t you post into the GitHub issue?

Visualize your massive data with Impala and Redash

Sat, 11 Feb 2017 14:14:44 +0900

Redash is a famous OSS visualization tool, which enables to visualize your data with SQL. It supports Apache Impala (incubating), fast SQL-on-Hadoop suitable for BI tools and exploratory analysis. With Impala, you can query SQLs to tables on Amazon S3.

In this post, we connect to Impala from Redash and visualize data.

Set up Redash

You can set up Redash with various way. This time, I use AMI for Redash. Then, you can access with your browser with admin/admin.

Add Data Source of Impala

After clicking Database icon, you can add data sources.

This time, I set configurations as follows:

Example configuration

Type: Impala
Database: default
Host: hostname of Impala daemon
Ldap_password/user: (empty)
Port: 21050 (default port)
Please specify beeswax or hiveserver2: hiveserver2
Timeout: 3600
Use_ldap: (empty)

Now, you can select Impala as a data source.

Result of Impala query

tabula-py: Extract table from PDF into Python DataFrame

Mon, 09 Jan 2017 14:09:08 +0900

(Oct 7th, 2019) As of Oct. 2019, I launched a documentation site and Google Colab notebook for tabula-py. The FAQ would be good place to execute accurate extraction.

Screenshots in this article is based on the old version interface. See the latest version example in the Colab notebook.

Today, I released tabula-py 0.3.0, which extracts table from PDF into Python pandas’s DataFrame.

It is simple wrapper of tabula-java and it enables you to extract table into DataFrame or JSON with Python. You also can extract tables from PDF into CSV, TSV or JSON file.

tabula is a tool to extract tables from PDFs. It is GUI based software, but tabula-java is a tool based on CUI. Though there were Ruby, R, and Node.js bindings of tabula-java, before tabula-py there isn’t any Python binding of it. I believe PyData is a great ecosystem for data analysis and that’s why I created tabula-py. If you are familiar with R, I highly recommend to use tabulizer, which has the most richest bindings including rich GUI.

You can install tabula-py via pip:

pip install tabula-py

With tabula-py, you can get DataFrame with read_pdf() method.

example of read_pdf()

example of read_pdf()

You can also extract tables as JSON format:

example of JSON

You can extract tables into a file like JSON, CSV or TSV with convert_into() method.

You can see more examples in Jupyter notebook.

I hope you will enjoy data wrangling with tabula-py. Any feedback would be welcome!

Waiting for your collaboration!

If you have any trouble with tabula-py, please file an issue on GitHub. I don’t want to receive emails because the answer will not share with other people. Make sure to fill the issue template, it will reduce many costs for me to solve the problem. Or, I also check StackOverflow. You can ask about it.

Livy & Jupyter Notebook & Sparkmagic = Powerful & Easy Notebook for Data Scientist

Fri, 30 Dec 2016 15:15:23 +0900

livy is a REST server of Spark. You can see the talk of the Spark Summit 2016, Microsoft uses livy for HDInsight with Jupyter notebook and sparkmagic. Jupyter notebook is one of the most popular notebook OSS within data scientists. Using sparkmagic + Jupyter notebook, data scientists can execute ad-hoc Spark job easily.

Why livy is good?

According to the official document, livy has features like:

Have long running SparkContexts that can be used for multiple Spark jobs, by multiple clients
Share cached RDDs or Dataframes across multiple jobs and clients
Multiple SparkContexts can be managed simultaneously, and they run on the cluster (YARN/Mesos) instead of the Livy Server for good fault tolerance and concurrency
Jobs can be submitted as precompiled jars, snippets of code, or via Java/Scala client API
Ensure security via secure authenticated communication
Apache License, 100% open source

Why livy + sparkmagic?

sparkmagic is a client of livy using with Jupyter notebook. When we write Spark code at our local Jupyter client, then sparkmagic runs the Spark job through livy. Using sparkmagic + Jupyter notebook, data scientists can use Spark from their own Jupyter notebook, which is running on their localhost. We don’t need any Spark configuration getting from the CDH cluster. So we can execute Spark job in a cluster like running on a local machine.

diagram from https://github.com/jupyter-incubator/sparkmagic/raw/master/screenshots/diagram.png

Requirements

Spark Cluster

Cloudera Director is nice to prepare
Install git and maven
I tried CDH 5.7 with CentOS 7

2. Local jupyter client

virtualenv and virtualenvwrapper is awesome

Preparation

In order to use livy with sparkmagic, we should install livy into the Spark gateway server and sparkmagic into local machine.

Install R

$ sudo yum install -y epel-release$ sudo yum install -y R

Build livy

$ git clone git@github.com:cloudera/livy.git$ cd livy$ mvn -Dspark.version=1.6.0 -DskipTests clean package

Because of failing test at that time, I added -DskipTests to build.

Run livy

Set environment variables as follows:

$ export SPARK_HOME=/opt/cloudera/parcels/CDH-5.7.1-1.cdh5.7.1.p0.11/lib/spark$ export HADOOP_CONF_DIR=/etc/hadoop/conf

Add the following configuration into livy.conf:

livy.server.session.factory = yarn

Let’s run livy server

$ ./bin/livy-server

Open another terminal and check the server

$ curl localhost:8998/sessions{"from":0,"total":0,"sessions":[]}

As livy’s Default port number is 8998, we should open or forward the port.

Prepare sparkmagic in local machine

Install sparkmagic by following the document:

$ pip install sparkmagic$ jupyter nbextension enable --py --sys-prefix widgetsnbextension

Then install wapper kernel. Do pip show sparkmagic and you can see the Location info. In the following example, Location is /Users/ariga/.virtualenvs/ibis/lib/python3.5/site-packages.

$ pip show sparkmagic---Metadata-Version: 2.0Name: sparkmagicVersion: 0.2.3Summary: SparkMagic: Spark execution via LivyHome-page: https://github.com/jupyter-incubator/sparkmagic/sparkmagicAuthor: Jupyter Development TeamAuthor-email: jupyter@googlegroups.orgInstaller: pipLicense: BSD 3-clauseLocation: /Users/ariga/.virtualenvs/ibis/lib/python3.5/site-packagesRequires: ipywidgets, pandas, ipython, requests, mock, autovizwidget, numpy, nose, ipykernel, notebook, hdijupyterutilsClassifiers: Development Status :: 4 - Beta Environment :: Console Intended Audience :: Science/Research License :: OSI Approved :: BSD License Natural Language :: English Programming Language :: Python :: 2.6 Programming Language :: Python :: 2.7 Programming Language :: Python :: 3.3 Programming Language :: Python :: 3.4$ cd /Users/ariga/.virtualenvs/ibis/lib/python3.5/site-packages$ jupyter-kernelspec install sparkmagic/kernels/sparkkernel$ jupyter-kernelspec install sparkmagic/kernels/pysparkkernel

Copy the config.json into ~/.sparkmagic/config.json and modify it.

Run jupyter notebook

Before running jupyter, I recommend checking the connection from the local machine to the livy server.

$ curl YOUR_HOSTNAME:8998/sessions

Launch jupyter notebook and create PySpark notebook (of course you can use Spark)

$ jupyter notebook

The example notebook is here

Jupyter Notebook Viewer
_n MAXROWS: The maximum number of rows of a SQL query that will be pulled from Livy to Jupyter. If this number is…_nbviewer.jupyter.org

In the nbviewer, we can not see the result of SQL, but we can visualize the result of SQL with %%sql magic command. That’s awesome :)

If you use %%local, you can use local Python libraries such as scikit-learn, seaborn etc, with received results from PySpark.

References

Text-to-speech based on deep learning for Web site using Amazon Polly and Ruby

Thu, 01 Dec 2016 15:00:02 +0900

Amazon Polly, Text-to-speech service from AWS was announced at today ‘s re:Invent. Amazon Polly is speech synthesize system based on deep learning.

Amazon Polly — Text to Speech in 47 Voices and 24 Languages

[updated] I added generated speech of this article.

[updated2] I created simple CLI tools and rubygems of polly

https://rubygems.org/gems/pollynomial

The great thing about Amazon Polly is that we can use TTS easily with AWS CLI. The price is free for up to 5 million characters a month, if over that limitation, it is very cheap with $ 0.000004/character. If you synthesize Adventures of Huckleberry Finn, it costs about only $2.4.

Here is the example code of Polly with AWS CLI tool.

$ aws polly synthesize-speech \
–output-format mp3 –voice-id Joanna \
–text “Hello my name is Joanna.” \
joanna.mp3

As of December 1, 2016, they support the following 24 languages mainly in European languages.

Icelandic
Italian
Welsh
Dutch
Swedish
Spanish (Castile)
Spanish (USA)
Danish
Turkish
German
Norwegian
French
French (Canada)
Portuguese
Portuguese (Brazil)
Polish
Romanian
Russian
Japanese
English (India)
English (Welsh)
English (Australia)
English (US)
English (UK)

I think Japanese speech sounds very natural. Sometime it will be a strange accent, but if I register a word with Lexicon, we can improve the quality by myself. Japanese sample voice as following:

I often find interesting articles in Medium, but since reading long English article is a bit tough for non native English speaker like me. So I came up with if I made the article to voice, I would listen it easily. That’s why I wrote the code to convert articles to speech with Ruby like following:

There are some important restrictions of API:

The number of characters per API is 1500 characters
Long voice is truncated after 5 minutes

Building predictive Model with Ibis, Impala and scikit-learn

Sat, 15 Oct 2016 06:10:31 +0900

tl;dr

visualizing MovieLens 20M data (famous movie rating data) with Ibis
build predictive model for movie favor with scikit-learn
repo / notebook

What is Ibis?

Ibis is a bridge between Python and Big Data. Ibis enables pandas handling Big Data.

architecture of Ibis

For more detail, see Wes’s presentation.

As you know, pandas is known as a killer application for data analysis. In my previous job, which is known as a developer of world largest monolithic Ruby on Rails application, many Rails developer attracted with pandas and Jupyter notebook for sharing analysis result.

Why Ibis?

pandas loads data on memory, so we have to filter with some SQL before analyzing. But we actually want to get insight and handle without SQL.

Preparation

Impala cluster

CDH 5.7 with Cloudera Director 2.1
table is created with parquet on S3

required port

impalad node’s 21050 port
NN’s 50070 port

Ibis

Python 3.5
using wheel and virtualenv, I didn’t use anaconda

Notebook

Full notebook repo is here. I also executed same code for Redshift, but several dialects prevent execution…

chezou/ibis-demo
_ibis-demo - Demo notebook of Ibis for “Spark + Python + Dita science Festival”_github.com

FAQ

What is the difference between PySpark?

Easy to setup. It is just like connecting DB
Fast x10. So that we can x10 experiences. It makes us innovations!
We can rapid prototyping with Ibis.

Which is prefer to build model Ibis + scikit-learn or Spark + MLlib?

It depends on data size.
Netflix uses Spark and R for building predictive models. Netflix uses R in order to model filtered data such as specific country, and they use Spark for global model.

Blog Posts | Democratizing Data

Migrated From Netlify to Cloudflare Pages

Scrape Notion and convert into PDF

Recipe on Notion is good, if it’s printable

Scrape Notion with Python

tabula-py 2.8.0 now uses jpype to launch JVM

How fast is it?

Caveats

Challenges for releasing v2.8.0

The test issue with different java_options

Read the docs default behavior change

4 Steps to Release a CLI in Python

Create a project by using poetry

Create a CLI with Click/Cloup

Use poetry-dynamic-versioning for version management

Introduce GitHub Actions to release the package to PyPI

Create data lineage from Trino/Hive queries in digdag log with Python

What’s data lineage?

sqllineage is awesome open source tool for visualizing lineage

Visualize data lineage from Treasure Data’s workflow logs

3 configs add recommend articles into your Hugo blog by GitHub Actions

py> operator development guide for Python users

How to build & test custom scripts on local env before pushing

1. Use TD docker image

2. Create a Python virtual environment on local env

Test a workflow including Python

Passing Parameters to py> operator

1. digdag argument

2. environment variable

3. digdag variable

Directory structures

How to install Python packages / OS packages

How to read/write tiny variables between digdag tasks

Error notification with Python stack trace

How to release Python package from GitHub Actions

Testing Python code on GitHub Actions

Releasing Python package from GitHub Actions to PyPI

How to test a new Docker image for digdag workflow on CircleCI?

An issue with digdag Docker executor on CircleCI

Use CircleCI machine executor

Build custom Docker image and test with digdag

Conclusion

The first conference of Operational Machine Learning: OpML ‘19

Overview of the conference

Some interesting talks

Keynote: Ray: A Distributed Framework for Emerging AI Applications

MLOp Lifecycle Scheme for Vision-based Inspection Process in Manufacturing

AIOps: Challenges and Experiences in Azure

How the Experts Do It: Production ML at Scale

The important thing to keep top level is

Cost of run/train vs Agility

What are the important things for your ML platform?

Continuous Training for Production ML in the TensorFlow Extended (TFX) Platform

Disdat: Bundle Data Management for Machine Learning Pipelines

Predictive Caching@Scale

Ruby for Data Science and Machine Learning

Data Science and Ruby

Center of data science with Ruby

So, how will be the Ruby data science going on?

A recent update of tabula-py

Use Tabula app template

Support file-like object

Allow multiple area option

Tip: Get table position

Other tabula-py articles

Use Markdown document on brand new PyPI

Upgrade Python packages (if necessary)

Modify setup.py

Build a wheel and upload with twine

References

Python basics: package management

Glossary

Installation of Python

Package management

Why should we use virtualenv/venv?

Install virtualenv

Install Python packages via pip

Managing package version

Levelaging wheelhouse

Should I use conda?

The test issue with different `java_options`

[Experimental] Add “`multiple_tables"` mode

Add batch conversion method: “`convert_into_by_batch()"`