Building predictive Model with Ibis, Impala and scikit-learn


  • visualizing MovieLens 20M data (famous movie rating data) with Ibis
  • build predictive model for movie favor with scikit-learn
  • repo / notebook

What is Ibis?

Ibis is a bridge between Python and Big Data. Ibis enables pandas handling Big Data.

architecture of Ibis
architecture of Ibis
architecture of Ibis

For more detail, see Wes’s presentation.

As you know, pandas is known as a killer application for data analysis. In my previous job, which is known as a developer of world largest monolithic Ruby on Rails application, many Rails developer attracted with pandas and Jupyter notebook for sharing analysis result.

Why Ibis?

pandas loads data on memory, so we have to filter with some SQL before analyzing. But we actually want to get insight and handle without SQL.


Impala cluster

  • CDH 5.7 with Cloudera Director 2.1
  • table is created with parquet on S3

required port

  • impalad node’s 21050 port
  • NN’s 50070 port


  • Python 3.5
  • using wheel and virtualenv, I didn’t use anaconda


Full notebook repo is here. I also executed same code for Redshift, but several dialects prevent execution…

_ibis-demo - Demo notebook of Ibis for “Spark + Python + Dita science Festival”


What is the difference between PySpark?

  • Easy to setup. It is just like connecting DB
  • Fast x10. So that we can x10 experiences. It makes us innovations!
  • We can rapid prototyping with Ibis.

Which is prefer to build model Ibis + scikit-learn or Spark + MLlib?

Aki Ariga
Aki Ariga
Principal Software Engineer

Interested in Machine Learning, ML Ops, and Data driven business. If you like my blog post, I’m glad if you can buy me a tea 😉

  Gift a cup of Tea