Building predictive Model with Ibis, Impala and scikit-learn
- visualizing MovieLens 20M data (famous movie rating data) with Ibis
- build predictive model for movie favor with scikit-learn
- repo / notebook
What is Ibis?
Ibis is a bridge between Python and Big Data. Ibis enables pandas handling Big Data.
architecture of Ibis
For more detail, see Wes’s presentation.
As you know, pandas is known as a killer application for data analysis. In my previous job, which is known as a developer of world largest monolithic Ruby on Rails application, many Rails developer attracted with pandas and Jupyter notebook for sharing analysis result.
pandas loads data on memory, so we have to filter with some SQL before analyzing. But we actually want to get insight and handle without SQL.
- CDH 5.7 with Cloudera Director 2.1
- table is created with parquet on S3
- impalad node’s 21050 port
- NN’s 50070 port
- Python 3.5
- using wheel and virtualenv, I didn’t use anaconda
Full notebook repo is here. I also executed same code for Redshift, but several dialects prevent execution…
What is the difference between PySpark?
- Easy to setup. It is just like connecting DB
- Fast x10. So that we can x10 experiences. It makes us innovations!
- We can rapid prototyping with Ibis.
Which is prefer to build model Ibis + scikit-learn or Spark + MLlib?
- It depends on data size.
- Netflix uses Spark and R for building predictive models. Netflix uses R in order to model filtered data such as specific country, and they use Spark for global model.