Blog Posts

Tags Categories

An easy way to get URL list of your Medium publication

I imported blog posts from own Wordpress but I have to redirect old articles to Medium manually. There is Wordpress plugin which enables you to redirect articles, but it requires URL mapping in CSV format. When you want to get Medium publication’s URL list, you may use official APIs, but officially, it lacks the function to get list of posts. We need some other choices, but I couldn’t get the post list.

sparkavro: Manupilate Apache Avro file with sparklyr

I created a simple sparklyr extension to handle Apache Avro file. It is just a simple wrapper of DataBrick’s spark-avro. It is listed in the official document of sparklyr extensions. chezou/sparkavro _sparkavro - Load Avro data into Spark with[]( Installation Use {devtools} to install sparkavro. devtools::install_github(“chezou/avrospark”) Simple usage You can read and write Avro file as follows: library(sparklyr) library(sparkavro) sc <- spark_connect(master = “spark://HOST:PORT”) df <- spark_read_avro(sc, “test_table”, “/user/foo/test.

How to connect secure Impala cluster from RStudio on macOS with implyr

Impala is very fast SQL-on-Hadoop, and it will enhance your R experience with implyr, a dplyr based interface for Apache Impala (incubating) created by Ian Cook. I will show you how to setup connection to Kerberized Impala cluster with implyr from local macOS. You can find my GitHub repo as follows: chezou/implyr-example _implyr-example - Example repository of[]( Setting up ODBC environment for macOS Install unixODBC with homebrew First, we will install unixODBC to handle Impala with ODBC.

Visualize your massive data with Impala and Redash

Redash is a famous OSS visualization tool, which enables to visualize your data with SQL. It supports Apache Impala (incubating), fast SQL-on-Hadoop suitable for BI tools and exploratory analysis. With Impala, you can query SQLs to tables on Amazon S3. In this post, we connect to Impala from Redash and visualize data. Set up Redash You can set up Redash with various way. This time, I use AMI for Redash.

tabula-py: Extract table from PDF into Python DataFrame

(Note: Oct 7th, 2019) As of Oct. 2019, I launched a documentation site and Google Colab notebook for tabula-py. The FAQ would be good place to execute accurate extraction. Today, I released tabula-py 0.3.0, which extracts table from PDF into Python pandas’s DataFrame. chezou/tabula-py _tabula-py - Simple wrapper of tabula-java: extract table from PDF into pandas[]( It is simple wrapper of tabula-java and it enables you to extract table into DataFrame or JSON with Python.

Livy & Jupyter Notebook & Sparkmagic = Powerful & Easy Notebook for Data Scientist

livy is a REST server of Spark. You can see the talk of the Spark Summit 2016, Microsoft uses livy for HDInsight with Jupyter notebook and sparkmagic. Jupyter notebook is one of the most popular notebook OSS within data scientists. Using sparkmagic + Jupyter notebook, data scientists can execute ad-hoc Spark job easily. Why livy is good? According to the official document, livy has features like: Have long running SparkContexts that can be used for multiple Spark jobs, by multiple clients Share cached RDDs or Dataframes across multiple jobs and clients Multiple SparkContexts can be managed simultaneously, and they run on the cluster (YARN/Mesos) instead of the Livy Server for good fault tolerance and concurrency Jobs can be submitted as precompiled jars, snippets of code, or via Java/Scala client API Ensure security via secure authenticated communication Apache License, 100% open source Why livy + sparkmagic?

Text-to-speech based on deep learning for Web site using Amazon Polly and Ruby

Amazon Polly, Text-to-speech service from AWS was announced at today ‘s re:Invent. Amazon Polly is speech synthesize system based on deep learning. Amazon Polly — Text to Speech in 47 Voices and 24 Languages [updated] I added generated speech of this article. [updated2] I created simple CLI tools and rubygems of polly The great thing about Amazon Polly is that we can use TTS easily with AWS CLI.

Building predictive Model with Ibis, Impala and scikit-learn

tl;dr visualizing MovieLens 20M data (famous movie rating data) with Ibis build predictive model for movie favor with scikit-learn repo / notebook What is Ibis? Ibis is a bridge between Python and Big Data. Ibis enables pandas handling Big Data. architecture of Ibis For more detail, see Wes’s presentation. As you know, pandas is known as a killer application for data analysis. In my previous job, which is known as a developer of world largest monolithic Ruby on Rails application, many Rails developer attracted with pandas and Jupyter notebook for sharing analysis result.