Create data lineage from Trino/Hive queries in digdag log with Python
sqllineage visualize your query log into lineage
What’s data lineage?
Data lineage is something to describe “Where this data comes from and where it goes?”
I learned this term in my previous job. They provided “Cloudera Navigator” which includes data lineage from execution logs of Hive/Spark etc.
sqllineage is awesome open source tool for visualizing lineage
Recently, I learned there is a Python package so called sqllinage, that makes analyze and visualize data lineage from SQLs.
sqllineage consists of Python implementation to analyze SQL and web application written in React.
Visualize data lineage from Treasure Data’s workflow logs
I found that Treasure Data’s workflow log outputs SQLs in its log. But it still needs to format pure SQLs.
Then, I create digdaglog2sql to extract SQLs from Treasure Workflow logs.
You can use it with Python 3.7+. Here is the overview of the usage and check details on GitHub.
Install via pip:
pip install --user digdaglog2sql
If you have a workflow log downloaded from Treasure Data, you can convert into SQL as:
digdaglog2sql --input workflow-log.txt --output output.sql
Or, if you want extract SQLs from specific workflow, you can use Session ID of it.
export TD_API_KEY=1234XXX/YYYYYYYY digdaglog2sql --session-id 12345 --site us --output output.sql
You can fetch SQLs from your hosted digdag as the following:
digdaglog2sql --session-id 12345 --endpoint digdag.example.com --output output.sql
Note that, as of May 5, 2022, sqllineage and sqlparse, which is an important backend of sqllineage, are not fully compatible with Trino and Hive queries.
These are the PRs that approaches the issues:
- ✅ https://github.com/reata/sqllineage/pull/252 -> Released in 1.3.5
- ✅ https://github.com/reata/sqllineage/pull/255 -> Released in 1.3.5
- [WIP] https://github.com/andialbrecht/sqlparse/pull/662
- [WIP] https://github.com/andialbrecht/sqlparse/pull/664
Don’t worry about it. I prepared patched branches on GitHub. You can install sqllineage and sqlparse as the following:
pip install git+https://github.com/chezou/sqlparse.git@trino#egg=sqlparse==0.4.3.dev0 pip install sqllineage
If you see some error on installation of sqllineage, double-check if you have node installed.
Then, you can visualize your SQL file as:
$ sqllineage -g -f output.sql * SQLLineage Running on http://localhost:5000/?f=output.sql
Now you can see visualization of data linage, both table level and column level.
Let’s try sqllineage!