<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Blog Posts | Democratizing Data</title><link>https://chezo.uno/blog/</link><atom:link href="https://chezo.uno/blog/index.xml" rel="self" type="application/rss+xml"/><description>Blog Posts</description><generator>HugoBlox Kit (https://hugoblox.com)</generator><language>en-us</language><copyright>©</copyright><lastBuildDate>Fri, 24 Apr 2026 18:53:00 -0700</lastBuildDate><image><url>https://chezo.uno/media/icon_hu_423f10ccd06de889.png</url><title>Blog Posts</title><link>https://chezo.uno/blog/</link></image><item><title>TikTok kept letting strangers create accounts with my email, so I filed PIPEDA and CASL complaints</title><link>https://chezo.uno/blog/2026-04-24-tiktok-kept-letting-strangers-create-accounts-wit/</link><pubDate>Fri, 24 Apr 2026 18:53:00 -0700</pubDate><guid>https://chezo.uno/blog/2026-04-24-tiktok-kept-letting-strangers-create-accounts-wit/</guid><description>
&lt;div class="callout flex px-4 py-3 mb-6 rounded-md border-l-4 bg-red-100 dark:bg-red-900 border-red-500"
data-callout="caution"
data-callout-metadata=""&gt;
&lt;span class="callout-icon pr-3 pt-1 text-red-600 dark:text-red-300"&gt;
&lt;svg height="24" xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24"&gt;&lt;path fill="none" stroke="currentColor" stroke-linecap="round" stroke-linejoin="round" stroke-width="1.5" d="M12 9v3.75m-9.303 3.376c-.866 1.5.217 3.374 1.948 3.374h14.71c1.73 0 2.813-1.874 1.948-3.374L13.949 3.378c-.866-1.5-3.032-1.5-3.898 0zM12 15.75h.007v.008H12z"/&gt;&lt;/svg&gt;
&lt;/span&gt;
&lt;div class="callout-content dark:text-neutral-300"&gt;
&lt;div class="callout-title font-semibold mb-1"&gt;Disclaimer&lt;/div&gt;
&lt;div class="callout-body"&gt;&lt;p&gt;This is a personal account of how I, a non-lawyer Canadian resident, used PIPEDA and CASL to address a privacy issue with TikTok. &lt;strong&gt;It is not legal advice.&lt;/strong&gt; Procedures, deadlines, and applicable laws change. If you are facing a similar situation, please verify everything against current OPC and CRTC guidance, and consider consulting a lawyer for matters with significant legal or financial stakes. Parts of the research and drafting in this article were done with the help of an LLM. I reviewed the output, but factual errors are still possible. Please check the primary sources I link to before acting on anything here.&lt;/p&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;h2 id="tldr"&gt;TL;DR&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;TikTok accounts I never created kept showing up under my email address. The first one was &lt;code&gt;@chezou2&lt;/code&gt;. After it was deleted, a different handle appeared.&lt;/li&gt;
&lt;li&gt;TikTok lets people register accounts without verifying that they control the email address. This is a structural failure to obtain consent.&lt;/li&gt;
&lt;li&gt;TikTok Support eventually deleted the first account after I escalated. A different account appeared shortly after the deadline I had given them.&lt;/li&gt;
&lt;li&gt;I filed two complaints with Canadian regulators:
&lt;ul&gt;
&lt;li&gt;A
&lt;strong&gt;violation report&lt;/strong&gt; to the
&lt;/li&gt;
&lt;li&gt;A
&lt;strong&gt;formal complaint&lt;/strong&gt; to the
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;The OPC had already issued a joint Report of Findings against TikTok in September 2025 (
), so my complaint joins an existing thread of regulatory attention.&lt;/li&gt;
&lt;li&gt;Below is what I did. Whether anything from this is useful for your own situation is for you to judge.&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id="what-is-pipeda-and-casl"&gt;What is PIPEDA and CASL?&lt;/h2&gt;
&lt;p&gt;If you live in Canada and you have never come across these acronyms, here is the short version.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;PIPEDA&lt;/strong&gt; stands for the Personal Information Protection and Electronic Documents Act. It is the federal law that governs how private-sector organizations collect, use, and disclose personal information in the course of commercial activity. Your email address is personal information under PIPEDA. Complaints about PIPEDA go to the &lt;strong&gt;OPC&lt;/strong&gt;, the Office of the Privacy Commissioner of Canada.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;CASL&lt;/strong&gt; stands for Canada&amp;rsquo;s Anti-Spam Legislation. It regulates the sending of &amp;ldquo;commercial electronic messages&amp;rdquo; (think marketing emails) without consent. Complaints about CASL go to the &lt;strong&gt;CRTC&lt;/strong&gt; through its Spam Reporting Centre.&lt;/p&gt;
&lt;p&gt;The two laws cover different things. PIPEDA is about what an organization does with your personal information. CASL is about whether an organization is allowed to send you a particular email. In my case, both were relevant for different reasons.&lt;/p&gt;
&lt;h2 id="what-happened"&gt;What happened&lt;/h2&gt;
&lt;p&gt;In November 2025, I started receiving emails from &lt;code&gt;noreply@account.tiktok.com&lt;/code&gt; reporting &amp;ldquo;a new device login.&amp;rdquo; The login was from an Android phone I do not own. More fundamentally, I have never created a TikTok account.&lt;/p&gt;
&lt;p&gt;The emails were addressed to my private email and referenced a handle, &lt;code&gt;@chezou2&lt;/code&gt;, that someone had registered with a name close to mine. There was an awkward fact behind this: &lt;strong&gt;TikTok lets people register accounts without verifying that the email address belongs to them.&lt;/strong&gt; No &amp;ldquo;is this you?&amp;rdquo; confirmation email had ever arrived. The account was simply created, and I, the actual address holder, was on the receiving end of TikTok&amp;rsquo;s automated correspondence.&lt;/p&gt;
&lt;p&gt;This is a structural problem, not an incidental one. Anyone can register a TikTok account using anyone else&amp;rsquo;s email address, and the actual address holder receives the platform&amp;rsquo;s notifications indefinitely.&lt;/p&gt;
&lt;h2 id="march-2026-contacting-tiktok-directly"&gt;March 2026: contacting TikTok directly&lt;/h2&gt;
&lt;p&gt;On March 15, 2026, I submitted a &amp;ldquo;Report a potential privacy violation&amp;rdquo; through TikTok&amp;rsquo;s privacy report form at
.&lt;/p&gt;
&lt;p&gt;The first reply, from &lt;code&gt;feedback@tiktok.com&lt;/code&gt;, was a templated message: &amp;ldquo;If you want to delete your account, log in and follow these steps.&amp;rdquo; The reply assumed I owned the account. The whole problem was that I did not.&lt;/p&gt;
&lt;p&gt;I replied, explaining that the account was not mine, that someone had created it using my email address without my consent, that I could not log in because I had never created the account, and that PIPEDA applied. The next reply asked me to verify ownership: signup date, first login location, registered device, registered phone number, linked third-party accounts.&lt;/p&gt;
&lt;p&gt;I could not answer any of these, and that was precisely the point. I replied saying so. My inability to provide ownership details was itself evidence that I had not created the account, and a process that demanded ownership details from a non-user was structurally incompatible with non-user privacy complaints. I also stated that if the matter was not resolved within 30 days, I would file a formal complaint with the OPC.&lt;/p&gt;
&lt;p&gt;After further escalation, &lt;code&gt;@chezou2&lt;/code&gt; was eventually deleted. I did not get an explicit confirmation; I just noticed, when I checked on April 16, that the profile no longer existed.&lt;/p&gt;
&lt;h2 id="april-2026-the-same-problem-again"&gt;April 2026: the same problem, again&lt;/h2&gt;
&lt;p&gt;The same day I confirmed &lt;code&gt;@chezou2&lt;/code&gt; was gone (April 16, 2026, the day after the 30-day deadline I had given TikTok), I started receiving emails from a different sender (&lt;code&gt;notification@service.tiktok.com&lt;/code&gt;, with &lt;code&gt;Reply-To: edm.feedback@tiktok.com&lt;/code&gt;). This time addressed to a new handle I also did not own. Messages in French, purporting to be social notifications from various TikTok users.&lt;/p&gt;
&lt;p&gt;After &lt;code&gt;@chezou2&lt;/code&gt; was deleted, someone else had created another account using my email address, and TikTok&amp;rsquo;s notification machinery had picked up where it left off.&lt;/p&gt;
&lt;p&gt;This made the issue clear. &lt;strong&gt;Deleting individual accounts will not fix anything as long as TikTok does not verify email ownership at registration.&lt;/strong&gt; The cycle can repeat indefinitely.&lt;/p&gt;
&lt;p&gt;A small but telling detail: the &lt;code&gt;Reply-To&lt;/code&gt; address on these emails is &lt;code&gt;edm.feedback@tiktok.com&lt;/code&gt;. In email marketing, &amp;ldquo;EDM&amp;rdquo; stands for &amp;ldquo;Electronic Direct Mail.&amp;rdquo; TikTok&amp;rsquo;s own infrastructure treats these notifications as part of its marketing email stack. This matters under CASL, as I will get to.&lt;/p&gt;
&lt;h2 id="two-regulatory-routes-in-canada"&gt;Two regulatory routes in Canada&lt;/h2&gt;
&lt;p&gt;I decided that filing individual deletion requests was a losing game and turned to the regulators. There are two distinct complaints that apply.&lt;/p&gt;
&lt;h3 id="casl--crtc"&gt;CASL → CRTC&lt;/h3&gt;
&lt;p&gt;CASL regulates the sending of unsolicited &amp;ldquo;commercial electronic messages&amp;rdquo; (CEMs). The relevant points for my case:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The CRTC has the power to impose substantial monetary penalties for violations.&lt;/li&gt;
&lt;li&gt;TikTok&amp;rsquo;s notification emails encourage engagement with the TikTok platform. They can plausibly be characterized as promoting commercial activity.&lt;/li&gt;
&lt;li&gt;The transactional or account-notification exemption should not apply here, because &lt;strong&gt;I am not the account holder&lt;/strong&gt;. TikTok has not verified, and structurally cannot verify, that my email address belongs to the account being notified.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The reporting channel is the
at &lt;code&gt;spam@fightspam.gc.ca&lt;/code&gt;. A useful detail: you can simply email reports there, and forwarding the offending messages as &lt;code&gt;.eml&lt;/code&gt; attachments preserves the full headers. This is better evidence than what the online form captures.&lt;/p&gt;
&lt;p&gt;I sent one context email and attached five sample messages in &lt;code&gt;.eml&lt;/code&gt; format.&lt;/p&gt;
&lt;p&gt;A caveat on expectations: the SRC does not provide individual case feedback. Submissions feed an enforcement-targeting dataset; you do not get a &amp;ldquo;your case has been resolved&amp;rdquo; notice. The CASL route is about contributing to enforcement signal, not about getting a personal resolution.&lt;/p&gt;
&lt;h3 id="pipeda--opc"&gt;PIPEDA → OPC&lt;/h3&gt;
&lt;p&gt;PIPEDA governs the collection, use, and disclosure of personal information. The relevant provision here is Schedule 1,
, read together with
&lt;strong&gt;(Valid consent)&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;Formal complaints to the OPC are filed through their
. Two things to know in advance.&lt;/p&gt;
&lt;h4 id="the-forms-automatic-stop"&gt;The form&amp;rsquo;s automatic stop&lt;/h4&gt;
&lt;p&gt;The form asks whether you have filed a complaint about the same matter with another body. I had filed with the CRTC SRC eight days earlier, so I answered &amp;ldquo;Yes.&amp;rdquo; The form ended my session: &amp;ldquo;If your complaint to another body covers any of your concerns with your personal information, please end this session and complete that complaint process first.&amp;rdquo;&lt;/p&gt;
&lt;p&gt;The problem: the SRC does not issue case outcomes. Waiting for that process to &amp;ldquo;complete&amp;rdquo; would mean never filing a PIPEDA complaint at all. The OPC&amp;rsquo;s online form was treating CASL and PIPEDA as overlapping, when in fact they cover different conduct.&lt;/p&gt;
&lt;h4 id="recovering-through-the-information-centre"&gt;Recovering through the Information Centre&lt;/h4&gt;
&lt;p&gt;The OPC&amp;rsquo;s
accepts free-text inquiries up to 2,000 characters. I sent one explaining the situation, the difference between the CRTC and PIPEDA matters, and asked how to proceed past the form&amp;rsquo;s stop.&lt;/p&gt;
&lt;p&gt;A reply came within a day. The OPC suggested:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Contacting TikTok&amp;rsquo;s privacy contact (TikTok Inc., Culver City) in writing first.&lt;/li&gt;
&lt;li&gt;Then filing a formal complaint if unresolved.&lt;/li&gt;
&lt;li&gt;For the form&amp;rsquo;s &amp;ldquo;complaint with another body&amp;rdquo; question, &lt;strong&gt;answer &amp;ldquo;No&amp;rdquo; and explain the situation in the free-text section&lt;/strong&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The first point was easy. I had already done the equivalent in March, through TikTok&amp;rsquo;s own privacy report form (the same channel the OPC pointed me to) and the email correspondence that followed.&lt;/p&gt;
&lt;h4 id="filing-the-formal-complaint"&gt;Filing the formal complaint&lt;/h4&gt;
&lt;p&gt;The OPC online form is structured in three parts. Part A asks about steps already taken. Part B determines jurisdiction. Part C covers details and remedy. Part C has four free-text fields with character limits ranging from 500 to 2,500.&lt;/p&gt;
&lt;p&gt;The argument I built:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Principle 4.3 (Consent), read with s.6.1 (Valid consent).&lt;/strong&gt; TikTok cannot have any basis to believe the address holder consented, because it has no mechanism to verify email ownership. Section 6.1 requires that &amp;ldquo;an individual to whom the organization&amp;rsquo;s activities are directed would understand the nature, purpose and consequences&amp;rdquo; of the collection, use, or disclosure. TikTok cannot satisfy this when it has not even confirmed who the address holder is.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;No s.7 exception applies.&lt;/strong&gt; I am not a customer; there is no investigation, emergency, or any other listed basis.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Reasonable expectations of a non-user.&lt;/strong&gt; PIPEDA Report of Findings
(Facebook using non-members&amp;rsquo; email addresses to suggest friends) is directly on point. The Commissioner there found that a non-user could not reasonably expect a platform to use their email to create social connections, particularly when there is no prior relationship to the organization. The same logic applies to me and TikTok.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Recurrence as evidence.&lt;/strong&gt; A second account appearing after the first one was deleted demonstrates that the issue is systemic, not an isolated incident.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;What I asked for, in order of priority:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;A &lt;strong&gt;Compliance Agreement under&lt;/strong&gt;
requiring TikTok to implement email verification at registration and a non-ownership-based privacy complaint process.&lt;/li&gt;
&lt;li&gt;Deletion of all accounts currently associated with my email.&lt;/li&gt;
&lt;li&gt;Adding my email to a registration blocklist.&lt;/li&gt;
&lt;li&gt;A public Report of Findings, given the systemic nature of the issue.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Attachments (within the 8-page and 25 MB limit): the March TikTok Support thread as a PDF, the OPC Information Centre response, the November security notification, and one representative April notification message.&lt;/p&gt;
&lt;h2 id="existing-opc--tiktok-context"&gt;Existing OPC × TikTok context&lt;/h2&gt;
&lt;p&gt;Worth knowing: in September 2025, the OPC, the CAI (Quebec), the OIPC BC, and the OIPC AB jointly issued
on TikTok. That investigation focused on minors&amp;rsquo; consent and targeted advertising, not on non-user account creation, but the regulators have already concluded that TikTok&amp;rsquo;s overall consent practices have problems.&lt;/p&gt;
&lt;p&gt;That report also addressed the BC PIPA and PIPEDA jurisdictional question. In cross-border data flows, the federal and provincial laws operate in an &amp;ldquo;airtight seal&amp;rdquo; rather than excluding each other. As a BC resident filing against a Singapore-based organization, this confirms PIPEDA jurisdiction is appropriate.&lt;/p&gt;
&lt;p&gt;So my complaint is not an isolated grievance. It joins an existing thread of regulatory scrutiny, from a different angle.&lt;/p&gt;
&lt;h2 id="what-i-did-in-summary"&gt;What I did, in summary&lt;/h2&gt;
&lt;p&gt;If you found this post because you are dealing with a similar situation, here is the sequence of what I did. The disclaimer at the top of the post still applies: this is not legal advice, and your situation may differ in ways that matter.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Step 0: Preserve evidence.&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Save the suspicious emails: sender, date, body, and full headers.&lt;/li&gt;
&lt;li&gt;If they are in your spam folder, rescue them before they auto-delete (Gmail purges spam after 30 days).&lt;/li&gt;
&lt;li&gt;Note the unauthorized account handle(s) referenced in the messages.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Step 1: Contact the organization in writing.&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Use whatever privacy reporting channel the organization provides (form, email, etc.).&lt;/li&gt;
&lt;li&gt;Cite PIPEDA explicitly. It changes the tone of the conversation.&lt;/li&gt;
&lt;li&gt;Make it clear you are not the account holder and are demanding deletion as the email holder, not as the user.&lt;/li&gt;
&lt;li&gt;Set a reasonable deadline. Mine was 30 days. This is not a statutory requirement, just a tactical one.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Step 2: Report to the CRTC SRC under CASL&lt;/strong&gt; (if the messages have any commercial or promotional flavor).&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Email &lt;code&gt;spam@fightspam.gc.ca&lt;/code&gt; directly. Forward the offending messages as &lt;code&gt;.eml&lt;/code&gt; attachments to preserve headers.&lt;/li&gt;
&lt;li&gt;Do not expect individual case feedback. The SRC uses submissions for enforcement targeting.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Step 3: File a PIPEDA complaint with the OPC.&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;If unsure how to proceed, send an inquiry to the OPC
first (2,000-character limit; replies typically come within a day or two).&lt;/li&gt;
&lt;li&gt;File the
once you have a clear path.&lt;/li&gt;
&lt;li&gt;Attachments are limited to roughly 8 pages and 25 MB. Choose evidence carefully. The full thread of correspondence with the organization is the most important.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Things to be aware of.&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The OPC has a current backlog. Investigations can take months.&lt;/li&gt;
&lt;li&gt;The OPC does not have order-making power. It cannot impose fines or issue binding orders directly. The strongest tool it has is the Compliance Agreement under s.17.1, which is enforceable in Federal Court. Worth asking for explicitly.&lt;/li&gt;
&lt;li&gt;BC OIPC may have concurrent jurisdiction, but in cross-border situations the OPC handles routing.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="a-note-on-ai-assisted-research"&gt;A note on AI-assisted research&lt;/h2&gt;
&lt;p&gt;Parts of the research, drafting, and form responses for this complaint were done with the help of an LLM (Claude). It was useful for surfacing precedents, building arguments structurally, and drafting boilerplate quickly. It was also wrong about some specifics. For example, an early version of the analysis presented &amp;ldquo;30 days for organization response&amp;rdquo; as a PIPEDA requirement, which it is not. I caught that on a later pass, but the principle is general.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;If you do this kind of research with an AI assistant, treat its output as a starting point, not as a citation.&lt;/strong&gt; Verify every legal reference (statute number, section, finding number, deadline, jurisdiction) against the primary source, such as laws-lois.justice.gc.ca for legislation, priv.gc.ca for OPC findings, or fightspam.gc.ca for CASL, before acting on it. AI tools hallucinate, and acting on a hallucinated regulatory deadline is the kind of mistake that has real consequences.&lt;/p&gt;
&lt;p&gt;For me, the value was in the working pattern. The LLM did the research scaffolding, I verified the primary sources, and I made the decisions about what went into formal submissions. That division of labour was practical and, I think, the right way to use these tools for legal-adjacent work.&lt;/p&gt;
&lt;hr&gt;
&lt;div class="callout flex px-4 py-3 mb-6 rounded-md border-l-4 bg-purple-100 dark:bg-purple-900 border-purple-500"
data-callout="important"
data-callout-metadata=""&gt;
&lt;span class="callout-icon pr-3 pt-1 text-purple-600 dark:text-purple-300"&gt;
&lt;svg height="24" xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24"&gt;&lt;path fill="none" stroke="currentColor" stroke-linecap="round" stroke-linejoin="round" stroke-width="1.5" d="M12 9v3.75m9-.75a9 9 0 1 1-18 0a9 9 0 0 1 18 0m-9 3.75h.008v.008H12z"/&gt;&lt;/svg&gt;
&lt;/span&gt;
&lt;div class="callout-content dark:text-neutral-300"&gt;
&lt;div class="callout-title font-semibold mb-1"&gt;Important&lt;/div&gt;
&lt;div class="callout-body"&gt;&lt;p&gt;&lt;strong&gt;Reminder:&lt;/strong&gt; Not legal advice. Verify primary sources. Consult a lawyer if your situation involves significant legal or financial stakes.&lt;/p&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;</description></item><item><title>Embedding workflow templates in skills: shifting the LLM's role from "generation" to "rendering"</title><link>https://chezo.uno/blog/2026-03-29-embedding-workflow-templates-in-skills-shifting-the-llm-s-role-from-generation-to-rendering/</link><pubDate>Sat, 28 Mar 2026 18:23:00 -0700</pubDate><guid>https://chezo.uno/blog/2026-03-29-embedding-workflow-templates-in-skills-shifting-the-llm-s-role-from-generation-to-rendering/</guid><description>&lt;h2 id="the-dream-of-llm-powered-ml-workflow-generation"&gt;The dream of LLM-powered ML workflow generation&lt;/h2&gt;
&lt;p&gt;At my company, an ML feature called
provides capabilities like RFM analysis, recommendation, and contextual bandits. To run ML predictions at scale, the system calls an ML API that spins up parallel workers on AWS Batch behind the scenes. To make this parallelization work, input tables are aggregated per profile, which is a deliberate trade-off for scalability. These processes are orchestrated through digdag workflows (.dig files, executed on Treasure Workflow, a hosted digdag service) containing SQL (Hive or Trino), where the ML API is invoked via digdag&amp;rsquo;s &lt;code&gt;http&amp;gt;&lt;/code&gt; operator.&lt;/p&gt;
&lt;p&gt;Originally, the pre-processing and post-processing workflows were built by MLEs on a paid Professional Services team using their own templates, then deployed to customers who purchased PS engagements. But there was a desire to scale this beyond PS customers, and LLM-based workflow and SQL generation was seen as the path forward. Despite models getting better every day, generating stable workflows and SQL with LLMs proved difficult. For example, TD-specific UDFs in SQL don&amp;rsquo;t come naturally to the models. After several attempts, we had given up.&lt;/p&gt;
&lt;h2 id="the-rise-of-claude-code-and-agent-friendly-clis"&gt;The rise of Claude Code and agent-friendly CLIs&lt;/h2&gt;
&lt;p&gt;Then our CEO called for a push toward becoming an AI-native organization, and Claude Code was rolled out company-wide. Adoption spread beyond software engineers to PdMs, Solution Architects, and even sales. Two developments were particularly impactful:
, a unified agent-friendly CLI that can call TD&amp;rsquo;s various microservice APIs, and
, a desktop application built on top of tdx. With Claude Code and tdx working together, agents could create marketing journeys, analyze tables in TD, and visualize results. The CEO personally created onboarding challenge tasks for employees to accelerate adoption, and as a result, automation has been spreading across both internal and customer environments.&lt;/p&gt;
&lt;p&gt;Out of this momentum, a Skills marketplace was born for both
and internal use. These skills improve reproducibility and accelerate automation of complex tasks.&lt;/p&gt;
&lt;h2 id="what-the-agent-friendly-cli-made-possible"&gt;What the agent-friendly CLI made possible&lt;/h2&gt;
&lt;p&gt;The biggest contribution of tdx was exposing Treasure Workflow&amp;rsquo;s endpoints through a CLI, which let Claude Code autonomously create workflows, push them, run them, inspect the results, and iterate. Workflows can take anywhere from a few minutes to over an hour to execute, which has always made automated testing painful.&lt;/p&gt;
&lt;p&gt;Thanks to Claude Code + tdx, it became possible to generate a workflow, queue it for execution in the background, and verify the results. This was a game changer.&lt;/p&gt;
&lt;p&gt;An agent-friendly CLI that handles API interactions end-to-end is no longer a nice-to-have. It&amp;rsquo;s essential.&lt;/p&gt;
&lt;h2 id="turning-ml-workflow-templates-into-skills"&gt;Turning ML workflow templates into skills&lt;/h2&gt;
&lt;p&gt;That said, as I mentioned at the top, no matter how smart Claude Code&amp;rsquo;s models get, it&amp;rsquo;s still hard for an LLM to generate workflows that are consistently reliable for anyone to run, especially for customers with no prior knowledge.&lt;/p&gt;
&lt;p&gt;So I reframed the problem. If generating workflows and SQL from scratch is too hard, why not create templates that generate workflows and SQL from configuration parameters? By embedding templates in a skill, the LLM&amp;rsquo;s responsibility narrows from &amp;ldquo;generate workflows and SQL from scratch&amp;rdquo; to &amp;ldquo;choose the right parameters for the data and the problem.&amp;rdquo; The workflows themselves become deterministic since they&amp;rsquo;re pre-built as templates. Deterministic logic should live in scripts, not in the LLM&amp;rsquo;s probabilistic output.&lt;/p&gt;
&lt;p&gt;This idea was inspired by how cdp-api dynamically generates digdag workflows from database values.&lt;/p&gt;
&lt;script defer class="speakerdeck-embed" data-slide="29" data-id="dcef99361823438cb3b542784fa07b56" data-ratio="1.7772511848341233" src="//speakerdeck.com/assets/embed.js"&gt;&lt;/script&gt;
&lt;p&gt;
(Japanese, see slide 29 for the code example)&lt;/p&gt;
&lt;h2 id="jinja2-templates-for-digdag-workflows"&gt;Jinja2 templates for digdag workflows&lt;/h2&gt;
&lt;p&gt;Concretely, I templated the .dig files with Jinja2 and had the LLM focus on determining configuration values, with config.yml as the single source of truth for all modifiable parameters. Claude renders the templates directly from config.yml without any external tooling. Seeing the &lt;code&gt;.dig.j2&lt;/code&gt; extension for the first time gave me a small thrill.&lt;/p&gt;
&lt;p&gt;Internally, I use two kinds of variables: render-time parameters with &lt;code&gt;{{ }}&lt;/code&gt; and digdag runtime parameters with &lt;code&gt;${ }&lt;/code&gt;. The former handles things like branching on whether the SQL engine is Hive or Trino, or when algorithms and hyperparameter candidates can be fixed ahead of time. The latter is for cases like storing hyperparameter tuning results in a table, then dynamically assigning those values to a training task at runtime via SQL.&lt;/p&gt;
&lt;h2 id="openapi-as-the-contract-with-the-agent"&gt;OpenAPI as the contract with the agent&lt;/h2&gt;
&lt;p&gt;One tricky part of templating was that the ML API accepts complex parameters, and somehow the agent needs to understand all of them. Fortunately, our project managed ML endpoint parameters with OpenAPI, so we could hand the full spec to the agent.&lt;/p&gt;
&lt;p&gt;Our project uses
to generate model.py from the OpenAPI spec for parameter validation at runtime. Giving the machine-readable openapi.yml to the agent and having it translated into a markdown document within the skill turned out to work great. Long live standard formats.&lt;/p&gt;
&lt;h2 id="skill-creator-agent-vs-skill-user-agent"&gt;Skill-creator agent vs. skill-user agent&lt;/h2&gt;
&lt;p&gt;While building the skills, I asked Claude how best to test them. Its suggestion: spin up a separate agent process to exercise the skills through trial and error. I tried it, and it was an excellent experience.&lt;/p&gt;
&lt;p&gt;When you&amp;rsquo;re using the skill-user agent, you&amp;rsquo;re not reading the OpenAPI spec or skill documentation yourself. Instead, you start thinking in terms of what you want to try: &amp;ldquo;I want to run this algorithm with this parameter combination.&amp;rdquo; Normally, when doing manual sanity checks, the spec is already in your head, and you tend to skip the tedious, complex parameter combinations. But when an agent can do it for you, you get greedy.&lt;/p&gt;
&lt;p&gt;The skill-user agent came back and told me: &amp;ldquo;I looked at the skill, but that parameter combination isn&amp;rsquo;t supported in the OpenAPI spec yet.&amp;rdquo; I had assumed our QA end-to-end tests would have caught this, but it was a close call. Because the OpenAPI spec was maintained manually, the Python code internally supported the combination, but the spec was missing the parameter, so requests couldn&amp;rsquo;t pass through.&lt;/p&gt;
&lt;p&gt;I fixed the bug quickly, deployed to the development environment, and updated the skill. Claude then picked up the new parameter combination and used it as if it had always been there. Impressive.&lt;/p&gt;
&lt;p&gt;Building the skill alongside the product taught me that having a live execution environment pays for itself many times over.&lt;/p&gt;
&lt;h2 id="wrapping-up"&gt;Wrapping up&lt;/h2&gt;
&lt;p&gt;By sharing these skills on the internal marketplace, the workflow creation step that used to require paid PS engagements was simplified, and even customers without PS contracts could benefit.&lt;/p&gt;
&lt;p&gt;Treasure Studio also helped here: ML prediction results can now be visualized directly, making it easy to run analysis and model improvement cycles. Turning those analysis patterns into skills too seems like a natural next step, but that&amp;rsquo;s out of scope for this post.&lt;/p&gt;
&lt;p&gt;When I was drafting this post, I bounced ideas off Claude, and it argued that &amp;ldquo;the LLM&amp;rsquo;s strengths are understanding problem structure and parameter inference.&amp;rdquo; Providing an end-to-end CLI to support that lets agents run the generate-execute-fix loop autonomously. And by templating the complex parts of that loop, you create a clean division of labor: domain experts build the templates, and agents (used by people without that domain knowledge) fill in the parameters.&lt;/p&gt;</description></item><item><title>Migrated from Pages CMS to Sveltia CMS</title><link>https://chezo.uno/blog/2026-03-19-migrated-from-pages-cms-to-sveltia-cms/</link><pubDate>Thu, 19 Mar 2026 16:41:00 -0700</pubDate><guid>https://chezo.uno/blog/2026-03-19-migrated-from-pages-cms-to-sveltia-cms/</guid><description>&lt;p&gt;I
, but after encountering several concerns, I migrated to Sveltia CMS.&lt;/p&gt;
&lt;h2 id="motivation"&gt;Motivation&lt;/h2&gt;
&lt;p&gt;As I wrote in
, the move was triggered by the following issues with Pages CMS:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;It handled time zones poorly (essentially forcing everything to +00:00), and even after
, there didn&amp;rsquo;t seem to be much interest in fixing it.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;It was quite troublesome to achieve the directory structure recommended by Hugoblox, where content (index.md) and images are placed in the same folder. I ended up having to upload images manually.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="what-i-did"&gt;What I did&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Deployed Sveltia CMS Auth to Cloudflare Workers for authentication.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Had Claude migrate the .pages.yml from Pages CMS to static/admin/config.yml.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Copy-pasted static/admin/index.html from the documentation.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Used Claude to restore the time zone data that Pages CMS had dropped.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The biggest hassle was deploying Sveltia CMS Auth, but since I was already using Cloudflare Pages, all I had to do was click the deploy button in the README at
and follow the instructions. It was simple. It reminded me of Heroku.&lt;/p&gt;
&lt;p&gt;For details, please refer to the following PRs:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;
&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;
&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="impressions"&gt;Impressions&lt;/h2&gt;
&lt;p&gt;Pages CMS had been bothering me with a few minor annoyances, and it was great to see them resolved here. For instance, loading 400+ posts takes over 10 seconds in Pages CMS, but Sveltia CMS handles it in about 2. It&amp;rsquo;s fast enough, isn&amp;rsquo;t it?&lt;/p&gt;
&lt;p&gt;The &lt;code&gt;M↓&lt;/code&gt; button is a lifesaver too. Being able to drop into raw Markdown when the editor misbehaves means I no longer have to open GitHub and edit files directly, which was a nightmare, especially on mobile.&lt;/p&gt;
&lt;p&gt;The attention to detail really shows. While writing this post, a minor GitHub outage hit, and I actually got a warning about it. Impressive for something that runs entirely client-side.&lt;/p&gt;
&lt;p&gt;
&lt;figure &gt;
&lt;div class="flex justify-center "&gt;
&lt;div class="w-full" &gt;
&lt;img alt="Warning notification of GitHub issue on Sveltia CMS"
srcset="https://chezo.uno/blog/2026-03-19-migrated-from-pages-cms-to-sveltia-cms/pasted-image-1773964049625_hu_7890cd6f51891eaa.webp 320w, https://chezo.uno/blog/2026-03-19-migrated-from-pages-cms-to-sveltia-cms/pasted-image-1773964049625_hu_4e64cb3d48029ebe.webp 480w, https://chezo.uno/blog/2026-03-19-migrated-from-pages-cms-to-sveltia-cms/pasted-image-1773964049625_hu_3899ba75bf4a4d04.webp 691w"
sizes="(max-width: 480px) 100vw, (max-width: 768px) 90vw, (max-width: 1024px) 80vw, 760px"
src="https://chezo.uno/blog/2026-03-19-migrated-from-pages-cms-to-sveltia-cms/pasted-image-1773964049625_hu_7890cd6f51891eaa.webp"
width="691"
height="94"
loading="lazy" data-zoomable /&gt;&lt;/div&gt;
&lt;/div&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p&gt;Uploading this screenshot straight from the clipboard was seamless as well. In Pages CMS, getting a screenshot into the right folder alongside the content was a real pain, so this one stood out. (This feature
. What a sense of speed!)&lt;/p&gt;
&lt;p&gt;The author, kyoshino, is a Japanese speaker, it is clear that they are mindful of the IME input issues that we CJK (Chinese, Japanese, Korean) users often encounter. Being able to type without stress is truly important.&lt;/p&gt;
&lt;p&gt;I plan to enjoy trying it out for a while, and unless any major issues arise, I think I will stick with it for the foreseeable future.&lt;/p&gt;</description></item><item><title>Between Principal and Glue Work</title><link>https://chezo.uno/blog/2026-03-08-2026-03-08-between-principan-and-glue-work/</link><pubDate>Sun, 08 Mar 2026 13:07:00 -0700</pubDate><guid>https://chezo.uno/blog/2026-03-08-2026-03-08-between-principan-and-glue-work/</guid><description>&lt;h2 id="introduction"&gt;Introduction&lt;/h2&gt;
&lt;p&gt;I have always worked with the mindset that being a Staff+ Engineer is essentially doing high-level &amp;ldquo;
.&amp;rdquo; Because the tasks are surprisingly diverse, I have written about them
, but I have never quite been able to cohesively organize my thoughts.&lt;/p&gt;
&lt;p&gt;And for a while, I also struggled with how to effectively express the impact I&amp;rsquo;ve made on my resume. This is because the role requires catching things that fall through the cracks far more often than one might expect, and while delivering value through these tasks is incredibly important, resumes tend to demand flashy, shiny achievements (like
).&lt;/p&gt;
&lt;p&gt;My stance has always been, &amp;ldquo;I will do whatever it takes to deliver a valuable product to the customer.&amp;rdquo; The scope of &amp;ldquo;whatever it takes&amp;rdquo; probably varies from person to person. So, as a starting point, let&amp;rsquo;s unravel this by comparing it with Will Larson&amp;rsquo;s
, which are well-known to those familiar with
.&lt;/p&gt;
&lt;h2 id="the-4-staff-archetypes"&gt;The 4 Staff Archetypes&lt;/h2&gt;
&lt;p&gt;Let&amp;rsquo;s recap Will Larson&amp;rsquo;s four Staff Archetypes.&lt;/p&gt;
&lt;h3 id="1-tech-lead"&gt;1. Tech Lead&lt;/h3&gt;
&lt;p&gt;A role deeply involved with a specific team (or a few teams), leading technical direction and execution.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Characteristics: Takes responsibility for the team&amp;rsquo;s technical decisions, fleshing out complex tasks, and unblocking progress.&lt;/li&gt;
&lt;li&gt;Primary Activities: Focuses more on shaping the overall technical vision for the team, mentoring members, and coordinating with product managers rather than pure implementation. This is the most common archetype and a natural extension from a senior engineer role.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="2-architect"&gt;2. Architect&lt;/h3&gt;
&lt;p&gt;A role responsible for cross-organizational success and quality within a specific technical domain (e.g., API design, frontend, infrastructure strategy).&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Characteristics: Formulates technical strategies spanning multiple teams and maintains long-term technical alignment.&lt;/li&gt;
&lt;li&gt;Primary Activities: Deeply understands business needs and technical constraints to guide the overall architecture of the organization. Needed in large organizations or companies with complex systems burdened by accumulated debt.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="3-solver"&gt;3. Solver&lt;/h3&gt;
&lt;p&gt;A &amp;ldquo;firefighter&amp;rdquo; role that moves beyond a specific team to solve critical and difficult technical challenges for the organization.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Characteristics: Deployed to problems with high execution risk or complex issues where the solution is unclear.&lt;/li&gt;
&lt;li&gt;Primary Activities: Goes to the frontline where the problem is occurring based on requests from leadership, and moves on to the next challenge once resolved. Requires pure technical breakthrough ability rather than organizational coordination.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="4-right-hand"&gt;4. Right Hand&lt;/h3&gt;
&lt;p&gt;A role acting as the &amp;ldquo;right hand&amp;rdquo; to executives like the CTO or VP, borrowing their authority to solve complex organizational problems.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Characteristics: Operates at the intersection of technology, business, people, and processes to permeate the executive&amp;rsquo;s intent throughout the organization.&lt;/li&gt;
&lt;li&gt;Primary Activities: Attends executive meetings and helps remove organizational bottlenecks and execute strategies. A rare archetype found in massive organizations with hundreds of engineers.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;However, Will Larson also states the following (quoted from
):&lt;/p&gt;
&lt;blockquote class="border-l-4 border-neutral-300 dark:border-neutral-600 pl-4 italic text-neutral-600 dark:text-neutral-400 my-6"&gt;
&lt;p&gt;This taxonomy is more focused on being &lt;em&gt;useful&lt;/em&gt; than complete, but so far, I’ve been able to fit every Staff-plus engineer I’ve spoken to into one of these categories. Admittedly, some folks are easier to classify than others.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I started writing this article because I couldn&amp;rsquo;t help but feel that I might be one of those &amp;ldquo;harder to classify&amp;rdquo; ones.&lt;/p&gt;
&lt;h2 id="breaking-down-my-work-by-archetypes"&gt;Breaking Down My Work by Archetypes&lt;/h2&gt;
&lt;p&gt;Let me briefly explain my positioning as a Principal Software Engineer or Tech Lead at my current workplace.&lt;/p&gt;
&lt;p&gt;I didn&amp;rsquo;t have the official title of Tech Lead, but I operated as the engineer overall responsible for driving the engineering of the company&amp;rsquo;s ML product development. Regarding the organizational structure, it was reorganized about a year ago by the CTO and VPoE. I, as a Principal, and the Engineering Manager of the ML team existed as peers, reporting directly to the Sr Engineering Director who oversaw multiple engineering teams. This is what is often called a &amp;ldquo;Two-in-a-box&amp;rdquo; style.&lt;/p&gt;
&lt;p&gt;Let&amp;rsquo;s break down the main things I&amp;rsquo;ve done over the past three years using the archetypes.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;As a Tech Lead / Architect&lt;/strong&gt;
&lt;ul&gt;
&lt;li&gt;Consistently led the grand design, PoC, scale validation, and release of the ML training and prediction infrastructure (Python, FastAPI, AWS Batch), successfully releasing it with 2 people in 5 months.&lt;/li&gt;
&lt;li&gt;Scaled the RFM prediction processing, making it up to 100x faster and supporting processing for 1 billion users.&lt;/li&gt;
&lt;li&gt;Implemented the PoC for the recommendation ML solution model and selected scalable algorithms.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;As a Right Hand&lt;/strong&gt;
&lt;ul&gt;
&lt;li&gt;Directly explained the engineering roadmap to the CTO and product leadership to secure sponsorship.&lt;/li&gt;
&lt;li&gt;Secured sponsorship for a complete revamp of the ML infrastructure over a year and drove it through to release.&lt;/li&gt;
&lt;li&gt;Drafted the product roadmap and proposed product direction not only to Engineering but also to the Product Manager and VPoP.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;As a Solver&lt;/strong&gt;
&lt;ul&gt;
&lt;li&gt;When an escalation came from the PS team saying, &amp;ldquo;The customer needs this feature right now,&amp;rdquo; I jumped in as a firefighter, resolved the issue, and released it to production in 1-2 weeks.&lt;/li&gt;
&lt;li&gt;Rewrote incomplete code from a project where the engineer who originally worked on it had moved on, elevating it to production-grade.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;As I was writing this, I realized I also paid attention to things like the following:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Task assignment considering members&amp;rsquo; motivation and project planning with their careers in mind (e.g., taking on tedious tasks like building admin consoles myself and handing challenging tasks to members).&lt;/li&gt;
&lt;li&gt;Creating draft UI design proposals to communicate the complexity of the data model to UX designers.&lt;/li&gt;
&lt;li&gt;Fundamentally rewriting JDs and acting as a gatekeeper in interviews.&lt;/li&gt;
&lt;li&gt;Providing indirect performance evaluation input to the EM during 1:1s.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In short, what I was doing was essentially
&lt;sup id="fnref:1"&gt;&lt;a href="#fn:1" class="footnote-ref" role="doc-noteref"&gt;1&lt;/a&gt;&lt;/sup&gt; . However, my focus was on catching things that fall through the cracks to prevent the project from failing and ensure its success, giving it my all according to their importance. Because of this, I often wondered what my actual responsibilities were, but reading Staff Engineer books out there made me realize that more or less everyone does this, so I gritted my teeth and kept unblocking things.&lt;/p&gt;
&lt;h2 id="why-does-the-scope-of-staff-expand"&gt;Why Does the Scope of Staff+ Expand?&lt;/h2&gt;
&lt;p&gt;Generally, the Two-in-a-box model is well known, although the breadth of role division varies. For example, in &amp;ldquo;
,&amp;rdquo; Charity Majors writes about the collaboration between EMs and TLs:&lt;/p&gt;
&lt;blockquote class="border-l-4 border-neutral-300 dark:border-neutral-600 pl-4 italic text-neutral-600 dark:text-neutral-400 my-6"&gt;
&lt;p&gt;There is an enormous demand for technical engineering leaders — far more demand than supply.  The most common hackaround is to pair a people manager (who can speak the language and knows the concepts, but stopped engineering ages ago) with a tech lead, and make them collaborate to co-lead the team.  This unwieldy setup often works pretty well.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Also, at
, every team has an EM, and for many projects, a TL is placed separately to collaborate with the EM, or if there is no TL, a Tech Lead Manager holds both roles. In their case, the division of roles between EM and TL is as follows:&lt;/p&gt;
&lt;blockquote class="border-l-4 border-neutral-300 dark:border-neutral-600 pl-4 italic text-neutral-600 dark:text-neutral-400 my-6"&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Engineering Manager primarily focuses on people management (staffing, coaching &amp;amp; growth) and organizational strategy (organizational risk, operational efficiency, team charter &amp;amp; outcomes)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Tech Lead primarily focuses on technical leadership (technical execution, technical strategy, technical culture, roadmap feasibility &amp;amp; execution).&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;
&lt;p&gt;In this way, the advantage of Two-in-a-box is sharing the cognitive load so that attention can be paid to various areas.&lt;/p&gt;
&lt;p&gt;However, in domains requiring deep technical knowledge like ML, if there is a gap in the technical depth between the EM and the TL, it inevitably becomes difficult to make highly accurate technical decisions. As a result, the &amp;ldquo;division&amp;rdquo; turns into a broad &amp;ldquo;delegation&amp;rdquo; to the TL, and the scope of the TL expands. In my case, for instance, I was making decisions as a TL up to task assignment and resource allocation, being delegated some of the EM roles at Asana.&lt;/p&gt;
&lt;p&gt;In my case, I felt the Two-in-a-box system worked better than expected. Especially after we clearly defined our boundaries and respective territories, we just ran autonomously. By fundamentally dividing roles—People Management for the EM and Technical Leadership for the TL—and reporting as peers to a common Sr Engineering Director, the expansion of the TL&amp;rsquo;s scope became organizationally acceptable. Before this structure was put in place, we frequently had minor clashes that looked like we were stepping on each other&amp;rsquo;s toes, which made me realize how important it is to build a good organizational structure.&lt;/p&gt;
&lt;p&gt;Ultimately, if the EM and TL trust each other and have a system where they can work autonomously, it&amp;rsquo;s a trivial matter which way the border leans a bit.&lt;/p&gt;
&lt;h2 id="reflecting-on-my-career-or-redefining-glue-work"&gt;Reflecting on My Career, or Redefining &amp;ldquo;Glue Work&amp;rdquo;&lt;/h2&gt;
&lt;p&gt;I&amp;rsquo;ve written a lot so far, but if I were to reorganize the work I&amp;rsquo;ve done and the scope I&amp;rsquo;ve expanded based on the Archetypes:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Technical decision-making and delivery as a Tech Lead/Architect&lt;/li&gt;
&lt;li&gt;Proposing technical strategies to executives and gaining approval as a Right Hand&lt;/li&gt;
&lt;li&gt;Critical problem-solving skills as a Solver
&lt;ul&gt;
&lt;li&gt;Occasionally stepping into scopes beyond Staff+ (PdM, PjM, UX Designer, parts of EM) when necessary&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;It turns out that doing glue work as a Staff+ is not just doing random chores, but rather &amp;ldquo;carrying multiple Staff+ functions necessary for the organization all by oneself.&amp;rdquo;&lt;/p&gt;
&lt;p&gt;Will Larson might be a little surprised too.&lt;/p&gt;
&lt;p&gt;But well, I think someone who can move like this would be highly valued in a company with a strong startup culture.&lt;/p&gt;
&lt;h2 id="to-those-with-the-same-struggles"&gt;To Those With the Same Struggles&lt;/h2&gt;
&lt;p&gt;If you ever feel like you&amp;rsquo;re losing sight of your core value while doing glue work as a Staff+ engineer, I recommend looking back at the Archetypes and verbalizing what you are actually achieving.&lt;/p&gt;
&lt;p&gt;In the AI era, verbalizing allows for deeper exploration, so it might be good to bounce ideas off an AI to articulate your thoughts. I myself have been trying this after a former colleague of mine,
, told me that dialoguing with AI for introspection deepens self-understanding and helps find directions for growth. It demands real concentration and can be draining, but the rewards are significant.&lt;/p&gt;
&lt;p&gt;Also, to keep me sane, the case studies in books about Staff Engineers have been a great support. It would be good to keep both
and
close at hand. Reading them after a painful experience provides a deeply enriching reading experience that resonates more the more you digest it.&lt;/p&gt;
&lt;p&gt;Doing glue work might just be a sign that you are thriving as a Staff+. However, never forget to prioritize your tasks by considering their business impact.&lt;/p&gt;
&lt;div class="footnotes" role="doc-endnotes"&gt;
&lt;hr&gt;
&lt;ol&gt;
&lt;li id="fn:1"&gt;
&lt;p&gt;In the Japanese engineering community, there&amp;rsquo;s a term &amp;ldquo;高機能雑用&amp;rdquo; (kōkinō zatsuyō) — roughly &amp;ldquo;High-functioning general-purpose grunt work.&amp;rdquo; It originated from
who, when asked by his director what he actually does, replied: &amp;ldquo;I do Hadoop, Hive, MySQL, machine learning, log analysis, NLP, VBA, and firefighting — but that&amp;rsquo;s too much to explain, so I just say &amp;lsquo;High-functioning general-purpose grunt work.&amp;rsquo;&amp;rdquo; That&amp;rsquo;s pretty much what being a Staff+ engineer feels like.&amp;#160;&lt;a href="#fnref:1" class="footnote-backref" role="doc-backlink"&gt;&amp;#x21a9;&amp;#xfe0e;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;</description></item><item><title>Tackling Review Fatigue by Document Driven Agentic Coding</title><link>https://chezo.uno/blog/2025-09-19-review-fatigue/</link><pubDate>Fri, 19 Sep 2025 21:26:00 -0700</pubDate><guid>https://chezo.uno/blog/2025-09-19-review-fatigue/</guid><description>&lt;p&gt;This year, the amount of time I spend on reviews has exploded. This applies to both code and documentation. And the fatigue from this has also increased dramatically.&lt;/p&gt;
&lt;p&gt;Of course, this is because the speed and volume of output have increased with the help of LLMs. However, the quality hasn&amp;rsquo;t necessarily improved; in fact, I feel my productivity is declined.&lt;/p&gt;
&lt;p&gt;I&amp;rsquo;m going to do a brain dump to write down what the challenges are and how I&amp;rsquo;m dealing with them.&lt;/p&gt;
&lt;h2 id="proxy-prompting-through-other-humans"&gt;Proxy Prompting Through Other Humans&lt;/h2&gt;
&lt;p&gt;I&amp;rsquo;ve been working as a tech lead on machine learning projects for a while now. As part of my job, I often write technical documents like
, but I also have many opportunities to review them.&lt;/p&gt;
&lt;p&gt;The difficult thing about machine learning is that when a project changes and the problem to be solved shifts even slightly, a completely different set of knowledge becomes necessary, requiring you to read new research papers.&lt;/p&gt;
&lt;p&gt;Imagine you&amp;rsquo;ve been building web applications with Rails for years. Then one day, someone hands you the specification document for a Ruby parser generator, saying, &amp;ldquo;It&amp;rsquo;s Ruby, so it should be the same, right?&amp;rdquo; You can tell it was all generated by an AI, but how do you review it? How do you provide feedback to make it a sound specification? You would probably start by reading implementations in other languages or research papers to build a mental index before you could even begin the review.&lt;/p&gt;
&lt;p&gt;Having to review specifications in such a world is truly exhausting. It&amp;rsquo;s a struggle just to get to a point where you have more knowledge than the generator. And after you&amp;rsquo;ve made the effort to provide appropriate comments, what awaits you are rebuttals made entirely by a generative AI, and the revisions are drastic, overwriting everything on a large scale without regard for the previous context.&lt;/p&gt;
&lt;p&gt;I describe this situation as &amp;ldquo;prompting through others,&amp;rdquo; and frankly, in most cases, it&amp;rsquo;s faster for me to just read the papers and discuss them with an LLM myself.&lt;/p&gt;
&lt;p&gt;Some readers might think, &amp;ldquo;Why not just have an LLM review the text generated by an LLM?&amp;rdquo; But this is fraught with a difficult problem: who guarantees the correctness of the LLM&amp;rsquo;s review? I often feel this even when I&amp;rsquo;m interacting with an LLM myself - &amp;ldquo;
.&amp;rdquo; (
), or
. The output is inconsistent; if you ask the same question three times, you&amp;rsquo;ll often get three different answers.&lt;/p&gt;
&lt;p&gt;I believe that the limits of current LLMs and generative AI are defined entirely by the limits of human ability. If the person acting as a proxy continues to be just a proxy, they will become unnecessary. It&amp;rsquo;s important to escalate this appropriately to a manager.&lt;/p&gt;
&lt;p&gt;As a side note, text generated by LLMs can often be detected by the high frequency of certain words or the
. But as a non-native English speakers, it&amp;rsquo;s often given away by the presence or absence of quirks specific to a native speaker&amp;rsquo;s language. So, I&amp;rsquo;m trying to be confident in the results I generate and pass them on to others as our own opinions.&lt;/p&gt;
&lt;h2 id="lack-of-self-review-or-the-eyes-glazing-over-problem"&gt;Lack of Self-Review, or the &amp;ldquo;Eyes Glazing Over&amp;rdquo; Problem&lt;/h2&gt;
&lt;p&gt;As I wrote in the previous section, it&amp;rsquo;s often the case that people don&amp;rsquo;t self-review what they&amp;rsquo;ve generated. This only breeds mistrust, so it&amp;rsquo;s best to stop immediately. However, the reality is that even with self-review, it&amp;rsquo;s easy for your eyes to just glaze over the content.&lt;/p&gt;
&lt;p&gt;I learned in a 1-on-1 with my manager that there&amp;rsquo;s a tendency for people, even the same person, to be more lenient when reviewing code they generated themselves and stricter when reviewing code generated by others. I get it.&lt;/p&gt;
&lt;p&gt;When I think about why this happens, it&amp;rsquo;s because non-hand-crafted code often doesn&amp;rsquo;t get stored in your brain&amp;rsquo;s cache. Furthermore, when generating code with something like Claude Code, the user experience makes it tedious to look at every diff. So, you end up generating something that works in &amp;ldquo;auto-approval mode&amp;rdquo; and then reviewing it. But when there are massive changes, you can&amp;rsquo;t keep up with the details. And you can&amp;rsquo;t to resist to think, &amp;ldquo;Well, the unit tests are passing, so it can be okay.&amp;rdquo;&lt;/p&gt;
&lt;h2 id="big-bang-commit"&gt;Big Bang Commit&lt;/h2&gt;
&lt;p&gt;Recently, I received a pull request which diff was &lt;code&gt;+10,000 -7,000&lt;/code&gt; lines in a single commit. Yes, a big bang commit PR.&lt;/p&gt;
&lt;p&gt;When you generate code with an LLM, especially in an auto-approval mode, the agent goes through a lot of trial and error. And a characteristic of agentic coding is that it overwrites large chunks of code aggressively. The existing design, which would have been respected if the same person were writing it by hand, was largely ignored in the PR generated by the LLM.&lt;/p&gt;
&lt;p&gt;There&amp;rsquo;s a research suggesting that
. I read a paper that
. Yeah, that makes sense.&lt;/p&gt;
&lt;p&gt;Also, LLMs don&amp;rsquo;t automatically break down tasks into appropriate granularity while Claude Code and GitHub Copilot create a TODO list. It&amp;rsquo;s quite difficult unless you give them careful instructions.&lt;/p&gt;
&lt;p&gt;By the way, it remains an open question that making appropriately sized commits is a universal and very difficult problem, especially for exploratory implementations like in machine learning.&lt;/p&gt;
&lt;h2 id="the-difference-in-incentive-structures-between-the-generator-and-the-reviewer"&gt;The Difference in Incentive Structures Between the Generator and the Reviewer&lt;/h2&gt;
&lt;p&gt;Many leaderships want to claim that &amp;ldquo;using LLMs increases coding productivity,&amp;rdquo; and I understand this is because many people strongly believe in AI-driven productivity improvements to prove its ROI, and there&amp;rsquo;s a hype that should be true.&lt;/p&gt;
&lt;p&gt;Now, I&amp;rsquo;m aware there are various debates about code generation throughput, but I think it&amp;rsquo;s possible it will increase. However, the speed of the reviewer&amp;rsquo;s side hasn&amp;rsquo;t gotten any faster yet. It will probably take a little more time (at least a year?) before we can entrust all reviews to LLMs. Also, we can&amp;rsquo;t avoid an extra step to review LLM generated code.&lt;/p&gt;
&lt;p&gt;We end up with an increase in a large amount of code that has been generated with a lenient self-review because the creator&amp;rsquo;s eyes glazed over. In other words, the generator&amp;rsquo;s goal is to maximize generation throughput, while the reviewer, regardless of that, must ensure quality. This difference in incentive structures is likely the main cause of review fatigue.&lt;/p&gt;
&lt;h2 id="how-to-approach-code-reviews-mostly-generated-by-llm"&gt;How to Approach Code Reviews Mostly Generated by LLM&lt;/h2&gt;
&lt;p&gt;So, what am I doing about it? Honestly, I haven&amp;rsquo;t found a silver bullet yet, but I want to share the results of my various experiments.&lt;/p&gt;
&lt;h3 id="1-the-generator-reviews-their-own-code-as-someone-elses-on-github"&gt;1. The Generator Reviews Their Own Code as Someone Else&amp;rsquo;s on GitHub&lt;/h3&gt;
&lt;p&gt;This comes from my own reflection on a time I made a rather large commit (+700 lines or so). Recently, in addition to reviewing and approving the code generated by the LLM in my local VSCode every time during implementation, I&amp;rsquo;ve also started reviewing the PR in a draft state on GitHub.&lt;/p&gt;
&lt;p&gt;By doing this, my mindset switches to &amp;ldquo;reviewer mode,&amp;rdquo; and I&amp;rsquo;ve been able to find various flaws.&lt;/p&gt;
&lt;p&gt;It may seem obvious, but I think using a dedicated review view is good because it puts you in the same mental model as when you&amp;rsquo;re reviewing someone else&amp;rsquo;s code.&lt;/p&gt;
&lt;h3 id="2-document-driven-development"&gt;2. Document-Driven Development&lt;/h3&gt;
&lt;p&gt;In the story of developing a Java version manager called
(Rust based Java version manager
, several helpful initiatives were taken, so I tried them myself.&lt;/p&gt;
&lt;p&gt;The three important points I learned from the
article and the actual commits are:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Always give the LLM a task size that a human can implement in about 30 minutes (to avoid context overflow).&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Follow a flow of requirements definition -&amp;gt; external design -&amp;gt; work plan.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Furthermore, commit the above documents along with the code.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;A very important point is that by committing the requirements document, external design document, work plan document, (and acceptance tests) as documentation along with the implementation, the intent becomes easier for humans to understand, even if the commit is somewhat large. Also, by properly organizing the work plan, you can provide the LLM with a rein to develop and commit at an appropriate granularity.&lt;/p&gt;
&lt;p&gt;I tried this method myself by open-sourcing a script I had written as an internal tool.&lt;/p&gt;
&lt;p&gt;I can&amp;rsquo;t show you the proper commits from when I was adjusting the internal code for open-sourcing, but the general workflow was as follows: (GitHub Copilot + Sonnet 4)&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;First, discuss what I want to do with Sonnet 4 in Agent mode and organize it into &lt;code&gt;docs/requirements.md&lt;/code&gt;, &lt;code&gt;docs/interface.md&lt;/code&gt;.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Have it create a TODO list with checkboxes for each phase of the implementation and save it as &lt;code&gt;docs/plan.md&lt;/code&gt;.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Have it create an &lt;code&gt;AGENTS.md&lt;/code&gt; file containing these documents and development conventions, toolsets, etc.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Implement each phase and commit. At that time, check off the item in &lt;code&gt;plan.md&lt;/code&gt; and commit.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;When moving to the next phase, clear the agent&amp;rsquo;s context and have it read &lt;code&gt;AGENTS.md&lt;/code&gt; and &lt;code&gt;plan.md&lt;/code&gt; to start the work.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;When the planned phase is finished and it&amp;rsquo;s time to move to the next milestone, create a folder like &lt;code&gt;docs/milestone1&lt;/code&gt;, move &lt;code&gt;plan.md&lt;/code&gt; there, and make &lt;code&gt;docs/plan.md&lt;/code&gt; an empty file to start development on the new milestone.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;
&lt;figure &gt;
&lt;div class="flex justify-center "&gt;
&lt;div class="w-full" &gt;
&lt;img alt="Commit history"
srcset="https://chezo.uno/blog/2025-09-19-review-fatigue/featured_hu_f403e92ef53c0678.webp 320w, https://chezo.uno/blog/2025-09-19-review-fatigue/featured_hu_83511fb07c3df511.webp 480w, https://chezo.uno/blog/2025-09-19-review-fatigue/featured_hu_97f6c05b091334f3.webp 481w"
sizes="(max-width: 480px) 100vw, (max-width: 768px) 90vw, (max-width: 1024px) 80vw, 760px"
src="https://chezo.uno/blog/2025-09-19-review-fatigue/featured_hu_f403e92ef53c0678.webp"
width="481"
height="760"
loading="lazy" data-zoomable /&gt;&lt;/div&gt;
&lt;/div&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p&gt;The actual PR for the open-source project looks like this:&lt;/p&gt;
&lt;p&gt;
&lt;/p&gt;
&lt;p&gt;Honestly, the commit granularity is large, and several features are committed together, but if I wanted to, I could easily make a commit for each checkbox because I&amp;rsquo;m managing the &lt;code&gt;plan.md&lt;/code&gt; file.&lt;/p&gt;
&lt;p&gt;In other words, the good thing is that you can understand if there are any major problems at the design level from the documents. If the design is flawed, you can go back and discuss it there without having to read the implementation, which I believe helps avoid pointless reviews.&lt;/p&gt;
&lt;h2 id="how-to-approach-document-reviews-mostly-generated-by-llm"&gt;How to Approach Document Reviews Mostly Generated by LLM&lt;/h2&gt;
&lt;p&gt;Honestly, I have no answer yet. What a colleague said really resonates with me: &amp;ldquo;People who blindly accept and pass on LLM-generated text are no different from the types who believe something is true just because a TV host said it on a variety show.&amp;rdquo;&lt;/p&gt;
&lt;p&gt;Based on my experience, I&amp;rsquo;m discussing the feedback way and wording with Gemini. LLM supports creating the appropriate wording for a feedback/escalation and suggests escalation strategy as needed. (Gemini is a lonely Principal&amp;rsquo;s mentor!) Have it enumerate the problematic data points and summarize them. Summarization is inherently a strong suit of LLMs, and models tuned by American companies so they are good at American style of feedback.&lt;/p&gt;
&lt;h2 id="continue-struggling"&gt;Continue Struggling&lt;/h2&gt;
&lt;p&gt;I&amp;rsquo;ve written about various challenges and what I&amp;rsquo;ve done about them, but honestly, I&amp;rsquo;m still feeling my way through this. However, fundamentally, the limit of an LLM is the limit of the human using it. Therefore, it will still be necessary for humans to create workflows with a primary focus on how to maximize their own abilities.&lt;/p&gt;</description></item><item><title>Configured Pages CMS</title><link>https://chezo.uno/blog/2025-08-24-configured-pages-cms/</link><pubDate>Sun, 24 Aug 2025 15:11:00 -0700</pubDate><guid>https://chezo.uno/blog/2025-08-24-configured-pages-cms/</guid><description>&lt;p&gt;I had been looking for a way to write Hugo articles on mobile devices like my iPad, and that&amp;rsquo;s when I came across this article by mehori and decided to try Pages CMS.&lt;/p&gt;
&lt;p&gt;
&lt;/p&gt;
&lt;p&gt;To be honest, it doesn&amp;rsquo;t properly support Hugo&amp;rsquo;s folder structure (&lt;code&gt;/articlename/index.md&lt;/code&gt;), which makes image uploads a bit tricky. On top of that, I couldn&amp;rsquo;t upload images from my iPhone because of a 413 error. But, since I was able to set up and write text-only articles in Japanese from my iPad, I&amp;rsquo;ll count it as a win.&lt;/p&gt;
&lt;p&gt;Along the way, I ran into a very minor YAML frontmatter parsing error, but I was able to find a workaround.
&lt;/p&gt;
&lt;p&gt;It&amp;rsquo;s a platform where you can figure things out by reading the code, and it has the bare minimum functionality, so it&amp;rsquo;s good enough for my needs.&lt;/p&gt;
&lt;p&gt;Here are my current settings:
&lt;/p&gt;
&lt;p&gt;I also migrated from &lt;strong&gt;wowchemy&lt;/strong&gt; to &lt;strong&gt;hugo-blox&lt;/strong&gt;, which was quite a hassle. You can find more details in the PR. But seriously, the names have changed way too many times: &lt;strong&gt;Hugo Academic&lt;/strong&gt; -&amp;gt; &lt;strong&gt;Wowchemy&lt;/strong&gt; -&amp;gt; &lt;strong&gt;Hugo Blox&lt;/strong&gt;&amp;hellip;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;
&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;I also fixed an issue where Amazon affiliate images weren&amp;rsquo;t showing up by changing the settings to not display images.
&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;After the migration, I realized the search wasn&amp;rsquo;t working, but I managed to get it running by:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Adding a setting to create a &lt;strong&gt;pagefind&lt;/strong&gt; index in the Cloudflare build command -&amp;gt; &lt;code&gt;&amp;amp;&amp;amp; npx pagefind --source 'public'&lt;/code&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Trying various things with &lt;strong&gt;Cloudflare&amp;rsquo;s Rocket Loader&lt;/strong&gt;, but it didn&amp;rsquo;t work, so I ended up disabling it.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Now it&amp;rsquo;s working somehow, but it seems like the Japanese search won&amp;rsquo;t work properly. Oh well, what can you do?&lt;/p&gt;</description></item><item><title>Machine Learning Project and Scrum</title><link>https://chezo.uno/blog/2025-05-02-ml-project-and-scrum/</link><pubDate>Fri, 02 May 2025 15:49:17 -0700</pubDate><guid>https://chezo.uno/blog/2025-05-02-ml-project-and-scrum/</guid><description>&lt;p&gt;I&amp;rsquo;ve worked on several machine learning projects, and intuitively, I&amp;rsquo;ve felt that Scrum doesn&amp;rsquo;t seem well-suited for machine learning. However, during an internal discussion, a colleague said, &amp;ldquo;If we use Technical Stories, we should be able to break down tasks to fit within two weeks for any tasks. And if we do that, we should be able to deliver value in two weeks for ML products.&amp;rdquo; I couldn&amp;rsquo;t properly counter this, so I&amp;rsquo;m writing this article to articulate my thoughts, including how others in the world are approaching this.&lt;/p&gt;
&lt;h2 id="disclaimer"&gt;Disclaimer&lt;/h2&gt;
&lt;p&gt;I have extensive experience in ML but is not particularly knowledgeable about Scrum, having only participated in a few projects as a development member.&lt;/p&gt;
&lt;h2 id="research-method"&gt;Research Method&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;Brainstorming with Gemini 2.5 Pro (experimental)
&lt;ul&gt;
&lt;li&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Investigating experiences mainly on Reddit&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="anticipated-challenges"&gt;Anticipated Challenges&lt;/h2&gt;
&lt;p&gt;Iterative development invariably occurs in ML projects. This is especially true because exploratory phases (EDA and model development) are always involved, and these phases generally involve rework.&lt;/p&gt;
&lt;p&gt;I&amp;rsquo;ve always felt a strong incompatibility between these high-probability rework for ML tasks and the &amp;ldquo;deliver customer value in two weeks&amp;rdquo; sprint.&lt;/p&gt;
&lt;h2 id="using-technical-stories"&gt;Using Technical Stories&lt;/h2&gt;
&lt;p&gt;What exactly are Technical Stories? I looked into it, and the example at the beginning of the following article resonated with me.&lt;/p&gt;
&lt;p&gt;
&lt;/p&gt;
&lt;blockquote class="border-l-4 border-neutral-300 dark:border-neutral-600 pl-4 italic text-neutral-600 dark:text-neutral-400 my-6"&gt;
&lt;p&gt;As a developer,
I want an automated build
So that I can be sure my code works.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I&amp;rsquo;ve actually seen stories like this here and there.&lt;/p&gt;
&lt;p&gt;However, the article above, when considering where the business benefit (which I understand to be synonymous with customer value) lies, argues for the following template to ensure business value and incorporate features:&lt;/p&gt;
&lt;blockquote class="border-l-4 border-neutral-300 dark:border-neutral-600 pl-4 italic text-neutral-600 dark:text-neutral-400 my-6"&gt;
&lt;p&gt;In order to &lt;deliver some business benefit&gt;
&lt;these people&gt;
will need &lt;these features&gt;.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Further refining this leads to:&lt;/p&gt;
&lt;blockquote class="border-l-4 border-neutral-300 dark:border-neutral-600 pl-4 italic text-neutral-600 dark:text-neutral-400 my-6"&gt;
&lt;p&gt;In order to &lt;deliver some business benefit&gt;
As a &lt;role&gt; I want &lt;some other role&gt;
to &amp;lt;do something, or use or be restricted by some feature&amp;gt;.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;blockquote class="border-l-4 border-neutral-300 dark:border-neutral-600 pl-4 italic text-neutral-600 dark:text-neutral-400 my-6"&gt;
&lt;p&gt;In order to stop bots from spamming the site
As a member of the commercial team, I want users
to fill in a captcha box.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Happy ending? Really?&lt;/p&gt;
&lt;h2 id="business-value-in-ml-projects"&gt;Business Value in ML Projects&lt;/h2&gt;
&lt;p&gt;The business value in ML projects and ML products remains providing something that positively impacts customers.&lt;/p&gt;
&lt;p&gt;Achieving a model with good accuracy doesn&amp;rsquo;t immediately deliver customer value; it&amp;rsquo;s only when the prediction results can be served in the actual production environment that value is delivered to the customer. Alternatively, it might involve summarizing analysis results based on predictions and presenting actionable proposals in a report.&lt;/p&gt;
&lt;p&gt;Conversely, I find it quite difficult to deliver such customer value with tickets spanning a few days to two weeks. Often, when a model is being developed, the deployment environment might not even exist yet.&lt;/p&gt;
&lt;p&gt;One direction, as Gemini suggests, is to consider &amp;ldquo;risk reduction and uncertainty reduction as sprint goals.&amp;rdquo; While not direct customer value, there is certainly business value. Gaining insights like &amp;ldquo;this feature seems effective for prediction&amp;rdquo; or &amp;ldquo;this algorithm was unsuitable for this problem&amp;rdquo; is valuable in itself. That sounds reasonable.&lt;/p&gt;
&lt;p&gt;However, by reading
, I got a hint for the question of &amp;ldquo;Why it&amp;rsquo;s difficult to fit ML tasks into two weeks?&amp;rdquo;.&lt;/p&gt;
&lt;h2 id="differences-in-task-level-development-flow"&gt;Differences in Task-Level Development Flow&lt;/h2&gt;
&lt;p&gt;
compares the task progression in typical software development versus ML development based on the following diagrams.&lt;/p&gt;
&lt;p&gt;It states that while task progression is linear at the task level in traditional software development, tasks in ML projects are often cyclical. (The following diagrams are both quoted from
)&lt;/p&gt;
&lt;p&gt;
&lt;figure &gt;
&lt;div class="flex justify-center "&gt;
&lt;div class="w-full" &gt;&lt;img src="./software_flow.png" alt="Linear flow in software development" loading="lazy" data-zoomable /&gt;&lt;/div&gt;
&lt;/div&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p&gt;Linear tasks in traditional software development are also said to be &amp;ldquo;completion-oriented&amp;rdquo; tasks, where the goal is simply to complete the project.&lt;/p&gt;
&lt;p&gt;
&lt;figure &gt;
&lt;div class="flex justify-center "&gt;
&lt;div class="w-full" &gt;&lt;img src="./ml_flow.png" alt="Circular flow in machine learning" loading="lazy" data-zoomable /&gt;&lt;/div&gt;
&lt;/div&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p&gt;Of course, ML projects also have linear tasks similar to traditional software development, such as data ingestion pipelines, basic preprocessing, developing demo applications with
, and developing prediction APIs.&lt;/p&gt;
&lt;p&gt;On the other hand, exploratory and experimental tasks are cyclical processes that involve repeating data understanding, hypothesis formulation, and verification. What&amp;rsquo;s difficult is that within a two-week sprint, there&amp;rsquo;s no guarantee of improved prediction accuracy, so nothing can be committed.&lt;/p&gt;
&lt;h2 id="how-to-implement-scrum-in-ml-projects"&gt;How to Implement Scrum in ML Projects&lt;/h2&gt;
&lt;p&gt;So, what should we do? Even so, some might be thinking, &amp;ldquo;Top-down, we&amp;rsquo;ve been told to do two-week sprints, so how do we implement it?&amp;rdquo;&lt;/p&gt;
&lt;p&gt;Looking at several discussions on Reddit, I found the following two approaches:&lt;/p&gt;
&lt;h3 id="1-stop-using-scrum-and-use-kanban"&gt;1. Stop Using Scrum and Use Kanban&lt;/h3&gt;
&lt;p&gt;This might sound blunt, but adopting Scrum as the workflow isn&amp;rsquo;t mandatory for agile development.&lt;/p&gt;
&lt;p&gt;&amp;ldquo;Agile Data Science with R&amp;rdquo; points this out in
:&lt;/p&gt;
&lt;blockquote class="border-l-4 border-neutral-300 dark:border-neutral-600 pl-4 italic text-neutral-600 dark:text-neutral-400 my-6"&gt;
&lt;p&gt;Both methodologies are applied with great success and it’s important to keep in mind that they are a means to an end, not religions. The Agile values and principles should be the primary guideline and when selecting one of the workflows you do so because it is the best way to work in an Agile way because its the best fit for the given situation.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The point is that Scrum or Kanban isn&amp;rsquo;t a matter of religions; what&amp;rsquo;s important is for the team to choose the best option to achieve their goals and to monitor whether it remains the best fit.&lt;/p&gt;
&lt;p&gt;The author&amp;rsquo;s company initially adopted Scrum due to the many experienced Scrum practitioners, but it didn&amp;rsquo;t work well for model development, and they eventually moved to Kanban (
).&lt;/p&gt;
&lt;p&gt;The reasons cited include that Scrum is too rigid and lacks flexibility for exploratory and experimental phases. Also, as mentioned earlier, it&amp;rsquo;s impossible to commit to model prediction performance improvements without seeing the data. Therefore, there&amp;rsquo;s a suggestion that time boxing is better than using story points.&lt;/p&gt;
&lt;blockquote class="border-l-4 border-neutral-300 dark:border-neutral-600 pl-4 italic text-neutral-600 dark:text-neutral-400 my-6"&gt;
&lt;p&gt;Scoping for data science is then not just estimating how long a task will take to complete, it is also time boxing. If used in this way, the scoping should be done in time units, not in a subjective measure such as story points. The data scientist should not take longer for the task than the team agreed upfront, wrapping up even when he does not feel completely finished.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Since Data Scientists tend to continue experimenting until they are satisfied, setting time limits rather than strict completion criteria is more rational in the research phase.&lt;/p&gt;
&lt;p&gt;In practice, they seem to have adopted a Kanban board with the following six lanes. However, for some tasks, they ended up stopping after confirming that the model accuracy didn&amp;rsquo;t improve during the hypothesis testing phase.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;to do&lt;/li&gt;
&lt;li&gt;test hypothesis&lt;/li&gt;
&lt;li&gt;code review hypothesis&lt;/li&gt;
&lt;li&gt;update model&lt;/li&gt;
&lt;li&gt;code review update model&lt;/li&gt;
&lt;li&gt;done&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;A very similar point (use Kanban, do time boxing) can be found on
.&lt;/p&gt;
&lt;blockquote class="border-l-4 border-neutral-300 dark:border-neutral-600 pl-4 italic text-neutral-600 dark:text-neutral-400 my-6"&gt;
&lt;p&gt;Don’t confuse agility with solely scrum and its sprints, which are the root of the problem and work poorly in research mode.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;This specific example is easy to understand:&lt;/p&gt;
&lt;blockquote class="border-l-4 border-neutral-300 dark:border-neutral-600 pl-4 italic text-neutral-600 dark:text-neutral-400 my-6"&gt;
&lt;p&gt;Example: build a PoC in a week. If AUC exceeds X then it’s promising and let’s spend another 3 months on further extensions (data, features, architecture, hyperopt) and putting all into production. If there was no AUC gain on the last week, we do not extend any further. Inside this 3 month time box - execute pure Kanban, task by task, which allows you to take different paths as needed (agility), not waiting till your sprint finishes in 3 days. You already know your new feature is poorly designed and you need to start on tweaking it right now.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;In other words, time-box, test hypotheses within that time, and if successful, decide to continue further extensions.&lt;/p&gt;
&lt;p&gt;Another key point the author of &amp;ldquo;Agile Data Science with R&amp;rdquo; proposes is to quickly create an MVM (Minimum Viable Model) instead of an MVP. For example, initially deploy a simple model like linear regression to a limited set of users as an MVM, then add features, and finally deploy a more complex model (Random Forest is mentioned in the text, but now a NN-based approach might also be considered).&lt;/p&gt;
&lt;p&gt;This connects with the idea of time-boxing mentioned earlier. Using an MVM as a hook is a good idea as part of an effort to prune the numerous options and find a high-probability path.&lt;/p&gt;
&lt;h3 id="2-use-hypothesis-based-stories-modifying-scrum"&gt;2. Use Hypothesis Based Stories (Modifying Scrum)&lt;/h3&gt;
&lt;p&gt;Another
discussed an approach that mixes Scrum and Kanban. They use Scrum for long-term projects and time boxing for research tasks.&lt;/p&gt;
&lt;p&gt;Specifically, they seem to have made the following changes to Scrum:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;hypothesis based stories (instead of user based)&lt;/li&gt;
&lt;li&gt;foregoe stand ups, people keep their tickets as research logs and @ people when they need help. - Product owner can read the tickets if they want to know where we are&lt;/li&gt;
&lt;li&gt;monthly retro rather than per scrum, wider focus&lt;/li&gt;
&lt;li&gt;tickets largely written by the data scientists then priorities by product owner&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Skip stand-ups and have people keep research logs in their tickets, mentioning others in the tickets when they need help. The Product Owner can proactively check the tickets for updates. Data scientists largely write the tickets, and the Product Owner prioritizes them. The goal is to allow DS/MLE to focus on their work.&lt;/p&gt;
&lt;p&gt;Hypothesis Based Stories are particularly unique. If we recall the discussion about business value in Technical Stories, the business value in the hypothesis testing phase of ML can only be described as &amp;ldquo;reducing risk and uncertainty.&amp;rdquo; Also, by making &amp;ldquo;formulating and testing a hypothesis&amp;rdquo; the goal of a ticket, it creates an awareness of the need to formulate proper hypotheses, and the outcome can be closed as &amp;ldquo;hypothesis was correct/incorrect.&amp;rdquo;&lt;/p&gt;
&lt;blockquote class="border-l-4 border-neutral-300 dark:border-neutral-600 pl-4 italic text-neutral-600 dark:text-neutral-400 my-6"&gt;
&lt;p&gt;The point is that the exit criteria is provable and the delivery is typically the proof. Likewise, disapproving the hypothesis is still a success, we learned something.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Here are some examples of such stories:&lt;/p&gt;
&lt;blockquote class="border-l-4 border-neutral-300 dark:border-neutral-600 pl-4 italic text-neutral-600 dark:text-neutral-400 my-6"&gt;
&lt;p&gt;“We believe a fine tuned distilbert architectures will allow us to identify cases with a precision of greater than .95 and a recall of greater than .7.”&lt;/p&gt;
&lt;/blockquote&gt;
&lt;blockquote class="border-l-4 border-neutral-300 dark:border-neutral-600 pl-4 italic text-neutral-600 dark:text-neutral-400 my-6"&gt;
&lt;p&gt;“We believe it should be possible to transform a given article within our dataset to our standardised form without additional datasets or augmentation”&lt;/p&gt;
&lt;/blockquote&gt;
&lt;blockquote class="border-l-4 border-neutral-300 dark:border-neutral-600 pl-4 italic text-neutral-600 dark:text-neutral-400 my-6"&gt;
&lt;p&gt;“We believe x metric can best be explained to stakeholders using a combination of shap values and distribution charts”&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2 id="aside-what-does-a-product-owner-do-in-ml-projects"&gt;Aside: What Does a Product Owner Do in ML Projects?&lt;/h2&gt;
&lt;p&gt;As an aside, there&amp;rsquo;s a detailed chapter on what a Product Owner does if they don&amp;rsquo;t create tickets (
).&lt;/p&gt;
&lt;p&gt;In summary:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Takes on all communication with stakeholders, allowing DS/MLE to focus on tasks.&lt;/li&gt;
&lt;li&gt;Helps discuss and organize the tasks created by DS/MLE through scoping.&lt;/li&gt;
&lt;li&gt;Prioritizes the tasks created by DS/MLE.&lt;/li&gt;
&lt;li&gt;Points out business concerns that DS/MLE might not be aware of.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Reading this, I feel like the time of DS/MLE is considered more valuable than that of SWE.&lt;/p&gt;
&lt;h2 id="conclusion"&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;After a quick investigation, I&amp;rsquo;ve reached the following conclusions:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Strict Scrum is not suitable for ML projects.&lt;/li&gt;
&lt;li&gt;If possible, separate workflows for research/exploration phases and development phases.&lt;/li&gt;
&lt;li&gt;In the research/exploration phase, use time limits instead of story points.&lt;/li&gt;
&lt;li&gt;In the research/exploration phase, proceed based on hypothesis testing (write Hypothesis Based Stories, use Kanban).&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The idea of time boxing and progressing research tasks centered around hypothesis testing, while common sense in ML projects, is a very powerful support.&lt;/p&gt;
&lt;p&gt;So, in response to the initial statement, &amp;ldquo;If we use Technical Stories, we should be able to break down tasks to fit within two weeks. And if we do that, we should be able to deliver value in two weeks,&amp;rdquo; my reply would be:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Understand business value as &amp;ldquo;reducing uncertainty.&amp;rdquo;&lt;/li&gt;
&lt;li&gt;Use Hypothesis Based Stories instead of Technical Stories, or Kanban.&lt;/li&gt;
&lt;li&gt;Instead of setting concrete goals within a two-week timeframe, quickly iterate through possibilities by setting time limits.&lt;/li&gt;
&lt;/ul&gt;</description></item><item><title>Migrated From Netlify to Cloudflare Pages</title><link>https://chezo.uno/blog/2024-02-02-migrated-from-netlify-to-cloudflare-pages/</link><pubDate>Fri, 02 Feb 2024 16:42:14 -0800</pubDate><guid>https://chezo.uno/blog/2024-02-02-migrated-from-netlify-to-cloudflare-pages/</guid><description>&lt;p&gt;Netlify is a great service, but it is also known as slowness in Japan. I have been using Netlify for my blog hosting for a long time, but I decided to migrate to Cloudflare Pages to improve the speed of access to my blog from Japan.&lt;/p&gt;
&lt;p&gt;The migration step from Netlify is pretty straight foward. I just need to follow this official guide:
.&lt;/p&gt;
&lt;p&gt;My blog is built by Hugo and I use Hugoblox, f.k.a., Wowchemy, as a theme. And, I manage my blog content on GitHub, and just adding the Cloudflare Pages app to my GitHub repository, and it automatically detects the settings and builds the site.&lt;/p&gt;
&lt;p&gt;If you use Cloudflare for DNS, it automatically sets up the DNS settings for you.&lt;/p&gt;
&lt;p&gt;The special consideration on build settings is that I need to set the environment variable &lt;code&gt;HUGO_VERSION&lt;/code&gt; to the version of Hugo that I use. In my case, I use Hugo 0.88.1, so I set &lt;code&gt;HUGO_VERSION&lt;/code&gt; to &lt;code&gt;0.101.0&lt;/code&gt;. Also, I need to set &lt;code&gt;-b&lt;/code&gt; URL option, it was &lt;code&gt;$URL&lt;/code&gt; in Netlify, but it is &lt;code&gt;$CF_PAGES_URL&lt;/code&gt; in Cloudflare Pages.&lt;/p&gt;
&lt;p&gt;The build time is pretty fast, and the PageSpeed Insights score is also improved. I can feel faster access on my browser as well. I&amp;rsquo;m happy with the migration. Actually, the major reason of slowness was downloading fonts and using Font cache on Cloudflare solved the problem.&lt;/p&gt;
&lt;p&gt;
&lt;figure &gt;
&lt;div class="flex justify-center "&gt;
&lt;div class="w-full" &gt;
&lt;img alt="PageSpeed Insights on Netlify"
srcset="https://chezo.uno/blog/2024-02-02-migrated-from-netlify-to-cloudflare-pages/before_hu_70d706d13854921b.webp 320w, https://chezo.uno/blog/2024-02-02-migrated-from-netlify-to-cloudflare-pages/before_hu_b3081ed14c04d16e.webp 480w, https://chezo.uno/blog/2024-02-02-migrated-from-netlify-to-cloudflare-pages/before_hu_751463bb4e3d1b29.webp 760w"
sizes="(max-width: 480px) 100vw, (max-width: 768px) 90vw, (max-width: 1024px) 80vw, 760px"
src="https://chezo.uno/blog/2024-02-02-migrated-from-netlify-to-cloudflare-pages/before_hu_70d706d13854921b.webp"
width="760"
height="433"
loading="lazy" data-zoomable /&gt;&lt;/div&gt;
&lt;/div&gt;&lt;/figure&gt;
&lt;figure &gt;
&lt;div class="flex justify-center "&gt;
&lt;div class="w-full" &gt;
&lt;img alt="PageSpeed Insights on Cloudflare Pages with font cache"
srcset="https://chezo.uno/blog/2024-02-02-migrated-from-netlify-to-cloudflare-pages/featured_hu_c387175f8dbc1141.webp 320w, https://chezo.uno/blog/2024-02-02-migrated-from-netlify-to-cloudflare-pages/featured_hu_6257a1260eaf439.webp 480w, https://chezo.uno/blog/2024-02-02-migrated-from-netlify-to-cloudflare-pages/featured_hu_2be40bed12ae65c5.webp 760w"
sizes="(max-width: 480px) 100vw, (max-width: 768px) 90vw, (max-width: 1024px) 80vw, 760px"
src="https://chezo.uno/blog/2024-02-02-migrated-from-netlify-to-cloudflare-pages/featured_hu_c387175f8dbc1141.webp"
width="760"
height="460"
loading="lazy" data-zoomable /&gt;&lt;/div&gt;
&lt;/div&gt;&lt;/figure&gt;
&lt;/p&gt;</description></item><item><title>Scrape Notion and convert into PDF</title><link>https://chezo.uno/blog/2024-01-26_scrape-notion-to-pdf/</link><pubDate>Fri, 26 Jan 2024 17:40:00 -0800</pubDate><guid>https://chezo.uno/blog/2024-01-26_scrape-notion-to-pdf/</guid><description>&lt;p&gt;I love
, who is a Japanese meal kits provider in Vancouver. Their meal kits are really tasty and authentic Japanese foods. I can&amp;rsquo;t live without them. When I visited Japan last year, I wasn&amp;rsquo;t too eager to find nice Japanese restaurants because of them.&lt;/p&gt;
&lt;h2 id="recipe-on-notion-is-good-if-its-printable"&gt;Recipe on Notion is good, if it&amp;rsquo;s printable&lt;/h2&gt;
&lt;p&gt;They provide a recipe on Notion. Seeing the recipes on it is great since they can fix recipes quite quickly.&lt;/p&gt;
&lt;p&gt;However, there&amp;rsquo;s one caveat of Notion. They don&amp;rsquo;t provide printable pages. It&amp;rsquo;s super annoying to copy and past the recipes to the memo app, and print it out. I asked Notion&amp;rsquo;s support team, but they answered it isn&amp;rsquo;t a prioritized item implicitly.&lt;/p&gt;
&lt;p&gt;Ok, it&amp;rsquo;s automation time!&lt;/p&gt;
&lt;h2 id="scrape-notion-with-python"&gt;Scrape Notion with Python&lt;/h2&gt;
&lt;p&gt;As my handy tool, I have been using Python for this kind of automation for years. Originally, I used beautifulsoup, which is great package for web scraping, but I gave it up to use it. Contents of Notion is rendered by JavaScript dynamically.&lt;/p&gt;
&lt;p&gt;I chose
and it works like a charm.&lt;/p&gt;
&lt;p&gt;Here is the GitHub repository:&lt;/p&gt;
&lt;div class="iframely-embed"&gt;&lt;div class="iframely-responsive" style="height: 140px; padding-bottom: 0;"&gt;&lt;a href="https://github.com/chezou/vangohan-pdf" data-iframely-url="//iframely.net/cP0eFmn?card=small"&gt;&lt;/a&gt;&lt;/div&gt;&lt;/div&gt;&lt;script async src="//iframely.net/embed.js"&gt;&lt;/script&gt;
&lt;p&gt;They key takeaways are:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;chromedriver-autoinstaller&lt;/code&gt; package is useful to avoid extra efforts of Chrome driver installation.&lt;/li&gt;
&lt;li&gt;Selenium is easy enough to export PDF
.&lt;/li&gt;
&lt;li&gt;Running the script on GitHub Actions is easy. Don&amp;rsquo;t forget to install fonts if it&amp;rsquo;s not English page.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Originally, I thought I had to prepare a Docker image, but I was aware it was not mandatory. Managing a Docker image for this kind of hobby script would be costly. So, I&amp;rsquo;m going to keep this approach and will look back if it is the right way.&lt;/p&gt;
&lt;p&gt;Currently, I scheduled the
. It will update the PDFs on the repository automatically.&lt;/p&gt;
&lt;p&gt;
&lt;/p&gt;
&lt;p&gt;Edit: Now I use Cloudflare Pages to host the PDFs. You can check at
.&lt;/p&gt;
&lt;p&gt;No Python environment on a local machine is needed anymore.&lt;/p&gt;
&lt;p&gt;Yay, automation is completed! 😁&lt;/p&gt;</description></item><item><title>tabula-py 2.8.0 now uses jpype to launch JVM</title><link>https://chezo.uno/blog/2023-09-09-tabula-py-280/</link><pubDate>Sat, 09 Sep 2023 17:13:08 -0700</pubDate><guid>https://chezo.uno/blog/2023-09-09-tabula-py-280/</guid><description>&lt;p&gt;Recently, I released tabula-py 2.8.0. It is a major release because it uses
to launch JVM. This means that it reduces JVM launch time since jpype reuse JVM via JNI.&lt;/p&gt;
&lt;h2 id="how-fast-is-it"&gt;How fast is it?&lt;/h2&gt;
&lt;p&gt;I measured &lt;code&gt;read_pdf_with_template&lt;/code&gt; function execution time, which repeatedly launches Java process in the previous version.&lt;/p&gt;
&lt;p&gt;The example template contains 4 rules, which means it calls tabula-java 4 times.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-sh" data-lang="sh"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;$ cat examples/data.tabula-template.json &lt;span class="p"&gt;|&lt;/span&gt; jq
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="o"&gt;[&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="o"&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="s2"&gt;&amp;#34;page&amp;#34;&lt;/span&gt;: 1,
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="s2"&gt;&amp;#34;extraction_method&amp;#34;&lt;/span&gt;: &lt;span class="s2"&gt;&amp;#34;guess&amp;#34;&lt;/span&gt;,
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="s2"&gt;&amp;#34;x1&amp;#34;&lt;/span&gt;: 153.99985500000003,
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="s2"&gt;&amp;#34;x2&amp;#34;&lt;/span&gt;: 565.5698550000001,
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="s2"&gt;&amp;#34;y1&amp;#34;&lt;/span&gt;: 123.999615,
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="s2"&gt;&amp;#34;y2&amp;#34;&lt;/span&gt;: 531.7446150000001,
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="s2"&gt;&amp;#34;width&amp;#34;&lt;/span&gt;: 411.57,
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="s2"&gt;&amp;#34;height&amp;#34;&lt;/span&gt;: 407.745
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="o"&gt;}&lt;/span&gt;,
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="o"&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="s2"&gt;&amp;#34;page&amp;#34;&lt;/span&gt;: 2,
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="s2"&gt;&amp;#34;extraction_method&amp;#34;&lt;/span&gt;: &lt;span class="s2"&gt;&amp;#34;guess&amp;#34;&lt;/span&gt;,
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="s2"&gt;&amp;#34;x1&amp;#34;&lt;/span&gt;: 153.99985500000003,
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="s2"&gt;&amp;#34;x2&amp;#34;&lt;/span&gt;: 453.879855,
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="s2"&gt;&amp;#34;y1&amp;#34;&lt;/span&gt;: 123.99884999999993,
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="s2"&gt;&amp;#34;y2&amp;#34;&lt;/span&gt;: 210.44384999999994,
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="s2"&gt;&amp;#34;width&amp;#34;&lt;/span&gt;: 299.88,
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="s2"&gt;&amp;#34;height&amp;#34;&lt;/span&gt;: 86.44500000000001
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="o"&gt;}&lt;/span&gt;,
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="o"&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="s2"&gt;&amp;#34;page&amp;#34;&lt;/span&gt;: 2,
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="s2"&gt;&amp;#34;extraction_method&amp;#34;&lt;/span&gt;: &lt;span class="s2"&gt;&amp;#34;guess&amp;#34;&lt;/span&gt;,
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="s2"&gt;&amp;#34;x1&amp;#34;&lt;/span&gt;: 153.99985500000003,
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="s2"&gt;&amp;#34;x2&amp;#34;&lt;/span&gt;: 487.53985500000005,
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="s2"&gt;&amp;#34;y1&amp;#34;&lt;/span&gt;: 410.99625000000003,
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="s2"&gt;&amp;#34;y2&amp;#34;&lt;/span&gt;: 497.44125,
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="s2"&gt;&amp;#34;width&amp;#34;&lt;/span&gt;: 333.54,
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="s2"&gt;&amp;#34;height&amp;#34;&lt;/span&gt;: 86.44500000000001
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="o"&gt;}&lt;/span&gt;,
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="o"&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="s2"&gt;&amp;#34;page&amp;#34;&lt;/span&gt;: 3,
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="s2"&gt;&amp;#34;extraction_method&amp;#34;&lt;/span&gt;: &lt;span class="s2"&gt;&amp;#34;guess&amp;#34;&lt;/span&gt;,
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="s2"&gt;&amp;#34;x1&amp;#34;&lt;/span&gt;: 153.99985500000003,
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="s2"&gt;&amp;#34;x2&amp;#34;&lt;/span&gt;: 235.85485500000001,
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="s2"&gt;&amp;#34;y1&amp;#34;&lt;/span&gt;: 123.99885000000012,
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="s2"&gt;&amp;#34;y2&amp;#34;&lt;/span&gt;: 322.8988500000001,
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="s2"&gt;&amp;#34;width&amp;#34;&lt;/span&gt;: 81.855,
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="s2"&gt;&amp;#34;height&amp;#34;&lt;/span&gt;: 198.9
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="o"&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;The result is as follows:&lt;/p&gt;
&lt;p&gt;v2.7.0:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-sh" data-lang="sh"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;$ python -m timeit &lt;span class="s1"&gt;&amp;#39;import tabula; tabula.read_pdf_with_template(&amp;#34;examples/data.pdf&amp;#34;, &amp;#34;examples/data.tabula-template.json&amp;#34;)&amp;#39;&lt;/span&gt; 2&amp;gt; /dev/null
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="m"&gt;1&lt;/span&gt; loop, best of 5: 1.31 sec per loop
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;v2.8.0:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-sh" data-lang="sh"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;$ python -m timeit &lt;span class="s1"&gt;&amp;#39;import tabula; tabula.read_pdf_with_template(&amp;#34;examples/data.pdf&amp;#34;, &amp;#34;examples/data.tabula-template.json&amp;#34;)&amp;#39;&lt;/span&gt; 2&amp;gt; /dev/null
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="m"&gt;1&lt;/span&gt; loop, best of 5: &lt;span class="m"&gt;75&lt;/span&gt; msec per loop
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;It is 17 times faster than the previous version!&lt;/p&gt;
&lt;h2 id="caveats"&gt;Caveats&lt;/h2&gt;
&lt;p&gt;Since
, you can pass &lt;code&gt;java_options&lt;/code&gt; for the first time only. If you want to change &lt;code&gt;java_options&lt;/code&gt;, you need to restart Python process.&lt;/p&gt;
&lt;h2 id="challenges-for-releasing-v280"&gt;Challenges for releasing v2.8.0&lt;/h2&gt;
&lt;p&gt;I had to solve several challenges to release this version.&lt;/p&gt;
&lt;h3 id="the-test-issue-with-different-java_options"&gt;The test issue with different &lt;code&gt;java_options&lt;/code&gt;&lt;/h3&gt;
&lt;p&gt;As I mentioned, jpype doesn&amp;rsquo;t allow to reboot JVM. This causes unit test with different &lt;code&gt;java_options&lt;/code&gt; to fail. I solved this by separating run with nox session.&lt;/p&gt;
&lt;p&gt;See
for details.&lt;/p&gt;
&lt;p&gt;This limitation is not a big deal for tabula-py users because tabula-py users don&amp;rsquo;t need to change &lt;code&gt;java_options&lt;/code&gt; frequently.&lt;/p&gt;
&lt;h3 id="read-the-docs-default-behavior-change"&gt;Read the docs default behavior change&lt;/h3&gt;
&lt;p&gt;Read the docs changed the default installation packages for Sphinx. I didn&amp;rsquo;t declared the dependency for Sphinx, so it caused the build failure.&lt;/p&gt;
&lt;p&gt;The default behavior of RTD was just installing the latest version of Sphinx and sphinx-rtd-theme, however, now it installs very old version of them like:
&lt;/p&gt;
&lt;p&gt;I solved this by pinning the versions of dependency for Sphinx and sphinx-rtd-theme.&lt;/p&gt;</description></item><item><title>4 Steps to Release a CLI in Python</title><link>https://chezo.uno/blog/2022-05-21_fastest-way-to-release-python-cli/</link><pubDate>Fri, 20 May 2022 23:32:41 -0700</pubDate><guid>https://chezo.uno/blog/2022-05-21_fastest-way-to-release-python-cli/</guid><description>&lt;p&gt;This is what I learned from creating a Python CLI (
) in a day.&lt;/p&gt;
&lt;p&gt;In just 4 steps, you can release a CLI written in Python easily.&lt;/p&gt;
&lt;h2 id="create-a-project-by-using-poetry"&gt;Create a project by using poetry&lt;/h2&gt;
&lt;p&gt;
is a modern Python packaging and dependency management tool. Poetry is becoming popular and defacto rapidly.&lt;/p&gt;
&lt;p&gt;By using Poetry, it enables us to manage package dependency, to create a project template, and to publish to PyPI.&lt;/p&gt;
&lt;p&gt;To setup a project with Poetry, this article is the best to read even if you build a CLI.&lt;/p&gt;
&lt;p&gt;
(originally written in Japanese)&lt;/p&gt;
&lt;p&gt;One thing I added to my project is
. isort is to sort imports automatically.&lt;/p&gt;
&lt;p&gt;Here is the example of my project.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-toml" data-lang="toml"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;tool&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;taskipy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;tasks&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="nx"&gt;test&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;cmd&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;&amp;#34;pytest tests&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;help&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;&amp;#34;runs all unit tests&amp;#34;&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="nx"&gt;pr_test&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;&amp;#34;task lint&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="nx"&gt;fmt&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;cmd&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;&amp;#34;black tests digdaglog2sql &amp;amp;&amp;amp; isort digdaglog2sql tests&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;help&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;&amp;#34;format code&amp;#34;&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="nx"&gt;lint&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;cmd&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;&amp;#34;task lint_black &amp;amp;&amp;amp; task lint_flake8 &amp;amp;&amp;amp; task lint_isort &amp;amp;&amp;amp; task lint_mypy&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;help&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;&amp;#34;exec lint&amp;#34;&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="nx"&gt;lint_flake8&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;&amp;#34;flake8 --max-line-length=88 tests digdaglog2sql&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="nx"&gt;lint_mypy&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;&amp;#34;mypy tests digdaglog2sql&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="nx"&gt;lint_black&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;&amp;#34;black --check tests digdaglog2sql&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="nx"&gt;lint_isort&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;&amp;#34;isort digdaglog2sql tests --check-only&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;h2 id="create-a-cli-with-clickcloup"&gt;Create a CLI with Click/Cloup&lt;/h2&gt;
&lt;p&gt;
is a famous Python package to build a command line tool.
You can easily create a CLI by using decorator.&lt;/p&gt;
&lt;p&gt;Here is the example from the Click website:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-py" data-lang="py"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;click&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="nd"&gt;@click.command&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="nd"&gt;@click.option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;--count&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;default&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;help&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;Number of greetings.&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="nd"&gt;@click.option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;--name&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;Your name&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;help&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;The person to greet.&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;hello&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;count&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="s2"&gt;&amp;#34;&amp;#34;&amp;#34;Simple program that greets NAME for a total of COUNT times.&amp;#34;&amp;#34;&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nb"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;count&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;click&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;echo&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;Hello, &lt;/span&gt;&lt;span class="si"&gt;%s&lt;/span&gt;&lt;span class="s2"&gt;!&amp;#34;&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="vm"&gt;__name__&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;__main__&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;hello&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;
is an extension of Click.&lt;/p&gt;
&lt;p&gt;Using by Cloup, you can handle option groups and complex constraints like &lt;code&gt;mutually_exclusive&lt;/code&gt; as:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-py" data-lang="py"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="nd"&gt;@option_group&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="s2"&gt;&amp;#34;Cool options&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;--foo&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;help&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;This text should describe the option --foo.&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;--bar&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;help&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;This text should describe the option --bar.&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;constraint&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;mutually_exclusive&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Constraints of Cloup can validate the dependency and it also renders constraints in help.&lt;/p&gt;
&lt;h2 id="use-poetry-dynamic-versioning-for-version-management"&gt;Use poetry-dynamic-versioning for version management&lt;/h2&gt;
&lt;p&gt;
is a Python package to do same thing as
. You don&amp;rsquo;t need to write version number by hand since this package use the version from tag of Git, e.g., &amp;ldquo;v.0.1.0&amp;rdquo;.&lt;/p&gt;
&lt;p&gt;Managing version by Git enables you to release to PyPI from GitHub Actions. This means you can release to PyPI on mobile device by releasing from GitHub.&lt;/p&gt;
&lt;p&gt;After installation of poetry-dynamic-versioning, you just add three thing in pyproject.toml:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-toml" data-lang="toml"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;tool&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;poetry&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="nx"&gt;version&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;&amp;#34;0.0.0&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;tool&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;poetry-dynamic-versioning&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="nx"&gt;enable&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;build-system&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="nx"&gt;requires&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;poetry-core&amp;gt;=1.0.0&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;&amp;#34;poetry-dynamic-versioning&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="nx"&gt;build-backend&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;&amp;#34;poetry.core.masonry.api&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Note that build-system configuration may vary depending on how you install poetry-dynamic-versioning. See the document for detail.&lt;/p&gt;
&lt;h2 id="introduce-github-actions-to-release-the-package-to-pypi"&gt;Introduce GitHub Actions to release the package to PyPI&lt;/h2&gt;
&lt;p&gt;As I mentioned above, I highly recommend to use GitHub Actions to release a Package to PyPI.&lt;/p&gt;
&lt;p&gt;Since GitHub provides
now, creating a release from GitHub with triggering PyPI release is the best way to publish a new version.&lt;/p&gt;
&lt;p&gt;Here is the snippet of GH Actions to release to PyPI by using poetry.&lt;/p&gt;
&lt;div class="iframely-embed"&gt;&lt;div class="iframely-responsive" style="height: 140px; padding-bottom: 0;"&gt;&lt;a href="https://github.com/chezou/digdaglog2sql/blob/ce35ce9b0220b77a79998f594304d850da231a94/.github/workflows/python-publish.yml" data-iframely-url="//iframely.net/39Qsg8o?card=small"&gt;&lt;/a&gt;&lt;/div&gt;&lt;/div&gt;&lt;script async src="//iframely.net/embed.js" charset="utf-8"&gt;&lt;/script&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-yaml" data-lang="yaml"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="nt"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l"&gt;Upload Python Package&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="nt"&gt;on&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;release&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;types&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="l"&gt;created]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="nt"&gt;permissions&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;contents&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l"&gt;read&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="nt"&gt;jobs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;deploy&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;runs-on&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l"&gt;ubuntu-latest&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;steps&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;- &lt;span class="nt"&gt;uses&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l"&gt;actions/checkout@v3&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;steps&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;- &lt;span class="nt"&gt;uses&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l"&gt;actions/checkout@v3&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;- &lt;span class="nt"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l"&gt;Set up Python&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;uses&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l"&gt;actions/setup-python@v3&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;with&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;python-version&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;3.x&amp;#39;&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;- &lt;span class="nt"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l"&gt;Install dependencies&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;run&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;|&lt;/span&gt;&lt;span class="sd"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="sd"&gt; python -m pip install --upgrade pip
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="sd"&gt; pip install poetry&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;- &lt;span class="nt"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l"&gt;Build and publish package&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;run&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;|&lt;/span&gt;&lt;span class="sd"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="sd"&gt; poetry version $(git describe --tags --abbrev=0)
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="sd"&gt; poetry build
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="sd"&gt; poetry publish --username __token__ --password ${{ secrets.PYPI_API_TOKEN }}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Note that, while PyPI API Token can be found on PyPI, if you need to create project scope token, you need to upload a package manually first.&lt;/p&gt;</description></item><item><title>Create data lineage from Trino/Hive queries in digdag log with Python</title><link>https://chezo.uno/blog/2022-05-05-sqllineage-with-digdag-log/</link><pubDate>Thu, 05 May 2022 20:31:05 -0700</pubDate><guid>https://chezo.uno/blog/2022-05-05-sqllineage-with-digdag-log/</guid><description>&lt;h2 id="whats-data-lineage"&gt;What&amp;rsquo;s data lineage?&lt;/h2&gt;
&lt;p&gt;Data lineage is something to describe &amp;ldquo;Where this data comes from and where it goes?&amp;rdquo;&lt;/p&gt;
&lt;p&gt;I learned this term in my previous job. They provided &amp;ldquo;Cloudera Navigator&amp;rdquo; which includes data lineage from execution logs of Hive/Spark etc.&lt;/p&gt;
&lt;figure&gt;&lt;img src="https://chezo.uno/blog/2022-05-05-sqllineage-with-digdag-log/nav_lineage.webp"&gt;&lt;figcaption&gt;
&lt;h4&gt;lineage of Cloudera Navigator via https://docs.cloudera.com/documentation/enterprise/6/6.3/topics/cn_lineage_generation.html&lt;/h4&gt;
&lt;/figcaption&gt;
&lt;/figure&gt;
&lt;h2 id="sqllineage-is-awesome-open-source-tool-for-visualizing-lineage"&gt;sqllineage is awesome open source tool for visualizing lineage&lt;/h2&gt;
&lt;p&gt;Recently, I learned there is a Python package so called sqllinage, that makes analyze and visualize data lineage from SQLs.&lt;/p&gt;
&lt;div class="iframely-embed"&gt;&lt;div class="iframely-responsive" style="height: 140px; padding-bottom: 0;"&gt;&lt;a href="https://github.com/reata/sqllineage" data-iframely-url="//iframely.net/4q6WPtz?card=small"&gt;&lt;/a&gt;&lt;/div&gt;&lt;/div&gt;&lt;script async src="//iframely.net/embed.js" charset="utf-8"&gt;&lt;/script&gt;
&lt;p&gt;sqllineage consists of Python implementation to analyze SQL and web application written in React.&lt;/p&gt;
&lt;h2 id="visualize-data-lineage-from-treasure-datas-workflow-logs"&gt;Visualize data lineage from Treasure Data&amp;rsquo;s workflow logs&lt;/h2&gt;
&lt;p&gt;I found that Treasure Data&amp;rsquo;s workflow log outputs SQLs in its log. But it still needs to format pure SQLs.&lt;/p&gt;
&lt;p&gt;Then, I create digdaglog2sql to extract SQLs from Treasure Workflow logs.&lt;/p&gt;
&lt;div class="iframely-embed"&gt;&lt;div class="iframely-responsive" style="height: 140px; padding-bottom: 0;"&gt;&lt;a href="https://github.com/chezou/digdaglog2sql" data-iframely-url="//iframely.net/5Up1iQ9?card=small"&gt;&lt;/a&gt;&lt;/div&gt;&lt;/div&gt;&lt;script async src="//iframely.net/embed.js" charset="utf-8"&gt;&lt;/script&gt;
&lt;p&gt;You can use it with Python 3.7+. Here is the overview of the usage and check details on GitHub.&lt;/p&gt;
&lt;p&gt;Install via pip:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-sh" data-lang="sh"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;pip install --user digdaglog2sql
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;If you have a workflow log downloaded from Treasure Data, you can convert into SQL as:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-sh" data-lang="sh"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;digdaglog2sql --input workflow-log.txt --output output.sql
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Or, if you want extract SQLs from specific workflow, you can use Session ID of it.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-sh" data-lang="sh"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="nb"&gt;export&lt;/span&gt; &lt;span class="nv"&gt;TD_API_KEY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;1234XXX/YYYYYYYY
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;digdaglog2sql --session-id &lt;span class="m"&gt;12345&lt;/span&gt; --site us --output output.sql
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;You can fetch SQLs from your hosted digdag as the following:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-sh" data-lang="sh"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;digdaglog2sql --session-id &lt;span class="m"&gt;12345&lt;/span&gt; --endpoint digdag.example.com --output output.sql
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;del&gt;Note that, as of May 5, 2022, sqllineage and sqlparse, which is an important backend of sqllineage, are not fully compatible with Trino and Hive queries.&lt;/del&gt;&lt;/p&gt;
&lt;div class="callout flex px-4 py-3 mb-6 rounded-md border-l-4 bg-blue-100 dark:bg-blue-900 border-blue-500"
data-callout="note"
data-callout-metadata=""&gt;
&lt;span class="callout-icon pr-3 pt-1 text-blue-600 dark:text-blue-300"&gt;
&lt;svg height="24" xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24"&gt;&lt;path fill="none" stroke="currentColor" stroke-linecap="round" stroke-linejoin="round" stroke-width="1.5" d="m16.862 4.487l1.687-1.688a1.875 1.875 0 1 1 2.652 2.652L6.832 19.82a4.5 4.5 0 0 1-1.897 1.13l-2.685.8l.8-2.685a4.5 4.5 0 0 1 1.13-1.897zm0 0L19.5 7.125"/&gt;&lt;/svg&gt;
&lt;/span&gt;
&lt;div class="callout-content dark:text-neutral-300"&gt;
&lt;div class="callout-title font-semibold mb-1"&gt;Note&lt;/div&gt;
&lt;div class="callout-body"&gt;&lt;p&gt;As of 2022/05/11, the issues in sqllineage around Hive/Trino were fixed and it is available in 1.3.5 on PyPI.
It means, you don&amp;rsquo;t have to have node for sqllineage installation from source.&lt;/p&gt;
&lt;p&gt;As of 2022/10/06, the issue in sqlparse was resolved in 0.4.3.&lt;/p&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;&lt;del&gt;These are the PRs that approaches the issues:&lt;/del&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;✅
-&amp;gt; Released in 1.3.5&lt;/li&gt;
&lt;li&gt;✅
-&amp;gt; Released in 1.3.5&lt;/li&gt;
&lt;li&gt;✅
-&amp;gt; Released in 0.4.3&lt;/li&gt;
&lt;li&gt;✅
-&amp;gt; Released in 0.4.3&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;del&gt;Don&amp;rsquo;t worry about it. I prepared patched branches on GitHub. You can install sqllineage and sqlparse as the following:&lt;/del&gt;&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-sh" data-lang="sh"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;pip install git+https://github.com/chezou/sqlparse.git@trino#egg&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nv"&gt;sqlparse&lt;/span&gt;&lt;span class="o"&gt;==&lt;/span&gt;0.4.3.dev0
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;pip install sqllineage
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;del&gt;If you see some error on installation of sqllineage, double-check if you have node installed.&lt;/del&gt;&lt;/p&gt;
&lt;p&gt;Then, you can visualize your SQL file as:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-sh" data-lang="sh"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;$ sqllineage -g -f output.sql
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; * SQLLineage Running on http://localhost:5000/?f&lt;span class="o"&gt;=&lt;/span&gt;output.sql
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Now you can see visualization of data linage, both table level and column level.&lt;/p&gt;
&lt;figure&gt;&lt;img src="https://chezo.uno/blog/2022-05-05-sqllineage-with-digdag-log/featured.webp"&gt;&lt;figcaption&gt;
&lt;h4&gt;An example of SQL lineage&lt;/h4&gt;
&lt;/figcaption&gt;
&lt;/figure&gt;
&lt;p&gt;Let&amp;rsquo;s try sqllineage!&lt;/p&gt;</description></item><item><title>3 configs add recommend articles into your Hugo blog by GitHub Actions</title><link>https://chezo.uno/blog/2022-01-25_hugo-content-based-recommendation/</link><pubDate>Tue, 25 Jan 2022 19:37:52 -0800</pubDate><guid>https://chezo.uno/blog/2022-01-25_hugo-content-based-recommendation/</guid><description>&lt;p&gt;Hugo has a feature to show keyword based related articles.&lt;/p&gt;
&lt;div class="iframely-embed"&gt;&lt;div class="iframely-responsive" style="height: 140px; padding-bottom: 0;"&gt;&lt;a href="https://gohugo.io/content-management/related/" data-iframely-url="//iframely.net/q1grvUY?card=small"&gt;&lt;/a&gt;&lt;/div&gt;&lt;/div&gt;&lt;script async src="//iframely.net/embed.js" charset="utf-8"&gt;&lt;/script&gt;
&lt;p&gt;Yeah, keyword based articles might be useful, for people who can manage keyword, category, etc, constantly.
I&amp;rsquo;d love to add content based recommendation that doesn&amp;rsquo;t require to write explicit keywords by myself. Then, I found an open source named &amp;ldquo;Prelims&amp;rdquo; which is developed by
.&lt;/p&gt;
&lt;div class="iframely-embed"&gt;&lt;div class="iframely-responsive" style="height: 140px; padding-bottom: 0;"&gt;&lt;a href="https://github.com/takuti/prelims" data-iframely-url="//iframely.net/omDBVa8?card=small"&gt;&lt;/a&gt;&lt;/div&gt;&lt;/div&gt;&lt;script async src="//iframely.net/embed.js" charset="utf-8"&gt;&lt;/script&gt;
&lt;p&gt;Prelims is a post-processing tool for Front matter of Hugo/Jekyll, that is a metadata of an article.
The recommendation method which is implemented for now is classical, create a TF-IDF based word vector and find similar articles by consign similarity.&lt;/p&gt;
&lt;p&gt;The reason why I love Prelims is it&amp;rsquo;s simple and flexible. Post-processing of front matter doesn&amp;rsquo;t break your articles nor blog system at all. You can remove extra meta data Prelims generated whenever you want.&lt;/p&gt;
&lt;p&gt;Isn&amp;rsquo;t it practical, right?&lt;/p&gt;
&lt;p&gt;One downside of Prelims is it requires to implement Python code for tokenizing or vectorizing TF-IDF. I don&amp;rsquo;t want to bring my laptop for blog writing and wanna use Netlify CMS and iPad without having Python environment.&lt;/p&gt;
&lt;p&gt;So, I built a CLI tool for Prelims, named
, which enables to add recommended articles just writing 1 configuration YAML file. It also runs with GitHub Actions.&lt;/p&gt;
&lt;div class="iframely-embed"&gt;&lt;div class="iframely-responsive" style="height: 140px; padding-bottom: 0;"&gt;&lt;a href="https://github.com/chezou/prelims-cli" data-iframely-url="//iframely.net/m9C9uKt?card=small"&gt;&lt;/a&gt;&lt;/div&gt;&lt;/div&gt;&lt;script async src="//iframely.net/embed.js" charset="utf-8"&gt;&lt;/script&gt;
&lt;p&gt;The three things you need to prepare are:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Configuration YAML file for prelims-cli. e.g., &lt;code&gt;scripts/config/myconfig.yaml&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Hugo HTML partial layout, e.g., &lt;code&gt;layouts/partials/page_related.html&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;GitHub Actions workflow for prelims-cli&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Here is the example gist what you need to write.&lt;/p&gt;
&lt;script src="https://gist.github.com/chezou/a9cb0ab2a086b3ce9ce9bf1abbc5b347.js"&gt;&lt;/script&gt;
&lt;p&gt;where &lt;code&gt;content/blog&lt;/code&gt; is the directory for English articles and &lt;code&gt;content/post&lt;/code&gt; is the directory for Japanese articles.&lt;/p&gt;
&lt;p&gt;Putting three files enables you to show recommended articles into your Hugo blog, like the screenshot in the top of this article.&lt;/p&gt;
&lt;p&gt;Internally, for Japanese tokenization, it uses SudachiPy. Since &lt;code&gt;keywords&lt;/code&gt; prelims generates are a-bit noisy and didn&amp;rsquo;t wanted to cleanup, so I stopped using it.&lt;/p&gt;
&lt;p&gt;The good things I feel are, I can use my blog articles for my hobby recommendation project, and I don&amp;rsquo;t need to manage tags and categories seriously.&lt;/p&gt;
&lt;p&gt;You can enjoy your recommendation without having Python environment, so you can write your articles on iPad with Netlify CMS!&lt;/p&gt;</description></item><item><title>py&gt; operator development guide for Python users</title><link>https://chezo.uno/blog/2020-03-05_py-operator-development-guide-for-python-users/</link><pubDate>Thu, 05 Mar 2020 14:15:52 -0800</pubDate><guid>https://chezo.uno/blog/2020-03-05_py-operator-development-guide-for-python-users/</guid><description>&lt;p&gt;
&lt;/p&gt;
&lt;h1 id="how-to-build--test-custom-scripts-on-local-env-before-pushing"&gt;How to build &amp;amp; test custom scripts on local env before pushing&lt;/h1&gt;
&lt;p&gt;General strategy:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Make a Python task reasonable granularity to run on local env&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Since Treasure Workflow doesn&amp;rsquo;t have intermediate storage between tasks, a task can be huge. Considering container launch time, it would be better to create a single huge task, but it makes difficult for debugging. Starting from creating a reasonable size of function which is able to debug easily. Then, you can create a function that calls those minimal functions at once.&lt;/p&gt;
&lt;p&gt;There are few options to develop py&amp;gt; operator on the local environment.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Use TD docker image&lt;/li&gt;
&lt;li&gt;Create a Python virtual environment on local env&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 id="1-use-td-docker-image"&gt;1. Use TD docker image&lt;/h2&gt;
&lt;p&gt;To develop a single py&amp;gt; operator task, you can use the official docker image to run python tasks locally. Like ordinal Python script, you can add the main guard like:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-py" data-lang="py"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="vm"&gt;__name__&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="s2"&gt;&amp;#34;__main__&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;your_function&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;default_argument&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;As of Mar. 5, 2020, our latest official images are shown as the following:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;digdag/digdag-python:3.7
&lt;/li&gt;
&lt;li&gt;digdag/digdag-anaconda3:2019.03
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If you want to run a debugger toward Docker container, we recommend using PyCharm to run a remote debugger. See also
.&lt;/p&gt;
&lt;h2 id="2-create-a-python-virtual-environment-on-local-env"&gt;2. Create a Python virtual environment on local env&lt;/h2&gt;
&lt;p&gt;Python provides venv to create virtual environments, you can create the same environment by using pip.&lt;/p&gt;
&lt;p&gt;Download requirements.txt and constraints.txt from
and you can install dependencies as same environment with digdag-python:3.7 as:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-sh" data-lang="sh"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;$ python -m venv .venv
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;$ &lt;span class="nb"&gt;source&lt;/span&gt; .venv/bin/activate
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="o"&gt;(&lt;/span&gt;.venv&lt;span class="o"&gt;)&lt;/span&gt;$ pip install -r requirements.txt -c constraints.txt&lt;span class="sb"&gt;`&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Using this virtual environment, you can develop by using the same packages on the local environment.&lt;/p&gt;
&lt;p&gt;Note that this approach can&amp;rsquo;t ensure OS differences, which means the production environment is running on Debian but the development environment might be Windows/macOS X. This causes errors when executing OS-dependent commands like apt-get.&lt;/p&gt;
&lt;p&gt;If you want to create the same environment with anaconda image, you can download environment.yml from
, and run:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-fallback" data-lang="fallback"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;conda env update -n base -f environment.yml
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Now you have the same Python packages with digdag/digdag-anaconda3:2019.03&lt;/p&gt;
&lt;p&gt;Note that this command will overwrite existing conda environment, we highly recommend to modify name in environment.yml from base to your environment name like my-env, and run:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-fallback" data-lang="fallback"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;conda env create -f environment.yml
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;h2 id="test-a-workflow-including-python"&gt;Test a workflow including Python&lt;/h2&gt;
&lt;p&gt;If you want to run an entire workflow on the local environment, &lt;del&gt;you can use
&lt;/del&gt;.&lt;/p&gt;
&lt;div class="callout flex px-4 py-3 mb-6 rounded-md border-l-4 bg-orange-100 dark:bg-orange-900 border-orange-500"
data-callout="warning"
data-callout-metadata=""&gt;
&lt;span class="callout-icon pr-3 pt-1 text-orange-600 dark:text-orange-300"&gt;
&lt;svg height="24" xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24"&gt;&lt;path fill="none" stroke="currentColor" stroke-linecap="round" stroke-linejoin="round" stroke-width="1.5" d="M12 9v3.75m-9.303 3.376c-.866 1.5.217 3.374 1.948 3.374h14.71c1.73 0 2.813-1.874 1.948-3.374L13.949 3.378c-.866-1.5-3.032-1.5-3.898 0zM12 15.75h.007v.008H12z"/&gt;&lt;/svg&gt;
&lt;/span&gt;
&lt;div class="callout-content dark:text-neutral-300"&gt;
&lt;div class="callout-title font-semibold mb-1"&gt;Warning&lt;/div&gt;
&lt;div class="callout-body"&gt;&lt;p&gt;As of Mar 5, 2020, Treasure Data uses digdag v0_10 branch, but it may change in the near future.&lt;/p&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="callout flex px-4 py-3 mb-6 rounded-md border-l-4 bg-orange-100 dark:bg-orange-900 border-orange-500"
data-callout="warning"
data-callout-metadata=""&gt;
&lt;span class="callout-icon pr-3 pt-1 text-orange-600 dark:text-orange-300"&gt;
&lt;svg height="24" xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24"&gt;&lt;path fill="none" stroke="currentColor" stroke-linecap="round" stroke-linejoin="round" stroke-width="1.5" d="M12 9v3.75m-9.303 3.376c-.866 1.5.217 3.374 1.948 3.374h14.71c1.73 0 2.813-1.874 1.948-3.374L13.949 3.378c-.866-1.5-3.032-1.5-3.898 0zM12 15.75h.007v.008H12z"/&gt;&lt;/svg&gt;
&lt;/span&gt;
&lt;div class="callout-content dark:text-neutral-300"&gt;
&lt;div class="callout-title font-semibold mb-1"&gt;Warning&lt;/div&gt;
&lt;div class="callout-body"&gt;&lt;p&gt;As of Feb 14, 2021, Treasure Data moved to v0_11 branch. You may use the latest release branch.
&lt;/p&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;h1 id="passing-parameters-to-py-operator"&gt;Passing Parameters to py&amp;gt; operator&lt;/h1&gt;
&lt;p&gt;There are two ways to pass parameters into py&amp;gt; operator:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;ordinal digdag argument&lt;/li&gt;
&lt;li&gt;environment variable&lt;/li&gt;
&lt;li&gt;digdag variable&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 id="1-digdag-argument"&gt;1. digdag argument&lt;/h2&gt;
&lt;p&gt;Assuming we have a Python script named py_scripts/examples.py like:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-py" data-lang="py"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;print_arg&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="nb"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;Message is &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Passing msg argument from simple_with_arg task can be like:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-yaml" data-lang="yaml"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="nt"&gt;+simple_with_arg&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;py&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l"&gt;py_scripts.examples.print_arg&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;Hello World&amp;#34;&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;docker&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;image&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;digdag/digdag-python:3.7&amp;#34;&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;If you want to pass multiple arguments, you can add arguments in your function, then add them into digdag arguments as well.&lt;/p&gt;
&lt;p&gt;Note that digdag arguments can be passed into Python seamlessly so that you might face unintended variables passed by using keyword arguments **kwargs.&lt;/p&gt;
&lt;p&gt;For example, in this case, docker variable can be passed as a dictionary {&amp;ldquo;image&amp;rdquo;: &amp;ldquo;digdag/digdag-python:3.7&amp;rdquo;}. We recommend having implicit arguments on a Python function.&lt;/p&gt;
&lt;p&gt;Note that there might be unintended conflicts between digdag and py&amp;gt; operator. Assuming you set some digdag variables like the following:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-yaml" data-lang="yaml"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="nt"&gt;_export&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;td&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;database&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l"&gt;my_db&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="nt"&gt;+simple_with_arg2&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;py&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l"&gt;py_scripts.examples.print_arg_td&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;Hello World&amp;#34;&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;docker&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;image&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;digdag/digdag-python:3.7&amp;#34;&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;having python function print_arg_td with td argument like the following:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-py" data-lang="py"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;print_arg_td&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;td&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;None&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="nb"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;&amp;#39;msg&amp;#39; is &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s2"&gt; and &amp;#39;td&amp;#39; is &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;td&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;In this case, td variable never can be None since exported td variable, i.e., {&amp;ldquo;database&amp;rdquo;: &amp;ldquo;my_db&amp;rdquo;} always should be passed. This may cause type mismatches like dictionary and string. We recommend avoiding to use preserved arguments for digdag, like td variables like:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;td.endpoint&lt;/li&gt;
&lt;li&gt;td.apikey&lt;/li&gt;
&lt;li&gt;td.use_ssl&lt;/li&gt;
&lt;li&gt;td.proxy.enabled&lt;/li&gt;
&lt;li&gt;td.proxy.host&lt;/li&gt;
&lt;li&gt;td.proxy.port&lt;/li&gt;
&lt;li&gt;td.proxy.password&lt;/li&gt;
&lt;li&gt;td.proxy.user&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Note that these variables might be changed in the future. There are build-in digdag variables. See digdag build-in variables at
&lt;/p&gt;
&lt;p&gt;Also, digdag might converts unintended type e.g., an integer from a string, so we recommend to evaluate or explicitly convert type on a Python function.&lt;/p&gt;
&lt;h2 id="2-environment-variable"&gt;2. environment variable&lt;/h2&gt;
&lt;p&gt;Environment variables can be another option to pass parameters to py&amp;gt; operator. An environment variable is reasonable for passing secure information or secrets.&lt;/p&gt;
&lt;p&gt;For example, if we have a task simple_with_env&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-yaml" data-lang="yaml"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="nt"&gt;+simple_with_env&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;py&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l"&gt;py_scripts.examples.print_env&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;_env&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;MY_ENV_VAR&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;hello&amp;#34;&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;docker&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;image&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;digdag/digdag-python:3.7&amp;#34;&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;This MY_ENV_VAR can be accessed by using os.environ like:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-py" data-lang="py"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;os&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;print_env&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="nb"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Env var is &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;environ&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;MY_ENV_VAR&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Using an environment variable should be important especially when you need to use secrets information e.g. Treasure Data API key or AWS secrets key, etc.&lt;/p&gt;
&lt;p&gt;digdag has a feature to store secrets information. Secrets are stored on digdag (or Treasure Workflow) database when executing td workflow secrets subcommand.&lt;/p&gt;
&lt;p&gt;
&lt;figure &gt;
&lt;div class="flex justify-center "&gt;
&lt;div class="w-full" &gt;&lt;img src="https://chezo.uno/post/2019-12-24-python-custom-scripting/digdag_secrets.png" alt="" loading="lazy" data-zoomable /&gt;&lt;/div&gt;
&lt;/div&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p&gt;Assuming you&amp;rsquo;ve set a secret named td.apikey. This secret can be passed to py&amp;gt; operator like:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-yaml" data-lang="yaml"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="nt"&gt;+simple_with_env2&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;py&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l"&gt;py_scripts.examples.access_td&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;_env&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;TD_API_KEY&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l"&gt;${secret:td.apikey}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;docker: image&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;digdag/digdag-python:3.7&amp;#34;&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;from py_scripts/examples.py like:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-py" data-lang="py"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;os&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;access_td&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;apikey&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;environ&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;TD_API_KEY&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="c1"&gt;# Do awesome execution&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;If you try to pass secrets from ordinal digdag arguments, secrets will never be fetched from secrets DB. For example, if you have a task like the following:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-yaml" data-lang="yaml"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="nt"&gt;+simple_with_env_ng&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;py&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l"&gt;py_scripts.examples.access_td_ng&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;apikey&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l"&gt;${secret:td.apikey}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;docker: image&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;digdag/digdag-python:3.7&amp;#34;&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;by using the following script like:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-py" data-lang="py"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;access_td_ng&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;apikey&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="nb"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;apikey&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="c1"&gt;# Always shows &amp;#34;${secret:td.apikey}&amp;#34; insted of actual API key like &amp;#34;1234/XXXX&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;h2 id="3-digdag-variable"&gt;3. digdag variable&lt;/h2&gt;
&lt;p&gt;If you want to read digdag variable in a Python script, you can use digdag.env.params as the following:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-py" data-lang="py"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;read_workflow_env&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;digdag&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="nb"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;digdag&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;env&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;my_msg&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Note that import digdag can be run only when the script is run as a digdag py&amp;gt; operator task. If you want to avoid import error, you should write try-except syntax like:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-py" data-lang="py"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;digdag&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;digdag&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;env&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;store&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;feature_query&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;feature_query&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="ne"&gt;ImportError&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;pass&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;h1 id="directory-structures"&gt;Directory structures&lt;/h1&gt;
&lt;p&gt;I recommend having the following directory structure.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-gdscript3" data-lang="gdscript3"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;my_project&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="err"&gt;├──&lt;/span&gt; &lt;span class="n"&gt;README&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;md&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="err"&gt;├──&lt;/span&gt; &lt;span class="n"&gt;config&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="err"&gt;├──&lt;/span&gt; &lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;test&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;yml&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;-&lt;/span&gt; &lt;span class="n"&gt;Configuration&lt;/span&gt; &lt;span class="n"&gt;file&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;run&lt;/span&gt; &lt;span class="n"&gt;through&lt;/span&gt; &lt;span class="n"&gt;test&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt; &lt;span class="n"&gt;Mirror&lt;/span&gt; &lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;yml&lt;/span&gt; &lt;span class="n"&gt;except&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="err"&gt;`&lt;/span&gt;&lt;span class="n"&gt;td&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;database&lt;/span&gt;&lt;span class="err"&gt;`&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="err"&gt;└──&lt;/span&gt; &lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;yml&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;-&lt;/span&gt; &lt;span class="n"&gt;Configuration&lt;/span&gt; &lt;span class="n"&gt;file&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;production&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="err"&gt;├──&lt;/span&gt; &lt;span class="n"&gt;awesome_workflow&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;dig&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;-&lt;/span&gt; &lt;span class="n"&gt;Main&lt;/span&gt; &lt;span class="n"&gt;workflow&lt;/span&gt; &lt;span class="n"&gt;to&lt;/span&gt; &lt;span class="n"&gt;be&lt;/span&gt; &lt;span class="n"&gt;executed&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="err"&gt;├──&lt;/span&gt; &lt;span class="n"&gt;ingest&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;dig&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;-&lt;/span&gt; &lt;span class="n"&gt;Data&lt;/span&gt; &lt;span class="n"&gt;ingestion&lt;/span&gt; &lt;span class="n"&gt;workflow&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="err"&gt;├──&lt;/span&gt; &lt;span class="n"&gt;py_scripts&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;-&lt;/span&gt; &lt;span class="n"&gt;Python&lt;/span&gt; &lt;span class="n"&gt;scripts&lt;/span&gt; &lt;span class="n"&gt;directory&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="err"&gt;├──&lt;/span&gt; &lt;span class="n"&gt;__init__&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;py&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="err"&gt;├──&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;py&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;-&lt;/span&gt; &lt;span class="ne"&gt;Script&lt;/span&gt; &lt;span class="n"&gt;to&lt;/span&gt; &lt;span class="n"&gt;upload&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="n"&gt;to&lt;/span&gt; &lt;span class="n"&gt;Arm&lt;/span&gt; &lt;span class="n"&gt;Treasure&lt;/span&gt; &lt;span class="n"&gt;Data&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="err"&gt;└──&lt;/span&gt; &lt;span class="n"&gt;my_script&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;py&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;-&lt;/span&gt; &lt;span class="n"&gt;Main&lt;/span&gt; &lt;span class="n"&gt;script&lt;/span&gt; &lt;span class="n"&gt;to&lt;/span&gt; &lt;span class="n"&gt;execute&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;g&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt; &lt;span class="n"&gt;Data&lt;/span&gt; &lt;span class="n"&gt;enrichment&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ML&lt;/span&gt; &lt;span class="n"&gt;training&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="err"&gt;├──&lt;/span&gt; &lt;span class="n"&gt;queries&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;-&lt;/span&gt; &lt;span class="n"&gt;SQL&lt;/span&gt; &lt;span class="n"&gt;directory&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="err"&gt;└──&lt;/span&gt; &lt;span class="n"&gt;example&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sql&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="err"&gt;├──&lt;/span&gt; &lt;span class="n"&gt;run_test&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sh&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;-&lt;/span&gt; &lt;span class="n"&gt;Test&lt;/span&gt; &lt;span class="n"&gt;shell&lt;/span&gt; &lt;span class="n"&gt;script&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;local&lt;/span&gt; &lt;span class="n"&gt;run&lt;/span&gt; &lt;span class="n"&gt;through&lt;/span&gt; &lt;span class="n"&gt;test&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="err"&gt;└──&lt;/span&gt; &lt;span class="n"&gt;test&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;dig&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;-&lt;/span&gt; &lt;span class="n"&gt;Test&lt;/span&gt; &lt;span class="n"&gt;workflow&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;local&lt;/span&gt; &lt;span class="n"&gt;run&lt;/span&gt; &lt;span class="n"&gt;through&lt;/span&gt; &lt;span class="n"&gt;test&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;You can generate this structure from a template by using cookiecutter-digdag.&lt;/p&gt;
&lt;p&gt;
&lt;/p&gt;
&lt;h1 id="how-to-install-python-packages--os-packages"&gt;How to install Python packages / OS packages&lt;/h1&gt;
&lt;p&gt;For installation of Python packages, you can use os.syste or subprocess.run like:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-py" data-lang="py"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;os&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nn"&gt;sys&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;system&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;sys&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;executable&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s2"&gt; -m pip install --upgrade pytd==1.4.3&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;subprocess&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="c1"&gt;# arguments should be passed by list&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;subprocess&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;run&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="n"&gt;sys&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;executable&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;&amp;#34;-m&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;&amp;#34;pip&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;&amp;#34;install&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;&amp;#34;--upgrade&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;&amp;#34;pytd==1.4.3&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Ensure you set the version number of Python package.&lt;/p&gt;
&lt;p&gt;To install OS packages, you can execute like the following:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-py" data-lang="py"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;os&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;system&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;apt-get update&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;# Need to run before doing apt-get install&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;system&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;apt-get install -y wkhtmltopdf&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;h1 id="how-to-readwrite-tiny-variables-between-digdag-tasks"&gt;How to read/write tiny variables between digdag tasks&lt;/h1&gt;
&lt;p&gt;To read a digdag variable, you can use digdag.env.params as mentioned above.&lt;/p&gt;
&lt;p&gt;To pass variables to another Python task, you can use import digdag.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-py" data-lang="py"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;store_workflow_env&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;digdag&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;digdag&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;env&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;store&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;my_msg&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;This example code sets my_msg variable which is able to use the following tasks like:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-yaml" data-lang="yaml"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="nt"&gt;+store_msg&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;py&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l"&gt;py_scripts.examples.store_workflow_env&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;Hello World&amp;#34;&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;docker&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;image&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;digdag/digdag-python:3.7&amp;#34;&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="nt"&gt;+restore_msg&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;echo&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l"&gt;${my_msg}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;h1 id="error-notification-with-python-stack-trace"&gt;Error notification with Python stack trace&lt;/h1&gt;
&lt;p&gt;digdag has _error: syntax to send a notification for an error message. You can access ${error.message} digdag variable to send the notification for Slack or Email.&lt;/p&gt;
&lt;p&gt;Assuming that if we have the following workflow:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-yaml" data-lang="yaml"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="nt"&gt;+simple_raise_error&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;py&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l"&gt;py_scripts.examples.error_sample&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;docker&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;image&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;digdag/digdag-python:3.7&amp;#34;&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="nt"&gt;_error&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;echo&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l"&gt;${error.message}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;with this Python script:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-py" data-lang="py"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;error_sample&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;a1234&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;# raises ValueError&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;This script always raises ValueError and the workflow log shows stack trace of Python as the following:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-gdscript3" data-lang="gdscript3"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="mi"&gt;2019&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;12&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;24&lt;/span&gt; &lt;span class="mi"&gt;23&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;06&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;32&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="mi"&gt;0900&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;INFO&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0039&lt;/span&gt;&lt;span class="err"&gt;@&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;python&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="n"&gt;simple&lt;/span&gt;&lt;span class="o"&gt;^&lt;/span&gt;&lt;span class="n"&gt;error&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="n"&gt;echo&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Python&lt;/span&gt; &lt;span class="n"&gt;command&lt;/span&gt; &lt;span class="n"&gt;failed&lt;/span&gt; &lt;span class="n"&gt;with&lt;/span&gt; &lt;span class="n"&gt;code&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;invalid&lt;/span&gt; &lt;span class="n"&gt;literal&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="ne"&gt;int&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="n"&gt;with&lt;/span&gt; &lt;span class="n"&gt;base&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;a1234&amp;#39;&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ValueError&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;from&lt;/span&gt; &lt;span class="n"&gt;Traceback&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;most&lt;/span&gt; &lt;span class="n"&gt;recent&lt;/span&gt; &lt;span class="n"&gt;call&lt;/span&gt; &lt;span class="n"&gt;last&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;from&lt;/span&gt; &lt;span class="ne"&gt;File&lt;/span&gt; &lt;span class="s2"&gt;&amp;#34;.digdag/tmp/digdag-py-2-1815457087076518360/runner.py&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;line&lt;/span&gt; &lt;span class="mi"&gt;165&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;module&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;callable_type&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;from&lt;/span&gt; &lt;span class="ne"&gt;File&lt;/span&gt; &lt;span class="s2"&gt;&amp;#34;/private/var/folders/y9/bnjb3krn39s22rmg_wvlnf7m0000gp/T/digdag-tempdir2111531196420040503/workspace/1_simple_1_2_2945225080250994454/py_scripts/examples.py&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;line&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;print_arg&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="ne"&gt;int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;a1234&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;from&lt;/span&gt; &lt;span class="n"&gt;ValueError&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;invalid&lt;/span&gt; &lt;span class="n"&gt;literal&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="ne"&gt;int&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="n"&gt;with&lt;/span&gt; &lt;span class="n"&gt;base&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;a1234&amp;#39;&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;runtime&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;In this example, we use echo&amp;gt; operator to show the error message, but you can use mail&amp;gt; operator for sending Email or http&amp;gt; operator to send a Slack message.&lt;/p&gt;</description></item><item><title>How to release Python package from GitHub Actions</title><link>https://chezo.uno/blog/2019-11-26_how-to-release-python-package-from-github-actions-d5a1d8edba6e/</link><pubDate>Mon, 25 Nov 2019 08:42:11 -0800</pubDate><guid>https://chezo.uno/blog/2019-11-26_how-to-release-python-package-from-github-actions-d5a1d8edba6e/</guid><description>&lt;p&gt;
&lt;figure &gt;
&lt;div class="flex justify-center "&gt;
&lt;div class="w-full" &gt;&lt;img src="./0__hOksODxf9TX1BkS0.jpg" alt="" loading="lazy" data-zoomable /&gt;&lt;/div&gt;
&lt;/div&gt;&lt;/figure&gt;
Photo by
on 
&lt;/p&gt;
&lt;p&gt;Recently, I changed my CI from Travis to GitHub Actions. GitHub Actions is handy and useful for testing, publishing Python packages.&lt;/p&gt;
&lt;h3 id="testing-python-code-on-githubactions"&gt;Testing Python code on GitHub Actions&lt;/h3&gt;
&lt;p&gt;Migration from Travis is super easy, just writing a simple workflow like:&lt;/p&gt;
&lt;p&gt;
&lt;/p&gt;
&lt;p&gt;The benefits of GitHub Actions for Python are:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;We can use build matrix (e.g., OS and Python versions) like Travis&lt;/li&gt;
&lt;li&gt;Launch time of GitHub is faster than Travis&lt;/li&gt;
&lt;li&gt;Easy for additional dependency installation by using &lt;code&gt;uses&lt;/code&gt; syntax, which uses another workflow&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;For example, installing JDK can be written as:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-fallback" data-lang="fallback"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;- uses: actions/setup-java@v1
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; with:
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; java-version: &amp;#39;12&amp;#39;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; java-package: jdk
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; architecture: x64
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;The downside of GitHub Actions are:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Unable to
&lt;/li&gt;
&lt;li&gt;Hard to find the resources for debugging on the internet and unable to ssh to the instance&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="releasing-python-package-from-github-actions-topypi"&gt;Releasing Python package from GitHub Actions to PyPI&lt;/h3&gt;
&lt;p&gt;I created the workflow like the following sequence:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Push a tag from local, or create a tag on GitHub. Using
enables you to make a new version from Git tag&lt;/li&gt;
&lt;li&gt;GitHub Actions creates GitHub release from the tag&lt;/li&gt;
&lt;li&gt;GitHub Actions publishes wheel to PyPI by using PyPI API Token&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;You can see the actual workflow on GitHub:&lt;/p&gt;
&lt;p&gt;
&lt;/p&gt;
&lt;p&gt;The key points are:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Triggering the workflow from Git tag&lt;/li&gt;
&lt;/ol&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-fallback" data-lang="fallback"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;on:
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; push:
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; tags:
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; - &amp;#39;v\*&amp;#39;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;2. Adding dependency for deploy task&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-fallback" data-lang="fallback"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;deploy:
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; needs: release
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;code&gt;needs&lt;/code&gt; syntax supports to write dependency. In this case, I describe &lt;code&gt;release&lt;/code&gt; job for creating GitHub release, and then &lt;code&gt;deploy&lt;/code&gt; job publishes the package to PyPI.&lt;/p&gt;
&lt;p&gt;3. Preparation secrets for PyPI&lt;/p&gt;
&lt;p&gt;Recently, PyPI provides API tokens for package publishments so that you can get an API token for the specific project. See details on the official document since it is under beta, and spec might change.&lt;/p&gt;
&lt;p&gt;
&lt;/p&gt;
&lt;p&gt;After getting API Token from PyPI, you can set secrets on GitHub by clicking “Settings” -&amp;gt; “Secrets” on the project page. Using my example workflow, you should set &lt;code&gt;__token__&lt;/code&gt; for &lt;code&gt;PYPI_USERS&lt;/code&gt; , and a token starting with &lt;code&gt;pypi-&lt;/code&gt; got on PyPI configuration for &lt;code&gt;PYPI_PASSWORD&lt;/code&gt; .&lt;/p&gt;
&lt;p&gt;Now, you can publish Python package to PyPI by just tagging on GitHub.&lt;/p&gt;</description></item><item><title>How to test a new Docker image for digdag workflow on CircleCI?</title><link>https://chezo.uno/blog/2019-10-06_how-to-test-a-new-docker-image-for-digdag-workflow-on-circleci--c8bb92987877/</link><pubDate>Sat, 05 Oct 2019 13:17:30 -0700</pubDate><guid>https://chezo.uno/blog/2019-10-06_how-to-test-a-new-docker-image-for-digdag-workflow-on-circleci--c8bb92987877/</guid><description>&lt;p&gt;&lt;figure&gt;&lt;img src="https://chezo.uno/blog/2019-10-06_how-to-test-a-new-docker-image-for-digdag-workflow-on-circleci--c8bb92987877/0__Sj4niOaDd__W4bydD.jpg"&gt;&lt;figcaption&gt;
&lt;h4&gt;Photo by [Campaign Creators](https://unsplash.com/@campaign_creators?utm_source=medium&amp;amp;utm_medium=referral) on [Unsplash](https://unsplash.com?utm_source=medium&amp;amp;utm_medium=referral)&lt;/h4&gt;
&lt;/figcaption&gt;
&lt;/figure&gt;
Photo by
on 
&lt;/p&gt;
&lt;p&gt;Testing workflow runnability would be important when we build a complex workflow.
is a workflow engine which syntax is simple and is able to run tasks with SQL, Python, Ruby, shell script, etc. digdag has Docker executor and it works like a charm with &lt;code&gt;py&amp;gt;&lt;/code&gt;, &lt;code&gt;rb&amp;gt;&lt;/code&gt;, and &lt;code&gt;sh&amp;gt;&lt;/code&gt; operators.&lt;/p&gt;
&lt;p&gt;How to ensure a new Docker image runnable with existing digdag workflow? I’ll show the way to run through it on CircleCI.&lt;/p&gt;
&lt;p&gt;You can see the example repo on GitHub:&lt;/p&gt;
&lt;p&gt;
&lt;/p&gt;
&lt;h3 id="an-issue-with-digdag-docker-executor-oncircleci"&gt;An issue with digdag Docker executor on CircleCI&lt;/h3&gt;
&lt;p&gt;Although CircleCI docker executor is the primary choice for CircleCI 2.0, which easily run with arbitrary Docker image,
since it launches remote sibling docker container. Hence digdag Docker executor assumes to mount a volume, like &lt;code&gt;-v /tmp:/tmp&lt;/code&gt;, you need some workaround to avoid it.&lt;/p&gt;
&lt;p&gt;
&lt;/p&gt;
&lt;p&gt;In this article, I’ll show you how to execute local mode digdag, a.k.a. &lt;code&gt;didgag run&lt;/code&gt;, on CircleCI with digdag docker executor.&lt;/p&gt;
&lt;h3 id="use-circleci-machineexecutor"&gt;Use CircleCI machine executor&lt;/h3&gt;
&lt;p&gt;tl;dr, use CircleCI
, which runs VM on CircleCI.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-gdscript3" data-lang="gdscript3"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;version&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;2j&lt;/span&gt;&lt;span class="n"&gt;obs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;test&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;working_directory&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="o"&gt;~/&lt;/span&gt;&lt;span class="n"&gt;app&lt;/span&gt; &lt;span class="n"&gt;machine&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;image&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;ubuntu&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1604&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;201903&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;01&lt;/span&gt; &lt;span class="n"&gt;docker_layer_caching&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;true&lt;/span&gt; &lt;span class="n"&gt;steps&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;checkout&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;run&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Install&lt;/span&gt; &lt;span class="n"&gt;digdag&lt;/span&gt; &lt;span class="n"&gt;command&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="n"&gt;curl&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;o&lt;/span&gt; &lt;span class="o"&gt;~/&lt;/span&gt;&lt;span class="n"&gt;bin&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;digdag&lt;/span&gt; &lt;span class="o"&gt;--&lt;/span&gt;&lt;span class="n"&gt;create&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;dirs&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;L&lt;/span&gt; &lt;span class="s2"&gt;&amp;#34;https://dl.digdag.io/digdag-latest&amp;#34;&lt;/span&gt; &lt;span class="n"&gt;chmod&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;~/&lt;/span&gt;&lt;span class="n"&gt;bin&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;digdag&lt;/span&gt; &lt;span class="n"&gt;echo&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;export PATH=&amp;#34;$HOME/bin:$PATH&amp;#34;&amp;#39;&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="o"&gt;~/.&lt;/span&gt;&lt;span class="n"&gt;bashrc&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;run&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Test&lt;/span&gt; &lt;span class="n"&gt;digdag&lt;/span&gt; &lt;span class="n"&gt;run&lt;/span&gt; &lt;span class="n"&gt;command&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="n"&gt;set&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="n"&gt;digdag&lt;/span&gt; &lt;span class="n"&gt;run&lt;/span&gt; &lt;span class="n"&gt;test&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;dig&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Machine executor has Python, Ruby, Java, and Docker CE by default, so you can easily run digdag on CircleCI.&lt;/p&gt;
&lt;p&gt;Here are the dig file and Python script.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-fallback" data-lang="fallback"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;# test.dig+task: py&amp;gt;: test.show docker: image: &amp;#34;python:3.7-slim-buster&amp;#34;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Python script:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-fallback" data-lang="fallback"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;# test.pydef show(): print(&amp;#34;Hello CircleCI&amp;#34;)
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;h3 id="build-custom-docker-image-and-test-withdigdag"&gt;Build custom Docker image and test with digdag&lt;/h3&gt;
&lt;p&gt;In some cases, you want to test whether a new Docker image works appropriately with existing workflow.&lt;/p&gt;
&lt;p&gt;If you build a new Docker image for digdag Docker executor and test with existing workflow, you can write like the following:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-gdscript3" data-lang="gdscript3"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;version&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;2j&lt;/span&gt;&lt;span class="n"&gt;obs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;build_and_test&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;working_directory&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="o"&gt;~/&lt;/span&gt;&lt;span class="n"&gt;app&lt;/span&gt; &lt;span class="n"&gt;machine&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;image&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;ubuntu&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1604&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;201903&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;01&lt;/span&gt; &lt;span class="n"&gt;docker_layer_caching&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;true&lt;/span&gt; &lt;span class="n"&gt;steps&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;checkout&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;run&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Install&lt;/span&gt; &lt;span class="n"&gt;digdag&lt;/span&gt; &lt;span class="n"&gt;command&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="n"&gt;curl&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;o&lt;/span&gt; &lt;span class="o"&gt;~/&lt;/span&gt;&lt;span class="n"&gt;bin&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;digdag&lt;/span&gt; &lt;span class="o"&gt;--&lt;/span&gt;&lt;span class="n"&gt;create&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;dirs&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;L&lt;/span&gt; &lt;span class="s2"&gt;&amp;#34;https://dl.digdag.io/digdag-latest&amp;#34;&lt;/span&gt; &lt;span class="n"&gt;chmod&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;~/&lt;/span&gt;&lt;span class="n"&gt;bin&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;digdag&lt;/span&gt; &lt;span class="n"&gt;echo&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;export PATH=&amp;#34;$HOME/bin:$PATH&amp;#34;&amp;#39;&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="o"&gt;~/.&lt;/span&gt;&lt;span class="n"&gt;bashrc&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;run&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Build&lt;/span&gt; &lt;span class="n"&gt;application&lt;/span&gt; &lt;span class="n"&gt;Docker&lt;/span&gt; &lt;span class="n"&gt;image&lt;/span&gt; &lt;span class="n"&gt;command&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="n"&gt;docker&lt;/span&gt; &lt;span class="n"&gt;build&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt; &lt;span class="o"&gt;./&lt;/span&gt;&lt;span class="n"&gt;Dockerfile&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt; &lt;span class="n"&gt;chezou&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;my&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;image&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;latest&lt;/span&gt; &lt;span class="o"&gt;.&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;run&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Test&lt;/span&gt; &lt;span class="n"&gt;treasure&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;boxes&lt;/span&gt; &lt;span class="n"&gt;workflows&lt;/span&gt; &lt;span class="n"&gt;command&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="n"&gt;set&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="n"&gt;digdag&lt;/span&gt; &lt;span class="n"&gt;run&lt;/span&gt; &lt;span class="n"&gt;test_custom&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;dig&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Building a Docker image on CircleCI, you can use it form &lt;code&gt;digdag run&lt;/code&gt; command with the following workflow and Dockerfile.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-fallback" data-lang="fallback"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;# test_custom.dig+task: py&amp;gt;: test.show docker: image: &amp;#34;chezou/my-image:latest&amp;#34;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-fallback" data-lang="fallback"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;# DockerfileFROM python:3.7-slim-buster
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-fallback" data-lang="fallback"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;RUN pip install tabula-py
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-fallback" data-lang="fallback"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;CMD [&amp;#34;python3&amp;#34;]
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;h3 id="conclusion"&gt;Conclusion&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Using CircleCI’s machine executor enables to use &lt;code&gt;digdag run&lt;/code&gt; with digdag Docker executor.&lt;/li&gt;
&lt;li&gt;It empowers us to do run through test for new Docker image with existing workflow on CircleCI&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;You can try it with this GitHub repo:&lt;/p&gt;
&lt;p&gt;
&lt;/p&gt;</description></item><item><title>The first conference of Operational Machine Learning: OpML ‘19</title><link>https://chezo.uno/blog/2019-06-04_the-first-conference-of-operational-machine-learning--opml--19-308baad36108/</link><pubDate>Mon, 03 Jun 2019 21:50:07 -0700</pubDate><guid>https://chezo.uno/blog/2019-06-04_the-first-conference-of-operational-machine-learning--opml--19-308baad36108/</guid><description>&lt;p&gt;I attended OpML ’19 is a conference for “Operational Machine Learning” held at Santa Clara on May 20th.&lt;/p&gt;
&lt;p&gt;
&lt;/p&gt;
&lt;p&gt;The scope of this conference is varied and seems not to be specified yet, even if I attended it. I’ll borrow the description from the OpML website.&lt;/p&gt;
&lt;blockquote class="border-l-4 border-neutral-300 dark:border-neutral-600 pl-4 italic text-neutral-600 dark:text-neutral-400 my-6"&gt;
&lt;p&gt;&lt;em&gt;The 2019 USENIX Conference on Operational Machine Learning (OpML ’19) provides a forum for both researchers and industry practitioners to develop and bring impactful research advances and cutting edge solutions to the pervasive challenges of ML production lifecycle management. ML production lifecycle is a necessity for wide-scale adoption and deployment of machine learning and deep learning across industries and for businesses to benefit from the core ML algorithms and research advances.&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h3 id="overview-of-the-conference"&gt;Overview of the conference&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;The number of attendees was 210, they came from LinkedIn, Microsoft, Google, Airbnb, Facebook, etc.&lt;/li&gt;
&lt;li&gt;The target of “Operational Machine Learning” is diverse. I thought it focuses on MLOps things such as reproducibility, ML DSL for productionization, visualization, stakeholder management, but there are many talks about ML for system, system utilization optimization, SRE for ML, hardware accelerator, etc.&lt;/li&gt;
&lt;li&gt;There is a contrast between tech giants, e.g. Google, Uber, Facebook, Airbnb, Microsoft, and LinkedIn, and other followers. While ML lead companies are talking about their OSSs or ML infrastructures, following companies tend to talk about their specific use case or their solutions (those speakers seems to be small ML ventures).&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="some-interesting-talks"&gt;Some interesting talks&lt;/h3&gt;
&lt;h3 id="keynote-ray-a-distributed-framework-for-emerging-ai-applications"&gt;Keynote: Ray: A Distributed Framework for Emerging AI Applications&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;/li&gt;
&lt;li&gt;Current target of Machine Learning is pattern recognition, but Jordan said decision-making will be the future of ML/AI&lt;/li&gt;
&lt;li&gt;Creating a “recommendation market” is the key&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;
&lt;figure &gt;
&lt;div class="flex justify-center "&gt;
&lt;div class="w-full" &gt;
&lt;img alt=""
srcset="https://chezo.uno/blog/2019-06-04_the-first-conference-of-operational-machine-learning--opml--19-308baad36108/1_rWIRSVcGYE5uuZ1ISuFMjg_hu_8bef18344b51a41f.webp 320w, https://chezo.uno/blog/2019-06-04_the-first-conference-of-operational-machine-learning--opml--19-308baad36108/1_rWIRSVcGYE5uuZ1ISuFMjg_hu_641a40d50cec64de.webp 480w, https://chezo.uno/blog/2019-06-04_the-first-conference-of-operational-machine-learning--opml--19-308baad36108/1_rWIRSVcGYE5uuZ1ISuFMjg_hu_4a47ea9b80cd86b8.webp 760w"
sizes="(max-width: 480px) 100vw, (max-width: 768px) 90vw, (max-width: 1024px) 80vw, 760px"
src="https://chezo.uno/blog/2019-06-04_the-first-conference-of-operational-machine-learning--opml--19-308baad36108/1_rWIRSVcGYE5uuZ1ISuFMjg_hu_8bef18344b51a41f.webp"
width="760"
height="570"
loading="lazy" data-zoomable /&gt;&lt;/div&gt;
&lt;/div&gt;&lt;/figure&gt;
&lt;figure &gt;
&lt;div class="flex justify-center "&gt;
&lt;div class="w-full" &gt;
&lt;img alt=""
srcset="https://chezo.uno/blog/2019-06-04_the-first-conference-of-operational-machine-learning--opml--19-308baad36108/1_eMGR-WfddebwmyheNe3OAg_hu_eac237abcb6ac670.webp 320w, https://chezo.uno/blog/2019-06-04_the-first-conference-of-operational-machine-learning--opml--19-308baad36108/1_eMGR-WfddebwmyheNe3OAg_hu_551d40ae440e0ac3.webp 480w, https://chezo.uno/blog/2019-06-04_the-first-conference-of-operational-machine-learning--opml--19-308baad36108/1_eMGR-WfddebwmyheNe3OAg_hu_287e7b8a84ae0825.webp 760w"
sizes="(max-width: 480px) 100vw, (max-width: 768px) 90vw, (max-width: 1024px) 80vw, 760px"
src="https://chezo.uno/blog/2019-06-04_the-first-conference-of-operational-machine-learning--opml--19-308baad36108/1_eMGR-WfddebwmyheNe3OAg_hu_eac237abcb6ac670.webp"
width="760"
height="570"
loading="lazy" data-zoomable /&gt;&lt;/div&gt;
&lt;/div&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;h3 id="mlop-lifecycle-scheme-for-vision-based-inspection-process-in-manufacturing"&gt;MLOp Lifecycle Scheme for Vision-based Inspection Process in Manufacturing&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;/li&gt;
&lt;li&gt;A challenge for defeat recognition by an image in edge applied for Samsung smartphone.&lt;/li&gt;
&lt;li&gt;They need to inference for 3000 GB images/day.&lt;/li&gt;
&lt;li&gt;The team structure which involves product inspectors and product managers is interesting&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;figure&gt;&lt;img src="https://chezo.uno/blog/2019-06-04_the-first-conference-of-operational-machine-learning--opml--19-308baad36108/1_5Ab748i-ppe-Lt1DreRiGQ.png"&gt;&lt;figcaption&gt;
&lt;h4&gt;From [https://www.usenix.org/sites/default/files/conference/protected-files/opml19\_slides\_lim.pdf](https://www.usenix.org/sites/default/files/conference/protected-files/opml19_slides_lim.pdf)&lt;/h4&gt;
&lt;/figcaption&gt;
&lt;/figure&gt;
From
&lt;/p&gt;
&lt;h3 id="aiops-challenges-and-experiences-inazure"&gt;AIOps: Challenges and Experiences in Azure&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;/li&gt;
&lt;li&gt;Anomaly detection and diagnosis with lambda architecture for Azure&lt;/li&gt;
&lt;li&gt;Disk failure prediction for Azure which introduces proactively live to migrate the workloads to a healthy disk&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;
&lt;figure &gt;
&lt;div class="flex justify-center "&gt;
&lt;div class="w-full" &gt;
&lt;img alt=""
srcset="https://chezo.uno/blog/2019-06-04_the-first-conference-of-operational-machine-learning--opml--19-308baad36108/1_TtMe-If9qvcuUr5_7dnITQ_hu_837339dbfbb16d94.webp 320w, https://chezo.uno/blog/2019-06-04_the-first-conference-of-operational-machine-learning--opml--19-308baad36108/1_TtMe-If9qvcuUr5_7dnITQ_hu_d6f345fb8d78a85d.webp 480w, https://chezo.uno/blog/2019-06-04_the-first-conference-of-operational-machine-learning--opml--19-308baad36108/1_TtMe-If9qvcuUr5_7dnITQ_hu_b9263caf1125a4fe.webp 760w"
sizes="(max-width: 480px) 100vw, (max-width: 768px) 90vw, (max-width: 1024px) 80vw, 760px"
src="https://chezo.uno/blog/2019-06-04_the-first-conference-of-operational-machine-learning--opml--19-308baad36108/1_TtMe-If9qvcuUr5_7dnITQ_hu_837339dbfbb16d94.webp"
width="760"
height="570"
loading="lazy" data-zoomable /&gt;&lt;/div&gt;
&lt;/div&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;h3 id="how-the-experts-do-it-production-ml-atscale"&gt;How the Experts Do It: Production ML at Scale&lt;/h3&gt;
&lt;p&gt;A panel discussion for ML infrastructures&lt;/p&gt;
&lt;p&gt;Lead and moderator: Joel Young, LinkedIn&lt;/p&gt;
&lt;p&gt;Panelists:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Sandhya Ramu, Director, AI SRE, LinkedIn&lt;/li&gt;
&lt;li&gt;Andrew Hoh, Product Manager, ML Infra and Applied ML, AirBNB&lt;/li&gt;
&lt;li&gt;Aditya Kalro, Engineering Manager, AI Infra Services and Platform, Facebook&lt;/li&gt;
&lt;li&gt;Faisal Siddiqi, Engineering Manager, Personalization Infrastructure, Netflix&lt;/li&gt;
&lt;li&gt;Pranav Khaitan, Engineering Manager, Personalization and Dialog ML Infra, Google&lt;/li&gt;
&lt;/ul&gt;
&lt;h4 id="the-important-thing-to-keep-top-levelis"&gt;The important thing to keep top level is&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;the lead time from experiment to production&lt;/li&gt;
&lt;li&gt;Flows build for production with involving different team&lt;/li&gt;
&lt;li&gt;Not everything is the highest priority. Metrics, dashboards are important&lt;/li&gt;
&lt;/ul&gt;
&lt;h4 id="cost-of-runtrain-vsagility"&gt;Cost of run/train vs Agility&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;It’s hard to find down streaming use cases. (Airbnb)&lt;/li&gt;
&lt;li&gt;Monitor model resource usage&lt;/li&gt;
&lt;li&gt;Keep ML infrastructure extremely flexible&lt;/li&gt;
&lt;li&gt;Hard to force using a single framework&lt;/li&gt;
&lt;/ul&gt;
&lt;h4 id="what-are-the-important-things-for-your-ml-platform"&gt;What are the important things for your ML platform?&lt;/h4&gt;
&lt;p&gt;Facebook&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Reliability&lt;/li&gt;
&lt;li&gt;Scalability&lt;/li&gt;
&lt;li&gt;Developer productivity&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;LinkedIn&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Agility (Available libraries etc)&lt;/li&gt;
&lt;li&gt;Enabling the latest technology&lt;/li&gt;
&lt;li&gt;Cost and impact of Machine Learning&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Netflix&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;How quickly/many A/B test we can do&lt;/li&gt;
&lt;li&gt;How rapid new researcher can do?&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Airbnb&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Business impact&lt;/li&gt;
&lt;li&gt;# of users for the infrastructures&lt;/li&gt;
&lt;li&gt;How many inferences/scoring is done?&lt;/li&gt;
&lt;li&gt;Availability, scalability, cost, and long-term decision making&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Google&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Innovation aspect&lt;/li&gt;
&lt;li&gt;How can the ML infrastructure system will empower the next 5 yrs products?&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="continuous-training-for-production-ml-in-the-tensorflow-extended-tfxplatform"&gt;Continuous Training for Production ML in the TensorFlow Extended (TFX) Platform&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;/li&gt;
&lt;li&gt;TFX provides a library for recording and retrieving metadata for ML: ML Metadata
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;figure&gt;&lt;img src="https://chezo.uno/blog/2019-06-04_the-first-conference-of-operational-machine-learning--opml--19-308baad36108/1_JjjlNJJ7xndhiOSddZv-zA.png"&gt;&lt;figcaption&gt;
&lt;h4&gt;From [https://www.usenix.org/system/files/opml19papers-baylor.pdf](https://www.usenix.org/system/files/opml19papers-baylor.pdf)&lt;/h4&gt;
&lt;/figcaption&gt;
&lt;/figure&gt;
From
&lt;/p&gt;
&lt;h3 id="disdat-bundle-data-management-for-machine-learning-pipelines"&gt;Disdat: Bundle Data Management for Machine Learning Pipelines&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;/li&gt;
&lt;li&gt;Talk about OSS for ML pipeline and data versioning.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="predictive-cachingscale"&gt;Predictive Caching@Scale&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;/li&gt;
&lt;li&gt;Traffic prediction for CDN (Akamai)&lt;/li&gt;
&lt;li&gt;Interesting cache strategy with covering prediction error&lt;/li&gt;
&lt;/ul&gt;</description></item><item><title>Ruby for Data Science and Machine Learning</title><link>https://chezo.uno/blog/2019-04-24_ruby-for-data-science-and-machine-learning-9f03e99125e0/</link><pubDate>Tue, 23 Apr 2019 20:10:28 -0700</pubDate><guid>https://chezo.uno/blog/2019-04-24_ruby-for-data-science-and-machine-learning-9f03e99125e0/</guid><description>&lt;p&gt;I attended
held at Fukuoka from Apr 18 to Apr 21. This year’s RubyKaigi was a really great opportunity for me to know the possibility of Data Science and Machine Learning for Ruby.&lt;/p&gt;
&lt;p&gt;
&lt;figure &gt;
&lt;div class="flex justify-center "&gt;
&lt;div class="w-full" &gt;
&lt;img alt=""
srcset="https://chezo.uno/blog/2019-04-24_ruby-for-data-science-and-machine-learning-9f03e99125e0/0_yHSEXuY1I2U_4ysS_hu_ee24cb5af842463d.webp 320w, https://chezo.uno/blog/2019-04-24_ruby-for-data-science-and-machine-learning-9f03e99125e0/0_yHSEXuY1I2U_4ysS_hu_734b455d02d0947d.webp 480w, https://chezo.uno/blog/2019-04-24_ruby-for-data-science-and-machine-learning-9f03e99125e0/0_yHSEXuY1I2U_4ysS_hu_aa35dc01168f4ab9.webp 760w"
sizes="(max-width: 480px) 100vw, (max-width: 768px) 90vw, (max-width: 1024px) 80vw, 760px"
src="https://chezo.uno/blog/2019-04-24_ruby-for-data-science-and-machine-learning-9f03e99125e0/0_yHSEXuY1I2U_4ysS_hu_ee24cb5af842463d.webp"
width="760"
height="570"
loading="lazy" data-zoomable /&gt;&lt;/div&gt;
&lt;/div&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;h3 id="data-science-andruby"&gt;Data Science and Ruby&lt;/h3&gt;
&lt;p&gt;As many of you may know, Ruby is widely known for web application with such as Ruby on Rails, but there is another momentum of Ruby or non-Python language. Here is the list of the sessions about Data Science.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;/li&gt;
&lt;li&gt;A Deep Learning Adventure
\[[repo](https://github.com/nusco/deep_learning_adventure)\](talked by Paolo Perrotta, the author of Metaprogramming Ruby!)&lt;/li&gt;
&lt;li&gt;
&lt;/li&gt;
&lt;li&gt;Reducing ActiveRecord memory consumption using Apache Arrow&lt;/li&gt;
&lt;li&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="center-of-data-science-withruby"&gt;Center of data science with Ruby&lt;/h3&gt;
&lt;p&gt;There is three core software supporting these movements:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Apache Arrow&lt;/li&gt;
&lt;li&gt;Numo/Cumo&lt;/li&gt;
&lt;li&gt;Red Chainer (Deep Learning framework ported from Chainer, implemented in Python)&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Apache Arrow is a cross-language data structure for in-memory data. Kohei Sutou, the creator of Red Arrow, Ruby binding of Apache Arrow, who is a Japanese PMC of Apache Arrow. He has also been organizing an initiative called Red Data tools, monthly developer meet-ups for Ruby data tools. The meetup drives Ruby data ecosystem, especially for beginners. I heard from mrkn, a Ruby committer, that Arrow is trying to implement data manipulations those pandas does as C++ code. That means, calculations of tabula style data, a.k.a. DataFrame can be done in Apache Arrow’s Table format so that Ruby would be able to be suitable for data manipulation.&lt;/p&gt;
&lt;p&gt;
&lt;figure &gt;
&lt;div class="flex justify-center "&gt;
&lt;div class="w-full" &gt;
&lt;img alt=""
srcset="https://chezo.uno/blog/2019-04-24_ruby-for-data-science-and-machine-learning-9f03e99125e0/0_r3ToqacydBaYmyh1_hu_a5c48076ad1c754c.webp 320w, https://chezo.uno/blog/2019-04-24_ruby-for-data-science-and-machine-learning-9f03e99125e0/0_r3ToqacydBaYmyh1_hu_ca46402a0b7a12a7.webp 480w, https://chezo.uno/blog/2019-04-24_ruby-for-data-science-and-machine-learning-9f03e99125e0/0_r3ToqacydBaYmyh1_hu_f2f7483d78c1e8a3.webp 600w"
sizes="(max-width: 480px) 100vw, (max-width: 768px) 90vw, (max-width: 1024px) 80vw, 760px"
src="https://chezo.uno/blog/2019-04-24_ruby-for-data-science-and-machine-learning-9f03e99125e0/0_r3ToqacydBaYmyh1_hu_a5c48076ad1c754c.webp"
width="600"
height="409"
loading="lazy" data-zoomable /&gt;&lt;/div&gt;
&lt;/div&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p&gt;Another essential thing is Numo, which enables to handle numeric array like Numpy and is the fundamental part of DS/ML execution. Cumo is the GPU version of Numo and 75 times faster than Numo for the hello world for Deep Learning, a.k.a. MNIST. The talk about Cumo suggested that many Deep Learning related executions depend on CUDA so that scripting languages can be just a wrapper of them.&lt;/p&gt;
&lt;p&gt;
&lt;figure &gt;
&lt;div class="flex justify-center "&gt;
&lt;div class="w-full" &gt;
&lt;img alt=""
srcset="https://chezo.uno/blog/2019-04-24_ruby-for-data-science-and-machine-learning-9f03e99125e0/0_0R6wvTiO6WQw79bD_hu_786121caf00a478e.webp 320w, https://chezo.uno/blog/2019-04-24_ruby-for-data-science-and-machine-learning-9f03e99125e0/0_0R6wvTiO6WQw79bD_hu_81dc0311b197c10e.webp 480w, https://chezo.uno/blog/2019-04-24_ruby-for-data-science-and-machine-learning-9f03e99125e0/0_0R6wvTiO6WQw79bD_hu_36840dcfbd5e9614.webp 600w"
sizes="(max-width: 480px) 100vw, (max-width: 768px) 90vw, (max-width: 1024px) 80vw, 760px"
src="https://chezo.uno/blog/2019-04-24_ruby-for-data-science-and-machine-learning-9f03e99125e0/0_0R6wvTiO6WQw79bD_hu_786121caf00a478e.webp"
width="600"
height="450"
loading="lazy" data-zoomable /&gt;&lt;/div&gt;
&lt;/div&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p&gt;Red Chainer enables Deep Learning tasks, but it seems still young. Rather than that,
can be a great tool, which allows to inference/predict with pre-trained models with PyTorch, Chainer, or any other frameworks which can export ONNX, the intermediate format of DL.&lt;/p&gt;
&lt;h3 id="so-how-will-be-the-ruby-data-science-goingon"&gt;So, how will be the Ruby data science going on?&lt;/h3&gt;
&lt;p&gt;Looking at those momenta of Apache Arrow and Cumo, I feel the data science on Ruby would become much easier since the core problems which are related to execution speed can be hidden into C++/GPU layer. And using Menoh-Ruby can be a good opportunity for Ruby on Rails applications to serve prediction results on Ruby!&lt;/p&gt;
&lt;p&gt;Red Data tools also create opportunities for many software engineers to jump into ML/DS world. One of my friends told me why he started working on Red Data tools that he wanted to change his field, and it’s an excellent area to join.&lt;/p&gt;
&lt;p&gt;If you interested in this movement, let’s join
!&lt;/p&gt;</description></item><item><title>A recent update of tabula-py</title><link>https://chezo.uno/blog/2019-02-18_a-recent-update-of-tabula-py-a923d2ab667b/</link><pubDate>Sun, 17 Feb 2019 08:26:00 -0800</pubDate><guid>https://chezo.uno/blog/2019-02-18_a-recent-update-of-tabula-py-a923d2ab667b/</guid><description>&lt;p&gt;&lt;figure&gt;&lt;img src="https://chezo.uno/blog/2019-02-18_a-recent-update-of-tabula-py-a923d2ab667b/0__9HRqzqcWldOqKJCK.jpg"&gt;&lt;figcaption&gt;
&lt;h4&gt;Photo by [Joshua Rawson-Harris](https://unsplash.com/@joshrh19?utm_source=medium&amp;amp;utm_medium=referral) on [Unsplash](https://unsplash.com?utm_source=medium&amp;amp;utm_medium=referral)&lt;/h4&gt;
&lt;/figcaption&gt;
&lt;/figure&gt;
Photo by
on 
&lt;/p&gt;
&lt;p&gt;&lt;em&gt;This article is&lt;/em&gt;
&lt;em&gt;published last December. I’m planning to bump up the next version of tabula-py within few weeks.&lt;/em&gt;&lt;/p&gt;
&lt;div class="callout flex px-4 py-3 mb-6 rounded-md border-l-4 bg-blue-100 dark:bg-blue-900 border-blue-500"
data-callout="note"
data-callout-metadata=""&gt;
&lt;span class="callout-icon pr-3 pt-1 text-blue-600 dark:text-blue-300"&gt;
&lt;svg height="24" xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24"&gt;&lt;path fill="none" stroke="currentColor" stroke-linecap="round" stroke-linejoin="round" stroke-width="1.5" d="m16.862 4.487l1.687-1.688a1.875 1.875 0 1 1 2.652 2.652L6.832 19.82a4.5 4.5 0 0 1-1.897 1.13l-2.685.8l.8-2.685a4.5 4.5 0 0 1 1.13-1.897zm0 0L19.5 7.125"/&gt;&lt;/svg&gt;
&lt;/span&gt;
&lt;div class="callout-content dark:text-neutral-300"&gt;
&lt;div class="callout-title font-semibold mb-1"&gt;Note&lt;/div&gt;
&lt;div class="callout-body"&gt;&lt;p&gt;(Note: Oct 7th, 2019)
As of Oct. 2019, I launched
and
for tabula-py. The FAQ would be good place to execute accurate extraction.&lt;/p&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;This is my first post on patreon. Apologies for delayed announcement of recent update of tabula-py. I will introduce the key features of updates.&lt;/p&gt;
&lt;h3 id="use-tabula-apptemplate"&gt;Use Tabula app template&lt;/h3&gt;
&lt;p&gt;Tabula app has
feature to reuse same bounding box for extraction. tabula-py now load and extract with tabula app’s template.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-py" data-lang="py"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;dfs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tabula&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read_pdf_with_template&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="s1"&gt;&amp;#39;./examples/data.pdf&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="s1"&gt;&amp;#39;./examples/data.tabula-template.json&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;pandas_options&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;header&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;h3 id="support-file-like-object"&gt;Support file-like object&lt;/h3&gt;
&lt;p&gt;Like many python libraries, tabula-py has been able to
.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-py" data-lang="py"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="c1"&gt;# With file-like object &lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;pdf&lt;/span&gt;\&lt;span class="n"&gt;_path&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="err"&gt;‘&lt;/span&gt;&lt;span class="n"&gt;tests&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;resources&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;pdf&lt;/span&gt;&lt;span class="err"&gt;’&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nb"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pdf&lt;/span&gt;\&lt;span class="n"&gt;_path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="err"&gt;‘&lt;/span&gt;&lt;span class="n"&gt;rb&lt;/span&gt;&lt;span class="err"&gt;’&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tabula&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read_pdf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="c1"&gt;# With pathlib &lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;pathlib&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Path&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;pdf_path&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;tests/resources/data.pdf&amp;#39;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tabula&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read_pdf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Path&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pdf_path&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;h3 id="allow-multiple-areaoption"&gt;Allow multiple area option&lt;/h3&gt;
&lt;p&gt;As of tabula-java v1.0.2, tabula can handle multiple area option.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-py" data-lang="py"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;pdf_path&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;tests/resources/MultiColumn.pdf&amp;#39;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="c1"&gt;# Relative area &lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;df_relative&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tabula&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read_pdf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;pdf_path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;pages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;area&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;]],&lt;/span&gt; &lt;span class="n"&gt;relative_area&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="c1"&gt;# Absolute area &lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;df_absolute&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tabula&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read_pdf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;pdf_path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;pages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;area&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;451&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;212&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;212&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;451&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;425&lt;/span&gt;&lt;span class="p"&gt;]])&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;h3 id="tip-get-tableposition"&gt;Tip: Get table position&lt;/h3&gt;
&lt;p&gt;This is not a new feature, but I think it might be helpful for some PDFs.&lt;br&gt;
Detailed post:
&lt;/p&gt;
&lt;p&gt;&lt;code&gt;read_pdf&lt;/code&gt; with JSON contains position info, so you can get the table position as follows:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-py" data-lang="py"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;In&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt; &lt;span class="n"&gt;tables&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;read_pdf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;./examples/data.pdf&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;output_format&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;json&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;In&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;9&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt; &lt;span class="n"&gt;top&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tables&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;top&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;In&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt; &lt;span class="n"&gt;left&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tables&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;left&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;In&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;11&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt; &lt;span class="n"&gt;bottom&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tables&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;height&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;top&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;In&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;12&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt; &lt;span class="n"&gt;right&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tables&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;width&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;left&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;In&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;13&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt; &lt;span class="n"&gt;top&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;bottom&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;left&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;right&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;Out&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;13&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;0.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;528.8800048828125&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;564.8800048828125&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;If you have any question, ask on
!&lt;/p&gt;
&lt;h3 id="other-tabula-py-articles"&gt;Other tabula-py articles&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;/li&gt;
&lt;/ul&gt;</description></item><item><title>Use Markdown document on brand new PyPI</title><link>https://chezo.uno/blog/2018-04-17_use-markdown-document-on-brand-new-pypi-9723024f09c2/</link><pubDate>Mon, 16 Apr 2018 21:21:33 -0700</pubDate><guid>https://chezo.uno/blog/2018-04-17_use-markdown-document-on-brand-new-pypi-9723024f09c2/</guid><description>&lt;p&gt;Yesterday, PyPI was renewed to the next-generation site. It is modern and stylish one.&lt;/p&gt;
&lt;p&gt;
told me that
, which was accepted Feb. 2018, allows us for a document on PyPI to use not only reStructuredText but also other formats such as Markdown.&lt;/p&gt;
&lt;p&gt;So I enabled my Markdown document on brand-new PyPI.&lt;/p&gt;
&lt;h3 id="upgrade-python-packages-if-necessary"&gt;Upgrade Python packages (if necessary)&lt;/h3&gt;
&lt;p&gt;We can use Markdown with setuptools
. Let’s upgrade you python packages if needed. Without that, Markdown description will not be rendered appropriately.&lt;/p&gt;
&lt;p&gt;$ python -m pip install --upgrade pip
$ pip install &amp;ndash;upgrade wheel&lt;br&gt;
$ pip --version
pip 10.0.0 from c:\\users\\chezo\\documents\\source\\tabula-py\\venv\\lib\\site-packages\\pip (python 3.6)
$ pip list&lt;br&gt;
Package Version Location&lt;br&gt;
-&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;- &amp;mdash;&amp;mdash;&amp;mdash;&amp;ndash; &amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;mdash;&amp;ndash;&lt;br&gt;
(&amp;hellip;snip&amp;hellip;)&lt;br&gt;
setuptools 38.1.0&lt;br&gt;
(&amp;hellip;snip&amp;hellip;)&lt;br&gt;
wheel 0.31.0&lt;/p&gt;
&lt;h3 id="modify-setuppy"&gt;Modify setup.py&lt;/h3&gt;
&lt;p&gt;If you’ve already used README.md as a long description on PyPI, all you have to do is to add &lt;code&gt;long_description_content_type&lt;/code&gt; to setup.py as follows:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-py" data-lang="py"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;long_description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;README.md&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;long_description_content_type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;text/markdown&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;You can see the full description of the PR :&lt;/p&gt;
&lt;p&gt;
&lt;/p&gt;
&lt;h3 id="build-a-wheel-and-upload-withtwine"&gt;Build a wheel and upload with twine&lt;/h3&gt;
&lt;p&gt;Now, you can build a wheel and upload with twine.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-sh" data-lang="sh"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;$ python setup.py bdist_wheel
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;$ twine upload dist/*
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;figure&gt;&lt;img src="https://chezo.uno/blog/2018-04-17_use-markdown-document-on-brand-new-pypi-9723024f09c2/1__TsTQiTt6wOa5zxTxQzpTsQ.png"&gt;&lt;figcaption&gt;
&lt;h4&gt;The Markdown document was rendered!&lt;/h4&gt;
&lt;/figcaption&gt;
&lt;/figure&gt;
The Markdown document was rendered!&lt;/p&gt;
&lt;p&gt;CAVEAT: I didn’t upgrade PyPI because it is too much to bump up for just rendering Markdown. I
.&lt;/p&gt;
&lt;h3 id="references"&gt;References&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;/li&gt;
&lt;/ul&gt;</description></item><item><title>Python basics: package management</title><link>https://chezo.uno/blog/2017-08-30_python-basics--package-management-462918458f96/</link><pubDate>Tue, 29 Aug 2017 19:31:15 -0700</pubDate><guid>https://chezo.uno/blog/2017-08-30_python-basics--package-management-462918458f96/</guid><description>&lt;p&gt;Python is a very famous programming language for machine learning. In this article, I will introduce basic Python environment.&lt;/p&gt;
&lt;h3 id="glossary"&gt;Glossary&lt;/h3&gt;
&lt;p&gt;I will introduce basic terms about Python package management.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;pip: A tool for package installation. It retrieves Python packages from
. pip is gem command of Ruby.&lt;/li&gt;
&lt;li&gt;virtualenv: Package isolation tool for Python. It has similar function with bundler of Ruby, but it also has the function to change Python versions over 2.x and 3.x.&lt;/li&gt;
&lt;li&gt;venv: It is an official tool for package isolation introduced from Python 3.3. But, if you want to use Python 2.x or you are Debian/Ubuntu user, I recommend you to use virtualenv.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;venv switches with a command like &lt;code&gt;python3.5 -m venv some-awesome-env&lt;/code&gt;, so it can’t handle over Python 2 and 3. venv installed by Debian/Ubuntu installs useless dependencies for other OSs, so I’m an Ubuntu user so I don’t use venv.&lt;/p&gt;
&lt;p&gt;These are common tool sets for many Pythonistas. They are
of
, a working group that maintains many of the relevant projects in Python packaging.&lt;/p&gt;
&lt;p&gt;There is one more tool that is for the specific purpose.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;conda: conda is a tool for package management for scientific computation developed by
, Inc. It can manage not only Python but also R. PyData community loves conda.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;I use
, but I recommend you to know the pros/cons of conda and virtualenv/venv and chose write tool for your purpose.&lt;/p&gt;
&lt;h3 id="installation-ofpython"&gt;Installation of Python&lt;/h3&gt;
&lt;p&gt;Since it is 2017, Python beginners should use the latest version of Python 3. However, there are some cases to use Python 2.x for some painful reasons.&lt;/p&gt;
&lt;p&gt;If you need to install Python 2 and 3, you can install multiple Python with package management tools like &lt;code&gt;apt&lt;/code&gt; or &lt;code&gt;yum&lt;/code&gt;. In Ubuntu, you can install Python 2.7 with &lt;code&gt;apt install python-dev&lt;/code&gt;, and you can install Python 3.6 via &lt;code&gt;apt install python3-dev&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;After installation, you can see the Pythons under &lt;code&gt;/usr/bin&lt;/code&gt;:&lt;/p&gt;
&lt;p&gt;/usr/bin/python #&amp;lt;- 2.7&lt;br&gt;
/python2 #&amp;lt;- 2.7&lt;br&gt;
/python2.7 #&amp;lt;- 2.7&lt;br&gt;
/python3 #&amp;lt;- 3.6&lt;br&gt;
/python3.6 #&amp;lt;- 3.6&lt;/p&gt;
&lt;p&gt;If you’re macOS user, you can install both Python 2 and 3 via &lt;code&gt;brew install&lt;/code&gt; or &lt;code&gt;port install&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;For Windows users, you can install Python 2 and 3 using official installer or Chocolatey. From Python 3.6 for Windows, there is &lt;code&gt;py&lt;/code&gt; command that switches Python version.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Caution&lt;/strong&gt;: Never try to keep using System Python. System Python is often old, and it depends on system critical system such as yum. If you run &lt;code&gt;sudo pip install&lt;/code&gt; carelessly, there is a risk of destroying the environment of the OS itself.&lt;/p&gt;
&lt;h3 id="package-management"&gt;Package management&lt;/h3&gt;
&lt;p&gt;As I mentioned, you should not do &lt;code&gt;sudo pip install awesome-package&lt;/code&gt;. Hence, Many important systems depend on system Python, don’t use &lt;code&gt;sudo pip&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;If you’re a venv user, this tutorial will help you.&lt;br&gt;
&lt;/p&gt;
&lt;p&gt;For virtualenv users, I will write a tutorial of virtualenv. It is a translation of the document written by aodag.&lt;br&gt;
&lt;/p&gt;
&lt;h4 id="why-should-we-use-virtualenvvenv"&gt;Why should we use virtualenv/venv?&lt;/h4&gt;
&lt;p&gt;virtualenv avoids:&lt;br&gt;
- Conflicting Python packages with system Python&lt;br&gt;
- Conflicting packages between projects&lt;br&gt;
- Losing sight of which project depends on those packages&lt;/p&gt;
&lt;h4 id="install-virtualenv"&gt;Install virtualenv&lt;/h4&gt;
&lt;p&gt;First, you can install &lt;code&gt;virtualenv&lt;/code&gt; under user home directory.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-sh" data-lang="sh"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;$ wget https://bootstrap.pypa.io/get-pip.py
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;$ &lt;span class="nb"&gt;export&lt;/span&gt; &lt;span class="nv"&gt;PATH&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;”~/.local/bin/:&lt;span class="nv"&gt;$PATH&lt;/span&gt;”
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;$ python get-pip.py --user
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;$ pip install virtualenv --user
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="se"&gt;\#&lt;/span&gt; Windows user can isntall just via &lt;span class="se"&gt;\`&lt;/span&gt;pip install&lt;span class="se"&gt;\`&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="se"&gt;\&amp;gt;&lt;/span&gt; pip install virtualenv
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;With &lt;code&gt;--user&lt;/code&gt; option, you can install packages under user directory.&lt;/p&gt;
&lt;p&gt;&lt;code&gt;virtualenv&lt;/code&gt; can create a Python virtual environment. Creating the environment under the project root is common.&lt;/p&gt;
&lt;p&gt;Run &lt;code&gt;virtualenv&lt;/code&gt; as follows:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-sh" data-lang="sh"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;$ virtualenv venv -p python3.6
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;then, you can get virtual environment.&lt;/p&gt;
&lt;p&gt;Since Python packages will be installed under the &lt;code&gt;venv&lt;/code&gt; directory, don’t forget to add venv directory into &lt;code&gt;.gitignore&lt;/code&gt;.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-sh" data-lang="sh"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;$ &lt;span class="nb"&gt;source&lt;/span&gt; venv/bin/activate
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="o"&gt;(&lt;/span&gt;venv&lt;span class="o"&gt;)&lt;/span&gt; $
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="se"&gt;\#&lt;/span&gt; For Windows
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="se"&gt;\&amp;gt;&lt;/span&gt; . venv/Script/activate
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;h4 id="install-python-packages-viapip"&gt;Install Python packages via pip&lt;/h4&gt;
&lt;p&gt;You can install packages via &lt;code&gt;pip&lt;/code&gt;. After activating virtualenv/venv, pip will install packages under &lt;code&gt;venv&lt;/code&gt; directory.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-sh" data-lang="sh"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="o"&gt;(&lt;/span&gt;venv&lt;span class="o"&gt;)&lt;/span&gt; $ pip install pyramid
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;If you want to install the specific version of the package, you can set version number:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-sh" data-lang="sh"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="o"&gt;(&lt;/span&gt;venv&lt;span class="o"&gt;)&lt;/span&gt; $ pip install &lt;span class="nv"&gt;pyramid&lt;/span&gt;&lt;span class="o"&gt;==&lt;/span&gt;1.8.1
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Without version number, &lt;code&gt;pip&lt;/code&gt; will install latest stable version.&lt;br&gt;
&lt;/p&gt;
&lt;p&gt;You can list installed packages with &lt;code&gt;pip list&lt;/code&gt; command.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-sh" data-lang="sh"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="o"&gt;(&lt;/span&gt;venv&lt;span class="o"&gt;)&lt;/span&gt; $ pip list
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;numpy &lt;span class="o"&gt;(&lt;/span&gt;1.13.1&lt;span class="o"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;pandas &lt;span class="o"&gt;(&lt;/span&gt;0.20.3&lt;span class="o"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;pip &lt;span class="o"&gt;(&lt;/span&gt;9.0.1&lt;span class="o"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;pkginfo &lt;span class="o"&gt;(&lt;/span&gt;1.4.1&lt;span class="o"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;pytest &lt;span class="o"&gt;(&lt;/span&gt;3.2.0&lt;span class="o"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;python-dateutil &lt;span class="o"&gt;(&lt;/span&gt;2.6.1&lt;span class="o"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;pytz &lt;span class="o"&gt;(&lt;/span&gt;2017.2&lt;span class="o"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;wheel &lt;span class="o"&gt;(&lt;/span&gt;0.29.0&lt;span class="o"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;h4 id="managing-packageversion"&gt;Managing package version&lt;/h4&gt;
&lt;p&gt;From pip 7.1, we can fix version of packages with &lt;code&gt;constraints.txt&lt;/code&gt;. Using &lt;code&gt;pip freeze&lt;/code&gt; command, you can list packages with a version number.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-sh" data-lang="sh"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="o"&gt;(&lt;/span&gt;venv&lt;span class="o"&gt;)&lt;/span&gt;$ pip freeze -l
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="nv"&gt;numpy&lt;/span&gt;&lt;span class="o"&gt;==&lt;/span&gt;1.13.1
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="nv"&gt;pandas&lt;/span&gt;&lt;span class="o"&gt;==&lt;/span&gt;0.20.3
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="nv"&gt;pkginfo&lt;/span&gt;&lt;span class="o"&gt;==&lt;/span&gt;1.4.1
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="nv"&gt;pytest&lt;/span&gt;&lt;span class="o"&gt;==&lt;/span&gt;3.2.0
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;python-dateutil&lt;span class="o"&gt;==&lt;/span&gt;2.6.1
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="nv"&gt;pytz&lt;/span&gt;&lt;span class="o"&gt;==&lt;/span&gt;2017.2
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="o"&gt;(&lt;/span&gt;venv&lt;span class="o"&gt;)&lt;/span&gt;$ pip freeze -l &amp;gt; constraints.txt
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;You should list your required packages into &lt;code&gt;requirements.txt&lt;/code&gt;,&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-sh" data-lang="sh"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="o"&gt;(&lt;/span&gt;venv&lt;span class="o"&gt;)&lt;/span&gt;$ cat requirements.txt
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;pandas
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;numpy
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Then you can install required packages as follows:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-sh" data-lang="sh"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="o"&gt;(&lt;/span&gt;venv&lt;span class="o"&gt;)&lt;/span&gt;$ pip install -r requirements.txt -c constraints.txt
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;h4 id="levelaging-wheelhouse"&gt;Levelaging wheelhouse&lt;/h4&gt;
&lt;p&gt;Modern Python package is distributed by wheel format, which is the binary type format. There is another format, sdist, which is the source type format and it requires compile from source if it depends on native codes. I highly recommend using wheel format, because it is faster installation than sdist without compilation and even if you have an offline environment which unable to connect PyPI you can deploy the project easily.&lt;/p&gt;
&lt;p&gt;Put all dependent &lt;code&gt;.whl&lt;/code&gt; format package files under &lt;code&gt;wheelhouse&lt;/code&gt; directory, you can install as follows:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-sh" data-lang="sh"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;$ pip install -r requirements.txt -c constraints.txt -f wheelhouse — no-index
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;code&gt;-w&lt;/code&gt; or &lt;code&gt;--wheel-dir&lt;/code&gt; option allows you to set wheel directory. &lt;code&gt;-f&lt;/code&gt; or&lt;code&gt;--find-links&lt;/code&gt; option uses wheelhouse directory primary.&lt;code&gt;--no-index&lt;/code&gt; option prevent to connect PyPI.&lt;/p&gt;
&lt;p&gt;If you want to export all the dependencies into &lt;code&gt;wheelhouse&lt;/code&gt; directory, you can use &lt;code&gt;pip wheel&lt;/code&gt; command.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-sh" data-lang="sh"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;$ pip wheel -r requirements.txt -c constraints.txt -w wheelhouse
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;h3 id="should-i-useconda"&gt;Should I use conda?&lt;/h3&gt;
&lt;p&gt;Anaconda is a Python distribution for scientific computing such as machine learning. Anaconda suit consists of Anaconda, which includes the recommended package and Miniconda, which is the minimum environment for conda and you can install only necessary packages yourself. Anaconda sometimes includes heavy packages. It used to include Django, so check the default package and use it properly.&lt;/p&gt;
&lt;p&gt;Unlike virtualenv, Anaconda can create its original virtual environment. Characteristically, using the &lt;code&gt;--copy&lt;/code&gt; option makes it possible to copy system level libraries, .so, etc. without creating symbolic links. If you archive a set of virtual environments with zip or tar, you can use it on other machines.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-sh" data-lang="sh"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;$ conda create -n myenv --copy &lt;span class="nv"&gt;python&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;3.6
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;$ conda activate myenv
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;In other words, libraries, which are managed by OS level package management tools such as &lt;code&gt;apt&lt;/code&gt;, are also managed by conda. Conda has its own package repository different from PyPI and upload binaries for each OS on it. Since the same package, such as OpenCV, is registered in the repository by multiple users, you should care which package is the best one.&lt;/p&gt;
&lt;p&gt;In many machine learning books, it is often written that conda can be used, but I think that it is better not to use it much outside Windows.&lt;/p&gt;
&lt;p&gt;The reasons are as follows:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;In 2017, wheel is de facto for the binary package format, so conda’s original purpose, handling scientific packages like numpy, or Scipy, can be done without conda.&lt;/li&gt;
&lt;li&gt;conda will replace commands such as openssl/curl/python in macOS / Linux System (strictly speaking, conda will pass PATH first)
\[[issue](https://github.com/ContinuumIO/anaconda-issues/issues/1119)\]&lt;/li&gt;
&lt;li&gt;Package developers are often not conda users, and they seem to be asked for support in an environment that they do not normally use, such as JRuby or Rubyinius (or Windows specific trouble).&lt;/li&gt;
&lt;li&gt;In the conda world, it is difficult to pass information that should be included in a build of a native extension (such as Cython dependence)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;So I recommend using conda for Windows users or people do not develop heavily but want to experience machine learning. Or, put Miniconda under pyenv control. I use conda under Docker environment.&lt;/p&gt;
&lt;p&gt;However, we can not install the package like Scipy on Windows via &lt;code&gt;pip install&lt;/code&gt;, you need to download wheel on your own. I think that this point is better for honest conda.&lt;/p&gt;
&lt;p&gt;Historical details are detailed in
. In short, because old binary format egg was not good, conda was created.&lt;/p&gt;
&lt;h3 id="conclusion"&gt;Conclusion&lt;/h3&gt;
&lt;p&gt;I introduced installation of Python and how to manage Python packages. I think we can manage Python packages via virtualenv/venv well without conda, but there is good case for conda to pack some environment with system libraries.&lt;/p&gt;
&lt;h3 id="references"&gt;References&lt;/h3&gt;
&lt;p&gt;Original Japanese document:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;/li&gt;
&lt;/ul&gt;</description></item><item><title>Why OSS based machine learning is good?</title><link>https://chezo.uno/blog/2017-08-03_why-oss-based-machine-learning-is-good--3ab45a1a5e52/</link><pubDate>Wed, 02 Aug 2017 20:56:59 -0700</pubDate><guid>https://chezo.uno/blog/2017-08-03_why-oss-based-machine-learning-is-good--3ab45a1a5e52/</guid><description>&lt;p&gt;&lt;em&gt;This article is translation of&lt;/em&gt;
&lt;em&gt;.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;After releasing of TensorFlow, the movement of OSS-based machine learning is accelerating.
, the creator of Keras, says the essential point of this change. I think his phrase is enough, but in this article, I would like to organize why open source machine learning is great, and what recent trends are.&lt;/p&gt;
&lt;h3 id="tldr"&gt;tl;dr&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Machine learning and deep learning frameworks have become standard things for software engineers&lt;/li&gt;
&lt;li&gt;Since arXiv becomes very famous, many papers are published before peer review of international conferences. This change made easier for other companies to validate the algorithm.&lt;/li&gt;
&lt;li&gt;Many researchers have been started to study machine learning, machine learning researches in academia become Red Oceanic.&lt;/li&gt;
&lt;li&gt;The strategy, “Make a great algorithm, but the implementation is secret” becomes a thing of the past.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="halcyon-days"&gt;Halcyon days&lt;/h3&gt;
&lt;p&gt;Five or ten years ago, almost all players working on advanced machine learning were in laboratories such as universities or large enterprises, or some advanced companies. In particular, the amount of data with a label was smaller than the present, and many researchers had been improving the performance by researching algorithms, by feature engineering.&lt;/p&gt;
&lt;p&gt;Many researchers from academia studied state-of-the-art machine learning, posted to international conferences. Most of the insights were shared after peer review. Implementation was not shared as much as now, and each researcher had to reimplement the preceding research from scratch. A typical cycle for releasing new algorithms was a half year, in some cases more than a year.&lt;/p&gt;
&lt;p&gt;There were few open source machine learning libraries/frameworks like Weka. scikit-learn,
, was not famous among software engineers. Many of us used libraries with single/few algorithms such as libsvm and liblinear.&lt;/p&gt;
&lt;h3 id="fast-movingera"&gt;Fast moving era&lt;/h3&gt;
&lt;p&gt;As of 2017, people who work in machine learning have significantly increased compared with 10 years ago. The center of machine learning has been moved from academia to companies with large data. In particular, software engineers, who have never worked on machine learning, entering deep learning world. I was surprised to hear that my friend of the community who had never worked on machine learning in business had started working on Deep Learning. The reasons for this movement are 1) it became general for companies to store large data that can be used for machine learning, &lt;br&gt;
2) excellent machine learning frameworks have been increased, and 3) the GPU power leverage Deep Learning for efficient calculation.&lt;/p&gt;
&lt;p&gt;Many open source libraries became popular not only in the frameworks of Deep Learning such as TensorFlow, Chainer, MXNet, Caffe 2, PyTorch but also by XGBoost, Lightgbm, which are famouse among kaggler. scikit-learn is also common tool as a framework to experiment with multiple algorithms.&lt;/p&gt;
&lt;h3 id="the-rise-of-openpapers"&gt;The rise of “open papers”&lt;/h3&gt;
&lt;p&gt;This movement is supported by machine learning competition site “kaggle”, and by a place to post open papers called “arXiv”. (There is discussion arxiv does not have a peer review process and quality is not assured. So can we call the document as a research paper? But, in this post, I will call the research paper style report as “paper”)&lt;/p&gt;
&lt;p&gt;The following article describes the number of paper submissions related to machine learning (especially Deep Learning) submitted to arXiv. According to this article, it is pointed out that the number of papers related to machine learning has more than quadrupled in 2017 compared to five years ago.&lt;/p&gt;
&lt;p&gt;
&lt;/p&gt;
&lt;p&gt;Papers of arXiv are posted every day. It means, state-of-the-art results from such as Google, Facebook, Microsoft, etc. are published more and more before peer review. This is a challenge for the central laboratories of the traditional large enterprises to research and develop cutting edge algorithms of machine learning itself. Those companies usually set targets for a year or half a year. There is also criticism of “just adding parts”, but it is clear that the speed of developing machine learning algorithms is significantly fast.&lt;/p&gt;
&lt;blockquote class="border-l-4 border-neutral-300 dark:border-neutral-600 pl-4 italic text-neutral-600 dark:text-neutral-400 my-6"&gt;
&lt;p&gt;In the field of machine translation, the breakthrough in deep learning was encoder-decoder and attention. The subsequent papers are not interesting, “I just put existing parts here.” I can’t understand why these papers come to the top conference.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Recently, for those who read new arXiv’s paper day and night, there is an system called “ariXiv Times” to better check new arrival documents.&lt;/p&gt;
&lt;p&gt;
&lt;/p&gt;
&lt;h3 id="open-papers-accelerates-open-source-machinelearning"&gt;Open papers accelerates Open source machine learning&lt;/h3&gt;
&lt;p&gt;This March, a paper about “Deep Forest” was published at arXiv, and it became a hot topic with the author claims that “performance is better than Deep Learning”.&lt;/p&gt;
&lt;p&gt;
&lt;/p&gt;
&lt;p&gt;This method proposed in this paper, about one week (2017/3/5) after the publication (2017/2 / 28), R implementation came up and Python implementation came out after R one. A discussion was made with the following LightGBM issue on GitHub, and it came out that there was not reproducibility of the article, they can’t confirm the performance.&lt;/p&gt;
&lt;p&gt;
&lt;/p&gt;
&lt;p&gt;It is a symbolic event where the OSS implementation of the paper &lt;br&gt;
appeared within a week after published in arXiv and the discussion of the community began.&lt;/p&gt;
&lt;p&gt;I hear that it is increasing that the number of international conferences that require disclosing the implementation when a paper is submitted.&lt;/p&gt;
&lt;h3 id="conclusion"&gt;Conclusion&lt;/h3&gt;
&lt;p&gt;It is an essential task to develop the machine learning algorithm. Thanks to open papers, ML competition web site, and fast implementation of new algorithms as an OSS, we can adopt state-of-the-art knowledge into the business rapidly.&lt;/p&gt;
&lt;p&gt;IMHO, it is becoming fun to focus on where we can make use of ML in business rather than developing the algorithm itself.&lt;/p&gt;
&lt;p&gt;In other words, now, it is too hard to say “special machine learning algorithms that only our company can do”. Of course, people in academia will push these cutting-edge initiatives if they can prepare data. What is the evidence that one company invents a better algorithm quickly than most state-of-the-art people from tech giants like Google, Facebook, Microsoft, etc.? That is the reason for the strength of open source based machine learning.&lt;/p&gt;
&lt;p&gt;Among academia, there is a famous phrase, “
”, it means that we should thank previous research then we can go on to the next step. Even in machine learning based on open source, we can not ignore this phrase. We cannot ignore giants.&lt;/p&gt;</description></item><item><title>How to run Cloudera Director on your macOS/Windows 10</title><link>https://chezo.uno/blog/2017-08-02_how-to-run-cloudera-director-on-your-macos-windows-10-710f82aa1d63/</link><pubDate>Tue, 01 Aug 2017 20:12:31 -0700</pubDate><guid>https://chezo.uno/blog/2017-08-02_how-to-run-cloudera-director-on-your-macos-windows-10-710f82aa1d63/</guid><description>&lt;p&gt;Cloudera Director is a provisioning tool for CDH and Cloudera Enterprise. We can launch cluster with Web GUI or CLI tool. Using Cloudera Director CLI tool, you can manage your cluster with configuration file, that enables you to manage configurations with git. In this article, I will introduce how to install Cloudera Director into your local macOS or Windows 10.&lt;/p&gt;
&lt;p&gt;For usage of Cloudera Director, see also the document.&lt;/p&gt;
&lt;p&gt;
&lt;/p&gt;
&lt;h3 id="install-cloudera-director-on-you-macos-withhomebrew"&gt;Install Cloudera Director on you macOS with homebrew&lt;/h3&gt;
&lt;p&gt;If you’re homebrew user, you can install Cloudera Director easily.&lt;/p&gt;
&lt;p&gt;
&lt;/p&gt;
&lt;p&gt;$ brew tap chezou/cloudera
$ brew install cloudera-director-server&lt;/p&gt;
&lt;p&gt;Then, you can launch/terminate Cloudera Director as follows:&lt;/p&gt;
&lt;p&gt;# Start Cloudera Director Server background&lt;br&gt;
$ cloudera-director-server-start&lt;br&gt;
# After launching director server, you can open with http://locahost:7189/&lt;/p&gt;
&lt;p&gt;# Stop Cloudera Director Server background&lt;br&gt;
$ cloudera-director-server-stop&lt;/p&gt;
&lt;h3 id="install-cloudera-director-on-you-windows10"&gt;Install Cloudera Director on you Windows 10&lt;/h3&gt;
&lt;p&gt;If you are Windows 10 user, you can install Ubuntu as the
.&lt;/p&gt;
&lt;p&gt;Launch bash on windows, then run as follows:&lt;/p&gt;
&lt;p&gt;Make sure to get not IP address of Windows but Ubuntu’s one.&lt;/p&gt;
&lt;h3 id="use-dockerimage"&gt;Use Docker image&lt;/h3&gt;
&lt;p&gt;If you don’t want to install your machine directly, you can use Docker image of Cloudera Director.&lt;/p&gt;
&lt;p&gt;
&lt;/p&gt;
&lt;p&gt;After installation of Docker, run following commands then your Director will launch.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-gdscript3" data-lang="gdscript3"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="o"&gt;$&lt;/span&gt; &lt;span class="n"&gt;git&lt;/span&gt; &lt;span class="n"&gt;clone&lt;/span&gt; &lt;span class="n"&gt;https&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="o"&gt;//&lt;/span&gt;&lt;span class="n"&gt;github&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;com&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;tsuyo&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;cloudera&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;boot&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt; &lt;span class="n"&gt;cd&lt;/span&gt; &lt;span class="n"&gt;cloudera&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;boot&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt; &lt;span class="o"&gt;.&lt;/span&gt; &lt;span class="n"&gt;bin&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;cloudera&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;boot&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sh&lt;/span&gt; &lt;span class="c1"&gt;# load several functions/aliases$ cb-build # may take a while# set you secrets&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;You can launch a Director server or use client as well. To get further information, see also README.md.&lt;/p&gt;</description></item><item><title>Simple way to distribute your private Python packages within your organization</title><link>https://chezo.uno/blog/2017-07-24_simple-way-to-distribute-your-private-python-packages-within-your-organization-fb7af5dbd4c9/</link><pubDate>Sun, 23 Jul 2017 09:21:40 -0700</pubDate><guid>https://chezo.uno/blog/2017-07-24_simple-way-to-distribute-your-private-python-packages-within-your-organization-fb7af5dbd4c9/</guid><description>&lt;p&gt;&lt;figure&gt;&lt;img src="https://chezo.uno/blog/2017-07-24_simple-way-to-distribute-your-private-python-packages-within-your-organization-fb7af5dbd4c9/0_YSlLMz01REAp_q_y.png"&gt;&lt;figcaption&gt;
&lt;h4&gt;[https://www.irasutoya.com/2017/05/blog-post\_22.html](https://www.irasutoya.com/2017/05/blog-post_22.html)&lt;/h4&gt;
&lt;/figcaption&gt;
&lt;/figure&gt;
&lt;/p&gt;
&lt;p&gt;&lt;em&gt;This article is a translation of&lt;/em&gt;
&lt;em&gt;, originally written by&lt;/em&gt;
&lt;em&gt;in Japanese. I translated it with his permission. This article is aimed to know simple ways to prepare internal Python package host like a&lt;/em&gt;
&lt;em&gt;.&lt;/em&gt;&lt;/p&gt;
&lt;h3 id="methods"&gt;Methods&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Include your packages in your git repository&lt;/li&gt;
&lt;li&gt;Publish a directory including your packages via HTTP server&lt;/li&gt;
&lt;li&gt;Build a local PyPI-equivalent server&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;It is a high-cost way to create a local PyPI-equivalent server &lt;em&gt;(translator note: like&lt;/em&gt;
&lt;em&gt;)&lt;/em&gt;, and I don’t think there is no need to do so, I will describe first two options.&lt;/p&gt;
&lt;h4 id="include-your-packages-in-your-git-repository"&gt;Include your packages in your Git repository&lt;/h4&gt;
&lt;p&gt;If your packages are required for a particular project, it is straightforward to contain them in the Git repository. You can put them in the directory named &lt;code&gt;wheelhouse&lt;/code&gt;, which comes from the name of the previous default directory created by &lt;code&gt;pip wheel&lt;/code&gt;. (&lt;em&gt;translator note: this method is assumed you to know wheel. If not,&lt;/em&gt;
&lt;em&gt;and&lt;/em&gt;
&lt;em&gt;would be helpful.&lt;/em&gt;)If you put the private package &lt;code&gt;foo&lt;/code&gt; in the &lt;code&gt;wheelhouse&lt;/code&gt;, you can install as follows:&lt;/p&gt;
&lt;p&gt;$ pip install foo -f wheelhouse&lt;/p&gt;
&lt;p&gt;Note that &lt;code&gt;-f&lt;/code&gt; is the short option for &lt;code&gt;--find-links&lt;/code&gt;, with that option, pip will search packages in the directory first, then fall back to &lt;code&gt;pypi&lt;/code&gt;.&lt;/p&gt;
&lt;h4 id="publish-a-directory-including-your-packages-via-httpserver"&gt;Publish a directory including your packages via HTTP server&lt;/h4&gt;
&lt;p&gt;We can use&lt;code&gt;--find-link&lt;/code&gt; option to search not only local directory but also a remote server via &lt;code&gt;http&lt;/code&gt;. If you have a package used by multiple projects, this method will help you.&lt;/p&gt;
&lt;p&gt;The easiest way to distribute your packages with this method is executing &lt;code&gt;python -m http.server&lt;/code&gt; with Python 3.x (or &lt;code&gt;python -m SimpleHTTPServer&lt;/code&gt; with Python 2.7) on the &lt;code&gt;wheelhouse&lt;/code&gt; directory. This simple server provides directory listings so that we can just use&lt;code&gt;--find-links&lt;/code&gt; to use the directory. Make sure to open &lt;code&gt;http://localhost:8000&lt;/code&gt; that you can see the list of files under the &lt;code&gt;wheelhouse&lt;/code&gt; directory via a web browser.&lt;/p&gt;
&lt;p&gt;To install &lt;code&gt;foo&lt;/code&gt; package via HTTP server you launched, you can execute as follows:&lt;/p&gt;
&lt;p&gt;$ pip install foo -f http://localhost:8000&lt;/p&gt;
&lt;p&gt;Since this is a simple server, for production, it is good to put them in cloud storage such as AWS S3, you should check the way for directory listings, or you can use Apache with &lt;code&gt;DirectoryIndex&lt;/code&gt; enabled.&lt;/p&gt;
&lt;h3 id="conclusion"&gt;Conclusion&lt;/h3&gt;
&lt;p&gt;I recommend these methods because they are simple and no need to prepare the dedicated application server.&lt;/p&gt;</description></item><item><title>tabula-py now able to extract remote PDF and multiple tables at once</title><link>https://chezo.uno/blog/2017-05-28_tabula-py-now-able-to-extract-remote-pdf-and-multiple-tables-at-once-6108e24ac07c/</link><pubDate>Sat, 27 May 2017 19:18:39 -0700</pubDate><guid>https://chezo.uno/blog/2017-05-28_tabula-py-now-able-to-extract-remote-pdf-and-multiple-tables-at-once-6108e24ac07c/</guid><description>
&lt;div class="callout flex px-4 py-3 mb-6 rounded-md border-l-4 bg-blue-100 dark:bg-blue-900 border-blue-500"
data-callout="note"
data-callout-metadata=""&gt;
&lt;span class="callout-icon pr-3 pt-1 text-blue-600 dark:text-blue-300"&gt;
&lt;svg height="24" xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24"&gt;&lt;path fill="none" stroke="currentColor" stroke-linecap="round" stroke-linejoin="round" stroke-width="1.5" d="m16.862 4.487l1.687-1.688a1.875 1.875 0 1 1 2.652 2.652L6.832 19.82a4.5 4.5 0 0 1-1.897 1.13l-2.685.8l.8-2.685a4.5 4.5 0 0 1 1.13-1.897zm0 0L19.5 7.125"/&gt;&lt;/svg&gt;
&lt;/span&gt;
&lt;div class="callout-content dark:text-neutral-300"&gt;
&lt;div class="callout-title font-semibold mb-1"&gt;Note&lt;/div&gt;
&lt;div class="callout-body"&gt;&lt;p&gt;(Note: Oct 7th, 2019)
As of Oct. 2019, I launched
and
for tabula-py. The FAQ would be good place to execute accurate extraction.&lt;/p&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;tabula-py is a Python library which enables you to extract tables from PDF into pandas DataFrames. Today, I released v0.8.0. In this post, I will introduce improvements after previous post of tabula-py. If you don’t familiar with tabula-py, you can see previous one.&lt;/p&gt;
&lt;div class="iframely-embed"&gt;&lt;div class="iframely-responsive" style="height: 140px; padding-bottom: 0;"&gt;&lt;a href="https://chezo.uno/blog/2017-01-09_tabula-py--extract-table-from-pdf-into-python-dataframe-6c7acfa5f302/" data-iframely-url="//iframely.net/WEoEyU7"&gt;&lt;/a&gt;&lt;/div&gt;&lt;/div&gt;&lt;script async src="//iframely.net/embed.js" charset="utf-8"&gt;&lt;/script&gt;
&lt;h3 id="change-notes"&gt;Change Notes&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Able to read remote PDF passing URL&lt;/li&gt;
&lt;li&gt;
\[Experimental\]Add &lt;code&gt;multiple_tables&lt;/code&gt; mode&lt;/li&gt;
&lt;li&gt;Add batch conversion method:&lt;code&gt;convert_into_by_batch()&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Add &lt;code&gt;encoding&lt;/code&gt; option&lt;/li&gt;
&lt;li&gt;Add &lt;code&gt;java_options&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Will deprecate &lt;code&gt;read_pdf_table()&lt;/code&gt; method&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;I will explain important features.&lt;/p&gt;
&lt;h4 id="read-remote-pdf-passingurl"&gt;Read remote PDF passing URL&lt;/h4&gt;
&lt;p&gt;If you want extract a DataFrame from the internet, you can extract remote PDF without downloading it manually.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-py" data-lang="py"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;read_pdf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;https://github.com/tabulapdf/tabula-java/raw/master/src/test/resources/technology/tabula/12s0324.pdf&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;h4 id="add-multiple_tables-mode"&gt;\[Experimental\] Add &amp;ldquo;&lt;code&gt;multiple_tables&amp;quot;&lt;/code&gt; mode&lt;/h4&gt;
&lt;p&gt;tabula-py is a simple wrapper of tabula-java, it was hard to handle multiple tables in a page. But now, you can extract multiple tables in a page using &lt;code&gt;multiple_tables&lt;/code&gt; option.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-py" data-lang="py"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;read_pdf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;tests/resources/data.pdf&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;pages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;multiple_tables&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;This function create a list of DataFrames via JSON from tabula-java, so if tabula-java’s JSON format will change, the output could be broken. If you see &lt;code&gt;CParserError&lt;/code&gt; , try to set &lt;code&gt;multiple_tables&lt;/code&gt; option.&lt;/p&gt;
&lt;h4 id="add-batch-conversion-method-convert_into_by_batch"&gt;Add batch conversion method: &amp;ldquo;&lt;code&gt;convert_into_by_batch()&amp;quot;&lt;/code&gt;&lt;/h4&gt;
&lt;p&gt;After tabula-java v0.9.2, we can extract tables from PDF by batch. You can use this function through &lt;code&gt;convert_into_by_batch()&lt;/code&gt; method.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-py" data-lang="py"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;convert_into_by_batch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;path_to_dir&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;output_format&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;csv&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;You should set directory path of PDFs, not the specific pdf path.&lt;/p&gt;
&lt;p&gt;tabula-py extracts tables same directory as input files.&lt;/p&gt;
&lt;h3 id="todos"&gt;TODOs&lt;/h3&gt;
&lt;p&gt;There are several problems those may be fixed after releasing of tabula-java 0.9.3. e.g) Handling embedded font, including Japanese…&lt;/p&gt;
&lt;h3 id="waiting-for-your-collaboration"&gt;Waiting for your collaboration!&lt;/h3&gt;
&lt;p&gt;If you have any troubles with tabula-py, please file
. I don’t want to receive emails because the answer will not share to other people. Make sure fill
, it will reduce many costs for me to solve the problem.&lt;/p&gt;
&lt;h4 id="other-tabula-py-articles"&gt;Other tabula-py articles&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;/li&gt;
&lt;/ul&gt;</description></item><item><title>An easy way to get URL list of your Medium publication</title><link>https://chezo.uno/blog/2017-05-02_an-easy-way-to-get-url-list-of-your-medium-publication-c60c61244101/</link><pubDate>Mon, 01 May 2017 19:01:01 -0700</pubDate><guid>https://chezo.uno/blog/2017-05-02_an-easy-way-to-get-url-list-of-your-medium-publication-c60c61244101/</guid><description>&lt;p&gt;I imported blog posts from own Wordpress but I have to redirect old articles to Medium manually. There is Wordpress plugin which enables you to redirect articles, but it requires URL mapping in CSV format. When you want to get Medium publication’s URL list, you may use
, but officially, it
. We need
, but I couldn’t get the post list. In this article, I will show you how to get URL list of your Medium publication easily.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Note: I tried this method with under 150 articles publication. It might not work with huge number of articles.&lt;/strong&gt;&lt;/p&gt;
&lt;h4 id="how-to"&gt;How-to&lt;/h4&gt;
&lt;p&gt;You can use following Python script, after showing whole articles with accesing &lt;code&gt;/latest&lt;/code&gt; of a publication. For example, after opening
, you can get whole contents with scrolling down and down and down…&lt;/p&gt;</description></item><item><title>sparkavro: Manupilate Apache Avro file with sparklyr</title><link>https://chezo.uno/blog/2017-03-26_sparkavro--manupilate-apache-avro-file-with-sparklyr-a53c61eaf0b0/</link><pubDate>Sun, 26 Mar 2017 05:02:01 -0700</pubDate><guid>https://chezo.uno/blog/2017-03-26_sparkavro--manupilate-apache-avro-file-with-sparklyr-a53c61eaf0b0/</guid><description>&lt;p&gt;I created a simple
extension to handle Apache Avro file. It is just a simple wrapper of DataBrick’s
. It is listed in
.&lt;/p&gt;
&lt;p&gt;
&lt;/p&gt;
&lt;h3 id="installation"&gt;Installation&lt;/h3&gt;
&lt;p&gt;Use &lt;code&gt;{devtools}&lt;/code&gt; to install sparkavro.&lt;/p&gt;
&lt;p&gt;devtools::install_github(&amp;ldquo;chezou/avrospark&amp;rdquo;)&lt;/p&gt;
&lt;h3 id="simple-usage"&gt;Simple usage&lt;/h3&gt;
&lt;p&gt;You can read and write Avro file as follows:&lt;/p&gt;
&lt;p&gt;library(sparklyr)&lt;br&gt;
library(sparkavro)&lt;br&gt;
sc &amp;lt;- spark_connect(master = &amp;ldquo;spark://HOST:PORT&amp;rdquo;)&lt;br&gt;
df &amp;lt;- spark_read_avro(sc, &amp;ldquo;test_table&amp;rdquo;, &amp;ldquo;/user/foo/test.avro&amp;rdquo;)&lt;br&gt;
spark_write_avro(df, &amp;ldquo;/tmp/output&amp;rdquo;)&lt;/p&gt;
&lt;p&gt;This is the very first version, so there might be bugs especially around options. If you find any bug, please raise on the
.&lt;/p&gt;</description></item><item><title>How to connect secure Impala cluster from RStudio on macOS with implyr</title><link>https://chezo.uno/blog/2017-03-26_how-to-connect-secure-impala-cluster-from-rstudio-on-macos-with-implyr-213c6536e4c7/</link><pubDate>Sat, 25 Mar 2017 14:35:45 -0700</pubDate><guid>https://chezo.uno/blog/2017-03-26_how-to-connect-secure-impala-cluster-from-rstudio-on-macos-with-implyr-213c6536e4c7/</guid><description>&lt;p&gt;Impala is very fast SQL-on-Hadoop, and it will enhance your R experience with
, a
based interface for
created by
. I will show you how to setup connection to Kerberized Impala cluster with implyr from local macOS. You can find my GitHub repo as follows:&lt;/p&gt;
&lt;p&gt;
&lt;/p&gt;
&lt;h3 id="setting-up-odbc-environment-formacos"&gt;Setting up ODBC environment for macOS&lt;/h3&gt;
&lt;h4 id="install-unixodbc-withhomebrew"&gt;Install unixODBC with homebrew&lt;/h4&gt;
&lt;p&gt;First, we will install
to handle Impala with ODBC. In R world, ODBC is preferred to connect Impala because of its performance and compatibility. Let’s install unixODBC with homebrew.&lt;/p&gt;
&lt;p&gt;$ brew install unixodbc&lt;/p&gt;
&lt;h4 id="download-and-install-the-latest-version-of-the-impala-odbc-driver-fromcloudera"&gt;Download and install the latest version of the Impala ODBC driver from Cloudera&lt;/h4&gt;
&lt;p&gt;You can download
.&lt;/p&gt;
&lt;h4 id="configure-yourodbcini-andodbcinstini"&gt;Configure your .odbc.ini and .odbcinst.ini&lt;/h4&gt;
&lt;p&gt;After installing Impala ODBC driver for macOS, basic configuration templates can be found in &lt;code&gt;/opt/cloudera/impalaodbc/Setup/&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;cp /opt/cloudera/impalaodbc/Setup/odbc.ini ~/.odbc.ini&lt;br&gt;
cp /opt/cloudera/impalaodbc/Setup/odbcinst.ini ~/.odbcinst.ini&lt;/p&gt;
&lt;p&gt;Before using following setting, you must replace &lt;code&gt;HOST&lt;/code&gt; and &lt;code&gt;KrbRealm&lt;/code&gt; with appropriate ones. Let’s modify your &lt;code&gt;.odbc.ini&lt;/code&gt; as follows:&lt;/p&gt;
\[ODBC\]&lt;p&gt;&lt;br&gt;
# Specify any global ODBC configuration here such as ODBC tracing.&lt;/p&gt;
\[ODBC Data Sources\]&lt;p&gt;&lt;br&gt;
Impala=Cloudera ODBC Driver for Impala&lt;/p&gt;
\[Impala\]&lt;p&gt;# Description: DSN Description.&lt;br&gt;
# This key is not necessary and is only to give a description of the data source.&lt;br&gt;
Description=Cloudera Impala ODBC Driver DSN&lt;/p&gt;
&lt;p&gt;# Driver: The location where the ODBC driver is installed to.&lt;br&gt;
Driver=/opt/cloudera/impalaodbc/lib/universal/libclouderaimpalaodbc.dylib&lt;/p&gt;
&lt;p&gt;# The DriverUnicodeEncoding setting is only used for SimbaDM&lt;br&gt;
# When set to 1, SimbaDM runs in UTF-16 mode.&lt;br&gt;
# When set to 2, SimbaDM runs in UTF-8 mode.&lt;br&gt;
#DriverUnicodeEncoding=2&lt;/p&gt;
&lt;p&gt;# Values for HOST, PORT, KrbFQDN, and KrbServiceName should be set here.&lt;br&gt;
# They can also be specified on the connection string.&lt;br&gt;
HOST=&lt;/p&gt;
\[REPLACE\_YOUR\_IMPALA\_HOST\]&lt;p&gt;&lt;br&gt;
PORT=21050&lt;br&gt;
Schema=default&lt;/p&gt;
&lt;p&gt;# The authentication mechanism.&lt;br&gt;
# 0 — No authentication (NOSASL)&lt;br&gt;
# 1 — Kerberos authentication (SASL)&lt;br&gt;
# 2 — Username authentication (SASL)&lt;br&gt;
# 3 — Username/password authentication (NOSASL or SASL depending on UseSASL configuration)&lt;br&gt;
AuthMech=1&lt;/p&gt;
&lt;p&gt;# Set to 1 to use SASL for authentication.&lt;br&gt;
# Set to 0 to not use SASL.&lt;br&gt;
# When using Kerberos authentication (SASL) or Username authentication (SASL) SASL is always used&lt;br&gt;
# and this configuration is ignored. SASL is always not used for No authentication (NOSASL).&lt;br&gt;
UseSASL=1&lt;/p&gt;
&lt;p&gt;# Kerberos related settings.&lt;br&gt;
KrbFQDN=_HOST&lt;br&gt;
KrbRealm=&lt;/p&gt;
\[REPLACE\_YOUR\_REALM\]&lt;p&gt;&lt;br&gt;
KrbServiceName=impala&lt;/p&gt;
&lt;p&gt;# Username/password authentication with SASL settings.&lt;br&gt;
UID=&lt;br&gt;
PWD=&lt;/p&gt;
&lt;p&gt;# Set to 0 to disable SSL.&lt;br&gt;
# Set to 1 to enable SSL.&lt;br&gt;
SSL=1&lt;br&gt;
CAIssuedCertNamesMismatch=1&lt;br&gt;
TrustedCerts=/opt/cloudera/impalaodbc/lib/universal/cacerts.pem&lt;/p&gt;
&lt;p&gt;# If you use SSL with AllowSelfSignedServerCert, you can set this configuration.&lt;br&gt;
#AllowSelfSignedServerCert=1&lt;/p&gt;
&lt;p&gt;# Specify the proxy user ID to use.&lt;br&gt;
#DelegationUID=&lt;/p&gt;
&lt;p&gt;# General settings&lt;br&gt;
TSaslTransportBufSize=1000&lt;br&gt;
RowsFetchedPerBlock=10000&lt;br&gt;
SocketTimeout=0&lt;br&gt;
StringColumnLength=32767&lt;br&gt;
UseNativeQuery=0&lt;/p&gt;
&lt;p&gt;After setting up the &lt;code&gt;.odbc.ini&lt;/code&gt; , your application will refer this setting with appropriate DSN name, like &lt;code&gt;Impala&lt;/code&gt; in this case.&lt;/p&gt;
&lt;h4 id="check-the-configuration"&gt;Check the configuration&lt;/h4&gt;
&lt;p&gt;After configuration, you should kinit with your principal.&lt;/p&gt;
&lt;p&gt;$ kinit $USER@YOUR_REALM&lt;/p&gt;
&lt;p&gt;You should replace `$USER` and `YOUR_REALM` with the appropriate REALM.&lt;/p&gt;
&lt;p&gt;Before using RStudio on you mac, you can check configuration with `isql` command.&lt;/p&gt;
&lt;p&gt;$ isql -v “Impala”&lt;br&gt;
+ — — — — — — — — — — — — — — — — — — — -+&lt;br&gt;
| Connected! |&lt;br&gt;
| |&lt;br&gt;
| sql-statement |&lt;br&gt;
| help &lt;/p&gt;
\[tablename\]&lt;p&gt; |&lt;br&gt;
| quit |&lt;br&gt;
| |&lt;br&gt;
+ — — — — — — — — — — — — — — — — — — — -+&lt;br&gt;
SQL&amp;gt;&lt;/p&gt;
&lt;h3 id="implyr-example"&gt;Implyr Example&lt;/h3&gt;
&lt;p&gt;After setting .odbc.ini you can connect secure Impala cluster with &lt;code&gt;{implyr}&lt;/code&gt;. For instance, We will visualize
.&lt;/p&gt;
&lt;p&gt;First, install R packages.&lt;/p&gt;
&lt;p&gt;install.packages(c(“implyr”, “odbc”, “DBI”, “dplyr”, “ggplot2”, “ggExtra”))&lt;/p&gt;
&lt;p&gt;Then, connect the Impala cluster.&lt;/p&gt;
&lt;p&gt;library(implyr)&lt;br&gt;
library(odbc)&lt;br&gt;
drv &amp;lt;- odbc::odbc()&lt;br&gt;
impala &amp;lt;- src_impala(&lt;br&gt;
drv = drv,&lt;br&gt;
dsn = “Impala”&lt;br&gt;
)&lt;/p&gt;
&lt;p&gt;If your &lt;code&gt;.odbc.ini&lt;/code&gt; is configured properly, you can connect to Impala cluster.&lt;/p&gt;
&lt;p&gt;Let’s visualize the airports data. In this case, we assume the data is in &lt;code&gt;u_ariga&lt;/code&gt; database, so that we will change database using SQL &lt;code&gt;use u_ariga&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;library(DBI)&lt;br&gt;
# Change database&lt;br&gt;
dbExecute(impala, “use u_ariga”)&lt;br&gt;
dbGetQuery(impala, “show tables”)&lt;br&gt;
airports &amp;lt;- tbl(impala, “airports_pq”)&lt;/p&gt;
&lt;p&gt;# Show the head of airports data&lt;br&gt;
View(airports)&lt;/p&gt;
&lt;p&gt;airports %&amp;gt;% filter(latitude &amp;lt; 35) %&amp;gt;% count()&lt;br&gt;
#903&lt;/p&gt;
&lt;p&gt;Finally, we will show a joint histogram of longitude and latitude.&lt;/p&gt;
&lt;p&gt;airports_by_geo &amp;lt;- airports %&amp;gt;% select(longitude, latitude) %&amp;gt;% collect()&lt;/p&gt;
&lt;p&gt;library(ggplot2)&lt;/p&gt;
&lt;p&gt;p &amp;lt;- ggplot(airports_by_geo, aes(longitude, latitude)) + geom_point() + theme_classic()&lt;br&gt;
ggExtra::ggMarginal(p, type = “histogram”)&lt;/p&gt;
&lt;p&gt;
&lt;figure &gt;
&lt;div class="flex justify-center "&gt;
&lt;div class="w-full" &gt;&lt;img src="1_SscW2sneYR_lphETyF7y1A.png" alt="" loading="lazy" data-zoomable /&gt;&lt;/div&gt;
&lt;/div&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;h3 id="conclusion"&gt;Conclusion&lt;/h3&gt;
&lt;p&gt;&lt;code&gt;{implyr}&lt;/code&gt; is a great package for Impala and dplyr but it is pretty young project. If you find some problems, why don’t you post into
?&lt;/p&gt;</description></item><item><title>Visualize your massive data with Impala and Redash</title><link>https://chezo.uno/blog/2017-02-11_visualize-your-massive-data-with-impala-and-redash-afe31133c644/</link><pubDate>Fri, 10 Feb 2017 21:14:44 -0800</pubDate><guid>https://chezo.uno/blog/2017-02-11_visualize-your-massive-data-with-impala-and-redash-afe31133c644/</guid><description>&lt;p&gt;
is a famous OSS visualization tool, which enables to visualize your data with SQL. It supports
, fast SQL-on-Hadoop suitable for BI tools and exploratory analysis. With Impala, you can
.&lt;/p&gt;
&lt;p&gt;In this post, we connect to Impala from Redash and visualize data.&lt;/p&gt;
&lt;h3 id="set-upredash"&gt;Set up Redash&lt;/h3&gt;
&lt;p&gt;You can set up Redash with various way. This time, I use
. Then, you can access with your browser with admin/admin.&lt;/p&gt;
&lt;h3 id="add-data-source-ofimpala"&gt;Add Data Source of Impala&lt;/h3&gt;
&lt;p&gt;After clicking Database icon, you can add data sources.&lt;/p&gt;
&lt;p&gt;This time, I set configurations as follows:&lt;/p&gt;
&lt;p&gt;&lt;figure&gt;&lt;img src="https://chezo.uno/blog/2017-02-11_visualize-your-massive-data-with-impala-and-redash-afe31133c644/1_gMPHyBohg3nZKTDxtm_b_w.png"&gt;&lt;figcaption&gt;
&lt;h4&gt;Example configuration&lt;/h4&gt;
&lt;/figcaption&gt;
&lt;/figure&gt;
Example configuration&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Type: Impala&lt;/li&gt;
&lt;li&gt;Database: default&lt;/li&gt;
&lt;li&gt;Host: hostname of Impala daemon&lt;/li&gt;
&lt;li&gt;Ldap_password/user: (empty)&lt;/li&gt;
&lt;li&gt;Port: 21050 (default port)&lt;/li&gt;
&lt;li&gt;Please specify beeswax or hiveserver2: hiveserver2&lt;/li&gt;
&lt;li&gt;Timeout: 3600&lt;/li&gt;
&lt;li&gt;Use_ldap: (empty)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Now, you can select Impala as a data source.&lt;/p&gt;
&lt;p&gt;&lt;figure&gt;&lt;img src="https://chezo.uno/blog/2017-02-11_visualize-your-massive-data-with-impala-and-redash-afe31133c644/1_Kk90BhI7L42fmIXPAn_mgg.png"&gt;&lt;figcaption&gt;
&lt;h4&gt;Result of Impala query&lt;/h4&gt;
&lt;/figcaption&gt;
&lt;/figure&gt;
Result of Impala query&lt;/p&gt;</description></item><item><title>tabula-py: Extract table from PDF into Python DataFrame</title><link>https://chezo.uno/blog/2017-01-09_tabula-py--extract-table-from-pdf-into-python-dataframe-6c7acfa5f302/</link><pubDate>Sun, 08 Jan 2017 21:09:08 -0800</pubDate><guid>https://chezo.uno/blog/2017-01-09_tabula-py--extract-table-from-pdf-into-python-dataframe-6c7acfa5f302/</guid><description>
&lt;div class="callout flex px-4 py-3 mb-6 rounded-md border-l-4 bg-blue-100 dark:bg-blue-900 border-blue-500"
data-callout="note"
data-callout-metadata=""&gt;
&lt;span class="callout-icon pr-3 pt-1 text-blue-600 dark:text-blue-300"&gt;
&lt;svg height="24" xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24"&gt;&lt;path fill="none" stroke="currentColor" stroke-linecap="round" stroke-linejoin="round" stroke-width="1.5" d="m16.862 4.487l1.687-1.688a1.875 1.875 0 1 1 2.652 2.652L6.832 19.82a4.5 4.5 0 0 1-1.897 1.13l-2.685.8l.8-2.685a4.5 4.5 0 0 1 1.13-1.897zm0 0L19.5 7.125"/&gt;&lt;/svg&gt;
&lt;/span&gt;
&lt;div class="callout-content dark:text-neutral-300"&gt;
&lt;div class="callout-title font-semibold mb-1"&gt;Note&lt;/div&gt;
&lt;div class="callout-body"&gt;&lt;p&gt;(Oct 7th, 2019)
As of Oct. 2019, I launched a
and
for tabula-py. The FAQ would be good place to execute accurate extraction.&lt;/p&gt;
&lt;p&gt;Screenshots in this article is based on the old version interface. See the latest version example in the Colab notebook.&lt;/p&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Today, I released tabula-py 0.3.0, which extracts table from PDF into Python pandas’s DataFrame.&lt;/p&gt;
&lt;div class="iframely-embed"&gt;&lt;div class="iframely-responsive" style="height: 140px; padding-bottom: 0;"&gt;&lt;a href="https://github.com/chezou/tabula-py" data-iframely-url="//iframely.net/0WmgXWY?card=small"&gt;&lt;/a&gt;&lt;/div&gt;&lt;/div&gt;&lt;script async src="//iframely.net/embed.js" charset="utf-8"&gt;&lt;/script&gt;
&lt;p&gt;It is simple wrapper of
and it enables you to extract table into DataFrame or JSON with Python. You also can extract tables from PDF into CSV, TSV or JSON file.&lt;/p&gt;
&lt;p&gt;
is a tool to extract tables from PDFs. It is GUI based software, but tabula-java is a tool based on CUI. Though there were
,
, and
bindings of tabula-java, before tabula-py there isn’t any Python binding of it. I believe PyData is a great ecosystem for data analysis and that’s why I created tabula-py. If you are familiar with R, I highly recommend to use
, which has the most richest bindings including rich GUI.&lt;/p&gt;
&lt;p&gt;You can install tabula-py via pip:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-sh" data-lang="sh"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;pip install tabula-py
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;With tabula-py, you can get DataFrame with &lt;code&gt;read_pdf()&lt;/code&gt; method.&lt;/p&gt;
&lt;p&gt;&lt;figure&gt;&lt;img src="https://chezo.uno/blog/2017-01-09_tabula-py--extract-table-from-pdf-into-python-dataframe-6c7acfa5f302/1_w0uPTg2qfvBbmHYEYxqjYw.png"&gt;&lt;figcaption&gt;
&lt;h4&gt;example of read_pdf()&lt;/h4&gt;
&lt;/figcaption&gt;
&lt;/figure&gt;
example of &lt;code&gt;read_pdf()&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;You can also extract tables as JSON format:&lt;/p&gt;
&lt;p&gt;&lt;figure&gt;&lt;img src="https://chezo.uno/blog/2017-01-09_tabula-py--extract-table-from-pdf-into-python-dataframe-6c7acfa5f302/1_wtSMgtCmBgy15PdP6Lq_jQ.png"&gt;&lt;figcaption&gt;
&lt;h4&gt;example of JSON&lt;/h4&gt;
&lt;/figcaption&gt;
&lt;/figure&gt;
example of JSON&lt;/p&gt;
&lt;p&gt;You can extract tables into a file like JSON, CSV or TSV with &lt;code&gt;convert_into()&lt;/code&gt; method.&lt;/p&gt;
&lt;p&gt;
&lt;figure &gt;
&lt;div class="flex justify-center "&gt;
&lt;div class="w-full" &gt;
&lt;img alt=""
srcset="https://chezo.uno/blog/2017-01-09_tabula-py--extract-table-from-pdf-into-python-dataframe-6c7acfa5f302/1_tLQ2aqjM_zD_Ls6qNY6E0g_hu_aa84eb58c296557e.webp 320w, https://chezo.uno/blog/2017-01-09_tabula-py--extract-table-from-pdf-into-python-dataframe-6c7acfa5f302/1_tLQ2aqjM_zD_Ls6qNY6E0g_hu_4afdb897f94b7496.webp 480w, https://chezo.uno/blog/2017-01-09_tabula-py--extract-table-from-pdf-into-python-dataframe-6c7acfa5f302/1_tLQ2aqjM_zD_Ls6qNY6E0g_hu_aebd3a5076b0625a.webp 760w"
sizes="(max-width: 480px) 100vw, (max-width: 768px) 90vw, (max-width: 1024px) 80vw, 760px"
src="https://chezo.uno/blog/2017-01-09_tabula-py--extract-table-from-pdf-into-python-dataframe-6c7acfa5f302/1_tLQ2aqjM_zD_Ls6qNY6E0g_hu_aa84eb58c296557e.webp"
width="760"
height="334"
loading="lazy" data-zoomable /&gt;&lt;/div&gt;
&lt;/div&gt;&lt;/figure&gt;
&lt;figure &gt;
&lt;div class="flex justify-center "&gt;
&lt;div class="w-full" &gt;
&lt;img alt=""
srcset="https://chezo.uno/blog/2017-01-09_tabula-py--extract-table-from-pdf-into-python-dataframe-6c7acfa5f302/1_ir9O2abAz1emEUdVqiwT0Q_hu_d239e59c74185345.webp 320w, https://chezo.uno/blog/2017-01-09_tabula-py--extract-table-from-pdf-into-python-dataframe-6c7acfa5f302/1_ir9O2abAz1emEUdVqiwT0Q_hu_a8d21a0dee94fcd.webp 480w, https://chezo.uno/blog/2017-01-09_tabula-py--extract-table-from-pdf-into-python-dataframe-6c7acfa5f302/1_ir9O2abAz1emEUdVqiwT0Q_hu_a9d95787c6508055.webp 760w"
sizes="(max-width: 480px) 100vw, (max-width: 768px) 90vw, (max-width: 1024px) 80vw, 760px"
src="https://chezo.uno/blog/2017-01-09_tabula-py--extract-table-from-pdf-into-python-dataframe-6c7acfa5f302/1_ir9O2abAz1emEUdVqiwT0Q_hu_d239e59c74185345.webp"
width="760"
height="304"
loading="lazy" data-zoomable /&gt;&lt;/div&gt;
&lt;/div&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p&gt;You can see more examples in Jupyter notebook.&lt;/p&gt;
&lt;div class="iframely-embed"&gt;&lt;div class="iframely-responsive" style="height: 140px; padding-bottom: 0;"&gt;&lt;a href="https://github.com/chezou/tabula-py/blob/master/examples/tabula_example.ipynb" data-iframely-url="//iframely.net/yCWTraF?card=small"&gt;&lt;/a&gt;&lt;/div&gt;&lt;/div&gt;&lt;script async src="//iframely.net/embed.js" charset="utf-8"&gt;&lt;/script&gt;
&lt;p&gt;I hope you will enjoy data wrangling with tabula-py. Any feedback would be welcome!&lt;/p&gt;
&lt;h3 id="waiting-for-your-collaboration"&gt;Waiting for your collaboration!&lt;/h3&gt;
&lt;p&gt;If you have any trouble with tabula-py, please file
. I don’t want to receive emails because the answer will not share with other people. Make sure to fill
, it will reduce many costs for me to solve the problem. Or, I also check StackOverflow. You can ask about it.&lt;/p&gt;
&lt;h4 id="other-tabula-py-articles"&gt;Other tabula-py articles&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;/li&gt;
&lt;/ul&gt;</description></item><item><title>Livy &amp; Jupyter Notebook &amp; Sparkmagic = Powerful &amp; Easy Notebook for Data Scientist</title><link>https://chezo.uno/blog/2016-12-30_livy---jupyter-notebook---sparkmagic---powerful---easy-notebook-for-data-scientist-a8b72345ea2d/</link><pubDate>Thu, 29 Dec 2016 22:15:23 -0800</pubDate><guid>https://chezo.uno/blog/2016-12-30_livy---jupyter-notebook---sparkmagic---powerful---easy-notebook-for-data-scientist-a8b72345ea2d/</guid><description>&lt;p&gt;livy is a REST server of Spark. You can see
,
.
is one of the most popular notebook OSS within data scientists. Using sparkmagic + Jupyter notebook, data scientists can execute ad-hoc Spark job easily.&lt;/p&gt;
&lt;h3 id="why-livy-isgood"&gt;Why livy is good?&lt;/h3&gt;
&lt;p&gt;According to
, livy has features like:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Have long running SparkContexts that can be used for multiple Spark jobs, by multiple clients&lt;/li&gt;
&lt;li&gt;Share cached RDDs or Dataframes across multiple jobs and clients&lt;/li&gt;
&lt;li&gt;Multiple SparkContexts can be managed simultaneously, and they run on the cluster (YARN/Mesos) instead of the Livy Server for good fault tolerance and concurrency&lt;/li&gt;
&lt;li&gt;Jobs can be submitted as precompiled jars, snippets of code, or via Java/Scala client API&lt;/li&gt;
&lt;li&gt;Ensure security via secure authenticated communication&lt;/li&gt;
&lt;li&gt;Apache License, 100% open source&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="why-livy--sparkmagic"&gt;Why livy + sparkmagic?&lt;/h3&gt;
&lt;p&gt;
is a client of livy using with Jupyter notebook. When we write Spark code at our local Jupyter client, then sparkmagic runs the Spark job through livy. Using sparkmagic + Jupyter notebook, data scientists can use Spark from their own Jupyter notebook, which is running on their localhost. We don’t need any Spark configuration getting from the CDH cluster. So we can execute Spark job in a cluster like running on a local machine.&lt;/p&gt;
&lt;p&gt;
&lt;figure &gt;
&lt;div class="flex justify-center "&gt;
&lt;div class="w-full" &gt;
&lt;img alt=""
srcset="https://chezo.uno/blog/2016-12-30_livy---jupyter-notebook---sparkmagic---powerful---easy-notebook-for-data-scientist-a8b72345ea2d/0_lwKpnEq0Tpi3Tlj_hu_8aa03525a8b89ea2.webp 320w, https://chezo.uno/blog/2016-12-30_livy---jupyter-notebook---sparkmagic---powerful---easy-notebook-for-data-scientist-a8b72345ea2d/0_lwKpnEq0Tpi3Tlj_hu_78e47a14c2c7040e.webp 480w, https://chezo.uno/blog/2016-12-30_livy---jupyter-notebook---sparkmagic---powerful---easy-notebook-for-data-scientist-a8b72345ea2d/0_lwKpnEq0Tpi3Tlj_hu_97577e1b2d5bfd01.webp 760w"
sizes="(max-width: 480px) 100vw, (max-width: 768px) 90vw, (max-width: 1024px) 80vw, 760px"
src="https://chezo.uno/blog/2016-12-30_livy---jupyter-notebook---sparkmagic---powerful---easy-notebook-for-data-scientist-a8b72345ea2d/0_lwKpnEq0Tpi3Tlj_hu_8aa03525a8b89ea2.webp"
width="760"
height="258"
loading="lazy" data-zoomable /&gt;&lt;/div&gt;
&lt;/div&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p&gt;diagram from
&lt;/p&gt;
&lt;h3 id="requirements"&gt;Requirements&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;Spark Cluster&lt;/li&gt;
&lt;/ol&gt;
&lt;ul&gt;
&lt;li&gt;Cloudera Director is nice to prepare&lt;/li&gt;
&lt;li&gt;Install git and maven&lt;/li&gt;
&lt;li&gt;I tried CDH 5.7 with CentOS 7&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;2. Local jupyter client&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;virtualenv and virtualenvwrapper is awesome&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="preparation"&gt;Preparation&lt;/h3&gt;
&lt;p&gt;In order to use livy with sparkmagic, we should install livy into the Spark gateway server and sparkmagic into local machine.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="install-r"&gt;Install R&lt;/h3&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-fallback" data-lang="fallback"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;$ sudo yum install -y epel-release$ sudo yum install -y R
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;h3 id="build-livy"&gt;Build livy&lt;/h3&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-fallback" data-lang="fallback"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;$ git clone git@github.com:cloudera/livy.git$ cd livy$ mvn -Dspark.version=1.6.0 -DskipTests clean package
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Because of failing test at that time, I added &lt;code&gt;-DskipTests&lt;/code&gt; to build.&lt;/p&gt;
&lt;h3 id="run-livy"&gt;Run livy&lt;/h3&gt;
&lt;p&gt;Set environment variables as follows:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-gdscript3" data-lang="gdscript3"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="o"&gt;$&lt;/span&gt; &lt;span class="k"&gt;export&lt;/span&gt; &lt;span class="n"&gt;SPARK_HOME&lt;/span&gt;&lt;span class="o"&gt;=/&lt;/span&gt;&lt;span class="n"&gt;opt&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;cloudera&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;parcels&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;CDH&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;5.7&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;1.&lt;/span&gt;&lt;span class="n"&gt;cdh5&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="mf"&gt;7.1&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;p0&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;11&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;lib&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt; &lt;span class="k"&gt;export&lt;/span&gt; &lt;span class="n"&gt;HADOOP_CONF_DIR&lt;/span&gt;&lt;span class="o"&gt;=/&lt;/span&gt;&lt;span class="n"&gt;etc&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;hadoop&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;conf&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Add the following configuration into livy.conf:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-fallback" data-lang="fallback"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;livy.server.session.factory = yarn
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Let’s run livy server&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-fallback" data-lang="fallback"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;$ ./bin/livy-server
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Open another terminal and check the server&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-fallback" data-lang="fallback"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;$ curl localhost:8998/sessions{&amp;#34;from&amp;#34;:0,&amp;#34;total&amp;#34;:0,&amp;#34;sessions&amp;#34;:[]}
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;As livy’s Default port number is 8998, we should open or forward the port.&lt;/p&gt;
&lt;h3 id="prepare-sparkmagic-in-localmachine"&gt;Prepare sparkmagic in local machine&lt;/h3&gt;
&lt;p&gt;Install sparkmagic by following the
:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-fallback" data-lang="fallback"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;$ pip install sparkmagic$ jupyter nbextension enable --py --sys-prefix widgetsnbextension
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Then install wapper kernel. Do pip show sparkmagic and you can see the Location info. In the following example, Location is /Users/ariga/.virtualenvs/ibis/lib/python3.5/site-packages.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-fallback" data-lang="fallback"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;$ pip show sparkmagic---Metadata-Version: 2.0Name: sparkmagicVersion: 0.2.3Summary: SparkMagic: Spark execution via LivyHome-page: https://github.com/jupyter-incubator/sparkmagic/sparkmagicAuthor: Jupyter Development TeamAuthor-email: jupyter@googlegroups.orgInstaller: pipLicense: BSD 3-clauseLocation: /Users/ariga/.virtualenvs/ibis/lib/python3.5/site-packagesRequires: ipywidgets, pandas, ipython, requests, mock, autovizwidget, numpy, nose, ipykernel, notebook, hdijupyterutilsClassifiers: Development Status :: 4 - Beta Environment :: Console Intended Audience :: Science/Research License :: OSI Approved :: BSD License Natural Language :: English Programming Language :: Python :: 2.6 Programming Language :: Python :: 2.7 Programming Language :: Python :: 3.3 Programming Language :: Python :: 3.4$ cd /Users/ariga/.virtualenvs/ibis/lib/python3.5/site-packages$ jupyter-kernelspec install sparkmagic/kernels/sparkkernel$ jupyter-kernelspec install sparkmagic/kernels/pysparkkernel
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Copy the
into ~/.sparkmagic/config.json and modify it.&lt;/p&gt;
&lt;h3 id="run-jupyternotebook"&gt;Run jupyter notebook&lt;/h3&gt;
&lt;p&gt;Before running jupyter, I recommend checking the connection from the local machine to the livy server.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-fallback" data-lang="fallback"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;$ curl YOUR_HOSTNAME:8998/sessions
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Launch jupyter notebook and create PySpark notebook (of course you can use Spark)&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-fallback" data-lang="fallback"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;$ jupyter notebook
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;The example notebook is here&lt;/p&gt;
&lt;p&gt;
&lt;/p&gt;
&lt;p&gt;In the nbviewer, we can not see the result of SQL, but we can visualize the result of SQL with &lt;code&gt;%%sql&lt;/code&gt; magic command. That’s awesome :)&lt;/p&gt;
&lt;p&gt;
&lt;figure &gt;
&lt;div class="flex justify-center "&gt;
&lt;div class="w-full" &gt;
&lt;img alt=""
srcset="https://chezo.uno/blog/2016-12-30_livy---jupyter-notebook---sparkmagic---powerful---easy-notebook-for-data-scientist-a8b72345ea2d/0_l8PW0TpvVfuoLdVv_hu_d5de29033a33568.webp 320w, https://chezo.uno/blog/2016-12-30_livy---jupyter-notebook---sparkmagic---powerful---easy-notebook-for-data-scientist-a8b72345ea2d/0_l8PW0TpvVfuoLdVv_hu_ccca514bcbd0e34e.webp 480w, https://chezo.uno/blog/2016-12-30_livy---jupyter-notebook---sparkmagic---powerful---easy-notebook-for-data-scientist-a8b72345ea2d/0_l8PW0TpvVfuoLdVv_hu_fbdfa0305b22868f.webp 760w"
sizes="(max-width: 480px) 100vw, (max-width: 768px) 90vw, (max-width: 1024px) 80vw, 760px"
src="https://chezo.uno/blog/2016-12-30_livy---jupyter-notebook---sparkmagic---powerful---easy-notebook-for-data-scientist-a8b72345ea2d/0_l8PW0TpvVfuoLdVv_hu_d5de29033a33568.webp"
width="760"
height="572"
loading="lazy" data-zoomable /&gt;&lt;/div&gt;
&lt;/div&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p&gt;If you use &lt;code&gt;%%local&lt;/code&gt;, you can use local Python libraries such as scikit-learn, seaborn etc, with received results from PySpark.&lt;/p&gt;
&lt;h3 id="references"&gt;References&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;/li&gt;
&lt;/ul&gt;</description></item><item><title>Text-to-speech based on deep learning for Web site using Amazon Polly and Ruby</title><link>https://chezo.uno/blog/2016-12-01_text-to-speech-based-on-deep-learning-for-web-site-using-amazon-polly-and-ruby-adc1923212cb/</link><pubDate>Wed, 30 Nov 2016 22:00:02 -0800</pubDate><guid>https://chezo.uno/blog/2016-12-01_text-to-speech-based-on-deep-learning-for-web-site-using-amazon-polly-and-ruby-adc1923212cb/</guid><description>&lt;p&gt;Amazon Polly, Text-to-speech service from AWS was announced at today ‘s re:Invent. Amazon Polly is speech synthesize system based on deep learning.&lt;/p&gt;
&lt;p&gt;
&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;\[updated\] I added generated speech of this article.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;\[updated2\] I created simple CLI tools and rubygems of polly&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;
&lt;/p&gt;
&lt;p&gt;The great thing about Amazon Polly is that we can use TTS easily with AWS CLI. The price is free for up to 5 million characters a month, if over that limitation, it is very cheap with $ 0.000004/character. If you synthesize [Adventures of Huckleberry Finn](https://en.wikipedia.org/wiki/Adventures_of_Huckleberry_Finn), it costs about only $2.4.&lt;/p&gt;
&lt;p&gt;Here is the example code of Polly with AWS CLI tool.&lt;/p&gt;
&lt;p&gt;$ aws polly synthesize-speech \&lt;br&gt;
&amp;ndash;output-format mp3 &amp;ndash;voice-id Joanna \&lt;br&gt;
&amp;ndash;text &amp;ldquo;Hello my name is Joanna.&amp;rdquo; \&lt;br&gt;
joanna.mp3&lt;/p&gt;
&lt;p&gt;As of December 1, 2016, they support the following 24 languages mainly in European languages.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Icelandic&lt;/li&gt;
&lt;li&gt;Italian&lt;/li&gt;
&lt;li&gt;Welsh&lt;/li&gt;
&lt;li&gt;Dutch&lt;/li&gt;
&lt;li&gt;Swedish&lt;/li&gt;
&lt;li&gt;Spanish (Castile)&lt;/li&gt;
&lt;li&gt;Spanish (USA)&lt;/li&gt;
&lt;li&gt;Danish&lt;/li&gt;
&lt;li&gt;Turkish&lt;/li&gt;
&lt;li&gt;German&lt;/li&gt;
&lt;li&gt;Norwegian&lt;/li&gt;
&lt;li&gt;French&lt;/li&gt;
&lt;li&gt;French (Canada)&lt;/li&gt;
&lt;li&gt;Portuguese&lt;/li&gt;
&lt;li&gt;Portuguese (Brazil)&lt;/li&gt;
&lt;li&gt;Polish&lt;/li&gt;
&lt;li&gt;Romanian&lt;/li&gt;
&lt;li&gt;Russian&lt;/li&gt;
&lt;li&gt;Japanese&lt;/li&gt;
&lt;li&gt;English (India)&lt;/li&gt;
&lt;li&gt;English (Welsh)&lt;/li&gt;
&lt;li&gt;English (Australia)&lt;/li&gt;
&lt;li&gt;English (US)&lt;/li&gt;
&lt;li&gt;English (UK)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;I think Japanese speech sounds very natural. Sometime it will be a strange accent, but if I register a word with Lexicon, we can improve the quality by myself. Japanese sample voice as following:&lt;/p&gt;
&lt;p&gt;I often find interesting articles in Medium, but since reading long English article is a bit tough for non native English speaker like me. So I came up with if I made the article to voice, I would listen it easily. That’s why I wrote the code to convert articles to speech with Ruby like following:&lt;/p&gt;
&lt;p&gt;There are some important restrictions of API:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The number of characters per API is 1500 characters&lt;/li&gt;
&lt;li&gt;Long voice is truncated after 5 minutes&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Read more in detail…&lt;/p&gt;
&lt;p&gt;
&lt;/p&gt;
&lt;p&gt;Actually, I tried to convert the following article just found in Hckr news. I can hear it comfortably.&lt;/p&gt;
&lt;p&gt;
&lt;/p&gt;
&lt;p&gt;If I did a bit more hard work, I can generate sounds of the latest articles on a specific site from RSS and play back the sounds from mobile saved in Dropbox.&lt;/p&gt;
&lt;p&gt;Honestly, Amazon Polly is cheap, multilingual and natural as it is, and API is easy to use like other AWS services. It makes me feel that companies in Japan that have worked hard for existing TTS systems are in very difficult time. As a developer, I am looking forward to using various purpose and get more better services using Polly.&lt;/p&gt;</description></item><item><title>Building predictive Model with Ibis, Impala and scikit-learn</title><link>https://chezo.uno/blog/2016-10-15_building-predictive-model-with-ibis--impala-and-scikit-learn-356b41f404e0/</link><pubDate>Fri, 14 Oct 2016 14:10:31 -0700</pubDate><guid>https://chezo.uno/blog/2016-10-15_building-predictive-model-with-ibis--impala-and-scikit-learn-356b41f404e0/</guid><description>&lt;h3 id="tldr"&gt;tl;dr&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;visualizing
20M data (famous movie rating data) with
&lt;/li&gt;
&lt;li&gt;build predictive model for movie favor with scikit-learn&lt;/li&gt;
&lt;li&gt;
/
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="what-isibis"&gt;What is Ibis?&lt;/h3&gt;
&lt;p&gt;Ibis is a bridge between Python and Big Data. Ibis enables pandas handling Big Data.&lt;/p&gt;
&lt;p&gt;&lt;figure&gt;&lt;img src="https://chezo.uno/blog/2016-10-15_building-predictive-model-with-ibis--impala-and-scikit-learn-356b41f404e0/1_pLXvJbXk8kJU09iwc4Dbdg.png"&gt;&lt;figcaption&gt;
&lt;h4&gt;architecture of Ibis&lt;/h4&gt;
&lt;/figcaption&gt;
&lt;/figure&gt;
architecture of Ibis&lt;/p&gt;
&lt;p&gt;For more detail, see Wes’s presentation.&lt;/p&gt;
&lt;p&gt;As you know, pandas is known as a killer application for data analysis. In my previous job, which is known as a developer of
, many Rails developer attracted with pandas and Jupyter notebook for sharing analysis result.&lt;/p&gt;
&lt;h3 id="why-ibis"&gt;Why Ibis?&lt;/h3&gt;
&lt;p&gt;pandas loads data on memory, so we have to filter with some SQL before analyzing. But we actually want to get insight and handle without SQL.&lt;/p&gt;
&lt;h3 id="preparation"&gt;Preparation&lt;/h3&gt;
&lt;h3 id="impala-cluster"&gt;Impala cluster&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;CDH 5.7 with Cloudera Director 2.1&lt;/li&gt;
&lt;li&gt;table is created with parquet on S3&lt;/li&gt;
&lt;/ul&gt;
&lt;h4 id="required-port"&gt;required port&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;impalad node’s 21050 port&lt;/li&gt;
&lt;li&gt;NN’s 50070 port&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="ibis"&gt;Ibis&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Python 3.5&lt;/li&gt;
&lt;li&gt;using wheel and virtualenv, I didn’t use anaconda&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="notebook"&gt;Notebook&lt;/h3&gt;
&lt;p&gt;Full notebook repo is
. I also executed same code for Redshift, but several dialects prevent execution…&lt;/p&gt;
&lt;p&gt;
&lt;/p&gt;
&lt;h3 id="faq"&gt;FAQ&lt;/h3&gt;
&lt;h4 id="what-is-the-difference-betweenpyspark"&gt;What is the difference between PySpark?&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;Easy to setup. It is just like connecting DB&lt;/li&gt;
&lt;li&gt;Fast x10. So that we can x10 experiences. It makes us innovations!&lt;/li&gt;
&lt;li&gt;We can rapid prototyping with Ibis.&lt;/li&gt;
&lt;/ul&gt;
&lt;h4 id="which-is-prefer-to-build-model-ibis--scikit-learn-or-spark-mllib"&gt;Which is prefer to build model Ibis + scikit-learn or Spark + MLlib?&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;It depends on data size.&lt;/li&gt;
&lt;li&gt;
. Netflix uses R in order to model filtered data such as specific country, and they use Spark for global model.&lt;/li&gt;
&lt;/ul&gt;</description></item></channel></rss>