Command Line Tools for Data Science

This post compiles a list of useful command line tools and techniques for data science, based on a recent discussion about Data Science at the Command Line, 2nd Edition, on (Hacker News)[https://news.ycombinator.com/item?id=40244097]. These tools can help streamline data processing workflows and provide efficient alternatives to traditional spreadsheets and databases.

VisiData

VisiData is a TUI (Text User Interface) application that has revolutionized ad-hoc data processing. It is scriptable and serves as a powerful replacement for spreadsheets and local DB instances when working with tabular data. VisiData is a must-have for anyone using tools like jq, q, awk, and grep for data processing pipelines.

Command Line Data Analysis

The online book “Ad Hoc Data Analysis From The Unix Command Line” (It is 15 years old but seems like gold) is a great resource for learning how to perform serious data analysis using command line tools such as cat, find, grep, wc, cut, sort, and uniq.

Make and Parallel Processing

I had never thought of Make for ETL, but Make is a powerful tool for defining and running data processing pipelines. It allows you to define a DAG (Directed Acyclic Graph) of tasks and run them in parallel, automatically bottlenecking on common dependencies and fanning out otherwise. Make can handle complex build processes and is well-suited for ETL (Extract, Transform, Load) pipelines.

Makefiles can be used to orchestrate the steps of a command line data science workflow, allowing make to determine when intermediate data needs to be regenerated. This technique is described in the book and has been discussed in previous HN comments. A good example of using make for map-reduce can be found (here)[https://www.benevolent.com/news-and-media/blog-and-videos/how-use-makefiles-run-simple-map-reduce-data-pipeline/].

Ripgrep and jq

Ripgrep (rg) is a fast command line search tool that can be combined with jq for efficient data processing workflows. For example, ripgrep | xargs jq and find -exec jq are common patterns for quickly extracting and processing JSON data from large datasets.

Git Scraping

Git scraping is a technique for tracking changes in data over time or saving snapshots of data to a git repository for diffing purposes. This can be useful for monitoring changes in build artifacts or other data sources.

Datasette, Clickhouse Local, and DuckDB

Datasette, Clickhouse Local (CLI), and DuckDB are additional tools worth exploring for data science workflows. They provide efficient ways to store, query, and analyze data from the command line.

Notes

Table of Contents