Data Processing

Transform, analyze, and process data efficiently with AI-powered assistance from Gemini CLI.

Why Use Gemini CLI for Data Processing?

Data engineering tasks — parsing logs, converting formats, cleaning datasets, extracting insights — traditionally require writing one-off scripts or mastering specialized tools like awk, jq, or pandas. Gemini CLI collapses this workflow: you describe what you want in plain English and pipe your data through it, or ask it to write the transformation script for you and then execute it.

This is especially powerful for data exploration. When you receive an unfamiliar dataset, you can ask Gemini CLI to describe its structure, identify anomalies, suggest a cleaning strategy, and generate the code to implement that strategy — all in a single interactive session. What used to take an afternoon of exploratory coding now takes minutes.

Gemini CLI works as a Unix pipe citizen: it reads from stdin and writes to stdout, so it composites naturally with every other tool in your data pipeline. You can chain it withsort,uniq,jq,curl, and any database CLI.

Scenario 1: CSV Analysis and Summarization

CSV files are the most common data exchange format. Whether it is a sales report, a product catalog, or a survey export, Gemini CLI can describe the data, compute statistics, and extract specific insights without requiring you to open a spreadsheet or write a pandas script.

Describe and summarize a CSV

# Understand the structure and statistics of any CSV

gemini 'Describe this CSV: column names, data types, row count, and basic statistics (min, max, mean) for numeric columns. Identify missing values.' < sales-q4.csv

Example output for a sales CSV with 12 columns and 5,000 rows:

Columns: date (datetime), region (string), product_id (string), units_sold (int), revenue (float), cost (float)

Rows: 5,000 | Missing values: cost (3 rows)

revenue: min=12.50, max=48,200.00, mean=1,842.33

Top region by revenue: APAC (38% of total)

Filter and extract specific rows

# Extract rows matching a complex condition

gemini 'Extract all rows where revenue > 10000 AND region is EMEA. Output as CSV with the same headers.' < sales-q4.csv > high-value-emea.csv

Generate a Python cleaning script

# Ask Gemini CLI to write the cleaning script for you

gemini 'Write a Python script using pandas to: 1) fill missing cost values with the column mean, 2) remove duplicate rows, 3) add a profit column (revenue - cost). Save to cleaned.csv.' < sales-q4.csv > clean_data.py

python3 clean_data.py

Generating a script rather than processing inline is the right approach for large files or for pipelines that will be run repeatedly. The script can be committed to version control and run on any machine without Gemini CLI installed.

Scenario 2: JSON Transformation and Reshaping

JSON is the lingua franca of APIs. Transforming deeply nested JSON into a flat structure, extracting specific fields, or converting between JSON and other formats are tasks that come up constantly in backend development and data integration work.

Flatten nested JSON to CSV

# Input: nested API response

# Output: flat CSV with dot-notation column names

gemini 'Convert this JSON array to CSV. Flatten nested objects using dot notation. Include a header row. Handle missing fields with empty strings.' < api-response.json > flat.csv

Input JSON sample:

[{'id': 1, 'user': {'name': 'Alice', 'email': 'a@x.com'}, 'score': 95}]

Output CSV:

id,user.name,user.email,score

1,Alice,a@x.com,95

Reshape JSON structure

# Transform from one API format to another

gemini "Transform this Stripe webhook payload into our internal order format. The target schema is: {id, customer_email, total_cents, items: [{sku, quantity, price_cents}]}." < stripe-event.json > internal-order.json

JSON to YAML conversion

# Convert config formats

gemini 'Convert this JSON config to YAML. Use 2-space indentation. Add a comment above each top-level key explaining its purpose.' < config.json > config.yaml

Scenario 3: Log Parsing and Analysis

Application logs contain a wealth of diagnostic information, but reading raw log files manually is tedious and error-prone. Gemini CLI can summarize error patterns, extract structured data, identify the root cause of an incident, or monitor a live log stream.

Analyze error patterns

# Summarize error frequency and patterns

gemini 'Analyze error patterns in this log file' < app.log

# Extract key metrics from web server logs

gemini 'Extract key metrics from web server logs' < access.log

Incident root cause analysis

# Find what caused a spike in 500 errors

gemini 'Our error rate spiked at 14:32 UTC. Analyze these logs and identify the root cause. Focus on errors that appeared for the first time around that time.' < server.log

Example findings from a real analysis session:

Root cause: Database connection timeout starting at 14:31:47

First occurrence: 14:31:47 — 'Connection pool exhausted: max_connections=20 reached'

Affected endpoints: /api/orders (87%), /api/products (13%)

Recovery: Connections stabilized at 14:38:12 after 400 failed requests

Recommendation: Increase connection pool size or add connection retry logic

Extract structured data from unstructured logs

# Convert log lines to JSON for further processing

gemini 'Parse each line of this log file and output JSON with fields: timestamp, level, service, message. Output one JSON object per line (JSONL format).' < mixed.log > parsed.jsonl

# Then query with jq

cat parsed.jsonl | jq 'select(.level == 'ERROR') | .message'

Scenario 4: Data Cleaning and Normalization

Raw data is rarely clean. Phone numbers come in dozens of formats, addresses have typos, dates are inconsistent, and categorical fields have misspellings. Gemini CLI can identify data quality issues and generate the code to fix them consistently.

Audit data quality

# Identify all data quality issues in a CSV

gemini 'Audit this CSV for data quality issues: inconsistent formats, outliers, duplicate rows, invalid values, and missing data. Report findings grouped by column.' < customer-export.csv

Normalize inconsistent values

# Normalize phone numbers, dates, and categories

gemini 'Write a Python script to normalize this CSV: 1) Phone numbers to E.164 format (+1XXXXXXXXXX), 2) Dates to ISO 8601 (YYYY-MM-DD), 3) Country names to ISO 3166-1 alpha-2 codes. Save to normalized.csv.' < messy-data.csv > normalize.py

python3 normalize.py

Deduplicate with fuzzy matching

# Find near-duplicate records (e.g., 'John Smith' vs 'Jon Smith')

gemini 'Write a Python script using rapidfuzz to find duplicate customer records where the name similarity is > 90% and the email domain matches. Output a report of suspected duplicates with their similarity scores.' < customers.csv > dedup.py

Additional Data Processing Examples

Format Conversion

gemini 'Convert this JSON to CSV format' < data.json > data.csv

gemini 'Transform this XML to YAML' < config.xml > config.yaml

Data Analysis

gemini 'Summarize trends in this dataset' < sales-data.csv

gemini 'Find anomalies in this time series data' < metrics.json

Batch Data Processing

Mass Data Transformation

#!/bin/bash

# process-data-files.sh

for file in data/*.json; do

echo 'Processing $file...'

gemini 'Clean and normalize this data' < '$file' > 'processed/$(basename '$file')'

done

Frequently Asked Questions

Can Gemini CLI process very large CSV files?

For files larger than a few megabytes, split them into chunks first usingsplit orawk, then pipe each chunk to gemini. Alternatively, ask gemini to write a Python or Node.js script that processes the file in a streaming manner, then run the generated script directly.

How do I convert JSON to CSV while keeping nested fields flat?

Use a specific prompt:gemini 'Convert this JSON array to CSV. Flatten nested objects using dot notation (e.g., address.city). Include a header row.' < data.json > output.csvGemini CLI will produce jq or Python code to handle the flattening, or output the CSV directly for small files.

Can Gemini CLI analyze log files in real time?

Yes, using tail -f with process substitution:tail -f app.log | gemini 'Summarize every 50 lines. Alert if you see ERROR or CRITICAL.'This creates a streaming pipeline where Gemini CLI monitors and summarizes your logs continuously.

What data formats does Gemini CLI understand natively?

Gemini CLI accepts any text-based format as stdin: CSV, JSON, JSONL, XML, YAML, TSV, log files, Markdown tables, and plain text. For binary formats (Parquet, Excel, SQLite), first convert to a text format using standard tools like csvkit, pandas, or sqlite3 before piping to Gemini CLI.