Data Gibberish

Data Gibberish

Share this post

Data Gibberish
Data Gibberish
How to make your SQL 10x faster (and silence the “AI can’t help with that” crowd)

How to make your SQL 10x faster (and silence the “AI can’t help with that” crowd)

A 4-week deep dive into schema-aware prompts, query plans, and AI-assisted indexing, built for engineers who give a damn about performance.

Yordan Ivanov's avatar
Yordan Ivanov
Jul 01, 2025
∙ Paid
11

Share this post

Data Gibberish
Data Gibberish
How to make your SQL 10x faster (and silence the “AI can’t help with that” crowd)
Share

This mini-course is part of the Level-up data engineering playlist. Click here to explore the full series.

A female engineer who uses AI to skyrocket her database performance

I used to think AI couldn’t help with real SQL work.

Sure, it could autocomplete a SELECT, but optimizing a 30M-row query? That’s senior engineer territory.

This mini-course proves otherwise.

You’ll learn how to prompt like a systems thinker, diagnose EXPLAIN plans like a pro, and simulate indexing with zero risk.

In the next 4 weeks, you’ll tackle one real-world performance bottleneck. By the end, you won’t just write faster SQL.

You’ll think faster too.

Loading...

How to work with this mini-course

Just reading the article alone would take you over 20 minutes. So:

  • Bookmark this guide and set a reminder to revisit it weekly.

  • Skim the entire article once to understand the big picture.

  • Each week, complete the exercises before applying them to your own projects.

Take your time. Don’t rush to implement everything at once. Master each step before moving to the next.

Also, you will need about an hour to read the whole thing and write the code at once. It’s much easier to spend 15 minutes per week!


Who is this for

This is for the engineer who’s done all the right things and still feels stuck.

You know how to write a solid query.

You’ve skimmed AI blog posts that say “Just use GPT!”

You’ve watched it write garbage SQL and thought: There’s no way this helps in prod.

You’re not wrong.

But also, it’s not the model.

It’s the prompt, the schema, and the way you teach it to reason.

This is for you if:

  • You write analytical queries that run slow, and want to know why

  • You’re tired of guessing what EXPLAIN plans mean

  • You want AI to be your performance copilot and not your crutch

  • You don’t want to play “SQL roulette” every time a query hits prod

  • You care about real metrics: buffer hits, index paths, cost deltas

This mini-course won’t make you a DBA.

It’ll make you dangerous in performance reviews, schema design meetings, and tuning sessions.

No magic. Just muscle.


What you’ll learn (by the end of 4 weeks)

You’re not here to get “better at SQL.”

You’re here to turn that skill into firepower, backed by evidence, not vibes.

By the end of this, you’ll know how to:

  • Prompt LLMs with schema, row counts, and constraints and catch when they hallucinate

  • Read an EXPLAIN (ANALYZE) plan like a performance engineer

  • Use ChatGPT to diagnose query bottlenecks (and challenge its answers with facts)

  • Simulate indexes with HypoPG before writing a single CREATE INDEX

  • Rewrite slow queries using window functions, CTEs, and partition-aware patterns

  • Spot when AI is helpful and when it’s confidently wrong

This isn’t autocomplete training.

It’s optimization thinking with AI as your assistant.


Setup your playground + tools

Before we start optimizing anything, we need a playground worth tuning.

We’re using the Wikipedia Clickstream dataset (Jan 2025) — messy, massive, and built for OLAP-style exploration.

You’ll load it into PostgreSQL on your local machine (Mac only — no Linux or Windows instructions here).

Here’s what we’re building:

  • A fact table with ~50M rows of user clickstream data

  • Two dimension tables to help with labeling and grouping

  • A test-only copy for slow queries you can safely experiment on

Step 1: download the data

Grab the Jan 2025 clickstream dataset:

curl -O https://dumps.wikimedia.org/other/clickstream/2025-01/clickstream-enwiki-2025-01.tsv.gz
gunzip clickstream-enwiki-2025-01.tsv.gz

This file is ~1.3GB uncompressed and uses tab-separated values.

Step 2: install PostgreSQL (if you haven’t yet)

Using Homebrew:

brew install postgresql
brew services start postgresql

Check version:

psql --version

You want PostgreSQL 14 or higher.

Step 3: create the database

Open Postgres CLI:

createdb clickstream_ai
psql clickstream_ai

Inside psql, create your base table:

create table clickstream (
  prev_title TEXT,
  curr_title TEXT,
  link_type TEXT,
  n BIGINT
);

This captures:

  • prev_title: the referring page

  • curr_title: the page that was visited

  • link_type: internal, external, or other

  • n: number of transitions

Step 4: Clean and load the data

Wikipedia data is in not in great shape. Let’s do our favourite thing as data engineers and clean it.

awk -F '\t' 'NF==4 && $3 != "other" {
  gsub(/"/, "\"\"", $1); gsub(/"/, "\"\"", $2); gsub(/"/, "\"\"", $3); gsub(/"/, "\"\"", $4);
  print "\"" $1 "\",\"" $2 "\",\"" $3 "\",\"" $4 "\""
}' clickstream-enwiki-2025-01.tsv > clickstream_cleaned.csv

Now load it:

psql clickstream_ai -c "\\copy clickstream from 'clickstream_cleaned.csv' with (format csv, header false)"

Check count:

select count(*) from clickstream;

You should see around 36 million rows.

Step 5: create dimension tables

We’ll create two lookup tables:

create table dim_page (
  title TEXT PRIMARY KEY,
  is_disambiguation BOOLEAN DEFAULT FALSE
);

create table dim_link_type (
  link_type TEXT PRIMARY KEY,
  description TEXT
);

Optional: add a few dim_page entries to use in joins later.

Step 6: create a slow copy for experiments

create table clickstream_slow as select * from clickstream;

You’ll use this to run bad queries on purpose and learn to fix them.

Step 7: tools you’ll use

  • ChatGPT / Claude / Copilot / Cursor for SQL generation + plan interpretation

  • HypoPG to simulate indexes without committing to schema changes

  • EXPLAIN (ANALYZE, BUFFERS, VERBOSE) to see exactly what your query is doing

Did you miss the announcement? You and I will spend the next year to turn your technical work into profit-driving business systems without playing politics, changing jobs, or necessarily becoming managers. Subscribe now and secure the next year of growth.

Keep reading with a 7-day free trial

Subscribe to Data Gibberish to keep reading this post and get 7 days of free access to the full post archives.

Already a paid subscriber? Sign in
© 2025 Yordan Ivanov
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share