This app provides a UI for my podcast index to enable semantic queries.
  • Elixir 83.1%
  • HTML 10.2%
  • CSS 3.4%
  • JavaScript 2.6%
  • Dockerfile 0.6%
  • Other 0.1%
Find a file
2026-06-19 18:58:11 +00:00
pg init w/ compose, pg, pgvector 2026-03-24 17:40:13 -04:00
podsearch TrendsDuck: Fix window timestamp format, and stop silently failing, log errors usefully instead 2026-06-19 18:58:11 +00:00
.gitignore add dev data sampling docs and gitignore for .env and sample data 2026-06-19 02:21:51 +00:00
clone_test_db.sh add clone_test_db.sh and document testing strategy 2026-06-19 02:21:04 +00:00
docker-compose.yml bump shm size and fix migration for experimental indices 2026-05-09 21:36:38 -04:00
dump_sample.sh fix: use POSTGRES_HOST consistently in dump_sample.sh 2026-06-19 01:49:43 +00:00
README.md add clone_test_db.sh and document testing strategy 2026-06-19 02:21:04 +00:00

podcast_search

This app provides a UI for my podcast index to enable semantic queries.

Getting a database sample

To set up a dev environment with realistic data, export a subset from a populated database and import it into your local instance.

Export

Run dump_sample.sh against the source DB. It reads connection info from .env (the same file docker-compose uses) and connects in read-only mode (PGOPTIONS='-c default_transaction_read_only=on'), so it cannot modify the source database.

Sampling is deterministic (id % N = 0), which means the same chunk IDs are selected every run. All referenced episodes and feeds are included automatically, preserving FK integrity.

./dump_sample.sh      # ~10% of chunks (default)
./dump_sample.sh 5    # ~20%
./dump_sample.sh 20   # ~5%

Output lands in podsearch_sample/ (three CSV files).

Import

Load the CSVs into your local Postgres in FK order (feeds → episodes → chunks). With the dockerized Postgres from this repo:

docker exec -i podcast_search-pgv-1 psql -U postgres -d podsearch_dev \
  -c "\copy feeds FROM STDIN CSV HEADER" < podsearch_sample/feeds.csv

docker exec -i podcast_search-pgv-1 psql -U postgres -d podsearch_dev \
  -c "\copy episodes FROM STDIN CSV HEADER" < podsearch_sample/episodes.csv

docker exec -i podcast_search-pgv-1 psql -U postgres -d podsearch_dev \
  -c "\copy chunks_512_qwen3_4b_2000d FROM STDIN CSV HEADER" < podsearch_sample/chunks.csv

The chunks import is slow (~15-20 min for a 10% sample) because Postgres maintains the HNSW index on each inserted vector.

Testing

Tests run against podsearch_test, a clone of podsearch_dev. We use CREATE DATABASE ... TEMPLATE to make an exact copy — data, HNSW indexes, extensions, everything — so there's no expensive CSV import or index rebuild on each test run.

The Ecto Sandbox wraps every test in a transaction that gets rolled back, so the cloned data stays intact across repeated mix test runs. Each test can read the full realistic dataset and make writes that disappear when the test ends. This gives us the standard Phoenix testing happy path (isolated, self-contained tests using DataCase) while also having a large corpus of real data available for integration-level assertions on search quality, trends analysis, and query correctness.

Setup

  1. Make sure podsearch_dev is populated (either from your own data, or by loading a sample via dump_sample.sh + psql \copy).

  2. Stop the dev serverCREATE DATABASE ... TEMPLATE requires no active connections to the source database.

  3. Clone dev into test:

    ./clone_test_db.sh
    
  4. Run tests as usual:

    cd podsearch && mix test
    

When to re-clone

Re-run clone_test_db.sh only when podsearch_dev has changed meaningfully (e.g. after ingesting new podcasts or running new migrations). Day-to-day test runs reuse the existing podsearch_test — the sandbox keeps it clean.