- Elixir 83.1%
- HTML 10.2%
- CSS 3.4%
- JavaScript 2.6%
- Dockerfile 0.6%
- Other 0.1%
| pg | ||
| podsearch | ||
| .gitignore | ||
| clone_test_db.sh | ||
| docker-compose.yml | ||
| dump_sample.sh | ||
| README.md | ||
podcast_search
This app provides a UI for my podcast index to enable semantic queries.
Getting a database sample
To set up a dev environment with realistic data, export a subset from a populated database and import it into your local instance.
Export
Run dump_sample.sh against the source DB. It reads
connection info from .env (the same file docker-compose uses) and connects
in read-only mode (PGOPTIONS='-c default_transaction_read_only=on'),
so it cannot modify the source database.
Sampling is deterministic (id % N = 0), which means the same chunk IDs are
selected every run. All referenced episodes and feeds are included
automatically, preserving FK integrity.
./dump_sample.sh # ~10% of chunks (default)
./dump_sample.sh 5 # ~20%
./dump_sample.sh 20 # ~5%
Output lands in podsearch_sample/ (three CSV files).
Import
Load the CSVs into your local Postgres in FK order (feeds → episodes → chunks). With the dockerized Postgres from this repo:
docker exec -i podcast_search-pgv-1 psql -U postgres -d podsearch_dev \
-c "\copy feeds FROM STDIN CSV HEADER" < podsearch_sample/feeds.csv
docker exec -i podcast_search-pgv-1 psql -U postgres -d podsearch_dev \
-c "\copy episodes FROM STDIN CSV HEADER" < podsearch_sample/episodes.csv
docker exec -i podcast_search-pgv-1 psql -U postgres -d podsearch_dev \
-c "\copy chunks_512_qwen3_4b_2000d FROM STDIN CSV HEADER" < podsearch_sample/chunks.csv
The chunks import is slow (~15-20 min for a 10% sample) because Postgres maintains the HNSW index on each inserted vector.
Testing
Tests run against podsearch_test, a clone of podsearch_dev. We use
CREATE DATABASE ... TEMPLATE to make an exact copy — data, HNSW indexes,
extensions, everything — so there's no expensive CSV import or index rebuild
on each test run.
The Ecto Sandbox wraps every test in a transaction that gets rolled back, so
the cloned data stays intact across repeated mix test runs. Each test can
read the full realistic dataset and make writes that disappear when the test
ends. This gives us the standard Phoenix testing happy path (isolated,
self-contained tests using DataCase) while also having a large corpus of
real data available for integration-level assertions on search quality,
trends analysis, and query correctness.
Setup
-
Make sure
podsearch_devis populated (either from your own data, or by loading a sample viadump_sample.sh+psql \copy). -
Stop the dev server —
CREATE DATABASE ... TEMPLATErequires no active connections to the source database. -
Clone dev into test:
./clone_test_db.sh -
Run tests as usual:
cd podsearch && mix test
When to re-clone
Re-run clone_test_db.sh only when podsearch_dev has changed meaningfully
(e.g. after ingesting new podcasts or running new migrations). Day-to-day
test runs reuse the existing podsearch_test — the sandbox keeps it clean.