No description

Python 94.3%
Shell 4.9%
Dockerfile 0.8%

Find a file

t-f b4600c910c Added --snippet-len option to tafkb-query		2026-01-07 01:48:39 -05:00
.claude-box	naming consistency	2025-11-28 20:40:35 -05:00
docs	Full magic	2025-12-11 22:48:57 -05:00
pg	create pg resources	2025-12-10 19:30:26 -05:00
shot-scraper@a367940cee	partial export from claude_tree	2025-11-28 21:34:34 -05:00
tafkb2	Added --snippet-len option to tafkb-query	2026-01-07 01:48:39 -05:00
tests	test and testability	2026-01-06 21:32:35 -05:00
.gitignore	scripts cleaned up	2025-12-13 20:26:15 -05:00
.gitmodules	kill mongo stuff, keeps failing and don't want to debug it, will just do pgvector	2025-12-10 17:56:18 -05:00
.python-version	partial export from claude_tree	2025-11-28 21:34:34 -05:00
CLAUDE.md	doc gen	2025-12-14 20:05:22 -05:00
claude.sh	claude takes a crack at switching us over to Mongo. TODO.md contains some notes with my review, which will be executed upon next and then I'll remove it	2025-12-11 17:18:33 -05:00
cron_run.sh	partial export from claude_tree	2025-11-28 21:34:34 -05:00
docker-compose.yml	compose fix env vars	2025-12-30 20:54:11 -05:00
README.md	more doc	2025-12-17 21:17:24 -05:00
run.sh	add a test db	2025-11-29 02:02:04 -05:00
run__intake.sh	partial export from claude_tree	2025-11-28 21:34:34 -05:00
setup.py	estimate tokens cli cmd	2026-01-01 22:13:24 -05:00

README.md

tafkb2 - Document Chunking and Embedding

A system for processing documents into searchable vector embeddings. Includes an optional email-based intake pipeline.

Use Cases

Full Pipeline - Process emails → fetch articles → chunk → embed → search
Chunking/Embedding Only - Bring your own documents, use chunking and embedding

Installation

Prerequisites

Python 3.13+ (tested with 3.13.3)
PostgreSQL 17+ with pgvector extension
pyenv (recommended for Python version management)

Setup

# Clone with submodule (shot-scraper fork, needed only for intake)
git clone --recursive https://gitlab.com/t-f/document_intake_and_embedding.git
cd tafkb2

# Set Python version (if using pyenv)
pyenv install 3.13.3
pyenv local 3.13.3

# Create and activate virtual environment
python3 -m venv venv
source venv/bin/activate

# Install the package
pip install -e .

Environment Configuration

Create a .env file (dotenv-compatible format):

# PostgreSQL connection
POSTGRES_HOSTNAME=localhost
POSTGRES_PORT=5432
POSTGRES_USER=youruser
POSTGRES_PASSWORD=yourpassword
POSTGRES_DB=kb2

# Test database (used by pytest and when TAFKB2_INTAKE_TEST=1)
TEST_POSTGRES_HOSTNAME=localhost
TEST_POSTGRES_PORT=5432
TEST_POSTGRES_USER=testuser
TEST_POSTGRES_PASSWORD=testpassword
TEST_POSTGRES_DB=kb2_test

# Project root (required for intake scripts)
REPO_ROOT=/path/to/tafkb2

Load before running:

set -a; source .env; set +a

Verify Installation

# Check CLI is available
tafkb-query-configs

# Test database connection
python -c "
from tafkb2.db import get_db, init_db
db = get_db()
init_db(db)
print('Database connection successful')
"

Part A: Chunking and Embedding (No Intake)

Three steps to semantic search. Bring your own documents, chunk, embed, search.

# 1. Prepare your documents as JSON (note: _id, not id)
cat > my_docs.json << 'EOF'
[
  {"_id": "DOC01", "content": "Elixir is a functional programming language..."},
  {"_id": "DOC02", "content": "Phoenix is a web framework for Elixir..."}
]
EOF

# 2. Import, chunk, and embed
tafkb-json-import my_docs.json
python -m tafkb2.embed.chunk.main
tafkb-setup-download-models          # one-time: download embedding models
python -m tafkb2.embed.embeddings.main

# 3. Search!
tafkb-query "What is Elixir?" --config e5-large-v2_500

JSON Format

Everything except _id becomes the document's data field. The content field is required for chunking:

[
  {
    "_id": "ABC12",
    "content": "The full text content to chunk and embed...",
    "url": "https://example.com/source",
    "title": "Optional metadata"
  }
]

Customization

Add chunking strategies: see tafkb2/embed/chunk/README.md
Add embedding models: see tafkb2/embed/README.md
Use search in Python: see tafkb2/README__SEARCH.md

Part B: Full Pipeline with Intake

The intake system processes emails containing URLs, fetches article content via shot-scraper, and stores documents automatically.

See tafkb2/intake/README.md for full intake setup instructions.

Quick overview:

Configure mbsync to sync an IMAP folder to a local maildir
Install shot-scraper: pip install -e shot-scraper/
Set intake environment variables (DOC_INTAKE_MAILDIR, REMOTE_CDP_URL, etc.)
Run: python -m tafkb2.intake.main

After intake, run chunking and embedding as described in Part A.

CLI Reference

After pip install -e .:

Command	Description
`tafkb-query`	Semantic search
`tafkb-query-configs`	List available EmbedderConfigs
`tafkb-json-import`	Import documents from JSON
`tafkb-json-export`	Export documents to JSON
`tafkb-setup-download-models`	Download embedding models
`tafkb-benchmark`	Benchmark search performance

Testing

set -a; source .env; set +a
export TAFKB2_INTAKE_TEST=1
pytest tests/ -v

Important: In test mode, the database user must be "claude" or "test". This prevents accidental test runs against production.

Development

Database Connection Safety

All database connections must go through tafkb2.db:

# Correct
from tafkb2.db import get_db
db = get_db()

# For tests
from tafkb2.db import get_test_db
db = get_test_db()

# NEVER do this - bypasses safety checks
from playhouse.postgres_ext import PostgresqlExtDatabase
db = PostgresqlExtDatabase(...)  # UNSAFE

Adding CLI Scripts

Create tafkb2/cli/<name>.py with a main() function
Add to console_scripts in setup.py
Reinstall: pip install -e .

Naming convention: tafkb-foo-bar → foo_bar.py

Working with Claude Code

This project is developed with Claude Code in a sandboxed Docker container. The container has no access to the host environment, secrets, or production data.

Files are synced in/out of a claude_tree/ directory (gitignored). See claude.sh for container management commands.