No description
  • Python 94.3%
  • Shell 4.9%
  • Dockerfile 0.8%
Find a file
2026-01-07 01:48:39 -05:00
.claude-box naming consistency 2025-11-28 20:40:35 -05:00
docs Full magic 2025-12-11 22:48:57 -05:00
pg create pg resources 2025-12-10 19:30:26 -05:00
shot-scraper@a367940cee partial export from claude_tree 2025-11-28 21:34:34 -05:00
tafkb2 Added --snippet-len option to tafkb-query 2026-01-07 01:48:39 -05:00
tests test and testability 2026-01-06 21:32:35 -05:00
.gitignore scripts cleaned up 2025-12-13 20:26:15 -05:00
.gitmodules kill mongo stuff, keeps failing and don't want to debug it, will just do pgvector 2025-12-10 17:56:18 -05:00
.python-version partial export from claude_tree 2025-11-28 21:34:34 -05:00
CLAUDE.md doc gen 2025-12-14 20:05:22 -05:00
claude.sh claude takes a crack at switching us over to Mongo. TODO.md contains some notes with my review, which will be executed upon next and then I'll remove it 2025-12-11 17:18:33 -05:00
cron_run.sh partial export from claude_tree 2025-11-28 21:34:34 -05:00
docker-compose.yml compose fix env vars 2025-12-30 20:54:11 -05:00
README.md more doc 2025-12-17 21:17:24 -05:00
run.sh add a test db 2025-11-29 02:02:04 -05:00
run__intake.sh partial export from claude_tree 2025-11-28 21:34:34 -05:00
setup.py estimate tokens cli cmd 2026-01-01 22:13:24 -05:00

tafkb2 - Document Chunking and Embedding

A system for processing documents into searchable vector embeddings. Includes an optional email-based intake pipeline.

Use Cases

  1. Full Pipeline - Process emails → fetch articles → chunk → embed → search
  2. Chunking/Embedding Only - Bring your own documents, use chunking and embedding

Installation

Prerequisites

  • Python 3.13+ (tested with 3.13.3)
  • PostgreSQL 17+ with pgvector extension
  • pyenv (recommended for Python version management)

Setup

# Clone with submodule (shot-scraper fork, needed only for intake)
git clone --recursive https://gitlab.com/t-f/document_intake_and_embedding.git
cd tafkb2

# Set Python version (if using pyenv)
pyenv install 3.13.3
pyenv local 3.13.3

# Create and activate virtual environment
python3 -m venv venv
source venv/bin/activate

# Install the package
pip install -e .

Environment Configuration

Create a .env file (dotenv-compatible format):

# PostgreSQL connection
POSTGRES_HOSTNAME=localhost
POSTGRES_PORT=5432
POSTGRES_USER=youruser
POSTGRES_PASSWORD=yourpassword
POSTGRES_DB=kb2

# Test database (used by pytest and when TAFKB2_INTAKE_TEST=1)
TEST_POSTGRES_HOSTNAME=localhost
TEST_POSTGRES_PORT=5432
TEST_POSTGRES_USER=testuser
TEST_POSTGRES_PASSWORD=testpassword
TEST_POSTGRES_DB=kb2_test

# Project root (required for intake scripts)
REPO_ROOT=/path/to/tafkb2

Load before running:

set -a; source .env; set +a

Verify Installation

# Check CLI is available
tafkb-query-configs

# Test database connection
python -c "
from tafkb2.db import get_db, init_db
db = get_db()
init_db(db)
print('Database connection successful')
"

Part A: Chunking and Embedding (No Intake)

Three steps to semantic search. Bring your own documents, chunk, embed, search.

# 1. Prepare your documents as JSON (note: _id, not id)
cat > my_docs.json << 'EOF'
[
  {"_id": "DOC01", "content": "Elixir is a functional programming language..."},
  {"_id": "DOC02", "content": "Phoenix is a web framework for Elixir..."}
]
EOF

# 2. Import, chunk, and embed
tafkb-json-import my_docs.json
python -m tafkb2.embed.chunk.main
tafkb-setup-download-models          # one-time: download embedding models
python -m tafkb2.embed.embeddings.main

# 3. Search!
tafkb-query "What is Elixir?" --config e5-large-v2_500

JSON Format

Everything except _id becomes the document's data field. The content field is required for chunking:

[
  {
    "_id": "ABC12",
    "content": "The full text content to chunk and embed...",
    "url": "https://example.com/source",
    "title": "Optional metadata"
  }
]

Customization


Part B: Full Pipeline with Intake

The intake system processes emails containing URLs, fetches article content via shot-scraper, and stores documents automatically.

See tafkb2/intake/README.md for full intake setup instructions.

Quick overview:

  1. Configure mbsync to sync an IMAP folder to a local maildir
  2. Install shot-scraper: pip install -e shot-scraper/
  3. Set intake environment variables (DOC_INTAKE_MAILDIR, REMOTE_CDP_URL, etc.)
  4. Run: python -m tafkb2.intake.main

After intake, run chunking and embedding as described in Part A.


CLI Reference

After pip install -e .:

Command Description
tafkb-query Semantic search
tafkb-query-configs List available EmbedderConfigs
tafkb-json-import Import documents from JSON
tafkb-json-export Export documents to JSON
tafkb-setup-download-models Download embedding models
tafkb-benchmark Benchmark search performance

Testing

set -a; source .env; set +a
export TAFKB2_INTAKE_TEST=1
pytest tests/ -v

Important: In test mode, the database user must be "claude" or "test". This prevents accidental test runs against production.


Development

Database Connection Safety

All database connections must go through tafkb2.db:

# Correct
from tafkb2.db import get_db
db = get_db()

# For tests
from tafkb2.db import get_test_db
db = get_test_db()

# NEVER do this - bypasses safety checks
from playhouse.postgres_ext import PostgresqlExtDatabase
db = PostgresqlExtDatabase(...)  # UNSAFE

Adding CLI Scripts

  1. Create tafkb2/cli/<name>.py with a main() function
  2. Add to console_scripts in setup.py
  3. Reinstall: pip install -e .

Naming convention: tafkb-foo-barfoo_bar.py


Working with Claude Code

This project is developed with Claude Code in a sandboxed Docker container. The container has no access to the host environment, secrets, or production data.

Files are synced in/out of a claude_tree/ directory (gitignored). See claude.sh for container management commands.