- Python 94.3%
- Shell 4.9%
- Dockerfile 0.8%
| .claude-box | ||
| docs | ||
| pg | ||
| shot-scraper@a367940cee | ||
| tafkb2 | ||
| tests | ||
| .gitignore | ||
| .gitmodules | ||
| .python-version | ||
| CLAUDE.md | ||
| claude.sh | ||
| cron_run.sh | ||
| docker-compose.yml | ||
| README.md | ||
| run.sh | ||
| run__intake.sh | ||
| setup.py | ||
tafkb2 - Document Chunking and Embedding
A system for processing documents into searchable vector embeddings. Includes an optional email-based intake pipeline.
Use Cases
- Full Pipeline - Process emails → fetch articles → chunk → embed → search
- Chunking/Embedding Only - Bring your own documents, use chunking and embedding
Installation
Prerequisites
- Python 3.13+ (tested with 3.13.3)
- PostgreSQL 17+ with pgvector extension
- pyenv (recommended for Python version management)
Setup
# Clone with submodule (shot-scraper fork, needed only for intake)
git clone --recursive https://gitlab.com/t-f/document_intake_and_embedding.git
cd tafkb2
# Set Python version (if using pyenv)
pyenv install 3.13.3
pyenv local 3.13.3
# Create and activate virtual environment
python3 -m venv venv
source venv/bin/activate
# Install the package
pip install -e .
Environment Configuration
Create a .env file (dotenv-compatible format):
# PostgreSQL connection
POSTGRES_HOSTNAME=localhost
POSTGRES_PORT=5432
POSTGRES_USER=youruser
POSTGRES_PASSWORD=yourpassword
POSTGRES_DB=kb2
# Test database (used by pytest and when TAFKB2_INTAKE_TEST=1)
TEST_POSTGRES_HOSTNAME=localhost
TEST_POSTGRES_PORT=5432
TEST_POSTGRES_USER=testuser
TEST_POSTGRES_PASSWORD=testpassword
TEST_POSTGRES_DB=kb2_test
# Project root (required for intake scripts)
REPO_ROOT=/path/to/tafkb2
Load before running:
set -a; source .env; set +a
Verify Installation
# Check CLI is available
tafkb-query-configs
# Test database connection
python -c "
from tafkb2.db import get_db, init_db
db = get_db()
init_db(db)
print('Database connection successful')
"
Part A: Chunking and Embedding (No Intake)
Three steps to semantic search. Bring your own documents, chunk, embed, search.
# 1. Prepare your documents as JSON (note: _id, not id)
cat > my_docs.json << 'EOF'
[
{"_id": "DOC01", "content": "Elixir is a functional programming language..."},
{"_id": "DOC02", "content": "Phoenix is a web framework for Elixir..."}
]
EOF
# 2. Import, chunk, and embed
tafkb-json-import my_docs.json
python -m tafkb2.embed.chunk.main
tafkb-setup-download-models # one-time: download embedding models
python -m tafkb2.embed.embeddings.main
# 3. Search!
tafkb-query "What is Elixir?" --config e5-large-v2_500
JSON Format
Everything except _id becomes the document's data field. The content field is required for chunking:
[
{
"_id": "ABC12",
"content": "The full text content to chunk and embed...",
"url": "https://example.com/source",
"title": "Optional metadata"
}
]
Customization
- Add chunking strategies: see tafkb2/embed/chunk/README.md
- Add embedding models: see tafkb2/embed/README.md
- Use search in Python: see tafkb2/README__SEARCH.md
Part B: Full Pipeline with Intake
The intake system processes emails containing URLs, fetches article content via shot-scraper, and stores documents automatically.
See tafkb2/intake/README.md for full intake setup instructions.
Quick overview:
- Configure mbsync to sync an IMAP folder to a local maildir
- Install shot-scraper:
pip install -e shot-scraper/ - Set intake environment variables (
DOC_INTAKE_MAILDIR,REMOTE_CDP_URL, etc.) - Run:
python -m tafkb2.intake.main
After intake, run chunking and embedding as described in Part A.
CLI Reference
After pip install -e .:
| Command | Description |
|---|---|
tafkb-query |
Semantic search |
tafkb-query-configs |
List available EmbedderConfigs |
tafkb-json-import |
Import documents from JSON |
tafkb-json-export |
Export documents to JSON |
tafkb-setup-download-models |
Download embedding models |
tafkb-benchmark |
Benchmark search performance |
Testing
set -a; source .env; set +a
export TAFKB2_INTAKE_TEST=1
pytest tests/ -v
Important: In test mode, the database user must be "claude" or "test". This prevents accidental test runs against production.
Development
Database Connection Safety
All database connections must go through tafkb2.db:
# Correct
from tafkb2.db import get_db
db = get_db()
# For tests
from tafkb2.db import get_test_db
db = get_test_db()
# NEVER do this - bypasses safety checks
from playhouse.postgres_ext import PostgresqlExtDatabase
db = PostgresqlExtDatabase(...) # UNSAFE
Adding CLI Scripts
- Create
tafkb2/cli/<name>.pywith amain()function - Add to
console_scriptsinsetup.py - Reinstall:
pip install -e .
Naming convention: tafkb-foo-bar → foo_bar.py
Working with Claude Code
This project is developed with Claude Code in a sandboxed Docker container. The container has no access to the host environment, secrets, or production data.
Files are synced in/out of a claude_tree/ directory (gitignored). See claude.sh for container management commands.