# Public RAG Retrieval Service Production FastAPI app for hosted source retrieval, served at `https://hyfl.uk/rag/v1/*` via Cloudflare Tunnel. Created by Bilawal Riaz: https://bilawal.net The service returns **source snippets and metadata only**. Client applications receive the snippets and decide how to use them as context before generation. The hosted public corpus is Wikipedia-derived. Treat returned text as background context, not clinical guidance, and do not use this service to diagnose, treat, or make decisions about real patients. --- ## Table of Contents 1. [Quick Start](#quick-start) 2. [Bootstrapping a Master API Key](#bootstrapping-a-master-api-key) 3. [Creating Child API Keys](#creating-child-api-keys) 4. [API Reference](#api-reference) 5. [Cloudflare Tunnel Setup](#cloudflare-tunnel-setup) 6. [Security](#security) 7. [Configuration](#configuration) 8. [Operational Checks](#operational-checks) 9. [Troubleshooting](#troubleshooting) --- ## Quick Start ### 1. Create a strong bootstrap secret ```bash SECRET="sk_live_$(openssl rand -base64 32 | tr -d '=+/' | head -c 32)" echo "$SECRET" # SAVE THIS - it cannot be recovered ``` ### 2. Add it to `.env` on the server ```bash cat >> .env <> .env echo "Save this key: $SECRET" docker compose up -d --build ``` > **Important:** Store the bootstrap key in a password manager. If you lose it, the database must be rebuilt (`rm data/rag_service.sqlite && docker compose restart`). ### Confirm the bootstrap key works ```bash curl https://hyfl.uk/rag/v1/usage \ -H "Authorization: Bearer sk_live_your_long_random_secret_here" ``` --- ## Creating Child API Keys Child keys have limited scopes and budgets. Always create a child key per application - never share the bootstrap key. ### Create a read-only key (1-day expiry) ```bash curl -X POST https://hyfl.uk/rag/v1/keys \ -H "Authorization: Bearer $BOOTSTRAP_KEY" \ -H "Content-Type: application/json" \ -d '{ "name": "my-app", "scopes": ["retrieve"], "expires_at": "2027-01-01T00:00:00Z" }' ``` Response (the `secret` is the **only** time the full key is returned): ```json { "prefix": "sk_live_AbCdEfGhIj", "secret": "sk_live_AbCdEfGhIjKlMnOpQrStUvWxYz123456", "name": "my-app", "scopes": ["retrieve"], "expires_at": "2027-01-01T00:00:00+00:00", "budget": { "rpm_limit": 600, "monthly_request_limit": null, "max_top_k": 25, "max_context_chars": 50000, "max_request_bytes": 1000000 } } ``` ### Available scopes | Scope | Allows | |-------|--------| | `retrieve` | Call `/retrieve` and `/corpora` | | `keys:create` | Create and revoke child keys | | `admin` | All scopes, no parent check, can revoke any key | ### Restricted child key (lower limits) ```bash curl -X POST https://hyfl.uk/rag/v1/keys \ -H "Authorization: Bearer $BOOTSTRAP_KEY" \ -H "Content-Type: application/json" \ -d '{ "name": "mobile-app", "scopes": ["retrieve"], "budget": { "rpm_limit": 60, "max_top_k": 5, "max_context_chars": 10000 } }' ``` ### Revoke a child key ```bash curl -X DELETE https://hyfl.uk/rag/v1/keys/sk_live_AbCdEfGhIj \ -H "Authorization: Bearer $BOOTSTRAP_KEY" ``` ### Check usage ```bash curl https://hyfl.uk/rag/v1/usage \ -H "Authorization: Bearer $CHILD_KEY" ``` --- ## API Reference Base URL: `https://hyfl.uk/rag/v1` Auth: `Authorization: Bearer ` on retrieval, usage, admin key, and MCP endpoints ### `GET /v1/health` Liveness probe. **No auth required.** ```bash curl https://hyfl.uk/rag/v1/health # {"ok": true, "version": "0.1.0"} ``` ### `GET /v1/ready` Readiness probe - verifies DB files, embedding model load, and a warm retrieval. Returns 503 if not ready. ```bash curl https://hyfl.uk/rag/v1/ready # {"ok": true, "version": "0.1.0", "corpora": ["medical_core", ...]} ``` ### `GET /v1/corpora` List available corpora and their public metadata (no filesystem paths exposed). ```bash curl https://hyfl.uk/rag/v1/corpora ``` Response: ```json { "corpora": [ { "id": "medical_core", "title": "Medical Core", "available": true, "article_count": 33291, "chunk_count": 188028, "graph_edge_count": 254344 } ] } ``` ### `POST /v1/retrieve` The primary endpoint. Returns source snippets for a query. ```bash curl -X POST https://hyfl.uk/rag/v1/retrieve \ -H "Authorization: Bearer $KEY" \ -H "Content-Type: application/json" \ -d '{ "query": "What is hypertension?", "corpus": ["medical_core"], "top_k": 5, "max_context_chars": 6000 }' ``` Request body: | Field | Type | Default | Notes | |-------|------|---------|-------| | `query` | string | required | 1–4000 chars | | `corpus` | string[] | `["medical_core"]` | Max 4 entries, must be valid corpus IDs | | `top_k` | int | `8` | 1–50, capped by key limit | | `max_context_chars` | int | `6000` | 500–100000, capped by key limit | | `messages` | object[] | `[]` | Max 12 recent conversation turns | | `conversation` | object | null | Compact state: `current_topic`, `summary`, `salient_terms` | | `include_trace` | bool | `true` | Include retrieval trace metadata | Response: ```json { "query": "What is hypertension?", "contextualized_query": "What is hypertension?", "results": [ { "id": "medical_core:chunk:43851", "title": "Systolic hypertension", "heading": "Systolic hypertension", "url": "https://en.wikipedia.org/wiki/Systolic_hypertension", "score": 0.7532, "text": "# Systolic hypertension ...", "trace": { "corpus": "medical_core", "retriever": "fts", "source_query": "is hypertension", "chunk_index": 0, "start_char": 0, "end_char": 726 } } ], "latency_ms": 218, "conversation": { "mode": "standalone", "used_recent_user_messages": 0, "used_summary": false, "used_terms": [] } } ``` **Errors:** - `400` - invalid corpus name, `top_k`/`max_context_chars` exceeds key limit - `401` - missing or invalid bearer token - `403` - token lacks `retrieve` scope - `413` - request body too large - `422` - validation error (e.g. empty query) - `429` - rate limit exceeded - `504` - retrieval timed out (8s default) ### `POST /v1/retrieve/stream` Server-Sent Events version of `/retrieve`. Send the same JSON body and bearer token, optionally with `Accept: text/event-stream`. Events: - `status` - retrieval has started - `result` - one returned source snippet with `index` and `item` - `results` - contains `result_count`, `latency_ms`, and `payload` with the normal `/retrieve` response - `done` - stream completion marker ### `POST /v1/chat` **Always returns 501.** Hosted generation is intentionally not offered. Use `/v1/retrieve` and have your client LLM write the answer. ```json {"detail": "hosted generation is not offered; call /retrieve and let the client LLM write the answer"} ``` ### `POST /v1/keys` Create a child key. Requires `keys:create` or `admin` scope. [See examples above.](#creating-child-api-keys) ### `DELETE /v1/keys/{prefix}` Revoke a child key. Requires `keys:create` or `admin` scope. Non-admins can only revoke their own children. ```bash curl -X DELETE https://hyfl.uk/rag/v1/keys/sk_live_AbCdEfGhIj \ -H "Authorization: Bearer $KEY" # {"ok": true} ``` ### `GET /v1/usage` Current key usage and limits. Requires `retrieve` scope. ```bash curl https://hyfl.uk/rag/v1/usage \ -H "Authorization: Bearer $KEY" ``` --- ## Cloudflare Tunnel Setup The service runs on `127.0.0.1:8091` inside Docker and is exposed publicly only through a Cloudflare Tunnel. ### 1. Install cloudflared on the server ```bash curl -fsSL https://pkg.cloudflare.com/cloudflare-main.gpg | gpg --dearmor -o /usr/share/keyrings/cloudflare.gpg echo "deb [signed-by=/usr/share/keyrings/cloudflare.gpg] https://pkg.cloudflare.com/cloudflared $(lsb_release -cs) main" > /etc/apt/sources.list.d/cloudflared.list apt-get update && apt-get install -y cloudflared ``` ### 2. Authenticate and create a tunnel ```bash cloudflared tunnel login cloudflared tunnel create hyfl-rag ``` ### 3. Configure the tunnel The tunnel needs **two rules** for `hyfl.uk`: - One with `path: /rag/*` to proxy API requests - One without a path to serve the landing page In the Cloudflare dashboard (Zero Trust → Networks → Tunnels → your tunnel → Configure → Public hostname): | Hostname | Path | Service | |----------|------|---------| | `hyfl.uk` | `/rag/*` | `http://127.0.0.1:8091` | | `hyfl.uk` | _(none - empty)_ | `http://127.0.0.1:8091` | If you manage the config via file (`~/.cloudflared/config.yml`): ```yaml tunnel: credentials-file: /root/.cloudflared/.json ingress: - hostname: hyfl.uk path: /rag/.* service: http://127.0.0.1:8091 - hostname: hyfl.uk service: http://127.0.0.1:8091 - service: http_status:404 ``` ### 4. Route DNS ```bash cloudflared tunnel route dns hyfl-rag hyfl.uk ``` ### 5. Run the tunnel ```bash cloudflared tunnel run hyfl-rag ``` ### 6. Verify ```bash curl https://hyfl.uk/rag/v1/health ``` --- ## Security ### Authentication - Retrieval, usage, child-key administration, and MCP endpoints require `Authorization: Bearer `. Health, readiness, corpora, docs, status, and anonymous key creation are public. - Keys are stored as salted PBKDF2-SHA256 hashes (210,000 iterations). - The full secret is never persisted - it is returned **only** at creation time. ### Authorization - Three scopes: `retrieve`, `keys:create`, `admin`. - Child key scopes must be a subset of the parent's (unless parent is `admin`). - Child key expiry cannot outlive parent's. ### Rate limiting - Per-key rolling-window: 600 RPM default (configurable per key). - Exceeding returns `429`. ### CORS - Browser preflight is supported for `GET`, `POST`, `DELETE`, and `OPTIONS`. - Default origins are `*`; bearer authentication is still required for protected endpoints. ### Input validation - Request body capped by `Content-Length` and `max_request_bytes` per key. - `corpus` names validated against configured corpora. - `top_k`, `max_context_chars` capped by key limits. - Pydantic validates all input fields. ### Information exposure - `/v1/corpora` never exposes filesystem paths. - Errors do not leak stack traces to clients (500 errors logged server-side only). - Public retrieval disables external article fetching and model query expansion. ### Network exposure - Docker binds `127.0.0.1:8091:8091` only - not reachable on the public VPS interface. - All public access goes through Cloudflare Tunnel. --- ## Configuration Environment variables (set in `.env` or `docker-compose.yml`): | Variable | Default | Description | |----------|---------|-------------| | `RAG_BOOTSTRAP_API_KEY` | `""` | Master admin key inserted on startup | | `RAG_CORS_ORIGINS` | `*` | Comma-separated CORS origins, or `*` for browser clients from any origin | | `RAG_WARM_ON_STARTUP` | `0` | Run a warmup retrieval on startup | | `RAG_READY_CORPUS` | `medical_core` | Corpus used for the ready check | | `RAG_READY_QUERY` | `ADHD treatment` | Query used for the ready check | | `RAG_VECTOR_TOP_K` | `0` | faiss HNSW vector retrieval top-k (0 = disabled). When >0, semantic search joins the FTS + graph candidate pool. | | `RAG_VECTOR_HNSW_M` | `16` | HNSW graph degree (M). Higher = better recall, more memory. | | `RAG_VECTOR_HNSW_EF_CONSTRUCTION` | `200` | HNSW build-time search depth. Higher = better-quality index, slower build. | | `RAG_VECTOR_HNSW_EF` | `50` | HNSW query-time search depth. Higher = better recall, slightly slower. | | `RAG_VECTOR_CANDIDATE_MULTIPLIER` | `3` | Per-corpus vector candidate pool = `top_k * multiplier`. | | `RAG_RETRIEVAL_TIMEOUT_S` | `8.0` | Per-request retrieval timeout | | `RAG_STAGE_TIMEOUT_S` | `6.0` | Per-stage FTS/graph timeout | | `RAG_MAX_QUERY_REWRITES` | `4` | Max deterministic query rewrites | | `RAG_FTS_TOP_K` | `top_k * 4` | FTS candidate pool size | | `RAG_GRAPH_TOP_K` | `top_k * 2` | Graph candidate pool size | | `RAG_MAX_PROBE_TERMS` | `10` | Max terms in the compact retrieval probe | > **Vector search is enabled in production.** The HNSW index is built once at warmup from the chunk embeddings already stored in each corpus SQLite DB and persisted to `.db.hnsw.faiss` next to the DB file. Subsequent container restarts skip the build (just an mmap). To force a rebuild after re-ingestion, run `python -m rag_service.cli reindex db/.db` inside the container, or delete the `.hnsw.faiss` and `.hnsw.meta.json` files and restart. --- ## Operational Checks ### Health from the host ```bash curl -s http://127.0.0.1:8091/v1/health ``` ### Container status ```bash docker ps --filter name=rag-api docker inspect hyfl-rag-rag-api-1 --format '{{.State.Health.Status}}' ``` ### Container resources ```bash docker stats hyfl-rag-rag-api-1 --no-stream ``` ### Recent logs ```bash docker logs hyfl-rag-rag-api-1 --tail 100 --since 10m ``` ### Database integrity ```bash docker exec hyfl-rag-rag-api-1 sqlite3 /app/data/rag_service.sqlite "PRAGMA integrity_check;" docker exec hyfl-rag-rag-api-1 sqlite3 /app/data/rag_service.sqlite "SELECT COUNT(*) FROM api_keys;" ``` ### Tunnel status ```bash cloudflared tunnel info hyfl-rag ps aux | grep cloudflared ``` --- ## Troubleshooting ### `500 Internal Server Error` on retrieve - Check container logs: `docker logs hyfl-rag-rag-api-1 --tail 50` - Verify corpus DBs exist: `docker exec hyfl-rag-rag-api-1 ls /app/db/` - Confirm config: `docker exec hyfl-rag-rag-api-1 cat /app/rag_config.json` ### `401 Unauthorized` on a valid key - Verify the key is in the DB: `docker exec hyfl-rag-rag-api-1 sqlite3 /app/data/rag_service.sqlite "SELECT prefix, name, revoked_at FROM api_keys;"` - Check the key hasn't expired: `SELECT prefix, expires_at FROM api_keys;` - Confirm the `Authorization` header is formatted correctly: `Authorization: Bearer sk_live_...` ### `404 Not Found` on `/rag/v1` - Check Cloudflare Tunnel ingress rule includes `path: /rag/.*` - Verify DNS: `dig hyfl.uk` should show a CNAME to the tunnel - Check tunnel logs: `journalctl -u cloudflared -f` ### High latency (>2s) - First request after cold start is slowest (embedder load + FTS warmup). Subsequent requests should be <500ms. - If consistently slow, check: `docker stats hyfl-rag-rag-api-1` for memory pressure. - If vector search is on (`RAG_VECTOR_TOP_K > 0`), the HNSW index build happens once at warmup (~60s for 188k chunks on 4 ARM cores). Inspect `/v1/ready` → `vector_index.loaded_corpora` to confirm `from_disk: true` after the first boot. - Lower `RAG_VECTOR_HNSW_EF` to speed up vector search at the cost of recall. ### Lost bootstrap key ```bash # Stop the container, remove the key store, restart with a new bootstrap key docker compose down rm data/rag_service.sqlite echo "RAG_BOOTSTRAP_API_KEY=sk_live_NEW_SECRET" >> .env docker compose up -d ``` --- ## MCP Endpoint The service also exposes a bearer-authenticated MCP JSON-RPC endpoint: - Local: `http://127.0.0.1:8091/mcp` - Public: `https://hyfl.uk/rag/mcp` It provides one tool: `retrieve_evidence`. See the MCP endpoint at `https://hyfl.uk/rag/mcp` for protocol integration.