HyDE Retrieval
HyDE improves RAG and memory retrieval by generating a hypothetical answer before embedding. Instead of embedding the raw user query, HyDE first asks an LLM to produce a plausible answer, then embeds that answer for vector search. The hypothesis is semantically closer to actual stored documents than a question is, yielding better recall.
Based on:
- Gao et al. 2023 "Precise Zero-Shot Dense Retrieval without Relevance Labels"
- Lei et al. 2025 "Never Come Up Empty: Adaptive HyDE Retrieval for Improving LLM Developer Support"
How It Works
Standard: Query --> Embed(query) --> Vector Search --> Results
HyDE: Query --> LLM(hypothesis) --> Embed(hypothesis) --> Vector Search --> Results
^ ^
Extra LLM call Better semantic match
The key insight: questions and answers live in different regions of embedding space. A question like "What causes memory leaks in Node?" is far from the answer text "Memory leaks in Node.js are caused by...". But a hypothetical answer generated from the question is much closer to the stored answer, producing higher cosine similarity scores.
When to Use HyDE
Good candidates:
- Knowledge base queries where the question phrasing differs from document style
- Vague or exploratory queries ("that thing about deployment")
- Memory recall where stored traces are statement-form, not question-form
- Background/batch processing where latency is less critical
Avoid when:
- Real-time chat with tight latency budgets (adds one LLM call per query)
- Simple keyword-style lookups where direct embedding already works well
- The query is already in statement/answer form
Configuration
agent.config.json
HyDE is configured per-request, not globally. The HydeRetriever class and
its config types are exported from @framers/agentos/rag.
{
"rag": {
"hyde": {
"enabled": true,
"initialThreshold": 0.7,
"minThreshold": 0.3,
"thresholdStep": 0.1,
"adaptiveThreshold": true,
"maxHypothesisTokens": 200,
"fullAnswerGranularity": true
}
}
}
Configuration Options
| Option | Type | Default | Description |
|---|---|---|---|
enabled | boolean | false | Master switch for HyDE |
initialThreshold | number | 0.7 | Starting similarity threshold |
minThreshold | number | 0.3 | Lowest threshold before giving up |
thresholdStep | number | 0.1 | How much to reduce threshold per step |
adaptiveThreshold | boolean | true | Enable step-down when no results found |
maxHypothesisTokens | number | 200 | Max tokens for hypothesis generation |
fullAnswerGranularity | boolean | true | Generate full prose answers vs keywords |
Programmatic API
1. RetrievalAugmentor (main RAG pipeline)
import { RetrievalAugmentor } from '@framers/agentos/rag';
const augmentor = new RetrievalAugmentor();
await augmentor.initialize(config, embeddingManager, vectorStoreManager);
// Register an LLM caller for hypothesis generation
augmentor.setHydeLlmCaller(async (systemPrompt, userPrompt) => {
const response = await openai.chat.completions.create({
model: 'gpt-4o-mini',
messages: [
{ role: 'system', content: systemPrompt },
{ role: 'user', content: userPrompt },
],
max_tokens: 200,
});
return response.choices[0].message.content ?? '';
});
// Enable HyDE per-request
const result = await augmentor.retrieveContext('What causes memory leaks?', {
hyde: {
enabled: true,
// Optional: pre-supply a hypothesis to skip the LLM call
// hypothesis: 'Memory leaks are caused by...',
// Optional: tune thresholds for this request
// initialThreshold: 0.8,
// minThreshold: 0.4,
},
});
// HyDE diagnostics are in the result
console.log(result.diagnostics?.hyde);
// {
// hypothesis: 'Memory leaks in Node.js are typically caused by...',
// hypothesisLatencyMs: 342,
// effectiveThreshold: 0.7,
// thresholdSteps: 0,
// }
2. MultimodalIndexer (cross-modal search)
import { MultimodalIndexer, HydeRetriever } from '@framers/agentos/rag';
const indexer = new MultimodalIndexer({
embeddingManager,
vectorStore,
visionProvider,
});
// Attach a HyDE retriever
indexer.setHydeRetriever(new HydeRetriever({
llmCaller: myLlmCaller,
embeddingManager,
config: { enabled: true },
}));
// Search with HyDE
const results = await indexer.search('architecture diagram', {
modalities: ['image'],
hyde: { enabled: true },
});
3. CognitiveMemoryManager (memory recall)
import { CognitiveMemoryManager, HydeRetriever } from '@framers/agentos';
const memoryManager = new CognitiveMemoryManager();
await memoryManager.initialize(config);
// Attach a HyDE retriever
memoryManager.setHydeRetriever(new HydeRetriever({
llmCaller: myLlmCaller,
embeddingManager,
config: { enabled: true },
}));
// Retrieve memories with HyDE
const result = await memoryManager.retrieve(
'that deployment discussion',
currentMood,
{ hyde: true },
);
4. Standalone HydeRetriever
import { HydeRetriever } from '@framers/agentos/rag';
const retriever = new HydeRetriever({
llmCaller: async (system, user) => {
// Your LLM call here
return hypotheticalAnswer;
},
embeddingManager,
config: {
enabled: true,
adaptiveThreshold: true,
initialThreshold: 0.7,
minThreshold: 0.3,
},
});
// Generate hypothesis only
const { hypothesis, latencyMs } = await retriever.generateHypothesis(
'What is retrieval augmented generation?',
);
// Full retrieve cycle with adaptive thresholding
const result = await retriever.retrieve({
query: 'What is RAG?',
vectorStore: myVectorStore,
collectionName: 'knowledge-base',
});
Adaptive Thresholding
HyDE supports adaptive threshold stepping: if no results are found at the initial similarity threshold, it steps down until content is found or the minimum threshold is reached. This ensures HyDE never "comes up empty."
Initial threshold: 0.7 --> No results
Step down to: 0.6 --> No results
Step down to: 0.5 --> Found 3 results! (stop here)
The thresholdSteps diagnostic tells you how many steps were needed.
Audit Trail
When includeAudit: true is passed to retrieveContext(), HyDE operations
appear in the audit trail with operation type 'hyde':
const result = await augmentor.retrieveContext(query, {
hyde: { enabled: true },
includeAudit: true,
});
const hydeOp = result.auditTrail?.operations.find(
(op) => op.operationType === 'hyde',
);
// hydeOp.hydeDetails.hypothesis
// hydeOp.hydeDetails.effectiveThreshold
// hydeOp.hydeDetails.thresholdSteps
// hydeOp.tokenUsage (embedding + LLM tokens)
Performance Implications
| Metric | Without HyDE | With HyDE |
|---|---|---|
| LLM calls per query | 0 | 1 |
| Embedding calls | 1 | 1 (hypothesis instead of query) |
| Vector searches | 1 | 1-N (N = adaptive steps) |
| Typical added latency | 0 | 200-500ms (LLM generation) |
| Recall improvement | baseline | +10-30% on vague queries |
The LLM call uses a small, fast model by default (configured via the caller).
Using gpt-4o-mini or similar keeps latency under 300ms for most queries.
Graceful Degradation
HyDE degrades gracefully in all failure scenarios:
- No LLM caller registered: Falls back to direct query embedding with a diagnostic message.
- LLM call fails: Falls back to direct query embedding.
- Hypothesis embedding fails: Falls back to direct query embedding.
- No results at any threshold: Returns empty results (same as without HyDE).
The system never throws due to HyDE failures -- it always falls back to the standard retrieval path.