Skip to main content

HyDE Retrieval

HyDE improves RAG and memory retrieval by generating a hypothetical answer before embedding. Instead of embedding the raw user query, HyDE first asks an LLM to produce a plausible answer, then embeds that answer for vector search. The hypothesis is semantically closer to actual stored documents than a question is, yielding better recall.

Based on:

  • Gao et al. 2023 "Precise Zero-Shot Dense Retrieval without Relevance Labels"
  • Lei et al. 2025 "Never Come Up Empty: Adaptive HyDE Retrieval for Improving LLM Developer Support"

How It Works

Standard:  Query --> Embed(query)       --> Vector Search --> Results
HyDE: Query --> LLM(hypothesis) --> Embed(hypothesis) --> Vector Search --> Results
^ ^
Extra LLM call Better semantic match

The key insight: questions and answers live in different regions of embedding space. A question like "What causes memory leaks in Node?" is far from the answer text "Memory leaks in Node.js are caused by...". But a hypothetical answer generated from the question is much closer to the stored answer, producing higher cosine similarity scores.

When to Use HyDE

Good candidates:

  • Knowledge base queries where the question phrasing differs from document style
  • Vague or exploratory queries ("that thing about deployment")
  • Memory recall where stored traces are statement-form, not question-form
  • Background/batch processing where latency is less critical

Avoid when:

  • Real-time chat with tight latency budgets (adds one LLM call per query)
  • Simple keyword-style lookups where direct embedding already works well
  • The query is already in statement/answer form

Configuration

agent.config.json

HyDE is configured per-request, not globally. The HydeRetriever class and its config types are exported from @framers/agentos/rag.

{
"rag": {
"hyde": {
"enabled": true,
"initialThreshold": 0.7,
"minThreshold": 0.3,
"thresholdStep": 0.1,
"adaptiveThreshold": true,
"maxHypothesisTokens": 200,
"fullAnswerGranularity": true
}
}
}

Configuration Options

OptionTypeDefaultDescription
enabledbooleanfalseMaster switch for HyDE
initialThresholdnumber0.7Starting similarity threshold
minThresholdnumber0.3Lowest threshold before giving up
thresholdStepnumber0.1How much to reduce threshold per step
adaptiveThresholdbooleantrueEnable step-down when no results found
maxHypothesisTokensnumber200Max tokens for hypothesis generation
fullAnswerGranularitybooleantrueGenerate full prose answers vs keywords

Programmatic API

1. RetrievalAugmentor (main RAG pipeline)

import { RetrievalAugmentor } from '@framers/agentos/rag';

const augmentor = new RetrievalAugmentor();
await augmentor.initialize(config, embeddingManager, vectorStoreManager);

// Register an LLM caller for hypothesis generation
augmentor.setHydeLlmCaller(async (systemPrompt, userPrompt) => {
const response = await openai.chat.completions.create({
model: 'gpt-4o-mini',
messages: [
{ role: 'system', content: systemPrompt },
{ role: 'user', content: userPrompt },
],
max_tokens: 200,
});
return response.choices[0].message.content ?? '';
});

// Enable HyDE per-request
const result = await augmentor.retrieveContext('What causes memory leaks?', {
hyde: {
enabled: true,
// Optional: pre-supply a hypothesis to skip the LLM call
// hypothesis: 'Memory leaks are caused by...',
// Optional: tune thresholds for this request
// initialThreshold: 0.8,
// minThreshold: 0.4,
},
});

// HyDE diagnostics are in the result
console.log(result.diagnostics?.hyde);
// {
// hypothesis: 'Memory leaks in Node.js are typically caused by...',
// hypothesisLatencyMs: 342,
// effectiveThreshold: 0.7,
// thresholdSteps: 0,
// }
import { MultimodalIndexer, HydeRetriever } from '@framers/agentos/rag';

const indexer = new MultimodalIndexer({
embeddingManager,
vectorStore,
visionProvider,
});

// Attach a HyDE retriever
indexer.setHydeRetriever(new HydeRetriever({
llmCaller: myLlmCaller,
embeddingManager,
config: { enabled: true },
}));

// Search with HyDE
const results = await indexer.search('architecture diagram', {
modalities: ['image'],
hyde: { enabled: true },
});

3. CognitiveMemoryManager (memory recall)

import { CognitiveMemoryManager, HydeRetriever } from '@framers/agentos';

const memoryManager = new CognitiveMemoryManager();
await memoryManager.initialize(config);

// Attach a HyDE retriever
memoryManager.setHydeRetriever(new HydeRetriever({
llmCaller: myLlmCaller,
embeddingManager,
config: { enabled: true },
}));

// Retrieve memories with HyDE
const result = await memoryManager.retrieve(
'that deployment discussion',
currentMood,
{ hyde: true },
);

4. Standalone HydeRetriever

import { HydeRetriever } from '@framers/agentos/rag';

const retriever = new HydeRetriever({
llmCaller: async (system, user) => {
// Your LLM call here
return hypotheticalAnswer;
},
embeddingManager,
config: {
enabled: true,
adaptiveThreshold: true,
initialThreshold: 0.7,
minThreshold: 0.3,
},
});

// Generate hypothesis only
const { hypothesis, latencyMs } = await retriever.generateHypothesis(
'What is retrieval augmented generation?',
);

// Full retrieve cycle with adaptive thresholding
const result = await retriever.retrieve({
query: 'What is RAG?',
vectorStore: myVectorStore,
collectionName: 'knowledge-base',
});

Adaptive Thresholding

HyDE supports adaptive threshold stepping: if no results are found at the initial similarity threshold, it steps down until content is found or the minimum threshold is reached. This ensures HyDE never "comes up empty."

Initial threshold: 0.7  -->  No results
Step down to: 0.6 --> No results
Step down to: 0.5 --> Found 3 results! (stop here)

The thresholdSteps diagnostic tells you how many steps were needed.

Audit Trail

When includeAudit: true is passed to retrieveContext(), HyDE operations appear in the audit trail with operation type 'hyde':

const result = await augmentor.retrieveContext(query, {
hyde: { enabled: true },
includeAudit: true,
});

const hydeOp = result.auditTrail?.operations.find(
(op) => op.operationType === 'hyde',
);
// hydeOp.hydeDetails.hypothesis
// hydeOp.hydeDetails.effectiveThreshold
// hydeOp.hydeDetails.thresholdSteps
// hydeOp.tokenUsage (embedding + LLM tokens)

Performance Implications

MetricWithout HyDEWith HyDE
LLM calls per query01
Embedding calls11 (hypothesis instead of query)
Vector searches11-N (N = adaptive steps)
Typical added latency0200-500ms (LLM generation)
Recall improvementbaseline+10-30% on vague queries

The LLM call uses a small, fast model by default (configured via the caller). Using gpt-4o-mini or similar keeps latency under 300ms for most queries.

Graceful Degradation

HyDE degrades gracefully in all failure scenarios:

  1. No LLM caller registered: Falls back to direct query embedding with a diagnostic message.
  2. LLM call fails: Falls back to direct query embedding.
  3. Hypothesis embedding fails: Falls back to direct query embedding.
  4. No results at any threshold: Returns empty results (same as without HyDE).

The system never throws due to HyDE failures -- it always falls back to the standard retrieval path.