Q1 vs Q2: Deep Dive

Mathematical foundations, calibration metrics, and practical interpretation of aleatoric and epistemic uncertainty

Quick Overview

Q1 - Aleatoric Uncertainty

  • Quantile: 25% (pessimistic)
  • Measures: Data noise and ambiguity
  • Reducibility: Irreducible
  • Threshold: 0.35

Q2 - Epistemic Uncertainty

  • Quantile: 75% (optimistic)
  • Measures: Model ignorance
  • Reducibility: Reducible with more data
  • Threshold: 0.35

Mathematical Definition

Q1 - Quantile 25%

Q1 represents the 25th percentile of the uncertainty distribution. It's a pessimistic estimate of uncertainty.

Q1 ∈ [0, 1]
# Interpretation (calibrated):
# In 100 similar claims with Q1 = x:
# → ~25 have veracity < x
# → ~75 have veracity > x

Example

Claim: "The Earth is flat"

Q1 prediction: 0.02 (very low aleatoric uncertainty)

In 100 claims similar to this with Q1=0.02:
→ ~25 have veracity < 0.02 (very false)
→ ~75 have veracity > 0.02 (somewhat true or completely true)

Q2 - Quantile 75%

Q2 represents the 75th percentile of the uncertainty distribution. It's an optimistic estimate of confidence.

Q2 ∈ [0, 1]
Q2 = f(embeddings, Q1) # Conditioned on Q1
# Interpretation (calibrated):
# In 100 similar claims with Q2 = y:
# → ~75 have veracity < y
# → ~25 have veracity > y

Example

Claim: "Vaccines prevent diseases"

Q2 prediction: 0.92 (high epistemic confidence)

In 100 claims similar to this with Q2=0.92:
→ ~75 have veracity < 0.92
→ ~25 have veracity > 0.92 (highly accurate)

Why Q2 is Conditioned on Q1

Key Insight

Conditioning Q2 on Q1 improves calibration by 21%. When a question is very ambiguous (high Q1), the model should account for that when assessing its own knowledge (Q2).

❌ Without Conditioning

Q1 and Q2 predicted independently:

q1 = Q1Gate(embeddings)
q2 = Q2Gate(embeddings)

Problem: Q2 doesn't know about question ambiguity, leading to poor calibration.

✅ With Conditioning

Q2 conditioned on Q1:

q1 = Q1Gate(embeddings)
q2 = Q2Gate(embeddings, q1)

Benefit: Q2 adjusts based on Q1, improving calibration by 21%.

# Q2 Gate Architecture (Simplified)
class Q2Gate(nn.Module):
def __init__(self):
self.feature_net = nn.Linear(384 + 1, 256) # +1 for Q1
self.q2_head = nn.Sequential(
nn.ReLU(),
nn.Dropout(0.1),
nn.Linear(256, 1)
)
def forward(self, embeddings, q1):
combined = torch.cat([embeddings, q1.unsqueeze(1)], dim=1)
features = self.feature_net(combined)
q2 = torch.sigmoid(self.q2_head(features))
return q2

Calibration Metrics

AletheionGuard uses multiple metrics to ensure Q1 and Q2 are well-calibrated:

ECE (Expected Calibration Error)

Measures the average gap between predicted confidence and observed accuracy across bins.

ECE = (1/N) × Σ |conf_bin - acc_bin|
# Target: ECE < 0.10 (ideally < 0.08)
AletheionGuard Level 1: ECE ~0.07-0.10 (30-50% better than Level 0)

RCE (Relative Calibration Error)

Measures relative error of calibration as a percentage of observed accuracy.

RCE = |predicted_confidence - observed_accuracy| / observed_accuracy
# Target: RCE < 0.05 (5% error)
Threshold for Production: RCE < 0.05 to be considered "calibrated"

Brier Score

Mean squared difference between predicted probabilities and actual outcomes.

Brier = (1/N) × Σ (p_pred - p_true)²
# Lower is better
Used alongside RCE for comprehensive calibration assessment.

Uncertainty Correlation

Correlation between epistemic uncertainty and actual error rate.

corr = correlation(epistemic_uncertainty, actual_error)
# Target: > 0.5
Interpretation: When the model is uncertain (high Q2), it should actually be wrong more often.

Target Thresholds

MetricTargetLevel 0Level 1
Q1 MSE< 0.05~0.06~0.048
Q2 MSE< 0.05~0.057~0.045
RCE< 0.05~0.06~0.042
ECE< 0.10~0.10-0.15~0.07-0.10
Uncertainty Corr.> 0.5~0.52~0.61

Practical Interpretation

Reading Q1 Values

Q1 < 0.20
Low aleatoric uncertainty. Question is unambiguous, has a clear correct answer.
0.20 ≤ Q1 < 0.35
Moderate aleatoric uncertainty. Some ambiguity in the question or data.
Q1 ≥ 0.35
High aleatoric uncertainty. Question is ambiguous, admits multiple valid answers. Verdict: "MAYBE"

Reading Q2 Values

Q2 < 0.20
Low epistemic uncertainty. Model has strong knowledge, low hallucination risk.
0.20 ≤ Q2 < 0.35
Moderate epistemic uncertainty. Model has some knowledge but not complete confidence.
Q2 ≥ 0.35
High epistemic uncertainty. Model lacks knowledge, high hallucination risk. Verdict: "REFUSED"

Combined Interpretation

Low Q1, Low Q2: Ideal case. Clear question, model knows the answer. → ACCEPT
High Q1, Low Q2: Question is ambiguous but model has knowledge. → MAYBE (ask for clarification)
Low Q1, High Q2: Clear question but model lacks knowledge. → REFUSED (escalate to expert)
High Q1, High Q2: Ambiguous question and model lacks knowledge. → REFUSED (highest uncertainty)

Code Example

from aletheion_guard import EpistemicAuditor
auditor = EpistemicAuditor()
# Example 1: Low Q1, Low Q2
result = auditor.evaluate("Paris is the capital of France")
print(f"Q1: {result.q1:.3f}, Q2: {result.q2:.3f}")
print(f"Verdict: {result.verdict}") # ACCEPT
# Example 2: High Q1 (ambiguous)
result = auditor.evaluate("What's the capital of Netherlands?")
print(f"Q1: {result.q1:.3f}, Q2: {result.q2:.3f}")
print(f"Verdict: {result.verdict}") # MAYBE
# Example 3: High Q2 (model doesn't know)
result = auditor.evaluate("What will Bitcoin cost tomorrow?")
print(f"Q1: {result.q1:.3f}, Q2: {result.q2:.3f}")
print(f"Verdict: {result.verdict}") # REFUSED
# Access calibration info
print(f"RCE: {result.rce:.3f}")
print(f"Is calibrated: {result.calibrated}") # True if RCE < 0.05

Performance Characteristics

Latency

  • Embedding: ~10ms
  • Q1/Q2 inference: ~5ms
  • Calibration: ~3ms
  • Total: 20-30ms per response

Throughput

  • Single: 50 req/sec
  • Batch 32: 500 req/sec
  • Batch 128: 1000+ req/sec
  • Production: ~200-400 req/sec sustained

Training Loss Function

AletheionGuard Level 1 uses Pyramidal VARO loss to train Q1 and Q2 gates:

L = λ₁ × MSE(q1, q1_true) +
λ₂ × MSE(q2, q2_true) +
λ₃ × MSE(height, height_true) +
λ₄ × calibration_loss(q2, error) +
λ₅ × fractal_loss(height, sqrt(q1² + q2²))

Component Breakdown

  • λ₁: Q1 accuracy weight
  • λ₂: Q2 accuracy weight
  • λ₃: Height regression weight
  • λ₄: Calibration weight (RCE)
  • λ₅: Fractal constraint weight

Typical Values

  • λ₁: 1.0
  • λ₂: 1.2 (slightly higher)
  • λ₃: 0.8
  • λ₄: 1.5 (prioritize calibration)
  • λ₅: 0.5

Next Steps

Questions about Q1 and Q2?

Our team can help you understand uncertainty quantification for your use case