Skip to content

ADR-021: Quint Code and First Principles Framework (FPF) Integration

Status: Proposed Date: 2025-12-19 Decision Makers: Engineering, Architecture Council Review: Pending (GPT-5.2-pro, Claude Opus 4.5, Gemini 3 Pro, Grok-4)


Context

Two complementary frameworks have emerged for structured AI-assisted reasoning:

Quint Code (github.com/m0n0x41d/quint-code)

A structured reasoning framework for AI-assisted development that creates auditable decision trails. Implements the First Principles Framework (FPF) methodology through:

Component Description
Abduction Phase Generate 3-5 competing hypotheses (stored in L0/)
Deduction Phase Verify logical consistency, promote to L1/
Induction Phase Gather empirical evidence, promote to L2/
Trust Scoring Weakest-link (WLNK) assurance model
Bias Detection Flags anchoring bias and early-hypothesis privilege
Design Rationale Records Auditable decision artifacts with expiry conditions

Integration: Works via MCP protocol with Claude Code, Cursor, Gemini CLI, Codex CLI.

First Principles Framework (FPF) (github.com/ailev/FPF)

A transdisciplinary "Operating System for Thought" providing:

Component Description
Holonic Foundation Everything as whole and part simultaneously
Trust Formula Trust = ⟨F, G, R⟩ (Formality, Granularity, Reliability)
Γ-Algebra Universal aggregation preserving invariants
Bounded Contexts Terms hold meaning only within defined boundaries
LLM Integration Functions as "bias-assistant" steering toward first-principles

Functional Alignment Analysis

Conceptual Overlap with LLM Council

Dimension LLM Council Quint Code/FPF Alignment
Multi-perspective 4 models provide diverse viewpoints Multiple competing hypotheses HIGH
Quality Assurance Peer review + Borda count ranking Trust scoring + WLNK model HIGH
Bias Detection ADR-015 bias auditing Anchoring bias detection HIGH
Decision Artifacts Aggregate rankings + synthesis Design Rationale Records MEDIUM
Temporal Validity Per-session (ephemeral) Evidence decay tracking LOW
Knowledge Levels Flat (all responses equal) Hierarchical (L0→L1→L2) LOW

Key Differences

Aspect LLM Council Quint Code/FPF
Execution Mode Runtime (query-time) Development-time (persistent)
Focus Answer synthesis Decision documentation
Verification Peer agreement Logical + empirical proof
Storage Ephemeral (per-session) Persistent knowledge base
Trust Model Vote aggregation Weakest-link chain

Decision

Implement a Bidirectional Integration where LLM Council enhances Quint Code's hypothesis generation and Quint Code's trust model enhances council decision confidence.

Proposed Integration Architecture

┌─────────────────────────────────────────────────────────────────────────────┐
│                    INTEGRATION LAYER: "Principled Council"                   │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  ┌──────────────────────┐          ┌──────────────────────┐                 │
│  │    QUINT CODE        │          │    LLM COUNCIL       │                 │
│  │    (Structured       │◄────────►│    (Multi-Model      │                 │
│  │     Reasoning)       │          │     Consensus)       │                 │
│  └──────────────────────┘          └──────────────────────┘                 │
│           │                                   │                              │
│           ▼                                   ▼                              │
│  ┌──────────────────────┐          ┌──────────────────────┐                 │
│  │  Abduction Phase     │          │  Stage 1: Collection │                 │
│  │  - Use council for   │◄─────────│  - Multiple models   │                 │
│  │    hypothesis gen    │          │    generate options  │                 │
│  └──────────────────────┘          └──────────────────────┘                 │
│           │                                   │                              │
│           ▼                                   ▼                              │
│  ┌──────────────────────┐          ┌──────────────────────┐                 │
│  │  Deduction Phase     │          │  Stage 2: Peer Review│                 │
│  │  - Verify logic via  │◄─────────│  - Cross-validate    │                 │
│  │    council critique  │          │    reasoning         │                 │
│  └──────────────────────┘          └──────────────────────┘                 │
│           │                                   │                              │
│           ▼                                   ▼                              │
│  ┌──────────────────────┐          ┌──────────────────────┐                 │
│  │  Trust Scoring       │─────────►│  Confidence Weights  │                 │
│  │  - WLNK model        │          │  - Apply to rankings │                 │
│  │  - Evidence chain    │          │  - Qualify synthesis │                 │
│  └──────────────────────┘          └──────────────────────┘                 │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Integration Points

1. Council-Powered Hypothesis Generation

Replace Quint Code's single-model abduction with council-based generation:

# Current: Single model generates hypotheses
hypotheses = await model.generate_hypotheses(problem)

# Proposed: Council generates diverse hypotheses
async def council_abduction(problem: str) -> List[Hypothesis]:
    """Use LLM Council for hypothesis generation phase."""
    result = await run_council_with_fallback(
        f"Generate 3-5 competing hypotheses for: {problem}. "
        "Each hypothesis should represent a distinct approach."
    )

    # Extract hypotheses from each model's response
    hypotheses = []
    for response in result["stage1_responses"]:
        hypotheses.extend(parse_hypotheses(response))

    # Deduplicate and return with provenance
    return deduplicate_with_provenance(hypotheses)

Benefits: - Model diversity prevents anchoring bias - Each hypothesis comes with model provenance - Natural competition between approaches

2. Council-Assisted Deduction Verification

Use peer review for logical verification:

async def council_verify(hypothesis: Hypothesis) -> VerificationResult:
    """Use council peer review for logical verification."""
    result = await run_council_with_fallback(
        f"Verify logical consistency of this hypothesis:\n"
        f"{hypothesis.content}\n\n"
        "Check for: constraint violations, type errors, "
        "implicit assumptions, edge cases."
    )

    # Unanimous agreement required for L1 promotion
    if result["consensus_type"] == "unanimous":
        return VerificationResult(passed=True, level="L1")
    elif result["consensus_type"] == "majority":
        return VerificationResult(passed=True, level="L1",
                                   caveats=result["dissent_summary"])
    else:
        return VerificationResult(passed=False,
                                   reasons=result["disagreements"])

3. Trust-Weighted Council Rankings

Apply FPF's trust formula to council rankings:

@dataclass
class TrustWeightedRanking:
    model: str
    raw_score: float
    trust_weight: float  # From FPF ⟨F, G, R⟩
    weighted_score: float

def apply_trust_weights(
    rankings: List[Ranking],
    evidence_chain: EvidenceChain
) -> List[TrustWeightedRanking]:
    """Apply weakest-link trust model to council rankings."""

    for ranking in rankings:
        # F = Formality (how rigorous was the evaluation)
        formality = calculate_formality(ranking.evaluation_text)

        # G = Granularity (scope of claims made)
        granularity = calculate_granularity(ranking.claims)

        # R = Reliability (evidence backing)
        reliability = evidence_chain.weakest_link_score()

        # Trust = min(F, G, R) per WLNK model
        trust = min(formality, granularity, reliability)

        ranking.trust_weight = trust
        ranking.weighted_score = ranking.raw_score * trust

    return sorted(rankings, key=lambda r: r.weighted_score, reverse=True)

4. Design Rationale Records for Council Decisions

Generate DRRs from council consensus:

@dataclass
class DesignRationaleRecord:
    decision_id: str
    timestamp: datetime
    question: str
    winning_hypothesis: str
    alternatives_considered: List[str]
    evidence_chain: List[Evidence]
    council_rankings: List[Ranking]
    consensus_type: str  # unanimous, majority, split
    trust_score: float
    valid_until: datetime  # Evidence expiry

def generate_drr(council_result: CouncilResult) -> DesignRationaleRecord:
    """Convert council result to Design Rationale Record."""
    return DesignRationaleRecord(
        decision_id=f"DRR-{uuid4()}",
        timestamp=datetime.utcnow(),
        question=council_result["query"],
        winning_hypothesis=council_result["synthesis"]["response"],
        alternatives_considered=[
            r["response"] for r in council_result["stage1_responses"]
        ],
        evidence_chain=extract_evidence(council_result),
        council_rankings=council_result["aggregate_rankings"],
        consensus_type=determine_consensus_type(council_result),
        trust_score=calculate_trust(council_result),
        valid_until=calculate_expiry(council_result),
    )

Knowledge Level Mapping

Map Quint Code's L0→L1→L2 to council consensus levels:

Quint Level Description Council Equivalent
L0 (Raw) Unverified hypothesis Single model response
L1 (Verified) Logically consistent Majority consensus
L2 (Validated) Empirically proven Unanimous + external validation
def council_result_to_knowledge_level(result: CouncilResult) -> str:
    """Map council consensus to FPF knowledge level."""
    rankings = result["aggregate_rankings"]
    top_score = rankings[0]["score"] if rankings else 0

    if result["consensus_type"] == "unanimous" and top_score > 0.9:
        return "L2"  # High confidence, validated
    elif result["consensus_type"] in ("unanimous", "majority"):
        return "L1"  # Logically verified
    else:
        return "L0"  # Raw hypothesis

Alternatives Considered

Alternative 1: Replace Council with Quint Code Entirely

Rejected: Quint Code is development-time focused; LLM Council is runtime-focused. They serve complementary purposes.

Alternative 2: No Integration (Use Separately)

Rejected: Significant synergy opportunities missed. Both systems address quality and bias but from different angles.

Alternative 3: Quint Code as Council Pre-processor Only

Rejected: Loses the value of FPF's trust model for enhancing council confidence scoring.


Implementation Phases

Phase 1: Evaluation (2 weeks)

  • [ ] Benchmark council-based hypothesis generation vs. single-model
  • [ ] Measure diversity improvement in abduction phase
  • [ ] Test trust-weighted ranking quality

Phase 2: Council-Powered Abduction (3 weeks)

  • [ ] Implement /q1-hypothesize-council command
  • [ ] Add provenance tracking for council-generated hypotheses
  • [ ] Update Quint Code's L0 storage format

Phase 3: Trust-Weighted Rankings (2 weeks)

  • [ ] Implement WLNK trust calculator for council
  • [ ] Add trust scores to council metadata
  • [ ] Create LLM_COUNCIL_TRUST_MODEL=wlnk config option

Phase 4: DRR Generation (2 weeks)

  • [ ] Implement Design Rationale Record generator
  • [ ] Add DRR storage to .quint/decisions/
  • [ ] Create decay detection for council-based decisions

Risks and Mitigations

Risk Likelihood Impact Mitigation
Latency increase from council in abduction High Medium Cache similar hypotheses, async generation
Trust model complexity Medium Medium Start with simplified F-G-R calculation
Dual system maintenance burden Medium High Clear interface boundaries, optional integration
Knowledge level mapping mismatch Low Medium Conservative defaults, explicit override option

Success Metrics

Metric Target Measurement
Hypothesis diversity +40% unique approaches Compare single-model vs. council
Anchoring bias reduction -60% first-hypothesis wins Track which hypothesis wins
Decision confidence +25% trust scores Before/after trust model
DRR completeness 100% decisions documented Audit trail coverage

Configuration Options

# Enable council-based hypothesis generation
LLM_COUNCIL_QUINT_INTEGRATION=true

# Trust model for rankings
LLM_COUNCIL_TRUST_MODEL=wlnk|simple|none  # default: simple

# Generate Design Rationale Records
LLM_COUNCIL_GENERATE_DRR=true

# DRR storage location
LLM_COUNCIL_DRR_PATH=.quint/decisions/

Council Review Summary

Status: APPROVE WITH MODIFICATIONS

Reviewed by: Gemini 3 Pro (34s), Claude Opus 4.5 (44s), Grok-4 (70s) GPT-5.2-pro: timeout (120s)

Council Verdict: Unanimous approval with significant architectural modifications. The integration is "architecturally sound but overengineered in its current form."


Consensus Analysis

1. Does Council-Based Abduction Improve Hypothesis Diversity?

Verdict: Conditionally Yes

Council does NOT automatically guarantee diversity—models often collapse into safe consensus based on shared training data.

Required Modifications: - Role-Based Prompting: Assign specific roles (Scientist, Historian, Logician) to different models - Adversarial Seeding: Require at least one model to argue against emerging consensus - Model Heterogeneity: Mix model families (GPT + Claude + Llama) rather than same-family instances - Diversity Metrics: Measure semantic distance between hypotheses, not just count

# Council-recommended diversity enforcement
class DiversityEnforcedCouncil:
    def generate_hypotheses(self, problem: str) -> List[Hypothesis]:
        # Assign adversarial roles
        roles = ["primary_proposer", "devil_advocate", "synthesis_agent"]

        # Enforce minimum variance via KL-divergence threshold
        hypotheses = self.collect_with_roles(problem, roles)

        if semantic_variance(hypotheses) < DIVERSITY_THRESHOLD:
            return self.force_divergence(hypotheses)

        return hypotheses

2. Is FPF Trust Model (F-G-R) Applicable to Multi-Model Consensus?

Verdict: Partially—Requires Reinterpretation

Component Single-Model Meaning Council Reinterpretation
Fidelity (F) Accuracy to source Inter-model agreement on factual claims
Groundedness (G) Traceability to evidence Convergent citation of same sources
Robustness (R) Stability under perturbation Consistency across prompt variations

Critical Issue: WLNK is Problematic

The pure Weakest-Link model WLNK = min(F, G, R) becomes excessively conservative in council contexts:

WLNK_council = min(min(F_i), min(G_i), min(R_i)) for all models i

This "double-minimum" means a single model's low score tanks the entire output.

Council-Recommended Alternative: Robust Aggregate Trust (RAT)

def calculate_rat(wlnk_scores: List[float], disagreement: float) -> float:
    """Replace pure WLNK with Robust Aggregate Trust."""
    α, β, γ = 0.5, 0.3, 0.2  # Weights

    geometric_mean = prod(wlnk_scores) ** (1/len(wlnk_scores))
    median_score = median(wlnk_scores)
    min_score = min(wlnk_scores)

    coherence_bonus = 1 + 0.2 * (1 - disagreement)

    return (α * geometric_mean + β * median_score + γ * min_score) * coherence_bonus

Tiered Trust Application: - L0 (Facts): Use strict WLNK—any factual error breaks the chain - L1 (Inferences): Use weighted aggregation—allow outvoting of weak links - L2 (Hypotheses): Use RAT—preserve diversity, don't force premature consensus

3. Should Knowledge Levels (L0-L2) Map to Consensus Types?

Verdict: Yes—Strongest Part of the Proposal

Level Definition Consensus Requirement Validation Method
L0 Data/Observation Strict Unanimity (5/5) RAG/API, not just voting
L1 Patterns/Inference Majority Vote (4/5+) Explicit reasoning chains
L2 Theories/Hypothesis Plurality (3/5) Preserve alternatives

Key Insight: For L2, diversity is preferred over consensus—goal is generating options, not picking a winner prematurely.

4. Risks Underestimated

Risk Severity Council Mitigation
Latency Cascade HIGH Implement tiered invocation (not full council every query)
Attribution Collapse HIGH Tag every claim with model_id and consensus_score
Shared Hallucination HIGH Consensus ≠ Truth; add citation requirements
Cost Scaling MEDIUM 7× cost requires explicit thresholds
Prompt Injection Amplification MEDIUM Council may "launder" malicious output through consensus

Council Architectural Recommendations

1. Tiered Invocation Strategy (Not Full Council by Default)

┌─────────────────────────────────────────────────────────┐
│                    Query Classifier                      │
│  (complexity, stakes, domain novelty)                    │
└─────────────────┬───────────────────────────────────────┘
    ┌─────────────┼─────────────┬─────────────┐
    ▼             ▼             ▼             ▼
┌───────┐   ┌─────────┐   ┌─────────┐   ┌─────────┐
│ Fast  │   │ Verify  │   │  Full   │   │ Deep    │
│ Path  │   │ Path    │   │ Council │   │ Council │
│(1 LLM)│   │(2 LLMs) │   │(5 LLMs) │   │(5+synth)│
└───────┘   └─────────┘   └─────────┘   └─────────┘
  <2s         3-5s          8-12s        15-30s

Mapping to Quint Levels: - Fast Lane (L0): Single Model + RAG verification (facts) - Medium Lane (L1): 3-Model Vote (pattern validation) - Slow Lane (L2): Full Council + FPF scoring (hypothesis generation)

2. Preserve Model Attribution in DRRs

design_rationale_record:
  query_id: "quint-hypothesis-001"
  timestamp: "2025-12-19T14:32:00Z"

  contributions:
    - model: "claude-opus-4.5"
      role: "primary_synthesis"
      claims: ["hypothesis_A", "constraint_check_passed"]
      confidence: 0.82

    - model: "gemini-3-pro"
      role: "adversarial_reviewer"
      dissents: ["edge_case_unhandled"]
      confidence: 0.71

  weakest_link_identified: "Assumption that clocks were synced"
  trust_score:
    fidelity: 0.78
    groundedness: 0.85
    robustness: 0.71
    composite_rat: 0.77

3. Circuit Breakers for Consensus Failure

class ConsensusCircuitBreaker:
    def evaluate(self, responses: List[ModelResponse]) -> Action:

        # Irreconcilable disagreement → escalate
        if semantic_variance(responses) > DIVERGENCE_THRESHOLD:
            return Action.ESCALATE_TO_HUMAN

        # Suspicious unanimity (possible shared hallucination)
        if agreement_score(responses) > 0.98 and groundedness < 0.5:
            return Action.REQUEST_CITATIONS

        # Potential prompt injection pattern
        if anomaly_score(responses) > INJECTION_THRESHOLD:
            return Action.QUARANTINE_AND_REVIEW

        return Action.PROCEED_TO_SYNTHESIS

4. The "Fact-Rule" Split

  • L0 (Facts): Do NOT use LLMs to verify facts if possible. Use deterministic code/API or RAG lookups.
  • L1/L2 (Logic/Game): This is the sweet spot for the LLM Council.

Implementation Revision (Council-Informed)

Phase Original Council Revision
Phase 1 Evaluation Prototype L1-only council (lowest risk)
Phase 2 Council-Powered Abduction Benchmark diversity with/without adversarial seeding
Phase 3 Trust-Weighted Rankings Replace WLNK with RAT for L1/L2
Phase 4 DRR Generation Implement attribution schema before production

Rollback Triggers (Council-Defined)

automatic_rollback:
  diversity:
    - semantic_variance < 0.3  # Hypotheses too similar
  latency:
    - p99_response_time > 15s
  trust:
    - shared_hallucination_detected: true
    - groundedness < 0.5 with consensus > 0.95
  attribution:
    - untraced_claims_ratio > 10%

References