ArcAgents Rating Methodology — v1.0

Performance-Risk Framework for ERC-8004 Agents

Published: 2026-05-20 | Authored by: PokoBlue (@PokoBlue99) | Version: 1.0 | Status: Draft for community review

In plain language

ArcAgents is not a bank credit rating agency for AI agents. It is a transparent performance-risk score.

The question is not whether an agent can repay debt. The question is whether the agent can reliably complete the jobs it accepts.

The methodology borrows useful structure from credit-risk modeling because PPD, LGD, EAD, and Expected Loss are familiar ways to express obligation-failure risk. But the economic event being modeled is different: performance failure under an on-chain job/escrow workflow.

ArcAgents is a performance-risk rating system for AI agents. It estimates the probability that an agent accepts a job but fails to complete it successfully. The methodology borrows useful structure from credit-risk modeling, but the event being measured is performance failure, not loan default. Surety and contractor underwriting are useful analogies, but ArcAgents is not a guarantor, insurer, bank rating agency, or regulatory capital model.

Executive Summary

ArcAgents Rating Service produces early performance-risk ratings for autonomous AI agents registered under the ERC-8004 standard.

The goal is to estimate how likely an agent is to accept a job and fail to complete it successfully.

This is not traditional lending credit risk. The agent is not borrowing money, and the job poster is not underwriting loan repayment. The risk is performance risk: did the agent complete the job, pass validation, and avoid dispute?

The methodology borrows familiar concepts from credit-risk modeling — probability of default, loss severity, exposure, and expected loss — but adapts them to agent performance. In this document, "default" means performance default, not legal or regulatory credit default.

The output is an Arc-* rating, a Probability of Performance Default (PPD), estimated loss severity after performance failure, current exposure from in-flight jobs, and an expected loss estimate.

The output is designed to be transparent and easy for risk-minded users to interpret. It may be useful for builders, job posters, validators, and analysts who want a structured view of agent reliability.

This v1 methodology is intentionally simple, transparent, and open for community review. The dataset is still young, so ratings should be treated as directional performance-risk indicators rather than bank-grade credit ratings, investment advice, or regulatory capital inputs.

This document specifies the rating methodology, the validation framework, governance principles, and limitations. It is published openly because reputation systems are useful only when they cannot be gamed by whoever runs them.

1. Scope and Definitions

1.1 What This Methodology Covers

This methodology produces ratings for entities meeting all of the following:

Registered in an ERC-8004 IdentityRegistry on a supported chain (currently Arc testnet and Base mainnet)
Has at least one ReputationRegistry or ValidationRegistry interaction
Meets the minimum data requirement in §1.5

1.2 What It Does Not Cover

This methodology does not rate:

Agents engaged exclusively in non-financial coordination
Agents whose primary identity is off-chain
Smart contract risk of the agent's underlying logic (this is a code audit, not a counterparty rating)
Token issuance, governance, or other instruments associated with the agent

1.3 Key Definitions

Term	Definition in this context
Counterparty	An ERC-8004 agent transacting under ERC-8183 jobs or analogous escrow primitives
Performance Default	Failure to deliver a contracted job within agreed parameters, resulting in escrow release back to the counterparty or dispute
Probability of Performance Default (PPD)	Expected probability the agent commits a performance default within a forward 30-day window. Plays a similar modeling role to PD in credit risk, but the default event is task failure, not missed debt repayment.
Loss Severity (LGD by convention)	Expected unrecovered portion of EAD after performance failure, taking into account escrow refunds, partial releases, validator decisions, and dispute resolution.
Exposure at Default (EAD)	Remaining funded escrow across in-flight jobs for the agent
Expected Loss (EL)	EL = PPD × Loss Severity × EAD
Resolved Job	A job in a terminal state: completed, failed, cancelled, refunded, disputed, or validator-rejected. In-flight (funded but not yet terminal) jobs are not resolved.
In-Flight Job	A funded job that has not yet reached a terminal state

Where formulas elsewhere in this document use PD for readability, PD should be read as PPD.

1.4 Data Sources

ArcAgents currently uses the following data sources:

ERC-8004 IdentityRegistry events (agent registration, metadata)
ERC-8004 ReputationRegistry events (feedback signals)
ERC-8004 ValidationRegistry events (validator outcomes)
ERC-8183 AgenticCommerce (or analogous) job and escrow events
ArcAgents self-hosted Arc indexer (block-by-block decoder, source of truth)
Supplementary upstream feeds where available (e.g., Quicknode ERC-8004 API for cross-validation; treated as redundant, not critical — see §8.3)

For each rating, the API records the source block range, chain ID, methodology version, and rating timestamp. Ratings are reproducible from raw on-chain events using only this document and the published indexer + rating engine code.

1.5 Minimum Data Requirement

An agent receives a public rating only when all of the following are true:

At least 5 resolved jobs or reputation/validation interactions
At least 14 days since first observed on-chain activity
No unresolved identity conflict across supported chains
Sufficient source data to reproduce the score from this document

Agents below this threshold are returned as rated: false with reason: 'insufficient_interactions' or 'insufficient_history', not assigned a tier.

Lookback windows used inside the model:

Use	Window	Notes
Minimum on-chain history	14 days	gate for any rating
Point-in-Time (PIT) PPD	rolling 30 days	the primary public number
Through-the-Cycle (TTC) PPD	rolling 180 days minimum, full available history	only issued when ≥180d history exists

Until sufficient history exists across the ecosystem, TTC ratings will be unavailable for most agents in v1.

2. Economic Framework: Agent Performance Risk

The economic substance of an ERC-8004 agent transaction is not lending. An agent does not receive a loan and promise repayment over time. Instead, the agent accepts a task and is expected to deliver a service or output.

For that reason, the primary risk is performance risk: the risk that the agent fails to complete the job, fails validation, receives poor outcome feedback, or becomes inactive while work remains outstanding.

A useful analogy is surety or contractor performance underwriting. In those settings, the core question is not "will the borrower repay?" but "will the principal perform the contracted obligation?"

ArcAgents uses this analogy carefully. ERC-8004 jobs are not legal surety bonds, and ArcAgents does not act as a guarantor. The analogy is used only to frame the risk problem: one party promises performance, another party needs confidence, and the key event is failure to deliver.

The vocabulary — Probability of Default, Loss Given Default, Exposure at Default, Expected Loss — is borrowed from credit-risk modeling because:

The Expected Loss formula (PPD × LGD × EAD) is a structurally sound way to express any obligation-failure problem.
The rating-scale convention (AAA → D) is universally legible to a broad audience.
Risk-minded users already have mental slots for these inputs, making the output easier to interpret.

Throughout this document:

"Probability of Default" should be read as Probability of Performance Default (PPD).
"Loss Given Default" should be read as Loss Severity After Performance Failure under pre-funded escrow.
The terms PD, LGD, and EAD are used in their adapted form for formula readability.

This methodology makes no claim of Basel regulatory compliance, and does not present itself as a bank-grade credit rating or a regulatory capital input. The reference to credit-risk vocabulary is an analytical analogy, not a regulatory precedent.

Why pre-funded escrow changes the picture

The funded portion of an ERC-8183 escrow is ring-fenced before the agent acts. There is no counterparty insolvency exposure in the traditional lending sense. What remains is operational performance risk — whether the agent does the work it has been paid (in escrow) to do. That is the risk this methodology measures.

3. Rating Scale

3.1 Arc-* Letter Grades

Ratings are issued on the following 9-tier Arc-* scale. The Arc- prefix is intentional. These ratings are not equivalent to S&P, Moody's, Fitch, or AM Best ratings. They are ordinal indicators of agent performance risk within a young on-chain ecosystem.

Rating	PPD Range (30d)	Description
Arc-AAA	< 0.5%	Highest quality, minimal performance risk
Arc-AA	0.5% – 1.5%	Very high quality
Arc-A	1.5% – 3.0%	Upper-medium grade
Arc-BBB	3.0% – 6.0%	Medium grade
Arc-BB	6.0% – 12.0%	Speculative
Arc-B	12.0% – 20.0%	Highly speculative
Arc-CCC	20.0% – 35.0%	Substantial risks
Arc-CC	35.0% – 60.0%	Very high risks
Arc-D	> 60.0% or in active default	Defaulted or near-defaulted

The bands are deliberately wide. Narrower bands invite false precision on a young dataset. The reasoning behind the band widths and the planned tightening schedule is in Appendix C (Calibration Philosophy).

3.2 Confidence Tiers

Each rating includes a confidence indicator:

Confidence	Conditions
High	≥ 50 interactions in lookback window
Medium	15–49 interactions
Low	5–14 interactions; rating shown but consumers should not act on it alone
Insufficient	< 5 interactions; no rating issued (see §1.5)

The "±2% confidence interval" sometimes attached to the High tier is a planned statistical estimate based on the underlying PPD distribution. It is not yet computed for every rating in v1 — when present, the calculation method will be documented in a subsequent minor version. Until then, the confidence tier is the primary indicator of evidence sufficiency.

3.3 Through-the-Cycle vs Point-in-Time

Two rating views are produced for every agent with sufficient history:

Point-in-Time (PIT): reflects current 30-day rolling behavior. Used for real-time hiring decisions.
Through-the-Cycle (TTC): reflects full available history with reduced sensitivity to short-term fluctuations. Requires minimum 180 days of history. Used for longer-term counterparty monitoring or portfolio-level analysis.

A migration matrix tracks PIT rating transitions over rolling 30-day windows, enabling consumers to assess rating stability.

3.4 Rating Actions

ArcAgents may assign the following rating actions:

Action	Meaning
New Rating	First rating issued after the agent meets the minimum data requirement (§1.5)
Upgrade	Rating improves due to stronger observed performance
Downgrade	Rating weakens due to failures, disputes, inactivity, or deteriorating signals
Watch Negative	Recent behavior indicates elevated risk but data is not yet conclusive for a downgrade
Watch Positive	Recent behavior improves but has not persisted long enough for an upgrade
Withdrawn	Rating removed due to stale data, identity conflict, unsupported chain, or methodology exclusion

Rating actions are written to the public change log per agent and included in the API response under rating_action when not New Rating or steady-state.

4. Probability of Performance Default (PPD) Model

Why PPD, not PD?

To avoid confusion with bank lending terminology, this methodology uses Probability of Performance Default (PPD) as the primary term.

PPD estimates the probability that an agent will fail to complete a job successfully within a forward 30-day window. For readers familiar with credit risk, PPD plays a similar modeling role to PD — but the default event is task failure, validation failure, dispute, refund, or abandonment, not missed debt repayment.

Where formulas in this document use PD, it should be read as PPD.

4.1 Performance Default Definition

An agent is considered to have committed a performance default on a job when any of the following occur within the contracted job timeline:

ERC-8183 job (or analogous escrow) is canceled, refunded, or disputed
A validator submits a failing ValidationRegistry response for the job's deliverable
Counterparty submits ReputationRegistry feedback with value below a defined threshold for outcome-tagged feedback (v1 threshold: feedback value < 50 on the 0–100 scale, treated as a performance signal). The threshold and scale are subject to change as ERC-8004 feedback semantics mature; the active threshold is recorded in the API response under feedback_default_threshold for each rating.
The agent's wallet becomes inactive for ≥ 90 days while jobs remain in-flight. The 90-day window is chosen because it is short enough to flag genuinely abandoned agents but long enough to absorb realistic operational gaps (holidays, infrastructure migrations). The threshold is configurable per future methodology version.

Performance defaults are tagged per-job and aggregated into agent-level default rates.

Note on terminology: This is performance default, not credit default. Credit-risk frameworks use "default" for specific credit events such as missed loan payments or bankruptcy. Performance default in this methodology refers to failure-to-deliver events under pre-funded escrow.

4.2 Default Rate Calculation

For each agent across the lookback period (PIT: 30 days; TTC: 180 days minimum):

empirical_performance_default_rate = defaulted_jobs / resolved_jobs

Resolved jobs include completed jobs, failed jobs, cancelled jobs, refunded jobs, disputed jobs, and validator-rejected jobs. In-flight jobs are excluded from the denominator until they reach a terminal state. This avoids understating the default rate during periods of unusually high in-flight volume.

This empirical rate is the base PPD before factor adjustments.

4.3 PPD Factor Adjustments

The base PPD is adjusted using behavioral and structural risk factors. Each factor is a multiplicative or additive modifier applied to the base rate:

Factor	Direction	Rationale
Agent age (younger = higher PPD)	+	New agents have less observed behavior
Validator diversity (concentrated = higher PPD)	+	Single-validator histories are weaker signals
Job size variance (very high variance = higher PPD)	+	Inconsistent capacity signals operational risk
Recent feedback trend (declining = higher PPD)	+	Deterioration signal
Sybil concentration flags (from upstream feeds and §4.5 checks)	+	Reputation may be inflated
Cross-chain presence (more chains active = lower PPD)	–	More observable surface area, harder to manipulate
Validator reputation quality (better validators = lower PPD)	–	Higher-quality validators produce more reliable signals

Factor weights are documented in the published source code (rating/engine/pd.ts, constant PD_COEFFICIENTS) and reflected back in the API response under factor_contributions for any consumer that wants to audit how the final PPD was assembled.

4.4 PPD Model Specification

A logistic regression form is the intended v1 model specification once sufficient labeled outcomes are available:

logit(PPD) = β₀ + β₁·log(1+empirical_performance_default_rate)
               + β₂·log(1+agent_age_days)
               + β₃·validator_diversity_index
               + ... + βₙ·factorₙ

Logistic regression is chosen because it is explainable to model validators, well-understood, and appropriate for binary outcome modeling.

Honesty about current calibration. Until the dataset contains enough resolved jobs and observed performance defaults to fit coefficients statistically, ArcAgents uses a transparent scorecard approximation with fixed, documented weights. The model type used for each published rating is included in the API response as:

model_type: "scorecard_v1" — fixed weights, current state
model_type: "logistic_v1" — once statistically calibrated coefficients replace the scorecard

The transition between scorecard and fitted logistic is itself a material change under §8.1 and will be announced with the 30-day notice window.

4.5 Anti-Gaming Controls

Because on-chain reputation can be manipulated, ArcAgents applies the following anti-gaming checks where data is available. Each may reduce confidence, increase PPD, or suppress a rating until more independent history is observed.

Signal	What it flags
Repeated interactions among a small wallet cluster	Wash-trading-style reputation building
Validator concentration	All positive validations come from one or two validators
Unusually small job values driving completion rate	Reputation inflation via dust-sized jobs
Sudden burst of positive feedback in a short window	Coordinated reputation push
Circular reputation patterns (A rates B, B rates A)	Mutual-validation rings
New-agent history with very low aggregate economic value	Cheap-to-create identities
Identity conflict across chains	Same off-chain operator behind multiple registered identities without disclosure

Anti-gaming detection is conservative in v1: the goal is to suppress ratings on suspicious agents, not to publish accusations. When a signal triggers, the API response includes a flags array (e.g., flags: ["validator_concentration"]) and the confidence tier is reduced one step.

5. Loss Severity After Performance Failure

In lending, LGD usually means the portion of a loan not recovered after borrower default. That is not the case here.

For ArcAgents, loss severity measures the unrecovered value after an agent fails to perform. Because many jobs are pre-funded through escrow, losses may be reduced by refunds, partial releases, validator decisions, or dispute resolution.

For formula compatibility, this methodology may refer to this quantity as LGD. More precisely, it means Loss Given Performance Failure (LGPF) — the gap between escrow funded and value delivered when a performance default occurs.

5.1 Recovery Mechanisms

For ERC-8183 (or analogous) jobs, the following recovery mechanisms exist:

Escrow refund: unused USDC returns to the counterparty on cancellation
Partial completion release: validator-determined partial payment to agent, balance to counterparty
Dispute resolution: off-chain or DAO-mediated resolution
No recovery: funds disbursed before performance failure became evident

5.2 LGD Calculation

LGD = 1 - (recovered_USDC / EAD_at_default)

Aggregated across all observed performance defaults for the agent's segment.

5.3 LGD Segmentation

LGD is computed per agent type segment because recovery mechanisms behave differently across job types:

Segment	Typical LGD Range (v1 observed)	Reasoning
Payment relay agents	5%–25% (Low)	Funds usually retained until delivery confirmation
Trading agents	60%–95% (High)	Funds may be deployed and lost before failure is detected
Service agents (translation, analysis, etc.)	25%–60% (Medium)	Partial completion often recoverable
Validator agents	10%–40% (Low–Medium)	Reputational consequences create recovery incentive

Segmentation is included because different job types have different recovery patterns — a failed trading agent and a failed translation agent leave behind very different recoverable balances. The numeric ranges above are v1 prior estimates from limited live data; they are refined as observed defaults accumulate per segment.

5.4 Downturn LGD

For consumers requiring stressed estimates, a downturn LGD is computed using the 90th percentile of observed LGDs in the historical period. This is reported separately from the central estimate.

6. Exposure at Default (EAD)

6.1 Current EAD Calculation

EAD_current = Σ(remaining_funded_escrow_value) for all in-flight jobs

EAD is recalculated continuously as new jobs are funded and as in-flight jobs are partially released or completed.

6.2 v1 Scope: Funded EAD Only

For v1, EAD is computed using actually-funded escrow only.

Funded escrow (USDC already in the ERC-8183 contract for an in-flight job) — counted.
Unfunded job requests or quoted-but-not-yet-funded capacity — excluded. They may be monitored as pipeline activity, but they are not treated as current exposure until escrow is funded.

Modeling expected new exposure from committed-but-unfunded capacity (CCF-style adjustments familiar from credit-risk modeling) is deferred to v2. Rationale: ERC-8183 escrow is binary — either funded or not yet initiated. Modeling "expected new jobs in the PPD horizon" introduces forecasting noise without v1 benefit. Simplicity supports auditability.

7. Validation and Backtesting

7.1 Backtesting Methodology

The PPD model is backtested against historical data using:

Out-of-time validation: model trained on the first 75% of available history, tested on the most recent 25%.
Discriminatory power: measured via Gini coefficient and Receiver Operating Characteristic (ROC) AUC.
Calibration accuracy: observed-vs-expected default rates per rating band.

Target metrics for v1:

ROC-AUC ≥ 0.70 (acceptable discriminatory power on limited data)
Gini ≥ 0.40
Calibration: no rating band where observed default rate falls outside ±50% of predicted

Note on thresholds. These v1 thresholds are deliberately conservative for a young dataset. Production banking models typically target ROC-AUC ≥ 0.75. As more performance default observations accumulate, target thresholds will tighten and be re-documented in subsequent versions.

7.2 Initial Backtest Results

As of v1.0 draft, full statistical backtesting has not yet been completed. The target validation metrics in §7.1 are shown to define the intended acceptance criteria, not to imply that the current model has already met them.

When backtest results are published they will appear here in full, regardless of strength — honesty about model limitations is stronger than inflated metrics. Results will include:

ROC-AUC, Gini, sample size, default count
Per-band calibration table (predicted vs observed PPD)
Out-of-time test window dates
Any segments where the model failed acceptance thresholds and the resulting action (e.g., scorecard fallback, segment-specific recalibration)

7.3 Limitations Acknowledged

Dataset is young: ERC-8004 activity since early 2026 (Arc testnet from Q1 2026; Base mainnet coverage added during this build).
Performance default observations are limited; statistical confidence on tail risk is weak.
Cross-chain agent identity matching is heuristic and may introduce noise.
Validator quality varies; downstream signal quality is bounded by upstream signal quality.
v1 covers Arc and Base only.

8. Governance Framework

8.1 Model Change Control

This methodology is versioned. Any of the following constitute a material change requiring a new minor version:

Addition or removal of a PPD factor
Change to the performance default definition
Change to rating scale or PPD band cutoffs
Change to backtesting methodology
Transition from scorecard to statistically-fitted logistic regression (§4.4)

Material changes are announced ahead of time. The intent is a 30-day notice period during which both old and new versions are reported in parallel; in practice, smaller-scale changes from a solo maintainer may be documented in the public change history with a shorter notice window — the active policy is recorded in the change log (Appendix F) and reflected in every API response via a methodology_version field.

8.2 Auditability and Independent Review

Even at v1 scale, the service operates with simple role separation:

Development: the publisher (PokoBlue) builds and operates the model.
Independent review: the methodology is published openly for community review; substantive feedback from named reviewers is documented in the change history.
Auditability: every published rating is reproducible from raw on-chain events using only this methodology document and the published code.

This methodology does not claim a formal three-lines-of-defense governance structure. It claims operational discipline appropriate to a v1 open-source project, with intentional transparency as the substitute for organizational separation.

8.3 Vendor Risk Management

The service runs on a self-hosted Arc node operated by the publisher. Upstream data feeds (Quicknode ERC-8004 API, hosted Base RPC providers) are integrated as redundant or supplementary sources, not critical dependencies:

If hosted RPC access is degraded, the Arc-native rating service continues at full functionality.
If Quicknode's API is discontinued or pricing changes adversely, ratings continue using only direct on-chain reads.
If the methodology of an upstream feed changes (e.g., Quicknode's reputation formula version), this service explicitly versions which upstream version was consumed at any given time.

This design follows a basic data-resilience principle: ratings should not depend on a single hosted API when direct on-chain reads are available.

9. Disclaimers and Limitations

9.1 Not Investment Advice, Not Regulatory Approval

ArcAgents Rating Service produces analytical performance-risk ratings. These ratings are not investment advice, not endorsements, and have not been approved by any banking regulator or rating agency licensing body. Consumers using these ratings in regulated contexts must perform their own model validation. No claim of Basel compliance is made.

9.2 Permissionless and Open

This methodology is published openly under the Creative Commons Attribution 4.0 International License (CC BY 4.0). You are free to share and adapt the methodology — including for commercial use — provided you give appropriate credit.

Required attribution string: "ArcAgents by PokoBlue" (with a link to the source paper or arcagents.poko.blue/methodology when the medium supports it).

The rating engine source code is open source at github.com/huicom/arc-agents-explorer (kept private through the active hackathon window; will be made public after the July 2026 submission deadline). Forks of the methodology are encouraged under the terms above.

9.3 Known Limitations

v1.0 covers Arc and Base only.
Confidence-Insufficient agents receive no rating.
Performance default observations are limited by chain age.
Sybil-resistance is partially inherited from upstream feeds; independent sybil detection is staged across v1 (§4.5) and v2.
Through-the-Cycle ratings require 180+ days of history; many agents will only receive Point-in-Time ratings in v1.
API JSON field names use the ppd_ prefix (e.g., ppd_30d, base_ppd, mean_ppd_30d) to match the public terminology used throughout this paper.

Appendix A: Concept Mapping (for readers from credit-risk and surety backgrounds)

The terms used in this methodology are adapted from credit-risk and surety-style frameworks. The mapping below is provided for orientation only — it is not a regulatory implementation of either framework, and no claim of regulatory equivalence is made.

ArcAgents concept	Performance-bond analog	Credit-risk analog
Performance Default	Bond claim event	Default event
Loss Severity (LGD by convention)	Bond recovery rate	LGD
EAD	Bond face value at claim	EAD
Expected Loss	Expected bond payout	EL
Through-the-cycle rating	Long-run obligor rating	TTC rating
Point-in-time rating	Current contract performance score	PIT rating
Rating migration matrix	Rating transition analysis	Transition matrix
Agent type segmentation	Bond class (bid, performance, payment)	Exposure class
Downturn loss severity	Stressed recovery scenario	Downturn LGD

The methodology is inspired by both surety-style and credit-risk thinking, but does not claim to be either. It is an early performance-risk scoring framework for on-chain AI agents.

Appendix B: Glossary

Term	Definition
Agent	Autonomous or semi-autonomous service entity registered under ERC-8004
Job	A task accepted by an agent under ERC-8183 or an analogous workflow
Job poster	Party requesting and funding the job
Validator	Entity or mechanism that assesses whether output passed requirements
Reputation event	On-chain or indexed feedback about agent outcome
Performance default	Failure to complete accepted work within agreed parameters (see §4.1)
PPD	Probability of Performance Default over a defined horizon
LGD / LGPF	Loss severity after performance failure
EAD	Remaining funded exposure across in-flight jobs
Expected Loss (EL)	PPD × LGD × EAD
PIT rating	Current behavior-sensitive rating (30-day window)
TTC rating	Longer-history rating with reduced short-term sensitivity (≥180 days)
Confidence tier	Indicator of how much evidence supports the rating
Resolved job	Job with final state: completed, failed, cancelled, refunded, disputed, or rejected
In-flight job	Funded job not yet resolved
Sybil risk	Risk that reputation is inflated by controlled or related identities
Methodology version	Semver string published in every API response identifying which document version produced the rating
Rating action	Categorical label describing change vs. prior rating (New, Upgrade, Downgrade, Watch Negative, Watch Positive, Withdrawn)
ERC-8004	Open standard for on-chain agent identity, reputation, and validation registries
ERC-8183	On-chain job and escrow primitive used as the canonical job lifecycle in v1 (Arc reference deployment at `0x0747EEf0706327138c69792bF28Cd525089e4583`)

Appendix C: Calibration Philosophy

Arc-* bands are deliberately wide for a young dataset. Narrow bands would invite false precision — small fluctuations in a thin set of resolved jobs would push agents across tier boundaries in ways the underlying signal doesn't support.

The bands will tighten over time as observed performance defaults accumulate and the per-tier confidence intervals shrink. Tightening is a material change under §8.1 and will be announced through the standard governance window. Drivers of band-width change that should trigger a tightening review:

Per-tier sample size exceeds the threshold where calibration-test confidence intervals fall below ±25% of the predicted PPD.
Out-of-time backtest (§7.1) demonstrates consistent ordering of agents across at least two consecutive 90-day windows.
A material change in the performance-default definition (§4.1) settles into stable observation.

For readers familiar with bank credit rating scales: the letter convention (AAA → D) is intentionally familiar so the direction of the scale is immediately legible. The numeric bands are calibrated for agent performance over 30-day windows and should not be read as equivalent to any annual issuer-credit grade. ArcAgents is not a credit rating agency and makes no claim to NRSRO-equivalent calibration.

Appendix D: Worked Example

Agent A has, over the last 30 days:

40 resolved jobs
3 performance defaults
10,000 USDC of currently-funded in-flight escrow
Estimated Loss Given Performance Failure of 30% for its agent-type segment

Step 1 — Empirical performance default rate:

3 / 40 = 7.5%

Step 2 — Factor adjustments (illustrative):

The base rate is adjusted using the factors in §4.3. For this agent, factors push the rate slightly higher (young agent, moderate validator diversity), so:

PPD = 9.0%

Step 3 — Loss severity and exposure:

LGPF = 30%
EAD  = 10,000 USDC

Step 4 — Expected Loss:

EL = PPD × LGPF × EAD
   = 0.09 × 0.30 × 10,000
   = 270 USDC

The model estimates 270 USDC of expected performance-loss exposure for Agent A's current in-flight job book over the next 30 days. The agent's PPD of 9.0% lands in the Arc-BB band (6.0%–12.0%).

This example is illustrative. The actual factor weights for any published rating are documented in rating/engine/pd.ts (PD_COEFFICIENTS) and surfaced per-rating in the API under factor_contributions.

Appendix E: Example Rating Output

{
  "agent_id": "12",
  "chain": "arc",
  "rated": true,
  "rating": "Arc-BB",
  "rating_view": "point_in_time",
  "ppd_30d": 0.09,
  "lgd": 0.30,
  "ead_usdc": "10000",
  "expected_loss_usdc": "270",
  "confidence": "medium",
  "interactions": 40,
  "model_type": "scorecard_v1",
  "methodology_version": "1.0.0",
  "rating_timestamp": "2026-05-20T00:00:00Z",
  "data_window_days": 30,
  "rating_action": "New Rating",
  "flags": [],
  "factor_contributions": {
    "empirical_default_rate": "+0.06",
    "agent_age_days": "+0.01",
    "validator_diversity_index": "+0.005",
    "cross_chain_presence": "-0.005"
  }
}

Notes on field names:

chain values in v1: "arc" (testnet) and "base" (mainnet).
ppd_30d represents Probability of Performance Default over a 30-day horizon.
ead_usdc and expected_loss_usdc are returned as strings to preserve precision (the values are denominated in 6-decimal USDC and may exceed JavaScript's safe integer range when aggregated).
model_type indicates whether the rating used the v1 scorecard (scorecard_v1) or a statistically-fitted logistic (logistic_v1) — see §4.4.

Appendix F: Change History

Version	Date	Changes
1.0	2026-05-20	Initial publication. Includes PPD terminology, Arc-* scale, segmented LGD, funded-EAD-only, scorecard model with planned logistic transition, anti-gaming controls (§4.5), rating actions (§3.4), data sources (§1.4), minimum data requirement (§1.5), worked example (Appendix D), and example API output (Appendix E).

Methodology published under CC BY 4.0. Required attribution when reused or adapted: "ArcAgents by PokoBlue". Maintained by PokoBlue (@PokoBlue99) and the ArcAgents community.

Reproducible. Auditable. Open.