Seventeen Corrections: Stress-Testing a Multi-Framework AI Measurement Methodology

Abstract

During rigorous stress testing of the Amplitude v3.1 specification, we identified and corrected seventeen mathematical errors across the ten scoring frameworks. This paper documents each correction - including division-by-zero edge cases, cold-start scoring behavior, and missing-data handling protocols - and discusses what the pattern of errors reveals about the challenges of building production-grade measurement systems for AI impact.

Background

The development of production-grade measurement systems follows a trajectory that is well-documented in metrology, the science of measurement ^[1]. Initial specifications define the theoretical measurement model, establishing the quantities to be measured, the mathematical relationships between them, and the scoring scales. Implementation reveals edge cases that the theoretical model did not anticipate: undefined behavior when inputs take extreme values, numerical instability in certain parameter ranges ^[2], and ambiguity in the specification that permits divergent implementations. Stress testing subjects the implementation to adversarial inputs designed to expose these weaknesses. The corrections that result from stress testing are not failures of the original design; they are the necessary refinements that transform a theoretical model into a production-ready instrument.

The history of established measurement systems provides instructive parallels. The CVSS specification underwent six major revisions between its initial publication in 2005 and the current version 4.0 ^[3], with each revision correcting scoring anomalies identified through field deployment. The Basel capital adequacy framework has been revised three times (Basel I, II, and III) ^[4], with each revision addressing weaknesses exposed by financial crises and supervisory experience. The FICO score algorithm has been updated more than a dozen times since its introduction ^[5], with each update correcting biases, gaming vulnerabilities, and edge-case behaviors identified through analysis of billions of scoring events. The common thread across these examples is that measurement systems mature through iterative correction, not through initial perfection.

The Amplitude specification, version 3.1, defines ten scoring frameworks that collectively measure AI impact across data quality, agent behavior, and ecosystem dynamics. Each framework specifies a mathematical model for computing a composite score from multiple input dimensions, along with normalization procedures, aggregation functions, and score interpretation guidelines. The specification spans approximately 140 pages and contains several hundred mathematical formulas, conditional logic statements, and parameter definitions. The stress-testing effort described in this paper subjected every formula, conditional, and parameter to systematic boundary analysis ^[6], producing the seventeen corrections documented herein.

The stress-testing methodology combines three complementary techniques: analytical boundary analysis ^[6], where each formula is evaluated at extreme and degenerate input values; Monte Carlo fuzzing ^[7], where random inputs drawn from pathological distributions are fed through the scoring pipeline; and adversarial construction, where inputs are deliberately crafted to trigger specific failure modes predicted by the analytical analysis. The combination of these techniques ensures broad coverage of the input space while maintaining the ability to target specific categories of potential errors. Each identified error was verified independently by at least two analysts before being classified as a genuine correction.

Approach

The systematic error identification process began with a taxonomy of potential error categories, derived from the metrology literature ^[1] and from experience with prior measurement system implementations. The taxonomy includes seven categories: division-by-zero errors, where a formula divides by a quantity that can be zero; overflow and underflow errors, where intermediate calculations exceed numerical precision limits ^[2]; undefined behavior errors, where the specification does not define the output for a valid input combination; cold-start errors, where the scoring model produces meaningless or misleading scores when insufficient historical data is available; missing-data errors, where the specification does not specify behavior when required input fields are absent; boundary discontinuity errors, where the score function has a discontinuity at a threshold value that produces counterintuitive score jumps; and aggregation anomaly errors, where the aggregation function produces a composite score that violates monotonicity or other expected properties.

For each of the ten scoring frameworks, the stress-testing team enumerated every mathematical formula and conditional statement, identified the input variables and their domains, and determined the extreme and degenerate values within each domain. A degenerate value is one that causes a formula to produce an undefined result (such as division by zero) or a result that lies outside the expected output range. The team then evaluated each formula at its degenerate inputs and recorded the result. For conditional statements, the team identified boundary values where the condition transitions from true to false and verified that the score function behaves continuously and intuitively across the boundary.

The Monte Carlo fuzzing component generated 10 million random input vectors for each framework, drawn from distributions designed to overrepresent extreme and pathological values ^[7]. Standard Monte Carlo testing uses uniform or normal distributions, but these distributions rarely produce the degenerate inputs that trigger edge-case errors. Our fuzzing distributions are a mixture of uniform, normal, point-mass (at zero, one, and boundary values), and heavy-tailed distributions, with mixing weights calibrated to ensure that at least 5% of generated inputs fall in each degenerate category identified by the analytical analysis. Each input vector is processed through the scoring pipeline, and the output is checked against four validity criteria: the score lies within the expected range [0, 100]; the score is a finite number (not NaN or infinity); the score varies continuously with continuous input variation; and the score respects monotonicity constraints specified in the framework documentation.

The adversarial construction component targets specific hypotheses about potential errors. For example, the Cascade framework computes a behavioral correlation score that involves dividing the number of correlated agent pairs by the total number of agent pairs. When the network contains only one agent, the total number of pairs is zero, and the correlation score is undefined. Adversarial construction generates inputs with exactly one agent and verifies that the scoring pipeline handles this degenerate case gracefully. Similarly, the Torque framework computes a multiplicative efficiency score from five dimension scores, each in [0, 1]. When any dimension score is exactly zero, the Torque score is zero regardless of the other dimensions, which may or may not be the intended behavior depending on whether zero is a valid dimension score or an indicator of missing data.

Findings

Of the seventeen corrections, six are division-by-zero errors, four are cold-start scoring errors, three are missing-data handling errors, two are boundary discontinuity errors, and two are aggregation anomaly errors. The division-by-zero errors follow a consistent pattern: a formula normalizes a quantity by dividing by a measure of scale (total transaction volume, number of agent pairs, market size) that can be zero in degenerate but valid scenarios. In each case, the correction adds a guard clause that returns a default score when the denominator is zero, with the default chosen to reflect the appropriate interpretation of the degenerate case. For example, when the Harmony framework divides the innovation rate by the number of market participants and the number of participants is zero, the corrected behavior is to return a score of 50 (neutral) rather than producing a NaN, because a market with zero participants has no competitive dynamics to evaluate.

The cold-start scoring errors arise when frameworks compute scores based on historical data and insufficient history is available. The most significant cold-start error was in the Fidelity framework, which computes a trust decay rate based on the ratio of verified to unverified alignment measurements. During the cold-start period, when fewer than 10 measurements are available, the ratio is highly unstable and can produce trust decay rates that are either unrealistically high or negative. The correction introduces a Bayesian prior ^[8] that pulls the cold-start estimate toward the population mean trust decay rate, with the prior's influence diminishing as more measurements accumulate. The correction uses a beta distribution prior ^[9] with parameters calibrated to the population-level trust decay distribution, ensuring that cold-start scores are conservative without being uninformative.

The missing-data handling errors reflect a systematic gap in the original specification: the specification defines scoring behavior when all required inputs are present but does not consistently define behavior when inputs are partially missing. Three frameworks, Drift, Torque, and Meridian, contain formulas that require inputs from external data sources that may be temporarily unavailable. The corrected specification introduces a three-tier missing-data protocol ^[10]. Tier one, for inputs missing for fewer than 24 hours, uses the most recent available value with a staleness penalty that linearly reduces the dimension score by 1% per hour of staleness. Tier two, for inputs missing for 24-168 hours, substitutes the trailing 30-day average with a confidence penalty that reduces the dimension weight in the composite score. Tier three, for inputs missing for more than 168 hours, excludes the dimension from the composite score and flags the overall score as partial.

The boundary discontinuity errors occur in frameworks that use threshold-based classification within their scoring logic. The Harmony framework, for instance, classifies market concentration as low (HHI < 1,500), moderate (1,500 <= HHI < 2,500), or high (HHI >= 2,500) ^[11], with different scoring formulas applied in each regime. At the regime boundaries, the scoring formula changes discontinuously, producing counterintuitive behavior: a market with HHI = 2,499 might receive a significantly higher score than a market with HHI = 2,501, despite the near-identical concentration levels. The correction replaces the hard threshold classification with a sigmoid transition function ^[12] that interpolates smoothly between regimes over a window of +/- 200 HHI points around each boundary, eliminating the discontinuity while preserving the regime-specific scoring semantics.

The aggregation anomaly errors involve cases where the composite score violates monotonicity: improving a dimension score while holding all other dimensions constant causes the composite score to decrease. This counterintuitive behavior arises from the interaction between dimension normalization and aggregation in two frameworks. In one case, improving a raw dimension score causes the normalized score to increase past a threshold that triggers a different aggregation weight, and the new weight is lower than the previous weight, causing the composite score to decrease. The correction ensures monotonicity by constraining the aggregation weights to be non-decreasing functions of the dimension scores ^[13], so that improvement in any dimension can never cause degradation of the composite score.

Implications

The pattern of seventeen errors across ten frameworks reveals structural challenges that are likely to recur in any production-grade measurement system for AI impact. The most important lesson is that mathematical correctness in the typical case is insufficient; a production specification must define behavior for every valid input combination, including degenerate cases that may seem unlikely but will inevitably occur at scale. A measurement system that processes millions of scoring events per day will encounter every edge case in its input domain within weeks of deployment. The cost of handling edge cases in advance through specification rigor is orders of magnitude lower than the cost of handling them reactively through incident response and score restatement.

The concentration of errors in division-by-zero and cold-start categories suggests that these are the most common failure modes for multi-dimensional scoring systems. Both categories arise from the same underlying cause: the mathematical model assumes a normal operating regime that is well-defined and data-rich, and fails to account for the transient and degenerate states that precede or surround normal operation. Future measurement system designers should adopt a defensive specification methodology that begins with the degenerate cases and works outward toward the normal regime, rather than the conventional approach of specifying the normal regime first and adding edge-case handling as an afterthought.

The missing-data corrections highlight a challenge that is specific to real-time measurement systems operating in distributed environments ^[10]. Unlike batch scoring systems that can wait for all inputs before computing a score, real-time measurement systems must produce scores on a continuous basis even when some inputs are temporarily unavailable. The three-tier missing-data protocol introduced in the corrections represents a general-purpose approach that other measurement systems can adopt: short-term substitution with staleness penalties, medium-term substitution with confidence penalties, and long-term exclusion with partial-score flagging. The key design principle is that missing data should degrade the score gracefully rather than producing either a crash or an artificially confident score.

The boundary discontinuity corrections illustrate a broader tension in scoring system design between interpretability and mathematical smoothness. Threshold-based classification is intuitive and easy to communicate to stakeholders, which is why it appears in the Harmony framework and in countless other scoring systems. But hard thresholds create discontinuities that are both mathematically problematic and economically perverse: agents near a threshold face asymmetric incentives to game their way to the favorable side of the boundary ^[14]. Sigmoid transition functions provide a practical compromise, preserving the interpretive clarity of regime-based scoring while eliminating the discontinuities that create gaming incentives and scoring anomalies.

The fact that seventeen corrections were identified and addressed before production deployment should be viewed as a success of the stress-testing process rather than a failure of the original specification. Measurement systems that do not undergo rigorous stress testing before deployment will encounter these same categories of errors in production, where the consequences include incorrect scores, damaged stakeholder trust, and regulatory non-compliance. The Amplitude specification's stress-testing process consumed approximately 2,400 person-hours over twelve weeks, representing a significant investment that is nonetheless small relative to the cost of deploying a measurement system with seventeen latent errors to a production environment serving thousands of agents and organizations.

References

Rabinovich, S. G. (2005). Measurement Errors and Uncertainties: Theory and Practice (3rd ed.). Springer.
Higham, N. J. (2002). Accuracy and Stability of Numerical Algorithms (2nd ed.). SIAM.
FIRST.Org. (2023). Common Vulnerability Scoring System v4.0: Specification Document. Forum of Incident Response and Security Teams.
Basel Committee on Banking Supervision. (2011). Basel III: A Global Regulatory Framework for More Resilient Banks and Banking Systems. Bank for International Settlements.
Fair Isaac Corporation. (2009). Understanding Your FICO Score. FICO.
Myers, G. J., Sandler, C., & Badgett, T. (2011). The Art of Software Testing (3rd ed.). John Wiley & Sons.
Metropolis, N., & Ulam, S. (1949). The Monte Carlo Method. Journal of the American Statistical Association, 44(247), 335-341.
Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., & Rubin, D. B. (2013). Bayesian Data Analysis (3rd ed.). Chapman and Hall/CRC.
Johnson, N. L., Kotz, S., & Balakrishnan, N. (1995). Continuous Univariate Distributions (Vol. 2, 2nd ed.). John Wiley & Sons.
Little, R. J. A., & Rubin, D. B. (2019). Statistical Analysis with Missing Data (3rd ed.). John Wiley & Sons.
U.S. Department of Justice & Federal Trade Commission. (2023). Merger Guidelines. U.S. Government Publishing Office.
Han, J., & Moraga, C. (1995). The Influence of the Sigmoid Function Parameters on the Speed of Backpropagation Learning. In J. Mira & F. Sandoval (Eds.), From Natural to Artificial Neural Computation (pp. 195-201). Springer.
Fishburn, P. C. (1982). Monotonicity Paradoxes in the Theory of Elections. Discrete Applied Mathematics, 4(2), 119-134.
Goodhart, C. A. E. (1984). Problems of Monetary Management: The U.K. Experience. In C. A. E. Goodhart (Ed.), Monetary Theory and Practice: The U.K. Experience (pp. 91-121). Macmillan.