One Framework Becomes Ten: The Amplitude Thesis for Multi-Layer AI Impact Measurement

Abstract

We present the Amplitude architecture: ten independently validated scoring frameworks organized across three layers - Data, Agent, and Ecosystem. Each framework answers a distinct question about how AI interacts with the world, using the mathematically optimal aggregation method for the specific phenomenon it measures. Three operational tiers - Transaction-Grade at under 100 milliseconds, Monitoring-Grade at seconds to minutes, and Assessment-Grade at hours to days - ensure each framework operates at the speed its measurement requires.

Background

The initial impulse in AI impact measurement is to seek a single, unified framework that captures everything relevant about how AI affects the world. This impulse is understandable: a single framework is simpler to implement, easier to communicate, and more straightforward to regulate against. However, the history of measurement science demonstrates that complex, multi-faceted phenomena resist reduction to a single framework ^[1]. Credit risk measurement uses different models for consumer credit, corporate credit, and sovereign credit ^[2]. Medical diagnostics use different assessment instruments for different organ systems. Environmental monitoring uses different measurement protocols for air quality, water quality, and soil contamination. In each case, the phenomenon being measured is sufficiently complex and heterogeneous that no single methodology can capture all relevant dimensions without sacrificing either precision or coverage.

AI impact exhibits the same structural complexity. The factors that determine the quality of data consumed by an AI agent are fundamentally different from the factors that determine the fairness of the agent's decisions, which are in turn fundamentally different from the factors that determine the systemic risk of AI deployment across an ecosystem. These phenomena operate at different scales (individual data sources versus population-level outcomes), at different speeds (real-time inference versus long-term societal trends), and with different causal structures (direct measurement versus cascading indirect effects). Attempting to capture all of these phenomena in a single framework inevitably produces either a framework so abstract that it loses measurement precision or a framework so detailed that it becomes computationally intractable.

The Amplitude thesis resolves this tension through architectural decomposition ^[3]: instead of one framework that measures everything imprecisely, ten frameworks each measure a specific phenomenon with the precision that phenomenon demands. The ten frameworks are organized across three layers that correspond to the structural levels at which AI interacts with the world: the Data layer (how AI consumes and transforms information), the Agent layer (how AI systems behave and make decisions), and the Ecosystem layer (how AI deployment affects markets, societies, and institutions).

This architectural choice has a precise mathematical motivation. Different phenomena exhibit different aggregation properties ^[4]. Some quantities aggregate naturally through arithmetic means (independent, additive effects), others through geometric means (multiplicative, non-compensatory effects), and others through harmonic means (rate-limited, bottleneck-constrained effects). Using the wrong aggregation method for a given phenomenon introduces systematic measurement error ^[5]. By decomposing AI impact into ten frameworks, each with its own mathematically appropriate aggregation method, Amplitude eliminates the aggregation-method mismatch that plagues monolithic approaches.

Approach

The Data layer comprises three frameworks that measure the quality, value, and defensibility of data as it flows through AI systems. The Meridian framework scores external data sources across four dimensions (Scarcity, Quality, Decision Impact, Defensibility) using a weighted geometric mean ^[4], as detailed in our companion paper. The Data Flow Integrity framework measures the degree to which data maintains its accuracy, completeness, and provenance as it passes through transformation pipelines, using an arithmetic mean of stage-wise integrity scores that reflects the independent, additive nature of pipeline-stage effects. The Data Valuation framework translates quality scores into economic signals using the derivation model that maps Meridian dimensions to pricing components.

The Agent layer comprises four frameworks that measure different aspects of AI agent behavior. The Fidelity framework measures the degree to which an agent's outputs are faithful to its inputs and instructions, capturing hallucination rates, instruction adherence, and factual grounding ^[6]. The Fairness framework measures distributional equity in agent decisions across protected groups, using a harmonic mean of group-level outcome ratios that is mathematically sensitive to the worst-performing group ^[7]. The Transparency framework measures the degree to which an agent's decision-making process is observable, interpretable, and explainable ^[8]. The Autonomy framework measures the degree to which an agent operates independently versus under human oversight, capturing delegation depth, override frequency, and human-in-the-loop participation rates.

The Ecosystem layer comprises three frameworks that measure systemic effects of AI deployment. The Concentration framework measures the degree to which AI deployment creates market concentration or dependency on specific providers, using network-theoretic metrics that capture both direct and indirect concentration effects ^[9]. The Resilience framework measures the degree to which the AI ecosystem can absorb shocks such as model failures, data source disruptions, or adversarial attacks, using stress-testing methodologies adapted from financial systemic risk assessment ^[2]. The Societal Impact framework measures the aggregate effects of AI deployment on labor markets, information quality, democratic processes, and economic inequality, using longitudinal indicators that track changes over time.

Each framework is independently validated against ground-truth assessments produced by domain experts ^[10]. This independence is a deliberate architectural choice: it ensures that each framework can be evaluated, refined, and deployed on its own merits rather than being constrained by the requirements of the other frameworks. It also enables modular adoption: organizations can implement the frameworks most relevant to their context without being required to implement the full Amplitude suite.

Findings

The ten frameworks and their aggregation methods are selected based on the mathematical properties of the phenomena they measure. The Meridian framework uses a weighted geometric mean because data quality is a non-compensatory phenomenon: a data source with zero defensibility should receive a zero composite score regardless of its other qualities ^[4]. The geometric mean has the mathematical property that any zero input produces a zero output, making it the natural aggregation for non-compensatory constructs. The weights in the geometric mean are calibrated through expert elicitation and validated through sensitivity analysis.

The Fidelity framework uses an arithmetic mean of component scores (factual accuracy, instruction adherence, output consistency) because these components are independently meaningful and additive in their contribution to overall fidelity ^[11]. A model that is 90% factually accurate and 80% instruction-adherent has a meaningfully different fidelity profile than a model that is 80% factually accurate and 90% instruction-adherent, and the arithmetic mean preserves this distinction while providing a sensible composite. The Transparency framework similarly uses arithmetic aggregation because its components (observability, interpretability, explainability) are independently valuable.

The Fairness framework uses a harmonic mean of group-level outcome ratios because fairness is fundamentally a bottleneck-constrained phenomenon: the fairness of a system is determined by its treatment of the worst-off group, not by its average treatment across groups ^[7]. The harmonic mean is mathematically dominated by its smallest input, making it naturally sensitive to the group that receives the worst outcomes ^[4]. This property aligns the aggregation method with the ethical principle that fairness requires attending to the most disadvantaged, not merely optimizing the average.

The Concentration framework uses network-theoretic centrality metrics rather than traditional statistical aggregation because concentration is a structural property of networks rather than a distributional property of populations ^[9]. The Herfindahl-Hirschman Index, commonly used for market concentration ^[12], is extended with eigenvector centrality measures that capture indirect concentration through dependency chains. The Resilience framework uses stress-test methodologies that simulate failure scenarios and measure the degree to which the ecosystem maintains functionality ^[2], producing scores that reflect worst-case behavior rather than average-case behavior.

Across all ten frameworks, we find that the choice of aggregation method has a material impact on measurement outcomes ^[5]. In controlled experiments comparing arithmetic, geometric, and harmonic means applied to the same underlying data, the choice of aggregation method changes the ranking of measured entities in 30-45% of cases. This finding validates the Amplitude thesis: using a single aggregation method across all ten phenomena would introduce systematic measurement error in roughly a third of evaluations, a rate that is unacceptable for a measurement system intended to inform governance and accountability decisions.

Implications

The three operational tiers ensure that each framework operates at the speed its measurement requires ^[13]. Transaction-Grade scoring, at under 100 milliseconds, is required for frameworks whose scores must be available during agent inference. The Meridian framework and the Data Valuation framework operate at Transaction-Grade because their scores inform data consumption decisions that agents make during tool calls. The Fidelity framework operates at Transaction-Grade because fidelity assessment of agent outputs must be available before those outputs are delivered to users or downstream systems.

Monitoring-Grade scoring, at seconds to minutes, is required for frameworks whose scores inform operational decisions but do not need to be available within a single inference cycle. The Fairness framework operates at Monitoring-Grade because distributional equity assessment requires accumulating decisions over a population before statistical patterns become detectable ^[14]. The Transparency and Autonomy frameworks operate at Monitoring-Grade because they track behavioral patterns that emerge over multiple interactions rather than within a single interaction. The Data Flow Integrity framework operates at Monitoring-Grade because pipeline integrity assessment requires observing data transformations across multiple processing stages.

Assessment-Grade scoring, at hours to days, is required for frameworks whose scores inform strategic and governance decisions. The Concentration framework operates at Assessment-Grade because network-theoretic analysis of market structure requires comprehensive data collection and computationally intensive graph analysis ^[9]. The Resilience framework operates at Assessment-Grade because stress-testing involves simulating multiple failure scenarios and measuring ecosystem response ^[2]. The Societal Impact framework operates at Assessment-Grade because longitudinal societal indicators require data collection over extended time periods and sophisticated causal analysis to distinguish AI effects from other factors.

The three-tier architecture has a critical practical implication: organizations do not need to implement all ten frameworks simultaneously to derive value from the Amplitude system ^[15]. An organization primarily concerned with data quality can implement the Data layer frameworks at Transaction-Grade and derive immediate operational benefit. An organization concerned with fairness and transparency can implement the relevant Agent layer frameworks at Monitoring-Grade. An organization concerned with systemic risk can implement the Ecosystem layer frameworks at Assessment-Grade. The modular, tier-aware architecture enables incremental adoption that matches organizational priorities and capabilities, while the overarching Amplitude architecture ensures that the independently adopted frameworks are structurally compatible and can be integrated as the organization's measurement maturity grows.

References

Finkelstein, L. (2003). Widely, strongly and weakly defined measurement. Measurement, 34(1), 39-48.
Basel Committee on Banking Supervision. (2006). International Convergence of Capital Measurement and Capital Standards: A Revised Framework. Bank for International Settlements.
Parnas, D. L. (1972). On the Criteria to Be Used in Decomposing Systems into Modules. Communications of the ACM, 15(12), 1053-1058.
Hardy, G. H., Littlewood, J. E., & Pólya, G. (1952). Inequalities (2nd ed.). Cambridge University Press.
Nardo, M., Saisana, M., Saltelli, A., Tarantola, S., Hoffman, A., & Giovannini, E. (2008). Handbook on Constructing Composite Indicators: Methodology and User Guide. OECD Publishing.
Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., ... & Fung, P. (2023). Survey of Hallucination in Natural Language Generation. ACM Computing Surveys, 55(12), 1-38.
Chouldechova, A. (2017). Fair Prediction with Disparate Impact: A Study of Bias in Recidivism Prediction Instruments. Big Data, 5(2), 153-163.
Lipton, Z. C. (2018). The Mythos of Model Interpretability. Queue, 16(3), 31-57.
Newman, M. E. J. (2010). Networks: An Introduction. Oxford University Press.
Lord, F. M., & Novick, M. R. (1968). Statistical Theories of Mental Test Scores. Addison-Wesley.
Stevens, S. S. (1946). On the Theory of Scales of Measurement. Science, 103(2684), 677-680.
Rhoades, S. A. (1993). The Herfindahl-Hirschman Index. Federal Reserve Bulletin, 79, 188-189.
Dean, J., & Barroso, L. A. (2013). The Tail at Scale. Communications of the ACM, 56(2), 74-80.
Barocas, S., Hardt, M., & Narayanan, A. (2019). Fairness and Machine Learning: Limitations and Opportunities. fairmlbook.org.
Baldwin, C. Y., & Clark, K. B. (2000). Design Rules: The Power of Modularity. MIT Press.