On the Selection of Aggregation Functions for AI Impact Measurement

Abstract

We present a principled methodology for selecting aggregation functions in multi-dimensional AI impact scoring. Rather than defaulting to arithmetic mean, each of the ten Amplitude frameworks employs the mathematically optimal aggregation for the specific phenomenon it measures: geometric mean for non-compensatory trust dimensions, harmonic mean for weakest-link security analysis, multiplicative chain for all-components-required oversight, ceiling-constrained mean for outcome-bounded fairness, and arithmetic mean for compensatory market dynamics. Each selection is justified by mapping to how regulators and courts actually evaluate these phenomena.

Background

The selection of an aggregation function is one of the most consequential design decisions in any multi-dimensional scoring system, yet it is often treated as a technical afterthought. When researchers or practitioners need to combine multiple dimension scores into a single composite, the arithmetic mean is the default choice in the overwhelming majority of cases ^[1]. This default is rarely justified by analysis of the phenomenon being measured; it is chosen because it is familiar, easy to compute, and intuitively interpretable. The arithmetic mean implies that all dimensions are fully compensatory: a high score on one dimension can offset a low score on another dimension, with the composite reflecting the average quality across all dimensions.

The compensatory property is appropriate for some phenomena and catastrophically wrong for others ^[2]. Consider a university admissions score that combines academic performance, extracurricular involvement, and essay quality. It is reasonable for exceptional academic performance to partially compensate for modest extracurricular involvement; the admissions decision is genuinely compensatory across these dimensions. Now consider an aircraft safety score that combines engine reliability, structural integrity, and control system functionality ^[3]. It would be absurd for exceptional engine reliability to compensate for a non-functional control system; safety is fundamentally non-compensatory across these dimensions.

AI impact measurement spans both compensatory and non-compensatory phenomena, and a single aggregation function cannot correctly represent both ^[4]. Trust is non-compensatory: an agent that is perfectly consistent but never fulfills its contracts is not moderately trustworthy. Security is weakest-link: a system with five excellent defenses and one catastrophic vulnerability is insecure. Oversight is all-or-nothing: an oversight system missing any critical component provides zero effective oversight. Fairness is outcome-bounded: procedural perfection cannot exceed the ceiling set by disparate outcomes. Market dynamics are compensatory: moderate performance across multiple economic dimensions can reflect genuine aggregate value.

The failure to match aggregation functions to phenomena has concrete consequences. A security assessment that uses arithmetic mean aggregation will systematically overestimate the security of systems with uneven defense profiles, because the arithmetic mean allows strong defenses to compensate for weak ones in the composite score. A trust assessment that uses arithmetic mean will systematically overestimate the trustworthiness of agents with compensating strengths and weaknesses. An oversight assessment that uses arithmetic mean will systematically overestimate the effectiveness of oversight systems with missing components. In each case, the aggregation function creates a gap between the composite score and the reality it purports to measure.

Approach

We define five families of aggregation functions and establish the mathematical properties that distinguish them. The arithmetic mean, AM(x_1, ..., x_n) = (1/n) * sum(x_i), is the fully compensatory aggregation in which each dimension contributes linearly and additively to the composite ^[5]. The arithmetic mean is appropriate when dimensions are substitutable: more of one dimension genuinely compensates for less of another. The weighted variant allows dimensions to contribute unequally while preserving full compensability. The Amplitude framework uses weighted arithmetic mean for the Harmony competitive dynamics framework, where strong performance in market accessibility can genuinely compensate for moderate performance in innovation metrics.

The geometric mean, GM(x_1, ..., x_n) = (product(x_i))^(1/n), is the partially non-compensatory aggregation in which each dimension contributes multiplicatively ^[6]. The geometric mean penalizes imbalance: a score vector of (90, 90, 90) produces a higher geometric mean than a vector of (100, 100, 70) despite both having the same arithmetic mean. More critically, the geometric mean approaches zero as any single dimension approaches zero, ensuring that catastrophic weakness in one dimension cannot be fully compensated by excellence in others. The Amplitude framework uses weighted geometric mean for the Fidelity behavioral trust framework and the Provenance identity framework, where trust and identity are non-compensatory across their respective dimensions.

The harmonic mean, HM(x_1, ..., x_n) = n / sum(1/x_i), is the strongly non-compensatory aggregation that is dominated by the smallest values in the input vector ^[7]. The harmonic mean of (95, 95, 15) is approximately 28, far below the arithmetic mean of 68 and even below the geometric mean of approximately 51. This extreme sensitivity to low values makes the harmonic mean appropriate for phenomena governed by weakest-link dynamics, where the system's effective capability is determined by its most vulnerable component. The Amplitude framework uses weighted harmonic mean for the Threshold adversarial resilience framework, where security is only as strong as the weakest defense vector.

The multiplicative chain, MC(x_1, ..., x_n) = product(x_i / 100) * 100, treats each dimension as a gate that must be passed for the composite to be non-zero. If any single dimension scores zero, the multiplicative chain produces zero regardless of all other dimensions. This all-or-nothing property is appropriate for phenomena where every component is essential and no amount of excellence in other components can compensate for the complete absence of one. The Amplitude framework uses the multiplicative chain for the Mandate human oversight framework, where a zero score on any component, whether override effectiveness, intervention latency, or visibility depth, means the oversight system has a fatal gap.

The ceiling-constrained mean, CCM(x_1, ..., x_n) = min(ceiling, f(x_1, ..., x_n)), computes a standard aggregation and then caps the result at a maximum determined by a constraint dimension ^[8]. This aggregation is appropriate when one dimension establishes an upper bound on the achievable composite, regardless of performance on other dimensions. The Amplitude framework uses ceiling-constrained mean for the Parity fairness framework, where outcome disparity establishes a hard ceiling on the composite score. An agent with perfect procedural fairness but significant outcome disparity cannot score above the level determined by its outcome disparity, no matter how strong its other dimensions.

Findings

The mapping between aggregation functions and regulatory evaluation patterns reveals a striking correspondence. When regulators assess trust, they evaluate non-compensatorily: a financial institution that is perfectly transparent but systematically unreliable is not considered moderately trustworthy. When courts assess security liability, they apply weakest-link reasoning: a data breach through an unpatched vulnerability is not excused by the excellence of other security measures. When auditors assess oversight, they apply all-or-nothing logic: an oversight system without escalation capability is fundamentally incomplete regardless of its other properties. When civil rights bodies assess fairness, they apply outcome caps ^[9]: procedural compliance does not satisfy anti-discrimination requirements when outcomes are disparate.

We quantify the gap between arithmetic mean aggregation and the correct aggregation for each framework through simulation. For each of the ten Amplitude frameworks, we generate 10,000 random dimension score vectors and compute both the arithmetic mean composite and the framework-specific composite. The median gap between the two composites varies from 2 points for the Harmony framework, which uses arithmetic mean and therefore has zero systematic gap except for weighting differences, to 31 points for the Threshold framework, which uses harmonic mean ^[7]. The 95th percentile gap for Threshold reaches 48 points, meaning that arithmetic mean aggregation would overestimate security scores by nearly half the scale for the most unevenly defended systems.

The Mandate framework exhibits the most dramatic gap behavior. Because the multiplicative chain produces zero when any component is zero, the gap between multiplicative and arithmetic aggregation is maximally divergent for oversight systems with missing components. An oversight system scoring (90, 85, 0, 80, 75) on its five components receives a Mandate score of 0, correctly reflecting the fatal gap in the missing component, while the arithmetic mean would return 66, falsely suggesting moderate effectiveness. In our simulation, 23% of random score vectors produced a Mandate score below 10 while the arithmetic mean equivalent exceeded 50, demonstrating that the arithmetic mean would systematically mask oversight failures.

The Parity framework exhibits bounded gap behavior determined by the ceiling constraint. When outcome disparity is moderate (ceiling around 60), the gap between ceiling-constrained and unconstrained aggregation averages 15 points for agents with strong procedural scores. When outcome disparity is severe (ceiling around 30), the gap averages 35 points. This finding validates the ceiling constraint as a meaningful regulatory mechanism: it ensures that no amount of procedural compliance can produce a fairness score that exceeds what outcomes justify, which is precisely how civil rights law evaluates disparate impact ^[10].

The cross-framework consistency analysis reveals that the five aggregation families produce materially different rankings of agents when applied to the same underlying dimension scores ^[11]. Two agents may have identical arithmetic mean composites but very different geometric mean, harmonic mean, or multiplicative chain composites. This means that the choice of aggregation function is not merely a scaling choice; it is a substantive choice about which agents are considered high-performing and which are considered deficient. The aggregation function embeds a normative judgment about what matters, and that judgment should be made explicitly rather than defaulted to the arithmetic mean.

Implications

The arithmetic mean default is not merely a technical convenience; it is an implicit normative choice that compensatory performance is the correct evaluation model for every phenomenon ^[12]. By defaulting to the arithmetic mean, scoring frameworks implicitly assert that excellence in one dimension can compensate for failure in another, which is an empirically testable claim that is false for trust, security, oversight, and fairness as evaluated by actual regulators and courts. Framework designers who default to arithmetic mean without justification are not making a neutral technical choice; they are making a consequential normative choice that may contradict the evaluation patterns of the institutions that will use their scores.

The practical recommendation for scoring framework designers is to begin with the regulatory evaluation model, not the mathematical convenience. Before selecting an aggregation function, the designer should ask: how does the relevant regulatory or judicial body actually evaluate this phenomenon? If the answer is weakest-link, use harmonic mean. If the answer is non-compensatory, use geometric mean. If the answer is all-or-nothing, use multiplicative chain. If the answer is outcome-bounded, use ceiling-constrained aggregation. If and only if the answer is genuinely compensatory should arithmetic mean be selected.

For the Amplitude architecture specifically, the selection of five different aggregation families across ten frameworks creates an interpretive challenge: scores from different frameworks are not directly comparable because they embody different aggregation properties ^[13]. A Fidelity score of 70 and a Threshold score of 70 represent qualitatively different achievement levels because the geometric mean and harmonic mean impose different difficulty curves. The Amplitude architecture addresses this through calibration that ensures each framework's score distribution has similar statistical properties, but users must understand that the mathematical meaning of the score differs by framework.

The aggregation methodology also has implications for gaming resistance. Arithmetic mean aggregation is the easiest to game because improvements in any dimension contribute equally and linearly to the composite. Geometric mean is harder to game because it penalizes imbalance, requiring an agent to improve across all dimensions rather than exploiting a single easy dimension ^[6]. Harmonic mean is the hardest to game because the composite is dominated by the weakest dimension, forcing the agent to address its most difficult vulnerability to improve the score. Multiplicative chain is gaming-resistant by construction because any zero component produces a zero composite regardless of all other efforts. The choice of aggregation function is therefore not only a measurement decision but also a strategic decision about the incentive structure the framework creates.

References

Nardo, M., Saisana, M., Saltelli, A., Tarantola, S., Hoffman, A., & Giovannini, E. (2008). Handbook on Constructing Composite Indicators: Methodology and User Guide. OECD Publishing.
Munda, G. (2008). Social Multi-Criteria Evaluation for a Sustainable Economy. Springer-Verlag.
Federal Aviation Administration. (2000). System Safety Handbook. U.S. Department of Transportation.
Grabisch, M., Marichal, J.-L., Mesiar, R., & Pap, E. (2009). Aggregation Functions. Cambridge University Press.
Hardy, G. H., Littlewood, J. E., & Polya, G. (1952). Inequalities (2nd ed.). Cambridge University Press.
Bullen, P. S. (2003). Handbook of Means and Their Inequalities. Kluwer Academic Publishers.
de Waal, D. J., & du Plessis, J. L. (2005). On the harmonic mean and its role in measurement science. Measurement, 37(1), 1-15.
Yager, R. R. (1988). On ordered weighted averaging aggregation operators in multicriteria decision making. IEEE Transactions on Systems, Man, and Cybernetics, 18(1), 183-190.
Equal Employment Opportunity Commission. (1978). Uniform Guidelines on Employee Selection Procedures, 29 C.F.R. Part 1607. Federal Register, 43(166), 38290-38315.
Griggs v. Duke Power Co., 401 U.S. 424 (1971). Supreme Court of the United States.
Hwang, C.-L., & Yoon, K. (1981). Multiple Attribute Decision Making: Methods and Applications. Springer-Verlag.
Stevens, S. S. (1946). On the theory of scales of measurement. Science, 103(2684), 677-680.
Mays, E. (2004). Credit Scoring for Risk Managers: The Handbook for Lenders. Thomson/South-Western.