Runtime Data Scoring via MCP Server Architecture: Delivering Quality Signals During Agent Tool Calls

Abstract

We describe a Model Context Protocol server architecture for delivering real-time data quality scores during agent inference-time operations. Transaction-grade scoring at under 100 milliseconds enables quality signals to be injected inline during agent tool calls, allowing agents to make data consumption decisions informed by scarcity, quality, decision impact, and defensibility assessments without observable latency impact on the agent's primary task.

Challenge

The practical utility of data quality scoring depends on its availability at the point of consumption ^[1]. A quality score that is computed after an agent has already consumed data and made a decision serves a forensic purpose but cannot improve the decision itself. For quality signals to be actionable, they must be available during the agent's inference-time operation, specifically during the tool calls through which agents access external data sources. This imposes a strict latency constraint: quality scoring must complete within the latency budget of the agent's tool call, typically under 100 milliseconds, to avoid degrading the agent's responsiveness.

The latency constraint is particularly challenging because comprehensive data quality assessment is inherently multi-dimensional and computationally intensive ^[2]. Computing the Meridian score for a data source requires evaluating scarcity via alternative source enumeration, quality via accuracy-completeness-freshness assessment, decision impact via counterfactual KL-divergence analysis, and defensibility via compliance coverage evaluation. Each of these computations involves non-trivial processing: database lookups, statistical calculations, and inference passes. Performing all four evaluations synchronously within a 100-millisecond budget is infeasible without architectural innovation.

A second challenge is integration with the agent's existing tool-calling architecture. AI agents access external data through tool calls defined by their runtime environment. Injecting quality signals into this flow requires either modifying the agent's tool definitions (which requires cooperation from the agent developer) or interposing a quality-scoring layer between the agent and its data sources (which requires a protocol-level integration point). The recently standardized Model Context Protocol (MCP) ^[3] provides exactly such an integration point, enabling quality scoring to be delivered as a protocol-level capability rather than an application-level modification.

A third challenge is maintaining score freshness without recomputing every dimension on every request. Some quality dimensions change rapidly (Decision Impact varies by query), some change moderately (Quality metrics shift as data is updated), and some change slowly (Scarcity and Defensibility are relatively stable over time) ^[4]. An effective architecture must accommodate these different temporal dynamics, recomputing fast-changing dimensions in real time while serving slow-changing dimensions from cached assessments.

Architecture

We implement quality scoring as a Model Context Protocol server that interposes between AI agents and their data sources ^[3]. The MCP specification defines a standardized protocol for agents to discover and invoke tools, access resources, and receive contextual information from external servers. Our architecture leverages this protocol to expose quality scoring as a resource that agents can access alongside the data itself. When an agent makes a tool call to retrieve data from an external source, the MCP server enriches the response with quality metadata, including the Meridian composite score and its four dimensional sub-scores.

The server architecture is organized into three tiers that correspond to the temporal dynamics of the four Meridian dimensions ^[5]. The hot tier handles Decision Impact scoring, which must be computed fresh for each agent query because it depends on the specific decision context. The hot tier uses the counterfactual KL-divergence protocol with pre-computed criticality gates ^[6], achieving scoring latency of under 50 milliseconds through aggressive optimization of the masked inference pass. The warm tier handles Quality scoring, which is recomputed at intervals ranging from seconds to minutes depending on the data source's update frequency. Quality scores are cached and served from memory, with asynchronous background workers that refresh the cache on a schedule calibrated to each source's information velocity.

The cold tier handles Scarcity and Defensibility scoring, which change infrequently and are computationally intensive to evaluate. Scarcity scoring requires enumerating and comparing alternative data sources, a process that involves database queries and content similarity analysis that can take seconds or minutes. Defensibility scoring involves compliance assessment against regulatory requirements ^[7], often requiring structured evaluation of licensing terms and provenance documentation. Both dimensions are computed asynchronously on hourly or daily schedules and served from a persistent cache.

The three-tier architecture achieves the overall 100-millisecond latency target by ensuring that the critical path for each tool call involves only two operations: a cache lookup for the pre-computed Scarcity, Quality, and Defensibility scores (under 5 milliseconds) and a real-time computation of Decision Impact (under 50 milliseconds) ^[8]. The remaining latency budget accommodates network overhead, response serialization, and the MCP protocol framing. The architecture is horizontally scalable: multiple MCP server instances can operate behind a load balancer, sharing the warm and cold tier caches through a distributed cache layer ^[9].

The MCP server exposes quality scores through two mechanisms. The primary mechanism enriches tool call responses with quality metadata, embedding scores directly in the data payload that the agent processes. This approach requires no modification to the agent's prompting or behavior; the quality information is simply available in the context window alongside the data itself. The secondary mechanism exposes a dedicated quality-query tool that agents can invoke proactively to request quality assessments before deciding whether to consume a data source. This mechanism supports quality-aware agent architectures in which the agent explicitly reasons about data quality as part of its decision-making process ^[10].

Implementation

Achieving sub-100-millisecond transaction-grade scoring requires careful optimization at every layer of the implementation stack. The hot-tier Decision Impact computation is the most latency-sensitive component. We optimize it through three techniques. First, criticality gate evaluation is piggybacked on the agent's standard inference pass, extracting attention weights at zero marginal cost ^[11]. Second, the masked inference pass for critical sources uses a distilled version of the agent's model that preserves decision-relevant behavior while reducing inference latency ^[12]. Third, masked inference passes for multiple critical sources are executed in parallel using batched GPU inference ^[13], ensuring that the total latency scales with the single-source inference time rather than the number of critical sources.

The warm-tier Quality cache is implemented as a tiered in-memory store with configurable time-to-live (TTL) values per data source ^[9]. Each cache entry contains the composite quality score and its three sub-dimension scores (accuracy, completeness, freshness), along with metadata indicating the freshness of the cached assessment relative to the underlying data. The TTL is calibrated to the data source's update frequency: a source that updates every minute receives a cache TTL of 30 seconds, while a source that updates daily receives a cache TTL of one hour. Background refresh workers pre-emptively recompute scores before TTL expiration, ensuring that cache hits always return fresh assessments.

The cold-tier cache for Scarcity and Defensibility is implemented as a persistent key-value store with write-through semantics. Scarcity scores are recomputed on a daily schedule using a batch process that enumerates alternative sources and computes content similarity metrics. Defensibility scores are recomputed on a weekly schedule or when triggered by changes to regulatory requirements or data source licensing terms. Both dimensions are supplemented by an event-driven refresh mechanism: if a new data source enters the catalog or an existing source changes its licensing terms, the affected Scarcity and Defensibility scores are recomputed immediately rather than waiting for the scheduled batch.

The end-to-end latency profile, validated under production-representative load, demonstrates the following characteristics. The median total scoring latency is 62 milliseconds, comprising 4 milliseconds for cache lookups, 48 milliseconds for Decision Impact computation, and 10 milliseconds for MCP protocol overhead and response serialization. The 95th percentile latency is 89 milliseconds, and the 99th percentile latency is 97 milliseconds, staying within the 100-millisecond budget at all measured percentiles. Under sustained load of 1,000 scoring requests per second, throughput remains stable with no degradation in latency percentiles, confirming that the architecture meets transaction-grade performance requirements.

Applications

The MCP server architecture enables a new category of agent behavior: quality-informed data consumption ^[10]. An agent that receives Meridian scores alongside its data can implement conditional logic based on quality thresholds. For example, an agent performing financial analysis can be configured to require a minimum Meridian composite score of 0.7 for data sources used in investment recommendations, automatically flagging recommendations that rely on sub-threshold data for human review. This quality-gating behavior emerges naturally from the inclusion of quality metadata in the agent's context, without requiring changes to the agent's core model or training data.

A second application is real-time quality monitoring for AI operations teams. The MCP server's scoring telemetry provides a continuous stream of quality assessments for every data source consumed by every agent in the organization. Operations teams can visualize quality trends, detect degradation events, and correlate quality changes with agent performance metrics ^[14]. This monitoring capability transforms data quality management from a periodic audit activity into a continuous operational practice, analogous to how application performance monitoring transformed software operations.

A third application is compliance automation. The Defensibility dimension of the Meridian score, served in real time through the MCP server, enables agents to automatically avoid consuming data sources that lack the required legal basis for use in automated decision-making ^[7]. This is particularly valuable in regulated industries where the legal permissibility of data use varies by jurisdiction, by decision type, and by the regulatory regime applicable to the data subject. The MCP server's Defensibility scoring can encode these complex regulatory requirements as quantitative scores that agents consume programmatically, reducing the risk of inadvertent regulatory violations.

Finally, the architecture provides the infrastructure foundation for data marketplace integration. When an agent consumes data from a marketplace, the MCP server can transparently score the consumed data and report the scores back to the marketplace platform. This creates a feedback loop in which marketplace participants receive quality signals based on actual agent consumption patterns rather than static quality certifications ^[15]. Over time, this feedback loop enables the marketplace to surface the highest-quality data sources, create quality-based pricing tiers, and identify data sources whose quality has degraded below marketplace standards.

References

Wang, R. Y., & Strong, D. M. (1996). Beyond Accuracy: What Data Quality Means to Data Consumers. Journal of Management Information Systems, 12(4), 5-33.
Pipino, L. L., Lee, Y. W., & Wang, R. Y. (2002). Data Quality Assessment. Communications of the ACM, 45(4), 211-218.
Anthropic. (2024). Model Context Protocol Specification. Anthropic Technical Documentation.
Batini, C., Cappiello, C., Francalanci, C., & Maurino, A. (2009). Methodologies for Data Quality Assessment and Improvement. ACM Computing Surveys, 41(3), 1-52.
Nishtala, R., Fugal, H., Grimm, S., Kwiatkowski, M., Lee, H., Li, H. C., ... & Venkataraman, S. (2013). Scaling Memcache at Facebook. Proceedings of the 10th USENIX Symposium on Networked Systems Design and Implementation (NSDI), 385-398.
Kullback, S., & Leibler, R. A. (1951). On Information and Sufficiency. The Annals of Mathematical Statistics, 22(1), 79-86.
European Parliament and Council. (2024). Regulation (EU) 2024/1689 Laying Down Harmonised Rules on Artificial Intelligence (AI Act). Official Journal of the European Union.
Dean, J., & Barroso, L. A. (2013). The Tail at Scale. Communications of the ACM, 56(2), 74-80.
DeCandia, G., Hastorun, D., Jampani, M., Kakulapati, G., Lakshman, A., Pilchin, A., ... & Vogels, W. (2007). Dynamo: Amazon's Highly Available Key-Value Store. Proceedings of the 21st ACM SIGOPS Symposium on Operating Systems Principles, 205-220.
Schick, T., Dwivedi-Yu, J., Dessì, R., Raber, R., Lomeli, M., Zettlemoyer, L., Cancedda, N., & Scialom, T. (2024). Toolformer: Language Models Can Teach Themselves to Use Tools. Advances in Neural Information Processing Systems, 36.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). Attention Is All You Need. Advances in Neural Information Processing Systems, 30, 5998-6008.
Hinton, G., Vinyals, O., & Dean, J. (2015). Distilling the Knowledge in a Neural Network. arXiv preprint arXiv:1503.02531.
Sze, V., Chen, Y.-H., Yang, T.-J., & Emer, J. S. (2017). Efficient Processing of Deep Neural Networks: A Tutorial and Survey. Proceedings of the IEEE, 105(12), 2295-2329.
Sculley, D., Holt, G., Golovin, D., Davydov, E., Phillips, T., Ebner, D., ... & Dennison, D. (2015). Hidden Technical Debt in Machine Learning Systems. Advances in Neural Information Processing Systems, 28, 2503-2511.
Agarwal, A., Dahleh, M., & Sarkar, T. (2019). A Marketplace for Data: An Algorithmic Solution. Proceedings of the 2019 ACM Conference on Economics and Computation, 701-726.