From core to more: Service providers to the rescue

January 12, 2024 · 9 min read

service provider interfaces application programming interfaces extensible software platform development data modeling domain-driven design ubiquitous language API evolution SPI evolution technology compatibility kits conformance testing test-driven development error handling fault tolerance capability management network resilience real-world testing modular architecture energy consumption monitoring provider implementation scalability maintainability

Orbiting about 400 kilometers above Earth, the International Space Station (ISS) serves as a beacon of human ingenuity and collaboration. This modular space habitat, constructed jointly by space agencies from all around the world, including NASA, Roscosmos, ESA, JAXA and CSA. A demonstration of the extraordinary feats possible through international partnership. This complex structure, requiring meticulous coordination and cooperation, orbits our planet every 90 minutes, symbolizing a unified effort in space exploration and scientific advancement. Similarly, extensible software systems possess an inherent complexity that is greater than the capabilities of a single team or organization. It’s crucial to establish distinct boundaries among components, designate experts for each, and devise the willpower and strategies to discover integration points among these components.

APIs and SPIs

An API (Application Programming Interface) is familiar terrain for us, serving as the outward-facing interfaces of system components. Whether it’s a library we rely on or an external service we communicate with through various protocols (such as REST, Protobuf, or any RPC), APIs present our systems to the outside world, ideally through an intuitive interface.

SPIs (Service Provider Interfaces), on the other hand, are the interfaces through which functionality and data are supplied to an existing system, often through mechanisms like callbacks or provider interfaces. This is a common pattern in our daily tools like the plugins of our preferred editor or callbacks from the frameworks we use.

But what does it mean to build such a platform, building the foundational layer between the API consumers (a product) and the functionality suppliers (our providers)?

In this article, we’ll delve into the principles of extensibility in software. I’ll share insights on crafting software systems that are not just sturdy and adaptable but also primed for evolution, potentially beyond the confines of your team or organization.

For the sake of this article, let’s say we build a platform that collects the energy consumption of various hardware components (CPU, GPU,…) as well as tracking the energy consumption of the external services our product consumes. As we can’t possibly support all different hardware components and how to track their energy consumption, nor know how much energy the external services use, we’ll build a platform that can be extended by us, other teams in our company as well as external parties. In our energy consumption platform, the API represents the overall energy consumption as a metric, while the SPIs are the conduits through which individual component data flows into our platform.

Crafting Robust Data Model(s)

Designing an effective data model is a complex task. In building for extensibility, I’ve learned that a single data model often isn’t sufficient. A robust data model should represent data in a way that’s intuitive to the consumer. This calls for a ubiquitous language within our domain, though our SPI implementations might have their own interpretations, necessitating a translation into our platform’s domain. Furthermore, SPI providers might require support in this translation process, as they often need to pull data from various APIs, translate it, and then consolidate it before integrating it into our platform. The data model and services we expose aim to present a unified view of all SPI providers, without bias towards any particular one. While this makes sense from an API perspective, the SPI side of our platform needs to be more flexible.

Consider the task of managing vast amounts of data packets for our energy consumption API, which aggregates daily usage from various providers. Each provider has unique data retrieval methods—some offer endpoints for minute-by-minute consumption, others use paginated REST endpoints for specific timeframes, and some provide a continuous stream via HTTP/2. While these methods vary, our goal is to present a unified data stream to our API consumers, masking the underlying complexity. SPI providers, therefore, must adapt their data delivery, potentially offering data in batches or continuous streams, with mechanisms for retries, pauses, or resumptions, ensuring a consistent and cohesive experience for the API consumer. But the variance in our providers as well as the fact that the provides need to construct our domain objects (creatable and potentially mutable) requires a different API than our API consumers (queryable and immutable).

Evolving API and SPI Independently yet Harmoniously

Advancing our platform and its extensions usually can’t happen in lock-step. The evolution of such systems tends to be gradual, influenced by varying schedules and priorities across teams. It is necessary to evolve API and SPI independently as we need to keep the different work areas aligned without exposing functionality that is not yet implemented.

Establishing a Baseline for Providers

When an API is ready to level up with new features that lean on the providers, it’s key to figure out if that new functionality is just nice-to-haves or must-haves. If it’s something every provider needs to implement, think about going SPI-first. This way, providers get a heads-up to weave in these features well before they hit the API stage and are exposed to the products. Getting everyone on the same page about what’s needed and how it all works can be a bit of a tangle. How to communicate what features a provider needs to implement and how it should work is usually a huge source of toil. You can consider writing specifications which are shared across all stakeholders. Depending on the quality of the specification, this may be a reasonable approach (Design by Contract). A great example for such a, publicly available, specification for an SPI is the Language Server Specification.

Test-driven providers

I’ve noticed that embracing a test-driven approach can really make a difference when it comes to growing service providers. Here’s the gist: the platform provides a suite of automated tests, some detailed, some not so much, that use the APIs to put the providers through their paces. You might know these as Technology Compatibility Kits (TCK) or “Conformance Testing”. Whether the providers can use the tests to take advantage to build out the necessary functionality, whether with real services or working with dummy setup. I’ve been in the trenches with a team crafting tests for more than 60 providers, all chatting with their actual backends. Found an edge case in one? Add one of the conformance tests and see how many others fail for the same edge case. Not only that, it’s also a great way to onboard new people onto the project of building out these providers.

Working under real circumstances

What’s great about this is getting a front-row seat to how our platform deals with what the providers throw at it. It’s not just about checking if the data looks right; it’s seeing if the platform can juggle all the curveballs, like timeouts and retry challenges. And you know what? There’s this moment I can’t shake off: a newcomer on the team came to me, wide-eyed, saying how mind-blowing it was to watch over 18,000 tests fire up on a provider they were piecing together – without them having to script a single test! That test suite was able to guide them through the various requirements at platform had for them - be it data formats, error handling or exposing their capabilities.

Implementing new functionality

In the natural progression of your platform’s development, we’ll want to expand its feature set and solicit additional functionalities or data from your providers. Consider, for instance, the scenario where you wish your energy measurement providers to also monitor temperature. Effective communication of these new requirements to your provider teams is crucial. Moreover, achieving consensus on the updated API and SPI, coupled with providing clear guidance for implementation, is essential. Leveraging a TCK-driven methodology can greatly streamline this process, facilitating the introduction of new tests for the desired functionalities and monitoring the integration progress of each provider. A critical aspect of this phase is the management of capabilities, ensuring that providers can transparently disclose the functionalities they have successfully integrated.

Fault barriers

In our interconnected realm, interactions with hardware or across networks inherently carry the risk of failure. Recognizing and integrating this reality into our foundational assumptions, protocols, and architectural design is vital. This principle equally applies to our providers. While constructing fault barriers around interactions with providers is feasible, simply acknowledging a vague failure is insufficient. Our platform needs to understand specific error categories exposed through providers. Is the provider currently unable to supply data? Is it struggling with a transient network error that can be retried? Or has it encountered a rate limit which requires a backoff before retrying? Parsing these error nuances is primarily the provider’s responsibility, yet ensuring this process is robustly codified in the specifications and validated through testing enhances the system’s resilience and clarity.

Anecdote

In my years of developing extensible platforms, one particular conversation kept popping up. We were putting our providers to the test against actual endpoints, so it wasn’t shocking when some tests failed due to various glitches. The dialogue often opened with, “Shouldn’t we just rerun those tests?”. My viewpoint? A firm no. These ‘flaky’ tests weren’t pointing to issues with the tests themselves; they were shining a light on deeper bugs within our platform. It’s not the tests that are unreliable; it’s the services we’re engaging with. As we’ve touched on before, this unpredictability isn’t just part of the testing landscape; it’s embedded in the real-world scenario, too. Network instability isn’t exclusive to our test runs; it’s a reality in the live environment as well. So, rather than brushing these inconsistencies under the rug, our platform needs to handle them adeptly, ensuring a smooth operation from the backend (like intelligently retrying data fetches from providers) and maintaining a seamless facade for our API consumers.

Opening up your product

When you next find yourself contemplating the adoption of various strategies for data extraction or the integration with multiple external services, it’s worth giving thought to the potential benefits of constructing an SPI. Take into account the perspective of both the API consumer and the SPI implementers. Evaluate whether a straightforward specification suffices, or if the complexity of the integration justifies the need for a test suite resembling a TCK. Remember, the goal is to streamline and enhance the maintainability of your architecture. Although there’s no single solution, determining the most suitable approach depends on the specifics of your environment.