What type of delivery assurance is required (e.g. guaranteed delivery, best effort, etc.)
How do you pay for the service (an entire topic by itself!)
What I'll focus on here are managing the quality of service and service level. Note that these recommendations are equally valid whether your IT is in-house or whether you outsource development and/or operation of your applications (in fact, I originally defined this model for a customer that does extensive outsourcing).
When looking at a service provider (independent of consumers), there are two key metrics you need to capture:
Guarantee of service level for a certain volume. For example, "a 4s (or better) response time will be provided for volumes up to 1000 requests per minute for typical requests". The challenging part here is defining what a "typical" request is. This is a measure of what scalability the service will initially roll out with. You would typically determine this by load testing on machines that are identical to the production infrastructure.
Cost of scale out. For example, "Cost to achieve 4s at volume of 2000 request per minute will be $X above base infrastructure". This lets the organization capture the cost of success (as more consumers start using the web service): how cost effectively will the service scale beyond its initial deployment. Again, this can be determined via load testing and hardware/software costing information.
When looking at a service consumer, there are two different metrics you should capture:
Peak usage of providers (broken down by time period). For example, "Consumer will invoke less that 250 requests per minute, 5x9, and less than 100 requests per minute otherwise." Capturing this for each consumer (adding them all up for each time period) allows you to determine whether the provider will exceed its expected volume (and thus perform below expectations). If there's an existing system that the consumer is replacing, you can capture the metrics from that system and extrapolate expected increases - otherwise, the project team is going to have to estimate the expected load.
Impact of provider slowdown. For example, "Consumer will provide 8s page refresh time, assuming provider delivers 4s response time. If provider delivers 6s response time, page refresh time will be 10s, For 8s provider response time, ..." In order to measure this, during load testing of the consumer application you need to insert delays in front of the provider to see how the consumer application performs in different cases. Fundamentally, this tells you what is the impact of a slower provider. It's important, as a part of this, to capture (at least in the relative fashion) what the cost to the business is of this slowdown (for example, does it impact revenue, customer satisfaction, etc.)
You'll notice in the above "contract metrics" are not defined in terms of absolutes: This is not a "Consumer A is guaranteed a certain response time up to a certain volume" type of contract - instead these metrics are tied to cost/benefit analysis and optimization. With just these four metrics for consumers and providers, you now have enough information to make "big picture" decisions about your SOA. If a new consumer is coming online or there is an unexpected load peak in existing consumers you now have enough information to make trade-offs for the provider:
Is the cost of adding capacity to handle the new load justified (compares the cost of scale out to the impact of provider slowdown)?
Or, should lower value consumers be throttled to maintain capacity for higher value consumers (is rationing available capacity a more effective solution)?
More sophisticated forms of this also capture and factor in the cost of sending requests across a WAN to alternate (lower load) data centers as another approach to using available capacity more effective (albeit at a potentially higher cost because WAN bandwidth is rarely free).
You can see that the right solution is not always "add more capacity." Load peaks are relatively rare (when you consider there are 24 hours in a day, load peaks rarely occur for more than 1hr a day) - most of the time services are operating well below capacity (typical server utilization, from industry metrics, is only 15%). Adding more capacity just lowers the average utilization.
By leveraging the capacity you already have more effectively (either by using other data centers, or by prioritizing/throttling load by business value) you can often handle these load peaks in a way that's much more cost effective to the business, without sacrificing what really matters. The contract metrics I've outlined above give you the tools you need to determine what the right trade-offs are.