System design: requirements as a contract

giphy

Most system design failures start before the first diagram.

They start when “requirements” are treated like a warm up, and architecture is treated like the real work.

If the requirements are vague, the design becomes a debate about taste.

If the requirements are clear, the architecture almost draws itself.

In this chapter I treat requirements as a contract. Not a legal one. A technical one.

It is the agreement on what we are building, what we are not building, and what we will optimize for.

Why I call this a contract

Because requirements are where you buy clarity.

Without a clear contract, you get three failure modes:

You argue about components instead of outcomes.
You design for everything, which means you design for nothing.
You cannot tell if the design is “good”, because “good” was never defined.

With a clear contract, you can say:

“This is in scope and this is out of scope.”
“This flow matters most, so the architecture supports it.”
“We are optimizing latency and reliability over cost, or the reverse.”

The three buckets that keep designs honest

Every requirements doc I trust can be reduced to three buckets:

Functional requirements: what the system must do. The core flows.
Non-functional requirements: how it must behave over time. reliability, performance, security, cost, operability.
Constraints and assumptions: the reality we do not get to wish away.

If one of these buckets is missing, the design has no anchor.

1) Functional requirements are flows, not features

A good functional requirement is something you can trace end-to-end.

It reads like: “A user does X, the system does Y, and the outcome is Z.”

For example, if the prompt is “design an invite link service for a product”, your must-haves might be:

Create an invite link that points to a specific workspace or resource.
Resolve an invite link like /invites/{token} to the correct destination.
Optionally track basic usage, like “how many accepts” per invite.

Notice what is not here:

“Use Redis.”
“Use microservices.”
“Use Kubernetes.”

Those are implementation choices, not requirements.

Implementation choices come later and they should fall out of constraints and NFRs.

Add acceptance criteria

The easiest way to prevent “we built it, but it does not work” is to write acceptance criteria.

Acceptance criteria are testable statements that tell you when the requirement is done.

Examples:

Redirect preserves query params.
Creating a link is idempotent if the client retries.
If a token does not exist (or is expired), the system returns a clear 404 or 410.

If you cannot test it, it is not a requirement yet.

2) Non-functional requirements decide the architecture

Non-functional requirements are the part that forces the system shape.

They decide where you need redundancy, where you can be eventually consistent, what you cache, how you partition data, and what you measure.

In plain words: NFRs decide what you can get away with.

If you need very high reliability, you add redundancy and you plan for failure.

If you need very low latency at high load, you avoid doing slow work on the request path and you shape traffic so the system does not collapse at peak.

The most useful move here is to stop saying “it should be fast and reliable” and instead put numbers on it.

For reliability, define these four things. This is the part that sounds cryptic until you attach it to a real system.

An SLI is the metric you use to represent “is this working for users.”
- Example: “successful request rate for GET /invites/{token}.”
- Example: “p99 latency for checkout requests.”
An SLO is the target for that SLI over a time window.
- Example: “99.9% successful redirects over 30 days.”
- Example: “p99 checkout latency under 300ms over 7 days.”
An error budget is what the SLO allows you to get wrong.
- Example: “99.9% over 30 days” means you can be down or failing for about 43 minutes in that month.
- This is the number that turns “reliability” into tradeoffs you can actually make.
An SLA is optional and external. It is what you promise to customers.
- Example: “If monthly availability is below 99.9%, customers get service credits.”
- If you are not making external promises, do not invent an SLA. Use SLIs and SLOs anyway.

For performance, define:

A latency target, usually in percentiles like p99 latency.
A load target, like QPS and peak-to-average (how bursty traffic is, peak vs average).

For security, define:

What data is sensitive.
Where the trust boundaries are.
What abuse you must resist.

For cost, define:

What you are allowed to spend per unit, per month, or per request.

If you cannot put numbers, you can still write priorities.

“We will trade cost for reliability” is a valid statement.

“We will trade consistency for availability on reads” is a valid statement.

What is not valid is “everything is important.”

3) Constraints and assumptions keep you out of fantasy land

Constraints are rules you must obey.

Assumptions are things you believe are true for this design to work.

Examples of constraints:

Must integrate with the existing auth system (SSO, IAM, whatever the org already uses).
Must run in an existing cloud account or cluster. No new platform this quarter.
Must use the existing data store for v1. Migrating data safely is often the real project, and it usually dominates the timeline.
Must comply with a real rule, like data residency, retention, or audit logging.
Team and time are fixed. Two engineers and six weeks means v1 must be small.

Examples of assumptions:

“Most traffic is reads. Writes are rare, but must be correct.”
“Traffic is bursty (peak-to-average is ~10×, meaning peak is ~10× average) and there will be hot keys.”
“The happy path is synchronous, but anything slow can be pushed to async.”
“We do not need feature X in v1.” This is an assumption about scope, not about reality. Still worth writing down.

Constraints and assumptions matter because they make decisions explainable.

They also tell you what risks you are taking if an assumption is wrong.

A one-page requirements template

This is the template I use when I want the requirements to force architecture.

Copy-paste it and fill it in.

## Problem statement

One sentence:
"Build X for Y users so they can Z, while meeting measurable targets A/B/C."

## Users and user journeys

- Primary user:
- One end-to-end user journey:
- Secondary users:

## Functional requirements

Must have:
- ...

Nice to have:
- ...

## Non-functional requirements

Reliability:
- SLI:
- SLO:

Performance:
- p99 latency:
- Peak QPS:
- Peak-to-average:

Security:
- Sensitive data:
- Trust boundaries:

Cost:
- Budget:

Operability:
- What must be observable:
- What failures must be easy to debug:

## Constraints and assumptions

Constraints:
- ...

Assumptions:
- ...

## In scope and out of scope

In scope:
- ...

Out of scope:
- ...

## Open questions

- ...

If you want the fastest possible interview version, you can fill this in live in a few minutes.

Then you draw.

Then you revisit the doc and update it as you make design decisions.

How to use this in an interview

Fill the problem statement and the primary user journey first.
Write 3 to 5 must-have functional requirements.
Write 2 to 3 NFR numbers (latency, QPS, SLO) even if they are rough.
List constraints and assumptions.
Only then start drawing the diagram.

If the interviewer changes the problem mid-way, update the contract first.

If you write requirements like that, the architecture conversation becomes boring in the best way.

You are no longer debating components.

You are trading one set of measurable properties for another.

That’s it, may the force be with you!