REST API Concepts for AI
What is an API?
An API (Application Programming Interface) is a contract that defines how two software components communicate with each other. In the context of AI deployment, an API is the bridge between your trained model and the outside world — applications, users, and other services that want to consume predictions.
The Restaurant Analogy
The most intuitive way to understand an API is to think of a restaurant:
| Restaurant | API World |
|---|---|
| Customer | Client application (web app, mobile app, another service) |
| Menu | API documentation (available endpoints, expected inputs) |
| Order | HTTP request with input data (JSON payload) |
| Waiter | API server (receives requests, routes them, returns responses) |
| Kitchen | ML model (processes input, generates prediction) |
| Dish served | HTTP response with prediction results |
| Receipt | Response status code (200 OK, 400 Bad Request, etc.) |
Just like a waiter doesn't need to know how to cook, an API doesn't need to expose the internal workings of your model. The client only needs to know what to send and what to expect back.
REST Architecture
REST (Representational State Transfer) is an architectural style for designing networked applications. A REST API follows a set of constraints that make it scalable, stateless, and easy to understand.
REST Principles
| Principle | Description | AI API Example |
|---|---|---|
| Stateless | Each request contains all information needed to process it | Every prediction request includes the full input features |
| Client-Server | Separation between the consumer and the provider | Web app (client) is separate from the model server |
| Uniform Interface | Standard HTTP methods and URI conventions | POST /api/v1/predict for predictions |
| Resource-Based | Everything is a resource identified by a URI | /models, /predictions, /health |
| Cacheable | Responses can be cached when appropriate | Cache repeated predictions for identical inputs |
| Layered System | Client cannot tell if connected directly or via intermediary | Load balancer sits between client and API |
REST API Architecture for AI
HTTP Methods
HTTP methods define the action you want to perform on a resource. For AI APIs, some methods are more common than others.
| Method | Action | Idempotent | Safe | AI API Usage |
|---|---|---|---|---|
| GET | Retrieve data | ✅ Yes | ✅ Yes | Get model info, health check, list available models |
| POST | Create/Submit data | ❌ No | ❌ No | Submit features for prediction, upload training data |
| PUT | Replace entirely | ✅ Yes | ❌ No | Replace a model version |
| PATCH | Partial update | ❌ No | ❌ No | Update model configuration |
| DELETE | Remove resource | ✅ Yes | ❌ No | Remove a deployed model |
Common AI API Endpoints
GET /api/v1/health → Check if the service is running
GET /api/v1/models → List available models
GET /api/v1/models/{id} → Get details about a specific model
POST /api/v1/predict → Submit features, receive prediction
POST /api/v1/predict/batch → Submit multiple inputs for batch prediction
GET /api/v1/predict/{id} → Retrieve a past prediction result
DELETE /api/v1/models/{id} → Remove a deployed model
Even though a prediction doesn't "create" a resource in the traditional sense, we use POST because:
- Input features can be complex (nested objects, arrays) — too large for URL parameters
- The request has a body (JSON payload)
- Predictions may have side effects (logging, billing)
HTTP Status Codes
Status codes tell the client what happened with their request. They are grouped by category.
Status Code Families
| Range | Category | Meaning |
|---|---|---|
| 1xx | Informational | Request received, processing continues |
| 2xx | Success | Request successfully processed |
| 3xx | Redirection | Further action needed |
| 4xx | Client Error | Problem with the request |
| 5xx | Server Error | Problem on the server |
Essential Status Codes for AI APIs
| Code | Name | When to Use | AI API Example |
|---|---|---|---|
| 200 | OK | Request succeeded | Prediction returned successfully |
| 201 | Created | Resource created | New model uploaded and registered |
| 204 | No Content | Success, no body | Model deleted successfully |
| 400 | Bad Request | Invalid input format | JSON syntax error in request body |
| 401 | Unauthorized | Missing authentication | No API key provided |
| 403 | Forbidden | Insufficient permissions | API key lacks prediction access |
| 404 | Not Found | Resource doesn't exist | Model ID not found |
| 422 | Unprocessable Entity | Validation failed | Feature values out of expected range |
| 429 | Too Many Requests | Rate limit exceeded | Client sent too many prediction requests |
| 500 | Internal Server Error | Unexpected server failure | Model crashed during inference |
| 503 | Service Unavailable | Server not ready | Model still loading at startup |
- 400 Bad Request: The JSON itself is malformed (syntax error)
- 422 Unprocessable Entity: The JSON is valid, but the data doesn't pass validation (e.g., negative age, missing required field)
FastAPI uses 422 by default for validation errors from Pydantic models.
JSON Request/Response Format
REST APIs communicate using JSON (JavaScript Object Notation). For AI APIs, you need to design clear input/output schemas.
Prediction Request
{
"features": {
"age": 35,
"income": 55000,
"credit_score": 720,
"employment_years": 8,
"loan_amount": 25000
},
"options": {
"explain": true,
"threshold": 0.5
}
}
Prediction Response
{
"prediction": "approved",
"probability": 0.87,
"confidence": "high",
"model_version": "loan-classifier-v2.1",
"timestamp": "2026-02-23T14:30:00Z",
"explanation": {
"top_features": [
{"feature": "credit_score", "importance": 0.42},
{"feature": "income", "importance": 0.31},
{"feature": "employment_years", "importance": 0.15}
]
}
}
Error Response
{
"error": {
"code": "VALIDATION_ERROR",
"message": "Invalid input features",
"details": [
{
"field": "age",
"message": "Value must be between 18 and 120",
"received": -5
}
]
},
"timestamp": "2026-02-23T14:31:00Z",
"request_id": "req_abc123"
}
API Design Best Practices
1. Endpoint Naming Conventions
| Convention | Good | Bad |
|---|---|---|
| Use nouns, not verbs | /api/v1/predictions | /api/v1/makePrediction |
| Use plural nouns | /api/v1/models | /api/v1/model |
| Use kebab-case | /api/v1/model-versions | /api/v1/modelVersions |
| Version your API | /api/v1/predict | /predict |
| Use hierarchy for relations | /api/v1/models/{id}/predictions | /api/v1/model-predictions |
2. Request Validation
Always validate input data before sending it to your model:
from pydantic import BaseModel, Field, validator
class PredictionInput(BaseModel):
age: int = Field(..., ge=18, le=120, description="Customer age")
income: float = Field(..., gt=0, description="Annual income in USD")
credit_score: int = Field(..., ge=300, le=850)
@validator("income")
def income_must_be_reasonable(cls, v):
if v > 10_000_000:
raise ValueError("Income seems unrealistically high")
return v
- Prevents your model from receiving nonsensical inputs
- Returns clear error messages to clients
- Avoids silent failures (model returns a prediction for garbage input)
- Protects against injection attacks
3. Consistent Response Format
Always return responses in a consistent envelope:
{
"status": "success", # or "error"
"data": { ... }, # response payload
"meta": { # metadata
"model_version": "v2.1",
"response_time_ms": 45,
"request_id": "req_abc123"
}
}
Authentication and Security
Protecting your AI API is critical — you don't want unauthorized users running predictions (which consume compute resources and may access sensitive models).
API Keys
The simplest authentication method. The client includes a secret key in request headers.
GET /api/v1/models HTTP/1.1
Host: api.example.com
X-API-Key: sk_live_abc123def456
| Pros | Cons |
|---|---|
| Simple to implement | No built-in expiration |
| Easy for clients to use | Hard to manage permissions per key |
| Works for server-to-server | Vulnerable if exposed in client-side code |
JWT (JSON Web Tokens) — Overview
JWT is a more advanced authentication mechanism where the server issues a signed token that the client includes in subsequent requests.
A JWT token has three parts: Header (algorithm), Payload (claims/permissions), and Signature (verification).
- API Keys: Simple internal services, prototyping, server-to-server
- JWT: Multi-user applications, fine-grained permissions, token expiration needed
- OAuth 2.0: Third-party access, delegated authorization
CORS (Cross-Origin Resource Sharing)
When a web application at https://myapp.com tries to call your API at https://api.myml.com, the browser blocks it by default. CORS headers tell the browser which origins are allowed.
CORS Configuration Example
from fastapi import FastAPI
from fastapi.middleware.cors import CORSMiddleware
app = FastAPI()
app.add_middleware(
CORSMiddleware,
allow_origins=["https://myapp.com", "http://localhost:3000"],
allow_credentials=True,
allow_methods=["GET", "POST"],
allow_headers=["*"],
)
Never use allow_origins=["*"] in production. This allows any website to call your API, which can lead to abuse and data leaks.
Rate Limiting
Rate limiting controls how many requests a client can make in a given time window. This is essential for AI APIs because each prediction consumes compute resources (CPU/GPU time, memory).
| Strategy | Description | Use Case |
|---|---|---|
| Fixed Window | X requests per minute/hour | Simple API key quotas |
| Sliding Window | Smoothed rate over rolling window | Prevents burst abuse |
| Token Bucket | Allows short bursts up to a limit | APIs with variable traffic |
| Per-Endpoint | Different limits for different endpoints | /predict = 100/min, /health = unlimited |
Rate Limit Response
When a client exceeds the limit, return a 429 Too Many Requests response with helpful headers:
HTTP/1.1 429 Too Many Requests
Retry-After: 30
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1708700000
{
"error": {
"code": "RATE_LIMIT_EXCEEDED",
"message": "You have exceeded 100 requests per minute. Please retry after 30 seconds."
}
}
REST vs GraphQL vs gRPC
When building AI APIs, REST is the most common choice, but it's worth understanding the alternatives.
| Feature | REST | GraphQL | gRPC |
|---|---|---|---|
| Protocol | HTTP/1.1 or HTTP/2 | HTTP/1.1 or HTTP/2 | HTTP/2 |
| Data Format | JSON | JSON | Protocol Buffers (binary) |
| Schema | OpenAPI (optional) | Required (SDL) | Required (.proto) |
| Learning Curve | Low | Medium | High |
| Performance | Good | Good | Excellent |
| Browser Support | Native | Native | Limited (needs proxy) |
| Streaming | Limited | Subscriptions | Bidirectional |
| Use Case | General APIs, web | Flexible queries, mobile | Microservices, low-latency |
| AI Relevance | Most common for ML APIs | Complex multi-model queries | High-throughput inference |
We focus on REST APIs because they are the most widely used, easiest to test, and best supported by tools like Swagger and Postman. If you need extremely low-latency inference between microservices, consider gRPC as a next step.
The Request/Response Lifecycle
Understanding the full lifecycle of an API request helps you debug issues and optimize performance.
Summary
| Concept | Key Takeaway |
|---|---|
| REST API | Standard way to expose ML models via HTTP |
| HTTP Methods | POST for predictions, GET for info/health |
| Status Codes | 200 = success, 422 = validation error, 500 = server error |
| JSON | Universal data format for request/response |
| Authentication | API keys (simple) or JWT (advanced) |
| CORS | Required for browser-based clients |
| Rate Limiting | Protects compute resources from abuse |
| REST vs alternatives | REST for most AI APIs, gRPC for internal high-throughput |
What's Next?
Now that you understand REST API concepts, you'll learn to implement them using two Python frameworks:
- FastAPI — Modern, async, auto-documented (next section)
- Flask — Lightweight, flexible, widely used
Vocabulary Quick Reference
| Term | Definition |
|---|---|
| Endpoint | A specific URL path that accepts requests (e.g., /api/v1/predict) |
| Payload | The data sent in the body of a request or response |
| Serialization | Converting data structures to a transferable format (JSON) |
| Idempotent | Making the same request multiple times produces the same result |
| Stateless | Server doesn't remember previous requests |
| Middleware | Code that runs between receiving a request and returning a response |
| Schema | A formal description of the expected data structure |