Testing AI APIs and Models

Theory 45 min

Why Test AI Systems?

Traditional software testing already matters — but testing AI-powered systems is even more critical because of unique challenges that don't exist in classical applications.

The Weather Forecast Analogy

Think of an AI model like a weather forecasting system:

The same atmospheric conditions can lead to different forecasts depending on tiny variations
The quality of predictions depends entirely on the quality of historical data
You can't just check "is the output correct?" — you need to check "is the output reasonable?"
A bug might not crash the system — it might silently produce wrong predictions for weeks

Silent Failures

The most dangerous bugs in AI systems are silent failures — the model keeps returning predictions, but those predictions are subtly wrong. Unlike a 500 error, nobody notices until real damage is done.

Key Challenges of Testing AI Systems

Challenge	Classical Software	AI Systems
Determinism	Same input → same output always	Same input → may produce slightly different outputs
Correctness	Output is either right or wrong	Output has a confidence level — "right" is relative
Data dependency	Logic is in code	Logic is learned from data — change data, change behavior
Edge cases	Finite set of boundary conditions	Infinite possible inputs, some adversarial
Debugging	Stack trace points to the bug	Model is a black box — hard to pinpoint failure
Regression	Code change breaks a test	Data drift, retraining, or environment change breaks behavior

The Testing Pyramid for AI

The testing pyramid is a classic concept: write many fast, cheap tests at the bottom (unit tests) and fewer slow, expensive tests at the top (end-to-end). For AI systems, we adapt this pyramid to include model-specific testing layers.

Layer Details

Layer	Tests What	Speed	Count	Tools
Unit	Individual functions, data validation, preprocessing, utilities	⚡ Very fast	50-200+	pytest, unittest
Integration	API endpoints, model loading, database connections, service interactions	🔄 Medium	20-50	pytest + TestClient, httpx
End-to-End	Full pipeline from raw input to final response in production-like env	🐢 Slow	5-15	Postman, Newman, selenium

Rule of Thumb

Aim for a 70/20/10 distribution: 70% unit tests, 20% integration tests, 10% end-to-end tests. This keeps your test suite fast while still catching real-world issues.

Pytest Fundamentals

pytest is the de facto testing framework in Python. It's simple, powerful, and extensible. If you've never used it before, you'll love its minimal syntax.

Your First Test

# test_basics.py

def add(a, b):
    return a + b

def test_add_positive_numbers():
    assert add(2, 3) == 5

def test_add_negative_numbers():
    assert add(-1, -1) == -2

def test_add_zero():
    assert add(0, 0) == 0

Run it:

pytest test_basics.py -v

Output:

test_basics.py::test_add_positive_numbers PASSED
test_basics.py::test_add_negative_numbers PASSED
test_basics.py::test_add_zero PASSED
========================= 3 passed in 0.02s =========================

Pytest Fixtures

Fixtures are reusable setup functions that provide test data or resources. Think of them as the "prep work" before a cooking recipe.

# conftest.py — shared fixtures available to all test files

import pytest
import joblib
import numpy as np
from fastapi.testclient import TestClient
from app.main import app


@pytest.fixture
def client():
    """Create a FastAPI test client."""
    return TestClient(app)


@pytest.fixture
def sample_features():
    """Return valid input features for prediction."""
    return {
        "features": [5.1, 3.5, 1.4, 0.2, 2.3]
    }


@pytest.fixture
def trained_model():
    """Load the trained model from disk."""
    return joblib.load("models/model_v1.joblib")


@pytest.fixture
def sample_array():
    """Return a NumPy array of sample features."""
    return np.array([[5.1, 3.5, 1.4, 0.2, 2.3]])

Using fixtures in tests:

# test_prediction.py

def test_prediction_returns_integer(trained_model, sample_array):
    prediction = trained_model.predict(sample_array)
    assert isinstance(prediction[0], (int, np.integer))

def test_prediction_is_valid_class(trained_model, sample_array):
    prediction = trained_model.predict(sample_array)
    assert prediction[0] in [0, 1]

Parametrize — Testing Multiple Inputs

Instead of writing 10 tests for 10 inputs, use @pytest.mark.parametrize to run one test function with many datasets:

import pytest

@pytest.mark.parametrize("features,expected_class", [
    ([5.1, 3.5, 1.4, 0.2, 2.3], [0, 1]),   # valid input → class 0 or 1
    ([6.7, 3.0, 5.2, 2.3, 1.1], [0, 1]),   # valid input → class 0 or 1
    ([4.9, 2.4, 3.3, 1.0, 0.5], [0, 1]),   # valid input → class 0 or 1
])
def test_prediction_valid_classes(trained_model, features, expected_class):
    import numpy as np
    X = np.array([features])
    prediction = trained_model.predict(X)
    assert prediction[0] in expected_class

Markers — Categorizing Tests

Use markers to tag tests and run subsets:

import pytest

@pytest.mark.slow
def test_model_training_from_scratch():
    """This test takes 30+ seconds — skip in CI fast runs."""
    ...

@pytest.mark.integration
def test_api_predict_endpoint(client, sample_features):
    response = client.post("/api/v1/predict", json=sample_features)
    assert response.status_code == 200

@pytest.mark.unit
def test_feature_validation():
    from app.schemas import PredictionRequest
    req = PredictionRequest(features=[1.0, 2.0, 3.0, 4.0, 5.0])
    assert len(req.features) == 5

Run only fast unit tests:

pytest -m "unit" -v
pytest -m "not slow" -v

# pytest.ini
[pytest]
markers =
    unit: Unit tests (fast)
    integration: Integration tests (medium speed)
    slow: Slow tests (skip in CI fast runs)
    e2e: End-to-end tests

conftest.py — The Test Configuration Hub

conftest.py is a special file that pytest discovers automatically. It's where you put shared fixtures, hooks, and plugins:

project/
├── conftest.py              # Root-level fixtures (available everywhere)
├── tests/
│   ├── conftest.py          # Test-specific fixtures
│   ├── unit/
│   │   ├── conftest.py      # Unit test fixtures
│   │   ├── test_schemas.py
│   │   └── test_utils.py
│   ├── integration/
│   │   ├── conftest.py      # Integration test fixtures
│   │   └── test_api.py
│   └── e2e/
│       └── test_full_pipeline.py

conftest.py Scoping

Fixtures in conftest.py are available to all tests in the same directory and subdirectories. No imports needed — pytest finds them automatically.

Testing API Endpoints

For AI APIs built with FastAPI, we use the TestClient (backed by httpx) to simulate HTTP requests without starting a real server.

Basic Endpoint Testing

# tests/integration/test_api.py

from fastapi.testclient import TestClient
from app.main import app

client = TestClient(app)


def test_health_endpoint():
    response = client.get("/health")
    assert response.status_code == 200
    data = response.json()
    assert data["status"] == "healthy"
    assert "model_loaded" in data


def test_predict_valid_input():
    payload = {"features": [5.1, 3.5, 1.4, 0.2, 2.3]}
    response = client.post("/api/v1/predict", json=payload)
    assert response.status_code == 200
    data = response.json()
    assert "prediction" in data
    assert "confidence" in data
    assert data["prediction"] in [0, 1]
    assert 0.0 <= data["confidence"] <= 1.0


def test_predict_returns_consistent_schema():
    """Verify the response always has the expected shape."""
    payload = {"features": [5.1, 3.5, 1.4, 0.2, 2.3]}
    response = client.post("/api/v1/predict", json=payload)
    data = response.json()
    expected_keys = {"prediction", "confidence", "model_version"}
    assert expected_keys.issubset(data.keys())

Testing Error Responses

def test_predict_missing_features():
    response = client.post("/api/v1/predict", json={})
    assert response.status_code == 422  # Pydantic validation error


def test_predict_wrong_type():
    payload = {"features": "not a list"}
    response = client.post("/api/v1/predict", json=payload)
    assert response.status_code == 422


def test_predict_wrong_feature_count():
    payload = {"features": [1.0, 2.0]}  # expects 5, got 2
    response = client.post("/api/v1/predict", json=payload)
    assert response.status_code == 400
    assert "features" in response.json()["detail"].lower()


def test_predict_invalid_json():
    response = client.post(
        "/api/v1/predict",
        content="this is not json",
        headers={"Content-Type": "application/json"}
    )
    assert response.status_code == 422

Mocking ML Models in Tests

Sometimes you don't want tests to depend on an actual model file. Mocking replaces the real model with a fake that returns predictable results.

Why Mock?

Reason	Explanation
Speed	Loading a large model takes seconds — mocking is instant
Isolation	Test the API logic without depending on model accuracy
Determinism	Mocked predictions are always the same — no randomness
CI/CD	No need to store large model files in your test pipeline

Mocking with unittest.mock

# tests/unit/test_with_mock.py

from unittest.mock import MagicMock, patch
import numpy as np


def test_predict_endpoint_with_mocked_model(client):
    mock_model = MagicMock()
    mock_model.predict.return_value = np.array([1])
    mock_model.predict_proba.return_value = np.array([[0.15, 0.85]])

    with patch("app.ml.model_service.model", mock_model):
        response = client.post(
            "/api/v1/predict",
            json={"features": [5.1, 3.5, 1.4, 0.2, 2.3]}
        )

    assert response.status_code == 200
    data = response.json()
    assert data["prediction"] == 1
    assert data["confidence"] == 0.85
    mock_model.predict.assert_called_once()


def test_predict_handles_model_exception(client):
    mock_model = MagicMock()
    mock_model.predict.side_effect = RuntimeError("Model crashed")

    with patch("app.ml.model_service.model", mock_model):
        response = client.post(
            "/api/v1/predict",
            json={"features": [5.1, 3.5, 1.4, 0.2, 2.3]}
        )

    assert response.status_code == 500
    assert "error" in response.json()["detail"].lower()

When to Mock vs. When to Use the Real Model

Mock when testing API routing, validation, error handling, response format
Use the real model when testing prediction accuracy, model performance, edge-case behavior

Testing Data Validation

Pydantic schemas are your first line of defense. Test them thoroughly:

# tests/unit/test_schemas.py

import pytest
from pydantic import ValidationError
from app.schemas import PredictionRequest, PredictionResponse


class TestPredictionRequest:

    def test_valid_request(self):
        req = PredictionRequest(features=[1.0, 2.0, 3.0, 4.0, 5.0])
        assert len(req.features) == 5

    def test_rejects_empty_features(self):
        with pytest.raises(ValidationError):
            PredictionRequest(features=[])

    def test_rejects_too_few_features(self):
        with pytest.raises(ValidationError):
            PredictionRequest(features=[1.0, 2.0])

    def test_rejects_string_features(self):
        with pytest.raises(ValidationError):
            PredictionRequest(features=["a", "b", "c", "d", "e"])

    def test_accepts_integer_features(self):
        req = PredictionRequest(features=[1, 2, 3, 4, 5])
        assert all(isinstance(f, float) for f in req.features)

    def test_rejects_none_values(self):
        with pytest.raises(ValidationError):
            PredictionRequest(features=[1.0, None, 3.0, 4.0, 5.0])


class TestPredictionResponse:

    def test_valid_response(self):
        resp = PredictionResponse(
            prediction=1,
            confidence=0.95,
            model_version="1.0.0"
        )
        assert resp.prediction == 1

    def test_confidence_in_range(self):
        with pytest.raises(ValidationError):
            PredictionResponse(
                prediction=1,
                confidence=1.5,  # > 1.0
                model_version="1.0.0"
            )

Testing Edge Cases

Edge cases are inputs at the boundaries of what your system can handle. For AI systems, these are especially tricky:

# tests/unit/test_edge_cases.py

import pytest
import numpy as np


class TestEdgeCases:

    def test_empty_input(self, client):
        response = client.post("/api/v1/predict", json={"features": []})
        assert response.status_code in [400, 422]

    def test_null_input(self, client):
        response = client.post("/api/v1/predict", json={"features": None})
        assert response.status_code == 422

    def test_missing_body(self, client):
        response = client.post("/api/v1/predict")
        assert response.status_code == 422

    def test_extremely_large_values(self, client):
        payload = {"features": [1e308, 1e308, 1e308, 1e308, 1e308]}
        response = client.post("/api/v1/predict", json=payload)
        assert response.status_code in [200, 400]

    def test_nan_values(self, client):
        payload = {"features": [float("nan"), 1.0, 2.0, 3.0, 4.0]}
        response = client.post("/api/v1/predict", json=payload)
        assert response.status_code == 400

    def test_infinity_values(self, client):
        payload = {"features": [float("inf"), 1.0, 2.0, 3.0, 4.0]}
        response = client.post("/api/v1/predict", json=payload)
        assert response.status_code == 400

    def test_negative_values(self, client):
        payload = {"features": [-100.0, -50.0, -25.0, -10.0, -1.0]}
        response = client.post("/api/v1/predict", json=payload)
        assert response.status_code == 200

    def test_all_zeros(self, client):
        payload = {"features": [0.0, 0.0, 0.0, 0.0, 0.0]}
        response = client.post("/api/v1/predict", json=payload)
        assert response.status_code == 200

    def test_very_long_feature_list(self, client):
        payload = {"features": [1.0] * 1000}
        response = client.post("/api/v1/predict", json=payload)
        assert response.status_code in [400, 422]

    def test_concurrent_requests(self, client):
        """Verify the API handles rapid sequential requests."""
        payload = {"features": [5.1, 3.5, 1.4, 0.2, 2.3]}
        responses = [
            client.post("/api/v1/predict", json=payload)
            for _ in range(20)
        ]
        assert all(r.status_code == 200 for r in responses)

Don't Forget These Edge Cases

NaN and Infinity: NumPy operations can silently produce NaN — make sure your API detects and rejects them
Type coercion: Pydantic may silently convert "5" to 5.0 — decide if that's acceptable
Unicode input: What if someone sends "features": ["café", "résumé"]?

Test Coverage

Test coverage measures what percentage of your code is exercised by tests. It's not a quality metric — 100% coverage doesn't mean your tests are good — but low coverage is a red flag.

Using pytest-cov

pip install pytest-cov

pytest --cov=app --cov-report=term-missing -v

Example output:

---------- coverage: platform linux, python 3.11 ----------
Name                      Stmts   Miss  Cover   Missing
---------------------------------------------------------
app/__init__.py               0      0   100%
app/main.py                  25      2    92%   41-42
app/ml/model_service.py      18      0   100%
app/schemas.py               15      0   100%
---------------------------------------------------------
TOTAL                        58      2    97%

Generate an HTML Report

pytest --cov=app --cov-report=html
# Open htmlcov/index.html in your browser

Coverage Thresholds in CI

pytest --cov=app --cov-fail-under=80

This command fails if coverage drops below 80% — perfect for CI pipelines.

Coverage Level	Interpretation
< 50%	🔴 Dangerously low — major gaps in testing
50-70%	🟡 Acceptable for early stages
70-85%	🟢 Good — most critical paths covered
85-95%	🟢 Very good — high confidence
> 95%	🔵 Excellent — but beware of diminishing returns

CI/CD Integration of Tests

Tests are only useful if they run automatically on every code change. Here's how to integrate pytest into your CI/CD pipeline.

GitHub Actions Example

# .github/workflows/test.yml

name: Run Tests

on:
  push:
    branches: [main, develop]
  pull_request:
    branches: [main]

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: "3.11"

      - name: Install dependencies
        run: |
          python -m pip install --upgrade pip
          pip install -r requirements.txt
          pip install pytest pytest-cov httpx

      - name: Run unit tests
        run: pytest tests/unit -v --cov=app --cov-report=xml

      - name: Run integration tests
        run: pytest tests/integration -v

      - name: Check coverage threshold
        run: pytest --cov=app --cov-fail-under=80

      - name: Upload coverage report
        uses: codecov/codecov-action@v4
        with:
          file: ./coverage.xml

Test Pipeline Flow

View CI/CD Pipeline

Best Practices Summary

Practice	Description
Test naming	Use descriptive names: `test_predict_rejects_nan_values`
AAA pattern	Arrange → Act → Assert in every test
One assertion per concept	Each test verifies one behavior (multiple `assert` is OK if they test the same thing)
Don't test the framework	Don't test that Pydantic validates — test YOUR validation rules
Test behavior, not implementation	Test what the function does, not how it does it
Keep tests fast	Mock heavy dependencies, use fixtures, avoid file I/O
Use fixtures for setup	Don't repeat setup code in every test
Test the sad path	Invalid inputs, errors, and edge cases matter more than happy path

Key Takeaways

AI systems need more testing, not less — silent failures, non-determinism, and data dependency make them fragile
Follow the testing pyramid: many unit tests, fewer integration tests, minimal E2E tests
pytest is the standard — master fixtures, parametrize, and markers
Use TestClient for fast, reliable API testing without starting a server
Mock the model when testing API logic; use the real model when testing predictions
Test edge cases aggressively: NaN, infinity, empty input, wrong types, extreme values
Measure coverage but don't obsess over 100% — focus on critical paths
Integrate tests into CI/CD so they run automatically on every push

Why Test AI Systems?​

The Weather Forecast Analogy​

Key Challenges of Testing AI Systems​

The Testing Pyramid for AI​

Layer Details​

Pytest Fundamentals​

Your First Test​

Pytest Fixtures​

Parametrize — Testing Multiple Inputs​

Markers — Categorizing Tests​

conftest.py — The Test Configuration Hub​

Testing API Endpoints​

Basic Endpoint Testing​

Testing Error Responses​

Mocking ML Models in Tests​

Why Mock?​

Mocking with unittest.mock​

Testing Data Validation​

Testing Edge Cases​

Test Coverage​

Using pytest-cov​

Generate an HTML Report​

Coverage Thresholds in CI​

CI/CD Integration of Tests​

GitHub Actions Example​

Test Pipeline Flow​

Best Practices Summary​

Key Takeaways​

Additional Resources​