إنتقل إلى المحتوى الرئيسي

Troubleshooting - API Development

Troubleshooting Reference

How to Use This Guide

Each issue follows the same structure:

SectionDescription
SymptomWhat you see (error message, behavior)
CauseWhy it happens
SolutionStep-by-step fix
PreventionHow to avoid it in the future

Issue 1: Model Loading Errors

FileNotFoundError: Model file not found

Symptom:

FileNotFoundError: [Errno 2] No such file or directory: 'models/model_v1.joblib'

The API starts but immediately crashes or enters degraded mode.

Cause:

The model file path is relative to the current working directory (where you run uvicorn or python), not relative to the Python file. If you start the server from a different directory, the path breaks.

Solution:

Use an absolute path or path relative to the script:

from pathlib import Path

BASE_DIR = Path(__file__).resolve().parent.parent
MODEL_PATH = BASE_DIR / "models" / "model_v1.joblib"

ml_service.load_model(str(MODEL_PATH))

Prevention:

  • Always use pathlib.Path with __file__ to build paths
  • Set the model path via an environment variable: MODEL_PATH=./models/model_v1.joblib
  • Log the resolved path at startup for debugging

ModuleNotFoundError: No module named 'sklearn'

Symptom:

ModuleNotFoundError: No module named 'sklearn'

Happens when loading a model serialized with scikit-learn.

Cause:

The environment where you run the API doesn't have scikit-learn installed, or has a different version than the one used to train the model.

Solution:

pip install scikit-learn

If version mismatch:

pip install scikit-learn==1.3.2  # match training environment version

Prevention:

  • Pin exact versions in requirements.txt
  • Use the same virtual environment for training and serving
  • Consider ONNX format for framework-independent serialization

UnpicklingError or ValueError when loading model

Symptom:

_pickle.UnpicklingError: invalid load key, '\x00'
ValueError: unsupported pickle protocol: 5

Cause:

  • Model was serialized with a different Python/scikit-learn version
  • File is corrupted or truncated
  • Wrong file (not a valid pickle/joblib file)

Solution:

  1. Verify the file is a valid joblib file:
import joblib
model = joblib.load("models/model_v1.joblib")
print(type(model))
  1. Check Python version compatibility:
python --version  # must match training environment
  1. Re-serialize the model if versions don't match.

Prevention:

  • Document Python and scikit-learn versions alongside each model file
  • Use a model registry that tracks metadata

Issue 2: CORS Errors

Access to fetch has been blocked by CORS policy

Symptom:

Browser console shows:

Access to fetch at 'http://localhost:8000/api/v1/predict'
from origin 'http://localhost:3000' has been blocked by CORS policy:
No 'Access-Control-Allow-Origin' header is present on the requested resource.

The API works fine with curl but fails from a browser.

Cause:

Browsers enforce the Same-Origin Policy. When your frontend (localhost:3000) calls your API (localhost:8000), the browser blocks the request unless the API explicitly allows cross-origin requests.

Solution — FastAPI:

from fastapi.middleware.cors import CORSMiddleware

app.add_middleware(
CORSMiddleware,
allow_origins=["http://localhost:3000"],
allow_methods=["GET", "POST"],
allow_headers=["*"],
)

Solution — Flask:

from flask_cors import CORS

CORS(app, origins=["http://localhost:3000"])

Prevention:

  • Always configure CORS at the beginning of your project
  • Test from a browser early, not just curl
  • Never use allow_origins=["*"] in production

CORS Preflight (OPTIONS) Fails

Symptom:

You see an OPTIONS request with a 405 or 500 error in the browser network tab, followed by the real request never being sent.

Cause:

The browser sends a preflight OPTIONS request before POST requests with custom headers. If your server doesn't handle OPTIONS, the preflight fails and the actual request is blocked.

Solution:

The CORS middleware handles this automatically. Make sure it's added before your routes:

# FastAPI — add middleware first
app.add_middleware(CORSMiddleware, ...)

# Then define routes
@app.post("/predict")
def predict():
...

Issue 3: Validation Errors (422)

FastAPI Returns 422 for Seemingly Valid Data

Symptom:

{
"detail": [
{
"loc": ["body", "age"],
"msg": "value is not a valid integer",
"type": "type_error.integer"
}
]
}

But you're sending "age": "35" which looks correct.

Cause:

Pydantic in strict mode does not coerce strings to integers. "35" (string) is not the same as 35 (integer) in JSON.

Solution:

Send proper JSON types:

# Wrong — age is a string
curl -d '{"age": "35", ...}'

# Correct — age is an integer
curl -d '{"age": 35, ...}'

Or enable coercion in Pydantic:

class PredictionInput(BaseModel):
model_config = {"coerce_numbers_to_str": False, "strict": False}
age: int = Field(...)

Prevention:

  • Always validate your JSON payloads (use a JSON linter)
  • Document expected types clearly in your API docs
  • Test with the Swagger UI which enforces correct types

Missing Content-Type Header

Symptom:

Flask returns None from request.get_json(), or FastAPI returns a 422 error.

Cause:

The client didn't set Content-Type: application/json.

Solution:

Always include the header:

curl -X POST http://localhost:8000/api/v1/predict \
-H "Content-Type: application/json" \
-d '{"age": 35, ...}'

Issue 4: Memory Leaks and High Memory Usage

Memory Grows Over Time

Symptom:

API memory usage (RSS) increases steadily over hours/days until the process is killed by the OS or container runtime.

Cause:

Common causes in ML APIs:

  1. Accumulating predictions in memory (logging lists that never get cleared)
  2. Creating new model instances per request instead of reusing
  3. Large temporary arrays not being garbage collected
  4. Circular references in custom objects

Solution:

  1. Ensure the model is loaded once and reused:
# Bad — loads model every request
@app.post("/predict")
def predict():
model = joblib.load("model.joblib") # memory leak!
...

# Good — load once, reuse
ml_service = MLService()
ml_service.load_model("model.joblib")

@app.post("/predict")
def predict():
result = ml_service.predict(...) # reuses loaded model
...
  1. Don't accumulate data in global lists:
# Bad
prediction_log = []

@app.post("/predict")
def predict():
prediction_log.append(result) # grows forever!
  1. Monitor memory:
import psutil
import os

@app.get("/debug/memory")
def memory():
process = psutil.Process(os.getpid())
return {"memory_mb": process.memory_info().rss / 1024 / 1024}

Prevention:

  • Monitor memory usage in production (Prometheus, CloudWatch)
  • Set memory limits in your container/process manager
  • Use a dedicated logging service instead of in-memory lists

Issue 5: Slow Predictions

High Latency on Prediction Endpoint

Symptom:

Predictions take 500ms–5s instead of the expected 10–50ms.

Cause:

Solution:

  1. Diagnose — add timing to your endpoint:
import time

@app.post("/predict")
def predict(data: PredictionInput):
t0 = time.perf_counter()

t1 = time.perf_counter()
features = preprocess(data)
preprocess_ms = (time.perf_counter() - t1) * 1000

t2 = time.perf_counter()
result = model.predict(features)
predict_ms = (time.perf_counter() - t2) * 1000

total_ms = (time.perf_counter() - t0) * 1000

return {
"result": result,
"timing": {
"preprocess_ms": preprocess_ms,
"predict_ms": predict_ms,
"total_ms": total_ms,
}
}
  1. Model loading: Load once at startup (see Issue 4)

  2. Async blocking: Use def (sync) for CPU-bound inference in FastAPI:

# Wrong — blocks the event loop
@app.post("/predict")
async def predict(data: PredictionInput):
result = model.predict(...) # CPU-bound, blocking!

# Right — FastAPI runs in thread pool
@app.post("/predict")
def predict(data: PredictionInput):
result = model.predict(...) # runs in thread pool
  1. Model optimization: Consider lighter models (decision tree vs. large ensemble)

Prevention:

  • Add response time headers (X-Response-Time-Ms)
  • Set latency budgets (e.g., p95 < 100ms)
  • Profile before optimizing

Issue 6: 422 Errors with Nested/Complex Inputs

Pydantic Fails on Nested Objects

Symptom:

{
"detail": [{"loc": ["body"], "msg": "value is not a valid dict"}]
}

Cause:

Client sends data in an unexpected format (e.g., form-encoded instead of JSON, or wrapping data in an extra layer).

Solution:

Verify what the client actually sends:

@app.post("/debug")
async def debug(request: Request):
body = await request.body()
return {
"content_type": request.headers.get("content-type"),
"body_raw": body.decode(),
"body_size": len(body),
}

Common fixes:

  • Ensure Content-Type: application/json
  • Don't double-wrap: {"data": {"age": 35}} when the schema expects {"age": 35}
  • Check for BOM characters in the request body

Issue 7: Deployment Issues

uvicorn Refuses Connections from Other Machines

Symptom:

API works on localhost but not when accessed from another machine or container.

Cause:

uvicorn binds to 127.0.0.1 (localhost only) by default.

Solution:

Bind to all interfaces:

uvicorn app.main:app --host 0.0.0.0 --port 8000

OSError: [Errno 98] Address already in use

Symptom:

Can't start the server because the port is occupied.

Solution:

# Find the process using the port
# Linux/macOS
lsof -i :8000

# Windows
netstat -ano | findstr :8000

# Kill it
kill <PID> # Linux/macOS
taskkill /PID <PID> /F # Windows

Or use a different port:

uvicorn app.main:app --port 8001

Multiple Workers and Model Loading

Symptom:

When running with multiple workers (uvicorn --workers 4), each worker loads the model separately, causing high memory usage.

Cause:

Each uvicorn worker is a separate process. The model is loaded in each one.

Solution:

For small models, this is acceptable. For large models:

  1. Use fewer workers
  2. Use a model server (TensorFlow Serving, Triton)
  3. Use shared memory or memory-mapped files
# 4 workers = 4x model memory
uvicorn app.main:app --workers 4

# Consider: is 1 worker with async enough?
uvicorn app.main:app --workers 1

Quick Reference: Error Code → Fix

Error CodeCommon CauseQuick Fix
400Malformed JSONCheck JSON syntax, add Content-Type: application/json
404Wrong URL pathVerify endpoint URL, check for typos
405Wrong HTTP methodUse POST not GET for /predict
422Validation failureCheck data types match schema, verify required fields
500Unhandled exceptionCheck server logs, add try/except in route handler
503Model not loadedVerify model file path, check startup logs

Debugging Checklist

When your API doesn't work, follow this systematic checklist:

  1. Check the server logs — the error message is usually there
  2. Verify the endpoint URLhttp://, port number, path
  3. Check the HTTP methodPOST /predict, not GET /predict
  4. Verify Content-Type headerapplication/json
  5. Validate your JSON — use a JSON validator/linter
  6. Test with curl first — eliminates browser/CORS issues
  7. Check model file — does it exist at the expected path?
  8. Check dependenciespip list | grep scikit-learn
  9. Try the Swagger UI/docs in FastAPI
  10. Read the full error trace — scroll up in the terminal
When All Else Fails

Add a debug endpoint that returns the raw request information:

@app.post("/debug")
async def debug(request: Request):
body = await request.body()
return {
"method": request.method,
"url": str(request.url),
"headers": dict(request.headers),
"body": body.decode("utf-8", errors="replace"),
}

This tells you exactly what the server receives, eliminating guesswork.