Skip to main content

Infrastructure Planning for AI

Theory 45 min

Why Infrastructure Matters

The Foundation Analogy

Infrastructure for AI deployment is like the foundation of a building. The most beautiful architecture is useless if the foundation is weak. Similarly, the most accurate model is worthless if it can't run reliably in production.


Python Virtual Environments

The Problem: Dependency Hell

Imagine you have two projects:

  • Project A requires scikit-learn==1.2.0
  • Project B requires scikit-learn==1.4.0

If both use your system Python, installing one version breaks the other. This is called dependency hell.

The Solution: Virtual Environments

A virtual environment is an isolated Python installation. Each project gets its own set of packages without interfering with others.

venv — The Built-in Option

venv comes with Python and is the simplest option:

# Create a virtual environment
python -m venv .venv

# Activate it (Windows)
.venv\Scripts\activate

# Activate it (macOS/Linux)
source .venv/bin/activate

# Your terminal shows the active environment
(.venv) $ python --version
Python 3.11.5

# Install packages in isolation
(.venv) $ pip install scikit-learn pandas fastapi

# Deactivate when done
(.venv) $ deactivate

conda — The Data Science Option

Conda is a package manager popular in data science. It manages both Python packages and system-level dependencies (like CUDA for GPUs).

# Create a conda environment
conda create -n ml-project python=3.11

# Activate it
conda activate ml-project

# Install packages (can mix conda and pip)
conda install scikit-learn pandas
pip install fastapi

# Export environment
conda env export > environment.yml

# Recreate from file
conda env create -f environment.yml

venv vs conda

Featurevenvconda
InstallationBuilt-in (Python 3.3+)Requires Anaconda/Miniconda
Package sourcePyPI onlyConda channels + PyPI
Non-Python depsCannot manageCan manage (CUDA, C libs)
SpeedFastSlower (dependency solving)
Reproducibilityrequirements.txtenvironment.yml
Disk spaceLightweightHeavier
Best forWeb apps, APIs, CI/CDData science, GPU projects
Recommendation for This Course

We use venv + pip throughout this course. It's simpler, faster, and sufficient for our API-focused deployment workflow. Use conda if you need GPU support or complex scientific libraries.


Dependency Management

requirements.txt — Pinning Versions

A requirements.txt file lists all your project's dependencies with pinned versions for reproducibility:

# Core ML
scikit-learn==1.4.2
pandas==2.2.0
numpy==1.26.4
joblib==1.3.2

# API Framework
fastapi==0.109.0
uvicorn==0.27.0
pydantic==2.5.3

# Testing
pytest==8.0.0
httpx==0.26.0

# Explainability
shap==0.44.1
lime==0.2.0.1
Always Pin Versions

Never use pip install scikit-learn without a version in your requirements file. An unpinned dependency means your project might break tomorrow if a new version is released.

Generating requirements.txt

# Option 1: Freeze all installed packages
pip freeze > requirements.txt

# Option 2: Use pipreqs (only project imports)
pip install pipreqs
pipreqs . --force

# Install from requirements
pip install -r requirements.txt

The Lock File Pattern

For stricter reproducibility, modern tools create lock files that pin every sub-dependency:

ToolConfig FileLock File
piprequirements.txtrequirements.txt (manually)
pip-toolsrequirements.inrequirements.txt (compiled)
Poetrypyproject.tomlpoetry.lock
PipenvPipfilePipfile.lock
# Using pip-tools for better dependency management
pip install pip-tools

# Write your direct dependencies in requirements.in
# Then compile the full locked file:
pip-compile requirements.in --output-file requirements.txt

Docker Basics for ML

What is Docker?

Docker packages your application, its dependencies, and the operating system into a single container — a lightweight, portable, self-sufficient unit.

The Shipping Container Analogy

Before standardized shipping containers, every port had different cranes, trucks, and warehouses. Shipping was chaotic and slow. The standardized container revolutionized global trade.

Docker does the same for software:

Shipping ContainerDocker Container
Standard size fits any ship/truck/craneRuns on any machine with Docker
Contents are isolated and sealedApp is isolated from host system
Stackable and composableMultiple containers work together
Reusable across the worldSame image runs dev/staging/prod

Dockerfile for an ML Project

A Dockerfile is a recipe for building a container image:

# Start from a Python base image
FROM python:3.11-slim

# Set working directory
WORKDIR /app

# Copy and install dependencies first (better caching)
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy the application code
COPY . .

# Expose the API port
EXPOSE 8000

# Start the FastAPI server
CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000"]

Key Docker Commands

# Build an image
docker build -t my-ml-api:v1.0 .

# Run a container
docker run -p 8000:8000 my-ml-api:v1.0

# Run in background
docker run -d -p 8000:8000 --name ml-api my-ml-api:v1.0

# Check running containers
docker ps

# View logs
docker logs ml-api

# Stop container
docker stop ml-api

Docker Layer Caching

Docker builds images in layers. Each instruction in the Dockerfile creates a layer. If a layer hasn't changed, Docker reuses the cached version.

Optimization Tip

Always copy requirements.txt and install dependencies before copying your code. This way, Docker only reinstalls packages when dependencies actually change, not when you edit a Python file.

.dockerignore

Just like .gitignore, a .dockerignore file excludes unnecessary files from the Docker build context:

__pycache__
*.pyc
.git
.venv
.env
*.ipynb_checkpoints
data/raw/
notebooks/
.pytest_cache

GPU vs CPU Considerations

When Do You Need a GPU?

Cost Comparison

Instance TypevCPUsRAMGPUPrice/hour (approx.)Use Case
t3.medium24 GBNone$0.04Simple sklearn models
c5.xlarge48 GBNone$0.17XGBoost, feature-heavy models
g4dn.xlarge416 GB1x T4$0.53PyTorch inference
p3.2xlarge861 GB1x V100$3.06Training deep learning models
p4d.24xlarge961152 GB8x A100$32.77Large Language Models
Cost Alert

A GPU instance can cost 10-100x more than a CPU instance. Always start with CPU and only upgrade to GPU if latency requirements demand it. For this course, CPU instances are sufficient.

Training vs Inference

PhaseCompute NeedsDurationCost Strategy
TrainingHigh (GPU often)Hours to daysUse spot instances (60-90% savings)
InferenceLower (CPU often OK)ContinuousUse reserved instances or serverless

Cloud Services for ML

The Big Three

Cloud Services Comparison

FeatureAWS SageMakerGCP Vertex AIAzure ML
NotebooksSageMaker StudioVertex WorkbenchAzure ML Studio
TrainingTraining JobsCustom TrainingTraining Pipelines
DeploymentEndpointsEndpointsManaged Endpoints
AutoMLAutopilotAutoMLAutoML
MLOpsPipelinesPipelinesDesigner + Pipelines
ContainersECR + ECS/EKSGCR + GKE/Cloud RunACR + ACI/AKS
ServerlessLambdaCloud FunctionsAzure Functions
PricingPay-as-you-goPay-as-you-goPay-as-you-go

Simpler Deployment Options

For college projects and small services, you don't need the full power of SageMaker or Vertex AI:

PlatformBest ForFree TierComplexity
RenderSimple API hosting750 hours/month⭐ Very Low
RailwayPython apps + DB$5 credit/month⭐ Very Low
Fly.ioDocker containers3 shared VMs⭐⭐ Low
AWS LambdaServerless functions1M requests/month⭐⭐ Low
Google Cloud RunContainer-based APIs2M requests/month⭐⭐ Low
HerokuFull-stack appsEco plan $5/month⭐⭐ Low
For This Course

We'll use local development (FastAPI + uvicorn) for most labs. For the final project, you may optionally deploy to a cloud platform.


CI/CD Basics for ML

What is CI/CD?

CI/CD stands for Continuous Integration / Continuous Deployment. It automates the process of testing and deploying code changes.

View CI/CD Pipeline

The Assembly Line Analogy

CI/CD is like a car assembly line:

  • CI = Quality checks at every station (unit tests, linting, building)
  • CD = The car rolls off the line and drives to the dealership (deployment)

Without CI/CD, it's like hand-building each car and manually driving it to the customer.

CI/CD for ML — What's Different?

Traditional CI/CD tests code. ML CI/CD must also test data and models:

Traditional CI/CDML CI/CD
Unit tests pass?Unit tests pass?
Code compiles?Code compiles?
Data validation passes?
Model metrics above threshold?
No data drift detected?
Model size within limits?
Deploy applicationDeploy model + application

Example: GitHub Actions for ML

name: ML Pipeline
on:
push:
branches: [main]

jobs:
test-and-deploy:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4

- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: '3.11'

- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install -r requirements.txt

- name: Run tests
run: pytest tests/ -v

- name: Check model metrics
run: python scripts/validate_model.py

- name: Build Docker image
run: docker build -t ml-api:latest .

Environment Reproducibility

The Reproducibility Pyramid

Minimum Reproducibility Checklist

FilePurposeRequired?
requirements.txtPython dependencies with versions✅ Yes
DockerfileComplete environment definition✅ Yes (for deployment)
.dockerignoreExclude unnecessary files✅ Yes
.gitignoreExclude generated files from Git✅ Yes
README.mdSetup and run instructions✅ Yes
pyproject.tomlProject metadata and tool configRecommended
.env.exampleTemplate for environment variablesRecommended
MakefileCommon commands shortcutsOptional

Standard Project Structure

ml-project/
├── app/
│ ├── __init__.py
│ ├── main.py # FastAPI application
│ ├── model.py # Model loading and prediction
│ └── schemas.py # Pydantic request/response models
├── models/
│ └── model_v1.0.0.pkl # Serialized model
├── data/
│ ├── raw/ # Original data (gitignored)
│ └── processed/ # Cleaned data
├── tests/
│ ├── __init__.py
│ ├── test_api.py # API endpoint tests
│ └── test_model.py # Model prediction tests
├── notebooks/
│ └── exploration.ipynb # Data exploration (gitignored in prod)
├── scripts/
│ └── train.py # Training script
├── .gitignore
├── .dockerignore
├── Dockerfile
├── requirements.txt
├── README.md
└── pyproject.toml
This Structure is Used Throughout the Course

Every lab in this course follows this project structure. You'll build it incrementally — starting with the environment setup in TP1, adding the model in Module 2, the API in Module 3, and tests in Module 5.


Summary

Infrastructure Decision Tree

Key Takeaways

#ConceptRemember
1Virtual environmentsAlways isolate project dependencies
2Pin versionsrequirements.txt with exact versions
3DockerPackage everything for reproducibility
4CPU firstOnly use GPU if deep learning demands it
5Cloud optionsSimple platforms (Render, Cloud Run) for small projects
6CI/CDAutomate testing and deployment
7Project structureFollow conventions for maintainability
Next Steps

In TP1, you'll put these concepts into practice by setting up your project environment from scratch — creating a virtual environment, installing dependencies, and building the standard project structure.