This post is a deep-dive, hands‑on guide to turning an experimental ML project ideas into a production‑grade service you can deploy, scale, and observe. We will build a real‑time anomaly detection microservice for grid sensor data, expose a friendly UI, containerize both services, deploy them to Kubernetes with autoscaling, and reason about SLOs, performance, and extensibility.

The full codebase lives in this repository: https://github.com/Tomas-Kozak/mlops-series-l1

Contents

  • A FastAPI service that serves an IsolationForest model for detecting anomalies in 3 sensor features: voltage, current, frequency.
  • A Streamlit UI that simulates grid readings, sends them in batches to the backend, and visualizes anomalies and tail latencies.
  • Prometheus metrics for end‑to‑end observability (/metrics).
  • Docker images for both services; Docker Compose for local multi‑service dev.
  • Kubernetes manifests with a Horizontal Pod Autoscaler (HPA) driven by CPU, deployable to a local kind cluster.

Patterns and code you can reuse: request/response schemas, model artifact handling, latency instruments, container hardening, HPA configuration, and a realistic load test harness.

Architecture Overview

Data flow:

1) Simulator emits grid readings with occasional injected anomalies. 2) UI assembles readings into request batches and calls the /predict endpoint. 3) Backend validates payloads (Pydantic), converts to feature matrix, runs IsolationForest inference. 4) Backend returns per‑reading anomaly flags and scores, while emitting Prometheus metrics. 5) UI renders charts, overlays anomalies, and generates burst/sustained load to trigger autoscaling.

Services and deployment:

  • src/backend/*: containerized FastAPI app, scaled by HPA in Kubernetes.
  • src/ui/*: containerized Streamlit UI, pointed to backend via BACKEND_URL.
  • docker/compose.yaml: local multi‑service run with shared network.
  • k8s/*: Deployment + Service for both apps and CPU‑based HorizontalPodAutoscaler for backend.

Diagram: High‑Level Architecture

 +----------------+         HTTP (JSON)          +-------------------+
 |  Streamlit UI  |  ------------------------->  |  FastAPI Backend  |
 |  (src/ui)      |                              |  (src/backend)    |
 +--------+-------+                              +----+--------------+
          ^                                           |
          |                                           | Prometheus exposition
          |                                           v
          |                                     +-----+------+
 Simulation (local)                             |  /metrics  |
   └─ stream_readings()                         +------------+

In Kubernetes:

          +------------------+                     +------------------+
          |  anomaly-ui Pod  |  ---> Service --->  | anomaly-backend  |
          |  (Deployment)    |        (ClusterIP)  |   Pods (HPA)     |
          +---------+--------+                     +---------+--------+
                    |                                        |
             Port-forward                                   HPA
                    |                              (CPU utilization target)
                    v                                        v
               http://localhost:8501                    Scales 1..N pods

Diagram: Request Path and Timers

UI batch -> HTTP POST /predict -> [FastAPI]
                                     |
                                     +-- parse JSON (Pydantic)
                                     |
                                     +-- as_feature_matrix()
                                     |
                                     +-- [T_infer] IsolationForest.predict
                                     |
                                     +-- build response
                                     |
                                     `-- observe REQUEST_LATENCY (total)

Diagram: Sequence (UI burst)

UI(ThreadPool)       Kubernetes SVC        Pod A (backend)       Pod B (backend)
    |                      |                     |                      |
    |-- N requests ------->|-- distribute ------>|-- handle (A)         |
    |                      |                     |                      |
    |-- N requests ------->|-- distribute ----------------------------->|-- handle (B)
    |                      |                     |                      |
    |<- responses (A,B) ---|<--------------------|<---------------------|

Data and Modeling

Core modeling is not the focus of this post, but we’ll briefly go over it to get a sense of what we’re working with. Problem: detect anomalous grid operating points using only point‑in‑time readings for three features: voltage (V), current (A), and frequency (Hz).

We use an IsolationForest (scikit‑learn) trained on synthetically generated “normal” operating points; assigns higher anomaly scores to outliers (inverted). This keeps the focus on the MLOps systems work rather than modeling intricacies, but we still make the model deterministic, reproducible, and artifacted.

Feature schema and defaults (source: src/backend/model.py:11):

DEFAULT_FEATURES = ["voltage", "current", "frequency"]

Model config and training (source: src/backend/model.py:15):

@dataclass(frozen=True)
class ModelConfig:
    feature_names: Tuple[str, ...] = tuple(DEFAULT_FEATURES)
    random_state: int = 42
    contamination: float = 0.05
    n_estimators: int = 200

def train_isolation_forest(config: ModelConfig, n_samples: int = 5000) -> IsolationForest:
    rng = np.random.default_rng(config.random_state)
    X = _generate_synthetic_normal(n_samples, rng)
    model = IsolationForest(
        n_estimators=config.n_estimators,
        contamination=config.contamination,
        random_state=config.random_state,
    )
    model.fit(X)
    return model

Theory notes:

  • IsolationForest isolates points by random splits; outliers require fewer splits to isolate, yielding larger anomaly scores.
  • contamination approximates the expected fraction of anomalies. For energy datasets with non‑stationarity, you’d calibrate this offline (ROC/PR curves) and monitor precision/recall drop‑offs over time.
  • Determinism and reproducibility come from fixed random_state and artifacting the trained model to models/iso_forest.joblib.

Advanced modeling guidance tailored to this service:

  • If your grid has diurnal/seasonal cycles, consider adding time‑of‑day or temperature context features or training per‑segment models (substations/regions). Keep feature_names ordered and versioned.
  • Track input drift via PSI/KL on voltage/current/frequency; alert when drift exceeds thresholds and kick off retraining jobs.
  • Calibrate a score threshold to a target precision/recall on validation data and make it a dynamic config value.

API Design and Contracts

We model the request/response with Pydantic. The contract is explicit and versionable (good for compatibility testing and schema drift detection).

Schemas (source: src/backend/api.py:1):

class SensorReading(BaseModel):
    voltage: float
    current: float
    frequency: float
    timestamp: Optional[str] = None

class BatchPredictRequest(BaseModel):
    readings: List[SensorReading]

class Prediction(BaseModel):
    anomaly: bool
    score: float

class BatchPredictResponse(BaseModel):
    predictions: List[Prediction]
    anomalies: int
    total: int
    served_by: str | None = None

Endpoints (source: src/backend/app.py):

  • GET /health: quick readiness and instance identity.
  • POST /predict: batch inference. Returns per‑reading anomaly booleans and scores, plus metadata.
  • GET /metrics: Prometheus exposition for scraping.

Example request:

POST /predict
Content-Type: application/json

{
  "readings": [
    {"voltage": 230.1, "current": 9.8, "frequency": 50.02},
    {"voltage": 271.0, "current": 28.5, "frequency": 50.7}
  ]
}

Example response:

{
  "predictions": [
    {"anomaly": false, "score": 0.06},
    {"anomaly": true, "score": 0.79}
  ],
  "anomalies": 1,
  "total": 2,
  "served_by": "pod-xyz:12345"
}

Validation and invariants:

  • Pydantic enforces types; reject empty batches with 400.
  • as_feature_matrix ensures strict feature ordering to match the model’s training order.
  • On model unavailability, return 503 to signal readiness gates.

Backend: Serving, Metrics, and Error Semantics

Core app (abridged; source: src/backend/app.py:1):

REQUEST_COUNT = Counter("requests_total", "Total HTTP requests", ["path", "method", "status"])
REQUEST_LATENCY = Histogram("request_latency_seconds", "Request latency (s)", ["path", "method"])
INFERENCE_LATENCY = Histogram("inference_latency_seconds", "Model inference latency (s)")
ANOMALY_COUNT = Counter("anomalies_total", "Total anomalies predicted")

@app.post("/predict", response_model=BatchPredictResponse)
def predict(req: BatchPredictRequest) -> BatchPredictResponse:
    if not req.readings:
        raise HTTPException(status_code=400, detail="No readings provided")
    if model_instance is None:
        raise HTTPException(status_code=503, detail="Model not ready")

    start = time.perf_counter()
    status_label = "200"
    try:
        X = as_feature_matrix([r.model_dump() for r in req.readings], model_instance.feature_names)
        infer_t0 = time.perf_counter()
        flags, scores = model_instance.predict_batch(X)
        INFERENCE_LATENCY.observe(time.perf_counter() - infer_t0)
        preds = [Prediction(anomaly=bool(flags[i]), score=float(scores[i])) for i in range(len(req.readings))]
        anomalies = int(flags.sum())
        ANOMALY_COUNT.inc(anomalies)
        return BatchPredictResponse(predictions=preds, anomalies=anomalies, total=len(preds), served_by=INSTANCE_ID)
    except HTTPException as he:
        status_label = str(he.status_code); raise
    except Exception:
        status_label = "500"; raise
    finally:
        dt = time.perf_counter() - start
        REQUEST_LATENCY.labels(path="/predict", method="POST").observe(dt)
        REQUEST_COUNT.labels(path="/predict", method="POST", status=status_label).inc()

Notes on instrumentation and semantics:

  • We separate total request latency from model inference latency; this allows you to spot overhead from JSON parsing, data marshaling, or network.
  • We label counts by path/method/status to power RED/USE dashboards and simple SLOs (e.g., error rate < 1%).
  • A global exception handler ensures 500s are counted even on unexpected code paths.

Threading and CPU:

  • scikit‑learn inference is CPU‑bound. We scale using Uvicorn workers (multiprocessing) and Kubernetes replicas; the GIL isn’t the bottleneck when workers are processes.
  • Suggested env for predictable CPU usage: set OMP_NUM_THREADS=1 and MKL_NUM_THREADS=1 in production to avoid oversubscription when running multiple workers/pods.

Histogram buckets and cardinality:

  • Customize histogram buckets to your SLOs to make histogram_quantile stable. Keep label sets small to control time‑series cardinality and Prometheus load.

Example bucket customization:

REQUEST_LATENCY = Histogram(
    "request_latency_seconds", "Request latency (s)", ["path", "method"],
    buckets=(0.01, 0.02, 0.05, 0.1, 0.2, 0.5, 1.0)
)

Model Artifact Management

Artifacts live under MODEL_DIR (models locally, /models in containers). On first run, the service trains and persists an artifact, making cold‑start deterministic.

Artifact handling (source: src/backend/model.py:39):

def ensure_model(model_path: Path | str, config: ModelConfig | None = None, train_if_missing: bool = True) -> AnomalyModel:
    path = Path(model_path)
    cfg = config or ModelConfig()
    if path.exists():
        model = joblib.load(path)
        return AnomalyModel(model, cfg)
    if not train_if_missing:
        raise FileNotFoundError(f"Model artifact not found at {path}")
    path.parent.mkdir(parents=True, exist_ok=True)
    model = train_isolation_forest(cfg)
    joblib.dump(model, path)
    return AnomalyModel(model, cfg)

Production variants:

  • Swap the local filesystem for a PVC in Kubernetes or a remote artifact store (S3/GCS) with a digest‑addressed path and integrity check.
  • Gate startup on model availability (TRAIN_IF_MISSING=0) and fail fast if artifact is missing.
  • Record model metadata (config, hash, training data hash) next to the artifact for lineage.

Artifact promotion workflow:

  • Train offline; store artifact with immutable digest and metadata JSON (config, data hashes, metrics).
  • Canary new artifact with 1 replica; compare error/latency and anomaly base rate vs. baseline.
  • Promote by scaling up canary and scaling down baseline or using traffic splitting with Argo Rollouts.

The UI: Streaming, Visualization, and Load Generation

The UI lets you explore the system’s behavior and also stress test it. It simulates readings, batches requests, and displays rolling windows with anomaly overlays and server instance distribution.

Highlights (source: src/ui/main.py):

  • Sidebar controls: backend URL, batch size, interval, anomaly rate, and “new connection each request” to improve load distribution across pods (disables HTTP keep‑alive pinning).
  • Actions: Generate Batch, Start/Stop Streaming, Inject Extreme Reading, Concurrent Burst, and Sustained Bursts (with workers and cycles).
  • Charts: line chart for signals; red triangle markers for anomalies; bar chart of “served_by” instance to visualize request spread across replicas.

Why “new connection each request” matters: HTTP keep‑alive can pin a Streamlit (single process) client to a single pod behind a ClusterIP/Service due to 5‑tuple hashing. Setting Connection: close on each request approximates a fairer fan‑out across pods at the cost of extra TCP overhead. For production, consider a proper load balancer with connection‑level balancing or a service mesh.

Additional UI internals:

  • Maintains a rolling deque (WINDOW=200) of recent points with anomaly overlay; aggregates served_by to visualize load distribution.
  • Burst features use ThreadPoolExecutor; when pushing very high RPS from one host, watch ephemeral port exhaustion and OS limits.
  • “Inject Extreme Reading” posts a synthetic outlier to validate end‑to‑end behavior and verify alerting/visualization.

Local Development and Tooling

We use uv with pyproject.toml to pin dependencies and streamline local runs.

Key commands:

  • Install: uv sync --extra dev
  • Run API: uv run python -m src.backend.main or uv run uvicorn src.backend.app:app --host 0.0.0.0 --port 8000
  • Run UI: uv run streamlit run src/ui/main.py
  • Lint/format: uv run python scripts/dev.py fix (optional helper)

Typing and linting:

  • mypy enforces type safety on src/; gradually tighten in critical paths (API surface, model I/O).
  • ruff provides fast lint + formatting; keep consistent style to reduce PR churn.

Containerization

We build two containers — a full backend image with the ML stack and a lean UI image that depends only on Streamlit + Requests.

Backend Dockerfile (source: docker/Dockerfile.backend):

# syntax=docker/dockerfile:1.7

FROM python:3.13-slim AS base

ENV PYTHONDONTWRITEBYTECODE=1 \
    PYTHONUNBUFFERED=1 \
    PIP_NO_CACHE_DIR=1 \
    PORT=8000 \
    MODEL_DIR=/models \
    MODEL_NAME=iso_forest.joblib \
    TRAIN_IF_MISSING=1 \
    VIRTUAL_ENV=/app/.venv \
    PATH="/app/.venv/bin:$PATH"

WORKDIR /app

RUN pip install --no-cache-dir --upgrade pip \
    && pip install --no-cache-dir uv

COPY pyproject.toml README.md uv.lock ./
COPY src ./src

RUN uv --version \
    && uv sync --frozen

RUN groupadd -r app && useradd -r -g app app \
    && mkdir -p /models \
    && chown -R app:app /app /models

USER app
EXPOSE 8000

CMD ["sh", "-c", "uvicorn src.backend.app:app --host 0.0.0.0 --port 8000 --workers ${WORKERS:-1}"]

UI Dockerfile (source: docker/Dockerfile.ui):

# syntax=docker/dockerfile:1.7

FROM python:3.13-slim AS base

ENV PYTHONDONTWRITEBYTECODE=1 \
    PYTHONUNBUFFERED=1 \
    PIP_NO_CACHE_DIR=1 \
    VIRTUAL_ENV=/app/.venv \
    PATH="/app/.venv/bin:$PATH" \
    BACKEND_URL=http://localhost:8000 \
    STREAMLIT_SERVER_PORT=8501

WORKDIR /app

RUN pip install --no-cache-dir --upgrade pip \
    && pip install --no-cache-dir uv

COPY src/__init__.py ./src/__init__.py
COPY src/ui ./src/ui

RUN uv pip install --system streamlit==1.49.1 requests==2.32.4

EXPOSE 8501

CMD ["streamlit", "run", "src/ui/main.py", "--server.port=8501", "--server.address=0.0.0.0"]

Notes:

  • Non‑root user (app) and a writable /models directory for artifacts in backend.
  • The UI image avoids installing the heavy ML stack, reducing build time and attack surface.
  • Use WORKERS env to tune Uvicorn worker processes for CPU cores.

Supply‑chain and runtime hardening:

  • Pin base image digests; generate SBOMs; scan images in CI.
  • Set OMP_NUM_THREADS=1 and MKL_NUM_THREADS=1 in the backend image to avoid CPU oversubscription.
  • Consider readOnlyRootFilesystem: true and dropping Linux capabilities where possible.

Docker Compose

Local multi‑service development (source: docker/compose.yaml):

services:
  backend:
    build:
      context: ..
      dockerfile: docker/Dockerfile.backend
    image: anomaly-backend
    environment:
      - TRAIN_IF_MISSING=1
      - MODEL_DIR=/models
      - POD_NAME=compose-backend
      - WORKERS=2
    volumes:
      - ../models:/models
    ports:
      - "8000:8000"

  ui:
    build:
      context: ..
      dockerfile: docker/Dockerfile.ui
    image: anomaly-ui
    environment:
      - BACKEND_URL=http://backend:8000
    ports:
      - "8501:8501"
    depends_on:
      - backend

networks:
  default:
    name: anomaly-net

Run both: cd docker && docker compose up --build.

Testing variations locally:

  • Scale WORKERS=4 and compare p95 at fixed RPS.
  • Delete ./models/iso_forest.joblib to exercise cold‑start training path.

Kubernetes: Deploy, Observe, and Scale

We provide Deployment + Service for both apps and an HPA for the backend. Apply via Kustomize: kubectl apply -k k8s/.

Backend deployment (source: k8s/backend-deployment.yaml):

apiVersion: apps/v1
kind: Deployment
metadata:
  name: anomaly-backend
  namespace: anomaly
  labels:
    app: anomaly-backend
spec:
  replicas: 1
  selector:
    matchLabels:
      app: anomaly-backend
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
  template:
    metadata:
      labels:
        app: anomaly-backend
    spec:
      containers:
        - name: app
          image: anomaly-backend:latest
          imagePullPolicy: IfNotPresent
          ports:
            - containerPort: 8000
              name: http
          env:
            - name: TRAIN_IF_MISSING
              value: "1"
            - name: MODEL_DIR
              value: "/models"
            - name: WORKERS
              value: "2"
            - name: POD_NAME
              valueFrom:
                fieldRef:
                  fieldPath: metadata.name
            - name: NODE_NAME
              valueFrom:
                fieldRef:
                  fieldPath: spec.nodeName
          readinessProbe:
            httpGet:
              path: /health
              port: 8000
            initialDelaySeconds: 5
            periodSeconds: 10
            timeoutSeconds: 2
          livenessProbe:
            httpGet:
              path: /health
              port: 8000
            initialDelaySeconds: 10
            periodSeconds: 20
            timeoutSeconds: 2
          resources:
            requests:
              cpu: "200m"
              memory: "256Mi"
            limits:
              cpu: "1"
              memory: "512Mi"
          volumeMounts:
            - name: model-store
              mountPath: /models
      volumes:
        - name: model-store
          emptyDir: {}

Horizontal Pod Autoscaler (source: k8s/backend-hpa.yaml):

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: anomaly-backend-hpa
  namespace: anomaly
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: anomaly-backend
  minReplicas: 1
  maxReplicas: 5
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 60

Backend service (source: k8s/backend-service.yaml):

apiVersion: v1
kind: Service
metadata:
  name: anomaly-backend
  namespace: anomaly
  labels:
    app: anomaly-backend
spec:
  type: ClusterIP
  selector:
    app: anomaly-backend
  ports:
    - name: http
      port: 8000
      targetPort: 8000

UI deployment (source: k8s/ui-deployment.yaml):

apiVersion: apps/v1
kind: Deployment
metadata:
  name: anomaly-ui
  namespace: anomaly
  labels:
    app: anomaly-ui
spec:
  replicas: 1
  selector:
    matchLabels:
      app: anomaly-ui
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
  template:
    metadata:
      labels:
        app: anomaly-ui
    spec:
      containers:
        - name: ui
          image: anomaly-ui:latest
          imagePullPolicy: IfNotPresent
          ports:
            - containerPort: 8501
              name: http
          env:
            - name: BACKEND_URL
              value: http://anomaly-backend:8000
          readinessProbe:
            httpGet:
              path: /
              port: 8501
            initialDelaySeconds: 5
            periodSeconds: 10
            timeoutSeconds: 2
          livenessProbe:
            tcpSocket:
              port: 8501
            initialDelaySeconds: 10
            periodSeconds: 20
            timeoutSeconds: 2
          resources:
            requests:
              cpu: "100m"
              memory: "128Mi"
            limits:
              cpu: "500m"
              memory: "256Mi"

UI service (source: k8s/ui-service.yaml):

apiVersion: v1
kind: Service
metadata:
  name: anomaly-ui
  namespace: anomaly
  labels:
    app: anomaly-ui
spec:
  type: ClusterIP
  selector:
    app: anomaly-ui
  ports:
    - name: http
      port: 8501
      targetPort: 8501

Notes and production tips:

  • On Kubernetes, prefer a single Uvicorn worker per container and let HPA/KEDA add pods; use multiple workers per pod only with a clear reason (startup cost/bin-packing) and cap OMP_NUM_THREADS/MKL_NUM_THREADS to 1 to avoid CPU oversubscription.
  • HPA on CPU is simple and effective for CPU‑bound inference. For request‑driven scaling (QPS or queue length), consider KEDA with event sources or custom metrics (requests in flight).
  • Ensure metrics-server is installed; for kind, patch insecure TLS as in the README.
  • Tune requests/limits to achieve a useful utilization target. Too low limits can cause throttling; too high requests can starve scheduling.
  • Prefer readinessProbe on /health over /metrics; treat model availability as a readiness condition.

Diagram: K8s Objects and Flow

 +-------------+     +-------------------+     +------------------------+
 |  Deployment | --> | ReplicaSet (Pods) | --> | HPA observes CPU usage |
 +------+------+     +---------+---------+     +-----------+------------+
        |                        |                           |
        v                        v                           |
  Pods expose :8000        ClusterIP Service                 |
        |                        |                           |
        +-------> kube-proxy/iptables (hash)  <--------------+

Kustomize overlays (source: k8s/kustomization.yaml):

apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
namespace: anomaly
resources:
  - namespace.yaml
  - backend-deployment.yaml
  - backend-service.yaml
  - backend-hpa.yaml
  - ui-deployment.yaml
  - ui-service.yaml

Namespace (source: k8s/namespace.yaml):

apiVersion: v1
kind: Namespace
metadata:
  name: anomaly
  labels:
    name: anomaly

Observability: Metrics, SLI/SLOs, and PromQL

The backend emits Prometheus metrics at /metrics:

  • requests_total{path,method,status}: RED metrics (rate/errors/duration).
  • request_latency_seconds_bucket{path,method} (+ _sum, _count): request latency histograms.
  • inference_latency_seconds_bucket (+ _sum, _count): pure model time.
  • anomalies_total: running count of predicted anomalies.

Example SLOs:

  • Rule of thumb: for a CPU-bound sync service, keep average CPU ≲70% to protect p95/p99 from queuing blow-ups; we target 60% here to leave headroom.
  • Bound synchronous server, keep utilization under ~70–75% to avoid runaway tail latencies; HPA targets 60% CPU here.

Reliability and Failure Modes

  • Empty batches 400 error; invalid types 422 (Pydantic).
  • Model missing 503 until artifact present or auto‑trained.
  • Global exception handler increments requests_total{status="500"} and returns a stable JSON body for clients.
  • Timeouts: client defaults to 5s; adjust based on batch sizes and latency budgets.

Resilience extensions:

  • Add circuit breaking/retries at the client or service mesh layer.
  • Expose a lightweight /live probe separate from /health readiness (for faster restarts).
  • Emit structured logs (already using structlog); attach instance, pod, and node for correlation.

Structured logging example:

logger.info("predict", instance=INSTANCE_ID, pod=POD_NAME, total=len(req.readings), anomalies=anomalies)

Security and Hardening

  • Non‑root containers; explicit resource requests/limits; probes.
  • Avoid including the ML stack in the UI image.
  • For production: ingress with TLS, authentication/authorization, rate limiting, and secrets management for external artifact stores.
  • Image scanning and SBOM generation in CI.

Secrets and config management:

  • Use Secret for credentials (e.g., remote artifact store) and ConfigMap for non‑sensitive configs (thresholds, feature lists).
  • Mount via env or volumes; avoid embedding secrets in images.

CI/CD and Testing Strategy

While this project focuses on systems integration, a robust setup would include:

  • Unit tests around schema conversions (as_feature_matrix), model scoring sign, and endpoint error paths.
  • Contract tests that POST known payloads and assert response schema and monotonicity of scores.
  • Performance tests gated on p95 thresholds per commit.
  • CI pipeline: lint (ruff), typecheck (mypy), test (pytest), build images, push to registry, deploy to a staging namespace via Kustomize overlays, smoke test, then promote.

Testing pyramid tailored to this service:

  • Unit: as_feature_matrix, score polarity, simulator generation bounds.
  • API: schema conformance, empty payloads, large batch behavior.
  • Load: enforce p95/p99 targets at representative RPS.
  • E2E: UI - backend flow; anomaly injection path.
  • Upgrade: canary two backends with different model digests and verify parity.

Extensibility: Beyond the Minimal Service

  • Model registry and lineage: back your artifacts by a registry (MLflow, WandB, or custom) with immutable digests and promotion flows.
  • Data drift and concept drift: track input feature distributions and anomaly rate; trigger alerts or shadow retraining.
  • Event‑driven streaming: swap the synchronous HTTP flow for Kafka/NATS ingestion + consumer workers; scale with KEDA on lag.
  • Inference servers: the same contract can be implemented with BentoML, Ray Serve, or LitServe while preserving schema and metrics.
  • Real thresholds: convert scores to probabilities via calibration; expose thresholds per segment or dynamic thresholds from control charts.

Streaming architectures:

  • Replace synchronous HTTP with Kafka ingestion and consumer workers; scale with KEDA on lag; UI subscribes via WebSocket/SSE.

End‑to‑End Run:

Bare metal (no containers):

  1. Install deps: uv sync --extra dev.
  2. Backend: uv run python -m src.backend.main.
  3. UI: uv run streamlit run src/ui/main.py.
  4. Generate batches or start streaming from the UI; optionally run uv run python scripts/load.py ... for load.

Docker Compose:

  1. cd docker && docker compose up --build.
  2. UI at http://localhost:8501 (points to backend).
  3. Optionally run load against http://localhost:8000.

Kubernetes (kind):

  1. kind create cluster --name anomaly.
  2. Install metrics‑server (and patch for kind); verify kubectl top nodes works.
  3. Build and load images into kind.
  4. kubectl apply -k k8s/.
  5. Port‑forward UI: kubectl -n anomaly port-forward svc/anomaly-ui 8501:8501.
  6. Optionally port‑forward backend and run the load generator.
  7. Watch kubectl -n anomaly get hpa -w and kubectl -n anomaly get pods -w while increasing RPS.

Smoke test checklist:

  • /health returns status=ok; features match DEFAULT_FEATURES.
  • First run creates model artifact; subsequent runs load it (no retrain).
  • UI shows anomalies with non‑zero anomaly rate or after “Inject Extreme Reading”.
  • /metrics counters and histograms move when generating load.

Appendix

Environment variables:

  • Backend: WORKERS, MODEL_DIR, MODEL_NAME, TRAIN_IF_MISSING, POD_NAME, NODE_NAME.
  • UI: BACKEND_URL, STREAMLIT_SERVER_PORT.

Key files to explore:

  • src/backend/app.py
  • src/backend/api.py
  • src/backend/model.py
  • src/backend/sim.py
  • src/ui/main.py
  • scripts/load.py
  • docker/Dockerfile.backend, docker/Dockerfile.ui, docker/compose.yaml
  • k8s/*

Closing

This project demonstrates the full lifecycle from a simple notebook‑idea model to a production‑ready microservice with observability and autoscaling. The same patterns scale to richer models and larger fleets: keep contracts explicit, instrument everything, design for scale‑out, and treat artifacts and configuration as first‑class citizens.