Deploying Small Language Models with Kubernetes Ingress
The pods were ready. The traffic was coming. The problem was how to guide it fast and without waste.
Kubernetes Ingress is the control plane for HTTP and HTTPS routing inside your cluster. It defines rules for how external requests reach your services. Without it, you rely on multiple LoadBalancers or NodePorts, which increase cost and complexity. With Ingress, you consolidate entrypoints and keep routing logic in one place.
Pairing Kubernetes Ingress with a Small Language Model (SLM) unlocks a new pattern. The Ingress handles external routing. The SLM processes incoming requests at low latency and minimal resource usage. Unlike large models, a small language model can run inside the cluster without expensive GPUs. This allows dynamic, AI-driven decision-making close to your data and services.
Use Kubernetes Ingress rules to route requests based on HTTP paths, hostnames, or headers. Direct model inference traffic to dedicated pods running the SLM. Apply Kubernetes annotations to integrate with external ingress controllers like NGINX, HAProxy, or Traefik. For TLS, configure certificate management with cert-manager. Keep health checks and timeouts tuned for your inference endpoints to avoid cold starts.
An optimized workflow:
- Deploy your SLM as a container in Kubernetes.
- Create a Service targeting the pods.
- Configure an Ingress resource to route AI endpoints to that Service.
- Use rewrite-target rules when you need clean API paths.
- Leverage Kubernetes namespaces to isolate model traffic from other workloads.
This design pattern keeps traffic predictable and secure while giving you the freedom to scale the SLM horizontally. If your use case needs low-latency AI—content filtering, smart routing, personalized responses—this method keeps both Kubernetes Ingress and your small language model in sync.
See this running live in minutes at hoop.dev. Deploy, route, and serve your small language model through Kubernetes Ingress without friction.