CloudLinux is a global remote-first company. We are driven by our principles: do the right thing, employees first, we are remote first, and we deliver high-volume, low-cost Linux infrastructure and security products that help companies to increase the efficiency of their operations. Every person on our team supports each other and does what we can to ensure we all are successful.
Check out our website for more information https://cloudlinux.com/
We are looking for a Senior IaaS / Kubernetes Platform Engineer to join our Infrastructure Department and become a key contributor to the design, implementation, and operation of our private cloud and multi-tenant Kubernetes platform.
Our infrastructure powers 500+ VMs across multiple datacenters, serving 20+ engineering teams. We are in the process of evolving from an OpenNebula-based virtualization platform toward a Kubernetes-native multi-tenant cloud with KubeVirt for VM orchestration — while maintaining reliability and operational excellence throughout the transition.
You will work alongside the existing IaaS Tech Lead and Network Engineer, and must be capable of independently owning and operating the full IaaS stack (compute, storage, networking, bare metal) if needed. This is not a "Kubernetes-only" role — it requires deep infrastructure generalist skills combined with Kubernetes platform expertise.
What You Will Do
Kubernetes Platform Engineering (Primary Focus — 40%)
- Design, build, and operate a multi-tenant Kubernetes platform using Cluster API (CAPI) with bare-metal providers (Metal3/Sidero).
- Implement hard multi-tenancy using vCluster (Loft Labs) or similar technology, providing isolated Kubernetes API servers per tenant.
- Deploy and manage KubeVirt for VM orchestration within Kubernetes, including CPU pinning, NUMA awareness, and HugePages configuration.
- Implement GitOps-driven infrastructure using ArgoCD or Flux as the single source of truth for all cluster configurations.
- Deploy and manage Policy-as-Code using Kyverno or OPA Gatekeeper for admission control, resource quotas, and security policies.
- Build self-service capabilities using Crossplane or similar Kubernetes-native infrastructure provisioning tools.
Storage Engineering (20%)
- Operate and optimize Ceph distributed storage clusters (currently 1 PiB raw, 149 OSDs, Quincy 17.2.5).
- Manage Rook-Ceph operator deployments at scale on modern Kubernetes (v1.28+).
- Implement storage tiering: Ceph for bulk storage, local NVMe for high-IOPS workloads, LINSTOR/DRBD or TopoLVM for ultra-fast replicated storage.
- Design and implement per-VM / per-tenant I/O isolation on shared Ceph clusters.
- Manage CDI (Containerized Data Importer) for VM image lifecycle in KubeVirt environments.
Networking (15%)
- Deploy and manage overlay networks for pod networking, micro-segmentation, and WireGuard/IPsec encryption.
- Implement Cluster Mesh for multi-datacenter pod-to-pod connectivity.
- Configure Multus CNI and SR-IOV for multi-NIC VM support in KubeVirt.
- Work with physical network infrastructure: Juniper switches (JunOS), BGP (eBGP/iBGP), EVPN/VXLAN, VLANs.
- Maintain IPSec site-to-site connectivity between datacenters.
Reliability and Operations (15%)
- Practice SRE discipline: define and maintain SLOs with error budgets, implement proactive capacity management with 6-12 month forecasting.
- Design and execute chaos engineering experiments to validate system resilience.
- Participate in on-call rotation for IaaS infrastructure (OpenNebula, Ceph, networking).
- Write and maintain runbooks, DRP documentation, and postmortem analyses.
- Drive proactive improvement: identify reliability risks, performance bottlenecks, and toil — then propose and implement solutions without waiting for incidents.
Infrastructure as Code and Automation (10%)
- Develop and maintain Terraform/OpenTofu modules for multi-cloud infrastructure provisioning.
- Write Ansible playbooks for bare-metal server configuration and fleet management.
- Automate infrastructure lifecycle: PXE boot images, hardware provisioning (Foreman), IPMI management.
- Implement FinOps practices: cost attribution, resource utilization analysis, right-sizing recommendations using OpenCost/Kubecost.