DevConf 2025 - From spreadsheet scheduling to Kubernetes: building an on-premise ML platform

학습일지/AI

DevConf 2025 - From spreadsheet scheduling to Kubernetes: building an on-premise ML platform

inspirit941 2025. 7. 1. 09:23

From spreadsheet scheduling to Kubernetes: building an on-premise ML platform

https://youtu.be/Bl_uHfSunxg?si=3IfGzbRJGDQOS1yl

screenCapture_2025-06-30_15.12.20

Innovatrics 라는 회사 테크 리드 / 엔지니어의 발표

이미지에서 biometric features를 찾아내는 회사.
- (사진에서 얼굴이나 지문 등을 찾아내는 모델)
R&D 플랫폼 개발이 목적. 연구자들이 연구에만 집중할 수 있게
- On-Prem MLOps
- Data Management + Workflows
- Infrastructure Tooling

Motivation

screenCapture_2025-06-30_15.16.00

screenCapture_2025-06-30_15.16.17

screenCapture_2025-06-30_15.16.28

Workload Scheduling / Execution으로 Spreadsheet 사용.

관리 안 됨: queue 없음, fairness 매커니즘도 없음, GPU 할당률 (Allocation Rate)도 낮음

screenCapture_2025-06-30_15.20.03

k8s 도입

기본적인 queue, fairness, allocation, monitoring은 지원하지만
운영 이슈가 있음.

Current State

screenCapture_2025-06-30_15.22.03

Kueue: queue plugin for kube-scheduler 를 사용함.

Deploy / Manage가 쉽다.
Default Scheduler를 대체하지 않음.
Visibility가 좋다 (Prometheus Metrics)
Default CNCF queuing mechanism. (GCP에서도 ML Workload scheduling할 때 이걸 쓴다.)

Volcano와 Yunikorn 이라는 것도 있는데, 둘다 default scheduler를 대체함 + hard to Setup / Maintain 이슈로 사용하지 않음.

screenCapture_2025-06-30_15.25.48

Kueue에서 제공하는 Prometheus Metric으로 만든 대시보드.

첫번째 줄: 어떤 팀이 얼마나 리소스 쓰고있는지
두번째 줄: avg waiting time for queue per team.
세번째 줄: implementation of the queue itself
- workload 이름, 생성시간, priority...
- 스케줄 안 되어 대기중이라면, 대기중인 이유가 표시됨.

Batch Job 스케줄링 예시

screenCapture_2025-06-30_15.49.51

Batch Job (며칠 ~ 몇주 가량 실행되는 workload) 스케줄 다이어그램 예시.

사용자가 ML Job을 생성한다.
kyvero (Policy Engine)가 정책에 따라 label을 부여한다.
작업이 kueue에서 관리하는 team Queue에 들어간다. (백그라운드에서 fairness score가 계산된다.)
우선순위에 따라 job pick -> team이 해당 resource 할당받아서 Use 상태로 변경됨.
1. fairness score 다시 연산.
kueue가 node selector 명시
k8s가 pod 스케줄

Interactive Development 예시

screenCapture_2025-06-30_16.36.57

screenCapture_2025-06-30_16.41.02

coder: k8s workload로 배포해서 관리할 수 있는, 일종의 developer portal

사용자에게 single Entrypoint 제공
사용자가 선택한 IDE와 통신할 수 있는 port 제공

사용자가 dev Env 요청한다
Terraform Apply로 필요한 리소스들 요청한다
kyverno에서 필요한 label 추가해준다
kueue를 거쳐서 pod가 배포된다.
pod는 coder-agent가 실행되며, 사용자에게 IDE 환경을 제공한다.

screenCapture_2025-06-30_16.41.49

screenCapture_2025-06-30_16.41.55

screenCapture_2025-06-30_16.42.14

screenCapture_2025-06-30_16.42.27

대충 위와 같은 느낌.

devpod 같은 필수 컴포넌트는 pre-built로 제공
사용자는 workspace / namespace 입력
terraform apply
workspace에서 만들어진 컴포넌트들은 사용자가 웹에서 직접 scale up / down 할 수 있다.

screenCapture_2025-06-30_16.45.59

작업 끝나면 Notify해주는 기능으로는 in-house Google Chat Application을 활용했다고 함.

'너가 요청한 리소스는 얼마고, 실제로는 얼마 쓰였으니까 다음에는 얼마로 신청하면 된다'를 안내해준다고 함.
researcher가 매번 리소스 요청을 필요 이상으로 해서 추가했다고.
단순한 Python App이고, Grafana Alert와 Prometheus History 기반으로 정보 제공.

Infrastructure Stuff

screenCapture_2025-06-30_16.48.32

ArgoCD로 gitOps 수행.

screenCapture_2025-06-30_16.49.34

Observability는 Loki, Grafana, Mimir + Alloy.

원래 LGTM 4개가 많이 쓰이는데, 우리는 tempo (분산 트레이싱)가 필요 없었다.
- Loki: 로그 집계, 저장, 조회. regex로 필요한 로그 빠르게 볼 수 있어서 좋다
- Grafana: 시각화 + 대시보드
- Mimir: 메트릭 저장소
- Alloy: 로그 수집

screenCapture_2025-06-30_16.56.01

Kueue와 Prometheus 정보 조합해서 만든 Grafana Dashboard

screenCapture_2025-06-30_17.01.37

Admin dashboard. 파란색은 정상, 빨간색은 조치 필요

screenCapture_2025-06-30_17.02.38

그 외에도 다양한 컴포넌트를 사용 중

RKE2: k8s distribution in 1 Binary Setup. k8s 업그레이드 필요하면 binary 교체하기만 하면 됨
Ansible
Longhorn: S3 통신 불가능할 때 로컬에 임시 저장하고, S3 통신 복구되면 Push
Nvidia GPU Operator
Calico: CNI
Kyverno: policy Engine to Mutate / distribute Configmaps & Secrets.
- researcher에게 필요한 권한이나 설정을 매번 직접 해주지 않아도 됨.

Conclusion

screenCapture_2025-06-30_17.05.55

GPU 구매 의사결정에 활용할 수 있는 데이터 근거
GPU utilization이 up to 85%
Single Point of Entry to all our servers. (kueue for BatchJob, coder for Interactive Development)
Node는 Stateless - 학습에 필요한 data는 Centralized NVMe Storage에 있다.
5분 이내의 diaster recovery가 가능. 노드 교체하고 투입하면 됨

screenCapture_2025-06-30_17.09.21

Lesson Learned

Talos Linux 써라: ssh나 cli 접근 불가능한 운영체제. (모든 것을 API로.)
Declarative / Atomic 관리. 문제 생기면 롤백하기 쉽게
researcher에게는 interactive Development가 필요하다. batch job만으로는 요구사항을 충족시킬 수 없다
사용자에게는 used Resource 피드백을 줘야 한다.
Centralized Storage 써라. 개별 노드가 stateful해지면 (Data Locality가 생기면) 병목이 생긴다.

Q&A

Q. 관리하는 인프라 크기는 얼마나?

3 control plane nodes Running in Proxmox (VM for HA)
10 on-prem Bare Metal Servers for Kubernetes

Q. ???

K0S로 k8s 실행하면 single node에도 실행할 수 있다. no need any control planes. kueue itseflt can be run on only 1 Node.

Q. researcher의 저항은 없었는지?

iterate로 프로덕트 개선 + 여기서만 좋은 서버 제공하는 식으로 사용 유도

Argo Workflow를 batch scheduling에 사용하고 있다고 함. (Kueue와 호환 잘 된다고.)

도입 초기라서 발표에는 굳이 넣지 않았다. kubeflow도 내부적으로는 argo workflow를 쓰고 있기도 하고. 단점이라면 kubeflow operation만을 위한 클러스터 관리가 필요하다는 것 정도.

저작자표시 비영리 변경금지 (새창열림)

'학습일지 > AI' 카테고리의 다른 글

Deview 2023 - 대규모 HPC 클러스터의 효율적 활용을 위한 Scheduler, Monitoring, Diagnostics (1)	2025.06.30
KubeCon2025 - Kubeflow Ecosystem: What's next for Cloud Native AI/ML and LLMOps (1)	2025.06.23
KubeCon2024 - Which GPU Sharing Strategy Is Right for You? A Comprehensive Benchmark Study Using DRA (0)	2025.05.28
Scaling AI Workloads with kubernetes: Sharing GPU Resources Across Multiple Containers (1)	2025.05.22
DAN24 - 인공지능의 마법으로 실시간 라이브 인코딩에 날개를 달다 (1)	2025.02.14

현재글DevConf 2025 - From spreadsheet scheduling to Kubernetes: building an on-premise ML platform

관찰과 질문, 그리고 데이터