KubeCon2025 - Kubeflow Ecosystem: What's next for Cloud Native AI/ML and LLMOps

학습일지/AI

KubeCon2025 - Kubeflow Ecosystem: What's next for Cloud Native AI/ML and LLMOps

inspirit941 2025. 6. 23. 13:41

https://youtu.be/gGP9QdlNr9Y?si=7fbyHmHW-01WnZNN

Kubeflow Ecosystem: What's next for Cloud Native AI/ML and LLMOps

screenCapture_2025-06-23_09.54.31

screenCapture_2025-06-23_09.55.26

kubeflow란 AI / ML Workload를 simple, scalable, portable하게 관리하기 위한 오픈소스 프로젝트.

kubernetes라면 어느 환경에서든 실행 가능
composable platform. 특정 컴포넌트만 standalone으로 실행하거나, End-to-End platform으로 설정하거나.
AI / ML과 cloud ecoystem을 연결하는 컴포넌트라고 이해해도 좋다.

이 강연은 요새 특히 수요가 많은 GenAI와 LLMOps에 집중할 예정.

screenCapture_2025-06-23_09.59.17

우리 쪽 오픈소스 프로덕트로 GenAI의 Lifecycle을 전부 관리할 수 있음

Spark Operator: Data processing
Notebooks: for Model Development
Trainer: for Fine-Tuning / distributed training
Katib: for model Optimization / Architecture Search
Kserve: Large Scale inference
kubeflow Pipelines: 모든 컴포넌트를 연결
Model Registry: Metadata / Artifacts storage

Notebook: interactive IDE for Data Scientists

screenCapture_2025-06-23_10.06.17

Notebook: Data scientist들이 사용할 수 있는 RStudio, JupyterLab, VSCode 등 interactive IDE를 제공함.

Spark Operator: Data processing

screenCapture_2025-06-23_10.06.27

screenCapture_2025-06-23_10.10.36

Spark 실행하는 컴포넌트. 구글에서 관리하다가 Kubeflow 오픈소스로 넘어왔다.

Katib: hyperParameter Tuning

screenCapture_2025-06-23_10.11.05

screenCapture_2025-06-23_10.13.05

screenCapture_2025-06-23_10.16.31

Model Optimization을 담당하는 컴포넌트. Katib의 사용법은 별도의 세션이 따로 있다.

Trainer

screenCapture_2025-06-23_10.17.58

screenCapture_2025-06-23_10.22.07

General AI / ML Model training, fine-tuning을 위한 컴포넌트. LLM도 된다.

다양한 프레임워크 지원. Pytorch / Tensorflow...
Role-Oriented Resource Model 구조: 누가 사용하는지에 따라 생성하게 될 리소스가 다르다는 뜻.
- 학습용 코드와 실행용 인프라를 분리한 것. 각자 해야 할 일에만 집중하면 된다.
  - Data Scientists: TrainJob
  - DevOps Engineers: TrainingRuntime

screenCapture_2025-06-23_10.22.16

screenCapture_2025-06-23_10.24.24

internal pipeline 기능과 통합되어 있고, ArrowCache와 통합해서 distributed ML Training도 가능함.

다른 데모세션에서 진행할 예정

Kubeflow SDK

screenCapture_2025-06-23_10.24.31

screenCapture_2025-06-23_10.24.46

학습 코드를 작성하는 사용자 입장에서는 Spark 쓰던 방식 그대로 사용하면 된다.
- k8s나 container 같은 지식을 몰라도 되고, Pytorch 코드 작성에만 집중하면 됨.
- scale은 k8s와 kubeflow에서 담당한다.
LLaMA stack에서 제공하는 컴포넌트도 kubeflow의 Add On과 호환된다.

screenCapture_2025-06-23_10.30.02

screenCapture_2025-06-23_10.30.10

fine-tuning 담당하는 torchtune과도 호환된다.
end-to-end GenAI Experience도 kubeflow SDK로 가능함.

Model Registry

screenCapture_2025-06-23_10.33.35

screenCapture_2025-06-23_10.33.42

S3 compatible Object Storage를 지원하고 있으며, OCI registry (Harbor, Nexus)도 알파 버전으로 지원하는 것으로 보임.

UI로 쉽게 모델을 등록하고 활용하는 게 가능하다는 듯.

GenAI Lifecycle - Kserve

screenCapture_2025-06-23_10.39.12

screenCapture_2025-06-23_10.39.17

모델을 서빙하기 위한 컴포넌트.

Knative + istio 기반 프로덕트
Envoy AI Gateway 통합
Custom Metric으로 LLM Autoscaling할 수 있도록 KEDA 통합
LLM Serving Runtime 지원.
k8s Gateway API for Raw Deployment Mode

Demo: End to End Pipeline with KubeFlow

19:00 - 27:20

저작자표시 비영리 변경금지 (새창열림)

'학습일지 > AI' 카테고리의 다른 글

DevConf 2025 - From spreadsheet scheduling to Kubernetes: building an on-premise ML platform (1)	2025.07.01
Deview 2023 - 대규모 HPC 클러스터의 효율적 활용을 위한 Scheduler, Monitoring, Diagnostics (1)	2025.06.30
KubeCon2024 - Which GPU Sharing Strategy Is Right for You? A Comprehensive Benchmark Study Using DRA (0)	2025.05.28
Scaling AI Workloads with kubernetes: Sharing GPU Resources Across Multiple Containers (1)	2025.05.22
DAN24 - 인공지능의 마법으로 실시간 라이브 인코딩에 날개를 달다 (1)	2025.02.14

현재글KubeCon2025 - Kubeflow Ecosystem: What's next for Cloud Native AI/ML and LLMOps

관찰과 질문, 그리고 데이터

KubeCon2025 - Kubeflow Ecosystem: What's next for Cloud Native AI/ML and LLMOps