KubeCon China 2025 Experience

LLM, LLM, and More LLM

Ryan Yin included in category Tech and series

2025-06-15 2025-06-15 1972 words 9 minutes

Series - Cloud-Native Related

Contents

Preface

After resigning at the end of January, I spent time at home for the Chinese New Year, then traveled around Shanghai, Zhangjiajie, Chongqing, Suzhou, and Nanjing. I didn’t return to Shenzhen to start job hunting until mid-April. Initially, I wasn’t sure if I would have time to attend KubeCon China 2025 in June. However, I was fortunate that the company I received an offer from highly values technology. During the interview, my leader mentioned seeing my KubeCon experiences in my blog and said the company strongly encourages participation in such technical exchange events, even supporting giving talks with full expense coverage.

So, less than a month after joining, I went on a company-funded trip to KubeCon China 2025 in Hong Kong (:

I asked my colleagues if they were interested, but for various reasons, I ended up being the only one attending (sad

TL;DR

In short, this year’s KubeCon China was almost entirely focused on AI on Kubernetes - it could have been renamed to CloudNative AI Con.

This year’s KubeCon China was only two days long, with significantly fewer talks than last year - almost half as many. As a result, I also watched many KubeCon Europe 2025 talks online as a supplement.

Overall, my impressions this year were:

Kubernetes has become a mature foundation - anything that can run on K8s will eventually be moved to K8s
AI has brought new life to the CloudNative community, with many new CloudNative projects emerging around AI in the past two years. AI topics have become the absolute main theme of KubeCon.
- The AI deployment section mainly discussed AI inference, with key technical points: distributed inference, scaling, and LLM-Aware load balancing, as well as AI model distribution
- There were several discussions about AIOps, from simple ChatBot implementations to more complex Multi-Agent systems for tasks like cloud cost analysis and optimization
  - Kuaishou attempted to use Logs/Metrics to train a model for each service in their ultra-large-scale cluster to dynamically adjust HPA, achieving a balance between SLA and cost (if I remembered incorrectly, I take no responsibility hhh)
OpenTelemetry is maturing and getting closer to its goal of unifying Logs/Traces/Metrics signals
- Platforms like Uptrace have emerged as unified observability platforms, fully utilizing OTel’s labels to correlate Logs/Traces
- Current best practice is to still use traditional methods for collecting Logs and Metrics at the Infra level, while at the APP level, OTel handles all Logs, Traces, and Metrics, correlating them through Span ID with consistent label semantics
WASM is still exploring its use cases, with this year’s main focus being running small models at the edge

KubeCon China 2025 and KubeCon Europe 2025 video playlists:

Presentation slides can be downloaded here (NOTE: not all talks will have PDFs uploaded):

Next, I’ll introduce some interesting content I heard, organized by topic, along with corresponding video links and possible PPT links.

Talks

Unified LLM Inference Solution

Introducing AIBrix: Cost-Effective and Scalable Kubernetes Control Plane for VLLM - Jiaxin Shan & Liguang Xie, ByteDance

AIBrix is a complete solution for running distributed LLM inference on K8s, including:

Distributed inference deployment
LLM scaling
LLM request routing (load balancing)
Distributed KV cache
- Mainly centralized storage of this data to reduce HBM memory usage and lower memory requirements
Dynamic LoRa loading
…

Code:

https://github.com/vllm-project/aibrix

AIBrix is currently under the vllm-project, with a good number of stars, suggesting the project is healthy and worth following.

Distributed LLM Inference Deployment

More Than Model Sharding: LWS & Distributed Inference - Peter Pan & Nicole Li, DaoCloud & Shane Wang, Intel

One of the most interesting talks, covering distributed inference architecture, optimization points, and the advantages and usage of LWS.

Code:

https://github.com/kubernetes-sigs/lws

Simply put, LWS is a CRD specifically designed for LLM distributed inference deployment, mainly supporting group scheduling for LLM tasks.

NOTE: According to an issue, AIBrix might integrate with LWS (possibly with official support): https://github.com/vllm-project/aibrix/issues/843#issuecomment-2728305020

LLM Scaling and Load Balancing

KubeCon EU 2025 - Optimizing Metrics Collection & Serving When Autoscaling LLM Workloads
- Quite entertaining, but since I’m familiar with this area, I could guess it was about custom business metrics + KEDA for custom metrics-based scaling, so I just skimmed through it
KubeCon EU 2025 - Keynote: LLM-Aware Load Balancing in Kubernetes: A New Era of Efficiency - Clayton Coleman, Distinguished Engineer, Google & Jiaxin Shan, Software Engineer, Bytedance
- Very interesting - LLM requests are very different from traditional API requests:
  - Input length varies greatly - some requests have simple inputs and are relatively lightweight, while others might include entire PDFs or other long text inputs. Outputs are similarly variable - if users request deep reasoning, it could lead to significant performance consumption
  - Different machines might use different GPU types with varying performance
  - In a multi-model platform, different models have distinct peak and off-peak periods
- These characteristics make traditional load balancing strategies completely ineffective
- Solution: https://github.com/kubernetes-sigs/gateway-api-inference-extension

AI Model Distribution

AI Model Distribution Challenges and Best Practices

Developers discussing how to distribute LLM models of hundreds of GB in size within clusters. Current industry approaches:

dragonfly
juicefs
oci model spec + oci volume (k8s 1.33+)

Observability

Antipatterns in Observability: Lessons Learned and How OpenTelemetry Solves Them - Steve Flanders, Splunk
- Very interesting and informative. The observability antipatterns he listed include:
  - Telemetry Data
    - Incomplete Instrumentation - need to introduce zero-code otel sdk for automatic data collection
      - metrics/logs/metrics signals aren’t all enabled by default, depends on agent implementation
      - In k8s, it’s recommended to disable both stdout logging and traditional prometheus pull /metrics endpoints, letting otel agent handle all App-level signals. Daemonset mode otel (or vector/fluentbit) mainly handles Infra-level logs
    - Over-Instrumentation - need to filter and streamline metrics at the otel-collector level before sending to backend storage
    - Inconsistent Naming Conventions - fully adopt OpenTelemetry solution for unified naming
  - Observability Platform
    - Vendor Lock-in - only choose platforms supporting OTel standards and use OTel naming conventions
    - Tool Sprawl - use unified observability platforms like Uptrace that support automatic Log-Trace correlation
    - Underestimating Scalability Requirements - use OTel for signal collection and choose scalable backend storage like VictoriaMetrics
  - Company Culture
    - Silos and Lack of Collaboration
    - Lack of Ownership & Accountability
KubeCon EU 2025 - From Logs To Insights: Real-time Conversational Troubleshooting for Kubernetes With GenAI - Tiago Reichert & Lucas Duarte, AWS
- The opening OnCall skit was very realistic… though getting a phone alert after 1 minute of pod pending seems exaggerated…
- After the skit, they covered the main content: encoding logs with embed models and storing in OpenSearch for RAG, giving the ChatBot k8s readonly permissions (banned secrets access), then using Deepseek/Claude for Q&A to solve problems
- Code: https://github.com/aws-samples/sample-eks-troubleshooting-rag-chatbot
Portrait Service: AI-Driven PB-Scale Data Mining for Cost Optimization and Stability Enhancement - Yuji Liu & Zhiheng Sun, Kuaishou
- Discussed how Kuaishou manages stability and performance optimization in their ultra-large-scale cluster of 200,000 machines
- Covered relatively basic content - mainly collecting vast amounts of cluster information, processing through a big data system, then training dedicated models, with each service potentially having its own resource optimization model
- This approach might be too heavy - worth learning from but not very useful in my current work scenario (scale too small)

Service Mesh

Revolutionizing Sidecarless Service Mesh With eBPF - Zhonghu Xu & Muyang Tian, Huawei
- Mainly covered Huawei’s Kmesh, with detailed explanation of the underlying implementation architecture (actually very similar to what I heard at last year’s KubeCon)
- Simply put, Ambient Mode intercepts traffic to user-space ztunnel for L4 traffic processing through istio-cni (underlying iptables), while Kmesh implements these L4 functions at the kernel level using eBPF. Also briefly introduced Cilium Service Mesh, a Per-Node Proxy, with main drawbacks being the requirement for Cilium network plugin and its primitive, complex CRDs
- Kmesh also attempted to implement HTTP protocol parsing with eBPF, but this requires kernel patching, which is costly
KubeCon EU 2025 - Choosing a Service Mesh - Alex McMenemy & Dimple Thoomkuzhy, Compare the Market
- While most of what I’ve encountered uses Istio, it’s always good to see how others make their choices
KubeCon EU 2025 - Navigating the Maze of Multi-Cluster Istio: Lessons Learned at Scale - Pamela Hernandez, BlackRock
- Multi-cluster Istio is used in quite a few large companies - I was asked about it in interviews, worth trying out
KubeCon EU 2025 - A Service Mesh Benchmark You Can Trust - Denis Jannot, solo.io
- Creating a good benchmark comparison takes a lot of time and effort - it’s most convenient to just look at the results others provide (:

Ingress-Nginx

The Next Steps for Ingress-NGINX and the Ingate Project - Jintao Zhang, Kong Inc.

Ingress-NGINX is finally being retired, with its successor being InGate, though InGate is currently almost empty (:

Code:

https://github.com/kubernetes-sigs/ingate

Security

Keynote: Who Owns Your Pod? Observing and Blocking Unwanted Behavior at eBay With eBPF

Mainly introduced cilium’s tetragon, an eBPF-based K8S security tool, somewhat similar to apparmor but capable of more fine-grained permission management.

A friend argued that such tools aren’t very necessary - we should use GitOps processes and move security checks to the CICD pipeline.

Cloud Cost Analysis and Optimization

KubeCon EU 2025 - Autonomous Al Agents for Cloud Cost Analysis - Ilya Lyamkin, Spotify

Implementation of a Multi-Agent system that automatically creates plans, writes SQL and Python for cloud cost analysis - very valuable reference.

Keynote: An Optimized Linux Stack for GenAI Workloads - Michael Yuan, WasmEdge

Discussed using WasmEdge + LlamaEdge to run small LLM models on edge devices - quite interesting.

How to Build an AI Workflow

KubeCon EU 2025 - Tutorial: Build, Operate, and Use a Multi-Tenant AI Cluster Based Entirely on Open Source

An hour-plus tutorial by IBM. Installed many components including Kueue, Kubeflow, PyTorch, Ray, vLLM, and Autopilot

Non-Tech

Attending KubeCon isn’t just about listening to technical changes and progress from the past year - it’s also an important opportunity to socialize with developers from various fields, kind of like a large-scale online friend meetup (:

This year, I got @scruelt, @ox-warrior, and other friends to come to KubeCon, and at the venue, I met up with @cookie, @rizumu, @ayakaneko, and @dotnetfx35 for casual chats. I received Kubernetes and Go cookies printed with 3D printers from @rizumu and@ayakaneko, and incidentally spread the word about NixOS (:

Meetup successful! Also spread the word about NixOS

K8s/Go cookies and Istio fridge magnets received

On Day 2 morning, I found there weren’t many talks I wanted to attend, so I noticed there was a Peer Group Meeting to join, though it required signing up first. I went with@scruelt to sign up, and we were a bit worried that signing up just 20 minutes before might be too late, but when we got to the meeting room, we found only 3 mentors present, so we just chatted casually with them. The three mentors were Nate Waddington (Head of Mentorship & Documentation, Canada), Kohei Ota (CNCF Ambassador, Japan), and Amit DSouza (co-founder of Odyssey Cloud, Australia). A Cisco engineer also joined in halfway through.

It was mostly casual conversation. @scruelt’s English is better than mine, and since he just resigned, he had many questions to ask - he initiated most of the topics. As for me, since everything has been going smoothly lately, I didn’t have many questions to ask.

Entered the Peer Group Meeting to find only Mentors hhh

Let’s end with some photos.

Three friends bought Switch 2 here during KubeCon, they made a killing

Had a great time, see you next year!

Contents

KubeCon China 2025 Experience

LLM, LLM, and More LLM

Preface

TL;DR