May 2-4, 2018 - Copenhagen, Denmark
Click Here For Information & Registration
Friday, May 4 • 14:45 - 15:20
Federated Prometheus Monitoring at Scale - Nandhakumar Venkatachalam & LungChih Tung, Oath Inc (Intermediate Skill Level) (Slides Attached)

Sign up or log in to save this to your schedule and see who's attending!

Feedback form is now closed.
In Media Build and Products under Oath, We run 12 production Kubernetes clusters running across our data centers with ~1200 machines with multi-tenant deployments. We monitor our cluster with Prometheus, each cluster runs a Prometheus instance and overall a single federated cluster with a persistent storage. Total time series is ~17mi (max 5mi /instance) with samples ingestion rate is 300K (max 80K /instance). We have built mind-blowing dashboards at a federated instance like Controller, Scheduler, API server, DNS, Kubelet, Etcd, Utilization overall and per-tenant namespace/ deployment/container gives high visibility. We leverage Alert manager which provides powerful alerting capabilities alerts on call on cluster status, nodes availability, scrape status, fd usage etc.We would like to share our experience of how we monitoring multi-kubernetes cluster with the multi-tenant environment

avatar for LungChih Tung

LungChih Tung

Software Developer II, Oath Inc
Lungchih Tung is a software engineer in core infrastructure team at Oath Media Build and Products. Lungchih has been working on building core infrastructure with Kubernetes, monitoring system and automating operations of cluster management.
avatar for Nandhakumar Venkatachalam

Nandhakumar Venkatachalam

Princi Production Engineer, Oath Inc
Nandhakumar Venkatachalam is a Princ Production Engineer, Lead for Kubernetes Infrastructure/ Cluster management team at Oath Media Build and Products. He is a subject matter expert and solution architect specialized in high availability. Nandha has been under Oath for 11 years and... Read More →

Friday May 4, 2018 14:45 - 15:20
Auditorium 11+12