About Me
Experience
Production Owner & Lead SRE
UDI (Unique Device Identification) System - - Now
职位: Production Owner & Lead SRE | 时间段: Apr 2023 – Present | 工作内容: Regulators required every medical device to carry a globally-unique identifier and to be traceable from manufacture to patient use. The platform had to ingest 1–2 M device records per day, retain 7 years of history (500 TB), and stay online 99.9 % of the time.Environment:• 8 bare-metal nodes (256 vCPU, 1 TB RAM)• 20 KVM guests across two data centers• Kubernetes 1.27, OceanBase 4.2, Hive 3.1, Airflow 2.7, Prometheus + LokiWhat I Actually Did (Hands-on):1. Architecture & Deployment– Designed a three-tier K8s cluster: ingress, application, and data planes, all deployed with kubeadm and managed by GitOps (Argo CD).– Wrote 12 Helm charts and 6 Kustomize overlays to make rollbacks one-command operations.2. Data Pipeline & Performance– Built an Airflow DAG chain that consumes Kafka topics → Hive ODS → ORC-based DWD → ClickHouse-served DM.– Optimised 37 slow Hive SQL queries; average report latency dropped from 45 min to 6 min.– Implemented bucketed + sorted ORC tables and enabled Zstandard, cutting HDFS usage by 35 %.3. Database HA & Migration– Migrated 2 TB of MySQL 5.7 data to OceanBase 4.2 with zero downtime using `ob-loader-dumper` + dual-write shadow mode.– Set up 3-node Paxos-based OceanBase cluster; p99 write latency < 50 ms and automatic failover within 30 s.4. Observability & Automation– Deployed kube-prometheus-stack; wrote Python exporters for device-registry business metrics.– Created Ansible playbooks to patch CVEs across 120 pods in 10 min; achieved 100 % compliance every quarter.5. Incident Handling– Led 8 Sev-1 incidents (e.g., ClickHouse OOM during batch load). Root-caused via eBPF flame graphs and fixed by tuning max_memory_usage; MTTR now < 15 min.Impact:• Platform uptime: 99.95 % (measured by black-box probes).• Regulatory audits: 0 non-conformities in last two inspections.• Team cost: eliminated one FTE worth of manual checks through automation.