自己紹介
経験
Production Owner & Lead SRE
UDI (Unique Device Identification) System - - 今
职位: Production Owner & Lead SRE | 时间段: Apr 2023 – Present | 工作内容: Regulators required every medical device to carry a globally-unique identifier and to be traceable from manufacture to patient use. The platform had to ingest 1–2 M device records per day, retain 7 years of history (500 TB), and stay online 99.9 % of the time. Environment: • 8 bare-metal nodes (256 vCPU, 1 TB RAM) • 20 KVM guests across two data centers • Kubernetes 1.27, OceanBase 4.2, Hive 3.1, Airflow 2.7, Prometheus + Loki What I Actually Did (Hands-on): 1. Architecture & Deployment – Designed a three-tier K8s cluster: ingress, application, and data planes, all deployed with kubeadm and managed by GitOps (Argo CD). – Wrote 12 Helm charts and 6 Kustomize overlays to make rollbacks one-command operations. 2. Data Pipeline & Performance – Built an Airflow DAG chain that consumes Kafka topics → Hive ODS → ORC-based DWD → ClickHouse-served DM. – Optimised 37 slow Hive SQL queries; average report latency dropped from 45 min to 6 min. – Implemented bucketed + sorted ORC tables and enabled Zstandard, cutting HDFS usage by 35 %. 3. Database HA & Migration – Migrated 2 TB of MySQL 5.7 data to OceanBase 4.2 with zero downtime using `ob-loader-dumper` + dual-write shadow mode. – Set up 3-node Paxos-based OceanBase cluster; p99 write latency < 50 ms and automatic failover within 30 s. 4. Observability & Automation – Deployed kube-prometheus-stack; wrote Python exporters for device-registry business metrics. – Created Ansible playbooks to patch CVEs across 120 pods in 10 min; achieved 100 % compliance every quarter. 5. Incident Handling – Led 8 Sev-1 incidents (e.g., ClickHouse OOM during batch load). Root-caused via eBPF flame graphs and fixed by tuning max_memory_usage; MTTR now < 15 min. Impact: • Platform uptime: 99.95 % (measured by black-box probes). • Regulatory audits: 0 non-conformities in last two inspections. • Team cost: eliminated one FTE worth of manual checks through automation.