Job Description
Key Responsibilities
- Ensure 24/7 availability of critical infrastructure through proactive monitoring, maintenance, and troubleshooting of servers, networks, and storage systems.
 - Optimize system performance and scalability by analyzing bottlenecks, tuning configurations, and implementing automation tools for resource management.
 - Respond to incidents promptly, conduct root-cause analysis, and document solutions to prevent recurrence while maintaining SLA compliance.
 - Deploy and manage Kubernetes clusters, including container orchestration, node provisioning, and integration with CI/CD pipelines.
 - Implement security best practices and compliance standards to protect infrastructure assets and ensure data integrity.
 - Collaborate with developers and DevOps teams to design scalable architectures and troubleshoot application-level issues.
 - Monitor system metrics and logs to identify performance trends, optimize resource allocation, and improve overall system reliability.
 - Stay updated on emerging technologies and industry trends to recommend infrastructure improvements and innovations.
 - Document technical processes, configurations, and incident resolutions to ensure knowledge sharing and operational continuity.
 - Perform regular system audits and capacity planning to anticipate future needs and ensure infrastructure readiness.
 
Job Requirements
- Proven experience in infrastructure management with a minimum of 5 years in system administration, DevOps, or related fields.
 - Expertise in Kubernetes cluster deployment, configuration, and operation, including familiarity with container orchestration tools like Docker and Helm.
 - Strong understanding of cloud platforms (AWS, Azure, GCP) and hybrid cloud environments for infrastructure scalability.
 - Proficiency in scripting languages (Python, Bash, PowerShell) and automation frameworks for system maintenance tasks.
 - Knowledge of network protocols, DNS management, and security practices (firewalls, encryption, IAM) to ensure infrastructure resilience.
 - Ability to analyze system performance metrics and implement solutions for latency reduction and resource optimization.
 - Experience with monitoring tools (Prometheus, Grafana, ELK stack) for real-time system health tracking and incident detection.
 - Excellent problem-solving skills and analytical mindset to diagnose complex technical issues and develop preventive measures.
 - Strong communication abilities to collaborate with stakeholders, document technical processes, and present solutions effectively.
 - Preferred certifications such as Certified Kubernetes Administrator (CKA), AWS Certified Solutions Architect, or CompTIA Security+.
 - Ability to work in fast-paced environments with strong attention to detail and organizational skills.
 - Experience with CI/CD pipelines and infrastructure-as-code (IaC) practices for automated deployment and configuration management.
 - Understanding of disaster recovery strategies and business continuity planning for infrastructure resilience.
 - Knowledge of containerization technologies and microservices architecture for scalable cloud solutions.
 - Ability to design and implement secure, high-performance infrastructure solutions that meet enterprise requirements.
 


