Site Reliability Engineer | Cloud Infrastructure Specialist
masrur [at] masrur [dot] org
Site Reliability Engineer with 10+ years of experience designing, building, and operating highly available, scalable cloud-native applications on AWS and GCP. Proven expertise in infrastructure automation, DevOps practices, and leading cross-functional teams to deliver production-ready, reliable systems.
AWS, GCP, IBM Cloud
Terraform, Ansible, Puppet, Helm
Kubernetes, Docker
Python, Go, C++, Bash
GitLab, GitHub, ArgoCD
Prometheus, Grafana, Zabbix
Kafka, Splunk, Flink
MySQL, PostgreSQL
Leading cross-functional teams ensuring 99.9%+ uptime for critical AI infrastructure services. Designing reliable, secure cloud infrastructure using AWS and IBM Cloud.
Maintained 5+ customer-facing services with high availability requirements. Built distributed data architectures for streaming data and real-time analytics.
Built and operated cloud-native applications on AWS handling high-volume transactions. Maintained 10+ customer-facing services ensuring production readiness.
Created automation tools for cross-compilation of 500k+ line codebase. Implemented comprehensive monitoring and observability solutions.
Ph.D. in Physics - University of British Columbia
Why custom markdown instructions beat generic automation tools
November 2025Rethinking infrastructure-as-code for the age of AI agents
November 2025