SK

Steve Kilduff

available

Last update: 06.09.2022

Database / Site Reliability Engineer

Company: Kilduff Informatics
Graduation: Bachelor-Abschluss
Hourly-/Daily rates: show
Languages: German (Limited professional) | English (Native or Bilingual)

Skills

Everything is a trade off in system operations.

Over the past 15 years I have been using the following techologies / softwares: 
chef, puppet, ansible, cfengine, terraform, packer configuration management. mcollective orchestration, devops attitude.
jenkins, gitlab-ci, bamboo build automation,
pfsense, brocade zxtm, amazon web services, elastic load balancers,
xymon, nagios, icinga2, cacti, smokeping, pingdom, TIG (telegraf, influxdb grafana), TICK (telegraf influxdb chronograf kapacitor)
elasticsearch, kafka, zookeeper, mssql, mysql, postgresql, cratedb, aurora, redis.
aws ec2, openstack, virtualbox, kvm, ovirt, vagrant, terrraform.
debian / ubuntu, but Redhat / CentOS are preffered OS.
ipsec / racoon / strongswan / pfsense

Project history

07/2019 - Present
Lead Database Reliability Engineer
Private (Internet and Information Technology, 5000-10.000 employees)

July 2019 - Current
Database operations for a leading cyber security company using ansible, terraform, packer, jenkins, on AWS.
Documenting deployments, backups, restores, ticketing on Jira.
Development of elasticsearch maintenance pipelines, ec2 instance retirement notices and replacements, elasticsearch upgrades, backups, restores.
Development of elasticsearch monitoring and alerting using telegraf, influxdb, grafana.
Maintaining 100 elasticsearch clusters, 2000 instances total on AWS, not the managed Amazon Elasticsearch Service.
Assisting teams in migrating, updating, ES versions. Assisting teams in ES best practices, feeding back improvements to the teams.
Cost saving initiatives - scaling clusters up and down relative to their actual usage.
Maintaining/automating kafka / zookeeper stacks.
Development of kafka maintenance pipelines, ec2 instance retirement notices and pipelined node replacements.
Maintaining redis stacks.
Maintaining the jenkins environment on kubernetes,
Implemented "s3 bootloader" for OS boot drive flashing on AWS, super fast OS replacement - data disks left intact.
Maintaining various AWS aurora/postgres stacks with terraform.

01/2018 - 07/2019
Lead Site Reliability Engineer
Private (Internet and Information Technology, 1000-5000 employees)

January 2018 - July 2019
Management / maintenance of 250 misc Linux servers baremetals/vms.
Managing software deployments, supporting new developer requests, creating development/automation pipelines.
Migrating from “cdist” (bash/scp) automation to ansible.
Building automation and deployment pipelines with gitlab, terraform, packer, ansible on AWS and OCI.
Planning and groundwork for upgrades from CentOS-5,6 and Debian/Ubuntu* to CentOS 7.
Initial deployment of development and production environments in AWS VPCs using terraform modules.
Creation of ipsec mesh network using ansible to migrate old datacenters and servers to new AWS VPCs.

09/2016 - 06/2017
Lead Devops Engineer
Travian Games GmbH (Internet and Information Technology, 50-250 employees)

Management / maintenance of 1700 CentOS 6 vms.
CentOS, git, nginx, php, MySQL, nagios, icinga2, grafana, F5, bacula, vmware, aws, bamboo, puppet, vagrant stack.
Planning and groundwork for initial upgrades of core systems to CentOS 7.
Migrating several custom inhouse solutions (bash,python,php, service-now) to puppet 4 and hiera.
Deployment of new datacenter for new game project
Icinga2 deployment with puppetdb driven host/service detection/collection.
Improving vagrant environment to support and test new tools.
Interviewed 20 applicants and hired 3 staff members to bring the permanent team back to 9 members.

07/2016 - 09/2016
Senior Devops Engineer
S&P Global (Banks and financial services, 1000-5000 employees)

Improve the integration and testing environment between development and production.
Consider and understand 20 developers needs for their personal dev habits on the same project.
Ensure vagrant up works for everyone, and everyone can use it for their needs.
CentOS, git, apache, weblogic, cassandra, node js, nagios, ansible, vagrant, AWS stack.
Improve monitoring, alerting, pipeline/build cycles in build management software.

02/2014 - 04/2016
Lead Site Reliability Engineer
Shopping Guide GmbH (Internet and Information Technology, 50-250 employees)

Ensure ciao.de and sister sites have 99.9% uptime. ( during Shopping Guide's ownership )
CentOS, git, apache, php, xymon, zxtm, bacula, ovirt, puppet, vagrant stack.
Improve monitoring, alerting and graphing.
Migrated cfengine automation scripts to work on localhost vagrant and puppet 3.7
Upgraded / reinstalled 150 CentOS 5 instances to CentOS 6 / 7.
Migrated developer environments to reuse production puppet code on vagrant.
Migrated php 4, 5.2 and 5.3 application code and stack to php 5.5 with developers.
Management of 20 hardware node ovirt virtualization cluster.
Rpm build management for custom packages ( optimized apache, php, nginx, builds ).

05/2012 - 02/2014
Senior Devops Engineer
Xing events GmbH (Internet and Information Technology, 50-250 employees)

Create chef cookbooks to run amiando's software, previously theys had no automation.
Ubuntu, git, apache, tomcat, rabbitmq, nagios, mysql, memcached stack.
1st DC migration, from bare-metal DC to AWS. Time limit meant cutting some corners.
Ensure xing-events.com on AWS (formerly amiando.com) had 99.9% uptime.
2nd DC migration, redesign of subnets, redundant gateways, ELB, and chef software upgrade.
Management of the office server environment and network.
Management of the 2-3 man desktop support team, budgeting, planning.

09/2009 - 05/2012
Senior Site Reliability Engineer
Microsoft Deutschland GmbH (Internet and Information Technology, >10.000 employees)

Ensure ciao.de and sister sites have 99.9% uptime ( during Microsoft's ownership ).
Development of cfengine automation scripts / pipelines.
Maintenance of core infrastructure, dns, tftp, pxe, dhcp services, jump boxes, provisioning services.
Upgrade path of servers and software from CentOS 4,5 to CentOS 6.
Assisted in DC migration from bare metal unmanaged DC to Microsoft DC in Dublin.
Creating new monitors, graphs and alerts in xymon, cacti and ganglia.
Creation and management of large testlab (200 baremetal + switches, 500 virtual machines)

01/2007 - 06/2009
Linux / FreeBSD System Administrator
Bluetmetrix Ltd (Internet and Information Technology, 10-50 employees)

Management of 80 FreeBSD servers in baremetal DC in Copenhagen.
Management of statistic collection software and servers similar to google analytics.
Ensure collection software has 99.99% uptime.
Improvement of monitoring, graphing and alerting with nagios and zabbix.
VPN, IPsec, maintenance / setup, general networking on *nix


Local Availability

Only available in these countries: Germany, Austria und Switzerland
Available for remote work currently due to Coronavirus
Profileimage by Steve Kilduff Database / Site Reliability Engineer from Muenchen Database / Site Reliability Engineer
Register