I worked as an office sysadmin, ordering equipment, installing OS and configuring software, repairing office equipment, and answering users’ questions.
I supported and developed a server park for commercial web projects, ensuring their stability.
Over two years, I installed a cluster for the company’s commercial projects and automated its deployment. I set up system monitoring at the company, improving the quality of service delivery.
I launched IP telephony for the sales department and simplified document flow for the accounting department in the office. I documented the infrastructure and launched the company’s first knowledge base.
I founded a hosting platform for e-commerce as part of a separate division within an integration company.
I managed a small operations team, automated processes and infrastructure, corresponded with clients, and promoted our services.
I organized work with the sales, development, and technical support departments. I built a financial model, designed tariff plans for our service, and developed strategies for promoting it among customers.
I installed, configured, and automated the core of our service, launched monitoring, and a backup copying service.
I administered internal and external infrastructures for commercial B2B and B2C projects. I launched over 10 projects, from designing the architecture to handing them over to the support team and, occasionally, decommissioning them.
I organized a technical venue for meetups, helping employees share knowledge. I publish videos on YouTube and developed DevRel: I helped the company build its HR brand and attract job seekers.
I developed a backup monitoring system, speeding up reaction time to problems when saving data.
I launched a geo-redundant, fail-safe, and scalable Kafka cluster to handle traffic exceeding 1Gbps. I configured Kafka monitor brokers for integration into development teams.
I launched a unified Grafana metric visualization system, improving the quality of monitoring and project support.
I launched a geo-redundant and scalable ClickHouse cluster for monitoring and solving business problems in commercial projects. I built a centralized log collection system from load balancers, accelerating event search times by engineers in support teams by up to 100 times.
I launched an office access control and surveillance system, as well as improved cooperation with the security agency. These changes eliminated security incidents.
I launched an artifact storage for application releases, consolidating common dependencies in projects. This change had a significant impact on the development, CI, and quality testing of projects.
I launched a corporate Kubernetes cluster for the company’s projects on bare metal. I automated deployment and configuration of the operating system and platform, describing infrastructure as code. I configured the first S3-compatible object storage for GitLab artifacts.
I launched VictoriaMetrics for collecting application metrics and a cluster for collecting and archiving applications logs based on Vector and Loki.
About company. Kuper (ex SberMarket) is an eCommerce project for delivering products and goods from favorite stores and restaurants, represented by three monolithic applications on Ruby/Rails and several hundred microservices associated with them.
The company’s organizational structure is hierarchical: the IT department consists of domains, domains consist of units, and units consist of teams with engineers. Teams are primarily cross-functional.
I joined the team in 2022 as a Senior DevOps Engineer in the IT operations domain, and since the end of 2022, I have led an SRE subgroup for the Customer domain, responsible for the reliability of the customer’s path in the mobile and web application.
More than 50% of the online store’s traffic comes to the “storefront” - the largest monolithic application on Ruby/Rails, which is worked on by over 45 teams.
Since mid-2023, I have been working in the position of head of the unit responsible for the reliability and performance of the storefront: the order lifecycle, payments, and fiscalization.
SDLC optimization. I analyzed, optimized, and made the SDLC observable. I reduced the build time by 90%, accelerated deployment time by 15%, and accelerated testing by 50% through parallelizing unit tests and automatically labeling flaky tests.
I increased the release frequency by two times while maintaining a low error rate when releasing new versions.
I launched the first version of the DORA monitoring for projects, allowing me to maintain a balance between development speed (velocity) and system stability.
I completed the Code Push project - the technology for delivering updates to iOS and Android. As a result, it was possible to reduce the mobile development cycle time to a couple of hours and bypass sanctions restrictions.
I containerized and migrated the “storefront” to the PaaS platform without downtime, which allowed us to unify the tools for development and operation.
Improving reliability. I added support for canary releases to the “storefront”, which allowed us to reduce the impact on users and shorten recovery time after a failure due to automatic rollback of bad releases.
I initiated and launched the first version of SLO/SLA in the company, adapted the tool for 100% critical services. I analyzed and developed monitoring of the critical user path for the Grocery and RTE (restaurant) directions, which allowed us to make failures in the system observable and evaluate their impact on business.
Preparation for high loads. I completed the migration of content from the “storefront” to the PIM system, which allowed us to reduce load and increase cluster capacity.
I organized the work of the load testing team and developed a methodology that allowed us to conduct regression testing of application capacities in the company. We managed to increase the number of tests sixfold, cover 92% of user traffic with tests, and become a driver for increasing capacity in development teams.
I initiated and led a cross-domain project to prepare Kuper for the high season with 200 participants: I developed a strategy for preparing for the high season, including several plans and risk assessments of failures using FMEA analysis; I organized the work of teams and facilitated the promotion of projects. As a result, within six months, I increased system’s performance by 2.2 times while maintaining availability at 99.9%.
Developing IT brand. I launched a non-commercial DevRel project called “Architectural Katas” (Russian only) - games for aspiring architects, aimed at popularizing the skill of system design within the company and beyond its borders. In one year, I gathered a community of 1000+ people and held more than ten game sessions.
I spoke at 10+ IT conferences and meetups in Russian. I also published four articles and became a nominee for “Технотекст-2023” on Habr.