The job of a cloud architect is to ensure the technical “Architecture” is robust and can withstand the test of time. Other important things to look for are workload and cost optimization and security. It is imperative that you have implemented the best practices suggested by the cloud service provider.
Whether it is “AWS Well-Architected Framework,” “Azure Well-Architected Framework,” or any other cloud provider, there are similarities across the board. Cloud Infrastructure, if not utilized appropriately, can wreck financial havoc in an organization.
Check out : AWS Solutions Architect Associate Practice exams
So, what exactly is a well-architected framework?
Regardless of the cloud provider, below are the 5 pillars of a well-architected framework:
- Operational Excellence
- Cost Optimization
- Performance Efficiency
Let us dive into more detail about each of the points mentioned above:
1. Operational Excellence
The architecture should be designed in such a way as to provide flexibility for future growth while ensuring stability at all times. The operational excellence pillar ensures that your application and infrastructure both are reliable and running effectively at all times.
Below are some design principles and best practices included as part of operational excellence:
- Infrastructure as code – Automate your infrastructure so that you can spin up resources at a moment’s notice. Furthermore, for a complex and/or hybrid cloud environment, all platform-level dependencies are identified, understood, documented, and shared across operations team(s).
- Make frequent, small, reversible changes – Ensure a systematic approach to the development and release process. DevOps processes like CI/CD must be followed for quick value delivery. Automate your unit and integration testing as part of the application deployment process. The best practice for configuration settings is to be changed or modified without rebuilding or redeploying the application. Usage of feature flags is also an excellent way to roll out new code.
- Monitor, visualize and act – The most basic thing to track is the application log and resource level statistics. These application-level events should be automatically correlated with resource-level metrics to quantify the current application state. Visualization of these trends helps you predict operational issues before they occur. Furthermore, capacity utilization should be monitored and used to forecast future growth.
- Code to prevent failures – Good practice is to deploy your application across multiple active regions or other deployment locations. Your workload must be built for self-healing and resiliency. Where needed, Auto-scaling is enabled for supporting PaaS and IaaS services. In case of any unexpected event, it is critical to design recovery strategies that minimize downtime and maximize uptime. For example, when using multi-region deployments, consider quickly moving traffic from one region to another without impacting user experience.
- Refine operations procedures frequently – Based on the application, set up availability targets such as Service Level Agreements (SLAs) and Service Level Objectives (SLOs). Always keep track of Role-Based Access Control (RBAC) to control operational and financial dashboards and underlying data access. Other considerations include, regulatory and governance requirements of all given workloads are known and well understood.
2. Cost Optimization
Bringing down the total cost of ownership (TCO) for a true cloud architecture model is challenging. TCO is the metric that organizations use to quantify and measure cloud adoption success. Understanding TCO helps organizations with the return on investment (ROI) so that they can prioritize the highest business value delivery within the allocated budget.
Below are some design principles and best practices included as part of cost optimization:
- Not all use cases fit cloud – Not all workloads and applications might end up using cloud-native functionality like auto-scaling, platform notifications, and other features that cloud platforms can offer. Understanding if the application is cloud-native or not provides a beneficial high-level indication about potential technical debt for operability and cost-efficiency.
- Create and implement cloud financial management – The goal of cost modeling is to estimate the organization’s overall cost in the cloud. As part of Cost modeling, you create logical groups of cloud resources that are mapped to the organization’s hierarchy. This is followed by an estimated costs for those groups. Often cloud providers offer a discount on the usage of the resources. Using the “Pay As You Go” vs. “Reserved Capacity” should be considered based on the application and its use case.
- Analyze and attribute expenditure – Be it production or non-production environments, allocating and monitoring budgets help. Consider the ratio of non-production to production environments. If it is substantially higher, organizations should consider merging testing environments or re-visit why the cost is so much higher. It is worth noting that some cloud regions are more expensive than others. So, assigning the correct region for the application is essential.
- Monitor costs – Once a cost model is implemented, alerts help keep track of the expenses. Consistent tagging helps streamline the budgets. The most common resources that need to be managed are computing resources, storage, and networking. The alerts should be sent to the application owners as decided at the organization level.
Security is one of the most important aspects of any architecture. The Security pillar includes the ability to protect organizational data, systems, and assets. Losing these assurances can negatively impact business operations and revenue, and an organization’s reputation in the market.
Below are some design principles and best practices included as part of the security pillar:
- Threat Analysis – Threat analysis consists of: defining security requirements, identifying threats, mitigating threats, validating threat mitigation. Use penetration testing to eliminate threats proactively. Organizations should monitor the security posture across workloads, and a central SecOps team should monitor security-related telemetry data and investigate security breaches.
- Apply security at all layers – Starting with Virtual Network, to all the way to backend databases, each layer should be secured individually.
- Automate security best practices – Always use the DevOps approach to building and maintaining software. Automation, close integration of infrastructure and development teams, testability and reliability, and repeatability of deployments increase the organization’s ability to address security concerns rapidly.
- Protect data in transit and at rest – Regardless of where the data is stored, data in transit and rest should be encrypted. The use of the latest version of TLS is essential.
- Principle of least privilege – The Principle of Least Privilege states that a subject should be given only those privileges needed for it to complete its task. If a subject does not require an access right, the subject should not have that right. These security controls apply to all layers of the architecture.
- Prepare for security events – Organizations should embed an incident response team within the SecOps team. Organizations should build playbooks to help incident responders quickly understand the workload and components to mitigate an attack and do an investigation. Furthermore, these procedures should be automated as much as possible.
Read: DevOps Basics
A well-architected framework prevents your application from a single point of failure. The Reliability pillar includes the ability of a workload to produce consistent and expected results all the time.
Below are some design principles and best practices included as part of the reliability pillar:
- Resilient to failures with automatic recovery – As much as we can try to prevent software or hardware from failure, it is not guaranteed to work all the time during its life span. However, as proactive measures, self-healing architecture comes to the rescue with automatic recovery. For example – deploy the application across multiple regions. This ensures if one data center goes down, the other is available. Most importantly, failover and failback steps and processes are automated.
- Establish and test recovery procedures – Along with Service Level Agreements (SLAs) and Service Level Objectives (SLOs), it is also vital to establish recovery targets. In particular, how long the entire workload can be unavailable (Recovery Time Objective) and how much data is acceptable to lose during a disaster (Recovery Point Objective) are important metrics. Once you have established the targets, it is imperative to do a dry run in a production-like environment for credibility.
- Scale horizontally to increase aggregate workload availability – Autoscaling can be leveraged to address unanticipated peak loads to help prevent application outages caused by overloading. Also, keep in mind that merely enabling autoscaling is not enough; it needs to be tested, and the time to scale in/out has been measured.
- Take charge of capacity – Designing application platform resiliency and availability is critical to ensuring overall application reliability. The benefit of the cloud is making the capacity available at the click of a button. However, the application code should support the capacity changes. Some questions to be asked – are the application processes stateless, is session state non-sticky and externalized to a data store?
5. Performance Efficiency
As demand changes and technologies evolve, the ability to use computing resources efficiently to meet system requirements and maintain that efficiency is part of the Performance Efficiency pillar.
Below are some design principles and best practices included as part of the performance efficiency pillar:
- DevOps teams and automation – Organizations should deploy multi-regional workloads to reduce latency and lower deployment costs. High-performing DevOps teams and automation can reduce the time to go live, thereby improving the time to market. Automation along with visualization to track long-term trends are significant to predict performance issues before they occur.
- Elastic and responsive workload – It is important to monitor the capacity utilization and use the data to forecast future growth. In some cases, configuring autoscaling is crucial to meet fluctuating demands. Serverless architecture often helps to not only reduce the cost but adapt rapidly to change in demand.
- Performance efficiency outside of the applications – Besides applications being able to scale with the traffic, it is important to also consider performance efficiencies in the networking stack like offloading SSL, using CDN, authentication/token verification offloading, to name a few.
Check out : Cloud Interview Questions
Accurately reviewing your existing architecture and design principles will help identify the areas for improvement. In the spirit of agility, the review process is always continuous and just as important as the business and operational goals.
The Cloud Well-Architected Framework provides architectural best practices across the five pillars for designing and operating reliable, secure, efficient, and cost-effective systems in the cloud.
So, there you have it – The cloud well-architected framework.
Author: Haman Sharma is a cloud enthusiast. You can connect with him on LinkedIn
Read next : Agile, DevOps, and CI/CD – How are they related?