Software Architecture Metrics: Measure what matter
When talking about Quality Attributes, it is the art of a Solution Architect when they can combine principles and practices to a system. In fact, they will need to have a deep understanding behind these criteria and flexibility in choosing which one, why and how to measure it.
Bài viết này có phiên bản Tiếng Việt
After years of teaching Solution Architect, one of my favorite part of it is the discussion of Quality Attributes, or in form of academic name is non-functional requirements (NFRs), which are important criteria for evaluating solutions. In fact, the daily work of a Solution Architect (SA) is dealing with issues around these criteria.
From my point of view, NFRs is one of the most difficult part to teach because it is not always easy for learners to understand both the nature principles behind these criteria defined in the textbooks and the practical applied solutions in real system.
I always begin with: “Measure what matter”.
With the complexity and fast change in software industry, a good architecture needs to go through the evolution improvement. The development of any applications or platforms go along with the growth of business. A good application is not popped out from thin air by a super individual expert, it comes from a process of learning from feedback. It requires time for developing maturity of the architecture, the team, and the production process. I believe that measurement is the most objective and transparent way to speed up this process.
In the previous post about DORA metrics, I mentioned a lot about the criteria I often use to improve production productivity for a high-performance team. Today, on a rainy day in Saigon, let’s discuss the NFRs on practical measurement.
Performance - How fast is the system performing a workload?
Performance is one of the most important and easiest criteria to measure among the sets of criteria that SA needs to pay attention to. There are generally two common ways of measuring:
- Latency - How long it takes for the system to process a unit of work
- Throughput - The amount of work processed in a unit of time
For example, for a message processing system, latency is how long it takes for a message to be processed, and throughput is the number of messages processed per minute. Lately, this criterion has been underrated because of confusion with the scalability of the system. It is just because a system that can scale to accommodate the workload does not mean that it processes efficiently. Today, when APIs are popular, the response time criterion is a key metric in determining the system's performance.
Real challenges:
- Project teams face many difficulties in terms of both technical and cost to create a test environment that is closest to the production environment, leading to measurement results that do not help reduce actual risks.
- Creating workload emulators is not simple in complex business problems and is often easily skipped by development teams, because they think they can be monitored later during operation.
- When operating, it is difficult to determine the reason that caused performance problems, because there is not enough log information, and performance troubleshooting often requires lots of experience.
Tips: Measure early in the most realistic environment, and measure only the important features that need attention (80/20).
Scalability - Problem of Resources Allocation & Cost Effectiveness
Scalability is the ability to scale the system when workload increases as well as the selection of the appropriate configuration of the allocated resources to ensure the system runs efficiently.
This criterion often requires a scale test to monitor performance indicators to determine appropriate scaling conditions and sizing selection, especially related to issues of architecture selection as well as logic bugs if the team does not have a scalable concern during development. With pros and cons of public cloud services, choosing a scalable architecture and using server-less services has become popular among many developers.
Real challenges:
- Scale test always gives unexpected results about system bottlenecks and bugs. It needs to be done! But many teams ignore it, and testing scalability only by architectural evaluation.
- The cost of unlimited scale services is also a big deal.
- Scalability may hide the problem that the system's processing performance is not in good condition, easily leading to pouring money to get more resources.
Tips: Do scale tests often and early, set a threshold on capacity of services to avoid cost surprises.
Availability - Service continuity
Availability is usually measured as the percentage of time that the system can provide services to users. This is an age-old and extremely important criterion for businesses because it is directly related to business continuity. This metric is affected by failures that lead to the system not responding to service. It is also because of proactive maintenance from the system administration team.
There are two main scales that are often mentioned when discussing Availability:
- Mean time between failure (MTBF): time between failures, indicating the frequency of failures within the system leading to service disruption.
- Mean time to repair (MTTR): time to restore the system to normal state when the fault occurs.
In fact, MTBF can only be measured when an error occurs with the production system, and MTTR becomes an easy criterion to model, measure and improve during construction. In addition, there are two criteria related to the service's data when an error occurs:
- Recovery point objective (RPO): the amount of data that is accepted to be lost when an error occurs
- Recovery time objective (RTO): the time it takes for data to return to normal after an error occurs.
For example, a system can be restored with a piece of data, which means RTO is higher than MTTR. In case of accepting highest RPO, which means lost all data, RTO starts again at zero, and vice versa.
Real challenges:
- DR is always still a difficult problem. In fact, many systems do not perform DR exercises.
- MTBF is an important indicator but difficult to model and predict.
- The effect of 9 digits’ availability makes the system feel almost perfect, but errors still occur.
Tips: Perform chaos testing, conduct planned system maintenance, limit dependence on people, avoid single point of failure (SPOF).
I don’t mention Security in this post. It will need a detailed article for this topic and take in consideration of DevSecOps practices and trends.
Last thought
The key point is not to invent a 10-point architecture, but to have a clear strategy and specific measurements, suitable for the system to gradually improve and balance the architecture with the business scale. Which criteria to choose, why and how to measure is the art a Solution Architect when they can combine principles and practices to a system.