Understand WHYs in Observability to use it better

As applications becoming cloud native, observability of any application is vital. Logs, metrics and traces are three pillars of observability.

Why Metrics when Logs are already there?

Application logs are generally verbose. If we compress more information in single line like access logs, still there will be single log entry for each event (say HTTP request). Having information about multiple events in single log entry is more cheaper when you ultimately need the numerical value to act.
We have to ingest the logs into database to take actions on top of it. Speed of log ingestion and analysis of log data should be near real time to respond in case of incidents.

Thus, Metrics are nothing but the properly modelled, concised Logs intended to provide holistic system information. (Say, overview of the system)

Pros:

Cheaper and faster way to see the system/application state in near real time.
Numeric representation of system information makes the data activation easy and quick. Eg.
- Alerting for on-call or
- Automated on-call actions like
  - Break the circuit connection to the affected downstream server
  - Restart the application
  - Notify customer for the non-recommended usage

Cons:

It is not intended to track the event level problem and so it will be costly (or) even break the metric system when we try to track individual events which will have high cardinality.

Shall we remove logs then?

No. When we are in need to debug the individual event, Metrics don't have that information. Logs are essential for debugging in most cases.

Thus,

Metrics are useful to monitor system's performance and to identify the anomalies in real time.
Logs are used to debug/narrow down any failures.

Pros:

Debugging the particular event or set of similar events. It holds all individual events like each HTTP request served, requested, Message Published, Consumed, DB calls.

Cons:

Aggregation of individual event logs into system level data will be costly. Eg.
- Visualisation of events in the time-series dashboard
- Firing the aggregated queries periodically for n number of use cases.

Why should we use another one - traces.?

In Traces, we capture the lifecycle of the event (Eg. HTTP request) across various boundaries. We can visualise the spans across services.

Tracing the response time in multiple layers of your application like database, cache, REST call or any custom span.
Tracing the requests across various boundaries (Say multiple applications). In Microservice architecture, tracing the same requests across apps provides deeper information.

Pros:

Easy to check what happens to the event(or average of group of events) in all layers and to check the time taken.

Cons:

Costly to trace each and every request since the instrumentation itself can add overheads to the application.
It doesn't provide overall picture of the system.

Wake up call:

Imagine the on-call engineer got paged at the middle of the night for the new Apdex alert for which the SOP is not documented yet. He/she has to look up the Application Metrics first to understand

Whether the requests are being failed or not.
If it fails, how much percentage of the requests got failed.
If not, how many requests breaching the SLA.

If the error in metrics doesn't have more information than status code, logs should be analysed.

Inspite of what we get to know the root cause of the issue in metrics, there is nothing wrong in quickly confirming the same in logs.

In this case, Metrics and Logs are used. There is no deny in using traces to analyse the time taken of few sample requests if investigation needs it.

Point is

We should use metrics first on this use cases like system level failures.
Logs are the first place to check on the L2 ticket where the particular event (say movie booking for one user) got failed.

Conclusion

Logs, metrics and traces are intended to solve different technical problems. But, they compliment each other. When they are used in combination, observing the system will be easier. Below diagram is the just sample that any feature in observability can be served using combination of logs, metrics and traces and the combination is really subjective to use cases.

Observation of any scalable system is crucial for

Planning the capacity of resource
Testing the performance of the system
Detecting the anomalies
Respond to disruption in systems
Identifying the root cause
Scale out or scale in the system
Measuring the cost of the system
Ensuring system quality on releases
Keep the system availability

and more.

Thanks for reading.

Understand WHYs in Observability to use it better

Lets understand the basic whys in Observability

Why Metrics when Logs are already there?

Pros:

Cons:

Shall we remove logs then?

Pros:

Cons:

Why should we use another one - traces.?

Pros:

Cons:

Wake up call:

Conclusion