Napkin Math #19: Metrics For Your Web Application's Dashboards

                            March 20, 2022

                Napkin Math #19: Metrics For Your Web Application's Dashboards

                        In the beginning of the year I was helping readwise.io get some of their observability up to snuff. A few weeks later "What should I monitor?" came up on another call, so I decided to list out the metrics that I expect from a great dashboard:

Web Backend (e.g. Django, Node, Rails, Go, ..)
Response Time p50, p90, p99, sum, avg
Throughput by HTTP status
Worker Utilization
Request Queuing Time
Service calls
Database(s), caches, internal services, third-party APIs, ..
Enqueued jobs are important!
[Circuit Breaker tripping][cb] /min
Errors, throughput, latency p50, p90, p99

Throttling
Cache hits and misses %
CPU and Memory Utilization
Exception counts /min

Job Backend (e.g. Sidekiq, Celery, Bull, ..)
Job Execution Time p50, p90, p99, sum, avg
Throughput by Job Status {error, success, retry}
Worker Utilization
Time in Queue
Queue Sizes
Don't forget scheduled jobs and retries!

Service calls p50, p90, p99, count, by type
Throttling
CPU and Memory Utilization
Exception counts /min

More details about what these all mean in the latest napkin post!
Any favourites of yours missing? Let me know.
P.S. On Thursday night, eastern time, I'll be doing a short talk about napkin math on memory bandwidth.

                            Don't miss what's next. Subscribe to Napkin Math: