March 20, 2022, 2:30 p.m.

Napkin Math #19: Metrics For Your Web Application's Dashboards

Napkin Math

In the beginning of the year I was helping readwise.io get some of their observability up to snuff. A few weeks later “What should I monitor?” came up on another call, so I decided to list out the metrics that I expect from a great dashboard:

  • Web Backend (e.g. Django, Node, Rails, Go, ..)
    • Response Time p50, p90, p99, sum, avg
    • Throughput by HTTP status
    • Worker Utilization
    • Request Queuing Time
    • Service calls
      • Database(s), caches, internal services, third-party APIs, ..
      • Enqueued jobs are important!
      • [Circuit Breaker tripping][cb] /min
      • Errors, throughput, latency p50, p90, p99
    • Throttling
    • Cache hits and misses %
    • CPU and Memory Utilization
    • Exception counts /min
  • Job Backend (e.g. Sidekiq, Celery, Bull, ..)
    • Job Execution Time p50, p90, p99, sum, avg
    • Throughput by Job Status {error, success, retry}
    • Worker Utilization
    • Time in Queue
    • Queue Sizes
      • Don’t forget scheduled jobs and retries!
    • Service calls p50, p90, p99, count, by type
    • Throttling
    • CPU and Memory Utilization
    • Exception counts /min

More details about what these all mean in the latest napkin post!

Any favourites of yours missing? Let me know.

P.S. On Thursday night, eastern time, I’ll be doing a short talk about napkin math on memory bandwidth.

You just read issue #19 of Napkin Math. You can also browse the full archives of this newsletter.