Tracking the progress of a distribution

distributions

                            April 1, 2021

                Tracking the progress of a distribution

                        Tracking the progress of a distribution
I've been super busy lately,
with work, a baby, and house hunting.
Fun projects will have to take a back seat for now.
But! 
I recently found an opportunity 
to exercise some basic linear algebra
when working on a metrics problem.
It's simple enough to write up 
in the moments 
between house showings and baby duties.
To anonymize the domain,
let's say we're working with
sheep colored white, green, or purple.
You start with a flock of n sheep,
each of which may be any of the three colors.
You're given a target distribution
(p_1, p_2, p_3) over colors (white, green, purple),
and you want to make your flock
have that distribution of colors.
You can paint your existing sheep,
and/or buy new pre-painted sheep
in a desired color,
or sell sheep.
However, the operations you make to improve
the flock's distribution toward the target
aren't your only task.
You're a busy shepherd, after all,
and you can't spend all your time painting sheep.
Still, you want to track
how much progress your making each day,
and show a metric that describes
how well you're directing the 
(possibly small) effort you're spending
on improving the color distribution.
The problem is how to measure "progress"
faithfully in this situation.
On one hand, you could use a statistical measure
like total variation, or KL-divergence,
or Euclidean distance between the current distribution
and the target distribution.
But these measures are agnostic
to the size of the underlying flock.
For example,
if you used total variation distance,
and started with a flock of (10, 15, 24)
which has distribution (0.204, 0.306, 0.490),
and your target distribution was (0.3, 0.6, 0.1),
then the total variation distance is 0.39.
If you add a white sheep and a green sheep to the flock,
you get a distribution of (0.216, 0.314, 0.470),
which has total variation distance 0.37 from the target,
an improvement of 0.02.
We spent effort perfectly:
all the operations made had a maximal impact
on improving the distribution.
Now scale the flock up by 100,
and start with (1000, 1500, 2400).
Add one white and one green sheep,
and you only improve the total variation by 0.00019,
even though your effort was still spent perfectly.
The damn statistic makes us look bad!
All the statistical distance measures I'm aware of
have the same problem.
The reason is because these comparisons 
are of distributions,
and so they are not allowed to know 
about the underlying populations 
that the distributions represent.
What we need instead
is some way to measure progress
toward the target distribution
relative to the amount of progress
that could have been made
if we theoretically directed our effort
as much as possible toward improving the ratio.
When I thought about this,
at first I was stuck in "statistics" land,
and came up with a bunch of crappy measurements
that were various ratios of ratios.
Some examples showed that they weren't easy to interpret.
Then I realized (embarrassingly late)
that the population itself is a vector,
and so I can use basic linear algebra
to measure distances of population vectors.
From there a sensible approach felt obvious.
Let R, S be two length-n tuples 
of positive integers
representing the population vectors
before and after some change is made.
Let M_R and M_S be the sum 
of the entries in R and S, respectively,
representing the total flock size before and after.
Let T = (t_1, ..., t_n) be the target distribution.
Then define the 
relative improvement of S from R toward T
as follows,
where norm() denotes the usual Euclidean (L2) norm
and proj_T denotes the vector projection onto T:
improvement = 
  (norm(proj_T(R) - R) - norm(proj_T(S) - S))
  / norm(S - R)

In lieu of a picture,
imagine the target distribution vector
spanning a line in n-dimensional Euclidean space.
This line represents all population states—for 
all possible flock sizes (even non-integer)—that match
the target distribution perfectly.
Then proj_T(R) describes 
the closest ideal population state vector
for a given state vector R.
Note the projection proj_T(R) will always have
norm at most norm(R),
but may have a larger flock size (L1 norm).
Finally, norm(proj_T(R) - R) is how far
a given state is from ideal (error),
and the improvement formula above
subtracts the error of the "after" state
from the error of the "before" state,
and divides by the distance between the before and after states.
Geometry makes analyzing this formula easy.
It must always be between -1 and 1,
because at best you can place S 
on the line between R and proj_T(R),
in which case the numerator is equal to norm(R-S)
and the improvement is 1.
Conversely, the worst you can do is move straight away
from the target, and the improvement is -1.
This hints that the improvement formula
actually computes the cosine of the angle
between S-R and R - proj_T(R).
It's easy to prove.
View S-R as the hypotenuse of a right triangle.
One base lies along R - proj_T(R)
and has length 
norm(proj_T(R) - R) - norm(proj_T(S) - S).
The improvement ratio computes
the ratio of base length to hypotenuse length,
i.e., the cosine of the desired angle.
My intended application of this metric is boring.
It has to do with computers in data centers.
But it could be useful in more situations.
Imagine a recommendation system with a target distribution
over recommendation categories like
"revenue gaining" and "exploratory." 
Progress is relative to 
the number of recommendation opportunities.
Or imagine a hiring system
in which you want to ensure you're receiving
resumes from enough applicants from 
historically disadvantaged groups.
Progress is relative to the amount
of money you spend on recruiting efforts.
I haven't used this metric
in practice yet,
so I can't say whether it works well
even for my intended application.
There's the possibility that for state vectors
starting close to the ideal line,
progress will look wonky
because you'll have no choice
but to overshoot.
You can alleviate that
by adding error bars
around the ideal,
inside of which you don't track progress at all.
But mostly it feels right 
compared to the less coherent formulations
that preceded it.
The geometry also provides a nice framework 
with which to reason about its behavior.
This means I spend less effort justifying
the metric to skeptical stakeholders.

                            Don't miss what's next. Subscribe to Halfspace: