Lies, damn lies, and weighted statistics

John Morrissey
May 24, 2022
3 min read

For now, jump over to Twitter if you're waiting for a typical Film School post. I've got plenty of video breakdowns and MS Paint-tier markups on important plays. Until later this week, Film School is out of session, and Math Class is in. I'll dispense with the bad jokes, so without further ado...

I'm known as a data nerd on the USL corners of the internet, and for good reason. I love a good player radar, I'm constantly referencing xG, and I publish my own self-calculated playoff odds. This week, in reference to Sean Totsch of Louisville City, Jordan Doenges on Twitter weaponized my 2021 post on defensive action leaders to prove why the centerback in question is underwhelming. Indeed, Totsch rated in just the 17th percentile for defensive actions amongst central defenders in 2021. On a raw basis in 2022, he's up to just the 49th percentile, still below average. What's going on? Totsch is a club legend in Kentucky, so why does the data look so bad for him.

Simply put, raw values are misleading. I own that fact in regard to my post from last year; it's devoid of context. After all, a defender for, say, Atlanta is going to face many more shots and bear the brunt of much more attacking play than a Louisville peer. The player on the less possessive team naturally has more opportunities to garner tackles and clearances. When you weigh defensive actions by a team's possession in 2022, Totsch suddenly rises up to the 77th percentile for defensive actions. I would argue that this new number is much more indicative of his impact.

DA leaders with high/low totals across adjustments

That same principle applies to players other than Totsch, of course. Nearby, you can see a ranking of the USL's top players by defensive actions. I've added various weighted alternatives for context. The highs and lows vary greatly! The choice of your adjustment category is super influential. Atlanta, for instance, is actually a high-possession team, but they're horribly outshot because of leaky transition defense. Take a look at Nelson Orji's varied totals for evidence. The same thing applies to a less extreme degree with Brendan Lambe lower on the graphic.

By comparison, take Jasser Khemiri, the second-ranked player by raw actions. He suffers by the possession weighting, but an adjustment that recognizes San Antonio's innate edge in dangerous chances changes things completely. Possession and danger aren't the same thing. Which rank is "more accurate?" It's impossible to say. What matters is that anyone who uses these statistics recognize that every number comes with a bias.

xG leaders and their ranks across shot adjustments

Shot totals also vary greatly depending on how you adjust them. In the graphic, the Louisville players - Jorge Gonzalez, Wilson Harris, and Enoch Mushagalusa - all show high rates of variance. They're lower down the ranks for their share of their team's shot total (Louisville shoots a lot). They also suffer when you adjust for Louisville's high rate of possession and the number of touches those attackers get in the final third. By comparison, Frank Lopez and Dylan Borczak of Rio Grande Valley gain when you weight by touches. RGV is a counter-centric side, and those two are invited to shoot without a second thought when their team is in a dangerous area. Again, context matters.

Still, each and every one of the categories is valuable and predictive in its own right. Just to note, I've mislabeled R-squared as correlation in the table. Still, the comparison of shot rates with xG is important. There's a fairly strong link between each weighted category and a player's expected goals. In other words, these numbers matter, but they all tell a different story.

These findings and musings are the reason why I always revert to the tape in the first place. Seeing a team in action and observing a distinct style informs my use of data. I love a good chart as much as the next guy, and I'll never stop rolling out data-driven insights, but it's all worthless without context.

Lies, damn lies, and weighted statistics

Recent Posts

Commentaires