First, a few sentences about how I see the role of statistics in scouting. They're an imperfect source of information among many other imperfect sources of information...they're useful enough to merit consideration, and if you're going to consider them, you might as well do so systematically (i.e. "advanced" statistics, actually finding useful correlations between NCAA stats and NBA performance), not haphazardly (e.g. cherrypicking individual statistics to bolster some preconceived notion about a player). To emphasize how imperfect they are, De'Andre Hunter's per-40 defensive stats stand at 6.3 boards, 0.7 steals, 0.7 blocks. No amount of staring at a box score is going to make you realize that he's a valuable defensive player. There's some hope that more advanced player tracking stats can do a better job, which is probably true, but that's a discussion for another day.
Methodology details below:
---
My model attempts to predict NBA adjusted plus/minus. The major benefit of predicting APM, instead of an NBA box score metric, is that my model doesn't inherit any biases at this stage (e.g. a model that tries to predict PER will necessarily end up with all the same biases/flaws of PER). The major drawback is that adjusted plus/minus is a very noisy stat, and the price I pay for trying to predict a noisy stat is relatively high uncertainties in my model coefficients (e.g. the marginal value of an NCAA assist, or rebound). Ultimately, it's a good tradeoff because in most cases the extra uncertainty in my predictions due to uncertainty in model coefficients is small relative to other sources of uncertainty.
My model assumes that there are no interaction terms between parameters. That means that the value of an NCAA player's assist according to my model does not depend at all on how many rebounds he gets, or how many points he scores.
It turns out that this assumption is absolutely crucial. Without it, a model is extremely vulnerable to a problem called "overfitting" in which it's basically tricked into thinking some artifact of statistical noise that affected a few prospects in the past is a fundamental rule that applies to all prospects in the future. A model suffering from overfitting generally does a good job explaining outcomes for past prospects but produces wonky and inaccurate predictions for future prospects.
The last important thing my model does is estimate uncertainties in its predictions, something notably lacking from most such models people have published. This illuminates some of the strengths and weaknesses of my model. For example, the largest contributor to uncertainty is made two-pointers, that is, my model is generally less accurate in predicting players who make a lot of two pointers. This makes some sense; a player's two pointers made per game (even taken together with two point percentage, or equivalently two pointers missed) falls far short of describing how good a scorer the player really is inside the arc. There's just not enough information in this part of the box score to properly evaluate a player.
Some more minor things:
-All stats are per possession. I also include height, and minutes per game.
-I assume a quadratic aging curve. I found that on offense, better prospects actually follow a steeper aging curve than worse prospects, and I accounted for this as well.
-My sample only goes through the 2012 draft. This hurts my sample size, and also hurts because my model is really tuned to predict how players entering the NBA a decade ago would be expected to perform. Obviously the NBA has changed since then and I have no way of adjusting for that.
-My sample only includes prospects that went on to play significant NBA minutes, so it suffers from "survivor bias" and therefore tends to be slightly too optimistic in its projections. How to correct for this is an interesting question in its own right that I won't get into for now (but could talk more about if you're interested).
-My model has some interesting artifacts because of the relatively large uncertainties in the coefficients I mentioned earlier. For instance, made two pointers have a slight (not statistically significant) negative value. In reality, they probably have (at least) a slight positive value. I could manually correct things like this to make my model slightly better, but that's obviously a slippery slope toward tweaking and tuning my model in retrospect to make it look like I think it "should." So I decided to just let it be, even in cases where the helpful tweak is obvious.
-My model doesn't account for strength of schedule or team strength.