What makes NBA players money?
SQL, and Python
What NBA stats are the most influential in increasing player salary?
Overview
It is common to hear sports commentators discuss how the NBA has changed, and that teams do not even try to play defense. Although that may be hyperbole, the statistics seem to agree with the industry professionals. I began analyzing NBA data to determine which stats were the most influential to increase player salary.
Part 1:
After evaluating data from the 2022 - 2023 season, it was determined that points and assists are the most reliable predictors of salary. With points influencing salary 25% more than assists.
A player will theoretically start with $955,400, and get $12,180 for every point, and $9,634 for every assist. This model can not accurately predict salaries (with an R squared value of 0.43), The p values for points and assists were always statistically significant in every variation of different variables tested. At this point I could only conclude that points and assists are the biggest factors in player salaries.
Part 2:
In pursuit of an accurate model, I analyzed the "advanced stats" of NBA players from 2020-2023. These statistics include a wider variety of metrics to evaluate players. I was able to create an accurate model with r square of .7. I used PER (player efficiency rating), VORP (Value Over Replacement Player), Win Shares, and Usage %.
This model may have been accurate, but each of these stats are derived from multiple calculations of other individual and team stats. This model can not efficiently tell a player what to specifically improve, to get more money. It will be beneficial to gather more data, and to break players into position groups.
Gathering Data
Part 1:
Data was taken from basketball-reference.com, a table containing player stats from 2022-2023 and another table containing salaries from the same year were put into MySQL.
Some of the players were traded mid season, creating duplicate rows of data (rows for each team, and a row containing totals of all stats). The rows containing team specific rows were dropped, leaving only rows that contained a single players complete stats. There were some null values that were replaced with 0, such as "three point percentage" when players did not score any three point shots.
After combining the tables of stats and salaries and cleaning the data. The table was exported as a csv file, and loaded into Python
Part 2: I gathered data on "advanced stats" from basketball-refence.com for 2020-2023. I used MySQL to average the stats from the 3 seasons, and remove any players that did not play all 3. Basketball contracts can be made for several years, so it was important to have data from more than one season.
Choosing Which Data to Test
Before running the regression analysis, I had to decide which stats to include. I did not want to include redundant stats (Using total rebounds, offensive rebounds, and defensive rebounds).
Kept:
Points - Scoring points is the ultimate goal of the team.
Assists - Assists are valuable because it shows the player can get the ball to players that are in an excellent position to score. Skills players are able to get more assists by disrupting the defense and forcing opposing players to move towards them, leaving teammates open.
Total Rebounds - Rebounds allow a team to retain possession and take another chance at scoring.
Blocks - Prevent the opposing team from scoring, very beneficial when skilled players are able to efficiently shoot the ball.
Turnovers - Turnovers remove a players team's chance to score, and gives the opportunity to the other team.
Steals - Steals are remove the opposing team's chance to score, and gives the players team another opportunity.
Minutes Played - Minutes played is the most accurate stat to track a players time contributing for the team.
Field Throws Attempted - FTA was kept because good players are more likely to draw fouls, and get sent to attempt free throws.
Effective Field Goal % - EFG% was kept as a singular measurement of shooting accuracy.
Personal Fouls - Personal Fouls can give opposing teams easy attempts at points.
Dropped:
'Games' and 'games started' were dropped, while 'minutes played' was kept. These stats heavily influenced salary. 'Minutes Played' was kept because it measured play time the most accurate.
field goals, field goals attempted, fg%, 3p, 3p attempted, 3p%, 2p, 2p attempted, and 2p% were all dropped. I decided using points and effective field goal percentage was an effective way to measure points and accuracy using the fewest columns.
offensive and defensive rebounds were removed because I was using the total rebounds data.
Part 2:
Each individual stat was tested in a linear regression. Variables with strong linear correlations were then put into the model, and redundancies were removed (I did not want win shares, and win shares per 48 min.)
Results
After testing multiple sets of variables, it was clear that points, assists, and playing time (games, games started, and minutes played) were the biggest and most accurate contributors to salary.
Playing time was not included in the final model, although mathematically it plays a significant role. It is not directly influential of a players salary. If a player played horribly every minute of every game, that would not warrant a high salary.
I was surprised that rebounds, steals, and blocks did not play a more significant role. However rebounds and blocks are common in Center and Power Forward positions, yet uncommon in Point Guard or Shooting Guard roles.
Part 2:
Using PER (player efficiency rating), VORP (Value Over Replacement Player), Win Shares, and Usage % created a strong model. PER and VORP measure individual performance, while win shares and usage % include a players efficiency with teammates.
Moving Forward
Part 1:
A more accurate model could be predicted by manipulating and separating the data.
Different positions should be judged differently, a more accurate model would predict Centers scores different than a Point Guard.
Stats could be divided by minutes played. Players who do not get as much playing time, do not get as much opportunities to increase their stats than players who play the entire game.
Part 2:
Stats proved far more accurate when divided by time.
Splitting data by position, and emphasizing offensive/point scoring stats could produce accurate models that show specific stats that can be used to increase player salary.
Code Part 1
# Players who were traded mid season had 3+ rows of data
# A 'totals' row, and then a row for each team played on
# I removed all duplicates and kept the totals rows
# I fixed null values (they were fractions that had 0's in the numerator or denominator)
# I joined the stats tables and salary tables together
# I dropped any rows that had null values in the salary column
SELECT `player_additional`, COUNT(`player_additional`) FROM noname23
group by `player_additional`
;
create table traded as
SELECT * FROM noname23
having `tm` = 'tot';
SELECT * FROM traded;
create table test2 as
SELECT * from traded;
delete t1
from test as t1
inner join test2 as t2
where t1.player_additional = t2.player_additional;
alter table test rename column player_additional to id;
select * from test
limit 5;
alter table test
add primary key (id);
delete from test
where id is null;
alter table test
add primary key(id);
alter table traded rename column player_additional to id;
alter table traded
add primary key(id);
Create table test3 as
SELECT * from test
UNION ALL
select * from traded;
select * from test3;
CREATE TABLE clean as
SELECT id,
pos,
age,
tm as team,
pts,
ast,
trb,
blk,
tov,
stl,
g,
gs,
mp,
fg,
fga,
fg_1 as fg_pct,
column_3p as c_3p,
column_3pa as c_3pa,
column_3p_1 as c_3pa_pct,
column_2p as c_2p,
column_2pa as c_2pa,
column_2p_1 as c_2p_pct,
efg,
ft,
fta,
ft_1 as ft_pct,
orb,
drb,
pf
FROM test3;
select * from clean;
update clean set fg_pct = 0 where fg_pct is null;
update clean set c_3pa_pct = 0 where c_3pa_pct is null;
update clean set c_2p_pct = 0 where c_2p_pct is null;
update clean set efg = 0 where efg is null;
update clean set ft_pct = 0 where ft_pct is null;
select * from clean;
select * from name23
limit 5;
create table names23 as
select player_additional as id,
player as player_name
from name23;
create table testname as
select distinct id,
player_name
from names23;
alter table testname
add primary key (id);
create table cleanname as
select * from clean
left join testname
using (id);
select * from cleanname
limit 100;
select * from salary23;
alter table salary23
add primary key(id);
select id, count(id)
from salary23
group by id;
select * from salary23;
drop table name23;
drop table names23;
drop table test;
drop table test2;
drop table test3;
drop table testname;
drop table traded;
drop table noname23;
select * from cleanname;
select * from salary23;
create table cleansalary23 as
select distinct id,
yr22_23
from salary23;
alter table cleansalary23
add primary key (id);
select * from cleansalary23;
delete from cleansalary23
where yr22_23 is null;
create table final as
SELECT * from cleanname
left join cleansalary23
using (id);
select * from final;
DELETE FROM final
WHERE
yr22_23 IS NULL;
select * from final;