Recently, professional sports associations and teams have made big strides towards leveraging data to inform both personel and on-the-field decision making. While the four major leagues (NBA, NFL, MLB, NHL) vary in terms of where they are in that process, most people would argue that the NBA is at the forefront of this movement. If you have never heard of SportVu before, they are a company that has partnered with the NBA to “utilize a six-camera system installed in basketball arenas to track the real-time positions of players and the ball 25 times per second. Utilizing this tracking data, SportVu is able to create a wealth of innovative statistics based on speed, distance, player separation and ball possession.” As stated, the release of aggregated SportsVu data has offered brand new insights into how the game of basketball is played, and more importantly, how each individual plays the game.
In this post, I looked at the SportVu data available for all NBA players active during the 2014-2015 season. More particularly, I was interested in finding out whether SportVu data could be leveraged to discover players with similar playing styles, but also to discover teams with similar rosters. To begin, I started off by writing a quick Python script to scrape SportsVu data from http://www.stats.com/sportvu/sportvu-basketball-media/
Collections and scraping the data
There is nothing extraordinary in the code above, I essentially scraped some data from publicly (I hope) available data, and after a small clean-up, concatenated all the data into a single Pandas dataframe. The only thing of note is that I restricted the data collection to players who averaged at least 15 minutes per game and played in at least half the games in the season. In total, this leaves us with 329 NBA players, each with 80 unique Sportvu data points.
Inferring and visualizing similarities between NBA players
With the data now in our hands (or RAM), we can proceed to the original intent of this blog post, which is finding players with the most similar playing styles. After some trial and error, I obtained the best results when computing the correlation matrix between Sportvu metrics for all players, and then applying the t-SNE dimensionality reduction algorithm. Roughly, t-SNE is considered to be useful because of its property to conserve the overall topology of the data, so that neighboring (i.e. similar) players are mapped to neighboring locations in a two-dimensional space (It is this property that makes it so amenable to image analysis). Other well-known clustering techniques such as k-means or MDS would also be adequate for this exercise, but I’ve had good fortune when using t-SNE, so am perhaps unwisely sticking to it here.
The advantage of using t-SNE in this context is that we are effectively taking an unsupervised approach, with the hopes that we can infer natural groupings of players based on their Sportvu statistics. Now that the data has been processed, we can start to visualize it. From there on, I will proceed to some nasty context switching and use R (I love both R and Python but hate using both in a single projetc. However, I am justifying my decision on the fact that, despite recent progress for Python, R still currently has far better wrappers around JS/D3).
The plot below shows the natural groupings of players, where the shape represents the cluster they belong to and the color represent their respective teams. Feel free to zoom, highlight certain teams or clusters (by hovering over the legends) and generally just playing around it.
NBA player similarity
Upon investigation, we can see that this approach makes a lot of sense. For example, players such as Damian Lillard, Mario Chalmers, Eric Bledsoe or Derick Rose are very to each other in space. There are many other examples like this (Serge Ibaka and Lamarcus Aldridge; Jimmy Butler and Andre Iguodola) but it is interesting to note how the shooting and point guard have well-defined positions, whereas the Center, Power Forward and Small Forward positions show a lot more heterogeneity and complexity. There are some mis-assignments here and there but these tend to be on the boundary of clusters, which could be probably be fixed after some further optimization and tinkering of the cluster assignments.
Team roster similarity
We can also leverage the results obtained from the dimensionality reduction part to discover teams that share the most similar roster of players. Given two teams X and Y with players [x1,..., xn] and [y1,...,yn] respectively, one way of achieving this is through the following steps:
Select player from team X (say x1)
Compute point-to-point distance between player x1 and all players [y1,…,yn] in team Y
Select and record the minimum value between player x1 and all players [y1,…,yn] in team Y. We are effectively finding the player in team Y that is most similar to player x1
Repeat step 1 to 3 for all remaining players in team X and sum the total point distance to get a “distance” value between team X and Y
By summing distances between pairs of players that are most similar in each teams, we can then assume that pairs of teams with low total distance between one another have more similar rosters than pairs of teams with high total distance.