My First Kaggle Kernel

PlayerUnknown’s Battlegrounds is currently one of the most popular Battle Royal type shooters with Millions of players worldwide. This popularity also means that there is a huge amount of data available. The data used in the Kaggle kernel competition ‘PUBG Finish Placement Prediction’ consisted of match statistics data taken from the official PUBG API. The goal of the competition was to predict the label winPlacePerc, that is the percentage rank a player or team achieved in a match. The features were basic game stats such as kills, heals, walkDistance, damageDealt, and so on.

Since the percentage rank is defined for a individual match, my idea was to train a neural net incrementally, one match at a time, and then make the predictions in the same way. The scikit-learn framework provides a multi-layer perceptron (MLP) regressor

sklearn.neural_network.MLPRegressor 

which can be trained incrementally my calling the method

partial_fit(x_train, y_train)

To feed the data to the model, a few modifications were in order. For one, the data was divided in different match types, defined by the features matchType. Available match types were solo, duo team and crash, where the first three types correspond to the size of the teams in a match (1 player, two players or four player) and the last type is a special event type of match where player do not use fire arms, but instead use melee weapons and vehicles to kill each other. Of course not all features were equally important in all match types, e.g. the feature revives had little impact on the winPlacePerc in the solo player matches. The respective features were then removed in the data set of that specific match type.

Once the data was sorted according to the match type, the features needed to be transformed into a form the MLP regressor can handle. In the first step, I reduced the number of features. The data set contains many numerical features, so my idea was to use a PCA analysis to get the main components and only fit on those. However, the partial fit method in the MLP regressor has a bug that prevented me from reducing the number of components. Therefore I chose to manually remove in each match type the features that had the least correlation with winPlacePerc and use PCA on the remaining ones. Finally, since I used a neural net, the data needed scaled. For this I chose the sk-learnMinMaxScaler. In the picture you can see a sample of the correlation matrix for the solo matches. As you can see, the feature roadKills has only a tiny correlation with winPlacePerc and is therefore removed. Large correlation, and therefore large impact on the winPlacePerc came from the features boosts, killPlace and walkDistance. At first glance walk distance does not seem very important for winning. However, once you realize that walk distance is also correlated to the time a player stayed in a match, it’s clear that players that had large walk distances survived for longer and were therefore placed higher in the final score.

In the last step before the training, I filtered out all matches that had a total of only one or two player. The reason is, that the outcome of these matches can be predicted ‘manually’. Matches with only one player automatically assign the player a winPlacePerc of 0.0. For matches with a total of two player, the player with a lower killPlace had a winPlacePerc of 1.0 in 99% of the training samples.

The remaining matches were then fed into the neural net. To evaluate the performance, I used a k-fold type cross validation on random subsets of the training data. The resulting evaluation mean absolute error was 0.06078. Compared to the Kernel’s Public Score of 0.05489, it did a bit better on the test data, than it did on the training data.

Check out my Kaggle Kernel to get the details.