On a technical level, prediction is not very different from the classification through supervised machine learning we did in Lesson 7. On a conceptual level, however, it is a world apart.
When we classify a text, we essentially summarize its content under a label. There is nothing speculative or inter-temporal about a classification. When we predict, however, we use the past to speculate about the future. Since we have an imperfect understanding of what determines future events, prediction is fraught with uncertainties. As a result, we must be extra careful when we interpret and rely on predictive results.
Machine learning algorithms make it easy to generate predictions. Indeed, predicting is easy. Anyone with access to some sample code can predict an outcome. That gives rise to what I call “dumb predictions”: predictions that lack a causal theory and a deeper understanding of the input data.
While predictions are extremely useful in practice, the researchers and lawyers relying on them should make sure that these are “smart predictions”. Smart predictions are rooted in a sound causal theory that connects causes to outcomes as best as we can. Smart predictions also require studying the input data to detect missing variables, biases and other limitations that make predictions less reliable.
In short, prediction is easy. Smart predictions are hard.
In this lesson, my main point is to show you how easy it is to predict. We will do what is common in practice: use an existing dataset and try different machine learning algorithms to see which one performs best. Keep in mind that the deeper challenge lies not in predicting, but in predicting well.
# Load voting data.
setwd("~/Google Drive/Teaching/Canada/Legal Data Science/English Course/Sample judgments")
voting <- read.csv("WJBrennan_voting.csv", header = TRUE)
Lets go back in time to 1980. Judge Brennan had been on the Supreme Court since 1956 and remained a judge until 1989. Let's try to predict his votes in the 1980s based on his voting history.
In the dataset, note that the voting code works as follows:
[1] means that he voted with the majority.
[2] means he dissented.
We will use three different machine learning algorithms on the same data to predict his voting choices in order to see which algorithm performs best. We starting again with Naive Bayes.
# Load packages.
library(e1071)
We begin by dividing our dataset in two parts: one pre-1980 and one post-1980. We use the pre-1980 to train our model and post-1980 to predict.
# Creating training and test set.
voting_pre1980 <- voting[c(1:3368),c(2:10)]
We first train our model on the training data.
voting_post1980 <- voting[c(3369:4746),c(2:10)]
model <- naiveBayes(voting_pre1980[,-9], as.factor(voting_pre1980[,9]))
Next, we predict out of sample based on our test data.
prediction <- predict(model, newdata = voting_post1980[,-9])
prediction_Bayes <- as.data.frame(prediction)
Finally, we compare our actual row assignment to our prediction.
prediction_Bayes <- cbind(voting_post1980$vote,prediction_Bayes)
colnames(prediction_Bayes) <- c("vote","prediction")
head(prediction_Bayes)
##
So, how well did our prediction perform? To compare the quality of our prediction we determine the number of correct predictions
# We again calculate the number of correct assignments.
hits <- 0
for (row in 1:nrow(prediction_Bayes)) {
if (prediction_Bayes$vote[row] ==prediction_Bayes$prediction[row] ) {
hits <- hits+1
}
}
correctness_Bayes <- hits/length(prediction_Bayes$vote)
correctness_Bayes
## [1] 0.6748911
In 67% our prediction proved correct. This is far off from perfection, but better than a 50:50 guess.
To further evaluate the performance of the algorithm, we can take a look at the confusion matrix. If the algorithm had predicted all values correctly, all actual decisions (rows) would match the predicted decisions (columns) and the lower left and upper right cell would be 0.
# Compare the results in a confusion matrix
table(prediction_Bayes$vote,prediction_Bayes$prediction)
##
We see that the algorithm got it wrong both ways. Some dissents were mistakenly predicted as majority votes and some majority votes were mistakenly predicted as dissents.
We now repeat the same exercise but with another algorithm: Support Vector Machines.
Again, we start by training our model on the training data to then predict out of sample.
# Training the model.
model <- svm(voting_pre1980[,-9], as.factor(voting_pre1980[,9]))
model <- svm(voting_pre1980[,-9], voting_pre1980[,9], kernel ="polynomial", degree = 18, cost = 3)
# Predicting out of sample.
prediction <- predict(model, voting_post1980[,-9])
prediction_SVM <- as.data.frame(prediction)
Finally, to evaluate the performance of our algorithm, we again compare our actual row assignment to our prediction and calculate the percentage of accurately predicted results.
prediction_SVM <- cbind(voting_post1980$vote,prediction_SVM)
colnames(prediction_SVM) <- c("vote","prediction")
head(prediction_SVM)
##
# We again calculate the number of correct assignments.
hits <- 0
for (row in 1:nrow(prediction_SVM)) {
if (prediction_SVM$vote[row] ==prediction_SVM$prediction[row] ) {
hits <- hits+1
}
}
correctness_SVM <- hits/length(prediction_SVM$vote)
correctness_SVM
## [1] 0.6669086
The performance of the SVM algorithm with close to 67% correctness is comparable to the performance of the Naive Bayes. But take a look at the confusion matrix!
table(prediction_SVM$vote,prediction_SVM$prediction)
##
You notice that the SVM predicted ALL voting outcomes as majority vote and NONE as dissent. The reason for that is that Judge Brennan voted more with the majority than in dissent. This creates an imbalance in the data and some machine learning algorithms are affected by that imbalance and then predict exclusively the more common category.
Another lesson to learn from this is to always look at the confusion matrix to assess what the algorithm got wrong.
Finally, we repeat the same exercise with a last algorithm: K-Nearest Neighbour.
Again, we start by training our model on the training data to then predict out of sample.
library(class)
# We first train our model on the training data and apply it to the test data.
model.knn <- knn(voting_pre1980[,-9], voting_post1980[,-9], voting_pre1980[,9], k = 3, prob=TRUE)
We then evaluate the performance of our algorithm by comparing our actual row assignment to our prediction. Then, we calculate the percentage of accurately predicted results.
# We create a dataframe with our prediction.
prediction_KNN <- as.data.frame(model.knn)
# Finally, we compare our actual row assignment to our prediction.
prediction_KNN <- cbind(voting_post1980$vote,prediction_KNN)
colnames(prediction_KNN) <- c("vote","prediction")
head(prediction_KNN)
##
# We again calculate the number of correct assignments.
hits <- 0
for (row in 1:nrow(prediction_KNN)) {
if (prediction_KNN$vote[row] ==prediction_KNN$prediction[row] ) {
hits <- hits+1
}
}
correctness_KNN <- hits/length(prediction_SVM$vote)
correctness_KNN
## [1] 0.6879536
The performance of the K-Nearest Neighbour algorithm is the best we have seen so far with 69% correctness, although this performance increase is modest. Let's also take a look at the confusion matrix.
# Compare the results in a confusion matrix
table(prediction_KNN$vote,prediction_KNN$prediction)
##
Like the Naive Bayes, the K-Nearest Neighbour algorithm produces balanced predictions. For this particular task, we would thus likely choose this algorithm because it had the best performance.
Much of the work in prediction (and supervised machine learning generally) is about achieving the highest possible accuracy of prediction by trying different algorithms and specifications. These efforts are useful and important. But correctness scores (and its related success measures such as recall, precision, F-scores, area-under-the-curve, etc.) should not become the sole target. In the end, predictions have to make sense. They have to be grounded in reliable theories of how the world works and should be informed by the quality of the data.
So while researchers and lawyers should eagerly apply the tools introduced in this lecture, they should think about what they are doing and carefully reflect on the data they use. This will ensure that you make smart predictions rather than dumb ones.
Sample of US Supreme Court Data. [Download]