Machine learning is responsible for the success of recent applications of artificial intelligence (AI) technology and is fuelling a wide range of advances from translation to self-driving cars. Machine learning can also be used in legal data science. Machine learning algorithms look for relationships in data and tend to improve as they process more data. For legal researchers, supervised and unsupervised learning are the two most relevant types of machine learning.
In supervised machine learning, a human “teaches” a computer a task, which it can then perform autonomously. For example, you could read 100 Supreme Court decisions and classify them based on outcomes. Subsequently, you feed the text and the associated outcome to a machine learning algorithm. The algorithm “learns” how the relationship between input (text of decision) and output (outcome of decision) and stores this relationship in form of a model. It can then use this model to autonomously assign a given outcome, if provided with the text of the decision. To have a sense of how good the model is, you should evaluate it against the “gold standard” — manually classified data (but not the ones used for the training of the model). If the results are mostly correct, the model works well. If it produces many falsely assigned cases, you may have to hand-code more cases so that the algorithm can learn the relationship between text and outcome better.
Unsupervised machine learning algorithms autonomously mine a dataset for patterns without prior human guidance. We will be looking at topic modeling as one example of such an algorithm. Say you do not know the content of a corpus in advance, but you guess that there are five topics in the corpus. You could then run a topic modeling algorithm that assumes that there are five topics in the corpus, that different words have different probabilities of appearing within a topic and that documents vary in the proportion of topics they talk about. The algorithm then sets out to “find” these five different topics and returns word lists associated with each topic. By interpreting these word lists you then deduce the content of these five topics. Sometimes unsupervised machine learning algorithms work like magic and reveal patterns that make intuitive sense. Sometimes they find relationships that make no sense at all. Sometimes, it literally comes down to luck – unsupervised machine learning algorithms tend to be probabilistic and, depending on where they start and how they “guess”, their findings can be more or less meaningful.
Machine algorithms are highly appealing because they can be used to quickly explore the content of a corpus or to classify texts by subject. However, legal researchers should not blindly trust the results, but instead carefully validate them. Aside from quantitive checks, this includes actually reading and reviewing some of the documents, which have been processed, to assess whether results make sense. Moreover, researchers should not expect that computers can do tasks that humans cannot. Some corpora may be so diverse that they cannot be meaningfully grouped; others allow for multiple equally valid groupings. Just like two lawyers may validly divide the same collection of judicial decisions differently, a computer-generated output should not be taken as “truth”, but as one possible way to group the data.
Unsupervised machine learning: It is often used at early stages of the research to explore new datasets, because an unsupervised algorithm, in contrast to human trained supervised machine learning algorithms, can even find patterns that the researcher did not actively look for. They thus work well even when categories are unknown. In fact, many researchers will be disappointed when they use unsupervised algorithms to look for known patterns. Instead of automatically categorizing treaties by their clauses, for example, a topic model is likely to pick up language typically used by specific states and classify treaties by signatories instead of by their content. Where categories of interest are known, a rules-based dictionary mapping or a supervised machine learning approach is more suitable.
Supervised machine learning: Whenever the researcher is confronted with repetitive tasks at a high volume, supervised machine learning is a useful tool. This is particularly true when the relationship between input and output data is complex. If the relationship can be broken down into a small set of logical rules a dictionary-based content mapping method might be more appropriate. It only makes sense to use machine learning when there is a high volume of data. If you are only classifying a few dozen or a hundred cases, it may be quicker to classify the data by hand.
In this lesson, we talk about two types of machine learning for the purposes of classifying the content of texts. In the next lesson, we use similar algorithms for prediction.
1. Unsupervised Machine Learning
2. Supervised Machine Learning
Before we can embark on any type of machine learning, we have to upload and pre-process our text.
Today we work with Federal Court decisions that have been classified into three different issue categories: (1) health, (2) aboriginal and (3) immigration.
# Load text data.
setwd("~/Google Drive/Teaching/Canada/Legal Data Science/English Course/Sample judgments")
cases <- read.csv("Sample Canadian Cases.csv", header = TRUE)
# Load package for text processing.
library(tm)
Now we can create a corpus.
# Create a corpus from the text.
corpus <- VCorpus(VectorSource(cases$text))
Next we pro-process our text.
# We get rid of variation that we don't consider conceptually meaningful.
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, removeWords, stopwords("english"))
# Finally, we again create a document-term-matrix.
dtm <- DocumentTermMatrix(corpus, control = list(bounds=list(global = c(2, Inf))))
dtm <- as.matrix(dtm)
We want to know what specific issues areas, or TOPICS, our judgments cover. Topics, here, can mean different things. Think about any text: depending on the level of abstraction a text can have different "topics".
A judicial decision, depending on the level of abstraction, can be about aboriginal rights, the consideration of aboriginal concerns in specific content such as sentencing, or the factual circumstances of the case.
We can use these levels of abstraction purposefully. When we are interested in abstract content, we can choose fewer topics that allow us to classify documents in content groups: this is a decision about aboriginal rights, this is a decision about immigration, etc.
When we are interested in more specific content, we can choose a higher number topics allow us to know what issues a document talks about: 60% of case A concerns the facts relating to burglaries, 20% relates to the criminal code and 20% relate to sentencing. Hence we can use topic models purposefully for different content analysis tasks.
Here, we focus on the first task: classification. All our cases have already been classified manually. We now want to know whether a topic modelling algorithm can guess these topics correctly.
We begin by installing and loading the package topicmodels.
# Load package.
library(topicmodels)
library(plyr)
Topic models are unsupervised machine learning algorithms. The only information we have to give to the computer is the number of topics we suspect to be in a set of documents. Based on the distribution of words in our documents, the computer then makes a statistically informed "guess" what these topics are.
The algorithm will not return a label for each topic, but it will return a list of words most associated with each topic. By reading this list of words, we can assign labels to each topic (and check whether the grouping is sensible).
In light of the above considerations, a LOW number of topics helps create sensible categories, while a HIGH number of topics will provide more specific categories.
Let's first try to classify our decisions in 3 baskets.
# So we set the number of topics to 3.
k <- 3
# We can run our topic model on our dtm with 3 topics.
topic_model <- LDA(dtm,k)
We now want to see the top-10 words associated with each topic to assign labels to each of the topics.
# We create a new object terms with the top words for each topic as input.
terms <- terms(topic_model, 10)
terms
##
Note: Your output may look different, since topic modelling is based on a probabilistic algorithm. But if you run it multiple times, it should produce results that allow you to classify texts in three themes.
In our case, the three topics emerge clearly as most frequent words. Our judgments deal with 1) health, 2) immigration and 3) aboriginal concerns.
We may thus want to assign these names as column headers to our terms.
# Consult the most distinct 10 words for each topic. Based on these words, assign [1] aboriginal, [2] health, [3] immigration to the corresponding topic. # !!! IMPORTANT: ADAPT THAT ORDERING TO YOUR RESULTS!!!
topic_label <- c("health","immigration","aboriginal")
colnames(terms) <- topic_label
terms
##
We can now use that classification to determine which is the most prominent topic per case.
Since we already have the correct, human-assigned label, we will combine the assignment with our classification prediction to compare results.
# We create a new object topics with the topic for each document as input and create a list that compares actual with predicted classifications.
topics <- topics(topic_model)
topics <- mapvalues(topics, from=1:3, to=topic_label)
topics <- as.data.frame(cbind(as.character(cases$issue),topics))
topics <- cbind(cases$case,topics)
colnames(topics) <- c("case","issue", "prediction")
head(topics, 5)
##
We can determine the probability of whether a topic is in a given document. On the one hand, it can give us a sense of what percentage of a document covers what. On the other hand, it also helps with error correction. The algorithm assigns a topic as main topic when it is the most prevalent topic in a document. But if, according to the algorithm, 51% of a document covers health and 49% covers immigration, then a human could arguably classify it as either or. If the document is then humanly classified as immigration, the algorithm may not be completely wrong. Again, it is important to carefully study results to assess how much confidence one can have in computer generated categories. The same can be said for human-assigned categories.
# We determine the probability of whether a topic is in a given document.
topics$prob_topic <- posterior(topic_model)$topics
head(topics,5)
##
Finally, we can formally quantify how good our automated classification was by comparing it to the manual classification. To do that we simply check the percentage of classes that were guessed correctly.
# Here we check how many times the preassigned "issue" label is identical to our prediction. If the prediction is correct, we count it as a hit. We start the hit count with 0 and add one every time the assignment was correct. We can then divide the hits by the total number of guesses.
hits <- 0
for (row in 1:nrow(topics)) {
if (topics$issue[row] ==topics$prediction[row] ) {
hits <- hits+1
}
}
correctness <- hits/length(topics$issue)
correctness
## [1] 0.8983051
Pretty good guessing! The unsupervised algorithm got 9 out of 10 classifications right. (Note: Your number may be different since the algorithm is probabilistic).
Extension: We used the topic model to classify our data. But by increasing our number of topics, we can also investigate content that is more granular. Rerun the analysis with a higher k. Rather than assigning each document to a single topic, we now want to check what percentages of a topic is in a given document. In that sense, the posterior (the share of topic per document) will not be a measure of how confident we are in our unique classification, but will describe the allocation of topics per document.
Supervised machine learning, in contrast to unsupervised machine learning (like topic models), uses human-determined categories as a baseline. The computer is trained on already-labelled data to then categorize not-yet labelled data automatically.
Again, we will be working with court decisions that we want the algorithm to classify automatically based on its subject matter. We will train the computer on a sub-sample of the decisions to then classify decisions that are not-yet labelled.
# Load packages.
library(e1071)
It is very computationally intensive and oftern unnecessary to input the entire DTM file. An easier way to prepare our estimate is to reduce the dimensions of our DTM.
Remember, the length of out DTM is the number of documents and the width is all the terms in the corpus. As such, DTMs can have thousands of columns. We want to compress this large matrix into a simpler matrix with just 2 dimensions. We can do that by creating a distance matrix representation of the DTM which reduces this matrix to two dimensions using a statistical formula.
# For that we again create a distance matrix.
distance_matrix <- as.matrix(dist(dtm, method="binary"))
# We then scale the distance matrix to say 2 dimensions.
compressed_dtm <- cmdscale(distance_matrix, k = 2)
# We add the issue areas to our dataframe.
compressed_dtm<-as.data.frame(compressed_dtm)
compressed_dtm <- cbind(cases$issue,compressed_dtm)
Next we want to create two subsets of our dataframe: [1] a training and [2] a test set.
# For that we generate 10 random row numbers that we will use to build our sets.
sample_rows <- sample(1:length(cases$issue), 15)
# On that basis, we create a test and a training set.
dtm_training <- compressed_dtm[-sample_rows,]
dtm_test <- compressed_dtm[sample_rows,]
We then train our model on the training data.
# We use a simple machine learning algorithm - Naive Bayes - to train our model.
model <- naiveBayes(dtm_training[,-1], as.factor(dtm_training[,1]))
Next, we use our model to predict outside of our sample data. We use our test data to predict their classification.
# We apply the model using the predict() function.
prediction <- predict(model, newdata = dtm_test)
prediction <- as.data.frame(prediction)
Finally, we compare our actual row assignment to our prediction.
prediction <- cbind(cases$issue[sample_rows],prediction)
colnames(prediction) <- c("issue","prediction")
prediction
##
# We follow the same approach as before to calculate the number of correct assignments.
hits <- 0
for (row in 1:nrow(prediction)) {
if (prediction$issue[row] ==prediction$prediction[row] ) {
hits <- hits+1
}
}
correctness <- hits/length(prediction$issue)
correctness
## [1] 0.8666667
Again, not bad. 86% of labels were assigned correctly. For some applications this may be good enough. For other tasks, we may need a correctness of closer to 100%. In that case, therer are two strategies. First, we could train more data so that the model learns the relationship between text and outcome more accurately. Second, we could try another machine learning algorithm that to see if it is more accurate. It is often difficult, if not impossible, to reach perfection.
It is important to remember that assignments made by machine learning algorithms do not need to be the final decision. It can instead make proposals that are subsequently validated by humans.
Sample of Canadian Court Decisions. [Download]