A relatively-low tech, but highly effective information retrieval technique is regular expressions. Most of you already use key words to find and retrieve information from full texts. Regular expressions – or short: regex – are like key word searches, but better. Rather than looking only for words, regexes look for patterns.
Fortunately for lawyers and legal researchers, much in law is based on patterns: consistent document identification, standardized citations, formalized text structures and so forth.
In legal data science, regexes serve two basic purposes.
1. What is a Regex?
2. Integrating Regexes into R Code
3. Using Regexes for Text Segmentation
4. Using Regexes for Information Retrieval
pattern <- "\\d{4}"
?grep
## Description - grep, grepl, regexpr, gregexpr and regexec search for matches to argument pattern within each element of a character vector: they differ in the format of and amount of detail in the results.
# Example text:
sample_text "World War II lasted from 1939 to 1945."
# So your regex would be "\\d{4}"
pattern <- "\\d{4}"
pattern_matching <- gregexpr(pattern, sample_text)
regmatches(sample_text, pattern_matching)[[1]]
## [1] "1939" "1945"
Aside from finding information, regular expressions help segment text either through substitution and splitting.
The command gsub() uses regexes to substitute elements.
gsub(pattern, "[enter year here]", sample_text)
## [1] "World War II lasted from [enter year here] to [enter year here]."
You can also split strings on that pattern using the strsplit() function.
strsplit(sample_text, pattern)
## [[1]]
[1] "World War II lasted from " " to " "."
Using patterns to split a text is useful to break down a contract, statute or treaty into its various subcomponents. As an illustration, let's once again work with the Universal Declaration of Human Rights.
# Let's repeat the part of lesson 2 and load the Universal Declaration of Human Rights.
library(pdftools)
human_rights <- pdf_text("https://www.ohchr.org/EN/UDHR/Documents/UDHR_Translations/eng.pdf")
# Recall that the pdf_text() function returns an object separated by page. We want to turn this into a single text object.
human_rights <- paste(human_rights, collapse = " ")
Fortunately, legal documents are often quite uniform. We can exploit that feature to segment them into subcomponents.
Take a look at the Universal Declaration. You will notice that it is segmented in articles which each have their own article headers. These headers follow the pattern of "Article" + space + one or more digits. As a result, the regex "Article\\s\\d+" should allow us to properly split the Declaration into articles. If you try this, though, you will see that it misses "Article 1", because unlike the others, it is formatted with a Roman numeral as "Article I". So a comprehensive regex looks for the word "Article" + space + one or more digit OR "Article" + space + a capitalized letter. We thus use the regex "Article\\s\\d+|Article\\s[A-Z]".
We can then follow a three-step processes of identifying, extracting and splitting based on the pattern we input into the regex.
# (1) identify our pattern in the text
article_matcher <- gregexpr("Article\\s\\d+|Article\\s[A-Z]", human_rights)
# (2) extract the pattern
article_headers <- regmatches(human_rights, article_matcher)[[1]]
# (3) and then split the text at the pattern
article_text <- strsplit(human_rights,"Article\\s\\d+|Article\\s[A-Z]")[[1]]
Finally, we want to create a dataframe that has the number of the article in column 1 and its text in column 2. Note, however, that since treaties have preambles, there will be more article_texts than article_headers.
# Hence, we have to add another header (preamble) before we can match the headers and text in a dataframe.
article_headers <- c("Preamble",article_headers)
# At last, we can combine the headers and text in a dataframe.
article_table <- data.frame(article_headers,article_text)
head(article_table)
##
article_headers | article_text | |
---|---|---|
1 | Preamble | Universal Declaration of Human Rights\nPreamble\nWhereas recognition of the inherent dignity and of the ... |
2 | Article I | \nAll human beings are born free and equal in dignity and rights. They are\nendowed with reason and cons... |
3 | Article 2 | \nEveryone is entitled to all the rights and freedoms set forth in this Declaration,\nwithout distinction... |
4 | Article 3 | \nEveryone has the right to life, liberty and security of person.\n |
5 | Article 4 | \nNo one shall be held in slavery or servitude; slavery and the slave trade shall be\nprohibited in all their forms.\n |
The research area in which regexes have the most promise is probably information retrieval. If there is a consistent pattern in your text, it is very likely that you can write a regex to extract it.
Examples include:
1) Numbers (years, telephone numbers, page numbers, ...)
2) Names (Names of persons, laws, entities,...)
3) Citations (court decisions, academic works, ...)
In the Universal Declaration, we might want to extract all proper names. Names are typically capitalized and consist of two or more words.
pattern_matching <- gregexpr("[A-Z][a-z]+\\s[A-Z][a-z]+", human_rights)
regmatches(human_rights, pattern_matching)[[1]]
##
Importantly, the results from regexes are only as good as the clarity and consistency of the underlying pattern. Here we get some false positives: "Human Rights" relate to the "Universal Declaration of Human Rights".
Regexes can also be used for fuzzy searches. If you are interested in all words connected with "human" you can then write a regex that captures all compound terms that start with the word "human":
pattern_matching <- gregexpr(" human\\s[a-z]+", human_rights)
unique(regmatches(human_rights, pattern_matching)[[1]])
##[1] " human family" " human rights" " human beings" " human person" " human dignity"
Sample of cases from the Supreme Court of Canada in txt format. [Download]
Go back to the Supreme Court of Canada data.
Imagine you work for a tech startup and have been asked to extract some meta data – data about the case – from the full text of Supreme Court judgments. Attempt to create a set of regular expressions to extract meta data from these cases relating to:
*
Hint: Start with a sample case before applying your code to the entire dataset.