Lesson 2

Webscraping and Data Upload

Introduction

The life-cycle of any legal data science project starts with getting your data into R.

The great thing about R is that it allows you to upload data from your local machine, and scrape the web for data (that is, processing websites and downloading their content). Webscraping is an art and skill of its own, so today’s lesson will only scratch the surface.

A word of warning

Some websites prohibit webscraping in their terms of service. As such, you should always double check to see if you are allowed to webscrape a page and, if in doubt, it is best to contact the website’s owner. Websites may also make their data available through other means, such as APIs.

Learning from your error (messages)

As we get into more complicated coding activities, you may encounter errors, i.e. your code is just not working. There are many reasons for this which range from typos in your function to forgetting to activate a package. Don’t be discouraged though. Often errors are wasy to fix. R will give you an error message indicating the source of the error. The most important thing is that you learn from your error messages. They will help you identify the source of the problem and will help you fix it. As we discussed in the first lesson, there also is plenty of help online that will enable you to resolve the issue.

What we do in this lesson

There are many ways to upload data into R. Today we will just consider 4 different methods. We will also teach you how to interact with your working directory and how to install packages.

1. Setting a Working Directory
2. Installing Packages
3. Loading and saving csv files
4. Upload text files
5. Read and upload pdfs
6. Webscraping
7. Working with XML data

R Script

Upload and Save CSV Files
In R, tabular data is the most common data-type. For instance, you may have prepared data in an excel sheet and not you want to upload it into R. While there are R packages that can read xlxs files, we will work with the simpler csv ("comma separated values") format. In a csv file each column is separated by a comma. For the purpose of this code, create a simple table in Excel and save it in your working directory as a csv file called "myfile".
# Here I created a very simple table in Excel with three countries and some sample data and saved it as myfile.csv. I then load it into R.
As part of R's inbuilt functions you can upload that csv file with the read.csv( ) command.
myfile <- read.csv("myfile.csv",header = TRUE, sep = ",",row.names=1)
To check that your data uploaded correctly, you can check its first couple of entries with the head() function.
head(myfile)
 ##
Data 1 Data 2 Data 3
Germany 5 2 8
Canada 6 4 5
Spain 3 6 7

To save that file in your working directory, use write.csv( ).
write.csv(myfile,file="mynewfile.csv")

Setting a Working Directory
The easiest way to get data into R is from your own computer. Whenever you interact with the environment on your computer in R, you need be aware of the working directory that R currently uses.
You can get to know your working directory through the getwd() function.
getwd()

You can change your working directory through the setwd() function.
setwd("~/Google Drive/")
Now, R will store files that you save at that location. It will also look for files you want to upload in your working directory unless you specify a different path.
Installing R Packages
Up to now, we have only worked with R's in-built functions. The last thing we want to do in this lesson is to see how we can expand R's functionality by installing packages that have already been created by the R user community. There are 2 ways you can install packages in R.
  1. Using R Studio: The R Studio Interface allows you to install packages directly. Look at the bottom right window of your R Studio interface. You will see "Packages" as one of the tabs. You can install packages by looking for the name of the package you want to install.
  2. Using the R Console: You can also install packages directly from your R console using the install.packages() command with the name of the package you would like to install. For example:
install.packages("readtext")

Importantly, before you can actually use the functionality of that package, you will have to activate the package.
library("readtext")

Whereas packages remain installed in R unless you uninstall them, you will have to activate the packages you are using anew for each session. So if the function you are trying to use is not working, double check that the required package is actually activated.
Upload Text Files
Say we have already obtained text (.txt) files (either through webscraping or other means). We now want to upload these files into R. An easy way to do that is via the "readtext" package.
library("readtext")

We will work with sample text data of Supreme Court of Canada cases. They are available under the DATASET tab. Please download them into a folder on your hard drive. You must ensure that the folder only contains the texts you want to import into R.
# Change the below folder path to the path of your target folder. Important: don't forget the little asterix *. It indicates that you want to import all files in that folder. Also make sure you only store the files you want to import in that folder.

folder <- "~/Google Drive/Teaching/Canada/Legal Data Science/2019/Data/Supreme Court Cases/*"
Now we can upload the texts from that target folder using the readtext() function.
scc_texts <- readtext(folder)
As you can see, the object is a dataframe that contains both the file name and its text.
print (scc_texts)
 ##
readtext object consisting of 25 documents and 0 docvars.
# data.frame [25 x 2]
    doc_id                          text
    <chr>                          <chr>
1 [2013] 1 S.C.R. 467.txt "\"SUPREME CO\"..."
2 [2013] 1 S.C.R. 61.txt "\"SUPREME CO\"..."
3 [2013] 1 S.C.R. 623.txt "\"SUPREME CO\"..."
4 [2013] 2 S.C.R. 227.txt "\"SUPREME CO\"..."
5 [2013] 3 S.C.R. 1053.txt "\"SUPREME CO\"..."
6 [2013] 3 S.C.R. 1101.txt "\"SUPREME CO\"..."
# ... with 19 more rows

In order to only work with the text use the $ operator which selects the text column.
scc_texts$text
Upload PDF Files

Many legal documents are in fact in .pdf and not in .txt format. To upload those we will use the package "pdftools".

# Activate package
library(pdftools)

It is important to note that only pdfs with embedded digital texts can be uploaded. Scanned images of a text first need to undergo OCR - Optical Character Recognition. Today, we will work with the Universal Declaration of Human Rights as an example.

# Download a .pdf version of the Universal Declaration of Human Rights directly into R from the internet.
human_rights <- pdf_text("https://www.ohchr.org/EN/UDHR/Documents/UDHR_Translations/eng.pdf")

The pdf_text() function converts each page into an element in your object. The Universal Declaration is 8 pages long. It has thus been converted into a list with 8 elements.

# If we want to look at page 5, simply specify the number of that page.

human_rights[5]
 ##  
[1] "    1. Men and women of full age, without any limitation due to race, nationality\r\n        or religion, have the right to marry and to found a family. They are entitled\r\n        to equal rights as to marriage, during marriage and at its dissolution.\r\n    2. Marriage shall be entered into only with the free and full consent of the\r\n        intending spouses.\r\n    3. The family is the natural and fundamental group unit of society and is\r\n        entitled to protection by society and the State.\r\nArticle 17\r\n    1. Everyone has the right to own property alone as well as in association with\r\n        others.\r\n    2. No one shall be arbitrarily deprived of his property.\r\nArticle 18\r\nEveryone has the right to freedom of thought, conscience and religion; this right\r\nincludes freedom to change his religion or belief, and freedom, either alone or in\r\ncommunity with others and in public or private, to manifest his religion or belief in\r\nteaching, practice, worship and observance.\r\nArticle 19\r\nEveryone has the right to freedom of opinion and expression; this right includes\r\nfreedom to hold opinions without interference and to seek, receive and impart\r\ninformation and ideas through any media and regardless of frontiers.\r\nArticle 20\r\n    1. Everyone has the right to freedom of peaceful assembly and association.\r\n    2. No one may be compelled to belong to an association.\r\nArticle 21\r\n"

Webscraping

Webscraping is technically challenging, sometimes impossible, and often prohibited. So this section is only a primer. Some websites contain special interfaces to access their data through so called application programming interfaces (APIs). Rather than doing webscraping yourself you should ask a computer scientist for help. With that in mind, I still want to give you a taste of what is possible in R. To do webscraping in R, we need to install and load the package "rvest" - note the pun.

# Activate package

library("rvest")

In this instance, we are interested in scraping the texts of labor agreements Canada has signed with third parties. Look for Canadian labor agreements online.

# Once we have found the website, copy its url.


url <- "https://www.canada.ca/en/employment-social-development/services/labour-relations/international/agreements.html"

We can then read the website into R using the read_html() command.

website<- read_html(url)

Websites are written in the html language. That is a mark up language that we can use to locate what we are looking for. Not all websites can be easily scraped. The website architecture determines how (and whether at all) it is possible to scrape it.

Open our target website in its source code (most web browsers have an option that allows you to view a website in source code). Go through that source code. Where do we find the list of labor treaties?

Our target website contains several lists (<li>) in html. The labor treaties are located in one of these lists. Within these lists items are classified with tags that indicate links to other pages. We now scrape the names behind the tags within the list objects and manually identify the location of the labor agreements.

# Websites change over time. We are looking for the agreements starting with NAFTA up to the Honduras-Canada agreement. At the time of this writing, they are list numbers 23:30.


treaties_names <- website %>% html_nodes("li") %>% html_nodes("a") %>% html_text()
 
treaties_names <- treaties_names[23:30]
 
treaties_names

 ##
[1] "North American Agreement on Labour Cooperation" "Canada-Chile Agreement on Labour Cooperation"
[3] "Canada-Costa Rica Agreement on Labour Cooperation" "Canada-Peru Agreement on Labour Cooperation"
[5] "Canada-Colombia Agreement on Labour Cooperation" "Canada-Jordan Agreement on Labour Cooperation"
[7] "Canada-Panama Agreement on Labour Cooperation" "Canada-Honduras Agreement on Labour Cooperation"
 

We also want to capture the hyperlink associated with each treaty so that we can go to the corresponding page with the full text and scrape the text.

# The the hyperlink associated with each treaty is stored under the "href" tag.

treaties_links <- website %>% html_nodes("li") %>% html_nodes("a") %>% html_attr("href")
 
treaties_links <-treaties_links[23:30]

Next, we want to go to the website behind each of these hyperlinks and scrape the text of the agreements.

# Links 1 to 7 are not complete so we add https://www.canada.ca". Since we are dealing with another list, we need a for-loop. To loop over the elements in the treaty_links object, we use the lapply function. The lapply() function is similar to the loop function we got to know before, but more efficient. In words, we ask R to loop over the elements in treaties_links and perform the function (x) on each element of that list. In our case, our function(x) is the paste() function. We want to add "https://www.canada.ca" to every link (except the 8th one, which is already complete).



treaties_links_full <- lapply(treaties_links, function(x) (paste("https://www.canada.ca",x,sep="")))
 
treaties_links_full[8] <-treaties_links[8]

Now that we have the proper hyperlinks, we can loop over these treaty link urls and extract the full text for each treaty.

# We use the read_html() function to extract the full text for each treaty behind the link url.

treaty_texts <- lapply(treaties_links_full, function(x) (read_html(x)))

The texts of the treaties themselves are contained in the paragraph tag. Here, the results we obtain will not be perfect. The code will captures some additional text not belonging to the treaties. If we want to be exact, we would need to trim that data down to the just the treaty texts before using it. For our present purposes, this information is good enough.



treaty_texts <- lapply(treaty_texts, function(x) (x %>% html_nodes('body') %>% html_nodes('p') %>% html_text())) 

treaty_texts <- lapply(treaty_texts, function(x) (unlist(x)))
 
treaty_texts <- lapply(treaty_texts, function(x) paste((x), collapse=' '))
 
treaty_texts <- unlist(treaty_texts)

Finally, we can combine the text of each agreement with the name we extracted earlier to create a dataframe.

# Storing our webscraped texts in a dataframe allows us to conduct analysis with it later on.
treaty_dataset <- data.frame(treaties_names,treaty_texts)

Working with XML Data

XML, similar to HTML, is another way of storing text information, whilst providing a possibility to annotate text. The combination of text and annotation tags makes XML one of the best data formats for legal data analysis.

XML data can store meta-data, that is data about the text, directly in the document. For instance, the XML versions of Canadian laws and regulations, contains information about the entry into force of the law or regulation, the date of its last amendment, its enabling statutory authority if applicable and the like. Court data in XML typically provides information on the identity of litigants, the date of the judgment, the identity of the judge, and other useful information.  

XML data also provides for annotated full text data. XML data of Canadian regulations, for instance, distinguishes article headers from article text, "definition" clauses from substantive clauses, and identifies cross-references to other laws. This facilitates the segmentation of the text and makes it easier to extract information.

The only downside of the XML format is that it needs to be parsed (similar to what we did for HTML) in order to extract such information. The code below illustrates how one can parse the XML of NAFTA - the North American Free Trade Agreement - an international agreement contained in the Text of Trade Agreements (ToTA) dataset.


# Load libraries




library("xml2")

 

library("rvest")

We start by reading the NAFTA XML into R. If you were to work with the entire ToTA set, instead of just one agreement, you would write a for-loop that repeats the code below for all the urls that lead to ToTA texts.


# Download XML of NAFTA, which is PTA number 112 in ToTA.




tota_xml <- read_xml("https://raw.githubusercontent.com/mappingtreaties/tota/master/xml/pta_112.xml", options = c("RECOVER", "NOCDATA", "NOENT", "NOBLANKS", "BIG_LINES"), encoding = "UTF-8")

Next we parse the meta-data of the NAFTA XML.


# Extract agreement name



tota_name <- as.character(xml_find_all(tota_xml, "/treaty/meta/name/text()"))

 

print(tota_name)

 ## [1] "North American Free Trade Agreement (NAFTA)"



# Extract agreement type

 

tota_type <- as.character(xml_find_all(tota_xml, "/treaty/meta/type/text()"))
print(tota_type)

 ## [1] "Free Trade Agreement & Economic Integration Agreement"



# Extract agreement parties



tota_parties <- as.character(xml_find_all(tota_xml, "/treaty/meta/parties_original/partyisocode/text()"))

 

tota_parties <- paste0(tota_parties, collapse = ",")

 

print(tota_parties)

 ## [1] "CAN,MEX,USA"



# Extract agreement year of signature



tota_date_signature <- as.character(xml_find_all(tota_xml, "/treaty/meta/date_signed/text()"))

 

print(tota_date_signature)  

 ## [1] "1992-12-17"


Finally, we can extract the full text of the NAFTA agreement from the XML.


# Extract full text



full_text <- xml_text(xml_find_all(tota_xml, "/treaty/body"))

If you want to continue working with this information, it would make sense to combine the meta-data and full text into the row of a dataframe like we did for the webscraping.

Dataset

Sample of cases from the Canadian Supreme Court in txt format. [Download]

Exercises

Scrape the treaty name list and corresponding links from the Trade Compliance Website. [Link]

chat networking coding local-network layer menu folders diagram panel route line-chart compass search flow data-sharing search-1 message target translator candidates studying chat networking coding local-network layer menu folders diagram panel route line-chart compass search flow data-sharing search-1 message target translator candidates studying