Imagine you have a text and want to extract all years that are mentioned in that document. Key word search is not an option, so what is there to do?
Regexes offer a solution, because they look for patterns rather than exact matches. It is possible to mix and match these patterns, which helps us extract a wide variety of elements from a text.
The above example – all years mentioned in a document – can be reframed as looking a pattern of four consecutive digits. The regex “\\d” for instance looks for all digits in a document. The regex “\\d{4}” would look for all series of 4 digits in a document. This is an easy strategy to help you find all years that are listed.
pattern <- "\\d{4}"
There are many other patterns, too. You can look for:
- – “\\w” matches all letters
- – “\\w+” matches all words with at least one letter
- – “[A-Z]\w+” matches all capitalized words
- – “\\s” matches all whitespaces
- – …
There are lots of regexes. Mastery of regexes requires practice. A great website to test regexes, find support, and try out different code is https://regexr.com.
access_time Last update May 8, 2020.