[ad_1]
“…What is this book about?” — I asked
“Regular Expressions” — He told me
“What? I’ve never heard of them” — I replied confused
“Oh, if you read the book, you will find them very useful” — He said
I opened the book, looked through the index, and went directly to the Python section.
I must confess. I was super confused. I didn’t understand a word of what the book was saying. So I closed it.
Sometime after that, I was working on a project related to natural language processing. I had to parse pdf files and it was becoming a nightmare.
I looked at my bookcase with despair and I saw the book standing there. I told myself that I had to give it a try. So I opened it again determined to understand regular expressions.
Every minute that I spent learning them was worth it!
Regex are very powerful and fast. Once you grasp the concept, a new world opens up. They allow you to search complex patterns that would be very difficult to find otherwise.
Note: All images unless otherwise noted are by the author.
Regular expressions, or Regex, are strings that contain a combination of normal and special characters describing patterns to find text within a text.
What??? This sounds very complicated…. Let’s break it down to understand it better.
The image above shows what a regular expression looks like. In Python, the r at the beginning indicates a raw string. It is not mandatory to use it, but advisable.
We said that a Regex contains normal characters… or in other words, literal characters that we already know. The normal characters match themselves. In the case shown in the image, tr exactly matches a t followed by an r.
Regex also contains Metacharacters. These special characters do not match themselves. Instead, they are characters that have a “special meaning” in a regular expression. Particularly, they can represent:
1. Types of characters
In this case, the metacharacter represents character classes or special sequences. As an example, d represents a digit, s indicates whitespace, [A-Za-z] any letter from A to Z, or a to z.
2. Ideas, such as location or repetitions
Metacharacters could indicate location. And also, they could act as quantifiers to specify how many times a character located to its left needs to be matched.
In the example, 1 and 2 inside curly braces indicate that the character immediately to the left, in this case, /d, should appear between 1 and 2 times. Also, the plus sign (+) indicates that any letter from A to Z, or a to z should appear 1 or more times.
In the table, we can see a list of the supported metacharacters and their meaning.
We said that regex describes a pattern. A pattern is a sequence of characters that maps to words or punctuation.
A data scientist or a software engineer uses pattern matching to find and replace specific text. Their use cases are wide going from validating strings (such as passwords or email addresses), parsing documents, and performing data preprocessing to helping in web scraping or data extraction.
Why regex? They are very powerful and fast. They allow us to search complex patterns that would be very hard to find otherwise.
Python has a useful library, the re
module, to handle regex.
import re
This library provides us with several functions that make pattern matching easier. Let’s see some of them.
To search for a pattern, we can use the .search()
function. It takes the regex and string. The function scans through the string, looking for the first location where the regex gives a match. It returns the match or None
if no position in the string matches the pattern.
> re.search('w{4}d{4}', 'My password is abcd1234.')
In the code, we want to find a word character repeated four times, followed by a digit repeated 4 times. The .search()
function finds the match: abcd1234
.
Another function that helps us find a match for our pattern is .match()
. It also takes the regex and string.
Why do we need another function, when we have already .search()
?
The .match()
function is anchored at the beginning of the string. It means that it will only return a corresponding match if the pattern match is found at the beginning of the string.
> re.match('w{4}d{4}', 'My password is abcd1234.')
We’ll use .match()
instead of .search()
in the previous example. We’ll find out that there is no match because no word character repeated four times, followed by a digit repeated 4 times is found at the beginning of the string.
> re.match('w{4}d{4}', 'abcd1234 is my password.')
Let’s change our string. We use a string with our pattern at the beginning. Now, the .match()
function is able to find a match.
To find all matches of a pattern, we can use the .findall()
function. It takes two arguments: the regex and the string.
> re.findall(r'd{1,3}', 'My 3 cats have 15 kittens')['3', '15']
In the code, we want to find all the matches of any digit that repeats between 1 and 3 times in the specified string. The findall()
function returns a list with the two matches found: ‘3’ and ‘15’.
Notice that it doesn’t have to be the same digit, it’s just the “digit class” that has to be repeated between 1 and 3 times.
To replace any pattern match with another string, we can use the sub()
function. It takes three arguments: the regex, the replacement, and the string.
> re.sub('d', ' ', 'My1house2has3white4walls')'My house has white walls'
In the example, we replace every match of a decimal digit with a blank space.
Now that we have covered the basic concepts of Regex, let’s see regex in action.
Imagine that we are cleaning some text that we extracted from the web. We come across some strings (e.g My&name&is#John Smith. I%live$in#London) that contain symbols that should not be there. How can we clean those strings?
We are going to use regex and the .sub()
function. How do we build the regex?
We’ll indicate that we want to search for the symbols #
, $
, %
, &
and we put them between square brackets [#$%&]
. This will indicate that any individual character between the square brackets could be matched. We are going to replace them with blank space “ “
. So the code will be the following:
> my_string = "My&name&is#John Smith. I%live$in#London."
> re.sub(r"[#$%&]", " ", my_string)'My name is John Smith. I live in London.'
Now, imagine that we want to validate a password. This password needs to meet certain requirements. So let’s write the regex that helps us validate each of them:
- It must start with a minimum of 4 but a maximum of 8 numbers:
d{4, 8}.
Because we have to match the beginning, let’s use.match()
- The numbers must be followed by a minimum of 2 and a maximum of 6 letters, either capital or small letters
[a-zA-Z]{2,}
. - After that, it can contain any character
.*
. - It cannot end with the following symbols !, @, $, %, &:
[^!@$%&]$
. Notice here that we use^
inside the square brackets to negate the occurrence of the symbols.$
anchors the pattern to the end of the string.
So we define a function to validate passwords:
> def validate_password(password):
> if re.match(r"d{4,8}[a-zA-Z]{2,}.*[^!@$%&]$", password):
> print(f"Valid Password {password}")
> else:
> print(f"Invalid Password {password}")
And we can test it using an invalid password: 4390Abac! which ends with a symbol.
> validate_password("4390Abac!")Invalid Password 4390Abac!
And 4390Abac!1 that meets the requirements.
> validate_password("4390Abac!1")Valid Password 4390Abac!1
Lastly, imagine that we have to extract dates from a document. People write dates in very different ways. The month can appear with a number, or with the name. The day can come after the month, or before. And so, on.
In the following example, we need to extract the date as: ordinal_number of month_name year hh:mm.
So let’s build the regex:
- The ordinal number can have 1 or 2 digits. It is followed by st, th, or rd (so 2 small letters):
d{1,2}[a-z]{2}
. After that we have whitespace:s
and then the wordof
and whitespace again:s
- Then, we’ll indicate that we want to match any letter (capital or small) at least one time:
[a-zA-Z]+
. Then, whitespace:s
. - Then, a number of 4 digits must follow:
d{4}
and whitespace:s
- We then want to match a number of 1 or 2 digits for the hour,
d{1,2}
, followed by a colon:
and a number of two digits for the minutesd{2}
Finally, we’ll have the following regex:
r”d{1,2}[a-z]{2}sofs[a-zA-Z]+sd{4}sd{1,2}:d{2}”
So our code and output will end up being the following:
> my_date="Your appointment has been confirmed for 1st of september 2022 18:30"
> regex = r"d{1,2}[a-z]{2}sofs[a-zA-Z]+sd{4}sd{1,2}:d{2}"> re.findall(regex, my_date)['1st september 2022 18:30']
A powerful tool, right?
In this article, we learned that regular expressions can match normal or literal characters as well as metacharacters that can represent character classes, quantities, or locations. We explored the re
module that allows us to find, match, search, and replace patterns in a string. Lastly, we saw some examples of how the regex can be used to extract data or validate expressions.
However, we just covered the basic concepts of regular expressions. And there is more to it!
If you want to know more about how to master Regex in Python, you can click on the image and take a look at my course.
Also, check these resources that are helpful for understanding and testing regular expressions:
An Introduction to Regular Expressions in Python Republished from Source https://towardsdatascience.com/an-introduction-to-regular-expressions-in-python-23baebfa3ac?source=rss—-7f60cf5620c9—4 via https://towardsdatascience.com/feed
[ad_2]
Source link