Introduction to Regular Expression (Regex) in Python

To not miss out on any new articles, consider subscribing.

What is Regex?

A regular expression (also known as regex) is a sequence of characters that define a search pattern and can be used to performs substitutions in a string. They are a very powerful tool for text manipulation and can be implemented in many languages, however, this article focuses on Python 3. This is like a cheat sheet and aims to outline the basic syntax of regex in Python 3.

To use regex in python, you import the re library.
It is advisable to use a raw string when defining patterns to avoid constantly escaping literal meta-characters in our string. A raw string is a string with a r in front.

The match method determines whether a string begins with the given pattern.

The search method finds a match of a pattern anywhere in the string.

The group method returns the string matched. start and end return the start and ending positions of the first match and span returns the start and end positions of the first match as a tuple.

The findall method returns a list of all substrings that match a pattern.

The finditer method does the same thing as findall method, except it returns an iterator, rather than a list.

Meta-characters

The dot(.) signifies any character in a string.
The circumflex accent(^) specifies that the character after it must be the first character in the string.
The dollar($) specifies that the character before it must be the last character in the string.

Here are some patterns using the three meta-characters mentioned above. Comments are written beside each LOC, indicated with # to give explanation.

Character class

A character class is created by putting the characters to be matched inside square brackets.

Character classes can also have ranges.
One thing to note is that regex is case sensitive.

You can place a (^) at the start of a character class to invert it. This causes it to match any character other than the ones included. It basically means, NOT and can be used to filter words with numbers or special characters. Other meta-characters such as ($) and (.), have no meaning within character classes.

Repetitions

The meta-character (*) means zero or more repetitions of the “previous thing”. The “previous thing” can be a single character, a class, or a group of characters in parentheses.

The meta-character (+) is very similar to (*), except it means one or more repetitions, as opposed to zero or more repetitions.
The meta-character (?) means zero or one repetitions.
The meta-character (|) means “or”. The pattern usually has 2 or more characters in a () separated by |.
Curly braces can be used to represent the number of repetitions between two numbers. The regex {x,y} means between x and y repetitions of something. If the first number is missing, it is taken to be zero. If the second number is missing, it is taken to be infinity. Hence {0,1} is the same thing as (?) meta-character explained above.

Groups

We touched on groups a bit earlier, but this section gives more explanation of the various operations capable of by groups in regex.

For groups, () and [] can be used .

The content of groups in a match can be accessed using the group function.
A call of group(0) or group() returns the whole match.
A call of group(n), where n is greater than 0, returns the nth group from the left.
The method groups() returns all groups up from 1.
Groups can be nested.

Special sequences

One useful special sequence is a backslash and a number between 1 and 99, e.g., \1 or \17. This matches the expression of the group of that number.

Another useful special sequence is (\d) and it matches digits
(\s) matches whitespace, and
(\w) matches word characters
Versions of these special sequences with upper case letter, (\D), (\S), and (\W), mean the opposite to the lower-case versions.
The sequence (\b) matches the empty string between word characters (\w) and non-word characters (\W) or word characters and the beginning/end of the string. Basically, it represents the boundary between words.
The sequence (\B) is the opposite of (\b) and it matches the empty string anywhere else.

A period(.) in a pattern is preceded by a backslash (\) to treat is a character.

These are some of my notes from learning about regular expressions in Python 3 via the SoloLearn Python Tutorial Course. The code for this article can be found here. You can use this as a cheat sheet while working with basic regular expression in Python 3, I really hope it was helpful.

Feel free to shoot me an email at contactaniekan at gmail dot com or on Twitter.

To not miss out on any new articles, consider subscribing.

Aniekan's blog

Data, Technology, Career, and Life

Introduction to Regular Expression (Regex) in Python

What is Regex?

Meta-characters

Character class

Repetitions

Groups

Special sequences

Leave a comment Cancel reply

What is Regex?

Meta-characters

Character class

Repetitions

Groups

Special sequences

Share this:

Related

Leave a comment Cancel reply