A Step-by-Step Guide for Learning Regular Expressions : A Guide with Real-Life Usage
Revealing the Power of RegEx as well as it’s Practical Scenarios
What are Regular Expressions?
Regular expressions, often referred to as regex, function by establishing patterns that enable you to seek out specific characters or words within strings.
Once you establish the template you wish to use, you gain the ability to perform actions like making changes, removing particular characters or words, swapping one element with another, pulling out important details from a file or any text that includes that specific template, and more.
Why Should You Learn Regex?
Learning regex can revolutionize the way you handle text, saving you plenty of time and even adding an element of enjoyment to the process.
Using regex simplifies finding specific information within the text. Once you pinpoint what you’re looking for, you can easily make widespread changes, replacements, deletions, and other text adjustments.
Regex proves incredibly useful in various real-life scenarios. It’s handy for renaming multiple files at once, analyzing logs, confirming the accuracy of form submissions, making extensive modifications to a codebase, and conducting thorough searches.
In this article, we’ll start with the fundamentals of regex, using this website to assist us. Later, I’ll present some regex challenges for you to tackle using Python.
Similar to many things in life, comprehending regular expressions truly comes through practical experience. I encourage you to experiment with regex as you navigate through this article.
Regex: The Basics
A regular expression is basically a sequence of characters that follows a certain pattern. In addition to using regular characters (like abc
), there are special characters (*
, +
, ?
, and so on) with specific purposes. You can also use character groups to simplify your regular expressions.
Before you create a regex, it’s important to grasp all the typical scenarios and special cases that match the pattern you’re aiming for.
For instance, if you want to find ‘Hello World’, do you need the line to start exactly with ‘Hello’, or can it start differently? Is there only supposed to be one space between ‘Hello’ and ‘World’, or can there be more? Can there be other characters after ‘World’, or should the line end there? Do you want to consider uppercase and lowercase letters separately? These are the types of questions you should have answers to before you start writing your regex.
Having a clear understanding of these aspects is crucial when you’re about to craft your regex.
Exact match
The most basic form of regex involves matching a sequence of characters in a similar way to what you can do with Ctrl-F in a text editor.
At the top, you’ll notice how many matches there are. And at the bottom, there’s an explanation that shows what the regex is matching, step by step.
Character set
Regex character sets let you match any single character from a selection of characters. This selection is enclosed in square brackets [].
For instance, B[au]g matches both Bag
and Bug
. In this case, B
and g
are constants, but between them, you can have either a
or u
.
Now take a Look at the Explanation:
Matching ranges in regex:
At times, you might want to match a series of characters that follow a sequence, like all uppercase English letters. However, listing all 26 letters individually would be cumbersome.
Regex provides a solution through ranges. The -
symbol functions as a range
operator. Here are some examples of valid ranges:
You can also indicate partial ranges. For instance, [b-e] matches any of the letters ‘b’, ‘c’, ‘d’, or e. Similarly, [3-6]
matches any of the numbers 3
,4
,55
or 6
.
You’re not restricted to just one range when you’re defining a character set. You can include multiple ranges and even mix them with other characters. For instance, [1-5a-k]
will match any of the characters 1
to 5
and a
to k
.
Matching characters not in a set:
When you start the set with a ^
symbol, it does the opposite. For instance, [A-Z0-9]
will match anything except uppercase letters and digits.
Character Classes:
When crafting regex, you often have to repeatedly match certain groups, like digits. You might even need to do this several times within the same expression.
For example, think about how you’d match a pattern like ‘letter-digit-letter-digit’.
Using what you’ve learned so far, you could create
[a-zA-Z]-[0–9]-[a-zA-Z]-[0–9].
This works, but you can imagine how complex the expression would become as the pattern grows longer.
To simplify things, regex has predefined classes for common character groups, like digits. The table below displays these classes and their equivalent expressions using character sets:
Character classes prove to be really useful as they simplify your expressions considerably. We’ll be using them a lot throughout this guide, so you can always refer to this table if you forget any of the classes.
Usually, we might not be concerned about every single position in a pattern. This is where the .
class comes in handy, as it spares us from listing out all the possible characters in a set.
For instance, t..
matches anything that begins with t
and is followed by any two characters. This might remind you of the SQL LIKE
operator, which uses t%%
to achieve a similar outcome.
Quantifiers
When dealing with patterns, repetition often comes into play. For instance, if you want to match a 3-digit number, you could use\d\d\d
. But what if you needed to match 11 digits? Typing \d
eleven times is quite tedious. In programming, including regex, it’s a good rule of thumb to avoid excessive repetition.
That’s where quantifiers in regex come in handy. Instead of writing the same thing multiple times, you can use quantifiers. To match 11 digits, you can simply use the expression \d{11}
.
The following table outlines the quantifiers available for use in regex:
In this instance, the expression can\s+write
finds occurrences of can
followed by one or more spaces and then followed by write
. However, note that canwrite
isn’t considered a match because \s+
requires at least one space to be present. This feature is particularly useful when you’re searching through text that isn’t trimmed.
Can you guess what can\s?write
will match?
Capture Groups
Capture groups are like smaller parts within a larger expression that you put inside parentheses (). You can have many capture groups, and they can even be inside each other.
For example, the expression (The ){2}
matches ‘The ‘ two times. But if there’s no capture group, the expression The {2}
would match ‘The’ followed by 2 spaces, because the counting rule applies to the space character and not the whole ‘The ‘ as a group.
You can put any pattern inside capture groups just like you do with regular regex. For example, (is\s+){2}
matches when it discovers ‘is’ followed by one or more spaces occurring twice.
Using logical OR in regex
You can utilize “|” to match different patterns. For example, (good|bad|sweet)
matches ‘This is ‘ followed by either good
, bad
, or sweet
.
Again, you must understand the importance of capture groups here. Think about what the expression You are good|bad|sweet
would match?
When using a capture group, like (good|bad|sweet), the match for “good”, “bad”, or “sweet” is separated from the preceding “This is”. However, if it’s not within a capture group, the entire regex is treated as a single group. Therefore, the expression This is good|bad|sweet would match the string if it contains ‘This is good’, ‘bad’, or ‘sweet’.
How to Use Capture Group References
Capture groups can be referred to within the same expression or when making replacements, as shown in the Replacement section.
Most tools and programming languages allow you to refer to a captured group with ‘\n’, where ’n’ represents the group’s order. On this website, ‘$n’ is used for referencing during replacement. The syntax for replacement may differ based on the tools or language you’re using. For instance, in JavaScript, it’s ‘$n’, while in Python, it’s ‘\n’.
In the expression You(This) is \1 power
, the group capturing ‘This’ is referenced as ‘\1’, effectively matching This is This power
.
How to Give Names to Capture Groups
You can assign names to your capture groups using the format (?<name>pattern)
, and then refer to them later in the same expression with is\k<name>
.
During replacement, referencing is achieved using $<name>
in JavaScript’s syntax, although this can vary across programming languages. Be aware that this feature might not be available in all languages.
In the expression:
(?<lang>[\w+]+) is the best but \k<lang>.*
the pattern [\w+]+
is captured with the name ‘lang’ and later referred to with \k<lang>
. This pattern matches one or more word characters or the ‘+’ character. The .*
at the end matches any character zero or more times. When replacing, referencing is done using a$<lang>
.
How to Utilize Regex with Command Line Tools
Handy command line tools are at your disposal for performing regex tasks right from your terminal. These tools prove even more time-efficient, allowing you to experiment with various regex patterns without needing to write and compile code in a programming language.
Notable tools in this realm include grep, sed, and awk. Below, I’ll provide a few examples to offer insight into how you can make the most of these tools.
Recursive Regex Search Using Grep
You can harness the capabilities of regex with the help of grep. Grep can look for patterns in a file or even perform searches within directories and subdirectories.
For Windows users, you can install grep using winget. Just enter this command in PowerShell:
winget install -e --id GnuWin32.Grep
You can learn more about grep and its command line options here
Using Substitution with sed
You can use sed to carry out tasks like adding, removing, or replacing text within text files, all by providing a regex pattern. If you’re on Windows, you can get sed from a specific source. Alternatively, if you’re using WSL, tools like grep and sed are already available.
The following is a common way to use sed:
sed 's/pattern/replacement/g' filename
echo "${text}" | sed 's/pattern/replacement/g'
In this case, the g
option is used to replace all occurrences.
There are a couple of other helpful options as well. You can use the -n
option to prevent the default behavior of printing all lines, and you can use p
instead of +
to print only the lines that are influenced by the regex.
Let’s now examine the text given below 👇
Hello rand chars World 56 rand chars
Henlo 52 rand chars W0rld rand chars
GREP rand chars Henlo 62 rand chars
Henlo 10 rand chars Henlo rand chars
GREP rand chars Henlo 45 rand chars
Our goal is to change instances of Henlo
followed by a number to Hello
followed by the same number, but only in lines where “GREP” appears. To do this, we’re looking for the pattern Henlo ([0-9]+)
, which matches ‘Henlo ‘ followed by one or more digits. All the digits are captured. The replacement will be Hello \1
— where ‘\1’ refers to the captured digits.
One approach to achieve this is by using the combination of grep and sed. First, we can use grep to identify lines containing “GREP,” and then we can utilize sed to make the replacements accordingly.
grep "GREP" texts.txt | sed -En 's/Henlo ([0-9]+)/Hello \1/p'
The “-E” option enables extended regex without which you would need to escape the parentheses.
Or you could just use sed. Use /pattern/
to restrict substitution on only the lines where the pattern is present.
sed -En '/GREP/ s/Henlo ([0-9]+)/Hello \1/p' texts.txt
You can read more about sed By Clicking The Link.
Some Real Life Examples of using Regex
1) Defining Logs Parsing Criteria
Here’s how we can outline the requirements for parsing logs using regex:
The log entries must consist of a minimum of 8 characters but should not exceed 16 characters.
Each log entry must include at least one lowercase letter.
Moreover, it should contain at least one uppercase letter.
Furthermore, it must encompass at least one number.
2) Crafting Bulk File Renaming
We can design a set of criteria for bulk file renaming using a regex:
The filenames must be at least 8 characters long but not exceed 16 characters.
Each filename must incorporate at least one lowercase letter.
Additionally, it should involve at least one uppercase letter.
Lastly, it ought to encompass at least one number.
3) Establishing Email Validation Criteria
Here’s how we can create a regex to ensure the following conditions for email validation:
The email address must be at least 8 characters long but not exceed 16 characters.
It must contain at least one lowercase letter.
It should also include at least one uppercase letter.
Furthermore, it should have at least one number.
4) Setting Password Requirements
we can formulate a regex that enforces these conditions:
The password must be at least 8 characters long but no more than 16 characters.
It must include at least one lowercase letter.
It must also contain at least one uppercase letter.
Additionally, it should have at least one number.
you can read some more examples here: Click to watch
I hope this guide has given you valuable practice. Don’t worry about memorizing every detail; understanding the core concepts matters most. If you forget a pattern, you can always search or use a cheat sheet.
As you practice, you’ll become more self-reliant. You can find useful cheat sheets online, like the one from QuickRef, for syntax and language-specific functions.
Remember, regex techniques are consistent across languages, though tools may offer extras. Research your tool to choose wisely.
Don’t use regex just for the sake of it; simpler methods often suffice. But in terminal work, regex is a potent tool.
So, Let’s end this Blog Today. I hope these tools will surely help you in your Daily Life. If you are still here and Reading this. Consider Following me and Subscribing to Email Notifications to Stay Updated with the crazy Content I will upload.
For any Discussions, you can add a comment.