Welcome to the twenty second blog of 52-technologies-in-2016 blog series. This week I wanted to write about speech recognition but end up learning regular expressions. Regular Expressions or Regex allows you to define patterns to match text. For example, you can write a regular expression [Ss]hekh?ar
which will match Shekhar, shekhar, Shekar, or shekar
. I am programming for last eleven years but I was never comfortable with regular expressions. Whenever I see a relatively complicated regular expression, I don't understand how to break it into smaller pieces to make sense of it. So, I decided to learn about regular expressions. In this tutorial, I will walk you through a series of examples that will help you learn about regular expressions. I will end this tutorial by covering a library VerbalExpressions that you can use to programmatically build regular expressions. VerbalExpressions is implemented in most of the commonly used programming languages so you can use it with your favorite programming language.
I think there are two reasons why I never understood regular expressions.
-
You can easily google for common regular expressions so you don't spend time learning about them.
-
Although regular expressions are composable in nature but because we don't understand them well enough we are unable to break them into smaller pieces and understand them. You can always understand a complicated problem by breaking it into smaller easy to understand components.
-
A lot of time developers abuse regular expressions by using them at places where different solution will be better. There are even jokes on regular expressions. You can read about more jokes on regular expressions here.
Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.
Yes, regular expressions too have their history. Regular expression originated in 1956 when an American Mathematician Stephen Kleene was working on formalizing Regular languages. Computer Science borrows most of the good ideas from Mathematics. Mathematicians has biggest impact on computer science. Regular language is any language that can be expressed using regular expressions. First application of regular expression was an editor named QED developed by Ken Thompson. Regular expressions are present in every programming language and most of the *nix tools make use of them like find
, grep
, etc.
There are many applications of regular expressions. Some of them are mentioned below:
- Form validation
- Find and replace
- Text transformation
- Log Processing
To learn regular expression, we will learn how to write a pattern that will match all the possible spellings of Gaddafi
. Muammar Gaddafi was a Libyan revolutionary, politician, and political theorist. I found this example in a Stack Overflow question. As per the stackoverflow post, there are 30 possible spellings of Gaddafi as shown below.
Gadaffi Gadafi Gadafy Gaddafi Gaddafy Gaddhafi Gadhafi Gathafi Ghadaffi Ghadafi Ghaddafi Ghaddafy Gheddafi Kadaffi Kadafi Kaddafi Kadhafi Kazzafi Khadaffy Khadafy Khaddafi Qadafi Qaddafi Qadhafi Qadhdhafi Qadthafi Qathafi Quathafi Qudhafi Kad'afi
I use https://www.regexpal.com/, an online tool to test regular expressions.
Let's start by matching the first spelling Gadaffi
. To match this spelling, we can simply use /Gadaffi/g
as our regular expression as shown below.
By default, matching is case-sensitive so Gadaffi
and gadaffi
are different. You can use flag \i
to make search case-insensitive.
The first and second spelling differ with each other in number of f
. Gadaffi
has two f
s and Gadafi
has one. We can define these using quantifier like /Gadaf{1,2}i/g
. The {1,2}
tell the regular expression processor that there could be minimum 1 f
and maximum 2 f
in the search text. The braces are considered metacharacters.
![](images/regex-2.png
Rather than defining a range like {1,2}
, you can also define fixed values like f{5}
which means there should be 5 f
in the search text. Also, you upper limit can be open like {2,}
which will mean 2 and above.
There are shorthand notation for common quantifier mentioned below:
*
= {0,}
, zero or more
+
= {1,}
, one or more
?
= {0,1}
, optional
The second and third spellings differ only in the last character -- i
or y
. In regular expression, you can define a set and any of the value in the set will be selected. The regular expression /Gadaf{1,2}[iy]/g
means last character could be either i
or y
.
We can match fourth and fifth spellings very easily by applying quantifiers to d
as shown below.
These two spellings differ with previous spellings in that they have h
in their spellings. This means h
is optional as it is required by some spelling and not required by others. To mark a character optional, you can use quantifier {0,1}
or you can use it shorthand notation ?
as shown below.
This spelling differ from previous one's as it uses t
instead of d
. So, we have to make a choice between either d
or t
. For this we will use |
to define an or clause as shown below.
You can do this by defining an optional h
as shown below.
We can very easily match this spelling by using a set of a
and e
.
Spelling 14-17 and 19-21: Matching Kadaffi
, Kadafi
, Kaddafi
, Kadhafi
, Khadaffy
, Khadafy
, and Khaddafi
To match these the first character has to be K
. We have to define a set of [GK]
to match either G
or K
as the first character as shown below.
This can be very easily matched by adding '
to the [dtz]
set as shown below.
VerbalExpressions
is a library written for most programming languages that helps you construct difficult regular expressions with ease. In this tutorial, I will show you how to use Java implementation of VerbalExpressions
.
If you are using Gradle project then you can add following to your build.gradle
.
compile "ru.lanwen.verbalregex:java-verbal-expressions:1.4"
Let's write a simple test to match
method using VerbalExpression
that will match either Gadaffi
or Gadafi
.
import ru.lanwen.verbalregex.VerbalExpression;
public class GaddafiSpellingMatcher {
public static boolean match(final String spelling) {
VerbalExpression verbalExpression = VerbalExpression
.regex()
.startOfLine()
.then("Gada")
.then("f").count(1, 2)
.then("i")
.endOfLine()
.build();
return verbalExpression.test(spelling);
}
}
In the code shown above, we used the VerbalExpression
fluent API to build a regular expression. We started with creating the expression builder using the VerbalExpression.regex()
. Then, we build the regular expression by specifying our clauses -- start the line with Gada
, then there will be either one or two f
, finally we will have an i
in the end.
To match all the names starting with either G
or K
we wrote following VerbalExpression
code.
import ru.lanwen.verbalregex.VerbalExpression;
public class GaddafiSpellingMatcher {
public static boolean match(final String spelling) {
VerbalExpression verbalExpression = VerbalExpression
.regex()
.startOfLine()
.anyOf("GK")
.maybe("h")
.any("ae")
.anyOf("tdz").count(1, 2)
.maybe("h")
.then("a")
.then("f").count(1, 2)
.anyOf("iy")
.endOfLine()
.build();
return verbalExpression.test(spelling);
}
}
To match all the 30 names, the code looks like as shown below.
import ru.lanwen.verbalregex.VerbalExpression;
public class GaddafiSpellingMatcher {
public static boolean match(final String spelling) {
VerbalExpression verbalExpression = VerbalExpression
.regex()
.startOfLine()
.anyOf("GKQ")
.maybe(VerbalExpression.regex().anyOf("uh"))
.maybe(VerbalExpression.regex().any("ae"))
.anyOf("tdz'").count(1, 2)
.maybe("h")
.maybe("dh")
.then("a")
.then("f").count(1, 2)
.anyOf("iy")
.endOfLine()
.build();
return verbalExpression.test(spelling);
}
}
You can look at all the code here.
That's all for this week. Please provide your valuable feedback by adding a comment to shekhargulati#28.