Hello guys, If you want to learn regular expression better in Java, you must remember meaning of all the special characters. They are the one, which makes a regular expression complex, but if you know and understand them then you can easily understand at least 50% of regular expression you encountered in Java applications. They are also known as reserved characters. In a regular expression, a character denotes itself unless it is one of the special character. For example, regular expression "a" will match letter "a" and return true, if input is "a" and false otherwise, but "a*" will not match input "a*", instead it will match any input which contains just e.g. "a", "aaa", or "aaaa". It will also match with empty String because * means zero or more times, so a* means "a" appearing zero or more times, as shown below:
regex: a*, input: aaaaa, matches: true
regex: a*, input: , matches: true
regex: a*, input: abcd, matches: false
In this article, I'll list down the meaning of Java regular expression special characters and give some examples of how they work.
What are the reserved characters in Regular expression in Java?
here is the list of reserved character in Java regular expression
1) Dot .
The dot character (.) stands for "any single character". For example, ".oney" will match both "money" and "honey, but it will not match "Phoney" because it matches only one character and here we have two characters "Ph", here is the sample output:
regex: .oney, input: money, matches: true
regex: .oney, input: honey, matches: true
regex: .oney, input: Phoney, matches: false
I am using Pattern.compile() method to generate this output as displayed in our Java program. Pattern.compile() takes two String, one for regular expression and other as input. It return true if input String matches the regular expression.
2) Asterisk or star *
As I said in first paragraph, the star (*) character matches with zero or more occurence of a character precedes it e.g. "be*" will match both "be" and "bee", but it will not match "been" because there is nothing in regular expression to match with letter "n". Here is sample output:
regex: be*, input: be, matches: true
regex: be*, input: bee, matches: true
regex: be*, input: been, matches: false
If you want to match "been", just change the regular expression to "be*n" and it will match to String "been", or you can use the dot special character as "be*.", it will also match "been", but it also match "beep", as shown below:
regex: be*., input: be, matches: true
regex: be*., input: bee, matches: true
regex: be*., input: been, matches: true
regex: be*., input: beep, matches: true
By the way, the regex special characters *, ?, + are also known as Quantifiers in Java regular expression because they allow you to specify the number of occurrences to match against.
3) Question Mark ?
The question mark character (?) is another quantifier, also known as optional quantifier because ? means zero or 1 occurrence of the character precedes it. For example "ca?" will match both "c" and "ca", but it will not match "caa" or "cat" because "a" the character which precedes "?" mark can only come at either zero or one time, but in "caa" it comes two times, as shown in following example:
regex: ca?, input: c, matches: true
regex: ca?, input: ca, matches: true
regex: ca?, input: caa, matches: false
regex: ca?, input: cat, matches: false
You can see that "caa" and "cat" doesn't matches the regular expression "ca?" because "a" comes twice and there is a "t" in the "cat".
4) Plus +
The plus symbol (+) is also a quantifier in Java regular expression, it means "one or more" times. For example, "ca+" will match both "ca" and "caa", but it will not match "c" because + means the preceding character, which is "a" in regex "ca+", must come one or more times.
Since in input "c", "a" doesn't come it won't match with regular expression "ca+". The plus + quantifier is also known as "at least one time" quantifier.
Here are some examples of using plus quantifier in Java regular expression.
regex: ca+, input: c, matches: false
regex: ca+, input: ca, matches: true
regex: ca+, input: caa, matches: true
regex: ca+, input: cat, matches: false
You Can see that both "ca" and "caa" matches but "c" and "cat" didn't match.
5) Braces {}
The opening and closing braces {} are used to specify other multiplicities e.g. to match a pattern exactly k number of times you specify p{k}. For example p{1} will match just "p", not "pp". Similarly p{2} will match "pp" but not "p" or "ppp", as shown in following example. This is known as exactly match quantifier.
regex: p{1}, input: p, matches: true
regex: p{1}, input: pp, matches: false
regex: p{1}, input: ppp, matches: false
You can see p{1} only matches with "p", similarly let's see which input regex p{2} matches:
regex: p{2}, input: p, matches: false
regex: p{2}, input: pp, matches: true
regex: p{2}, input: ppp, matches: false
You can see it only matches "pp" nothing else, that's why meaning of {k} is to match the preceding character exactly k times. But, you can also convert this to an at least match quantifier. To require a pattern to appear at least k times, add a comma after the number e.g. p{2,} will match "pp", "ppp" and "pppp", as shown below:
regex: p{2,}, input: p, matches: false
regex: p{2,}, input: pp, matches: true
regex: p{2,}, input: ppp, matches: true
The first input "p" didn't match because p{2,} requires "p" to occur at least 2 times.
You can also put an upper limit on the number of occurrences by adding a second number inside the braces e.g. p{2,3} will match at least 2 character but not more than 3. This means "pp" and "ppp" will match but "ppppp" will not match, as shown below:
regex: p{2,3}, input: p, matches: false
regex: p{2,3}, input: pp, matches: true
regex: p{2,3}, input: ppp, matches: true
regex: p{2,3}, input: ppppp, matches: false
6) Opening bracket [
The bracket [] special character in Java regular expression is used to denote character classes. A character class is nothing but a set of character alternatives enclosed in brackets, such as [aA] will match either "a" or "A".
Inside a character class the "-" denotes a range(all characters whose unicode values fall between the two bounds e.g. [A-Z] will match any character between capital "A" and capital "Z". Similarly [0-9] will match any digit as we used in earlier example to check if given String is numeric in Java.
However, if you use "-" in the first or last position inside bracket then its literal meaning will be used, it won't be special anymore.
Similarly, if the caret character ^ comes in first position inside the bracket than character class denotes the complement(all characters except those specified), for example [^0-9] will match any non-digit character.
There are also many predefined character classes in Java e.g. \d will match all digits, it's equivalent to [0-9]. The regular expression \s will match any whitespace.
7) Parentheses ()
The parentheses () are used for grouping. They are also used to combine things, for example p(o|e)p will match either pop or pep but po|ep will match either po or ep. A big difference in meaning with and without parentheses. Group is one of the most difficult concept in Java regular expression and out of scope of this article. I will write a separate article to explain how to use group with Java regular expression sometime.
8) Pipe |
The pipe (|) special character in Java regular expression is used to perform OR operation. For example, "(b|c)ook" will match both "book" and "cook"
9) backslash \
The backslash \ character is used to escape a special character. If you escape a special character, it will no longer be special but will normal and literal meaning of it will be used by regular expression engine e.g. to match the star character in the input String you can use "\*".
You can also use certain special character as normal inside character class. For example you only need to escape [ and \ inside character class, provided you are careful with the position of ], -, and ^.
For example, []^-] will match ], ^ and - literally as we have seen in how to split a CSV string in Java.
10) caret ^
The caret ^ stands for beginning of the line, I mean In Java regular expressions, the caret ^ is used as a metacharacter to indicate the beginning of a line. When used at the beginning of a regular expression pattern, it specifies that the pattern should match only at the start of a line.
For example, if you have the regular expression ^Hello, it would match any line that starts with the word "Hello."
Here's a simple Java code snippet demonstrating the use of ^:
import java.util.regex.*;
public class RegexExample {
public static void main(String[] args) {
String text = "Hello, World!\nHello, Java!";
String pattern = "^Hello";
Pattern regex = Pattern.compile(pattern, Pattern.MULTILINE);
Matcher matcher = regex.matcher(text);
while (matcher.find()) {
System.out.println("Found match: " + matcher.group());
}
}
}
In this example, the Pattern.MULTILINE flag is used to make ^ match the beginning of each line in a multiline string. The output will be:
Found match: Hello
Found match: Hello
11) Dollar sign $
The $ sign stands for end of line, I mean In Java regular expressions, the dollar sign $ is a metacharacter used to indicate the end of a line. When used at the end of a regular expression pattern, it specifies that the pattern should match only at the end of a line.
For example, if you have the regular expression world$, it would match any line that ends with the word "world."
Here's a simple Java code snippet demonstrating the use of $:
import java.util.regex.*;
public class RegexExample {
public static void main(String[] args) {
String text = "Hello, World!\nJava is wonderful";
String pattern = "world$";
Pattern regex = Pattern.compile(pattern, Pattern.MULTILINE | Pattern.CASE_INSENSITIVE);
Matcher matcher = regex.matcher(text);
while (matcher.find()) {
System.out.println("Found match: " + matcher.group());
}
}
}
In this example, the Pattern.MULTILINE flag is used to make $ match the end of each line in a multiline string, and Pattern.CASE_INSENSITIVE is used to make the pattern case-insensitive.
The output will be:
Found match: World
Java Program to demonstrate Regular Expression Meta characters
Now, here is the complete Java program to demonstrate the meaning of all meta characters or special characters in Java.
import java.util.regex.Pattern;
/*
* Java Program to demonstrate meaning of reserved
* characters in regular expression, also known
* as special character or meta characters.
*
*/
public class Main {
public static void main(String[] args) {
String regex = "p{2,3}";
String input = "p";
System.out.printf("regex: %s, input: %s, matches: %b %n",
regex, input, Pattern.matches(regex, input));
input = "pp";
System.out.printf("regex: %s, input: %s, matches: %b %n",
regex, input, Pattern.matches(regex, input));
input = "ppp";
System.out.printf("regex: %s, input: %s, matches: %b %n",
regex, input, Pattern.matches(regex, input));
input = "ppppp";
System.out.printf("regex: %s, input: %s, matches: %b %n",
regex, input, Pattern.matches(regex, input));
}
}
And, here is a nice table with list of Java Regular expression meta characters and their meaning for quick revision:
That's all about meaning of special characters in Java regular expression. If you have to reuse a pattern multiple times, prefer Matcher.matches() instead of Pattern.compile() because it compile the regular expression one time and caches it, contrary to Pattern.compile() which compiles the regular expression every time.
In short, Matcher.matches() will be faster if you have compare thousands of input String against the same regular expression.
Other Java Regular expression tutorials from this blog:
- How to split a comma separated String in Java
- My favorite courses to learn Regular expression in Java
- How to remove all special characters from String in Java?
- How to check if String is numeric in Java?
- How to split a String by whitespace or tab in Java?
- How to check if given String is number in Java?
- 2 ways to split String by Dot (.) in Java?
1 comment :
Is Java and bash regular expression meta characters are same? how about Perl? what is difference between Java and Perl regular expression characters?
Post a Comment