Knowledgebase
What are regular expressions and how to interpret them
Regular expressions are used to search for a string of text/code by using patterns. This means that one pattern can match many different strings that meet the requirements of the pattern. Regexes (as they are also called) are used in different programming languages and with some Apache directives. If in your .htaccess files you see a weird looking combination of letters, digits and symbols in directives for redirecting or rewriting, this means that you're probably looking at a regular expression. In this article we'll just briefly outline what some of the special characters (meta-characters) used with regular expressions mean.
Letters and digits are matched literally. For example, an a in a regex is matched with an a. Meta-characters have specific meaning.
- The dot . means any single character. This includes letters, digits, symbols and whitespaces. For example, the regex d. will match da, db, do, d3, d#, etc.
- The asterisk (star) * indicates that the character after which it appears can be repeated zero or more times. For example, the regex do* will match d, do, doo, etc.
- The plus + means that the character/pattern after which it appears can be repeated one or more times. For example, the regex do+ will match do, doo, etc.
- The question mark ? indicates that the preceding character should be repeated zero or one time. For example, the regular expression do? will match d and do.
- The round brackets ( ) have the role to group characters. This is done so that you can apply other meta-characters to a whole group and not just to a single character. For example, the regex (do)+ will match do, dodo, dododo, etc.
- The square brackets [] are for setting up character classes. The regex will match any character that's within the specified character class. For example, the regex [do] will match d and o. You can use quantifying meta-characters after the character class; for example, the regular expression [do]+ will match d, o, do, ddo, ddoo, dddo, etc. By using hyphens you can specify whole ranges within the character class; for example, [a-zA-z] will match any lowercase and uppercase letter from the whole alphabet.
- The curly brackets {} are for specifying how many times particular character/pattern should match. You can specify the minimum and maximum times, just the minimum number of matches, or an exact number. For example, (do){1,3} will match do, dodo, and dododo. The regex (do){2,} will match dodo, dododo, dodododo, and up until infinity. The regex (do){2} will match only dodo.
- The backslash \ tells the Regex Engine to interpret a special character literally. It's placed before the character that you want to be taken literally. For example, the regex do\+ will match do+.
- The caret ^ and the dollar sign $ are anchors. The caret marks the beginning of a string and the dollar sign the end. So, for example, if you use the regex ^do to search the string do do, it will match only the first do. If you search the same string with the regex do$ it will match the second do. And if you use the regex ^do$ for this string there will be no matches.
- The shorthand classes are \d \w \s \D \W \S. The shorthand \d means any digit and \D means any character that's not a digit. \w is for any word characters (e.g. alphanumeric characters and the underscore), while \W is used for any character that's not a word character. \s means whatespace characters and \S any character with the exception of whatespaces.
For more details and more examples you can also check out our regular expressions tutorial.