jwallace.us

tech, tunes, and other stuff

Regular Expressions in PHP

PHP support for regular regular expressions is built into the language.  In PHP they are commonly called Perl-compatible regular expressions  or PCRE for short.      Note this is not meant to be a comprehensive discussion on PCRE.  For a complete description on PCRE, see the PHP manual: here.   That being said, there are just a few basic things you need to know about PCRE in order start using them.

  • Delimiters
  • Meta-characters
  • Escape Sequences
  • Anchors
  • Subpatterns
  • Functions

Before we get started, I should probably present an example of a simple regular expression.  The following regular expression finds every two digit number followed by a letter:

/[0-9]{2,2}\w/

For example, the code below:

1
2
3
4
5
6
7
<?php
   $s = "aab2k3oo45l987xz0x11Cv22";
   $regex = "/[0-9]{2,2}\w/";
   preg_match_all ($regex,$s,$matches);
   echo $s . PHP_EOL;
   print_r ($matches);
?>

Will return:

aab2k3oo45l987xz0x11Cv22
Array
(
  [0] => Array
  (
  [0] => 45l
  [1] => 87x
  [2] => 11C
  )
)

Delimiters

All regular expressions are delimited by a character.  This can be almost any character, but it should not be one that can be found in the search string.  For example, in the above PCRE the “/” character was used as the delimiter.  A delimiter cannot be a whitespace, backslash, or alphanumeric character.

Meta-characters

Meta-characters allow for alternatives or repetitions in patterns.  A few were used in the above example:

  • [   character class definition start
  • ]   character class definition ending
  • -   character range
  • (   subpattern start
  • )   subpattern end
  • {   min-max quantifier start
  • }   min-max quantifier ending
  • \   escape character

Some characters have special meaning inside a character class:

  • \ escape character
  • ^ class negation, but only if first character e.g. [^a-z789]
  • - character range
  • ] terminates the character class

Escape Sequences

If the escape character \ is followed by a non alpha-numeric character then it takes away from the special meaning that the character may have.  For example, if you wanted to actually include a “ character in a PCRE, then \” sequence would remove the special “string character” status that “ has.

If the escape character \ is followed by an alpha-numeric character then it takes on a special meaning. Escape codes can encode non-printing codes in a visible manner.  For example:

  • \f is a form feed character
  • \n is a newline character
  • \t is a tab character

Escape sequences can be used to match any character by giving their octal character codes.  For example:

  • \040 space character
  • \007 bel character
  • \033 esc character
  • \011 tab character
  • \113 the letter K

Escape sequences can also represent generic character types.  For example:

  • \d is any decimal digit
  • \s is any whitespace character
  • \w is any word character

If some alpha characters are capitalized, it means the opposite.  For example:

  • \S is any non-whitespace character
  • \W is any non-word character
  • \D is any non-digit

For example:

1
2
3
4
5
6
7
8
<?php
   $s = "U29qexVhA8 LJQYP";
   $regex = "/\d/";
   $ar = preg_split ($regex, $s);
   foreach ($ar as $a) {
      echo "[$a]\n";
}
?>

Returns:

[U]
[]
[qexVhA]
[ LJQYP]

Anchors

Outside a character class, in the default matching mode, the circumflex character (^) is an assertion which is true only if the current matching point is at the start of the subject string.

A dollar character ($) is an assertion which is true only if the current matching point is at the end of the subject string, or immediately before a newline character that is the last character in the string (by default). For example:

1
2
3
4
5
6
7
8
9
 <?php
   $s = "U29qexVhA8 LJQYP";
   $regex = "/QYP$/";
   preg_match ($regex, $s, $matches);
   print_r ($matches);
   $regex = "/^U29/";
   preg_match ($regex, $s, $matches);
   print_r ($matches);
 ?>

Returns:

Array
(
  [0] => QYP
)
Array
(
  [0] => U29
)

Subpatterns

Subpatterns are delimited by parenthesis.

The purpose of subpatterns is to:

  1. Localize a set of alternatives, and
  2. Sets up the subpattern as a capturing subpattern.

For example, given a set of localized alternative: /phil(harmonic|anthropist)/

1
2
3
4
5
6
7
<?php
   $s = "the st. louis philharmonic orchestra";
   $regex = '/phil(harmonic|anthropist)/';
   preg_match_all($regex, $s, $matches);
   echo $s . PHP_EOL;
   print_r ($matches);
?>

returns:

the st. louis philharmonic orchestra
Array
(
  [0] => Array
  (
  [0] => philharmonic
)
[1] => Array
  (
  [0] => harmonic
  )
)

In this example, ‘philharmonic’ was captured, with ‘harmonic’ being captured as a subpattern.

If an opening parenthesis is followed by “:?”, the subpattern does not do any capturing, and is not counted when computing the number of any subsequent capturing subpatterns.

For example:

1
2
3
4
5
6
7
<?php
   $s = "the new york jets and the new york mets";
   $regex = '/new york (:?jets|giants)|(mets|yankees)/';
   preg_match_all($regex, $s, $matches);
   echo $s . PHP_EOL;
   print_r ($matches);
?>

produces:

=-=-=-=-= subpatterns
the new york jets and the new york mets
Array
(
  [0] => Array
  (
  [0] => new york jets
  [1] => mets
  )
  [1] => Array
  (
  [0] =>
  [1] => mets
  )
)

Functions

preg_filter() – performs a regular expression search and replace, returning a string

1
2
3
4
5
6
$subject = "It all depends what this is.";
$replace = "is";
$pattern = "/this/";
echo "replacing this".PHP_EOL."[$subject ]".PHP_EOL."with".PHP_EOL . "[";
print (preg_filter($pattern, $replace, $subject));
echo "]" . PHP_EOL;

will print:

replacing this
[It all depends what this is. ]
with
[It all depends what is is.]

preg_grep() - Return array entries that match the pattern
preg_last_error() - Returns the error code of the last PCRE regex execution

1
2
3
4
5
6
$names = array ('John', 'Bob', 'Teresa', 'Lisa', 'Jimmy', 'Beverly');
$grepped = preg_grep("/^B+/", $names);
print_r($grepped);
if (preg_last_error() == PREG_NO_ERROR) {
   print 'There was no preg error.' . PHP_EOL;
}

will print:

Array
(
  [1] => Bob
  [5] => Beverly
)
There was no preg error.

preg_match_all() - Perform a global regular expression match (see earlier examples for usage)
preg_match() - Perform a regular expression match
preg_match is similar to preg_match_all except that it will stop searching after the first match is found

preg_quote() - Quote regular expression characters

1
2
$regex_chars = "\ + * ? [ ^ ] $ ( ) { } = ! &lt; &gt; | : –";
echo preg_quote ($regex_chars) . PHP_EOL;

will print:

\\ \+ \* \? \[ \^ \] \$ \( \) \{ \} \= \! \&lt; \&gt; \| \: \-

preg_replace_callback() - Perform a regular expression search and replace using a callback

1
2
3
4
5
6
7
8
9
10
11
$bicycle = array ('frame', 'chain', 'gruppo', 'seatpost', 'tires',
                  'handlebars', 'stem', 'saddle');
$pattern = "/\w+/";
print_r(preg_replace_callback ($pattern, 'my_callback', $bicycle));
function my_callback ($matches) {
   $s = "";
   foreach ($matches as $match) {
      $s .= strtoupper($match);
   }
   return $s;
}

will print:

Array
(
  [0] => FRAME
  [1] => CHAIN
  [2] => GRUPPO
  [3] => SEATPOST
  [4] => TIRES
  [5] => HANDLEBARS
  [6] => STEM
  [7] => SADDLE
)

preg_replace() - Perform a regular expression search and replace, returning an array

1
2
3
4
5
$string = "this sentence has been capitalized.";
$replace = "T";
$pattern = "/^\w/";
$ar = preg_replace ($pattern, $replace, $string);
print_r($ar);

Will print:

This sentence has been capitalized.

preg_split() - Split string by a regular expression

1
2
3
4
$string = "Isn't PHP a cool language?";
$pattern = "/\s/";
$ar = preg_split ($pattern, $string);
print_r ($ar);

Will print:

Array
(
  [0] => Isn't
  [1] => PHP
  [2] => a
  [3] => cool
  [4] => language?
)