Content-Type: RST
Since nobody responded to the poll on the left, I'll conclude either
people are blind to it, so I shouldn't use it for choosing a blog
topic, or that none of the options were interesting. So, here we go
about something I found particularly interesting in my Discrete Math 2
class. But first, some basics so that I don't dive right in.
Basics
------
I will be focusing on the mathematical aspect of things, and in
programming and other applications, syntax may be different, but the
concepts are the same. First, some definitions:
.. _alphabet:
Alphabet
The set of characters that a certain language_ uses. For example,
``{a,b,c,d,...}`` is the English alphabet, and ``{1,0}`` is the binary
alphabet.
String
A series of characters in a certain order.
For example, the string
of "I think, therefore I am." is the set of these characters: ``I``,
, ``t``, ``h``, ``i``, ``n``, ``k``, , ``,``, et cetera.
.. _language:
Language
The set of all unique strings which conform to a rule. For example,
"My name is Bob." is part of the English language, "Je m'apelle Bob"
is a part of the French language, "fa;aslddkfasl" is part of the
language of strings composed of only characters in the home row of
the QWERTY keyboard, etc.
I will be focusing on how to use regular expressions and state
machines in order to define a language. These language definitions
will be able to just say "yes" or "no" (or, programmatically, true or
false, or 1 or 0) to indicate whether a certain string is in a
language. If it returns "yes", then that is called a **match**.
.. figure :: http://icanhascheezburger.files.wordpress.com/2007/02/tell-me-when-its-over.jpg
You with me?
Regular Expressions (Regex)
---------------------------
As I mentioned, I will only be addressing the mathematical aspect of
regex, but if you want the commonly used POSIX syntax, here is an
`useful cheatsheet
`__.
So, regular expressions are plain strings that describe a
language. A (math) regex can only be composed of the following:
- A plain character sequence: defines a language containing only that
character sequence. `Example:` ``foo`` only matches "foo" and
nothing else.
- Parentheses: to define sub-expressions.
- An asterisk (``*``): the expression before it can be repeated as
many times as necessary, including none. `Example:` ``(foo)*``
matches "" (nothing), "foo", "foofoo", "foofoofoo", etc.
- A plus sign (``+``): read as "OR", it matches either the expression
to its left, or the one to its right. `Example:` ``foo+bar`` only
matches either "foo" or "bar".
That's it. All other regular expressions can be derived from these
simple signs. I find these are best explained through example, so
let's use the alphabet of only 0's and 1's, and look at one
All strings that end with "00": ``(0+1)*00``
What the expression says in plain speech is that a string is in the
defined language if it has any number of either 0's or 1's, followed
by two 0's.
Let's look at another:
All strings that start with "01": ``01(0+1)*``
Same principle. Or, if we were to combine them:
All strings that start with "01" *and* end with "00": ``01(0+1)*00``
All strings that start with "01" *or* end with "00":
``(01(0+1)*)+((0+1)*00)``
At this point you may be asking yourself "But why? Why go to all this
effort to make basic rules like this?" Because computers find them
easy to understand and parse. Using the POSIX regex syntax, for
example, I can define what a phone number is:
.. sourcecode :: plaintext
^(\([0-9]{3}\)|[0-9]{3}(-|\ ))[0-9]{3}(-|\ )[0-9]{4}$
This will match all of: "(682)515-1005" and "682-515-1005" and "682 515
1005". Giving a computer's regex library the expression I outlined
above will make it able to identify phone numbers with no
problem. Plus, many regex parsers have the ability to search within a
larger string or database for matches, so you can do this:
.. figure :: http://imgs.xkcd.com/comics/regular_expressions.png
http://xkcd.com/208/
Yes, for complicated regexes it becomes an unreadable mess, but the
important thing is it *works*. Why did I use POSIX though? Well, for
instance, POSIX supports using ``[0-9]`` for digits, where as in the
simple mathematical regex I'd have to do ``(0+1+2+3+4+5+6+7+8+9)``,
which would be inconvenient. However, even using regex, some problems
get cumbersome, for example:
All strings that have an even number of 0's and 1's:
``((00+01(11)*10)+(1+01(11)*0)(0(11)*0)*(1+0(11)*10))*(1+01(11)*0)(0(11)*0)*``
(`reference `__)
That is where state machines come in!
State Machine
-------------
A state machine can be described by a directed graph, which is a series of
vertices connected by directed edges -- all nicely represented by circles and arrows.
In the case of a state machine, each
vertex is a "state" and each edge is a condition to achieve the state
from the previous state. I'll explain by example:
.. figure :: http://blog.opensourcenerd.com/upload/state-nondet-endswith11-1
Non-deterministic machine for all strings that end in 11.
The `starting state` is the incoming arrow. The machine then follows
arrows from its current state depending on the sequence of
characters. It can either follow a 0 or a 1 and end up in the same
place, or follow two arrows of 1, and end up at a `accepting
state` - represented by the double border. If a string traverses a
state machine and ends up in an accepting state, then it is a match.
The reason I called it non-deterministic in the caption is because the
machine is human-readable, but not quite so for a computer: it has the
1 from the starting state leading off in two directions, and then it
doesn't know what to do if it followed to the right and then got a 0!
To make it computer-friendly, we can make it deterministic:
.. figure :: http://blog.opensourcenerd.com/upload/state-det-endswith11
Deterministic machine for all strings that end in 11.
Deterministic just means that each state has one and only one outgoing
connection for every character in the alphabet_. That's all there
really is to it. Now, about that complicated problem in regex:
.. figure :: http://blog.opensourcenerd.com/upload/state-det-even1and0
Deterministic machine for all strings with an even number of 1's
and 0's.
Of course, I wouldn't even begin to try representing a telephone
number as a state machine, but it's possible. In fact, `regex can be
translated into a non-deterministic machine, and non-deterministic
machines can be translated into deterministic machines, and
deterministic machines can be translated into regex`. They're
equivalent! I can't elaborate on the proofs here since there's not
enough room.
`I have discovered a truly marvelous proof of this, which this
margin is too narrow to contain.` ~ Pierre de Fermat
Irregular Languages
-------------------
Some languages just can't be represented by regex or finite state
machines. For example, the language containing all strings where there
is some number of 0's followed by the same number of 1's. Try it. You
can't. It's because regular expressions and state machines only work
on the current character, and they check the strings character by
character. A problem such as the former needs previous knowledge of
what was already done. Extended POSIX regex syntax provides the
ability to use back-references, though I am not very familiar with
that feature of it, so you should `ask the experts `__.
-----------
Anyway, I hope you enjoyed my recapitulative rant as much as I have,
and found it as useful reading it as I did writing it.