Content-Type: RST Since nobody responded to the poll on the left, I'll conclude either people are blind to it, so I shouldn't use it for choosing a blog topic, or that none of the options were interesting. So, here we go about something I found particularly interesting in my Discrete Math 2 class. But first, some basics so that I don't dive right in. Basics ------ I will be focusing on the mathematical aspect of things, and in programming and other applications, syntax may be different, but the concepts are the same. First, some definitions: .. _alphabet: Alphabet The set of characters that a certain language_ uses. For example, ``{a,b,c,d,...}`` is the English alphabet, and ``{1,0}`` is the binary alphabet. String A series of characters in a certain order. For example, the string of "I think, therefore I am." is the set of these characters: ``I``, , ``t``, ``h``, ``i``, ``n``, ``k``, , ``,``, et cetera. .. _language: Language The set of all unique strings which conform to a rule. For example, "My name is Bob." is part of the English language, "Je m'apelle Bob" is a part of the French language, "fa;aslddkfasl" is part of the language of strings composed of only characters in the home row of the QWERTY keyboard, etc. I will be focusing on how to use regular expressions and state machines in order to define a language. These language definitions will be able to just say "yes" or "no" (or, programmatically, true or false, or 1 or 0) to indicate whether a certain string is in a language. If it returns "yes", then that is called a **match**. .. figure :: http://icanhascheezburger.files.wordpress.com/2007/02/tell-me-when-its-over.jpg You with me? Regular Expressions (Regex) --------------------------- As I mentioned, I will only be addressing the mathematical aspect of regex, but if you want the commonly used POSIX syntax, here is an `useful cheatsheet `__. So, regular expressions are plain strings that describe a language. A (math) regex can only be composed of the following: - A plain character sequence: defines a language containing only that character sequence. `Example:` ``foo`` only matches "foo" and nothing else. - Parentheses: to define sub-expressions. - An asterisk (``*``): the expression before it can be repeated as many times as necessary, including none. `Example:` ``(foo)*`` matches "" (nothing), "foo", "foofoo", "foofoofoo", etc. - A plus sign (``+``): read as "OR", it matches either the expression to its left, or the one to its right. `Example:` ``foo+bar`` only matches either "foo" or "bar". That's it. All other regular expressions can be derived from these simple signs. I find these are best explained through example, so let's use the alphabet of only 0's and 1's, and look at one All strings that end with "00": ``(0+1)*00`` What the expression says in plain speech is that a string is in the defined language if it has any number of either 0's or 1's, followed by two 0's. Let's look at another: All strings that start with "01": ``01(0+1)*`` Same principle. Or, if we were to combine them: All strings that start with "01" *and* end with "00": ``01(0+1)*00`` All strings that start with "01" *or* end with "00": ``(01(0+1)*)+((0+1)*00)`` At this point you may be asking yourself "But why? Why go to all this effort to make basic rules like this?" Because computers find them easy to understand and parse. Using the POSIX regex syntax, for example, I can define what a phone number is: .. sourcecode :: plaintext ^(\([0-9]{3}\)|[0-9]{3}(-|\ ))[0-9]{3}(-|\ )[0-9]{4}$ This will match all of: "(682)515-1005" and "682-515-1005" and "682 515 1005". Giving a computer's regex library the expression I outlined above will make it able to identify phone numbers with no problem. Plus, many regex parsers have the ability to search within a larger string or database for matches, so you can do this: .. figure :: http://imgs.xkcd.com/comics/regular_expressions.png http://xkcd.com/208/ Yes, for complicated regexes it becomes an unreadable mess, but the important thing is it *works*. Why did I use POSIX though? Well, for instance, POSIX supports using ``[0-9]`` for digits, where as in the simple mathematical regex I'd have to do ``(0+1+2+3+4+5+6+7+8+9)``, which would be inconvenient. However, even using regex, some problems get cumbersome, for example: All strings that have an even number of 0's and 1's: ``((00+01(11)*10)+(1+01(11)*0)(0(11)*0)*(1+0(11)*10))*(1+01(11)*0)(0(11)*0)*`` (`reference `__) That is where state machines come in! State Machine ------------- A state machine can be described by a directed graph, which is a series of vertices connected by directed edges -- all nicely represented by circles and arrows. In the case of a state machine, each vertex is a "state" and each edge is a condition to achieve the state from the previous state. I'll explain by example: .. figure :: http://blog.opensourcenerd.com/upload/state-nondet-endswith11-1 Non-deterministic machine for all strings that end in 11. The `starting state` is the incoming arrow. The machine then follows arrows from its current state depending on the sequence of characters. It can either follow a 0 or a 1 and end up in the same place, or follow two arrows of 1, and end up at a `accepting state` - represented by the double border. If a string traverses a state machine and ends up in an accepting state, then it is a match. The reason I called it non-deterministic in the caption is because the machine is human-readable, but not quite so for a computer: it has the 1 from the starting state leading off in two directions, and then it doesn't know what to do if it followed to the right and then got a 0! To make it computer-friendly, we can make it deterministic: .. figure :: http://blog.opensourcenerd.com/upload/state-det-endswith11 Deterministic machine for all strings that end in 11. Deterministic just means that each state has one and only one outgoing connection for every character in the alphabet_. That's all there really is to it. Now, about that complicated problem in regex: .. figure :: http://blog.opensourcenerd.com/upload/state-det-even1and0 Deterministic machine for all strings with an even number of 1's and 0's. Of course, I wouldn't even begin to try representing a telephone number as a state machine, but it's possible. In fact, `regex can be translated into a non-deterministic machine, and non-deterministic machines can be translated into deterministic machines, and deterministic machines can be translated into regex`. They're equivalent! I can't elaborate on the proofs here since there's not enough room. `I have discovered a truly marvelous proof of this, which this margin is too narrow to contain.` ~ Pierre de Fermat Irregular Languages ------------------- Some languages just can't be represented by regex or finite state machines. For example, the language containing all strings where there is some number of 0's followed by the same number of 1's. Try it. You can't. It's because regular expressions and state machines only work on the current character, and they check the strings character by character. A problem such as the former needs previous knowledge of what was already done. Extended POSIX regex syntax provides the ability to use back-references, though I am not very familiar with that feature of it, so you should `ask the experts `__. ----------- Anyway, I hope you enjoyed my recapitulative rant as much as I have, and found it as useful reading it as I did writing it.