20 January 2016
A language is a (possibly infinite) set of strings, where a string is a sequence of symbols (characters) from some alphabet. The alphabet is usually designated by the Greek uppercase \(\Sigma\) (sigma). For example, sequences of binary digits are drawn from \(\Sigma=\{0,1\}\) and base ten integers have \(\Sigma=\{-,0,1,2,3,4,5,6,7,8,9\}\) (the negation sign and all ten numerals).
We can characterize infinite language informally using English descriptions, or by set notation and abusing the ellipses (…), as in these examples:
\(L_0\) is the set of strings from \(\Sigma=\{a,b\}\) where the sequence \(ab\) is repeated one or more times. E.g., \(L_0 = \{ab, abab, ababab, abababab, \dots\}\) but \(aba\not\in L_0\).
\(L_1\) is the set of bit strings (from \(\Sigma=\{0,1\}\)) that end in the sequence \(101\). E.g., \(L_1 = \{101, 0101, 1101, 00101, 01101, 10101, 11101, \dots\}\) but \(1010\not\in L_1\).
\(L_2\) is the set of strings consisting of \(N\) \(a\)’s followed by \(N\) \(b\)’s where \(N \ge 0\). When \(N=0\) this is the empty string, usually designated with the Greek lowercase \(\epsilon\) (epsilon). E.g., \(L_2 = \{\epsilon, ab, aabb, aaabbb, aaaabbbb, \dots\}\) but \(aabbb\not\in L_2\).
\(L_3\) is the set of base ten representations of natural numbers that are multiples of 3. E.g., \(L_3 = \{0,3,6,9,12,15,18,21,\dots\}\) but \(20\not\in L_3\).
\(L_4\) is the set of base ten representations of natural numbers that are prime. E.g., \(L_4 = \{2, 3, 5, 7, 11, 13, 17, 19, 23, 29, \dots\}\) but \(28 \not\in L_4\).
\(L_5\) is the set of strings consisting of matched parentheses or brackets. E.g., \(\Sigma=\{\) (
, )
, [
, ]
\(\}\) and \(L_5 = \{\epsilon\), ()
, []
, ([])
, [()]
, (())
, [[]]
, ([]())
, \(\dots\}\) but ([)]
\(\not\in L_5\).
Can we come up with a finite, formal characterization of languages like these? What if we had a well-defined ‘machine’ of some sort that, given any string, can determine whether the string is in the language. In other terms, the machine recognizes or accepts strings in the language, and it rejects any strings not in the language.
The type of machine we’ll start with is called a Finite State Automaton (FSA). If the term automaton is unfamiliar, just think of machine or device; the plural form is automata. Below is an example of an FSA for the language \(L_0\).
The circles are states, and the arrows are transitions. The state labeled \(S\) is the start state, and the double-circle is an accept or final state. The arrows are labeled with the character that is consumed when you make that transition.
So it works like this: Given a string, let’s say \(abab\), we keep track of what state we’re in and then consume characters one at a time. Each character corresponds to following one of the arrows from the current state. So the first \(a\) moves us from \(S\) to \(X\). The next character is \(b\) so we move from \(X\) to \(Y\). Now we’re in an accept state, so if this was the end of the string, it would be in the language \(L_0\). But since there are more characters we continue. The second \(a\) moves us back to \(X\) and the final \(b\) moves us to \(Y\). We’re back in the accept state and at the end of the string, so the string is recognized by the FSA and therefore it is in the language \(L_0\).
Now, consider some things that can go wrong. What if we encounter a character for which there is no outgoing arrow from the current state? For example, trace the same FSA with the string \(aab\). The first \(a\) takes you to state \(X\), but from there we don’t have an outgoing arrow for the second \(a\). Therefore we’re stuck and the string is rejected: \(aab\not\in L_0\).
Another type of rejection is when we reach the end of the string, but we’re not in the final state. Trace the FSA with the string \(aba\). You’ll get to the end of the string, but end up in state \(X\) which is not a final state. Therefore \(aba\not\in L_0\).
Now let’s try to build an FSA for the next sample language, \(L_1\). The simplest way to approach it is to build a non-deterministic FSA. The start state is \(S\), and from there we have self-transitions on \(0\) or \(1\), back to state \(S\). That means we can – at first – have any number of \(0\)’s and \(1\)’s. But to get to the final state \(Z\), we need the string to end with \(101\).
We can see that the automaton is non-deterministic because, from state \(S\) if we see a \(1\), there are two choices! Do we follow the loop and stay in \(S\), or do we move instead to \(X\)? Consider the string \(1101\). On the first \(1\) if we move to \(X\) then it appears we’re stuck trying to match the second \(1\). The correct way to match this string is to take the loop for the first \(1\) and move to \(X\) on the second \(1\).
The way non-deterministic FSAs are interpreted is that the string is accepted if there is any path that leads to the final state. But when using the machine, how do we know which transitions to take? The strings could be arbitrarily long and complex, so we could have many false starts and have to backtrack.
Fortunately (and maybe surprisingly), any non-deterministic FSA can be converted to a deterministic FSA. The number of states may increase substantially, but it will still be constant relative to the length of the input string.
The conversion algorithm is somewhat intuitive. Starting from the start state, consider each possible input from \(\Sigma\). On a \(0\) we stay in \(S\). But on a \(1\), there are two possibilities: we can end up in either \(S\) or \(X\). So we create a new, union state called \(SX\) and let that be the target of \(1\). Continue from \(SX\): if I’m in either \(S\) or \(X\), what are all the possibilities when I see a \(0\)? Well, if it was \(S\) I would stay put, but if I was in \(X\) I would move to \(Y\). So the new union state is \(SY\). Now, from \(SX\) what are all the possibilities when I see a \(1\)? Well from state \(X\) I would be stuck so I can ignore that. But from state \(S\) I could be in either \(S\) or \(X\). So there is a loop on a \(1\) from \(SX\) to \(SX\).
Once you consider all possible transitions from these union states, you make any state that includes \(Z\) an accept state – in this example there is just one, but it’s possible the non-deterministic accept state could be a member of multiple deterministic union states.
It’s straightforward to implement a deterministic FSA in any programming language. Just keep track of the state and loop through the characters. Here is a Java program that implements the FSA for \(L_0\).
public class FSA_L0 {
enum State { S, X, Y };
static boolean runFSA(String buffer) {
State state = State.S; // Start state
for(int position = 0; position < buffer.length(); position++) {
char current = buffer.charAt(position);
switch(state) {
case S:
if('a' == current) state = State.X;
else return false; // Stuck
break;
case X:
if('b' == current) state = State.Y;
else return false; // Stuck
break;
case Y:
if('a' == current) state = State.X;
else return false; // Stuck
break;
}
}
return state == State.Y; // Are we in accept state?
}
public static void main(String[] args) {
if(args.length > 0) { // Try strings provided on command line
for(String arg : args) {
System.out.println
((runFSA(arg)? "ACCEPT" : "REJECT") + ": " + arg);
}
}
else { // Built-in test cases
assert(runFSA("ab"));
assert(runFSA("abab"));
assert(runFSA("ababab"));
assert(!runFSA("a"));
assert(!runFSA(""));
assert(!runFSA("aba"));
assert(!runFSA("baba"));
System.out.println("Tests pass.");
}
}
}
We invoke it like this:
% java FSA_L0 ab abc abb aba abab
ACCEPT: ab
REJECT: abc
REJECT: abb
REJECT: aba
ACCEPT: abab
% java FSA_L0
Tests pass.
Create a non-deterministic FSA to recognize the language \(L_6\), the set of strings from \(\Sigma=\{a,b,n\}\) that begin with \(ba\) and end with \(na\). For example, it should accept \(bana\), \(banana\), \(babbna\) but not \(banba\).
Convert the non-deterministic FSA from the previous exercise into a deterministic FSA. Try it on the same strings.
Create a deterministic FSA to recognize the language \(L_3\), the natural numbers that are multiples of \(3\). I’ve provided an incomplete starting point below that accepts \(\{0,3,6,9,12,15,18,21\}\). As a shorthand, I’ve labeled arrows with more than one digit to indicate that any of those digits will follow that arrow. You should expand the given FSA so it works for any number in the infinite set \(L_3\). Surprisingly, you shouldn’t need more states, just more arrows. One observation that makes this possible is that the sum of the digits of a number that is a multiple of 3 is also a multiple of 3. So we can tell that 236,529 is a multiple of 3 because \(2+3+6+5+2+9=27\) and 27 is a multiple of 3. (Which we can, in turn, tell because \(2+7=9\) is a multiple of 3.)
When we try to create an FSA for \(L_2\), we run into trouble. Recall that \(L_2\) requires a sequence of \(a\)’s followed by a sequence of \(b\)’s, but those sequences must have the same length. So \(aabb\in L_2\) but \(aaabb\not\in L_2\). We can design an FSA that requires a sequence of \(a\)’s followed by a sequence of \(b\)’s, but it won’t be able to count that we’re getting the same number of \(a\)’s and \(b\)’s.
The candidate FSA accepts everything in \(L_2\), but it fails to reject some strings that are not in \(L_2\), such as \(aaabb\).
Because strings in the language can be arbitrarily long, we cannot keep track of lengths using a constant number of states. There is an expansion of FSAs called push-down automata (PDA) that addresses this problem. In addition to the states and transitions, push-down automata use a stack where you can push or pop data on each transition. So if we push each \(a\) as we encounter it, then when we transition to \(b\)’s we can pop an \(a\) for each \(b\). If the stack is empty when we get to the end of the string, then the counts matched.
PDAs are strictly more powerful than FSAs – they can recognize more languages. In this way, definable languages can be placed into a hierarchy based on the power of the machine needed to recognize them. This is called the Chomsky Hierarchy, after linguist Noam Chomsky described it in his book Syntactic Structures in 1957.
Types 2 and 3 are of the most interest to us, in studying programming languages and compilers. Programming languages are usually defined to be context-free, and tokens (like a number, string, or identifier) are regular.
Regular languages show up in many computer programs in the form of regular expressions, also known as regexes. A regular expression is a type of pattern that can be used to search or match text. They are essential to the command-line tool grep
and built in to the syntax of languages like Perl and Javascript.
We won’t cover regular expressions in great detail here, but suffice to say that a dot .
matches any single character, a *
is a post-fix operator that matches zero or more occurrences of the preceding character or group, and [abc]
matches any single character from the set \(\{a,b,c\}\). So the regular expression c.lo*[rn]
would match colr
, colon
, color
, coloon
, calr
, calon
, czloooor
but not colour
or caloot
.
The set of strings matched by a regular expression forms a regular language. Therefore any regular expression can be translated to an FSA, and in fact this is often how they are implemented. Supposing that \(\Sigma=\{a,b,c,\dots,z\}\), here is an FSA that matches c.lo*[rn]
. The \(\Sigma\) label on one of the arrows means you can make that transition on any character in \(\Sigma\).
[cdf]([aoeui][tgr])*
to an FSA. As in arithmetic, the parentheses group together elements of the regex so the *
operator applies to the whole group. Some strings that match this expression are c
, cat
, dog
, dig
, cator
, cigar
, firet
, caterer
, and forager
.Another tool for characterizing languages is a grammar. The common notation is called Backus-Naur Form (BNF) named after the early programming language developers John Backus and Peter Naur. A grammar consists of a set of terminals and non-terminals, and production rules that determine how sequences of them can be transformed into other sequences. In the treatment below, we’ll use lowercase letters for terminals and uppercase for non-terminals, but other typographical treatments are sometimes used too.
This is a grammar for a subset of simple English sentences. The non-terminal S
stands for sentence, but it’s also the starting rule. Read the bar character |
as “or” – it introduces an alternative to the rule above.
S ::= NP VP
NP ::= NOUN
| ADJ NP
VP ::= VERB
| VP ADV
NOUN ::= dog
| idea
| pizza
ADJ ::= green
| mean
| tasty
VERB ::= sleeps
| eats
| learns
ADV ::= furiously
| studiously
| quickly
The above grammar can be used to generate all sorts of strange quasi-English sentences, like “tasty green pizza learns quickly” and “mean dog sleeps.”
To generate a sentence, start with an S
and then expand it using any of the available rules where S
appears on the left side of ::=
. There is only one, so the first layer transforms S
into NP VP
. Now, for NP
there are two choices – do we want some adjectives or just a noun by itself? As you continue expanding non-terminals, the process naturally constructs a tree where the leaves are the terminals in your final sentence.
The inverse problem is called parsing. That is, given a sentence that we suspect is in the language, what grammar rules would be invoked to generate that sentence? Or in other terms, given the sentence how do we construct the tree?
Let’s try to create a grammar to characterize the language \(L_2\), which we learned was not regular. Its strings contain a sequence of a
’s followed by the same number of b
’s. The grammar is actually quite simple.
S ::= ε
| aSb
The start non-terminal S
can either transform to an empty string, or it can recursively become an S
surrounded by an a
on the left and a matching b
on the right. This ensures we generate the same number of a
s as b
s.
The grammars above are context-free. You can easily tell this because the left-hand side of each ::=
is just a single non-terminal. If we allow terminals on either side of the non-terminal (all on the left side of ::=
), then the rule can only be applied when in a matching context, so the grammar is context-sensitive.
Here’s a small example of a context-sensitive grammar. It recognizes a language from \(\Sigma=\{a,b,c\}\) where there are sequences of \(a\)s, then \(b\)s, and then \(c\)s all with the same number of characters.
S ::= aSBc (rule 1)
| abc (2)
cB ::= Bc (3)
bB ::= bb (4)
Notice how the left side of the ::=
in two of the rules contains a terminal. That means the B
can only be expanded one way if there is a c
to its left, and a different way if there is a b
to its left. Below is a derivation that generates a string in the language.
S ⇒ aSBc (rule 1)
⇒ aaSBcBc (1)
⇒ aaabcBcBc (2)
⇒ aaabBccBc (3)
⇒ aaabbccBc (4)
⇒ aaabbcBcc (3)
⇒ aaabbBccc (3)
⇒ aaabbbccc (4)
Using the English-like grammar above, draw the parse tree for the sentence “mean idea eats furiously quickly.”
The English-like grammar doesn’t support subject-verb agreement. Suppose we wish to add plural nouns like dogs/ideas/pizzas and plural verbs like sleep/eat/learn. Revise and extend the grammar to ensure subject-verb agreement, so that “dogs eat” and “dog eats” are in the language, but not “dogs eats” or “dog eat”. Your grammar should continue to support the same adjectives and adverbs.
Write a context-free grammar corresponding to the language \(L_5\) consisting of strings of matched parentheses. Using your grammar, draw the parse tree for ([]())
.