SQL, how strange art thou

| 11 Comments

One of the things I've been working on lately is an architecture for lexers, and I'm basing my work off an existing SQL lexer I have access to. The lexer is well-factored, so refactoring it into a more generic tool is pretty simple. However, while doing this, I ran into a behavior that I thought was strange. I'm sure you've seen the same thing in your own coding: you take some working code, refactor it, test it out and notice some strange behavior and wonder "did I break something?" In my case, I was noticing that my lexer was reporting the following as two distinct tokens (instead of some sort of error). The text I was entering was:

2foo

I found it strange because what my lexer was telling me is that I was getting a NUMBER followed by an IDENTIFIER. So I did a bit more poking, and it was even more interesting than that.
2.1.1

That was returning NUMBER NUMBER!

So I was checking over my code to see if I had somehow introduced a new behavior, but realized that I hadn't. This behavior had been there since long before I got my fingers into the code. Surely I had just found a rather obscure lexer bug, right?

Wrong.

In SQL, tokens do not need to be separated by whitespace if they can be unambiguously distinguished while lexing. In the two cases I provided above, the lexer can easily figure out the answer in an unambiguous manner. But this has some rather funny ramifications.

select(*)from'sometable'where.1=.1

Hard to read, but perfectly legal SQL statement right there. My understanding of why you would actually want to do something like this is as a front-end optimizaton to your queries. Basically, if you're going to send a plain text query over the wire, you may want to run it through some sort of minimizer first to reduce the size of the query. The only other explanation I can come up with is that it also makes for an interesting blog posting! ;-)

11 Comments

Actually where1 is one identifier. Otherwise, yep :)

I'm surprised that you're surprised; the REALbasic compiler happily tokenizes

if 0=0then

for example.

> Does it correctly do this:
> if.1=.1then MsgBox"This works fine?"
Yep! No syntax error and the message is displayed

I kept telling you SQL was slightly nuts !
Believe me now ? :P

Presumably, when the Rb tokenizer reads 'i', it figures it's the start of either a reserved word or an identifier and reads until it hits '.', sees that it has a reserved word, emits the token, and resumes at '.'. Seems pretty sensible to me.

If 2.1.1 is showing NUMBER NUMBER, you have another problem--is it 2.1 and then 1, or 2 and then 1.1?

RB lexical grammar does have an unambiguous definition of whitespace - it simply doesn't include end-of-line. EOL is its own kind of token.

I didn't design RB's lexical grammar, but you might say I reverse-engineered it. While this behavior looks strange and isn't often used, I'd describe it as a predictable and reliable consequence of the design.

Leave a comment

August 2009

Sun Mon Tue Wed Thu Fri Sat
            1
2 3 4 5 6 7 8
9 10 11 12 13 14 15
16 17 18 19 20 21 22
23 24 25 26 27 28 29
30 31