One of the things I've been working on lately is an architecture for lexers, and I'm basing my work off an existing SQL lexer I have access to. The lexer is well-factored, so refactoring it into a more generic tool is pretty simple. However, while doing this, I ran into a behavior that I thought was strange. I'm sure you've seen the same thing in your own coding: you take some working code, refactor it, test it out and notice some strange behavior and wonder "did I break something?" In my case, I was noticing that my lexer was reporting the following as two distinct tokens (instead of some sort of error). The text I was entering was:
2fooI found it strange because what my lexer was telling me is that I was getting a NUMBER followed by an IDENTIFIER. So I did a bit more poking, and it was even more interesting than that.
2.1.1That was returning NUMBER NUMBER!
So I was checking over my code to see if I had somehow introduced a new behavior, but realized that I hadn't. This behavior had been there since long before I got my fingers into the code. Surely I had just found a rather obscure lexer bug, right?
Wrong.
In SQL, tokens do not need to be separated by whitespace if they can be unambiguously distinguished while lexing. In the two cases I provided above, the lexer can easily figure out the answer in an unambiguous manner. But this has some rather funny ramifications.
select(*)from'sometable'where.1=.1Hard to read, but perfectly legal SQL statement right there. My understanding of why you would actually want to do something like this is as a front-end optimizaton to your queries. Basically, if you're going to send a plain text query over the wire, you may want to run it through some sort of minimizer first to reduce the size of the query. The only other explanation I can come up with is that it also makes for an interesting blog posting! ;-)
Actually where1 is one identifier. Otherwise, yep :)
Oops, good catch! I meant where.1=.1, but suck at typing. ;-)
I'm surprised that you're surprised; the REALbasic compiler happily tokenizes
if 0=0then
for example.
Mars is welcome to correct me if I'm wrong, but it was never designed to do that, at least as far as I'm aware. Does it correctly do this:
if.1=.1then MsgBox"This works fine?"
I can see some of that being correct, like .1=.1, but I'd be surprised about the lack of spaces between if/.1, .1/then and possibly MsgBox/"
> Does it correctly do this:
> if.1=.1then MsgBox"This works fine?"
Yep! No syntax error and the message is displayed
I kept telling you SQL was slightly nuts !
Believe me now ? :P
Presumably, when the Rb tokenizer reads 'i', it figures it's the start of either a reserved word or an identifier and reads until it hits '.', sees that it has a reserved word, emits the token, and resumes at '.'. Seems pretty sensible to me.
Sensible? Perhaps. In REALbasic, whitespace is ill-defined IMO. It's important sometimes, and not important other times. For instance, whitespace IS important in terms of statements. All statements are terminated by a new line (except when using line continuation character _). But whitespace is not important enough for it to be an error to write if.1=.1, which is a bit strange. Don't get me wrong -- I can easily see why it's lexed the way it is. But I'm just surprised -- this isn't part of the language design intent (as far as I'm aware). It's not *bad*, it's just not something you should ever rely on for RB. For SQL, it's actually part of the language design intent, so it IS something you can rely on.
Neato -- learning all sorts of new things today!
If 2.1.1 is showing NUMBER NUMBER, you have another problem--is it 2.1 and then 1, or 2 and then 1.1?
@Silverpie -- That's actually not problematic because it is unambiguous -- the 2 starts a number, so it's 2.1 followed by .1. Also, you could never have a 2. followed by a 1.1 because 2. would be an illegal numeric value.
RB lexical grammar does have an unambiguous definition of whitespace - it simply doesn't include end-of-line. EOL is its own kind of token.
I didn't design RB's lexical grammar, but you might say I reverse-engineered it. While this behavior looks strange and isn't often used, I'd describe it as a predictable and reliable consequence of the design.