Difficult Questions to Answer

| | Comments (16)

Here's the source code:


if &hFFFFFFFF > 0 then
break
else
break
end if

(Of course, you can ignore the constants -- this happens with variables as well. Imagine u as being a UInt32 whose value is 0xFFFFFFFF, and i as being an Int32 whose value is 0; you'll get the identical behavior either way.)

When you run this code, you'll get results that seem a little strange at first blush. You get into the else clause -- so the compiler is generating code that says a large unsigned number is less than 0. How's that possible??

Well, what the compiler does whenever it gets any binary operator (+, -, =, <, etc) is first compute a common type between the operands. And for backwards compatibility reasons, the common type between signed and unsigned is signed. This harks back to the days when REALbasic only had signed 32-bit integer values. So in essence, this causes the compiler to treat the large unsigned value as being a signed value whenever the high bit is set. Any value >= 0x80000000 will be treated as a negative, thus turning the if statement into this: if -x > 0 then, which will always evaluate to false.

So what should the compiler do in this situation? If it promotes everything to unsigned, then you run into an equally undesirable situation with code like this:


if &h0 > -1 then
break
else
break
end if

Because in that case, -1 would be promoted to 0xFFFFFFFF, and 0 is not greater than four billion.

Keep in mind that the behavior has nothing to do with constants either -- you can get the same behavior using just variables as well. Maybe the compiler could try to do something sensible with constants, but the problem still remains any way you slice it. But let's explore this a bit: should the compiler try to "be smarter" when it comes to constant comparisons? Given the following code:


dim u as UInt32 = &hFFFFFFFF
if u > 0 then
break
else
break
end if

Human logic says that we should get into the first break statement because an unsigned integer must always be greater than zero. We could make this logic work by promoting 0 to be an unsigned integer via computing common types preferring to promote to unsigned instead of signed. But that would then break this code:

dim i as Int32 = -1
if i < &h0 then
break
else
break
end if

Because zero is unsigned, i would be promoted to unsigned and be treated as 0xFFFFFFFF. But then again, this is a less common case. More common would be to just use "0" instead of "&h0", which means both operands would be signed and the comparison would work. But trade &h0 out for anything that starts with an & and the problem still remains.

This tells me that computing the common types is clearly the wrong place to try to place more logic. No matter what we do, there are very trivial cases where the behavior will be undesirable.

The only other alternative is to handle constants as a special case during comparison operations. If either of the operands is a constant, then we could do some special processing that essentially ignores the common type computation. However, what would that special processing be? Convert the constant to the variable's sign? That might work, but what about this code:


dim u as UInt32 = &h80000000
if u > -1 then
end if

This is perfectly legal code that would misbehave if we forced the constant to be unsigned (since the left-hand side would be &h80000000 and the right-hand side would be &hFFFFFFFF). Ok, so if the constant is obviously negative, then we compute common types as we always have? In that case, u would change to a signed value (-2147483648), which is not greater than -1 and we're right back where we started.

So basically, this is a really difficult question to answer. As much as I dislike saying it: I think the original behavior is not a bug. It's just the misfortune that comes about from mixing signs. Ideally, the compiler would warn you about this situation -- and that may come about in a future release of the product. But for the time being, you should watch out for cases where you mix signs, and understand that the compiler will always convert both values to being signed if any of the operands in a binary operation if either operand is signed. You can avoid this situation by being explicit and telling the compiler how to handle the operator: use typecasting. If you want an unsigned comparison, typecast using UInt32( 0 ), or use a constant value that is implicitly unsigned (like &h0).

16 Comments

I'd love to see compiler warnings in RB. But if you do decide to implement them, I wish you the best of luck with fending off the 1000 stupid ideas that will come up suggesting warnings for trivial cases that can't even be properly defined ;)

@Steve -- It's not like we don't already get a lot of... different... ideas for compiler features as it stands. ;-) Warnings won't be much different when they happen. People will suggest some feasible and good ideas, and people will also suggest some infeasible ideas. Take em as they come. :-)

If I had had a saying in this when Unsigned int types were added, I'd have made this rule:

If there's the chance of ambiguity, the compiler would issue an error. The user would then have to add a typecast to his code to remove the ambiguity.

To me, this need for a code change would be better than the current situation, because then people who upgrade older code are at least poked at the cases where code that used to work now may cause trouble, instead of silently possibly doing things that break code and are then very hard to detect.

In principle, I'd always prefer that RB would poke us at newly-introduced possible problems this way, instead of requiring us to learn all the new possibl pitfalls and keep them in mind, which is especially diffifult if we adopt foreign or old code where we're not aware of their inner workings.

@Thomas -- that's a noble stance to take on the problem, and a reasonable idea in theory. Unfortunately, it's impossible to accomplish in a practical sense. It's a fact that anything which causes new errors is a reason that people don't upgrade. If it were just a single error or two, it might not be a big deal to most people, and something we could consider -- but this isn't an error or two. In my medium-sized test case, it's 500+ errors!

If you don't believe me, check out what removing the old-style constructors did during the betas of 2008r2. That was turning a deprecated warning situation into an error, and caused quite a stir!

What I would really like to see is a system analogous to C's literal suffixes. Although I find the C syntax more than a little yucky, it would be a great help to have a way to explicitly define integer types, and also single precision floating-point constants. I often see huge amounts of unavoidable double-to-float-to-double conversions in compiled code that makes Shark.app sad :-(

@Frank -- in terms of variables, the type is explicit in the declarations. In terms of constants, I agree that it would be interesting to consider allowing users to be explicit about the types. However, in terms of literals, there's nothing more to be done really. You can be explicit about literal types by typecasting: UInt32( 0 ) will cause 0 to be treated as an unsigned 32-bit integer, for instance.

But the second you use a decimal you're forced into double precision - Single( 1.0 ) doesn't work (nor should it be expected to work since the 1.0 literal is 8 bytes). So casting doesn't help with floats, where in C you can simply use 1.0f. Similarly, you can't say UInt64( 0 ) in Rb where in C code you can define long long literals explicitly. The compiler also seems to use double precision for all floating point math, even if all the variables involved are Singles - but I suppose that's a separate issue.

@Frank -- that's an excellent point, in C you can use 1.0f to force it to a single. And you can only cast, not convert in RB currently. That being said, I think it's not unreasonable to see a CType operator in RB that allows you to convert instead of just cast. It's a bit more verbose than in C, but that's just the way BASIC rolls sometimes. Imagine something like CType( 0, UInt64 ) as a way to convert the literal 0 to be UInt64. Or CType( 1.0, Single ) to get a Single instead of a Double.

What happens if u is a UInt32 with the value 0xFFFFFFFF and you compare it to the constant 0? If that doesn't return true for u > 0 then it seems weird to me. Does that mean that half of all unsigned values are considered less than 0?

Correct, it does not return true. That's because 0 is a signed 32-bit integer literal, and we convert the UInt32 to match for backwards compatibility reasons. So you're right, half of all unsigned values are considered less than zero when you use "0", as weird as that seems. The reason is because of the sign mismatch. If you're working with UInt, you should use &h0 instead -- then both sides of the comparison are unsigned and there's no problems.

This is the type of problem that lead Niklaus Wirth (of Pascal fame) to drop the Cardinal (unsigned integers) data type from the second revision of Modula-2.

(Just a little factoid from the history of compilers.)

@Kirk -- and we all know how wildly popular Modula-2 turned out to be. ;-) :: grins :: sorry, I just couldn't resist.

@ Aaron

Well as a long time Modula-2 user, I'm a big fan. :o) [The free MacMETH Modula-2 compiler written by Wirth et al. was blazing fast on 680x0 Macs. I only just left it behind for good when I went Intel and Classic is no longer supported.]

But my point was that Wirth is a grand poobah of compiler design (having a position at a university where he has spent decades doing nothing but compiler design). I remember reading something by him about the problem. Basically and his co-designers were unable to come up with a satisfactory solution and decided rather than allow difficult to find bugs, the safest thing (Wirth is all about safe computing) was to depreciate unsigned integers all together. If I remember correctly, the official definition changed Cardinals to be the positive range of Integers. So if Integers are defined as -32768 to +32767 then Cardinals are defined as 0 to 32767 rather than 0 to 65535.

@ Aaron,

Also, RealBasic as finally reached parity with Modula-2 as far as popularity is concerned. According to http://www.tiobe.com/index.php/content/paperinfo/tpci/index.html


"The following list of languages denotes #51 to #100. Since the differences are relatively small, the programming languages are only listed (in alphabetical order).

"ABC, Algol, Alpha, APL, Applescript, AspectJ, Beta, Boo, cg, Ch, Clean, Csh, cT, Curl, DC, Dylan, Eiffel, EXEC, Factor, Felix, Focus, IDL, Inform, Limbo, MAD, Magic, Maple, Mathematica, Modula-2, MOO, MUMPS, Oberon, Occam, Oz, Pike, PILOT, Postscript, PowerBuilder, Progress, Q, R, REALbasic, Rebol, S-lang, SIGNAL, Simula, SPSS, VBScript, VHDL, XSLT."

@Kirk -- I had seen that the other day, yay for us! ;-) And the Modula-2 approach is interesting but really not practical. Can you imagine in incredible amount of confusion that would occur when trying to interact with the OS via declares? Or wire protocols? File formats? There's just no way to avoid true unsigned integers in a practical sense if you want your applications to interact with the outside world. But it's an interesting theory, to be sure. :-)

@ Aaron,

I wasn't trying to say that unsigned integers were all together bad and RB should get rid of them.

I was just underscoring your point that you must take extra care when using them. Wirth in striving to make the "perfect" language decided uints to be too dangerous. C et al., geared more toward "getting it done" uses them extensively. I think the best approach is to be aware of the possible problems and use them sparingly and carefully. I personally did a little happy dance when RB intro'd int8, uint8, and the rest because, as you point out, they let us get things done with the real world.

Leave a comment

Disclaimer

I'm currently an employee of REAL Software. My blog is mine. The opinions represented in this blog are mine as well and may not represent my employer's opinions. All original material is copyrighted and property of the author.

REALbasic® is a registered trademark of REAL Software, Inc. REAL SQL Server™ and Lingua™ are pending trademarks of REAL Software, Inc. All rights reserved.