The little bug that wouldn't..

| | Comments (0)

So there's been a bug in REALbasic that I swore I've fixed a half dozen times. It's the UDPSocket bind bug where calling .Connect on it would cause it to report an error #42 in Windows 98. I don't even remember when I first started seeing reports about this bug, but it was a long time ago.

Well, it'd get fixed, then unfixed, then fixed, then not rolled in, then rolled in, then not fixed but only for some people, etc.

I finally solved the mystery once and for all. Here's the tale in its entirety....

We use the same code base under the hood for our sockets on all platforms except Mac Classic. We do this by providing a shim layer that resolves function pointers at runtime. So on OS X, we load the BSD socket functions via (essentially) soft declares, and on Linux, we link straight against the imported functions, and on Windows, we load them up from WinSock. However, there are two different versions of WinSock being used: WinSock and ws2_32, and here's where the fun begins.

The error is reported when using a function called setsockopt which sets socket options such as the TTL and loopback properties. It's a common function and it is set up in a general fashion so you can use it to set any number of options. It takes the socket's file descriptor, an option constant, a value specifier and a size. It's the option constants that started this whole mess.

When writing a normal networking application, you statically link against the library, and use the proper header files to define the constants you're going to use. So this isn't an issue for static network applications. However, when you're dynamically loading the function pointers, WinSock suddenly becomes a steaming pile of crap. Microsoft had the brilliant idea to change the value of the constants depending on which library you link against . So if you link against WinSock, you're using one set of constant values, but if you link against ws2_32, you link against an incompatible set of constant values. Both sets of constant values share the same name. So if you used the wrong set of constants with the wrong library, then you would get an error #42 because the socket option doesn't exist, or you're setting an option with incorrect values. For example, you may think you're setting the TTL for the socket, but instead you're setting the QOS.

So the original bug was that we were using the wrong set of constant values with the library we loaded against. Fine, so I started using the new constant values and the function loaded from ws2_32. But then I realized not every installation has that library, so I had to use the old version. So I started loading [g/s]etsockopt from WinSock and using the old constant values. Bug was fixed. I tested it out on a Windows 98 machine in the office, and it was working. w00t!

Then I started getting reports that the issue was still there. So I went and took a look... and sure enough, in some instances it was failing for me again. Such as when I used the UDP socket on Windows XP. Doh! I fixed it for 98, but broke it for NT. Ok fine...

So then I tried to outsmart the idiotic ideas that Microsoft came up with by loading ws2_32 when it was available and using that, then using the WinSock version as a fall-back. Worked great except that it broke the sockets on 98 again for the fallback case because the constant values were still a problem.

Ok fine! I'll fix it the brute force way. I duplicated the constants -- one set with the old values, the other set with the new values, and gave them different constant names. Then I loaded the old version on non-NT systems, and the new version on NT systems. Finally, in the code itself where I went to use them -- I check to see if you're on an NT system, and if so, we pass in the new constant values, otherwise we pass in the old constant values. Bam! Issue solved. Again! This was for 5.5.5fc1 or 2 that it was solved.

Then I hear more rumblings about how the issue still isn't solved for some people! What is this all about, I think? Of course it's solved, I'm being explicit!

Oops, forgot to roll back one of the files. The one that loads the functions. So it was fixed in my version, but broken in Dave's version. Easy mistake to make, but oh well. So I roll that file back, and I go to test the internal build to make sure. Now, by this time, we've got a new Windows 98 machine. Clean install, no other apps installed on it.

And it's broken again.

By now I am starting to gnash my teeth and figure "screw Windows 98 users, they don't need UDP." :-P But I'm determined to fix this bug. It's become a battle of the wills at this point! So I start stepping thru the code in the debugger and I see that everything is happening the way I would expect it to. We load up the old function. We use the old constant value. We call the function and it gives us an error. Why?!?!

So for giggles, I move the execution point to another spot -- one where we're setting a different option. And that one works. Wait a second... why does it work? I go and I check the constant values to see if maybe I messed those up somehow. Nope -- if I messed those up, I'd for sure get the error because I'd be loading an option that doesn't exist. So what could it be?

Then a little red flag goes up in my head. I wrote in the socket read me that on some older versions of Windows, WinSock doesn't allow you to get the loopback option for multicasting. And that's the function that was failing. So I went and looked at the error number again just to make sure I wasn't dreaming.... error #42 means socket option not supported. It was a valid error that I was getting and just not handling! Wow. So I put in some checks for the error case (we just consume the error internally and don't pass it along to you), and now I can say with confidence that the UDP error #42 issue is finally resolved. Assuming the files get rolled back correctly, which I am going to ensure by standing over Dave's shoulder and watching him roll them back myself.

This has been the most annoying bug to fix that I've ever encountered. Each time I thought I had it nailed, and it would be for my testing, but somehow something would go wrong! Yeesh! I'm just glad it's fixed (and hopefully for good this time!).

But if you ever hear me gripe about how much I hate WinSock, this is just one of the reasons it pisses me off. Microsoft deviates from the underlying BSD implementation in just enough places that using this shim can be a royal PITA sometimes. I mean, come on! There is no need to change the value of the constants! None! :: sighs ::

Anyhow, you should see this bug fix in the next 5.5.5fc -- just remember the blood, sweat and tears that went into fixing it! :-P

Leave a comment

Disclaimer

I'm currently an employee of REAL Software. My blog is mine. The opinions represented in this blog are mine as well and may not represent my employer's opinions. All original material is copyrighted and property of the author.

REALbasic® is a registered trademark of REAL Software, Inc. REAL SQL Server™ and Lingua™ are pending trademarks of REAL Software, Inc. All rights reserved.