Dangerous Bug with DefineEncoding

| | Comments (5)

Here's my public service announcement for the month: there's a fairly dangerous bug lurking with DefineEncoding in REALbasic. We're currently working to resolve the underlying issue, but as of right now, this bug exists in all versions of REALbasic from 2007r3 and prior. To understand the issue with DefineEncoding, you have to be familiar with REALbasic strings first.

In REALbasic, the String datatype is used to hold any chunk of data you want. It's a big block of bytes, basically. And this data has an encoding identifier attached to it which tells interested parties how to interpret the block of bytes. However, internally, one of the things which the string datatype does (which isn't exposed to the REALbasic user) is null terminates the block of bytes. We do this because almost every single OS API call for every platform expects null terminated strings. If we didn't null terminate every string when it's created, then we'd have to do it when calling OS APIs from the framework, or every time a REALbasic function called a declare. Or every time a REALbasic function called a plugin. Basically, it would be a *huge* performance hit, which is why we do it the way we do.

So, you've probably realized by now that this dangerous bug has something to do with null terminating strings, which is does. REALbasic null terminates all strings when allocating them based on a single null character because at the time the string is allocated, the encoding frequently is not known. And since REALbasic has its roots firmly grounded in Mac Classic and Windows 95, this was a perfectly reasonable action. However, it's still incorrect as the DefineEncoding operation just says "treat that bucket of bytes which was XXX encoding as being YYY encoding instead." It doesn't alter those bytes in any way -- which includes the null terminator. For encodings which use a multi-byte null terminator, you've got major issues.

So let's take a look at a concrete example of the DefineEncoding bug. Given that UTF-16 (which is basically treated as USC-2 for all intents and purposes) uses two bytes for each character, including the null byte, look at this example:

dim mb as new MemoryBlock( 2 )
mb.Short( 0 ) = &h01BF
EditField1.Text = DefineEncoding( mb, Encodings.UTF16 )

This code stuffs the latin letter wynn (ƿ) into a memory block, and then transfers it to a string with the correct encoding. From the REALbasic user's perspective, this is 100% correct code. To understand the bug, we'll have to look at some binary data.

The MemoryBlock contains two bytes: 01 BF

The string contains three bytes (two from the MemoryBlock, and one for the null terminator): 01 BF 00

Now, when you go to pass this "UTF-16" block of bytes off to an OS API function (whether you do it via declares, or the RB framework does it, doesn't matter), you have a buffer overrun error. The reason is because the OS API is expecting a null terminated UTF-16 string, but it's not getting that. There's only *half* a null terminator there! So what you wind up with, is bad.

Sometimes, you might get lucky and have another null byte directly following the string, but you certainly cannot rely on that.

While this problem may seem like an insurmountable one, it's not (I assure you). This problem only affects encodings which have a multi-byte null terminator, which is UTF-16, UCS-2, UTF-32 and UCS-4. All the other encodings use a single-byte null terminator (including UTF-8, which is what REALbasic encodes string constants in), which is already taken care of properly. Also, this problem only affects DefineEncoding and not ConvertEncoding. And finally, this problem is resolved for a future version of REALbasic (all strings are null terminated by a four-byte null terminator when they're allocated -- that catches every case, regardless of possible future encodings).

If you have current code that uses DefineEncoding to a multi-byte encoding, and have no plans to upgrade your version of REALbasic, there is a simple solution for you to follow. Simply append your own null byte (or null character) to the string. For instance, with our code above:

dim mb as new MemoryBlock( 3 ) // One extra byte for null!
mb.Short( 0 ) = &h01BF
EditField1.Text = DefineEncoding( mb, Encodings.UTF16 )

or

dim mb as new MemoryBlock( 2 )
mb.Short( 0 ) = &h01BF
dim s as String = mb
EditField1.Text = DefineEncoding( s + ChrB( 0 ), Encodings.UTF16 )

5 Comments

THANK YOU! I've been fighting with encoding issues all week (ConvertEncoding, and BinaryStream.Read(#,Encodings..) and today in looking at the problem another way tried DefineEncoding and abandoned it quickly after finding the functionality I was looking for was basically in the BinaryStream.Read call.

Is BinaryStream.Read using an implicit CONVERTEncoding or DEFINEEncoding?

Is BinaryStream.Read affected by this bug also? (Should I be appending a null to the string I receive from it to guarantee my later ConvertEncodings will work?)

BinaryStream (and the other functions which accept an optional string encoding) *do* suffer from the same issue. However, the call to ConvertEncoding is one of the few Win32 APIs which does not require a null terminated string (the Win32 API allows you to specify the string length to convert), and so (at least on Windows), calling ConvertEncoding to *a different encoding than what is set on the source string* will succeed without danger.

Thanks, again.

Hello!

I have a file I am trying to read in to a BinaryStream. The file starts with an "R" and then a null character ant then 30 other bytes I am interested in. It appears that the file stops being read in at that first null character. Is this possible? Could it have anything to do with what you are describing here. Is there a workaround?

Thanks,
Ralph

That doesn't sound like a DefineEncoding issue. It sounds like you're reading in with CString, which will cause that exact behavior. However, without a peek at code, it's impossible to say one way or the other.

I'd recommend posting a question to the forums!

Leave a comment

Disclaimer

I'm currently an employee of REAL Software. My blog is mine. The opinions represented in this blog are mine as well and may not represent my employer's opinions. All original material is copyrighted and property of the author.

REALbasic® is a registered trademark of REAL Software, Inc. REAL SQL Server™ and Lingua™ are pending trademarks of REAL Software, Inc. All rights reserved.