Before we continue, I’ll clarify the title slightly. I’m only talking about programming things and by “early enough” I mean they hadn’t been taught in the first two years of the CS&SE degree I’m doing. The concepts also only really apply to C and C++, though since C and C++ are still the most commonly used programming languages (combined) AND a full understanding of higher level languages can only be obtained through understanding the lower levels, I believe these points are relevant.
(If you disagree, post a comment and we can discuss it. That’s the point of comments. Better yet, write your own blog post and attract readers to yourself. Certainly many of the ByteClub blogs are way too quiet.)
I will post one thing per blog post. Partially through wanting to keep individual discussions on track and also through not wanting to have to write all three before I click “Publish”.
Thing One: Unicode
If I didn’t take an interest in programming outside of university, I would never have heard of Unicode. Blissful ignorance is a great excuse for “beginning” programmers, but the reality is not actually all that complex.
A bit of background. The first major character set (which maps numbers to displayable characters) was 7-bit ASCII (ASCII being the title, 7-bit being relevant information). Hopefully everyone has just worked out that 7-bits equals 128 possible characters. If you consider the English language (which, not surprisingly, the American Standards Association (later ANSI) did) the characters that need to be displayed are upper and lowercase letters A through Z, each digit, and a sprinkling of punctuation. This fits quite well into 7-bits, including some leftovers for control characters.
But then, people in countries where other languages are spoken decided that they deserved to be able to use their own language’s characters. So they started using codes 128-255. Except, these codes were being used in different ways in different places. Greek characters bear little resemblance to French characters, so Greek computers would use a different set of characters in the 128-255 range. Computers would ship with a selection of so-called “code pages“, allowing the user to select their preferred character set.
Now, when sending email from Greece to France (because people sent lots of emails prior to 1991), the code page had to be specified to ensure your ωs weren’t changed into çs. Of course, this whole setup failed miserably. So Unicode was developed.
Rather than assigning each character a single value between 0 and 255, Unicode assigns each character a code-point value. (One of) a variety of encoding schemes are then used to store this code-point, the two most popular being UTF-8 and UTF-16. UTF-8 is based around byte-sized (8-bit) elements, while UTF-16 is based around word-sized (16-bit) elements - to use the C++ data types, char and wchar_t (or unsigned short).
The majority of code-points fit into 16 bits (ie. are less than 0xFFFF), making UTF-16 efficient for when a wide variety of languages (or even only one, providing it’s not English) will be used. The majority of code-points for the English language fit into 8 bits (ie. are less than 0xFF), making UTF-8 more efficient for files consisting primarily of Latin characters. UTF-8 also has the advantange of being able to decode ASCII encoded files, which is why it is the most commonly used encoding.
However, there is a catch. A program that supports UTF-8 can read ASCII files, but not necessarily the other way around. The reason for this is that ASCII is a single-byte character set (SBCS) while UTF-8 is a multi-byte character set (MBCS). Basically, ASCII guarantees that 8 bits is enough to describe every supportable character. UTF-8 guarantees nothing. If a character cannot be encoded into 8-bits (technically 7 bits), it will be split across two bytes. Or three. Or four. Or five. Or six! Suddenly everything taught about ASCII strings goes out the window.
Consider the following code:
char* myString
int count = 0;
for(char* c = myString; *c != ‘\0′; ++c)
{
count++;
}
Now, what does this snippet do? Think about it.
The obvious answer is that it counts the number of characters in the string. This is incorrect. It actually counts the number of bytes used by the string. The difference may not be obvious if you attended the same classes as I did. The char data type is a signed, 8-bit integer - one byte. Each increment counts as one, regardless of the contents of that byte, provided that it isn’t the null terminator.
For a SBCS, like ASCII, each character requires one byte, so the number of bytes will always equal the number of characters. For a MBCS, like UTF-8 (and UTF-16), a single character may be encoded in more than one byte. For an English string this is unlikely. However, throw in some accented characters and counting bytes is an incredibly inaccurate way of counting characters (though it is still relevant in terms of dealing with the encoded string, just not in decoding it back into individual characters).
The work-around? It’s relatively simple, though potentially not nicely portable. Microsoft’s Visual C++ library has macros for traversing a string, and the UTF8-CPP library looks quite similar, though neither are built into the language (no “batteries included”). Operating systems provide suitable functionality and support for character processing, and apparently current C and C++ standard libraries have MBCS support (I am somewhat doubtful with regards to the std::string class).
UTF-16 suffers from the same issue, though not to the same extent. UTF-16 is predominantly used on Windows systems and is the standard for that platform. Most, but not all API calls support using either UTF-8 or UTF-16 (incorrectly referred to as ANSI and Unicode, and providing the current code-page is set to UTF-8) though some more recent functions only support UTF-16. *nix based operating systems usually use UTF-8 and “wide characters” are actually 32-bits, making direct compatibility difficult.
So how should this be taught? A good start is actually Java, C# or VB, since the char data type in these languages represents an entire character, rather than an 8-bit integer. When C++ starts to be used for processing strings, the wchar_t type is a much better option. *nix compilers will treat this as a 32-bit integer, capable of encoding every known character, while Windows will treat it as a 16-bit integer, capable of encoding most. (The char type should only ever be used for strings in cases where memory is more important than processing speed, ie. never.) The standard C++ library provides wide versions of the iostream streams such as wcout and wcin, as well as wstring and wstringstream classes which, while probably not perfect, are considerably better than the 8-bit alternatives.
Finally, teach operating system specific function calls. There is going to be more than one subject where this comes up, so do one on Windows and one on Linux. Specifically, teach documentation reading, how to find functions in MSDN and whatever-the-Linux-equivalent-to-MSDN-is. Most graduates are going to end up programming for Windows anyway, probably in a language where OS specific calls are impossible. Ideologically teaching everything in an OS that doesn’t cost anything is of no help to students. Teaching a full range of content is.
To summarise (for all those tl;dr people), ASCII is dead and has been for a long long time. To still be teaching that one-char always equals one-character is inexcusable and is not setting up the next generation of developers to be able to handle a global audience. Higher level languages provide better support built-in, but an understanding of character sets and code pages is as essential to being a well-rounded software developer as is bits and bytes (ie. very essential, but that’s a different topic (no, not in this series)).
For futher reading, I recommend Joel Spolsky’s “The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)”.