Professor von Clueless in the Blunder Dome: What We See Is Not What We Get: Character Codes and Computers

	Professor von Clueless in the Blunder Dome	status privacy about contact
Hangout for experimental confirmation and demonstration of software, computing, and networking. The exercises don't always work out. The professor is a bumbler and the laboratory assistant is a skanky dufus. Atom Feed Associated Blogs Edge City Numbering Peano Orcmid's Lair Praxis101 Spanner Wingnut's Muddleware Lab Recent Items Blinking at Quarks: Is It an Object that I See Before Me The Quarks of Object-Oriented Development Performing in Teams: Where's the Praxis? Windows Home Server Edition The Ultimate Confirmable Incoherence Experience To Express or Not To Express: Choosing a C/C++ Compiler Agile Builds: Making a Bad Idea Efficient? Patterns: Starting in the Meta-Middle Without Context, Every Open-Format Standard Is the Best Second-Guessing Microsoft on ECMA: Shape-Shifting the ODF Archives 2004-06-13 2004-06-20 2004-06-27 2004-08-29 2004-09-05 2004-09-12 2004-09-19 2004-10-10 2004-10-24 2004-11-07 2004-11-28 2004-12-05 2004-12-12 2004-12-26 2005-01-30 2005-02-06 2005-03-06 2005-03-13 2005-03-20 2005-04-03 2005-04-10 2005-04-17 2005-04-24 2005-05-01 2005-05-08 2005-05-15 2005-05-29 2005-06-05 2005-06-12 2005-06-19 2005-06-26 2005-07-10 2005-07-17 2005-07-31 2005-08-28 2005-10-09 2005-10-16 2005-10-23 2005-11-13 2005-11-27 2005-12-04 2005-12-18 2006-01-08 2006-02-05 2006-02-12 2006-02-19 2006-03-05	Wednesday, March 08, 2006 What We See Is Not What We Get: Character Codes and Computers I was swimming around in the Unicode 4.0 specification yesterday. It reminded of the amazing changes of abstraction involved in representing human-readable characters in computers. And that had me think about the difficulties we create for ourselves too. Here’s a peculiar practice around computational abstractions that began with early programming languages on binary computers and continues to this day, 50 years later. {tags: What Computers Know What Programmers Know Unicode code page character encoding conceptual integrity orcmid} Disguising Characters as Integers It is important to understand that we don’t really store human-readable characters in computers. For the most part (and it wasn’t always so), there are also no fixed codes in which characters must be stored. “Modern” computer hardware and software are not so rigid. Instead, the same binary storage elements used for all data are also used for (possibly-several) representations of text characters. The preservation of a particular representation is accomplished by constraints enacted in the software that creates, stores, and retrieves data intended to be interpreted as particular character representations. The computer software is orchestrated so the interpretation of textual data is preserved. When that is successful, you would never notice that anything remarkable is happening. It is as if the characters are recorded in the computer. And they’re not. Mostly. It’s no longer a novelty to hear that characters and text and other data forms important to us are all represented in computers using binary storage elements. What is surprising to me is that so-called high-level programming languages remain so low level that the representations of character data — the numeric codes used — are exposed and can be misused (a.k.a applied in novel, interesting, and also mistaken ways). As it was in the beginning … When Fortran was a teen-ager, starring as the first-ever ANSI standard programming language (ANS X3.9-1966, hence Fortran 66), the provision for text data in Fortran programs was known as the Fortran Hollerith data type. In programs, you could specify input-output formats for lines of recorded text with a Format Statement like `00100 FORMAT(12A6)` This statement defines transfer of Hollerith data from blocks of 6 characters each, 12 blocks per line. The statement `READ(INPUT, 100) I` refers to the earlier Format Statement. Hollerith data from the first six-character block of the next input record is stored into the Fortran variable, `I`. In the days when binary computer words held 36 bits and character codes required six bits, program variable `I` could hold the codes for exactly six characters. When binary computer words shrank to 32 bits and character codes grew to eight bits, program variable `I` could receive Hollerith data for only four characters. The rule is to transfer the rightmost characters of the block left-justified into the Hollerith data value, with trailing space characters added if needed. (Got that? It’s tricky to figure out how to fish out the code for a single character and how to insert different codes. There were programmers who managed to do all of that.) Any manipulation of program variable `I` in the Fortran program is going to treat the data as a Fortran Integer value in the binary code used by the computer. As the Fortran specification put it, “There exists no mechanism to associate a symbolic name with the Hollerith data type. Thus data of this type, other than constants, are identified under the guise of a name of one of the other types.” (The grouping into at most 12 chunks of six characters reflects a limitation of some early card readers which could only deliver the first 72 columns of data on 80-column punched cards. This is why Fortran statements were coded in only the first 72 columns on programming pads.) Here we have an early lesson in creating portable code: Use format `A1` to input one character at a time and put them into Fortran Integers for possible manipulation “under the guise” of that type. `INTEGER TEXT(72) 00200 FORMAT(72A1) C IN CASE THAT OLD CARD READER IS STILL AROUND READ(INPUT, 200) (TEXT(I), I = 1, 72)` What the Fortran specification failed to provide was a way to know the precise correspondence being made between the character codes on the input medium (or the keys on a keyboard) and their guise in terms of another data type in the storage of the computer. To accomplish anything beyond carefully moving the disguised Hollerith data type from input to output, it is necessary to know very specific details about the compiler and the computer system and the input-output too. These details are not assured to survive the transfer of the program to another computer, with or without recompilation. All of this “out of band” information comes into play if one attempts to make use of the disguises as a way to actually do something with the characters being represented. Another way to put it is to point out that the representation of the Hollerith data type in Fortran programs is underspecified. And if its not specified, implementations can (and do) go about it differently. Not all of the early programming languages were so peculiar in their accommodation of data in the form of text. The languages that converged as COBOL, the Common Business Oriented Language, were far more character-centric as one might suppose. Nevertheless, this “guise of another type” device is prominent and alive. Although clumsy in the case of Fortran, it is also an idea that is at the heart of how computers are so useful. The challenge is how to manage these representations when the computer system is completely oblivious to what we are out to accomplish. And in the middle … Introducing Hollerith data in Fortran might seem very quaint, even crude. Well, it’s not much different than what is accomplished in the C Language program fragment `char text[74]; /* allow for '\n' and '\0' at the end / fgets(text, sizeof(text), stdin);` The improvement here is that the individual characters are much more accessible as the char* datatype elements, `text[`i`]`. But the encoded character data is still in the guise of another type. Although char would appear to signify “character” (and that is the intention behind its introduction), the char datatype is an arithmetic-type, a limited integer representation. It seemed clever at the time. It can be helpful when writing software (including C Language compilers) intended to be operated on a different computer that there isn’t much software for yet. I incorporated the C Language fragment above into a small program and compiled it with the Visual C++ Toolkit 2003 command-line compiler. When I entered “`Orc¢6`” in a console session on my Media Center PC, I obtained the following results: `'O' text[0] = 79 'r' text[1] = 114 'c' text[2] = 99 '¢' text[3] = -101 '6' text[4] = 54 '\n' text[5] = 10 '\0' text[6] = 0` Not only are the character codes exposed as an arithmetic type, they can be manipulated as such. To attempt to manipulate the characters by operating on their char encodings depends on facts that are not fixed in the C Language specification. There are characters that must have encodings, but what those encodings must be is not specified. The under-specification is intentional. As Harbison and Steele remind us in C, a Reference Manual (ed. 4), section 2.1.3: “A common C programming error is to assume a particular encoding is in use when, in fact, another one holds.” Preserving any agreement on a character encoding depends on a set of constraints that are mutually honored by programmers, the C Language compiler’s translation of literal characters to char data, Standard C library functions, and the predictable behavior of keyboards (usually), displays, and printers. This char guise allows multiple character encodings to be dealt with programmatically, often in the same program. Such broad flexibility is also a fruitful source of mystifying errors. When a program is moved to a different compiler-computer combination, all bets are off with regard to assumptions about encodings that the programmer made. Since the programs themselves are encoded on electronic media, there may even be some difficulty having the source program appear the same to two different compilers. It is concern for the interchange of the encoded program text itself that has some programmers know to write, even in comments, something like `/* allow for '??/n' and '??/0' at the end /` I am confident that most C Language programmers in the United States have no idea what this is about and may have never seen it used. … So shall it be at the end For all its modernity, the 2005 edition of the Java Language Specification perpetuates char* as an arithmetic type (as does C# in its own way). The values are 16 bits with a range of code values from `'\u0000'` to `'\uFFFF'`, i.e., arithmetic values `0` to `65535`. Although the Java char data are intended for carrying single Unicode code units, it is not precise to say that the Java char is Unicode: char is a 16-bit Java IntegralType. Furthermore, Unicode (as of versions 3.0 and 4.0) requires 21 bits to cover the complete code space. 16 bits are insufficient to accommodate the largest codes, extending to <10FFFF> in Unicode parlance. It is also asserted that instances of the `java.lang.String` class represent sequences of Unicode characters. What String instances actually represent are sequences of char values and, just as with C Language, there is no hard-wired assurance that this has anything to do with Unicode. The designers of Java have arranged for there to be high confidence that Unicode is involved, even though the machinery doesn’t guarantee it. There are many Unicode-specific features in Java language, starting with the support of literal values for char and for `String` instances. The Java API (that is, the standard library) is designed to produce char values and String instances using Unicode. Many standard operations on char values and String instances are predicated on Unicode encoding forms being used. And the electronic text of Java language source programs are always interpreted as Unicode encodings (after suitable translation, if known to be required). These provisions, along with standard support for obtaining String instances from input sources, tend to ensure that Unicode encodings are typically found in Java char values and String instances. Java’s improvement to char is the reliance on a defined, default encoding that is supported in all facilities of the language. The benefit of String instances is dramatic for their easy use in the inputting, examining, combining, and outputting of char sequences as representations of character sequences. The encoding and the library-supported practices are definitely not underspecified. Of course, the char type is useful for coded data wherever a range of values from 0 to 65535 is sufficient. String instances are available as containers for sequences of those, whatever they are intended to represent in the application of the Java program. It is wise to announce such custom arrangements very loudly lest they be misunderstood by programmers later called upon to maintain or modify the program. It is interesting that the programmer still has complete license to do this (absent effective management intervention, misguided or not) and that we speak of Java as a strongly-typed language. Because Unicode code points can require up to 21 bits, Unicode characters with codes greater than <FFFF> are encoded in two consecutive char positions of a String. The encoding is such that neither member of the pair can be confused with the encoding of a valid code point less than <FFFF>. Consequently, the number of Unicode characters coded in a String or other sequence of char values is not always the same as the number of char values, a wonderful source of programming errors. To deal with this possibility, Java API version 1.5 introduces new operations, including `String.CodePointAt(`index`)` for returning the code point number found in position index (and index+1 if needed and available) as a single Unicode character encoding. The Unicode code point is returned in the guise of a 32-bit Java int value, naturally. Properly advancing the index is left as an exercise for the programmer. Disguising Integers as Binary Words At this point, you may not be too surprised to see me point out that there are other common datatypes that are actually represented in the guise of another type. For starters, each Java Language IntegralType represents a limited range of integers under the guise of another type: a fixed-length sequence of bits. That’s true for every one of them: byte, int, short, long, and char. You might also observe that there no Java Language primitive types that correspond to those unadorned critters. At some level, perhaps every computer programmer knows that. But we often don’t behave as if we know that and we are not often taught programming in a way that suggests the instructor knows it either. It is some kind of empty factoid that survives as a trivia question. For each one of these datatypes, the guise breaks down and the Java Language (and C Language and Fortran Language) are no help at all. For example, in Java, `java.lang.Integer.MAX_VALUE + 1` *and* `java.lang.Integer.MIN_VALUE` are the same Java int value. Experienced Java (Fortran, C, and C++) programmers are happy to point out that the IntegralType representations are only faithful in a confined range and that programmers are supposed to deal with it. Stepping out of the supported range will silently produce incorrect results. It is up to the programmer to ensure that calculations always stay within the range of faithful representation and provide some sort of exception behavior when that can’t be assured. (The C# language allows the programmer to select where the preservation of arithmetic representation should be checked for, something that the COBOL and Ada languages have had all along.) I also find it valuable that the C family of languages uses contractions that are useful tip-offs that these datatypes are not anything familiar to us: int, char, float, and double are clue-full. Technically, `java.lang.Integer` also gets a pass, but the technicality is easy to overlook, especially when we just see `Integer`. Whether by accident or design, those languages escaped the conceits of integer and real that can be traced from Fortran through ALGOL 60 to Pascal. It also seems that we quickly learn to identify them with the mathematical abstractions, to the extend those are understood, anyhow. How We Get What We Get We could wonder why such low-level representation failure is tolerated as automatic behavior with so-called high-level programming languages. That’s not the main point of today’s sermon. The lesson of this long explanation is to emphasize how digital computers and the languages we program them with only accomplish representation of useful elements in the guise of some other fundamental datatypes. In contemporary digital computers, those fundamental datatypes are actually just fixed sequences of bits and nothing more. Everything else, including practically everything that matters to you and I, is accomplished by an organized guise involving fundamental datatypes. Well, is that it? Or are even the sequences of bits a digital illusion maintained by yet another guise? Yes and No. It is true that there really aren’t any bits in computer in whatever way we think of them when programming. Our conception is an abstraction, but it is an abstraction that is faithfully maintained in the design and operation of the computer. So we don’t have to worry that its not the physical reality, any more than we have to worry about the material objects we experience being mostly immaterial in their subatomic composition. The guise of fundamental datatypes is maintained so perfectly as a result of brilliant engineering and reliable manufacture that for all intents and purposes fundamental bit-sequence datatypes can be taken as actually manifest in the workings of the computer. There’s nothing else to be found underneath. The computer could be defective, but that means it just operates incorrectly or fails completely. No sub-fundamental datatypes show through in the operation of computer programs. Any other datatype is accomplished by some sort of simulation using fundamental datatypes. There may be operations designed into the computer that favor one simulation over another. There may be software, including for programming languages, that we can use to provide and maintain the simulation for us. At bottom, it is all done with the fundamental datatypes and operations the computer provides on them. It is one of the marvels of computer software that we can erect simulation upon simulation and achieve extremely useful, even entertaining and evocative results. When there is a breakdown in the fidelity of our simulations, perhaps due to oversight, we may gain a glimpse beyond the facade, experienced as inscrutable behavior and startling results: Someone has provided us with a leaking abstraction. And when an abstraction shears, the breakdown is often ugly and sometimes costly. Considering how much of what software developers do is devoted to erecting usefully-simulated abstractions in this way, I do find it disappointing that: Some of the most common and heavily-used abstractions delivered with the aid of programming tools are leaky by default if not by design. There seems to be more attention and energy expended in overlooking and working around the defects in our leaking simulations than in implementing them properly. None of this is a great foundation for training and development of newcomers in an appreciation of the care involved in designing and achieving reliable manifestations of abstractions for useful, important, and sometimes life-critical purposes. I confess that I have removed important context in expressing my disappointment. There are practical factors. My concern is that the perpetuation of certain practices is now simply habit, long past the time when the practical considerations were crucial. How We Get What We See The coherent preservation of such a simple thing as a visible character from input entry onto digital media and across space and time to appear in glowing color on the surface of a display is all because of the coordinated behavior of software running on numerous computers. The computer is practically no help in the matter. This miraculous preservation of human communication over digital threads is accomplished by the painstaking care of engineers who craft the transitions between the perceptual and the bits and by software developers who shepherd the bits-in to bits-out by programming. It is in no way a small matter. What launched me toward this topic was my search for a simple statement around the absence of any provision for Unicode in the current Open Document Management API (ODMA 2.0). Serious under-specification of character-set handling in the ODMA specification left me to wonder just what character-sets and encodings are permissible in ODMA. I needed some way to differentiate the possible default encodings (code pages) versus the rather different Unicode encoding forms (and full range of code pages) on the native Windows platform. I fear that I don’t have the full story. There are some experiments to be done before I am confident about the encodings that can be encountered using the ODMA API, and how that variety can be accommodated in implementations on both sides of the API. I’m also concerned that implementors have not been paying attention to this. I wonder if there are tests that can be used to survey for the degree of alignment between the specification and reality. My consolation is that I was able to mine articles in the blogs of Michael Kaplan and Raymond Chen. That is always rewarding. In the course of my investigations I happened to recall the great Hollerith datatype guising, and here we are. posted by orcmid at 3/08/2006 08:57:35 PM, rating data by NewsGator Online *Comments:* Post a Comment

You are navigating the Blunder Dome

template created 2004-06-17-20:01 -0700 (pdt) by orcmid
$$Author: Orcmid $
$$Date: 06-04-23 15:52 $
$$Revision: 1 $