Tuesday, October 11, 2011

UTF-8 encoded files /w BOM marker - Apache, PHP

Although it is generally known that UTF support within Apache and PHP is limited, I would like to extend a warning about UTF-8 encoded files. As most of you know, there are these basic file types:
  • UTF-16 = 16 bit characters
    • BOM marker (16 bits) at beginning indicates endianness 
    • BOM marker considered strongly suggested, if not mandatory
  • UTF-8 = 8 bit characters (like ANSI), special extension mark for extended characters, otherwise backwards compatible with ANSI
    • BOM marker is NOT necessary NOR desired
    • Despite this, Windows and many/most Windows editors add a BOM on UTF-8 files to identify them as such.
So, so you're adding some Japanese to your web site, and have to save as UTF-8, be careful! If under Windows, be *sure* to omit that BOM marker, else Apache is going to emit the 2 byte BOM marker verbatim. It won't be visible, but may affect the display of the page (a blank link at the top).

Sure, it seems like it would be a smart move for Apache to skip that BOM marker, if found. After all, it is not likely to naturally occur, ever. However, it is against the standard for UTF-8, and we know how the F/OSS is about standards .. and they are right. Without standards, chaos ensues. That said, we are where we are, so pragmatism and compromise must sometimes be the order of the day.

What is UTF? Unicode Text File. What is Unicode? It is what was created when it became clear there were more characters in the worldwide alphabet than ASCII (1 byte) could handle. It therefore extends character length to 2 bytes (16 bits). It has been adopted by all major OSes.

----
Now, if you find that useful, go check out some of my software (its free):
Process Lasso - The ultimate Windows process priority optimization software

No comments:

Post a Comment