A coder's home for Marc "Foddex" Oude Kotte, who used to be located in Enschede, The Netherlands, but now in Stockholm, Sweden!
foddex.net

ASCII extended character set hell

Originally posted at Fri 19-10-2012 16:40:53, in the nerd stuff category.

The problem

I recently came across a massive issue with "simple" ASCII encoded text. A CSV file written by Excel containing the Dutch word "categorieën" (categories) was not being processed correctly. By processed I mean "converted to UTF-8" using PHP. Both iconv() and mb_convert_encoding() removed the character completely...! I was pretty flabbergasted and decided to do some googling. I found two sites who claimed they showed the correct extended ASCII sets, namely (A) ascii-code.com and (B) ascii.nl. Lookup the entry for the ë in both maps, and you'll notice that (A) says it's 235, and (B) says it's 137! And ofcourse my data is in variant (B), and both PHP functions for converting encodings assume (A)... Excellent!

$str = "Categorieën";	
$encoding = mb_detect_encoding( $str, 'ASCII,ISO-8859-1,Windows-1252,UTF-8' );
$utf8_mb = mb_convert_encoding( $str, 'UTF-8', $encoding );
$utf8_iconv = iconv( $encoding, 'UTF-8', $str );

// Both $utf8_mb and $utf8_iconv now contain "categorien", 
// the special character is cut out completely!

The solution

I did a lot of googling, and finally, finally found a solution! Apparently, variant (A) is the official (US-)ASCII standard, and as such used when "ASCII", "US-ASCII" or "ISO-8859-1" is specified to the encoding functions as "from" encoding. Variant (B) is the codepage 437 variant for ASCII. If you want to use variant (B) in PHP, you have to use iconv(), and specify "CP437" as the source encoding:
$str = "Categorieën";	
$utf8_iconv = iconv( "CP437", 'UTF-8', $str );

// $utf8_iconv now contains "categorieën"!

-- Foddex



3 comment(s)

Click to write your own comment

On Sun 28-06-2015 16:24 Nibs wrote: Thanks to you, the above link looks much better!
On Sun 28-06-2015 16:25 Nibs wrote: Thanks to you, this link looks much better. Extra karma bonus!
https://concen.org/content/crude-2009
On Sat 01-10-2016 15:13 djjjozsi wrote: thank you :) this article helped a lot!

i made a converting program from MSSQL(ascii) into mysql utf-8!
regards,
jjozsi
Name:
URL: (optional!)
Write your comment:
Answer this question to prove you're human:
What's the white stuff on top of mountains called?