I recently came across a massive issue with "simple" ASCII encoded text. A CSV file written by Excel containing the Dutch word "categorieën" (categories) was not being processed correctly. By processed I mean "converted to UTF-8" using PHP. Both iconv() and mb_convert_encoding() removed the character completely...! I was pretty flabbergasted and decided to do some googling. I found two sites who claimed they showed the correct extended ASCII sets, namely (A) ascii-code.com and (B) ascii.nl. Lookup the entry for the ë in both maps, and you'll notice that (A) says it's 235, and (B) says it's 137! And ofcourse my data is in variant (B), and both PHP functions for converting encodings assume (A)... Excellent!
$str = "Categorieën"; $encoding = mb_detect_encoding( $str, 'ASCII,ISO-8859-1,Windows-1252,UTF-8' ); $utf8_mb = mb_convert_encoding( $str, 'UTF-8', $encoding ); $utf8_iconv = iconv( $encoding, 'UTF-8', $str ); // Both $utf8_mb and $utf8_iconv now contain "categorien", // the special character is cut out completely!
$str = "Categorieën"; $utf8_iconv = iconv( "CP437", 'UTF-8', $str ); // $utf8_iconv now contains "categorieën"!