
Will output: Fédération Camerounaise de Football Usage: $utf8_string = Encoding::fixUTF8($garbled_utf8_string) Įxamples: echo Encoding::fixUTF8("Fédération Camerounaise de Football") Įcho Encoding::fixUTF8("Fédération Camerounaise de Football") Įcho Encoding::fixUTF8("FÃÂédÃÂération Camerounaise de Football") Įcho Encoding::fixUTF8("Fédération Camerounaise de Football") I've included another function, Encoding::fixUFT8(), wich will fix every UTF8 string that looks garbled. $latin1_string = Encoding::toLatin1($utf8_or_latin1_or_mixed_string) Usage: $utf8_string = Encoding::toUTF8($utf8_or_latin1_or_mixed_string) I did it because a service was giving me a feed of data all messed up, mixing UTF8 and Latin1 in the same string. Encoding::toUTF8() will convert everything to UTF8. It can be Latin1 (iso 8859-1), Windows-1252 or UTF8, or the string can have a mix of them. You dont need to know what the encoding of your strings is.

I made a function that addresses all this issues. If you apply utf8_encode() to an already UTF8 string it will return a garbled UTF8 output. Here's a transcription of another answer I gave to a similar question: If you really don't know the encoding of a file, nothing is going to tell you it with 100% accuracy. Just to reiterate though: all of this is heuristic.

At that point you could detect that a file is valid UTF-8 by recoding it to UTF-8 and seeing whether the input and output are identical.Īlternatively, do this programmatically rather than using the recode utility - it would be quite straightforward in C#, for example. one which contains invalid UTF-8 byte sequences) it may well convert the invalid sequences into question marks or something similar.

I'm not familiar with the recode tool itself, but you might want to see whether it's capable of recoding a file from and to the same encoding - if you do this with an invalid file (i.e. One option would be to detect whether it's actually a completely valid UTF-8 file first, I suppose. Now there are certainly characteristics which would strongly suggest that it's UTF-8 - if it starts with the UTF-8 BOM, for example - but they wouldn't be definitive. How would you expect recode to know that a file is Windows-1252? In theory, I believe any file is a valid Windows-1252 file, as it maps every possible byte to a character.
