UTF-8 is a variable-length character encoding standard used for electronic communication. Defined by the Unicode Standard, the name is derived from Unicode Transformation Format – 8-bit

Encoding

The first 128 characters of Unicode, which correspond one-to-one with ASCII are encoded using a single byte with the same binary value as ASCII, so that valid ASCII text is valid UTF-8-encoded Unicode as well.

Unicode PointsUTF-8 EncodingBytes
0000~007F0xxxxxxx1
0080~07FF110xxxxx 10xxxxxx2
0800~FFFF1110xxxx 10xxxxxx 10xxxxxx3
10000~1FFFFF11110xxx 10xxxxxx 10xxxxxx 10xxxxxx4
200000~3FFFFFF111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx5
4000000~7FFFFFFF111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx6
Where x represents the bits of the corresponding Unicode point