Wednesday, July 28, 2021

Difference between UTF-8, UTF-16 and UTF-32 Character Encoding? Example

The main difference between UTF-8, UTF-16, and UTF-32 character encoding is how many bytes it requires to represent a character in memory. UTF-8 uses a minimum of one byte, while UTF-16 uses a minimum of 2 bytes. BTW, if the character's code point is greater than 127, the maximum value of byte then UTF-8 may take 2, 3 o 4 bytes but UTF-16 will only take either two or four bytes. On the other hand, UTF-32 is a fixed-width encoding scheme and always uses 4 bytes to encode a Unicode code point. Now, let's start with what is character encoding and why it's important? Well, character encoding is an important concept in the process of converting byte streams into characters, which can be displayed.

There are two things, which are important to convert bytes to characters, a character set and an encoding. Since there are so many characters and symbols in the world, a character set is required to support all those characters. A character set is nothing but list of characters, where each symbol or character is mapped to a numeric value, also known as code points.

On the other hand UTF-16, UTF-32 and UTF-8 are encoding schemes, which describe how these values (code points) are mapped to bytes (using different bit values as a basis; e.g. 16-bit for UTF-16, 32 bits for UTF-32 and 8-bit for UTF-8). UTF stands for Unicode Transformation, which defines an algorithm to map every Unicode code point to a unique byte sequence.

 For example, for character A, which is Latin Capital A, Unicode code point is U+0041, UTF-8 encoded bytes are 41, UTF-16 encoding is 0041, and Java char literal is '\u0041'. In short, you just need a character encoding scheme to interpret a stream of bytes, in the absence of character encoding, you cannot show them correctly. Java programming language has extensive support for different charset and character encoding, by default it uses UTF-8.





Difference between UTF-32, UTF-16 and UTF-8 encoding

As I said earlier, UTF-8, UTF-16, and UTF-32 are just couple of ways to store Unicode code points i.e. those U+ magic numbers using 8, 16 and 32 bits in the computer's memory. Once Unicode character is converted into bytes, it can be easily persisted in the disk, transferred over network and recreated at other end. 

The fundamental difference between UTF-32 and UTF-8, UTF-16 is that former is fixed width encoding scheme, while later duo is variable length encoding. BTW, despite, both UTF-8 and UTF-16 uses Unicode characters and variable width encoding, there are some difference between them as well.



1. UTF-8 uses one byte at the minimum in encoding the characters while UTF-16 uses minimum two bytes.

In UTF-8, every code point from 0-127 is stored in a single bytes. Only code points 128 and above are stored using 2,3 or in fact, up to 4 bytes. In short, UTF-8 is variable length encoding and takes 1 to 4 bytes, depending upon code point. UTF-16 is also variable length character encoding but either takes 2 or 4 bytes. On the other hand UTF-32 is fixed 4 bytes.



2. UTF-8 is compatible with ASCII while UTF-16 is incompatible with ASCII

UTF-8 has an advantage where ASCII are most used characters, in that case most characters only need one byte. UTF-8 file containing only ASCII characters has the same encoding as an ASCII file, which means English text looks exactly the same in UTF-8 as it did in ASCII. Given dominance of ASCII in past this was the main reason of initial acceptance of Unicode and UTF-8.

Here is an example, which shows how different characters are mapped to bytes under different character encoding scheme e.g. UTF-16, UTF-8 and UTF-32. You can see how different scheme takes different number of bytes to represent same character.

Difference between UTF-8, UTF-16 and UTF-32 character encoding



Summary

1) UTF16 is not fixed width. It uses 2 or 4 bytes. The only UTF32 is fixed-width and unfortunately, no one uses it.  Also, worth knowing is that Java Strings are represented using UTF-16 bit characters, earlier they use USC2, which is fixed width. 

2) You might think that because UTF-8 takes fewer bytes for many characters it would take less memory than UTF-16, well that really depends on what language the string is in. For non-European languages, UTF-8 requires more memory than UTF-16.

3) ASCII is strictly faster than multi-byte encoding scheme because less data to process = faster.



That's all about Unicode, UTF-8, UTF-32, and UTF-16 character encoding. As we have learned, Unicode is a character set of various symbols, while UTF-8, UTF-16, and UTF-32 are different ways to represent them in byte format. Both UTF-8 and UTF-16 are variable-length encoding, where the number of bytes used depends upon Unicode code points. 

On the other hand, UTF-32 is fixed-width encoding, where each code point takes 4 bytes. Unicode contains code points for almost all representable graphic symbols in the world and it supports all major languages e.g. English, Japanese, Mandarin, or Devanagari.

Always remember, UTF-32 is fixed-width encoding, always takes 32 bits, but UTF-8 and UTF-16 are variable-length encodings where UTF-8 can take 1 to 4 bytes while UTF-16 will take either 2 or 4 bytes.

11 comments :

Kunal Krishna said...

"BTW, if character's code point is greater than 127, " what is Character's CODE POINT ? plz explain.

Anonymous said...

You said: "Java programming language has extensive support for different charset and character encoding, by default it use UTF-8." Then you said: "Also, worth knowing is that Java Strings are represented using UTF-16 bit characters"
Could you clear this out.

gm said...

One question. You mention the default encoding in Java is UTF-8 but at least Character and String have the default UTF-16 (http://docs.oracle.com/javase/8/docs/api/java/lang/Character.html). Is there a different encoding you were reffering to?

Tx, nice blog

Unknown said...

@Kunal
"Code points are the numbers that are used in coded character set where coded character set represent collection of characters and each character will assign a unique number. This coded character set define range of valid code points. Valid code points for Unicode are U+0000 to U+10FFFF."
http://javarevisited.blogspot.com/2012/01/java-string-codepoint-get-unicode.html

Anonymous said...

Hello One point to note is that, UTF-8 can go until 6 bytes, i hope i am not wrong here. Thanks.

javin paul said...

@gm, Yes, Java String uses UTF-16 but when you convert Byte array to characters, Java uses platform's default character encoding. It's different at different places e.g. in Eclipse it could be different than your Linux host.

Anonymous said...

Hello there? What is difference between UTF-16, UTF-16LE and UTF-16BE? Are they same?

Anonymous said...

@Anonymous, They are not same. UTF-16LE store bytes in little endian order, while UTF-16BE stores bytes in big endian order in disk. Since UTF-16 uses minimum 2 bytes to represent a character, how they store those two bytes in disk will affect the value of character. In big endian, most significant byte is stored at higher location.

vijaypratap said...

(£) This symbol we are taking from Data base, while displaying this values in .jsp page it is fine, but while getting the value in to APIs it is coming as (A^£). We are using chaset =utf-8. Could you please tell me why it is happening and what is solution for it.

Unknown said...

Use utf16

Anonymous said...

A character set is nothing but list of characters, where each symbol or character is mapped to a numeric value, also known as code points.

Post a Comment