Fork me on GitHub

Charsets experiments

/!\ CPU intensive /!\

Comparing non-ASCII characters on legacy charsets

Unicode: all code points (0x0 - 0x10FFFF)

Unicode 6.3: all assigned code points

more info
all assigned code points
all assigned code points (excluding low/high surrogates and private use blocks)

How many bytes are used by each assigned Unicode character in each encoding

By block
By code point (U+0000-U+40000)
By code point (U+40000-U+80000)
By code point (U+80000-U+B0000)
By code point (U+B0000-U+F0000)
By code point (All - txt - 88MB)
(excluding low/high surrogates and private use blocks)

Unicode 8 glyphs names

made from this 155Mb file
all (4mb)

kDefinitions of all Han code points in Unicode 8

txt (1.1mb)