Data Version: 1.0
This is a collection of Unicode code point frequency data gathered from across the web. Frequencies are provided for individual code points and code point pairs, where each frequency count is the number of web pages that particular code point or pair is found on.
Note: the code points do not need to occur next to each other in the source page to be counted. See the collection methodology section for more information.
These data files are licensed under the W3C Software and Document License. See LICENSE.md
The frequency data files are encoded with Riegeli. Each record is a serialized protobuf with the following schema: unicode_count.proto
This data set contains the frequencies for pairs of code points. Each record will have exactly two
codepoints fields. Records that list the same code point twice give the frequency of that code point
individually.
Frequency data is collected by both language and script. The file name will be either:
Language_<language code>.riegeli, where<language code>is a bcp 47 tag, orScript_<script name>.csv.
Some of the larger files are split into multiple shards, these will have a suffix of the form:
filename.riegeli-*-of-*.
The data/metadata.binpb file contains a binary protobuf with metadata about the frequency data files. This includes:
- A list of all available data files.
- For each file, a list of code points covered by that file.
The schema for this metadata is defined in metadata.proto.
You can regenerate this metadata file using the generate_metadata utility:
bazel run -c opt //:generate_metadata -- --input_dir=$(pwd)/data --output_file=$(pwd)/data/metadata.binpb
The ift-encoder library provides tools and libraries for interacting with these data files:
-
freq_data_to_sorted_code points: can pull out single code point frequencies and output them in a text format. Example usage:
bazel run util:freq_data_to_sorted_codepoints -- "Language_ja.riegeli@*" --add_character > japanese-freqs.txt -
ift-encoder util::LoadFrequenciesFromRiegeli: provides a C++ API for loading these files.
These also provide a demonstration for how to use the Riegeli library to parse the files. Both of these are capable of
handling sharded data files. When loading a file that is sharded append @* to the file name. For example
Language_ja.riegeli@*.
You can also find a copy of the data files in this repo hosted under https://www.gstatic.com/fonts/unicode_frequency/v1/.
The list of data files that are present is given by DATA_FILE_LIST.
- Pages from a web search index are first randomly sampled.
- Note: this means that reported counts are not absolute and should be interpreted relatively within a particular file.
- Each selected page is analyzed to determine the language that it is written in. Pages with a low confidence language detection are discarded.
- Based on the detected language an associated writing script is selected.
- Note: this means for some scripts counts are influenced by page samples from multiple languages, the most prominent example of this is latin which includes many languages.
- For each unique code point pair on a page the associated count for that script and language is incremented by 1.
- Here a code point pair just means that both code points are present somewhere on the page, it does not require they occur next to each other in the text.
- Each unique pair is counted only once per page.
- Within a script code points are filtered to those used in that script, using the definitions in googlefonts/nam-files
- In addition to the individual CJK scripts an overall CJK code point frequency count is collected by
combining all of the Chinese, Japanese, and Korean counts. This can be found in
Script_CJK.riegeli. Script_emoji.csvandScript_symbols.csvare based on counts across all scripts.fallback.csvcollects up counts across all scripts of any code points which are not associated with any other script.