Bulk change 001: Initial import
The history of every entry that comes from the original HanDeDict file starts with this version. To prepare for the initial import, the dictionary file salvaged from the Internet Archive’s Wayback Machine was transformed into the new, extended format. Specifically:
- A random ID was assigned to every entry, and the output file was sorted by entry ID.
- Markup was added to represent the initial version of each entry. The changing user is HanDeDict, which is a placeholder to represent the dictionary’s original authors. The change’s timestamp is May 28, 2011, reflecting the retrieved data file’s “last changed” declaration.
- The version history markup incorporates each entry’s status from the original data file. Unverified entries contained (u.E.) in a German sense. Where this was found, Stat-New is stated in the extended format. For verified entries, which did not contain (u.E.), Stat-Verif is stated in the extended format.
- Every line in the original data file was checked for syntax. 49 entries were discarded, mostly because they didn’t contain the same number of Hanzi and Pinyin syllables. A smaller number were discarded because they contained rare Hanzi at Unicode code points above the two-byte range, which are not supported by HanDeDict @ Zydeo.
Wrk10Prepare.cs in ZD.Tool is the script that was used for the initial transformation. To execute:
- Place handedict.u8 in a subfolder named _work under the solution root
- Compile and run ZD.Tool with the --10-prepare argument
Output: x-10-handedict.txt contains the enriched entries in the new format.
WrkExamine.cs in ZD.Tool is a diagnostic script that was executed before the transformation. Its outputs:
- hdd-diag.txt contains entries that are discarded in the transformation.
- hdd-trip.txt contains entries that don’t roundtrip unchanged through the CC-CEDICT parser used throughout the HanDeDict @ Zydeo code base. This file is empty.
- hdd-tags.txt contains all the different words that are contained within parantheses in the original dictionary file, ordered by frequency. This includes, as a subset, a comprehensive list of the metainformation labels used by HanDeDict’s creators.