Format and guidelines
This website provides interactive functions to search, browse and edit HanDeDict. The dictionary itself is a large text file that contains entries in the CC-CEDICT format, plus entry-level metainformation including a unique ID and the entry’s full change history. Technically, all of this content is stored in a database, but the website automatically updates the authoritative text file on a nightly schedule.
When you search or edit HanDeDict through the website, you are never directly in touch with the text-based format. Many of the dictionary’s design decisions, however, are a direct consequence of the format’s constraints. This page covers (1) the CC-CEDICT format at the core of the dictionary; (2) the extensions that accommodate entry history; and (3) the editorial guidelines and conventions layered on top of the format itself.
A CC-CEDICT entry
With the exception of comments, which are marked with a # sign, every non-empty line in the file represents a single dictionary entry. These lines have the following internal structure:
檢索 检索 [jian3 suo3] /Recherche, Suche (S)/recherchieren, suchen (V)/
- This is a line from a plain text file; the colors are only added here for extra clarity.
- The line begins with the headword, which in turn is made up of the traditional variant, simplified variant and Pinyin in square brackets. The three parts are separated by spaces.
- Pinyin syllables indicate tone with a digit between 1 and 4, or 5 for the neutral tone. All Pinyin syllables are separated by spaces.
- The rest of the line lists senses, separated and surrounded by slashes.
- As a logical constraint, every headword must contain the same number of traditional characters, simplified characters, and Pinyin syllables.
- Retroflex 儿 is represented by the separate Pinyin syllable r5 - i.e., it is not appended to the previous syllable.
The same entry, when it appears as a search result, is rendered like this:
In addition to uncommented lines holding the current version of entries, HanDeDict’s data file contains additional information in commented lines. In its initial state the entry above looks like this:
# ID-a54wi5I # Ver 2011-05-28T01:27:49Z HanDeDict Stat-New 001>Originalversion HanDeDict-Datei # 檢索 检索 [jian3 suo3] /Recherche, Suche (u.E.) (S)/recherchieren, suchen (u.E.) (V)/ # Ver 2016-10-23T15:32:07Z zydeo-robot Stat-New 002>Datenreinigung 檢索 检索 [jian3 suo3] /Recherche, Suche (S)/recherchieren, suchen (V)/
Extended entries are separated by an empty line in the file; each section starts with a commented line stating the entry’s random unique ID. Every change in the entry’s history is represented by one or two lines.
- The first line, which is always present, starts with Ver and states the version’s timestamp, the user that changed the entry, the entry’s status after the change, and a comment after the > mark.
- If the change altered either the headword or any of the translations, a second line follows with the entry’s content after the change.
- The entry’s status can be either New (Stat-New), Approved (Stat-Verif) or Flagged (Stat-Flagged).
- Some changes affect a large number of entries in the dictionary (e.g., semi-automated cleanup through scripts, or entries imported in bulk from an external file). To avoid cluttering the change history with thousands of related items, such changes appear as a single bulk change in the history. Bulk changes are defined impicitly through the three-digit change ID before the > mark that introduces the comment. The website has a dedicated page for every bulk change, providing additional information and the specific input files and scripts that were involved. Regular, non-bulk changes do not have a three-digit ID before the > mark.
Original HanDeDict statuses: The majority of entries in the original HanDeDict file from 2011 is unverified, indicated by (u.E.) within the entry text itself. Unverified entries appear as "New" here; the few verified entries received “Approved” status.
Conventions and guidelines
Tone sandhi is not indicated in headwords. Specifically, 一and 不 always get first and fourth tone, even when they are pronounced differently in a given context. Similarly, 3>2 tone sandhi is never indicated; the original third tone is retained in 你好 , for instance. This is standard practice in Chinese dictionaries.
Lexicalized changes to the neutral tone, however, are made explicit, so 明白 is, not .
Parentheses always indicate meta-information: something that is not to be read literally as a German equivalent of the Chinese headword.
- Text that is in parentheses is not retrieved by search. This is the most important thing to keep in mind as editors: exclusion from search is the primary purpose of parentheses.
- To indicate that parenthesized text is not to be read literally, it is rendered in italics.
- Parentheses may occur at the start and end of a sense, but not in the middle.
- In particular, avoid constructions like this: . It is impossible to correctly index and search text like this. There are still several such instances in the dictionary, and it’s an editorial goal to eliminate them.
- The inherited HanDeDict file indicates part-of-speech in parentheses, but neither the origin nor the exact meaning of these labels is clear. (Do they refer to the headword or the German equivalent? Exactly what automated method was used to generate them, and how reliable are they?) In spite of these doubts, the original part-of-speech labels are preserved. However, it is absolutely not a requirement to keep them in entries that you edit, or to indicate part-of-speech in newly added entries.
- The inherited HanDeDict file uses a relatively large set of labels for domain and register, but no guidelines or conventions survive to explain how they were applied. For the time being there is no editorial policy about labels.
- When it is impossible to provide a German equivalent, put the entire paraphrase or explanation within parentheses. This is particulary useful in the case of function words like 的, which have no equivalent “translation” in a different language.
What is a word? HanDeDict includes a lot of entries that are, strictly speaking, not words, such as Chengyu, proper names, fixed expressions or simply frequent collocations. As a matter of principle, it doesn’t matter if something is a “word” or not. If you feel that something stands on its own and would be useful for someone who is trying to make sense of a Chinese text, it belongs in HanDeDict. You should avoid adding the following kinds of entries, though:
- Expressions whose meaning is completely transparent from their parts. No trivial phrases or sentences.
- Trivially modified words. 蓝色 refers to the color blue; no need to include 蓝色的 as if that were an adjective in its own right.
Pinyin words. In HanDeDict, Pinyin is used exclusively for the purpose of providing a syllable-by-syllable phonetic transcription of the headword. The rules of Pinyin ortography for word boundaries are not observed. The main reason for this is that the inherited HanDeDict file does not contain this information. Because the dictionary contains a mixture of words and multi-word expressions, it would be an immense manual effort to add true word segmentation, and the practical benefit for the dictionary’s users would be limited.
Ortographic and phonetic variants. Some headwords can be written with alternative characters (typically, alternative traditional characters). Similarly, many headwords have alternative pronunciations (most frequently, with a different tone, often reflecting regional variation). HanDeDict’s current format cannot encode this information elegantly; the only possibility is to create separate headwords. This may be addressed through an extension of the format in the future.
German ortography. New German ortography, as used in Germany, is preferred. It is not mandatory, however; the website’s search function compensates for the most significant differences to make sure all relevant words are found regardless of ortography.
Meanings and alternatives. HanDeDict doesn’t impose any strict guidelines about separating distinct meanings of a headword. Often, alternative translations for a single meaning are listed within a single sense, separated by commas or semicolons. In other entries, alternative translations for the same meaning appear as separate senses. Both approaches are acceptable. Very often, it would be difficult to decide if something’s a synonym for the same meaning, or a different meaning altogether.
When editing an entry, keep in mind the person using the dictionary, and try to find a form that makes it easy for the user to grasp the word’s meanings, register and usage.