HANDEDICT汉德词典
HanDeDict 汉德词典
Search History Read Search tips Details About Download Users
DE EN
 

Bulk change 002: Data cleansing

The original HanDeDict data file has several issues. Some of these are marginal, such as inconsistenly used curly or straight quotation marks. Some are annoying, such as <> and () used inconsistently as parantheses, or double spaces. Some might have been caused by a script error in the original HanDeDict site, or in the Wayback Machine, such as &gt (without a semicolon) showing up in various entries. Some are serious but easy to fix: e.g., missing spaces, because these obscure true word boundaries and prevent searching from retrieving German words that are actually in the dictionary. Some problems are serious but not fixable: the file often contains / as text, but this cannot be distinguished from / as the sense separator. Finally, the majority of the entries contains (u.E.), which in HanDeDict @ Zydeo belongs in the extended format, not in the displayed entry text.

Before import, a second script was executed to fix as many of these issues as possible without human intervention. The script records a new version for every changed entry, with zydeo-robot as the changing user. That is a placeholder for scripted transformations, so that bulk changes like this do not show up under a real user’s account that also contributes to the dictionary as a human.

For the technical reader, it is easiest to directly include the relevant code that was executed for every entry:

private Regex reLtGt = new Regex(@"<([^>\/]+)>");
private Regex reAngled = new Regex(@"\[([^\]\/]+)\]");
private Regex reCommaSpace = new Regex(@",(\p{L})");
private Regex reParenSpace1 = new Regex(@"\( *([^\)]+)\)");
private Regex reParenSpace2 = new Regex(@" +\)");
private string cleanse(string line)
{
    // Curly quotes, non-breaking spaces
    line = line.Replace(' ', ' '); // NBSP
    line = line.Replace('\t', ' '); // TAB
    line = line.Replace('“', '"'); // Curly quote
    line = line.Replace('”', '"'); // Curly quote
    // Remove "(u.E.)" from entry itself. We put this info into the Status meta field.
    line = line.Replace("(u.E.)", "");
    // Fix incorrect entities
    line = line.Replace("&gt", ">");
    line = line.Replace("&quot", "\"");
    // <something> -> (something)
    line = reLtGt.Replace(line, "($1)");
    // Angle [brackets] inside body
    int spos = line.IndexOf('/');
    string head = line.Substring(0, spos);
    string body = line.Substring(spos);
    body = reAngled.Replace(body, "($1)");
    line = head + body;
    // No space after comma
    line = reCommaSpace.Replace(line, ", $1");
    // Multiple spaces
    while (true)
    {
        string b = line.Replace("  ", " ");
        if (b == line) break;
        line = b;
    }
    // Spaces inside ( parentheses )
    line = reParenSpace1.Replace(line, "($1)");
    line = reParenSpace2.Replace(line, ")");
    // Trailing / leading spaces in senses
    spos = line.IndexOf('/');
    head = line.Substring(0, spos);
    body = line.Substring(spos);
    body = body.Replace("/ ", "/");
    body = body.Replace(" /", "/");
    line = head + body;
    // Phew
    return line;
}

Repro

Wrk20Cleanse.cs from ZD.Tool is the script that was used. To execute:

  1. Put x-10-handedict.txt from the initial transformation script in a subfolder named _work under the solution root
  2. Compile and run ZD.Tool with the --20-cleanse argument

Output: x-20-handedict.txt is the file that was initially imported into HanDeDict @ Zydeo.

OK Learn more
This website relies on cookies for some of its functionality. By using the site, you agree to its use of cookies.
mmmmmmmmmm