Some Statistics about German Words Derived from Ding

Ding is a Dictionary lookup program which also contains a German-English dictionary (more than 345,000 entries) released under GNU General Public License by Frank Richter.

Since the German-English dictionary in Ding is in fact the only open data which I can find on the Internet and has clean information about part of speech of German words, I think it is fair to find out some statistics about German based on this German-English diction.

The unfortunate aspect is there is no documentation about the details of Ding dictionary data formate, while the fortunate aspect is the dictionary lookup program is open source and the official page do provide a image about how the word items are rendered eventually.

A Screenshot of Ding on Ding 1.4

One even better aspect is Ding is now also online as a website - BEOLINGUS. By searching for "Ding", a similar screenshot as above can be got as following.

Data Inspectation

Since there is no data formate introducton at hand, the first step would be inspecting the data and hoping to retrive the data formate information.

Ding Dictionary Text

Data analysis is about Data First, so looking at the data first.
Because the official webpage provides the scrennshot for the query world "Ding", it is natural to look at "Ding" first.

The "Ding" block in the de-en.txt (version 1.8.1) is as following.

Ding {n}; Sache {f} | Dinge {pl}; Sachen {pl}; Krempel {m} | Dinge für sich behalten | die Dinge laufen lassen | den Dingen auf den Grund gehen | beim augenblicklichen Stand der Dinge | das Ding an sich | über solchen Dingen stehen | Er ist der Sache nicht ganz gewachsen. :: thing | things | to keep things to oneself | to let things slide | to get to the bottom of things | as things stand now; as things are now | the thing-in-itself | to be above such things | He is not really on top of things.
Ding {n} [ugs.] (Produkt) [techn.] [econ.] :: widget [coll.]
Ding {n}; Kniff {m}; Trick {m} :: gimmick
Ding {n}; Sache {f}; Coup {m} [slang] (Einbruch, Überfall) :: job [slang] (burglary, robbery)
Dingel {pl} (Limodorum) (botanische Gattung) [bot.] :: limodores (botanical genus)
Dingens {n} [ugs.] :: thingy [coll.]
Dings {n}; Dingens {n} [ugs.] :: dingus
Dingsbums {n} :: thingamabob; thingumabob; thingmabob; thingamajig; thingumajig; thingmajig; thingummy
Dings {n}; Dingsbums {n}; Dingsda {n} | ganz aus dem Häuschen sein | in der Klemme stecken; in der Patsche sitzen :: dohickey; dojigger; doodad; doodah [Br.]; doohickey; hickey; gimmick | to be all of a doodah | to be in deep doodah [Br.]; to be on a sticky wicket

Which is apperantly different from the screenshots above since the versions differ.

By guessing (sorry, no really guessing, since this block is written after I skimmed the source code as described bellow), some useful informaton could be

Pretty much from just guessing, I guess. So better to dive into the source code for verification.

Ding Program Source Code

To be honst, because the source code of Ding program is written in Tcl, an ancient programming language, I really don't have the will to read it carefully through. By quick scanning and searching, I just get the following information out of it.

Combined with the above gussing blcok, now it seems to be at least to some degree clear. So the next step could be data transformation.

Data Transformation

As acknowleged in Data Analysis field, the prefered data structure DataFrame is convinient for data analysis. So it would be reasonable to define the record fields / object attributes or data struct / class / type first.

Because JavaScript/TypeScript has a really expressive literal syntax, I would like to use JavaScript/TypeScript to define the data structure.

interface DictLine {
  id: number; // could be automatically generated or just raw line number
  // NOTE: level 0 - '\n', '::'
  fileLineNum: number; // storage is cheeper than data
  fileLineTextGerman: string; // keeps everything which is more than enough, kept for potential usage, kind of my personal best practice
  fileLineTextEnglish: string;
  // NOTE: level 1 - '|'
  lineNum: number;
  germanLine: string;
  englishLine: string;
  // NOTE: level 2 - ';'
  germanTerms: string[]; // with all `{..}`, `(..)` and `{...}`
  englishTerms: string[]; // no one-to-one corresponding from german to english
}
interface DictGermanEntry {
  id: number;
  // NOTE: level 0 - '\n', '::'
  fileLineNum: number;
  fileLineTextGerman: string;
  // NOTE: level 1 - '::'
  germanLine: string;
  germanLineNum: string;
  // NOTE: level 2 - ';'
  germanTerm: string; // with all `{..}`, `(..)` and `[..]`
  germanTermPositionNum: string;
  leadingGermanTerm: string;
  // NOTE: level 2 - detailed
  germanEntry: string;
  germanEntryPoS: string; // `{..}`
  germanEntryAppendices: string; // all thing after `{...}`
}

interface DictEnglishEntry {
  id: number;
  // NOTE: level 0 - '\n', '::'
  fileLineNum: number;
  fileLineTextEnglish: string;
  // NOTE: level 1 - '::'
  englishLine: string;
  englishLineNum: string;
  // NOTE: level 2 - ';'
  englishTerm: string; // with all `{..}`, `(..)` and `[..]`
  englishTermPositionNum: string;
  leadingEnglishTerm: string;
  // NOTE: level 2 - detailed
  englishEntry: string;
  englishEntryPoS: string; // `{..}`
  englishEntryAppendices: string; // all thing after `{...}`
}

Since I am currently more interested in German Entries, so I would like to just go further with the DictGermanEntry data structure.

Based on the conclusions derived from data inspection, the most relevant Python snippet which is used to process the original Ding dictionary data is as following.

Some Statistics about Ding Dictionary

The raw statistics of the abstracted DictGermanEntry records can be seen from following image.

As shown above, the complete set of DictGermanEntry contains 493,141 records of which 313.574 records are taged with part of speech label.

And as the horizontal bar chart shows, the complete set of DictGermanEntry contains many repeated entries, in which "Schlag", "Anhänger" and "Verbindung" are repeated for more than 30 times.

Which part of speech in German has the most vocabularies?

Considering the repeated entries in the complete data set, in order to find the answer for the above question, it is necessary to get the unique entry set which is fortunately very easy by using Python Pandas. And the result is as following.

With or without expectation, the feminine noun in German has most vocabularies, and then there follows the masculine noun. One interesting aspect is there are more adjectives (adj) than verbs (vt+vi+vr) in German according to the entries in Ding dictionary.

What are other interesting aspects that can be derieved from Ding dictionary?

To be continued!

Personally, I would like to generate some statistics about the relationships between noun endings and the noun genders in German.

Any one with any interesting idea are wellcomed to give your comments.

References


* cached version, generated at 2020-05-02 22:12:53 UTC.

Subscribe by RSS