15th NODALIDA, Joensuu, May 20-21, 2005
Viggo Kann, Magnus Rosell: Free construction of a Swedish dictionary of synonyms
Free construction of a Swedish dictionary of synonyms Building a large dictionary of synonyms for a language is a very tedious task. Hence there exist very few synonym dictionaries for most languages, and those that exist are generally not freely available due to the amount of work that have been put into them.We describe a method that makes it feasible to compile one of the largest synonym dictionaries for Swedish without effort and without copyright problems.
First we need a large list of possible pairs of synonyms. Then we need hundred thousand people that are willing to participate in deciding which of the pairs that are good synonyms. Third we need a computer system where the decisions can be done and where the growing synonym dictionary can be available for lookup and downloading.
1. How can a list of possible pairs of Swedish synonyms be produced? If you have access to a dictionary D1 from Swedish to another language X and a dictionary D2 from X to Swedish you can collect synonym pairs by translating each Swedish word to X and back again to Swedish, i.e.,
{(w,v): exists y: y in D1(w) & v in D2(y)}
We may also consider only the dictionary D1 from Swedish to X:
{(w,v): exists y: y in D1(w) & y in D1(v)}
Similarly we may also consider only D2. The pairs obtained in this way will sometimes be synonyms, but due to ambiguous word senses there will also be lots of rubbish.
If there are dictionaries available between Swedish and other languages one can get lists of word pairs also from them. Such lists can then be used either to complement or to refine the original list. If (w,v) is a pair included in many lists it becomes more probable that w and v are real synonyms.
By using this technique we have constructed a list of 600 000 pairs of possible synonyms. In order to improve the quality of the list we can part-of-speech tag the words and only keep pairs containing words that have the same word class. We also refine the list of synonyms using a method called Random Indexing or RI (Kanerva, Kristoferson, Holst 2000). In RI each word is assigned a random label vector of a few thousand elements. Using these vectors one constructs a co-occurrence representative vector for each word by adding the random vectors for all words appearing in the context of each occurrence of the word in a large training corpus (we used the KTH news corpus). Synonymous words normally occur in similar contexts, and hence will get similar co-occurrence vectors. For each word pair (w,v) the cosine distance between the co-occurrence vectors of w and v will be used as a measure of the synonymity of the words. A suitably chosen threshold will refine the list of pairs to an acceptable level.
2. The Lexin online dictionary (lexin.nada.kth.se) is a very popular website for translations of Swedish words to about ten different languages. During the year 2004 the number of lookups in Lexin was 101 millions. This means more than three lookups each second of the year. The users of Lexin are good candidates for doing the job of deciding which word pairs are good synonyms.
The users visit the web site as they need to ask a language (translation) question. They obviously like the idea of an online dictionary, so they are probably motivated to put a small effort in producing a Swedish synonym dictionary by just answering a simple question.
3. The Swedish Agency for School Improvement has allowed us to put such a question on each Lexin lookup answer web page. A question could for example be "Are 'spread' and 'lengthen' synonyms? Answer using a scale from 0 to 5 where 0 means 'I do not agree' and 5 means 'I fully agree', or answer 'I do not know'".
When a user have answered this question a web page of the growing synonym dictionary will be opened where the user may choose to grade more pairs, suggest new synonym pairs, lookup in the synonym dictionary or download the synonym dictionary.
The programs taking care of the answers and the synonym dictionary were developed by a student project group at KTH. The system will soon be linked to the Lexin website. We estimate that the synonym dictionary will be completed after about two months. Then it will continue to grow automatically as long as it is linked to Lexin.
Every web page that invites the public to participate will be subjected to attempts of abuse. Thus the synonym dictionary must have ways to prevent this. Our solution is threefold. First, many gradings of a pair are needed before it is considered to be a good synonym pair and become possible to lookup in the synonym dictionary. Second, the pair that a user is asked to grade has been randomly picked from the list of about half a million pairs. The same user will almost never be asked to grade the same pair more than once. If most of the users answer honestly and have an acceptable idea of the synonymity when they think they have, the quality of the synonym dictionary should be good. Third, the word pairs that users suggest themselves are first checked using a spelling checker and are then added to the long list of pairs, and will eventually be graded by other users. The probability that a user will be asked to grade his own suggested pair is extremely small.
It is interesting to note that the exact meaning of "synonym" does not need to be defined. The users will grade the synonymity using their intuitive understanding of the concept of synonymity and the words in the question. The produced synonym dictionary will therefore use the People's definition of synonymity, and hopefully this is exactly what the People wants when looking up in the same dictionary.
| nodalida2005@joensuu.fi Last modified: Fri Apr 8 22:25:02 EEST 2005 |