Logo NODALIDA2005

15th NODALIDA, Joensuu, May 20-21, 2005

Hercules Dalianis: Improving search engine retrieval using a compound splitter for Swedish

1. Introduction

Today when searching on Internet it is very likely that you will find some answer, this is due to the immense amount of information that is present and the efficient global search engines. There is always some web page that contains the answer of your question, but when searching on a web site the task is not so easy anymore. Therefore we need more sophisticated tools when searching on web sites.

One of ten queries does not obtain any answer since they are misspelled. (Dalianis 2002). One solution to this is a spelling support linked to the index of the search engine. When the user makes a spelling error the spelling correction module tries to find a word that has either similar spelling or pronunciation to one or mores words in the index and consequently the user will get feedback in form of possible candidate word(s), (Dalianis 2002, Sarr 2003).

One other reason for no hits is often that the user searches for a word and the word is written in an other inflected form, this is of course very common in cases when one uses languages that are morphologically complicated, (usually not English).

To solve the problem with word inflections one can use a stemmer that will remove the inflections and make the words both in the search query and in the index stemmed and consequently able to match. Tomlinson (2002). For Swedish, for example, precision and recall increased with 15 and 18 percent respectively using a stemmer, Carlberger et al. (2001).

Two other methods to process queries are either compound splitting (decompound-ing) or compound joining. In Swedish for example we have a lot of compounds but we are heavily influenced by English written language and we tend to decompound Swedish words.

An example on the Swedish public medical website Vårdguiden, is that somebody is searching for diabetespatient (patient with diabetes), and obtains no hits then the system tries to split the compound word to diabetes patient and the resulting hit become patienter med diabetes (patient with diabetes), note also that the stemmer will makes it possible to automatically find the word patienter (plural form - patients). The other situation is that the user uses two search words streptokock infektion and does not obtain any hit then the system can propose the compound streptokockinfektioner (plural form) that gives several relevant hits.

Chen & Gey (2004) used stemming and compound splitting and obtained 14 percent higher precision for Dutch, 37 percent for German and 30 percent higher precision for Swedish and Finnish respectively.

2. Our study and method

We have studied nine Swedish public websites: Two municipalities, one university, one political party, a nature conservation site, a public authority site, a popular science site, and two insurance companies, these sites ranges the size from 500 documents to 50 000 documents each. They contain together totally 100 000 documents and the search engines there obtained around 1.6 million queries of which 9.3 percent were misspelled.

The top 30 of the total 1.6 million queries with no answer at all, were 6 000 compounds, 127 different compounds, in total 3.7 promille of the number of total queries, on some specific web sites up to 2 percent of the total queries. (Another 600 were written decompounded and became compounds by putting them together).

We connected the compound splitter described in (Sjöbergh & Kann 2004) to the search engine. We carried out compound splitting on each compound of the 127 compounds on each web site and it generated in total 7 724 new hits. 64 percent of these were relevant to the query, 20 compounds were not splitted, over splitted or incorrectly splitted. That is 84 percent success rate of the compound splitter.

Of the 126 (100%) investigated compounds that none of the them obtained any hit at all first, obtained hits after splitting them with a compound splitter and using the search engine on the splitted result again we found the following:

17 (13%) of the investigated splitted compounds still gave no answers.
29 (23%) gave us bad non relevant hits.
+ 80 (64%) relevant hits boosting the search using compound splitting.
= 126 (100%)

3. Conclusions

We have seen in our experiment that we obtained 64 percent more and relevant hits using the compound splitter described in (Sjöbergh & Kann 2004) as a post processor in a search engine.

The findings in our experiment are that nouns and specifically proper nouns need to be splitted in a smart and correct way, sometimes splitted but not oversplitted.

A finding says that if two parts of the splitted compounds is longer than 4 and 5 characters long, then the compound splitting becomes better. We found also that a maximum of 29 words distance between the words in a text compound splitting search gave relevant results.

References
Carlberger, J., H. Dalianis, M. Hassel, O. Knutsson 2001. Improving Precision in Information Retrieval for Swedish using Stemming. In the Proceedings of NODALIDA 01, May 21-22, Uppsala, Sweden.
Chen, A. and F. Gey. 2003. Combining Query Translation and Document Translation in Cross Language Retrieval CLEF 2003 http://clef.iei.pi.cnr.it/2003/WN_web/05.pdf
Dalianis, H. 2002. Evaluating a Spelling Support in a Search Engine, in Natural Language Processing and Information Systems, 6th International Conference on Applications of Natural Language to Information Systems, NLDB 2002 (Eds.) B. Andersson, M. Bergholtz, P. Johannesson, Stockholm, Sweden, June 27-28, 2002. Lecture Notes in Computer Science. Vol. 2553. pp. 183-190. Springer Verlag.
Sarr, M. 2003. Improving precision and recall using a spell checker in a search engine. In the proceeding of NODALIDA 2003, the 14th Nordic Conference of Computational Linguistics, Reykjavik, 2003.
Sjöbergh, J. and V. Kann, Finding the correct interpretation of Swedish compounds, a statistical approach, Proc. LREC 2004, Lissabon, Portugal, http://www.nada.kth.se/theory/projects/xcheck/rapporter/sjoberghkann04.pdf
Tomlinson, S. 2002. Experiments in 8 European Languages with Hummingbird SearchServer™ at CLEF 2002. Third Workshop of the Cross-Language Evaluation Forum, Rome, Italy, September 19-20, 2002. To be published by Springer in their Lecture Notes for Computer Science (LNCS) series. http://www.stephent.com/ir/papers/clef02.html


nodalida2005@joensuu.fi
Last modified: Fri Apr 8 22:25:02 EEST 2005