|Izenburua||Search engine based approaches for collecting domain-specific Basque-English comparable corpora from the Internet|
|Publication Type||Conference Paper|
|Year of Publication||2009|
|Authors||Leturia, Igor, Iñaki San Vicente, and Xabier Saralegi|
|Conference Name||5th International Web as Corpus Workshop (WAC5)|
|Conference Location||Donostia-San Sebastian, Spain|
In this paper we propose using search engine queries for collecting bilingual specialized comparable corpora from the Internet, an alternative to using news agencies or focused crawling which will supposedly obtain more varied corpora. The method we propose for obtaining specialized corpora on a language is based on the BootCaT method (querying search engines for random combinations of a list of seed words representative of the domain or topic and retrieving the pages returned) but, instead of the seed words, a sample minicorpus is used as a basis for the process: most representative words are automatically extracted from it, and a final domain-filtering step is performed using document-similarity measures with this sample corpus. For obtaining the bilingual comparable corpora, two different variants of this method are proposed. One of them uses a sample minicorpus for each language and launches the corpus-collecting processes for each language independently. The other uses only a sample mini-corpus in one of the languages, and uses dictionaries for translating the extracted seed words and performing the topic filtering for the other language. We have collected two domain-specific Basque-English comparable corpora with each of the methods, and evaluated them using corpus similarity measures.