Language Identification on the WWW
LE3 .A278 2008
2008
Benoit, Darcy
Acadia University
Master of Science
Masters
Computer Science
This thesis discusses the problem of language identification on Web pages. Previous research has mostly focused on text language identification and a limited amount of research has concentrated on Web language identification. Web language identification is not a simple task that comes from text language identification. The noisy and diverse nature of Web pages introduces additional difficulties with Web language identification. To solve this problem, we introduce a robust approach which consists of multiple methods such as detection of language and character encoding declaration and Web text language identification to better handle the difficulties of Web language identification. The methods are harmonized to maximize their strengths and complement each other. By verifying our approach on corpora of 1400 Web pages for seven languages, it achieved 99.6 percent accuracy rate. Furthermore, we designed a system that employed our composite Web language identification approach and several other Web language and geographical distribution approaches for determining Web server location and language distributing status of portal Web pages on the Internet. The data collection came from the latest Web census (Benoit, Slauenwhite & Trudel, 2006 and 2007). This system allows us to provide information such as countries ranked by Web server number, the languages ranked by identified portal pages count and language distribution status in a country such as Canada. \
The author retains copyright in this thesis. Any substantial copying or any other actions that exceed fair dealing or other exceptions in the Copyright Act require the permission of the author.
https://scholar.acadiau.ca/islandora/object/theses:3197