Language Identification on the WWW

Cai, Lehe

Cai, L. (2008). Language Identification on the WWW. https://scholar.acadiau.ca/islandora/object/theses:3197

Search for this publication on Google Scholar

Details:

Title: Language Identification on the WWW
Author: Cai, Lehe
Call Number: LE3 .A278 2008
Date: 2008
Supervisor: Benoit, Darcy
Degree Grantor: Acadia University
Degree Name: Master of Science
Degree Level: Masters
Discipline: Computer Science
Affiliation: Computer Science
Abstract: This thesis discusses the problem of language identification on Web pages. Previous research has mostly focused on text language identification and a limited amount of research has concentrated on Web language identification. Web language identification is not a simple task that comes from text language identification. The noisy and diverse nature of Web pages introduces additional difficulties with Web language identification. To solve this problem, we introduce a robust approach which consists of multiple methods such as detection of language and character encoding declaration and Web text language identification to better handle the difficulties of Web language identification. The methods are harmonized to maximize their strengths and complement each other. By verifying our approach on corpora of 1400 Web pages for seven languages, it achieved 99.6 percent accuracy rate. Furthermore, we designed a system that employed our composite Web language identification approach and several other Web language and geographical distribution approaches for determining Web server location and language distributing status of portal Web pages on the Internet. The data collection came from the latest Web census (Benoit, Slauenwhite & Trudel, 2006 and 2007). This system allows us to provide information such as countries ranked by Web server number, the languages ranked by identified portal pages count and language distribution status in a country such as Canada. \
Rights: The author retains copyright in this thesis. Any substantial copying or any other actions that exceed fair dealing or other exceptions in the Copyright Act require the permission of the author.
Permanent Link: https://scholar.acadiau.ca/islandora/object/theses:3197

Acadia Scholar

Language Identification on the WWW

Details:

Bookmarks: