Automation of the Compilation and Processing of a Hausa Corpus

##plugins.themes.academic_pro.article.main##

Eno Ubong Ekpo
Ubong Sunday Ekpo
Tunde Adegbola

Abstract

A spell checker is an indispensable tool for text editing as it can be used to assist the possible poor language skills of writers as well as to identify and correct inevitable typing errors.  With a population of over 40 million speakers, the Hausa language is the second most widely spoken language in Africa, yet it is without a standard spell checker. 

To create a Hausa spell checker, a Hausa corpus was built by data entry and web crawling.  The wordlist was cleaned to remove non-Hausa words as well as to correct typographical and other errors.  Also, in order to determine the extent to which the modest corpus used for the spell checker covers the Hausa language, the rate of increase in the size of the wordlist in relation to corpus size was determined. A modest 2 million-word Hausa corpus was realized. The corpus was then tokenized to produce a wordlist of about 30,000 Hausa tokens.  After cleaning, the wordlist was reduced to 23,306 tokens.  Based on the use of Hausa morphology, the word list was compressed to 12,569 stems and 62 affix rules.  This made up the spell checker files.  Also, a700,000-word corpus drawn from the Hausa corpus was tokenized in separate files with a successive increment of 20,000 words per file.

Results showed that Hausa morphology proved effective for information compression as expected and a rudimentary spell checker was produced. Furthermore, results of the corpus study showed that a corpus of 20,000 words would produce an average of about 3000 tokens and the number of new tokens produced will decrease with every addition of each new file until it asymptotes to a point that an addition of corpus of any size would produce little or no new tokens at all. The rate of new tokens realised with each addition decreased from 2000 tokens to 1000 tokens and to less than that.

This work is recommended for use by individuals, institutions and organisations to guide in the design of standard spell checkers in Hausa language and other languages that feature agglutination.

##plugins.themes.academic_pro.article.details##

How to Cite
Ekpo, E. U., Ekpo, U. S., & Adegbola, T. (2017). Automation of the Compilation and Processing of a Hausa Corpus. The International Journal of Science & Technoledge, 5(3). Retrieved from http://internationaljournalcorner.com/index.php/theijst/article/view/123450