Parameter-Efficient Amharic Text Summarization

Daniel Mekuriaw
6 min readJan 1, 2024

--

This article looks at a project aimed at improving Amharic text summarization. It focuses on efficiently fine-tuning the mT5-small model with minimal parameter updates and incorporates cross-lingual learning from Arabic and English.

In the field of artificial intelligence, the evolution of Large Language Models (LLMs) and deep learning techniques has significantly transformed Natural Language Processing (NLP) [1]. These advancements, however, are not uniformly experienced across all languages. There remains a notable disparity, with many languages underrepresented due to limited computational resources and insufficient training data [2]. This uneven progress highlights a crucial challenge in NLP: the need to elevate low-resource languages to ensure their inclusion in the rapidly advancing digital landscape. Addressing this gap is essential for broadening access to technological developments for speakers of these less prominent languages [3].

The Amharic Language

Amharic, recognized as the official working language of Ethiopia and spoken by over 57 million people, is a prime example of the challenges faced in NLP for less commonly studied languages. As a member of the Afro-Asiatic language family, Amharic falls under the Southwest Semitic group. It shares its roots with Ge’ez, the ceremonial language of the Ethiopian Orthodox Church, and exhibits linguistic similarities with other languages such as Tigrinya and the South Arabic dialects. The language’s script is unique, comprising 33 basic characters that can each be modified into seven forms, based on the vowel sounds in a syllable [4].

The Amharic Phonetic Alphabet: Characters and Corresponding Sounds [5]

Amharic NLP

NLP for Amharic is notably less developed compared to languages with more extensive technological backing. Past efforts have primarily centered around creating basic NLP tools like stemmers, morphological analyzers, and part-of-speech taggers, alongside the development of specific resources and corpora tailored for Amharic [6]. Current efforts are focused on enhancing the range of available Amharic NLP resources and improving system efficacy. Despite these initiatives, areas like Amharic text summarization have received less research attention [1].

Project Overview

Benchmark Dataset Compilation: Focused on creating a unified benchmark dataset for Amharic text summarization, amalgamating various existing datasets to address the lack of standardized resources in this field.

Efficient Transfer Learning with mT5-small: The project aimed to enhance the mT5-small model’s Amharic text summarization capabilities. This involved efficiently fine-tuning the model, which had been pre-trained with some Amharic vocabulary, using a mix of Amharic, Arabic, and English data. Arabic and English were selected for their resource richness and linguistic relevance.

Project Implementation Design Showing the Plan to Fine-tune the mT5-small Model with Different Datasets

Outcomes and Contributions: The project successfully developed a standardized Amharic text summarization dataset and fine-tuned multiple versions of the mT5-small model using the IA3 method. These models, evaluated and refined for improved performance, along with the dataset, are made openly available. Comprehensive documentation of the methodologies and findings is included in the final report, which is linked below.

Methodology

This section offers a succinct overview of the methodology used in the project, highlighting the key aspects while detailed information is available in the final report.

Data Collection and Aggregation
The study involved collecting several Amharic text summarization datasets from various sources. These datasets included those specifically designed for text summarization and others adapted from related tasks like news category classification. The datasets, in various formats like CSV and JSONL, required careful aggregation and preprocessing to ensure consistency and usability. This process was crucial to form a comprehensive benchmark dataset.

Data Preprocessing and Model Fine-Tuning
Data preprocessing tailored to each language was crucial to align the datasets with the mT5 model requirements. For Amharic, this involved character-level normalization and filtering out non-Amharic texts and characters. The Arabic and English datasets underwent similar preprocessing steps, ensuring their compatibility with the mT5 model. The fine-tuning of the mT5-small model employed the IA3 PEFT technique, leveraging various Amharic, Arabic, and English datasets. This approach aimed to enhance the model’s summarization capabilities in a parameter-efficient manner, with multiple models developed for comparative evaluation.

Evaluation
The project adopted both quantitative and qualitative methods for evaluating the models. Quantitative measures included ROUGE and BLEU scores, along with BERTScore, which assesses semantic similarity based on contextual embeddings. The qualitative evaluation involved human assessments by a native Amharic speaker, providing a nuanced perspective on model performance. This dual approach aimed to comprehensively evaluate the effectiveness of the fine-tuned models in Amharic text summarization, considering both technical precision and human interpretability.

Results

The project’s evaluation phase revealed several key findings:

Quantitative Metrics: The models were evaluated using ROUGE and BLEU scores, which generally indicated low word-to-word overlap between the generated and reference summaries. The fine-tuned models, particularly those utilizing the Amharic dataset, showed improvements over the base mT5-small model, highlighting the significance of incorporating Amharic data. However, the overall low scores across all models suggest that while there are improvements, these are modest and do not represent a major leap in performance.

Human Evaluation and BERTScores: Despite quantitative metrics indicating limited overlap, human evaluation revealed that the models fine-tuned with Amharic data produced relevant summaries. This was further corroborated by BERTScores, which suggested a higher level of relevance and semantic similarity than ROUGE and BLEU scores alone could capture. The models fine-tuned on combinations of languages did not demonstrate significant improvements over those fine-tuned solely on Amharic, indicating that the cross-lingual transfer of summarization capabilities was limited.

The analysis of these results led to several insights:

Impact of Dataset Quality: The improved performance of models fine-tuned on a more refined Amharic dataset underscored the importance of dataset quality. The study highlighted that for further advancements in NLP for low-resource languages like Amharic, enhancing the quality of training data is crucial.

Tokenization Truncation Analysis: A significant operational aspect of the study was the decision to set maximum token size limits due to computational resource constraints. The analysis showed that truncation did not significantly impact the model’s performance, and a comparable level of effectiveness was achieved even with fewer training examples. This finding suggests that efficient utilization of training examples and resources can still yield strong model performance.

Future Directions

The results and methodologies of this project lay a promising foundation for future research in the field. The availability of standardized datasets, models, and detailed documentation in an open-access format encourages further exploration and collaboration in Amharic text summarization. Future research directions could include exploring parameter-efficient adaptations of other NLP models, such as mBART or GPT-3, to understand their efficacy in processing Amharic. This approach would not only broaden the scope of technology applicable to Amharic but also contribute to the broader field of NLP, especially in handling low-resource languages.

Building on this, there’s potential in experimenting with different language combinations and full-fine-tuning methods to enhance cross-lingual transfer learning. Research could also extend to developing higher quality Amharic datasets, incorporating varied sources like literature and official documents, to further enrich the dataset’s depth. Additionally, the rising interest in multilingual and cross-lingual text summarization in NLP research highlights the timeliness and relevance of this project’s outcomes. It positions the study not just as a step for Amharic NLP but also as a valuable contribution to the global NLP research community, encouraging similar explorations in other low-resource languages.

Resources

Link to Final Report: https://drive.google.com/file/d/1LivrnwkGAJfmCo6kDLTUJNIoy8RlpRF7/view?usp=sharing

Link to Datasets: https://www.kaggle.com/datasets/danielmekuriaw/benchmark-dataset-for-amharic-text-summarization/data

Link to Models: https://drive.google.com/drive/folders/1VmE_GlGwOLzDGC91zMI0UBcbhpoNLeNF?usp=sharing

Link to Notebooks: https://github.com/danielmekuriaw/mT5-PEFT-Amharic-Text-Summarization/

References

[1] Atnafu Lambebo Tonja, Tadesse Destaw Belay, Israel Abebe Azime, Abinew Ali Ayele, Moges Ahmed Mehamed, Olga Kolesnikova, and Seid Muhie Yimam. Natural Language Processing in Ethiopian Languages: Current State, Challenges, and Opportunities, 2023.

[2] Maartje ter Hoeve, David Grangier, and Natalie Schluter. High-Resource Methodological Bias in Low-Resource Investigations, 2022.

[3] Atnafu Lambebo Tonja, Olga Kolesnikova, Muhammad Arif, Alexander Gelbukh, and Grigori Sidorov. Improving Neural Machine Translation for Low Resource Languages Using Mixed Training: The Case of Ethiopian Languages. In Obdulia Pichardo Lagunas, Juan Martínez-Miranda, and Bella Martínez Seis, editors, Advances in Computational Intelligence, pages 30–40, Cham, 2022. Springer Nature Switzerland.

[4] Britannica, The Editors of Encyclopaedia. Amharic language. https://www.britannica.com/topic/ Amharic-language, July 2023.

[5] Alicia Grinberg and Miguel Grinberg. Amharic Alphabet. https://www.amharicmachine.com/default/ alphabet, 2023.

[6] Tilahun Yeshambel, Josiane Mothe, and Yaregal Assabie. Amharic Adhoc Information Retrieval System Based on Morphological Features. Applied Sciences, 12(3), 2022.

--

--

Daniel Mekuriaw
Daniel Mekuriaw

Written by Daniel Mekuriaw

An undergraduate student at Yale University ('24) majoring in Computer Science and Statistics & Data Science

No responses yet