Hasil Pencarian

Ditemukan 2 dokumen yang sesuai dengan query

Mukhlizar Nirwan Samsuri

Perbandingan Penggunaan Kamus Terdistribusi, Partition Around Medoids (PAM) dan Struktur Data Trie dalam Perbaikan Ejaan Otomatis Pada Teks Formal Bahasa Indonesia = A Comparison of Distributed, PAM, and Trie Data Structure Dictionaries in Automatic Spelling Correction for Indonesian Formal Text

"Kesalahan ejaan dapat dibagi menjadi dua jenis, non-word errors dan real-word errors. Non-word errors adalah kesalahan eja yang tidak terdapat dalam kamus, sedangkan real-word errors adalah kata yang terdapat pada kamus tetapi berada pada tempat yang tidak tepat pada kalimat. penelitian ini berfokus pada koreksi ejaan untuk non-word errors pada teks formal Bahasa Indonesia. Tujuan dari penelitian ini adalah untuk membandingkan efektivitas tiga jenis struktur kamus untuk koreksi ejaan, antara lain kamus terdistribusi, kamus PAM (Partition Around Medoids), dan kamus menggunakan struktur data trie. Ketiga jenis kamus juga akan dibandingkan dengan kamus sederhana yang dijadikan sebagai baseline. Tahap pengurutan kandidat (ranking correction candidates) dilakukan dengan menggunakan dua variasi dari edit distance, yaitu Levenshtein dan Damerau-Levenshtein dan n-gram. Guna mendukung penelitian ini, dibangun dataset gold standard dari 200 kalimat yang terdiri dari 4.323 token dengan 288 di antaranya adalah non-word errors. Berdasarkan kombinasi tipe kamus dan edit distance, didapatkan hasil bahwa struktur data trie dengan Damerau-Levenshtein distance memperoleh accuracy terbaik untuk menghasilkan kandidat koreksi, yaitu 95,89% dalam 45,31 detik. Selanjutnya, kombinasi struktur data trie dengan Damerau-Levenshtein distance juga mendapatkan accuracy terbaik dalam memilih kandidat terbaik, yaitu 73,15%.

......Spelling errors can be divided into two groups: non-word and real-word. A non-word error is a spelling error that does not exist in the dictionary, while a real-word error is a real word but not on the right place. In this work, we address the non-word errors in spelling correction for Indonesian formal text. The objective of our work is to compare the effectiveness of three kinds of dictionary structure for spelling correction, distributed dictionary, PAM (Partition Around Medoids) dictionary, and dictionary using trie data structure, with the baseline of a simple flat dictionary. We conducted experiments with two variations of edit distances, i.e. Levenshtein and Damerau-Levenshtein, and utilized n-grams for ranking correction candidates. We also build a gold standard of 200 sentences that consists of 4,323 tokens with 288 of them are non-word errors. Among the various combinations of dictionary type and edit distance, the trie data structure with Damerau-Levenshtein distance gets the best accuracy to produce candidate correction, i.e. 95.89% in 45.31 seconds. Furthermore, the combination of trie data structure with Damerau-Levenshtein distance also gets the best accuracy in choosing the best candidate, i.e. 73.15%."

Depok: Fakultas Ilmu Komputer, 2023

TA-pdf

UI - Tugas Akhir Universitas Indonesia Library

Hanif Arkan Audah

Perbandingan Metode Pemeriksa Ejaan antara SymSpell dan Kombinasi Damerau-Levenshtein Distance dengan Struktur Data Trie = A Spell Checker Method Comparison Between SymSpell and a Combination of Damerau-Levenshtein Distance With the Trie Data Structure

"Non-word error merupakan kesalahan ejaan yang menghasilkan kata yang tidak ada dalam kamus. Tujuan dari penelitian ini adalah membandingkan dua metode pemeriksa ejaan non-word error, yaitu SymSpell dan kombinasi Damerau-Levenshtein distance dengan struktur data trie. Kedua metode tersebut melakukan isolated-word error correction terhadap non-word error. Dalam implementasi, SymSpell dibedakan menjadi dua, yaitu weighted dan unweighted. Proses perbandingan metode dimulai dengan penyusunan kamus menggunakan entri kata dari KBBI V yang diperkaya dengan kata-kata tambahan dari Wiktionary. Kamus yang dihasilkan memuat 91.557 kata. Selanjutnya, disusun dataset uji yang dibuat secara sintetis dengan memanfaatkan modifikasi dari candidate generation Peter Norvig. Dataset uji sintetis yang dihasilkan memuat 58.532 kata salah eja. Dilakukan perbandingan antara Weighted SymSpell, Unweighted SymSpell, dan kombinasi Damerau-Levenshtein distance dengan struktur data trie menggunakan dataset uji sintetis tersebut. Perbandingan tersebut mengukur best match accuracy, candidate accuracy, dan run time. Hasil perbandingan menyimpulkan bahwa SymSpell memiliki performa yang lebih baik dibandingkan dengan metode kombinasi Damerau-Levenshtein distance dan struktur data trie karena unggul dari aspek best match accuracy dan run time serta memperoleh candidate accuracy yang setara dengan metode-metode lain. Implementasi SymSpell yang unggul, yaitu Weighted SymSpell memperoleh best match accuracy 66,79%, candidate accuracy 99,33%, dan run time 0,39 ms per kata.

......Non-word errors are errors during writing where the resulting word does not exist in the dictionary. The objective is to compare non-word error spell checker methods, which are SymSpell and a combination of Damerau-Levenshtein distance with the trie data structure. Both methods handle non-word errors using isolated-word error correction.

During implementation, SymSpell is divided into two types: weighted and unweighted.

The comparison process starts by compiling a dictionary from word entries in KBBI V and Wiktionary. The resulting dictionary contains 91,557 words. The next step

is to synthetically generate a test dataset using a modified version of Peter Norvig’s candidate generation method. The resulting test dataset contains 58,532 misspellings.

A comparison is made between Weighted SymSpell, Unweighted SymSpell, and a

combination of Damerau-Levenshtein distance with the trie data structure using the synthetic test dataset that was generated. The comparison measures the best match accuracy, candidate accuracy, and run time. The results found that SymSpell performed better than the method that used a combination of Damerau-Levenshtein distance with the trie data structure because it obtained a higher best match accuracy, lower run time, and

an equivalent candidate accuracy compared to the other methods. The best performing

SymSpell implementation is Weighted SymSpell which obtained a best match accuracy of 66.79%, candidate accuracy of 99.33%, and a run time of 0.39 ms per word."

Depok: Fakultas Ilmu Komputer Universitas Indonesia, 2023

TA-pdf

UI - Tugas Akhir Universitas Indonesia Library

Hasil Pencarian :: Simpan CSV :: Kembali

Hasil Pencarian