Hasil Pencarian

Ditemukan 196818 dokumen yang sesuai dengan query

Muhammad Arief Fauzan

Analisis dan Mitigasi Religion Bias pada Dataset dan Embedding NLP Berbahasa Indonesia = Analysis and Mitigation of Religion Bias in Indonesian NLP Datasets and Embeddings

"Riset terdahulu menunjukkan adanya misrepresentasi identitas agama pada media Indonesia. Menurut studi sebelumnya, misrepresentasi identitas marjinal pada dataset dan word embedding untuk natural language processing dapat merugikan identitas marjinal tersebut, dan karenanya harus dimitigasi. Riset ini menganalisis keberadaan bias agama pada beberapa dataset dan word embedding NLP berbahasa Indonesia, dampak bias yang ditemukan pada downstream performance, serta proses dan dampak debiasing untuk dataset dan word embedding. Dengan menggunakan metode uji Pointwise Mutual Information (PMI ) untuk deteksi bias pada dataset dan word similarity untuk deteksi bias pada word embedding, ditemukan bahwa dua dari tiga dataset, serta satu dari empat word embedding yang digunakan pada studi ini mengandung bias agama. Model machine learning yang dibentuk dari dataset dan word embedding yang mengandung bias agama memiliki dampak negatif untuk downstream performance model tersebut, yang direpresentasikan dengan allocation harm dan representation harm. Allocation harm direpresentasikan oleh performa false negative rate (FNR) dan false positive rate (FPR) model machine learning yang lebih buruk untuk identitas agama tertentu, sedangkan representation harm direpresentasi oleh kesalahan model dalam mengasosiasikan kalimat non-negatif yang mengandung identitas agama sebagai kalimat negatif. Metode debiasing pada dataset dan word embedding mampu memitigasi bias agama yang muncul pada dataset dan word embedding, tetapi memiliki performa yang beragam dalam mitigasi allocation dan representation harm. Dalam riset ini, akan digunakan lima metode debiasing: dataset debiasing dengan menggunakan sentence templates, dataset debiasing dengan menggunakan kalimat dari Wikipedia, word embedding debiasing dengan menggunakan Hard Debiasing, joint debiasing dengan sentence templates, serta joint debiasing menggunakan kalimat dari Wikipedia. Dari lima metode debiasing, joint debiasing dengan sentence templates memiliki performa yang paling baik dalam mitigasi allocation harm dan representation harm.

Previous research has shown the existence of misrepresentation regarding various religious identities in Indonesian media. Misrepresentations of other marginalized identities in natural language processing (NLP) resources have been recorded to inflict harm against such marginalized identities, and as such must be mitigated. This research analyzes several Indonesian language NLP datasets and word embeddings to see whether they contain unwanted bias, the impact of bias on downstream performance, the process of debiasing datasets or word embeddings, and the effect of debiasing on them. By using the Pointwise Mutual Test (PMI) test to detect dataset bias and word similarity to detect word embedding bias, it is found that two out of three datasets and one out of four word embeddings contain religion bias. The downstream performances of machine learning models which learn from biased datasets and word embeddings are found to be negatively impacted by the biases, represented in the form of allocation and representation harms. Allocation harm is represented by worse false negative rate (FNR) and false positive rate (TPR) of models with respect to certain religious identities, whereas representation harm is represented by the misprediction of non-negative sentences containing religious identity terms as negative sentences. Debiasing at dataset and word embedding level was found to correctly mitigate the respective biases at dataset and word embedding level. Nevertheless, depending on the dataset and word embedding used to train the model, the performance of each debiasing method can vary highly at downstream performance. This research utilizes five debiasing methods: dataset debiasing using sentence templates, dataset debiasing using sentences obtained from Wikipedia, word embedding debiasing using Hard Debiasing, joint debiasing using sentence templates, as well as joint debiasing using sentences obtained from Wikipedia. Out of all five debiasing techniques, joint debiasing using sentence templates performs the best on mitigating both allocation and representation harm."

Depok: Fakultas Ilmu Komputer Universitas Indonesia, 2023

T-pdf

UI - Tesis Membership Universitas Indonesia Library

Nida Lathifah

Bias liberal pada The New York Times dan The Washington Post: analisis wacana kritis terhadap pemberitaan pemilihan Presiden Indonesia tahun 2014 = Liberal bias in The New York Times and The Washington Post: critical discourse analysis of 2014 Indonesian Presidential election news

"The New York Times dan The Washington Post merupakan dua koran nasional Amerika yang diketahui memiliki bias liberal. Penelitian ini bertujuan untuk meneliti bias liberal kedua koran tersebut pada pemberitaan pemilihan presiden Indonesia tahun 2014. Jurnal ini menggunakan kerangkan Analisis Wacana Kritis dengan menerapkan metode tiga dimensi Norman Fairclough (analisis tekstual, praktik wacana, dan praktik sosial budaya.

Temuan dari penelitian ini menunjukan bahwa kedua koran tersebut bias terhadap Jokowi dan tim koalisinya yang cenderung mendukung kebebasan dan keberagaman. Temuan penelitian ini penting untuk mengkritik bias yang ada pada koran-koran Amerika, terutama dalam meliput berita dari negara-negara lain.

The New York Times and The Washington Post are two American national newspapers that are known to have a liberal bias. This research aims to explore the liberal bias of the newspapers in the 2014 Indonesian presidential election news. In order to do so, this research utilizes the Critical Discourse Analysis framework by applying Norman Fairclough's three dimensional methods (textual analysis, discourse practice, and sociocultural practice).

The findings show the newspapers have a bias toward Jokowi and his coalition team who tend to support liberality and plurality. The findings are significant to criticize the bias of these American newspapers, especially in covering news from other countries."

Depok: Fakultas Ilmu Pengetahuan Budaya Universitas Indonesia, 2015

MK-Pdf

UI - Makalah dan Kertas Kerja Universitas Indonesia Library

Zihan Nindia

Analisa Long-Short Term Memory dan BERT Embeddings pada Klasifikasi Teks Data SMS Spam Berbahasa Indonesia = Analysis of Long-Short Term Memory and BERT Embeddings on Text Classification of SMS Spam Data in Indonesian

"Pesatnya perkembangan teknologi informasi dan komunikasi telah membawa banyak perubahan dalam kehidupan manusia. Salah satu perkembangan yang paling signifikan adalah munculnya teknologi pesan singkat atau Short Message Service (SMS). Media SMS sering disalahgunakan sebagai media penipuan terhadap pengguna telepon. Penipuan sering terjadi dengan cara mengirimkan SMS secara masif dan acak hingga mencapai sepuluh ribu per hari kepada semua pengguna dan menjadi SMS spam bagi banyak orang. Klasifikasi teks menggunakan Long-Short Term Memory (LSTM) dan BERT Embbeddings dilakukan untuk mengklasifikasi data SMS ke dalam dua kategori, yaitu spam dan non-spam. Data terdiri dari 5575 SMS yang telah diberi label. Dengan menggunakan metode LSTM + BERT, penelitian ini dapat mencapai nilai accuracy sebesar 97.85%. Metode ini menghasilkan hasil yang lebih baik dari ketiga model sebelumnya. Model LSTM + BERT menghasilkan nilai accuracy 0.65% lebih baik dari LSTM.

The rapid development of information and communication technology has brought many changes in human life. One of the most significant developments is the emergence of short message service (SMS) technology. SMS media is often misused as a medium for fraud against telephone users. Fraud often occurs by sending massive and random SMS up to ten thousand per day to all users and becomes SMS spam for many people. Text classification using Long-Short Term Memory (LSTM) and BERT Embeddings is performed to classify SMS data into two categories, namely spam and ham. The data consists of 5575 SMS that have been labeled. By using the LSTM + BERT method, this research can achieve an accuracy value of 97.85%. This method produces better results than the three previous models. The LSTM + BERT model produces an accuracy value of 0.65% better than LSTM."

Depok: Fakultas Teknik Universitas Indonesia, 2024

S-pdf

UI - Skripsi Membership Universitas Indonesia Library

Nabila Dita Putri

Pembangunan Data dan Model Analisis Emosi Fine-Grained pada Teks Media Sosial Berbahasa Indonesia = Fine-Grained Emotion Analysis on Indonesian Social Media Text: Dataset and Models

"Saat ini, dataset yang tersedia untuk melakukan analisis emosi di Indonesia masih terbatas, baik dari segi jumlah data, cakupan emosi, serta sumbernya. Pada penelitian ini, peneliti membangun dataset besar untuk tugas analisis emosi pada data teks berbahasa Indonesia, di mana dataset ini dikumpulkan dari berbagai domain dan sumber. Dataset ini mengandung 33 ribu teks, yang terdiri dari tweet yang dikumpulkan dari Twitter, serta komentar unggahan yang dikumpulkan dari Instagram dan Youtube. Domain yang dicakup pada dataset ini adalah domain olahraga, hiburan, dan life chapter. Dataset ini dianotasi oleh 36 annotator dengan label emosi fine-grained secara multi-label, di mana label emosi yang digunakan ini merupakan hasil dari taksonomi emosi baru yang diusulkan oleh peneliti. Pada penelitian ini, peneliti mengusulkan taksonomi emosi baru yang terdiri dari 44 fine-grained emotion, yang dikelompokkan ke dalam 6 basic emotion. Selain itu, peneliti juga membangun baseline model untuk melakukan analisis emosi. Didapatkan dua baseline model, yaitu hasil fine-tuning IndoBERT dengan f1-score micro tertinggi sebesar 0.3786, dan model hierarchical logistic regression dengan exact match ratio tertinggi sebesar 0.2904. Kedua baseline model tersebut juga dievaluasi di lintas domain untuk dilihat seberapa general dan robust model yang telah dibangun.

Currently, no research in Indonesia utilises fine-grained emotion for emotion analysis. In addition, the available datasets for analysing emotions still need to be improved in terms of the amount of data, the range of emotions, and their sources. In this study, researchers built a large dataset for analysing emotion. This dataset contains 33k texts, consisting of tweets collected from Twitter and comments collected from Instagram and Youtube posts. The domains covered in this dataset are sports, entertainment, and life chapter. Thirty-six annotators annotated this dataset with fine-grained emotion labels and a multi-label scheme, where the emotion labels resulted from a new emotion taxonomy proposed by the researcher. In this study, the researchers propose a new emotion taxonomy consisting of 44 fine-grained emotions which are grouped into six basic emotions. Two baseline models were obtained, the first one is the fine-tuned IndoBERT model, which achieved the highest f1-score micro of 0.3786, and the second one is hierarchical logistic regression model, which achieved the highest exact match ratio of 0.2904. Both baseline models were also evaluated to determine their cross-domain applicability. The dataset and baseline models that are produced in this study are expected to be valuable resources for future research purposes."

Depok: Fakultas Ilmu Komputer Universitas Indonesia, 2023

S-pdf

UI - Skripsi Membership Universitas Indonesia Library

Kaysa Syifa Wijdan Amin

Pembangunan Data dan Model Analisis Emosi Fine-Grained pada Teks Media Sosial Berbahasa Indonesia = Fine-Grained Emotion Analysis on Indonesian Social Media Text: Dataset and Models

Depok: Fakultas Ilmu Komputer Universitas Indonesia, 2023

S-pdf

UI - Skripsi Membership Universitas Indonesia Library

Gilang Catur Yudishtira

Pembangunan Data dan Model Analisis Emosi Fine-Grained pada Teks Media Sosial Berbahasa Indonesia = Fine-Grained Emotion Analysis on Indonesian Social Media Text: Dataset and Models

Depok: Fakultas Ilmu Komputer Universitas Indonesia, 2023

S-pdf

UI - Skripsi Membership Universitas Indonesia Library

Tsarina Dwi Putri

Analisis sistem rekomendasi berbasis konten dan perancangan tinjauan sistematis pada dataset publikasi penelitian (dengan pendekatan Word Embedding) = Content-Based recommendation system analysis and designing a systematic review on research publication dataset (using Word Embedding Approach)

"ABSTRAK

Penggunaan word embedding sebagai pemodelan topik telah banyak dilakukan. Hasil dari pemodelan topik tersebut turut membantu dalam mengubah pola pikir para peneliti tentang teks sebagai suatu nilai. Menurut studi yang dilakukan oleh Mikolov et al. (2013) mengenai word embedding, mereka mengubah teks-teks tersebut menjadi suatu vektor yang dapat divisualisasikan dalam ruang vektor kontinu yang secara fleksibel dapat dihitung jarak kedekatannya dan dapat diolah lebih lanjut dengan menggabungkannya dengan metode yang lain seperti LSTM (Long Short Term Memory), CNN (Convolutional Neural Network), dll untuk berbagai keperluan penelitian. Beragam penelitian berkembang menggunakan hasil dari nilai embedding tersebut untuk tujuan yang lebih kompleks, mendorong penulis untuk kembali mengkaji manfaat dasar dari hal tersebut kemudian menggalinya untuk tujuan akhir lain yang belum pernah dilakukan penelitian lain sebelumnya.

Penelitian ini menggunakan nilai akhir embedding secara sederhana sebagai sistem rekomendasi berbasis konten yang kemudian berkembang dengan kebaruan untuk digunakan sebagai alat bantu untuk melakukan tinjauan sistematis. Hasil penelitian ini menunjukkan bahwa kebaikan penggunaan metode word embedding sangat bervariasi tergantung dari dataset dan hyperparameter yang digunakan.

ABSTRACT

The utilization of word embedding as topic modeling has been widely carried out. The results helped to change the researchers' mindset regarding text as a value. According to a study conducted by Mikolov et al. (2013) regarding word embedding, they convert these texts into vectors that can be visualized in a continuous vector space which can be flexibly calculated of its proximity and can be further processed by combining it with other methods such as LSTM (Long Short Term Memory), CNN (Convolutional Neural Network), etc. for various research purposes. Various studies have been developing by using the embedding value for more complex purposes, thus encouraging the author to re-examine the basic benefits of it then explore it for other purposes that have never been done by other studies before.

This study simply used embedding value as a content-based recommendation system which then it developed with novelty to be used as a tool to conduct systematic review. The results of this study indicate that the merits of using word embedding method vary greatly depending on the dataset and hyperparameters used."

Depok: Fakultas Teknik Universitas Indonesia , 2020

T-Pdf

UI - Tesis Membership Universitas Indonesia Library

Julian Fernando

Pengembangan Metode Ekstraksi Sumber Daya NLP dari Kamus Dwibahasa Indonesia dan Bahasa Daerah = Extracting NLP Resources from Bilingual Dictionaries for Regional Languages in Indonesia

"Perkembangan NLP bahasa daerah di Indonesia masih tergolong lambat. Banyak faktor yang melatarbelakangi hal tersebut, seperti dokumentasi bahasa yang buruk, penutur bahasa yang sedikit, dan kurangnya sumber daya untuk mempelajari NLP bahasa daerah. Penelitian ini bertujuan untuk mengembangkan metode ekstraksi kamus dwibahasa Indonesia dan bahasa daerah yang umum untuk menghasilkan sumber daya NLP. Sistem yang dihasilkan mampu mengolah banyak kamus dwibahasa sekaligus menjadi sumber daya NLP. Kamus terlebih dahulu dikonversi ke dalam bentuk machine readable dan diolah ke bentuk korpus entri sebelum dilakukan ekstraksi. Korpus entri adalah korpus yang mengandung informasi lengkap setiap entri di dalam kamus beserta jenis font, ukuran, dan posisi setiap kata pada entri di dalam kamus dwibahasa. Proses ekstraksi dilakukan dengan memperhatikan pola entri sehingga perlu dilakukan tahap standardisasi entri terlebih dahulu sebelum sumber daya dibentuk. Selain pembentukan sumber daya, dilakukan pula perbaikan ejaan khusus untuk sumber daya korpus paralel. Dalam mengevaluasi hasil ekstraksi, diambil beberapa kamus dwibahasa sebagai sampel. Evaluasi dilakukan dengan memperhatikan ketepatan peletakan setiap komponen entri di dalam hasil ekstraksi. Tim peneliti menemukan bahwa sistem yang dibangun telah berhasil mengekstrak sumber daya NLP berupa leksikon bilingual, kamus morfologi, dan korpus paralel dengan optimal pada 32 kamus dwibahasa Indonesia dan bahasa daerah. Masih terdapat beberapa kekurangan pada sistem yang berhasil dibangun karena proses ekstraksi sangat bergantung dengan ketepatan pendeteksian font sehingga kualitas kamus masih memberikan pengaruh yang besar pada kualitas hasil ekstraksi.

The development of regional language NLP in Indonesia is still relatively slow. There are several factors behind this, such as poor language documentation, a small number of speakers of the language, and lack of the resources needed to study regional language NLP. This research aims to develop a general extraction method for Indonesian and regional bilingual dictionaries to produce NLP resources. The resulting system is able to process multiple bilingual dictionaries at once into NLP resources. Dictionaries are converted to machine readable form and processed to the form of a corpus of entries in advance before extraction is carried out. A corpus of entries means corpus that contains full information of each entry in the dictionary as well as font style, font size, and the position of each word of the entry in the bilingual dictionary. The extraction process is carried out by observing the entry's pattern resulting in the entry standardization phase having to be done prior before resources are produced. Besides resource production, spell checking is also carried out specifically for parallel corpus resources. In order to evaluate the extraction results, several bilingual dictionaries are taken to be samples. Evaluation process is carried out by observing the accuracy of each entry component’s placement in the extraction results. Research team found that the resulting system has succeeded in extracting NLP resources optimally in the form of bilingual lexicon, morphology, and parallel corpus on 32 Indonesian and regional bilingual dictionaries. There are still some deficiencies in the developed system since the extraction process is highly dependent on the accuracy of font detection such that the qualities of dictionaries still have a big impact on the quality of extraction results."

Depok: Fakultas Ilmu Komputer Universitas Indonesia, 2023

S-pdf

UI - Skripsi Membership Universitas Indonesia Library

Raden Fausta Anugrah Dianparama

Pengembangan Metode Ekstraksi Sumber Daya NLP dari Kamus Dwibahasa Indonesia dan Bahasa Daerah = Extracting NLP Resources from Bilingual Dictionaries for Regional Languages in Indonesia

Depok: Fakultas Ilmu Komputer Universitas Indonesia, 2023

S-pdf

UI - Skripsi Membership Universitas Indonesia Library

Harakan Akbar

Pengembangan Metode Ekstraksi Sumber Daya NLP dari Kamus Dwibahasa Indonesia dan Bahasa Daerah = Extracting NLP Resources from Bilingual Dictionaries for Regional Languages in Indonesia

Depok: Fakultas Ilmu Komputer Universitas Indonesia, 2023

S-pdf

UI - Skripsi Membership Universitas Indonesia Library

<< 1 2 3 4 5 6 7 8 9 10 >>

Hasil Pencarian :: Simpan CSV :: Kembali

Hasil Pencarian