Hasil Pencarian

Ditemukan 207399 dokumen yang sesuai dengan query

Nia Dwi Rahayuningtyas

Analisis Teks pada Tweet Berbahasa Indonesia untuk Mendeteksi Pro Kontra Vaksinasi Menggunakan Pendekatan Stance Detection dan Topic Modeling = Text Analytics on Indonesian Tweets to Detect Pro vs Anti Vaccination Using Stance Detection and Topic Modeling

"Keraguan dan penolakan orang tua terhadap vaksinasi meningkat secara global. Maraknya penyebaran isu vaksinasi melalui media sosial mengarahkan persepsi publik pada keraguan terhadap vaksin yang berujung pada penurunan cakupan imunisasi dan tidak tercapainya target IDL di Indonesia. Pada media sosial Twitter terdapat dua kelompok, yaitu kelompok pro-vaksin yang mendukung vaksinasi dan anti-vaksin yang menolak vaksinasi.

Penelitian ini bertujuan untuk mengidentifikasi apakah sebuah Tweet memiliki kecenderungan ke arah pro- atau anti-vaksin dan untuk mengeksplorasi topik-topik terkait pro-vaksin dan anti-vaksin. Dataset diambil dari Twitter dengan kata kunci "vaksin" dan "imunisasi" lebih dari 9.000 data Tweet antara 11 Agustus sampai 10 September 2019. Anotasi dilakukan dalam 3 langkah berturut-turut dengan tiga pasangan label yaitu RELEVANT/IRRELEVANT, SUBJECTIVE/NEUTRAL, dan PRO/ANTI. Tiga eksperimen yaitu pemilihan fitur, algoritma, dan pipeline klasifikasi dilakukan untuk mendapatkan model stance detection terbaik yaitu nilai rata-rata micro tertinggi dari precision, recall, dan f1-score.

Fitur terpilih adalah kombinasi 3 fitur teks Count +Unigram+Bigram dengan algoritma Logistic Regression dan pipeline Two-stage Classification (f1-score = 80,5%). Algoritma terpilih pada pembentukan topic modeling adalah NMF dan LDA masing-masing untuk korpus pro-vaksin dan anti-vaksin dengan nilai koherensi sebesar 0.999.

Topik-topik anti-vaksin meliputi kritik terhadap fatwa halal MUI untuk Vaksin MR, kandungan babi pada Vaksin Meningitis Haji, komersialisasi vaksin, vaksin palsu, KIPI dan bahaya vaksin, vaksin sebagai alat konspirasi dan agenda Yahudi, tuntutan vaksin halal, dan seterusnya. Sedangkan topik-topik pro-vaksin lebih bersifat homogen yaitu mengenai manfaat dan pentingnya imunisasi, aturan pemberian vaksin, dan kampanye dalam bentuk publisitas kegiatan imunisasi, dan anjuran vaksin.

Parents hesitancy and refusal toward immunization was rising globally. The rise of the issue of vaccination through social media directs the public's perception of vaccine hesitancy that lead to a reduction in immunization coverage and the unfulfilled IDL target in Indonesia. There are two groups: pro-vaccine that support vaccines and anti-vaccine that refuse vaccines for various reasons that expressed in tweets on Twitter.
This research aims to identify whether a tweet has a tendency to support, or oppose immunization or vaccines and exploring the topic of pro-vaccine and anti-vaccine corpus. The dataset was taken from Twitter with the keywords "vaksin" and "imunisasi" of more than 9,000 tweets at 11 August until 10 September 2019. Annotation was carried out in 3 consecutive steps with three couple label namely RELEVANT vs IRRELEVANT, SUBJECTIVE vs NEUTRAL, and PRO vs ANTI.
Three experiments, namely the selection of features, algorithms, and pipeline were carried out to get the best model of stance detection which has the highest micro average precision, recall, and f1-scores. The selected feature is combination of Count +Unigram+Bigram features with Logistic Regression and pipeline Two-stage Classification (f1-score = 80,5%).
The selected topic modeling algorithms are NMF and LDA for the corpus pro-vaccine and anti-vaccine with coherence score 0.999. Anti-vaccine topics include criticism of the halal MUI fatwa for MR vaccine, pork gelatine in the Hajj Meningitis Vaccine, vaccines for business fields, fake vaccines, KIPI and vaccine hazards, vaccines as part of conspiracy and Jewish agenda, demands for halal vaccines, etc. Whereas pro-vaccine topics are more homogeneous, namely the benefits and importance of immunization, vaccine administration rules, and campaigns with publicity of immunization activities and vaccine recommendations."

Depok: Fakultas Ilmu Komputer Universitas Indonesia , 2020

TA-Pdf

UI - Tugas Akhir Universitas Indonesia Library

Gibran Brahmanta Patriajati

Analisis Performa Pendekatan Topic Modeling dan Similarity Measure untuk Text Summarization secara Ekstraktif pada Teks Berbahasa Indonesia = Performance Analysis of Topic Modeling and Similarity Measure Approach for Extractive Text Summarization in Indonesian Text

"Text Summarization secara ekstraktif merupakan suatu isu yang dapat meningkatkan kualitas pengalaman pengguna ketika menggunakan suatu sistem perolehan informasi. Pada bahasa Inggris, terdapat beberapa penelitian terkait Text Summarization secara ekstraktif salah satunya adalah penelitian Belwal et al. (2021) yang memperkenalkan suatu metode Text Summarization secara ekstraktif yang berbasiskan proses Topic Modeling serta Semantic Measure menggunakan WordNet. Sementara pada bahasa Indonesia, juga terdapat beberapa penelitian terkait Text Summarization secara ekstraktif tetapi belum ada yang menggunakan metode yang sama seperti yang diperkenalkan oleh Belwal et al. (2021). Agar metode yang diperkenalkan Belwal et al. (2021) dapat digunakan pada bahasa Indonesia, proses Semantic Measure menggunakan WordNet harus diganti dengan Similarity Measure menggunakan Vector Space Model karena tidak adanya model WordNet bahasa Indonesia yang dapat digunakan oleh umum. Dalam menggunakan metode yang diperkenalkan oleh Belwal et al. (2021) pada bahasa Indonesia, terdapat beberapa metode yang dapat digunakan untuk melakukan Topic Modeling, Vector Space Model, serta Similarity Measure yang terdapat di dalamnya. Penelitian ini berfokus untuk mencari kombinasi metode ketiga hal yang telah disebutkan sebelumnya yang dapat memaksimalkan performa metode Text Summarization yang diperkenalkan oleh Belwal et al. (2021) pada bahasa Indonesia dengan menggunakan pendekatan hill-climbing. Proses evaluasi dilakukan dengan menggunakan metrik ROUGE-N dalam bentuk F-1 Score pada dua buah dataset yaitu Liputan6 serta IndoSUM. Hasil penelitian menemukan bahwa kombinasi metode yang dapat memaksimalkan performa metode Text Summarization secara ekstraktif yang diperkenalkan oleh Belwal et al. (2021) adalah Non-Negative Matrix Factorization untuk Topic Modeling, Word2Vec untuk Vector Space Model, serta Euclidean Distance untuk Similarity Measure. Kombinasi metode tersebut memiliki nilai ROUGE-1 sebesar 0.291, ROUGE-2 sebesar 0.140, dan ROUGE-3 sebesar 0.079 pada dataset Liputan6. Sementara pada dataset IndoSUM, kombinasi metode tersebut memiliki nilai ROUGE-1 sebesar 0.455, ROUGE-2 sebesar 0.337, dan ROUGE-3 sebesar 0.300. Performa yang dihasilkan oleh kombinasi metode tersebut bersifat cukup kompetitif dengan performa metode lainnya seperti TextRank serta metode berbasiskan model Deep Learning BERT apabila dokumen masukannya bersifat koheren.

Extractive text summarization is an issue that can improve the quality of user experience when using an information retrieval system. Research related to extractive text summarization is a language-specific research. In English, there are several studies related to extractive text summarization, one of them is the research of Belwal et al. (2021) They introduced an extractive Text Summarization method based on the Topic Modeling process and Semantic Measure using WordNet. While in Indonesian, there are also several studies related to extractive text summarization, but none have used the same method as introduced by Belwal et al. (2021). In order to use the method introduced by Belwal et al. (2021) in Indonesian, the Semantic Measure process using WordNet must be replaced with Similarity Measure using the Vector Space Model because there is no Indonesian WordNet model that can be used by the public. When using the method introduced by Belwal et al. (2021) in Indonesian, there are several methods that can be used to perform Topic Modeling, Vector Space Model, and Similarity Measure that contained in there. This study focuses on finding a combination of the three methods previously mentioned that can maximize the performance of the Text Summarization method introduced by Belwal et al. (2021) in Indonesian using hill-climbing approach. The evaluation process is carried out using the ROUGE-N metric in the form of F-1 Score on two datasets, namely Liputan6 and IndoSUM. The results of the study found that the combination of methods that can maximize the performance of the extractive text summarization method introduced by Belwal et al. (2021) are Non-Negative Matrix Factorization for Topic Modeling, Word2Vec for Vector Space Model, and Euclidean Distance for Similarity Measure. The combination of those methods has a ROUGE-1 value of 0.291, ROUGE-2 value of 0.140, and ROUGE-3 value of 0.079 in the Liputan6 dataset. Meanwhile, in the IndoSUM dataset, the combination of those methods has a ROUGE-1 value of 0.455, ROUGE-2 value of 0.337, and ROUGE-3 value of 0.300. The performance generated by the combination of those methods is quite competitive with the performance of other methods such as TextRank and Deep Learning BERT model based method if the input document is coherent."

Depok: Fakultas Ilmu Komputer Universitas Indonesia, 2022

S-pdf

UI - Skripsi Membership Universitas Indonesia Library

Rilo Chandra Pradana

Deep Embedded Clustering untuk Pendeteksian Topik Tweet Berita Berbahasa Indonesia = Deep Embedded Clustering for Topic Detection on Indonesian News Tweet

Pendeteksian topik adalah teknik untuk memperoleh topik-topik yang dikandung oleh suatu data tekstual. Salah satu metode untuk pendeteksian topik yaitu dengan menggunakan clustering. Namun, secara umum metode clustering tidak menghasilkan cluster yang efektif bila dilakukan pada data yang berdimensi tinggi. Sehingga untuk memperoleh cluster yang efektif perlu dilakukan reduksi dimensi pada data sebelum dilakukan clustering pada ruang fitur yang berdimensi lebih rendah. Pada penelitian ini, digunakan suatu metode bernama Deep Embedded Clustering (DEC) untuk melakukan pendeteksian topik. Metode DEC bekerja untuk mengoptimasi ruang fitur dan cluster secara simultan. Metode DEC terdiri dari dua tahap. Tahap pertama terdiri dari pembelajaran autoencoder untuk memperoleh bobot dari encoder yang digunakan untuk mereduksi dimensi data dan k-means clustering untuk memperoleh centroid awal. Tahap kedua terdiri dari penghitungan soft assignment, penentuan distribusi bantuan untuk menggambarkan cluster di ruang data, dan dilanjutkan dengan backpropagation untuk memperbarui bobot encoder dan centroid. Dalam penelitian ini, dibangun dua macam model DEC yaitu DEC standar dan DEC without backpropagation. DEC without backpropagation adalah DEC yang menghilangkan proses backpropagation pada tahap kedua. Setiap model DEC pada penelitian ini akan menghasilkan topik-topik. Hasil tersebut dievaluasi dengan menggunakan coherence. Dari penelitian ini dapat dilihat bahwa model DEC without backpropagation lebih baik daripada DEC standar bila dilihat dari waktu komputasi dengan perbedaan coherence antara keduanya yang tidak terlalu jauh.

Topic detection is a technique for obtaining the topics that are contained in a textual data. One of the methods for topic detection is clustering. However, generally clustering does not produce an effective cluster when it is done by using data with high dimension. Therefore, to get an effective cluster, dimensionality reduction is needed before clustering in the lower dimensional feature space. In this research we use DEC method for topic detection. DEC method is used to optimize the feature space and cluster simultaneously. DEC is divided into two stages. The first stage consists of autoencoder learning that obtains the weights of the encoder that used for dimension reduction and k-means clustering to get the initial centroid. The second stage consists of the soft assignment calculation, computing the auxiliary distribution that represents the cluster in the data space, and backpropagation to update the encoder weights and the centroid. In this research, two DEC models are built, namely the standard DEC and DEC without backpropagation. DEC without backpropagation is the DEC which eliminate the backpropagation process in the second stage. Every DEC models will produce topics. The results are evaluated using the coherence measure. From this research, it can be seen that DEC without backpropagation is better than standard DEC in terms of computation time with a slight difference in coherence measure.

Depok: Fakultas Matematika dan Ilmu Pengetahuan Alam Universitas Indonesia, 2020

S-pdf

UI - Skripsi Membership Universitas Indonesia Library

Latifah Al Haura

Deteksi Malicious Account pada Akun Twitter Indonesia Berbasis Tweet = Malicious Accounts Detection on Indonesian Twitter Acoounts using Tweets

"Penipuan dan bahkan pencurian informasi saat ini kerap terjadi di media sosial melalui unggahan pengguna yang tidak bertanggung jawab berupa status, tweet, ataupun pesan Spam yang berisi tautan-tautan yang berbahaya. Hal ini tidak terlepas dari keberadaan akun-akun jahat yang sudah sangat meresahkan dan mengganggu keamaan dan kenyamanan pengguna media sosial. Oleh karena itu, penelitian ini bertujuan untuk menggunakan fitur dari tweet (teks) dalam mendeteksi Malicious Account (akun jahat) di Twitter pengguna Indonesia. Terdapat dua metode ekstraksi fitur teks yang digunakan dan dibandingkan dalam penelitian ini yaitu Word2Vec dan FastText. Selain itu, penelitian ini juga membahas perbandingan antara metode Machine Learning dan Deep Learning dalam mengklasifikasi pengguna atau akun berdasarkan fitur dari tweet tersebut. Algoritma Machine Learning yang digunakan di antaranya adalah Logistic Regression, Decision Tree, dan Random Forest sedangkan algoritma Deep Learning yang digunakan yaitu Long Short-Term Memory (LSTM). Hasil dari keseluruhan skenario pengujian menunjukkan bahwa performa rata-rata yang dihasilkan metode ekstraksi fitur Word2Vec lebih unggul dibandingkan dengan FastText yang memiliki nilai F1-Score sebesar 74% dan metode klasifikasi Random Forest lebih unggul dibandingkan dengan tiga metode lainnya yang mana memiliki nilai F1-Score sebesar 82%. Sedangkan performa terbaik untuk kombinasi antara metode ekstraksi fitur dan metode klasifikasi terbaik yaitu gabungan antara Pre-trained Word2Vec dan LSTM dengan nilai F1-Score sebesar 84%.

Fraud and even theft of information nowadays often occur on social media through irresponsible user uploads in the form of statuses, tweets, or spam messages containing dangerous links. This is inseparable from the existence of Malicious Accounts that have been very disturbing and disturbing the comfort of users and the comfort of social media users. Therefore, this study aims to use the feature of tweets (text) in detecting Malicious Accounts on Indonesian Twitter users. There are two text feature extraction methods used and compared in this study, namely Word2Vec and FastText. In addition, this study also discusses the comparison between Machine Learning and Deep Learning methods in classifying users or accounts based on the features of the tweet. The Machine Learning algorithm used is Logistic Regression, Decision Tree, and Random Forest, while the Deep Learning algorithm used is Long Short-Term Memory (LSTM). The results of all test scenarios show that the average performance of the Word2Vec feature extraction method is higher than FastText with an F1-Score value of 74% and the Random Forest classification method is higher than the other three methods which have an F1-Score value of 82%. While the best performance for the combination of feature extraction method and the best classification method is the combination of Pre-trained Word2Vec and LSTM with an F1-Score value of 84%."

Depok: Fakultas Ilmu Komputer Universitas Indonesia, 2021

T-pdf

UI - Tesis Membership Universitas Indonesia Library

Sultan Daffa Nusantara

Pendekatan Rule-based Menggunakan Kamus dan Named Entity Recognizer untuk Mendeteksi dan Mengoreksi Kesalahan Penulisan Huruf Kapital pada Teks Berbahasa Indonesia = A Rule-based Approach Using Dictionary and Named Entity Recognizer for Detecting and Correcting Capitalization Errors in Indonesian Text

"Penggunaan huruf kapital merupakan aspek penting dalam menulis bahasa Indonesia yang baik dan benar. Aturan penggunaan huruf kapital dalam bahasa Indonesia telah dijelaskan dalam Pedoman Umum Ejaan Bahasa Indonesia (PUEBI) yang terdiri dari 23 aturan. Penelitian sebelumnya telah memulai mengembangkan pendeteksi dan pengoreksi kesalahan huruf kapital untuk bahasa Indonesia menggunakan pendekatan rule-based dengan kamus dan komponen Named Entity Recognition (NER). Namun, penelitian tersebut hanya mencakup 9 dari 23 aturan huruf kapital yang tercantum dalam PUEBI dan dataset uji yang digunakan tidak dipublikasikan sehingga tidak dapat digunakan untuk penelitian selanjutnya. Penelitian ini bertujuan untuk mengusulkan metode untuk mendeteksi dan mengoreksi 14 dari 23 aturan PUEBI menggunakan pendekatan yang mirip dengan penelitian sebelumnya. Model NER dikembangkan menggunakan pretrained language model IndoBERT yang dilakukan fine-tuning dengan dataset NER. Untuk menguji metode rule-based yang diusulkan, dibuat sebuah dataset sintesis yang terdiri dari 5.000 pasang kalimat. Setiap pasang terdiri dari kalimat benar secara aturan huruf kapital dan padanan kalimat salahnya. Kalimat salah dibuat dengan mengubah beberapa huruf kapital di kalimat yang awalnya benar. Sebelum dilakukan perbaikan terhadap kalimat yang salah, didapatkan akurasi sebesar 83,10%. Namun, setelah menggunakan metode ini, tingkat akurasi meningkat 12,35% menjadi 95,45%.

The correct use of capital letters plays a vital role in writing well-formed and accurate Indonesian sentences. Pedoman Umum Ejaan Bahasa Indonesia (PUEBI) provide a comprehensive set of 23 rules that explain how to use capital letters correctly. Previous research has attempted to develop a rule-based system to detect and correct capital letter errors in Indonesian text using dictionaries and Named Entity Recognition (NER). However, this study only covered 9 out of the 23 capital letter rules specified in PUEBI, and the test dataset used was not publicly available for further analysis. In this study, we aim to propose a method that can identify and rectify 14 out of the 23 PUEBI rules, following a similar approach to previous research. The NER model was trained using the IndoBERT pretrained language model and fine-tuned with a specific NER dataset. To evaluate the effectiveness of our rule-based method, we created a synthetic dataset comprising 5,000 sentence pairs. Each pair consists of a correctly capitalized sentence and an equivalent sentence with incorrect capitalization. Before applying our method, the baseline accuracy was 83.10%. However, after implementing our approach, the accuracy improved by 12.35% to reach 95.45%."

Depok: Fakultas Ilmu Komputer Universitas Indonesia, 2023

S-pdf

UI - Skripsi Membership Universitas Indonesia Library

Nicholas Ramos Richardo

Analisis Performa EFCM dengan BERT sebagai Representasi Teks pada Pendeteksian Topik = The Performance of EFCM with BERT as Text Representation on Topic Detection

"Pendeteksian topik adalah suatu proses untuk menentukan suatu topik dalam teks dengan menganalisis kata di dalam teks tersebut. Pendeteksian topik dapat dilakukan dengan membaca isi dari teks tersebut. Namun, cara ini semakin sulit apabila data yang dimiliki semakin besar. Memanfaatkan metode machine learning dapat menjadi alternatif dalam menangani data yang berjumlah besar. Metode clustering adalah metode pengelompokkan data yang mirip dari suatu kumpulan data. Beberapa contoh metode clustering adalah K-Means, Fuzzy C-Means (FCM), dan Eigenspaced-Based Fuzzy C-Means (EFCM). EFCM adalah metode clustering yang memanfaatkan metode reduksi dimensi Truncated Singular Value Decomposition (TSVD) dengan metode FCM (Murfi, 2018). Dalam pendeteksian topik, teks harus direpresentasikan kedalam bentuk vektor numerik karena model clustering tidak dapat memproses data yang berbetuk teks. Metode yang sebelumnya umum digunakan adalah Term-Frequency Inversed Document Frequency (TFIDF). Pada tahun 2018 diperkenalkan suatu metode baru yaitu metode Bidirectional Encoder Representations from Transformers (BERT). BERT merupakan pretrained language model yang dikembangkan oleh Google. Penelitian ini akan menggunakan model BERT dan metode clutering EFCM untuk masalah pendeteksian topik. Kinerja performa model dievaluasi dengan menggunakan metrik evaluasi coherence. Hasil simulasi menunjukkan penentuan topik dengan metode modifikasi TFIDF lebih unggul dibandingkan dengan metode centroid-based dengan dua dari tiga dataset yang digunakan metode modifikasi TFIDF memiliki nilai coherence yang lebih besar. Selain itu, BERT lebih unggul dibandingkan dengan metode TFIDF dengan nilai coherence BERT pada ketiga dataset lebih besar dibandingkan dengan nilai coherence TFIDF.

Topic detection is a process to determine a topic in the text by analyzing the words in the text. Topic detection can be done with reading the contents of the text.However, this method is more difficult when bigger data is implemented. Utilizing machine learning methods can be an alternative approach for handling a large amount of data. The clustering method is a method for grouping similar data from a data set. Some examples of clustering methods are K-Means, Fuzzy C-Means (FCM), and Eigenspaced-Based Fuzzy C-Means (EFCM). EFCM is a clustering method that utilizes the truncated dimension reduction method Singular Value Decomposition (TSVD) with the FCM method (Murfi, 2018). In topic detection, the text must be represented in numerical vector form because the clustering model cannot process data in the form of text. The previous method that was most commonly used is the Term-Frequency Inverse Document Frequency (TFIDF). In 2018 a new method was introduced, namely the Bidirectional Encoder method Representations from Transformers (BERT). BERT is a pretrained language model developed by Google. This study will use the BERT model and the EFCM clustering method for topic detection problems. The performance of the model is evaluated using the coherence evaluation metric. The simulation results show that modified TFIDF method for topic determination is superior to the centroid-based method with two of the three datasets used by modified TFIDF method having a greater coherence value. In addition, BERT is superior to the TFIDF method with the BERT coherence value in the three datasets greater than the TFIDF coherence value."

Depok: Fakultas Matematika dan Ilmu Pengetahuan Alam Universitas Indonesia, 2022

S-pdf

UI - Skripsi Membership Universitas Indonesia Library

Lista Kurniawati

Analisis Kinerja Gabungan Metode Representasi Teks BERT dan Metode Clustering DEC untuk Pendeteksian Topik = Performance Analysis of BERT as Text Representation Method and DEC Clustering Method for Topic Detection

"Pendeteksian topik merupakan masalah komputasi yang menganalisis kata-kata dari suatu data teks untuk menemukan topik yang ada di dalam teks tersebut. Pada data yang besar, pendeteksian topik lebih efektif dan efisien dilakukan dengan metode machine learning. Data teks harus diubah ke dalam bentuk representasi vektor numeriknya sebelum dimasukkan ke model machine learning. Metode representasi teks yang umum digunakan adalah TF-IDF. Namun, metode ini menghasilkan representasi data teks yang tidak memperhatikan konteksnya. BERT (Bidirectional Encoder Representation from Transformer) merupakan metode representasi teks yang memperhatikan konteks dari suatu kata dalam dokumen. Penelitian ini membandingkan kinerja model BERT dengan model TF-IDF dalam melakukan pendeteksian topik. Representasi data teks yang diperoleh kemudian dimasukkan ke model machine learning. Salah satu metode machine learning yang dapat digunakan untuk menyelesaikan masalah pendeteksian topik adalah clustering. Metode clustering yang populer digunakan adalah Fuzzy C-Means. Namun, metode Fuzzy C-Means tidak efektif pada data berdimensi tinggi. Karena data teks berita biasanya memiliki ukuran dimensi yang cukup tinggi, maka perlu dilakukan proses reduksi dimensi. Saat ini, terdapat metode clustering yang melakukan reduksi dimensi berbasis deep learning, yaitu Deep Embedded Clustering (DEC). Pada penelitan ini digunakan model DEC untuk melakukan pendeteksian topik. Eksperimen pendeteksian topik menggunakan model DEC (member) dengan metode representasi teks BERT pada data teks berita menunjukkan nilai coherence yang sedikit lebih baik dibandingkan dengan menggunakan metode representasi teks TF-IDF.

Topic detection is a computational problem that analyzes words of a textual data to find the topics in it. In large data, topic detection is more effective and efficient using machine learning methods. Textual data must be converted into its numerical vector representation before being entered into a machine learning model. The commonly used text representation method is TF-IDF. However, this method produces a representation of text data that does not consider the context. BERT (Bidirectional Encoder Representation from Transformers) is a text representation method that pays attention to the context of a word in a document. This study compares the performance of the BERT model with the TF-IDF model in detecting topics. The representation of the text data obtained is then entered into the machine learning model. One of the machine learning methods that can be used to solve topic detection problems is clustering. The popular clustering method used is Fuzzy CMeans. However, the Fuzzy C-Means method is not effective on high-dimensional data. Because news text data usually has a high dimension, it is necessary to carry out a dimension reduction process. Currently, there is a clustering method that performs deep learning-based dimension reduction, namely Deep Embedded Clustering (DEC). In this research, the DEC model is used to detect topics. The topic detection experiment using the DEC (member) model with the BERT text representation method on news text data shows a slightly better coherence value than using the TF-IDF text representation method."

Depok: Fakultas Matematika dan Ilmu Pengetahuan Alam Universitas Indonesia, 2022

S-pdf

UI - Skripsi Membership Universitas Indonesia Library

Muktiari

Kernelisasi fuzzy c - means berbasis ruang eigen untuk pendeteksian topik pada berita online berbahasa Indonesia = Kernelization of eigenspaced ?? based fuzzy c - means for topic detection on Indonesian online news

"ABSTRAK

Pendeteksian topik adalah metode praktis untuk menemukan topik pada suatu koleksi dokumen. Salah satu metodenya adalah metode berbasis clustering yang mana centroid merepresentasikan topik contohnya eigenspace ndash; based fuzzy c ndash; EFCM . Proses clustering pada metode EFCM diimplementasikan pada dimensi yang lebih kecil yaitu ruang eigen. Sehingga akurasi dari proses clustering memungkinkan berkurang. Pada tesis ini, penulis menggunakan metode kernel sehingga proses clustering tersebut dapat diimplentasikan pada dimensi yang lebih tinggi tanpa mentransformasikan data ke ruang tersebut. Simulasi penulis menunjukkan bahwa kernelisasi ini meningkatkan akurasi dari EFCM berdasarkan skor interpretability pada berita online berbahasa Indonesia.

ABSTRACT

Topic detection is practical methods to find a topic in a collection of documents. One of the methods is a clustering based method whose centroids are interpreted as topics, i.e., eigenspace based fuzzy c means EFCM . The clustering process of the EFCM method is performed in a smaller dimensional Eigenspace. Thus, the accuracy of the clustering process may be reduced. In this thesis, we use the kernel method so that the clustering process is performed in a higher dimensional space without transforming data into that space. Author simulations show that this kernelization improves the accuracies of EFCM in term of interpretability scores for Indonesian online news."

Depok: Fakultas Matematika dan Ilmu Pengetahuan Alam Universitas Indonesia, 2018

T50790

UI - Tesis Membership Universitas Indonesia Library

Nofa Aulia

Deteksi ujaran kebencian teks panjang berbahasa Indonesia menggunakan data facebook = Hate speech detection on Indonesian long text using facebook data

Depok: Fakultas Ilmu Komputer Universitas Indonesia, 2019

T51811

UI - Tesis Membership Universitas Indonesia Library

Darell Hendry

Identifikasi dan reorganisasi intent pada chatbot menggunakan pemodelan topik = Identification and reorganization of chatbot intent using topic modeling.

"Chatbot sebagai asisten virtual yang digunakan oleh suatu instansi dapat memberikan manfaat bagi penggunanya. Dengan adanya chatbot, pengguna dapat berbicara langsung kepada chatbot melalui pesan singkat, yang kemudian sistem secara spontan mengidentifikasi intent pesan tersebut dan merespons dengan tindakan yang relevan. Sayangnya, cakupan pengetahuan chatbot terbatas dalam menangani pesan oleh pengguna yang semakin bervariasi. Dampak utama dari adanya variasi tersebut adalah adanya perubahan pada komposisi label intent. Untuk itu, penelitian ini berfokus pada dua hal. Pertama, pemodelan topik untuk menemukan intent dari pesan pengguna yang belum teridentifikasi intent-nya. Kedua, pemodelan topik digunakan untuk mengorganisasi intent yang sudah ada dengan menganalisis hasil keluaran model topik. Setelah dianalisis, terdapat dua kemungkinan fenomena perubahan komposisi intent yaitu: penggabungan dan pemecahan intent, dikarenakan terdapat noise saat proses anotasi dataset orisinal. Pemodelan topik yang digunakan terdiri dari Latent Dirichlet Allocation (LDA) sebagai model baseline dan dengan model state-of-the-art Top2Vec dan BERTopic. Penelitian dilakukan terhadap dataset salah satu e-commerce di Indonesia dan empat dataset publik. Untuk mengevaluasi model topik digunakan metrik evaluasi coherence, topic diversity dan topic quality. Hasil penelitian menunjukkan model topik BERTopic dan Top2Vec menghasilkan nilai topic quality 0.036 yang lebih baik dibandingkan model topik LDA yaitu -0.014. Terdapat pula pemecahan intent dan penggabungan intent yang ditemukan dengan analisis threshold proporsi.

Chatbot, as a virtual assistant used by an institution, can provide benefits for its users. With a chatbot, users can speak directly to the chatbot via a short message, which then the system spontaneously identifies the intent of the message and responds with the relevant action. Unfortunately, the scope of chatbot knowledge is limited in handling messages by an increasingly varied user. The main impact of this variation is a change in the composition of the intent label. For this reason, this research focuses on two things. First, topic modeling to find intents from user messages whose intents have not been identified. Second, topic modeling is used to organize existing intents by analyzing the output of the topic model. After being analyzed, there are two possible phenomena of changing intent composition: merging and splitting intents because there is noise during the annotation process of the original dataset. The topic modeling used consists of Latent Dirichlet Allocation (LDA) as the baseline model and the state-of-the-art Top2Vec and BERTopic models. The research was conducted on one dataset of e-commerce in Indonesia and four public datasets. The evaluation metrics of coherence, topic diversity, and topic quality were used to evaluate the topic model. The results showed that the BERTopic and Top2Vec topic models produced a topic quality value of 0.036, better than the LDA topic model, which was -0.014. There are also intent splitting and intent merging found by proportion threshold analysis."

Depok: Fakultas Ilmu Komputer Universita Indonesia, 2021

S-pdf

UI - Skripsi Membership Universitas Indonesia Library

<< 1 2 3 4 5 6 7 8 9 10 >>

Hasil Pencarian :: Simpan CSV :: Kembali

Hasil Pencarian