首页 | 本学科首页   官方微博 | 高级检索  
     检索      


Comparing automated text classification methods
Authors:Jochen Hartmann  Juliana Huppertz  Christina Schamp  Mark Heitmann
Institution:Marketing & Customer Insight, University of Hamburg, Moorweidenstraße 18, 20148 Hamburg, Germany
Abstract:Online social media drive the growth of unstructured text data. Many marketing applications require structuring this data at scales non-accessible to human coding, e.g., to detect communication shifts in sentiment or other researcher-defined content categories. Several methods have been proposed to automatically classify unstructured text. This paper compares the performance of ten such approaches (five lexicon-based, five machine learning algorithms) across 41 social media datasets covering major social media platforms, various sample sizes, and languages. So far, marketing research relies predominantly on support vector machines (SVM) and Linguistic Inquiry and Word Count (LIWC). Across all tasks we study, either random forest (RF) or naive Bayes (NB) performs best in terms of correctly uncovering human intuition. In particular, RF exhibits consistently high performance for three-class sentiment, NB for small samples sizes. SVM never outperform the remaining methods. All lexicon-based approaches, LIWC in particular, perform poorly compared with machine learning. In some applications, accuracies only slightly exceed chance. Since additional considerations of text classification choice are also in favor of NB and RF, our results suggest that marketing research can benefit from considering these alternatives.
Keywords:Text classification  Social media  Machine learning  User-generated content  Sentiment analysis  Natural language processing
本文献已被 ScienceDirect 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号