首页 | 本学科首页   官方微博 | 高级检索  
     检索      

规则与统计相结合的中文新词识别研究
引用本文:王琳琳.规则与统计相结合的中文新词识别研究[J].嘉兴学院学报,2014,26(6):124-130.
作者姓名:王琳琳
作者单位:枣庄学院信息科学与工程学院,山东枣庄,277160
基金项目:山东省高校科技计划项目
摘    要:针对目前分词方法无法识别网络中不断出现的普通新词,设计了一种新的基于规则与统计相结合的分词方法。针对新词不同的构词模式特点,利用语言学的知识把新词识别问题分类细化,将单字串模式的新词和后缀串模式的新词作为本文的主要识别对象。对于单字串模式的候选新词,在使用内部词概率模型的基础上,通过分析候选新词的内部紧密结合度以及上下文环境依赖程度等特征信息,采用平均互信息和左右邻接信息熵相结合的方法进行识别;对于后缀模式的候选新词,通过大规模语料训练得到的噪声尾词典进行噪声串的过滤得到新词.

关 键 词:新词识别  左右信息熵  内部词概率  平均互信息

Research on Chinese New Word Identification Based on Regulations and Statistics
Wang Linlin.Research on Chinese New Word Identification Based on Regulations and Statistics[J].Journal of Jiaxing College,2014,26(6):124-130.
Authors:Wang Linlin
Institution:Wang Linlin (College of Information Science and Engineering, Zaozhuang University, Zaozhuang,Shandong 277160)
Abstract:In view of the current word segmentation methods' incapability in recognizing the newly emerged words on the Internet,we design a new sub-word method based on regulations and statistics.According to the different features of component models of new words,we make use of linguistic knowledge to classify the problem of recognizing new words,and take single character mode and suffix character mode as the research objects.With regard to the candidate new words of single character mode,the recognition is realized by means of inspecting internal combination and external linguistic environment of those candidates which are on the basis of inside word probability,and employing the combination of the average mutual information and the left and right entropy model.As for the candidate new words of the suffix character mode,new words are obtained by filtering noises with the noise tail dictionary based on the large-scale corpus training.
Keywords:new word identification  left and right entropy  inside word probability  average mutual information
本文献已被 维普 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号