科研成果

当前位置: 首页 >> 科学研究 >> 科研成果 >> 正文

王忠义,zhang jin,黄京: Multi-granularity hierarchical topic-based segmentation of structured, digital library resources

日期:2016年03月07日 16:28  浏览数:

【摘要】Purpose:Current segmentation systems almost invariably focus on linear segmentation and can only divide text into linear sequences of segments. This suits cohesive text such as news feed but not coherent texts such as documents of a digital library which have hierarchical structures. To overcome the focus on linear segmentation in document segmentation and to realize the purpose of hierarchical segmentation for a digital library’s structured resources, this paper aimed to propose a new multi-granularity hierarchical topic-based segmentation system (MHTSS) to decide section breaks.Design/methodology/approach:MHTSS adopts up-down segmentation strategy to divide a structured, digital library document into a document segmentation tree. Specifically, it works in a three-stage process, such as document parsing, coarse segmentation based on document access structures and fine-grained segmentation based on lexical cohesion.Findings:This paper analyzed limitations of document segmentation methods for the structured, digital library resources. Authors found that the combination of document access structures and lexical cohesion techniques should complement each other and allow for a better segmentation of structured, digital library resources. Based on this finding, this paper proposed the MHTSS for the structured, digital library resources. To evaluate it, MHTSS was compared to the TT and C99 algorithms on real-world digital library corpora. Through comparison, it was found that the MHTSS achieves top overall performance.Practical implications:With MHTSS, digital library users can get their relevant information directly in segments instead of receiving the whole document. This will improve retrieval performance as well as dramatically reduce information overload.Originality/value:This paper proposed MHTSS for the structured, digital library resources, which combines the document access structures and lexical cohesion techniques to decide section breaks. With this system, end-users can access a document by sections through a document structure tree.

 

【关键词】Hierarchical segmentation, Access structures, AIC, Digital library resources, Lexical cohesion, Optimum partitioning clustering, Structured segmentation

 

  该文发表于《The Electronic Library》第35卷第1期

联系我们

学院地址:湖北省武汉市洪山区雄楚大道382号华中师范大学南湖综合楼10楼
办公室电话: 027-67868316
研究生管理办公室电话: 027-67867641/027- 67867649
Copyright©2019-2029 imd.ccnu.edu.cn, All Rights Reserved

XML 地图