    1. System and method for comparing and reviewing documents

      A document processing system for accurately and efficiently analyzing documents and methods for making and using same. Each incoming document includes at least one section of textual content and is provided in an electronic form or as a paper-based document that is converted into an electronic form. Since many categories of documents, such as legal and accounting documents, often include one or more common text sections with similar textual content, the document processing system compares the documents to identify and classify the common text sections. The document comparison can be further enhanced by dividing the document into document segments and ...
    2. The ACODEA framework: Developing segmentation and classification schemes for fully automatic analysis of online discussions

      Abstract  Research related to online discussions frequently faces the problem of analyzing huge corpora. Natural Language Processing (NLP) technologies may allow automating this analysis. However, the state-of-the-art in machine learning and text mining approaches yields models that do not transfer well between corpora related to different topics. Also, segmenting is a necessary step, but frequently, trained models are very sensitive to the particulars of the segmentation that was used when the model was trained. Therefore, in prior published research on text classification in a CSCL context, the data was segmented by hand. We discuss work towards overcoming these challenges. We ...
    3. A Novel Method For Speech Segmentation Based On Speakers' Characteristics. (arXiv:1205.1794v1 [cs.AI])

      Speech Segmentation is the process change point detection for partitioning an input audio stream into regions each of which corresponds to only one audio source or one speaker. One application of this system is in Speaker Diarization systems. There are several methods for speaker segmentation; however, most of the Speaker Diarization Systems use BIC-based Segmentation methods. The main goal of this paper is to propose a new method for speaker segmentation with higher speed than the current methods - e.g. BIC - and acceptable accuracy. Our proposed method is based on the pitch frequency of the speech. The accuracy of this ...
      Mentions: Hossein Sameti BIC
    4. Method and vector analysis for a document

      The invention provides a document representation method and a document analysis method including extraction of important sentences from a given document and/or determination of similarity between two documents. The inventive method detects terms that occur in the input document, segments the input document into document segments, each segment being an appropriately sized chunk and generates document segment vectors, each vector including as its element values according to occurrence frequencies of the terms occurring in the document segments. The method further calculates eigenvalues and eigenvectors of a square sum matrix in which a rank of the respective document segment vector ...
    5. Mobile Lifelogger – Recording, Indexing, and Understanding a Mobile User’s Life

      Lifelog system involves capturing personal experiences in the form of digital multimedia during an entire lifespan. Recent advancements in mobile sensor technologies have helped to develop these systems using commercial smart phones. These systems have the potential to act as a secondary memory and also aid people who struggle with episodic memory impairment (EMI). Despite their huge potential, there are major challenges that need to be addressed to make them useful. One of them is how to index the inherently large lifelog data so that the person can efficiently retrieve the log segments that interest him / her most. In this ...
    6. System and method of machine-aided information extraction rule development

      An automatic rule generation system generates rules for fact extraction. A rule generation module receives a sample and generates a rule from the sample. A rule relaxation module generates a relaxed rule from the rule. A rule testing module generates a reverse index from a corpus, applies the relaxed rule to the reverse index, and generates text segments. An information extraction module generates modified text segments from the relaxed rule and the text segments. A candidate suggestion module performs a candidate generation process: if the candidate generation process generates no candidates, the candidate suggestion module signals the rule relaxation module ...
    7. Learning word segmentation from non-white space languages corpora

      Illustrative embodiments provide a computer implemented method, apparatus, and computer program product for learning word segmentation from non-white space language corpora. In one illustrative embodiment, the computer implemented method receives text input characters and calculates a ratio-measure for each pair of characters in the input characters. The computer implemented method further determines whether the ratio-measure of each pair of characters is equal to a predetermined threshold value. Responsive to determining the ratio-measure is less than the predetermined threshold value, and a local-minimum value, the computer method further identifies the pair as a weak pair and breaks the weak pair of ...
    8. A two step salient objects extraction framework based on image segmentation and saliency detection

      Abstract  Salient objects extraction from a still image is a very hot topic, as it owns a lot of useful applications (e.g., image compression, content-based image retrieval, digital watermarking). In this paper, targeted to improve the performance of the extraction approach, we propose a two step salient objects extraction framework based on image segmentation and saliency detection (TIS). Specially, during the first step, the image is segmented into several regions using image segmentation algorithm and the saliency map for the whole image is detected with saliency detection algorithm. In the second step, for each region, some features are extracted ...
    9. Segmentation Similarity and Agreement. (arXiv:1204.2847v1 [cs.CL])

      We propose a new segmentation evaluation metric, called segmentation similarity (S), that quantifies the similarity between two segmentations as the proportion of boundaries that are not transformed when comparing them using edit distance, essentially using edit distance as a penalty function and scaling penalties by segmentation size. We propose several adapted inter-annotator agreement coefficients which use S that are suitable for segmentation. We show that S is configurable enough to suit a wide variety of segmentation evaluations, and is an improvement upon the state of the art. We also propose using inter-annotator agreement coefficients to evaluate automatic segmenters in terms ...
      Mentions: Diana Inkpen
    10. A GPU-Based Accelerator for Chinese Word Segmentation

      The task of Chinese word segmentation is to split sequence of Chinese characters into tokens so that the Chinese information can be more easily retrieved by web search engine. Due to the dramatic increase in the amount of Chinese literature in recent years, it becomes a big challenge for web search engines to analyze massive Chinese information in time. In this paper, we investigate a new approach to high-performance Chinese information processing. We propose a CPU-GPU collaboration model for Chinese word segmentation. In our novel model, a dictionary-based word segmentation approach is proposed to fit GPU architecture. Three basic word ...
    11. Technique for searching out new words that should be registered in dictionary for speech processing

      To search out a new word that should be newly registered in a dictionary contained in a segmentation device for segmenting a text into words. This system inputs a training text into the segmentation device to cause the segmentation device to segment the training text into words, and thereby generates a plurality of segmentation candidates in association with certainty factors of the results of the segmentation, the segmentation candidates respectively containing mutually different combinations of words as results of the segmentation of the training text. Then, this system computes a likelihood that the each word is a new word by ...
    12. A Symbolic Approach for Automatic Detection of Nuclearity and Rhetorical Relations among Intra-sentence Discourse Segments in Spanish

      Nowadays automatic discourse analysis is a very prominent research topic, since it is useful to develop several applications, as automatic summarization, automatic translation, information extraction, etc. Rhetorical Structure Theory(RST) is the most employed theory. Nevertheless, there are not many studies about this subject in Spanish. In this paper we present the first system assigning nuclearity and rhetorical relations to intra-sentence discourse segments in Spanish texts. To carry out the research, we analyze the learning corpus of the RST Spanish Treebank, a corpus of manually-annotated specialized texts, in order to build a list of lexical and syntactic patterns marking rhetorical ...
    13. Predictive Text Entry for Agglutinative Languages Using Unsupervised Morphological Segmentation

      Systems for predictive text entry on ambiguous keyboards typically rely on dictionaries with word frequencies which are used to suggest the most likely words matching user input. This approach is insufficient for agglutinative languages, where morphological phenomena increase the rate of out-of-vocabulary words. We propose a method for text entry, which circumvents the problem of out-of-vocabulary words, by replacing the dictionary with a Markov chain on morph sequences combined with a third order hidden Markov model (HMM) mapping key sequences to letter sequences and phonological constraints for pruning suggestion lists. We evaluate our method by constructing text entry systems for ...
    14. Search-based word segmentation method and device for language without word boundary tag

      The present invention discloses a search-based segmentation method and device for a language without a word boundary tag. The inventive method includes the steps of: a. providing at least one search engine with a segment of a text including at least one segment; b. searching for the segment through the at least one search engine, and returning search results; and c. selecting a word segmentation approach for the segment in accordance with at least part of the returned search results. The invention solves the problems of word segmentation for a language without a word boundary tag, and thus combat the ...
    15. Segmenting DNA sequence into `words' based on statistical language model. (arXiv:1202.2518v2 [q-bio.GN] Cross Listed)

      This paper presents a novel method to segment/decode DNA sequences based on n-gram statistical language model. Firstly, we find the length of most DNA "words" is 12 to 15 bps by analyzing the genomes of 12 model species. The bound of language entropy of DNA sequence is about 1.5674 bits. After building an n-gram biology languages model, we design an unsupervised 'probability approach to word segmentation' method to segment the DNA sequences. The benchmark of segmenting method is also proposed. In cross segmenting test, we find different genomes may use the similar language, but belong to different branches ...
    16. Challenges of Chinese Natural Language Processing - Segmentation

      As the Chinese consumer market takes the center stage in the world economy, the rush to adapt business tools for the Chinese market is equally as frenzy. Fortunately, despite what my friend Ben might say, most of the adaptions are confined to the interface layer. That means, the majority of the challenges a
    17. Ontology Based Segmentation of Geo-Referenced Queries

      The last generation of search engines is confronted with complex queries, whose expression goes beyond the capability of the Bag of Word model and requires the systems which understand query sentences. Among these queries, huge importance is taken by geo-referenced queries, i.e. queries whose understanding requires localizing objects of interest, where the user location is the most important parameter. In this paper, we focus on geo-referenced queries and show how natural language analysis can be used to decompose queries into sub-queries and associating them to suitable real-world objects. In this paper we propose a syntactic and semantic approach, which ...
    18. A Novel Chinese Word Segmentation Method Utilizing Morphology Information

      In this paper, we present a novel approach to integrate morphology information into the statistical model for CWS, which yields better accuracies than the traditional CRFs-based approach. The improvements are mainly attributed to two aspects. Firstly, the structure information within the words is integrated into the CRFs model by annotating the Chinese word corpus with morphology tags, which conveys the construction modes of Chinese words. Secondly, the training process adopts a joint CRFs model to integrate structure information with other context, which combine the morphology tag and word boundary in the same state level and complete the word segmentation and ...
      Mentions: China Guangdong
    19. Combining Image-Level and Segment-Level Models for Automatic Annotation

      For the task of assigning labels to an image to summarize its contents, many early attempts use segment-level information and try to determine which parts of the images correspond to which labels. Best performing methods use global image similarity and nearest neighbor techniques to transfer labels from training images to test images. However, global methods cannot localize the labels in the images, unlike segment-level methods. Also, they cannot take advantage of training images that are only locally similar to a test image. We propose several ways to combine recent image-level and segment-level techniques to predict both image and segment labels ...
    20. The SALAH Project: Segmentation and Linguistic Analysis of ḥadīṯ Arabic Texts

      A model for the unsupervised segmentation and linguistic analysis of Arabic texts of Prophetic tradition (ḥadīṯs), SALAH, is proposed. The model automatically segments each text unit in a transmitter chain (isnād) and a text content (matn) and further analyses each segment according to two distinct pipelines: a set of regular expressions chunks transmitter chains in a graph labeled with the relation between transmitters, while a tailored, augmented version of the AraMorph morphological analyzer (RAM) analyzes and annotates lexically and morphologically the text content. A graph with relations among transmitters and a lemmatized text corpus, both in XML format, are the ...
    21. Pinyin Tagging System Research and Implementation Based on Word Segmentation

      The pinyin tagging system has a wide range of applications. The tagging accuracy is lower because of Chinese polyphone problems. The Chinese multi-tone word ratio is far less than the polyphone, so the tagging approach of word segmentation can improve tagging accuracy. The tagging accuracy is improved by 10.8%. The pinyin dictionary mechanism is studied in the word segmentation process to improve the tagging speed. Content Type Book ChapterPages 495-500DOI 10.1007/978-3-642-25658-5_59Authors Zhiqiang Ma, College of Information Engineering, Inner Mongolia University of Technology, 010080 Hohhot, ChinaLimin Liu, College of Information Engineering, Inner Mongolia University of Technology, 010080 Hohhot ...
    22. Research on Dictionary Construction in Automatic Correcting

      It is a key technology in online examination system that automatic correcting. Semantic similarity is the main way to solve the auto-correcting, but the exact calculation depends on the words similarity of the student’s response and accurate answers. They are the important reasons to decrease the accuracy of word segmentation that identify the new words (Glossary) and segmentation ambiguity problem. A new method to construct dictionary having glossary is proposed that the new words are identified by PAT array and ambiguities are eliminated by association rule mining. The accuracy of segmentation may achieve 95% and have 5% of increase ...
      Mentions: China
