1. Articles in category: Segmentation

    481-504 of 830 « 1 2 ... 18 19 20 21 22 23 24 ... 33 34 35 »
    1. Speeding Up Bayesian HMM by the Four Russians Method

      Bayesian computations with Hidden Markov Models (HMMs) are often avoided in practice. Instead, due to reduced running time, point estimates – maximum likelihood (ML) or maximum a posterior (MAP) – are obtained and observation sequences are segmented based on the Viterbi path, even though the lack of accuracy and dependency on starting points of the local optimization are well known. We propose a method to speed-up Bayesian computations which addresses this problem for regular and time-dependent HMMs with discrete observations. In particular, we show that by exploiting sequence repetitions, using the four Russians method, and the conditional dependency structure, it is possible ...
      Read Full Article
    2. Annotation of sentence structure

      Abstract  The focus of this article is on the creation of a collection of sentences manually annotated with respect to their sentence structure. We show that the concept of linear segments—linguistically motivated units, which may be easily detected automatically—serves as a good basis for the identification of clauses in Czech. The segment annotation captures such relationships as subordination, coordination, apposition and parenthesis; based on segmentation charts, individual clauses forming a complex sentence are identified. The annotation of a sentence structure enriches a dependency-based framework with explicit syntactic information on relations among complex units like clauses. We have gathered ...
      Read Full Article
    3. Domain-Adapted Word Segmentation for an Out-of-Domain Language Modeling

      This paper introduces a domain-adapted word segmentation approach to text where a word delimiter is not used regularly. It depends on an unknown word extraction technique. This approach is essential for language modeling to adapt to new domains since a vocabulary set is activated in a word segmentation step. We have achieved ERR 21.22% in Korean word segmentation. In addition, we show that an incremental domain adaptation of the word segmentation decreases the perplexity of input text gradually. It means that our approach supports an out-of-domain language modeling. Content Type Book ChapterPages 63-73DOI 10.1007/978-1-4614-1335-6_9Authors Euisok Chung, Speech ...
      Read Full Article
    4. Methods, systems, and products for classifying content segments

      Methods, systems, and products are disclosed for classifying content segments. A set of annotations is received that occur within a segment of time-varying content. Each annotation is scored to each node in an ontology. The segment is classified based on at least one of the scores.
      Read Full Article
    5. A New Strategy for Disambiguation in Segmentation of Chinese Words

      Segmentation is the base of information processing in Chinese, the difficulty of lies in disambiguation. This paper puts forward a new method for disambiguation according to the frequency of single characters functioning as independent meaningful words, context and the frequency of word collocation as well. It has been proved by experiments to be able to greatly improve the accuracy and efficiency of segmentation. Content Type Book ChapterPages 466-472DOI 10.1007/978-3-642-21411-0_76Authors Yueqi Liao, Information Science and Technology School of Hunan Agricultural University Changsha of China, Post Code: 410078Shaoxian Tang, Information Science and Technology School of Hunan Agricultural University Changsha of ...
      Read Full Article
    6. On Morphological Analysis for Learner Language, Focusing on Russian

      Abstract  We describe a framework for performing morphological analysis to account for learner language, focusing on Russian as an example of an inflecting language. Because a set of linguistic analyses is needed to provide feedback on potentially noisy data, there is a large amount of ambiguity for even well-formed words. Using a segmented POS lexicon as a test case, we show how to analyze subparts of words, in order to analyze variations. After describing and implementing this framework for Russian, we focus on removing undesirable analyses to keep the task feasible. This is essentially an investigation of how much overgeneration ...
      Read Full Article
    7. Automatic Pragmatic Text Segmentation of Historical Letters

      In this investigation we aim to reduce the manual workload by automatic processing of the corpus of historical letters for pragmatic research. We focus on two consecutive sub tasks: the first task is automatic text segmentation of the letters in formal/informal parts using a statistical n-gram based technique. As a second task we perform semantic labeling of the formal parts of the letters using supervised machine learning. The main stumbling block in our investigation is data sparsity due to the small size of the data set and enlarged by the spelling variation present in the historical letters. We try ...
      Read Full Article
      Mentions: Portugal Gama Pinto
    8. How Ackuna wants to fix language translation by crowdsourcing it

      How Ackuna wants to fix language translation by crowdsourcing it
      Wired.co.ukHow Ackuna wants to fix language translation by crowdsourcing itWired.co.uk"If someone enters a phrase that's already been translated properly -- translated, reviewed, edited, or proofread by a real human translator, in other words -- the machine translation step is skipped for that segment and the correct, human-translated ...
      Read Full Article
    9. Segmentation of Printed Devnagari Documents

      Document segmentation is one of the most important phases in machine recognition of any language. Correct segmentation of individual symbols decides the success of character recognition technique. It is used to decompose an image of a sequence of characters into sub images of individual symbols by segmenting lines and words. Devnagari is the most popular script in India. It is used for writing Hindi, Marathi, Sanskrit and Nepali languages. Moreover, Hindi is the third most popular language in the world. Devnagari documents consist of vowels, consonants and various modifiers. Hence a proper segmentation Devnagari word is challenging. A simple approach ...
      Read Full Article
      Mentions: India Marathi Nagpur
    10. Topic Segmentation: Application of Mathematical Morphology to Textual Data

      Mathematical Morphology (MM) offers a generic theoretical framework for data processing and analysis. Nevertheless, it remains essentially used in the context of image analysis and processing, and the attempts to use MM on other kinds of data are still quite rare. We believe MM can provide relevant solutions for data analysis and processing in a far broader range of application fields. To illustrate, we focus here on textual data and we show how morphological operators (here the morphological segmentation using watershed transform) may be applied on these data. We thus provide an original MM-based solution to the thematic segmentation problem ...
      Read Full Article
      Mentions: France Rennes Cedex
    11. Good Friends, Bad News - Affect and Virality in Twitter

      The link between affect, defined as the capacity for sentimental arousal on the part of a message, and virality, defined as the probability that it be sent along, is of significant theoretical and practical importance, e.g. for viral marketing. The basic measure of virality in Twitter is the probability of retweet and we are interested in which dimensions of the content of a tweet leads to retweeting. We hypothesize that negative news content is more likely to be retweeted, while for non-news tweets positive sentiments support virality. To test the hypothesis we analyze three corpora: A complete sample of ...
      Read Full Article
    12. Sequential latent Dirichlet allocation

      Abstract  Understanding how topics within a document evolve over the structure of the document is an interesting and potentially important problem in exploratory and predictive text analytics. In this article, we address this problem by presenting a novel variant of latent Dirichlet allocation (LDA): Sequential LDA (SeqLDA). This variant directly considers the underlying sequential structure, i.e. a document consists of multiple segments (e.g. chapters, paragraphs), each of which is correlated to its antecedent and subsequent segments. Such progressive sequential dependency is captured by using the hierarchical two-parameter Poisson–Dirichlet process (HPDP). We develop an efficient collapsed Gibbs sampling ...
      Read Full Article
    13. Method for building parallel corpora

      A method for identifying documents for enriching a statistical translation tool includes retrieving a source document which is responsive to a source language query that may be specific to a selected domain. A set of text segments is extracted from the retrieved source document and translated into corresponding target language segments with a statistical translation tool to be enriched. Target language queries based on the target language segments are formulated. Sets of target documents responsive to the target language queries are retrieved. The sets of retrieved target documents are filtered, including identifying any candidate documents which meet a selection criterion ...
      Read Full Article
    14. A Computational Model of Unsupervised Speech Segmentation for Correspondence Learning

      Abstract  In this paper, we develop a new conceptual framework for an important problem in language acquisition, the correspondence problem: the fact that a given utterance has different manifestations in the speech and articulation of different speakers and that the correspondence of these manifestations is difficult to learn. We put forward the Correspondence-by-Segmentation Hypothesis, which states that correspondence is primarily learned by first segmenting speech in an unsupervised manner and then mapping the acoustics of different speakers onto each other. We show that a rudimentary segmentation of speech can be learned in an unsupervised fashion. We then demonstrate that, using ...
      Read Full Article
    15. Classifying with Co-stems

      Besides the content the writing style is an important discriminator in information filtering tasks. Ideally, the solution of a filtering task employs a text representation that models both kinds of characteristics. In this respect word stems are clearly content capturing, whereas word suffixes qualify as writing style indicators. Though the latter feature type is used for part of speech tagging, it has not yet been employed for information filtering in general. We propose a text representation that combines both the output of a stemming algorithm (stems) and the stem-reduced words (co-stems). A co-stems can be a prefix, an infix, a ...
      Read Full Article
      Mentions: Germany Benno Stein
    16. An Iterative Approach to Text Segmentation

      We present divSeg, a novel method for text segmentation that iteratively splits a portion of text at its weakest point in terms of the connectivity strength between two adjacent parts. To search for the weakest point, we apply two different measures: one is based on language modeling of text segmentation and the other, on the interconnectivity between two segments. Our solution produces a deep and narrow binary tree – a dynamic object that describes the structure of a text and that is fully adaptable to a user’s segmentation needs. We treat it as a separate task to flatten the tree ...
      Read Full Article
      Mentions: Canada Guelph Ontario
    17. Using SRX Standard for Sentence Segmentation

      In this paper, we evaluate using the SRX (Segmentation Rules eXchange) standard for specifying sentence segmentation rules. The rules were originally created for a proofreading tool called LanguageTool. As proofreading tools are quite sensitive to segmentation errors, the underlying segmentation mechanisms must be sufficiently reliable. Even though SRX allows only regular expressions as a means for specifying sentence breaks and exceptions to those breaks, our evaluation shows that it is sufficient for the task, both in terms of the performance of the algorithm used and correctness of results. Moreover, it offers interoperability with different tools, which in turn allows maintaining ...
      Read Full Article
    18. Automatically linking documents with relevant structured information

      FIELD OFTHE INVENTIONThe present invention relates generally to information extraction and, in particular, to discovering entities hidden in a given document with respect to a given relational database.BACKGROUNDFaced with growing knowledge management needs, enterprises are increasingly realizing the importance of seamlessly integrating, or interlinking, critical business information distributed across structured and unstructured data sources. However,in a typical enterprise environment, the str
      Read Full Article
      Mentions: Transaction IBM GTS
    19. Method and apparatus for constructing a link structure between documents

      TECHNICAL FIELDThe present invention relates to document information management technology, more particularly, relates to method and apparatus for constructing a link structure between documents.BACKGROUNDIn most cases, information is related to other information. Information is linked together via links and a link topology structure is formed. The link topology is important information about information. A typical example of important linkedsystems is WWW. The WWW is a hyperlinked collection. I
      Read Full Article
      Mentions: SIM
    20. A Transliteration Based Word Segmentation System for Shahmukhi Script

      Word Segmentation is an important prerequisite for almost all Natural Language Processing (NLP) applications. Since word is a fundamental unit of any language, almost every NLP system first needs to segment input text into a sequence of words before further processing. In this paper, Shahmukhi word segmentation has been discussed in detail. The presented word segmentation module is part of Shahmukhi-Gurmukhi transliteration system. Shahmukhi script is usually written without short vowels leading to ambiguity. Therefore, we have designed a novel approach for Shahmukhi word segmentation in which we used target Gurmukhi script lexical resources instead of Shahmukhi resources. We employ ...
      Read Full Article
    21. Self-adjusting Bootstrapping

      Bootstrapping has been used as a very efficient method to extract a group of items similar to a given set of seeds. However, the bootstrapping method intrinsically has several parameters whose optimal values differ from task to task, and from target to target. In this paper, first, we will demonstrate that this is really the case and serious problem. Then, we propose self-adjusting bootstrapping, where the original seed is segmented into the real seed and validation data. We initially bootstrap starting with the real seed, trying alternative parameter settings, and use the validation data to identify the optimal settings. This ...
      Read Full Article
    22. Word Segmentation for Dialect Translation

      This paper proposes an unsupervised word segmentation algorithm that identifies word boundaries in continuous source language text in order to improve the translation quality of statistical machine translation (SMT) approaches for the translation of local dialects by exploiting linguistic information of the standard language. The method iteratively learns multiple segmentation schemes that are consistent with (1) the standard dialect segmentations and (2) the phrasal segmentations of an SMT system trained on the resegmented bitext of the local dialect. In a second step multiple segmentation schemes are integrated into a single SMT system by characterizing the source language side and merging ...
      Read Full Article
      Mentions: Japan Kyoto Russian
    481-504 of 830 « 1 2 ... 18 19 20 21 22 23 24 ... 33 34 35 »
  1. Categories

    1. Default:

      Discourse, Entailment, Machine Translation, NER, Parsing, Segmentation, Semantic, Sentiment, Summarization, WSD
  2. Popular Articles

  3. Organizations in the News

    1. (36 articles) Microsoft
    2. (30 articles) Google
    3. (20 articles) Nuance Communications
    4. (20 articles) Intel
    5. (19 articles) SMEs
    6. (18 articles) Healthcare
    7. (18 articles) Service
    8. (18 articles) IBM
    9. (17 articles) IBM Corporation
    10. (17 articles) Bfsi
    11. (16 articles) NLP
    12. (16 articles) Apac
  4. Locations in the News

    1. (30 articles) India
    2. (24 articles) Japan
    3. (23 articles) China
    4. (20 articles) Pune
    5. (18 articles) New York
    6. (14 articles) Canada
    7. (13 articles) Germany
    8. (12 articles) Africa
    9. (12 articles) France
    10. (9 articles) Washington
    11. (9 articles) Massachusetts
    12. (9 articles) California