    1. Czech Text-to-Sign Speech Synthesizer

      Recent research progress in developing of the Czech – Sign Speech synthesizer is presented. The current goal is to improve the system for automatic synthesis to produce accurate synthesis of the Sign Speech. The synthesis system converts written text to an animation of an artificial human model (avatar). This includes translation of text to sign phrases and their conversion to the animation of the avatar. The animation is composed of movements and deformations of segments of hands, a head and also a face. The system has been evaluated by two initial perceptual tests. The perceptual tests indicate that the designed synthesis ...
    2. Automatic Labeling Inconsistencies Detection and Correction for Sentence Unit Segmentation in Conversational Speech

      In conversational speech, irregularities in the speech such as overlaps and disruptions make it difficult to decide what is a sentence. Thus, despite very precise guidelines on how to label conversational speech with dialog acts (DA), labeling inconsistencies are likely to appear. In this work, we present various methods to detect labeling inconsistencies in the ICSI meeting corpus. We show that by automatically detecting and removing the inconsistent examples from the training data, we significantly improve the sentence segmentation accuracy. We then manually analyze 200 of noisy examples detected by the system and observe that only 13% of them are ...
    3. Query-Topic Focused Web Pages Summarization

      We present a novel Web Pages Summarizer ContextSummarizer that groups the given Web pages into ‘sense-clusters’ respecting a user’s topical interests. ContextSummarizer constructs then an extractive summary for each sense-cluster. A user’s topical interest is described by the user who selects and refines some of the word senses disambiguated within the content contexts of the given Web pages. The semantic similarity measures between the contents of Web pages/segments/sentences and the user-selected word senses were used to choose the most topically relevant sentences as the extractive summaries referring to a user’s topical interest. ContextSummarizer addresses the ...
    4. Meeting Structure Annotation

      We describe a generic set of tools for representing, annotating, and analysing multi-party discourse, including: an ontology of multimodal discourse, a programming interface for that ontology, and NOMOS – a flexible and extensible toolkit for browsing and annotating discourse.We describe applications built using the NOMOS framework to facilitate a real annotation task, as well as for visualising and adjusting features for machine learning tasks. We then present a set of hierarchical topic segmentations and action item subdialogues collected over 56 meetings from the ICSI and ISL meeting corpora using our tools. These annotations are designed to support research towards automatic ...
    5. Lexical Cohesion Based Topic Modeling for Summarization

      In this paper, we attack the problem of forming extracts for text summarization. Forming extracts involves selecting the most representative and significant sentences from the text. Our method takes advantage of the lexical cohesion structure in the text in order to evaluate significance of sentences. Lexical chains have been used in summarization research to analyze the lexical cohesion structure and represent topics in a text. Our algorithm represents topics by sets of co-located lexical chains to take advantage of more lexical cohesion clues. Our algorithm segments the text with respect to each topic and finds the most important topic segments ...
    6. Bilingual Segmentation for Alignment and Translation

      We propose a method that bilingually segments sentences in languages with no clear delimiter for word boundaries. In our model, we first convert the search for the segmentation into a sequential tagging problem, allowing for a polynomial-time dynamic-programming solution, and incorporate a control to balance monolingual and bilingual information at hand. Our bilingual segmentation algorithm, the integration of a monolingual language model and a statistical translation model, is devised to tokenize sentences more suitably for bilingual applications such as word alignment and machine translation. Empirical results show that bilingually-motivated segmenters outperform pure monolingual one in both the word-aligning (12% reduction ...
    7. Automatic resolution of segmentation ambiguities in grammar authoring

      A rules-based grammar is generated. Segmentation ambiguities are identified in training data. Rewrite rules for the ambiguous segmentations are enumerated and probabilities are generated for each. Ambiguities are resolved based on the probabilities. In one embodiment, this is done by applying the expectation maximization (EM) algorithm.
    8. Cross-Lingual Retrieval of Identical News Events by Near-Duplicate Video Segment Detection

      Recently, for reusing large quantities of accumulated news video, technology for news topic searching and tracking has become necessary. Moreover, since we need to understand a certain topic from various viewpoints, we focus on identical event detection in various news programs from different countries. Currently, text information is generally used to retrieve news video. However, cross-lingual retrieval is complicated by machine translation performance and different viewpoints and cultures. In this paper, we propose a cross-lingual retrieval method for detecting identical news events that exploits image information together with text information. In an experiment, we verified the effectiveness of making use ...
    9. Learning Word Segmentation Rules for Tag Prediction

      In our previous work we introduced a hybrid, GA&ILP-based; approach for learning of stem-suffix segmentation rules from an unmarked list of words. Evaluation of the method was made difficult by the lack of word corpora annotated with their morphological segmentation. Here the hybrid approach is evaluated indirectly, on the task of tag prediction. A pair of stem-tag and suffix-tag lexicons is obtained by the application of that approach to an annotated lexicon of word-tag pairs. The two lexicons are then used to predict the tags of unseen words in two ways, (1) by using only the stem and suffix generated ...
    10. Joint Inference in Information Extraction

      Hoifung Poon Pedro Domingos Department of Computer Science and Engineering University of Washington Seattle, WA 98195-2350, U.S.A. {hoifung, pedrod}@cs.washington.edu Abstract The goal of information extraction is to extract database records from text or semi-structured sources. Traditionally, information extraction proceeds by first segmenting each candidate record separately, and then merging records that refer to the same entities. While computational
    11. A Chinese Segmentation and Tagging Module Based on the Interpolated Probabilistic Model

      Chinese is a challenging language in natural language processing. Unlike other languages like English, Portuguese, the first step in Chinese text processing is the segmentation because there are no delimiters in a Chinese sentence for identifying the words boundaries in it. And there are many ambiguity problems during Chinese processing like segmentation ambiguities, unknown words problem, part-of-speech ambiguities, etc. In segmentation and tagging, one of the main tasks is to identify unknown words and recognize proper nouns. In the research, efforts are being paid on this particular problem. In this paper, an integrated application with segmentation and tagging ability has ...
    12. Compression of annotated nucleotide sequences.

      Related Articles Compression of annotated nucleotide sequences. IEEE/ACM Trans Comput Biol Bioinform. 2007 Jul-Sep;4(3):447-57 Authors: Korodi G, Tabus I This article introduces an algorithm for the lossless compression of DNA files, which contain annotation text besides the nucleotide sequence. First a grammar is specifically designed to capture the regularities of the annotation text. A revertible transformation uses the grammar rules in order to equivalently represent the original file as a collection of parsed segments and a sequence of decisions made by the grammar parser. This decomposition enables the efficient use of state-of-the-art encoders for processing the ...
    13. Text extraction and document image segmentation using matched wavelets and MRF model.

      Related Articles Text extraction and document image segmentation using matched wavelets and MRF model. IEEE Trans Image Process. 2007 Aug;16(8):2117-28 Authors: Kumar S, Gupta R, Khanna N, Chaudhury S, Joshi SD In this paper, we have proposed a novel scheme for the extraction of textual areas of an image using globally matched wavelet filters. A clustering-based technique has been devised for estim ating globally matched wavelet filters using a collection of groundtruth images. We have extended our text extraction scheme for the segmentation of document images into text, background, and picture components (which include graphics and continuous ...
    14. Large-scale evaluation of a medical cross-language information retrieval system.

      Related Articles Large-scale evaluation of a medical cross-language information retrieval system. Medinfo. 2007;12(Pt 1):392-6 Authors: Markó K, Daumke P, Schulz S, Klar R, Hahn U We propose an approach to multilingual medical document retrieval in which complex word forms are segmented according to medically relevant morpho-semantic criteria. At its core lies a multilingual dictionary, in which entries are equivalence classes of subwords, i.e. semantically minimal units. Using two different standard test collections for the medical domain, we evaluate our approach for six languages covered by our system. PMID: 17911746 [PubMed - indexed for MEDLINE]
    15. Finding Temporal Order in Discharge Summaries

      Finding Temporal Order in Discharge Summaries Philip Bramsen†, Pawan Deshpande†, Yoong Keok Lee‡, MS and Regina Barzilay†, PhD Massachusetts Institute of Technology (MIT), Cambridge, MA† DSO National Laboratories, Singapore‡ Abstract A method for automatic analysis of time-oriented clinical narratives would be of significant practical import for medical decision making, data modeling and biomedical research. This paper proposes a robust corpus-based approach for temporal analysis of medical disc
    16. Minimum Cut Model for Spoken Lecture Segmentation

      Minimum Cut Model for Spoken Lecture Segmentation Igor Malioutov Massachusetts Institute of Technology igorm@csail.mit.edu Regina Barzilay Massachusetts Institute of Technology regina@csail.mit.edu Abstract We consider the task of unsupervised lecture segmentation. We formalize segmentation as a graph-partitioning task that optimizes the normalized cut criterion. Our approach moves beyond localized comparisons and takes into account longrange cohesion dependencies. Our results demonstrate tha
    17. Translation correlation device

      A confirmation link edition unit receives a confirmation link specified by a user. A paragraph correlation unit respectively divides an English text and a Japanese text into a plurality of paragraphs according to the specified confirmation link. A segment correlation calculation unit correlates an English segment to a Japanese segment for each paragraph. A correlation edition unit provides a user the correspondence obtained by the segment correlation calculation unit, and edits the correspondence according to a correction instruction from the user if any.
    18. A Discriminative Model Corresponding to Hierarchical HMMs

      Hidden Markov Models (HMMs) are very popular generative models for sequence data. Recent work has, however, shown that on many tasks, Conditional Random Fields (CRFs), a type of discriminative model, perform better than HMMs. We propose Hierarchical Hidden Conditional Random Fields (HHCRFs), a discriminative model corresponding to hierarchical HMMs (HHMMs). HHCRFs model the conditional probability of the states at the upper levels given observations. The states at the lower levels are hidden and marginalized in the model definition. We have developed two algorithms for the model: a parameter learning algorithm that needs only the states at the upper levels in ...
    19. Systems and methods for providing online fast speaker adaptation in speech recognition

      A system (230) performs speaker adaptation when performing speech recognition. The system (230) receives an audio segment and identifies the audio segment as a first audio segment or a subsequent audio segment associated with a speaker turn. The system (230) then decodes the audio segment to generate a transcription associated with the first audio segment when the audio segment is the first audio segment and estimates a transformation matrix based on the transcription associated with the first audio segment. The system (230) decodes the audio segment using the transformation matrix to generate a transcription associated with the subsequent audio segment ...
    20. Method and apparatus for browsing document content

      A computer-implemented method is provided that includes receiving a document and determining a file type for the document. In addition, the document is segmented into blocks of text as a function of the file type and at least one keyword and a summary is generated for the document.
