    1. Bayesian Transductive Markov Random Fields for Interactive Segmentation in Retinal Disorders

      In the realm of computer aided diagnosis (CAD) interactive segmentation schemes have been well received by physicians, where the combination of human and machine intelligence can provide improved segmentation efficacy with minimal expert intervention [1-3]. Transductive learning (TL) or semi-supervised learning (SSL) is a suitable framework for learning-based interactive segmentation given the scarce label problem. In this paper we present extended work on Bayesian transduction and regularized conditional mixtures for interactive segmentation [3]. We present a Markov random field model integrating a semi-parametric conditional mixture model within a Bayesian transductive learning and inference setting. The model allows efficient learning and ...
    2. Using automated content analysis for audio/video content consumption

      Audio/video (A/V) content is analyzed using speech and language analysis components. Metadata is automatically generated based upon the analysis. The metadata is used in generating user interface interaction components which allow a user to view subject matter in various segments of the A/V content and to interact with the A/V content based on the automatically generated metadata.
    3. Model-Guided Segmentation and Layout Labelling of Document Images Using a Hierarchical Conditional Random Field

      We present a model-guided segmentation and document layout extraction scheme based on hierarchical Conditional Random Fields (CRFs, hereafter). Common methods to classify a pixel of a document image into classes - text, background and image - are often noisy, and error-prone, often requiring post-processing through heuristic methods. The input to the system is a pixel-wise classification based on the output of a Fisher classifier based on the output of a set of Globally Matched Wavelet (GMW) Filters. The system extracts features which encode contextual information and spatial configurations of a given document image, and learns relations between these layout entities using hierarchical ...
    4. A Novel Role-Based Movie Scene Segmentation Method

      Semantic scene segmentation is a crucial step in movie video analysis and extensive research efforts have been devoted to this area. However, previous methods are heavily relying on video content itself, which are lack of objective evaluation criterion and necessary semantic link due to the semantic gap. In this paper, we propose a novel role-based approach for movie scene segmentation using script. Script is a text description of movie content that contains the scene structure information and related character names, which can be regarded as an objective evaluation criterion and useful external reference. The main novelty of our approach is ...
    5. Automatic Evaluation Of Machine Translation Via Word Choice And Word Order

      Abstract  We propose a novel metric ATEC for automatic MT evaluation based on explicit assessment of word choice and word order in an MT output in comparison to its reference translation(s), the two most fundamental factors in the construction of meaning for a sentence. The former is assessed by matching word forms at various linguistic levels, including surface form, stem, sound and sense, and further by weighing the informativeness of each word. The latter is quantified in term of the discordance of word position and word sequence between a translation candidate and its reference. In the evaluations using the ...
    6. AnCora-CO: Coreferentially annotated corpora for Spanish and Catalan

      Abstract  This article describes the enrichment of the AnCora corpora of Spanish and Catalan (400 k each) with coreference links between pronouns (including elliptical subjects and clitics), full noun phrases (including proper nouns), and discourse segments. The coding scheme distinguishes between identity links, predicative relations, and discourse deixis. Inter-annotator agreement on the link types is 85–89% above chance, and we provide an analysis of the sources of disagreement. The resulting corpora make it possible to train and test learning-based algorithms for automatic coreference resolution, as well as to carry out bottom-up linguistic descriptions of coreference relations as they occur ...
    7. Segmentation of strings into structured records

      An system for segmenting strings into component parts for use with a database management system. A reference table of string records are segmented into multiple substrings corresponding to database attributes. The substrings within an attribute are analyzed to provide a state model that assumes a beginning, a middle and an ending token topology for that attribute. A null token takes into account an empty attribute component and copying of states allows for erroneous token insertions and misordering. Once the model is created from the clean data, the process breaks or parses an input record into a sequence of tokens. The ...
    8. Unsupervised Text Normalization Approach for Morphological Analysis of Blog Documents

      In this paper, we propose an algorithm for reducing the number of unknown words on blog documents by replacing peculiar expressions with formal expressions. Japanese blog documents contain many peculiar expressions regarded as unknown sequences by morphological analyzers. Reducing these unknown sequences improves the accuracy of morphological analysis for blog documents. Manual registration of peculiar expressions to the morphological dictionaries is a conventional solution, which is costly and requires specialized knowledge. In our algorithm, substitution candidates of peculiar expressions are automatically retrieved from formally written documents such as newspapers and stored as substitution rules. For the correct replacement, a substitution ...
    9. Text segmentation of spoken meeting transcripts

      Abstract  Text segmentation has played an important role in information retrieval as well as natural language processing. Current segmentation methods are well suited for written and structured texts making use of their distinctive macro-level structures; however text segmentation of transcribed multi-party conversation presents a different challenge given its ill-formed sentences and the lack of macro-level text units. This paper describes an algorithm suitable for segmenting spoken meeting transcripts combining semantically complex lexical relations with speech cue phrases to build lexical chains in determining topic boundaries. Content Type Journal ArticleDOI 10.1007/s10772-009-9048-2Authors Bernadette Sharp, Staffordshire University FCET Beaconside Stafford ST18 ...
    10. System and method for audio hot spotting

      Audio hot spotting is accomplished by specifying query criterion to include a non-lexical audio cue. The non-lexical audio cue can be, e.g., speech rate, laughter, applause, vocal effort, speaker change or any combination thereof. The query criterion is retrieved from an audio portion of a file. A segment of the file containing the query criterion can be provided to a user. The duration of the provided segment can be specified by the user along with the files to be searched. A list of detections of the query criterion within the file can also be provided to the user. Searches ...
    11. Intended boundaries detection in topic change tracking for text segmentation

      Abstract  This paper presents a topical text segmentation method based on intended boundaries detection and compares it to a well known default boundaries detection method, c99. We compared the two methods by running them on two different corpora of French texts and results are evaluated by two different methods: one using a modified classic measure, the FScore, the other based on a manual evaluation one the Internet. Our results showed that algorithms that are close when automatically evaluated can be quite far when manually evaluated. Content Type Journal ArticleDOI 10.1007/s10772-009-9051-7Authors Alexandre Labadié, LIRMM 161 rue Ada 34392 Montpellier ...
    12. Systems and methods for hybrid text summarization

      Techniques are provided for segmenting text into categorized discourse constituents and attaching discourse constituents into a structural representation of discourse. Techniques for determining hybrid structural and non-structural summaries of a text are also provided. A text is segmented based on a theory of discourse analysis into at least a main discourse constituent containing spatio-temporal information about a single event in a possible world view. The discourse constituents are then inserted into a structural representation of discourse. Non-structural techniques are used to determine relevance scores and important discourse constituents are determined. Relevance scores are percolated through the structural representation of discourse ...
    13. Method and apparatus for efficient segmentation of compound words using probabilistic breakpoint traversal

      A method for segmenting a compound word in an unrestricted natural-language input is disclosed. The method comprises receiving a natural-language input consisting of a plurality of characters. Next, a set of probabilistic breakpoints based on a probabilistic breakpoint analysis is constructed in the natural-language input. A plurality of linkable components is identified by traversal of substrings of the natural-language input delimited by the set of probabilistic breakpoints. Finally, a segmented string consisting of a plurality of linkable components spanning the natural-language input is returned. The segmented string can be interpreted as a compound word.
    14. Multimodal News Story Segmentation

      In this paper, we describe a multi-modal approach to segmenting news video based on the perceived shift in content. We divide up a video document into logically coherent semantic units known as stories. We investigate the effectiveness of a number of multimedia features which serve as potential indicators of a story boundary. The results show an improvement of performance over current state of the art story segmenters. Content Type Book ChapterDOI 10.1007/978-81-8489-203-1_7Authors Gert-Jan Poulisse, Katholieke Universiteit Leuven Department of Computer Science Celestijnenlaan 200A Box 2402 B-3001 Heverlee BelgiumMarie-Francine Moens, Katholieke Universiteit Leuven Department of Computer Science Celestijnenlaan 200A ...
    15. Secure Distributed Human Computation

      In Peha’s Financial Cryptography 2004 invited talk, he described the Cyphermint PayCash system (see www.cyphermint.com), which allows people without bank accounts or credit cards (a sizeable segment of the U.S. population) to automatically and instantly cash checks, pay bills, or make Internet transactions through publicly-accessible kiosks. Since PayCash offers automated financial transactions and since the system uses (unprotected) kiosks, security is critical. The kiosk must decide whether a person cashing a check is really the person to whom the check was made out, so it takes a digital picture of the person cashing the check and ...
    16. Blade clearance control

      A turbine engine has a circumferentially segmented shroud within a case structure. Each shroud segment is mounted for movement between an inboard position and an outboard position. One or more springs bias the shroud segments toward their inboard positions. One or more valves are positioned to vent one or more volumes so as to counter the spring bias to shift the shroud segments to their outboard positions.
    17. Cleaning, Segmenting, and Spell-Checking Text

      When extracting text from different sources, you commonly end up with “noise” characters and unwanted whitespace. So you need tools to help you clean up this extracted text. For many applications, you’ll also want to segment text by identifying the boundaries of sentences and to spell-check text using a single suggestion or a list of suggestions. In this chapter, you’ll learn how to remove HTML tags, extract full text from an XML file, segment text into sentences, perform stemming and spell-checking, and recognize and remove noise characters. Content Type Book ChapterDOI 10.1007/978-1-4302-2352-8_2 Book Scripting IntelligenceDOI 10 ...
    18. Building a Morphosyntactic Lexicon and a Pre-syntactic Processing Chain for Polish

      This paper introduces a new set of tools and resources for Polish which cover all the steps required to transform a raw unrestricted text into a reasonable input for a parser. This includes (1) a large-coverage morphological lexicon, developed thanks to the IPI PAN corpus as well as a lexical acquisition techique, and (2) multiple tools for spelling correction, segmentation, tokenization and named entity recognition. This processing chain is also able to deal with the XCES format both as input and output, hence allowing to improve XCES corpora such as the IPI PAN corpus itself. This allows us to give ...
    19. Method of vector analysis for a document

      The invention provides a document representation method and a document analysis method including extraction of important sentences from a given document and/or determination of similarity between two documents.The inventive method detects terms that occur in the input document, segments the input document into document segments, each segment being an appropriately sized chunk and generates document segment vectors, each vector including as its element values according to occurrence frequencies of the terms occurring in the document segments. The method further calculates eigenvalues and eigenvectors of a square sum matrix in which a rank of the respective document segment vector ...
    20. Towards the quantification of the semantic information encoded in written language. (arXiv:0907.1558v2 [physics.soc-ph] Cross Listed)

      Written language is a complex communication signal capable of conveying information encoded in the form of ordered sequences of words. Beyond the local order ruled by grammar, semantic and thematic structures affect long-range patterns in word usage. Here, we show that a direct application of information theory quantifies the relationship between the statistical distribution of words and the semantic content of the text. We show that there is a characteristic scale, roughly around a few thousand words, which establishes the typical size of the most informative segments in written language. Moreover, we find that the words whose contributions to the ...
    21. Google Translator Kit: Automated Translation Meets Crowdsourcing

      Only a handful of blogs picked up on Google's fresh Translator Toolkit , which the company launched yesterday by means of a blog post , but this new service really deserves a second look, if only because Wikimedia apparently sees the tool as something that could "change the way Wikipedia grows in other languages" . You can read an extensive review of the product over at Google Blogoscoped , but ...
