    1. Image-based document indexing and retrieval

      A system that facilitates document retrieval and/or indexing is provided. A component receives an image of a document, and a search component searches data store(s) for a match to the document image. The match is performed over word-level topological properties of images of documents stored in the data store(s).
    2. Word processing with artificial language validation

      BACKGROUNDThe present invention relates to data processing by digital computer, and more particularly to word processing.Word processing systems (also referred to as word processors) allow users to create documents, primarily textual documents that might otherwise be prepared on a typewriter. Users can also edit, print or save the documents using the wordprocessor. Such documents will be referred to as word processing documents.Modern word processors offer a greater range of functions than the f
    3. Chinese character-based parser

      BACKGROUND OF THEINVENTION1. Technical FieldThe present invention relates to data processing and, in particular, to parsing Chinese character streams. Still more particularly, the present invention provides word segmentation, part-of-speech tagging and parsing for Chinese characters.2. Description of Related ArtThere are many natural language processing (NLP) applications, such as machine translation (MT) and question answering systems, that use structural information of a sentence. As word segm
    4. Effects of Repair Support Agent for Accurate Multilingual Communication

      Translation repair plays an important role in intercultural communication using machine translation. It can be used to create messages that have very few translation mistakes. However, translation repair is a laborious task. It is important to carry out translation repair efficiently. Therefore, we propose a repair support agent that provides the segments that have not been translated accurately. We perform experiments on the translation repair efficiency to evaluate the effectiveness of the repair support agent. The results of these experiments are as follows. (1) Providing inaccurately translated segments improves the ability to detect inaccurate segments. (2) The inaccurate-judgment rate can ...
    5. Method and apparatus for window matching in delta compressors

      The present invention relates generally to data compression and, more particularly, to a method for efficient window partition matching indelta compressors to enhance compression performance based on the idea of modeling a dataset with the frequencies of its n-grams.BACKGROUND OF THE INVENTIONCompression programs routinely limit the data to be compressed together in segments called windows. The process of doing this is called windowing. Delta compression techniques were developed to compress a t
      Mentions: lamda Bell Labs
    6. Development and evaluation of a clinical note section header terminology.

      Related Articles Development and evaluation of a clinical note section header terminology. AMIA Annu Symp Proc. 2008;:156-60 Authors: Denny JC, Miller RA, Johnson KB, Spickard A Clinical documentation is often expressed in natural language text, yet providers often use common organizations that segment these notes in sections, such as history of present illness or physical examination. We developed a hierarchical section header terminology, supporting mappings to LOINC and other vocabularies; it contained 1109 concepts and 4332 synonyms. Physicians evaluated it compared to LOINC and the Evaluation and Management billing schema using a randomly selected corpus of history and physical ...
    7. Systems and methods for interactive topic-based text summarization

      INCORPORATION BY REFERENCEThis Application incorporates by reference: entitled "SYSTEMS AND METHODS FOR DETERMINING THE TOPIC STRUCTURE OF A PORTION OF TEXT" by I. Tsochantaridis et al., filed Mar.22, 2002 as U.S. patent application Ser. No. 10/103,053; entitled"SYSTEMS AND METHODS FOR DISPLAYING INTERACTIVE TOPIC BASED TEXT SUMMARIES" by F. Chen et al., filed Dec. 16, 2002, as U.S. patent application Ser. No. 10/319,545; entitled "SYSTEMS AND METHODS FOR SENTENCE BASED INTERACTIVE TOPIC BASED T
    8. CoZo+ - A Content Zoning Engine for textual documents. (arXiv:0811.0453v1 [cs.CL])

      Content zoning can be understood as a segmentation of textual documents into zones. This is inspired by [6] who initially proposed an approach for the argumentative zoning of textual documents. With the prototypical CoZo+ engine, we focus on content zoning towards an automatic processing of textual streams while considering only the actors as the zones. We gain information that can be used to realize an automatic recognition of content for pre-defined actors. We understand CoZo+ as a necessary pre-step towards an automatic generation of summaries and to make intellectual ownership of documents detectable.
    9. Compound word breaker and spell checker

      CROSS-REFERENCE TO RELATED APPLICATIONSReference is hereby made to the following co-pending and commonly assigned patent applications: U.S. application Ser. No. 10/804,883, filed Mar. 19, 2004, entitled "SYSTEM AND METHOD FOR PERFORMING ANALYSIS ON WORD VARIANTS" and U.S. application Ser. No. 10/804,998, filed Mar. 19, 2004, entitled "FULL-FORM LEXICON WITH TAGGED DATA AND METHODS OF CONSTRUCTING AND USING THE SAME", both of which are incorporated by reference in their entirety.BACKGROUND OF THE
    10. Analyse spectrale des textes: d\'etection automatique des fronti\`eres de langue et de discours. (arXiv:0810.1212v1 [cs.CL])

      We propose a theoretical framework within which information on the vocabulary of a given corpus can be inferred on the basis of statistical information gathered on that corpus. Inferences can be made on the categories of the words in the vocabulary, and on their syntactical properties within particular languages. Based on the same statistical data, it is possible to build matrices of syntagmatic similarity (bigram transition matrices) or paradigmatic similarity (probability for any pair of words to share common contexts). When clustered with respect to their syntagmatic similarity, words tend to group into sublanguage vocabularies, and when clustered with respect ...
      Mentions: Markov
    11. Aligning lay and specialized passages in comparable medical corpora.

      Related Articles Aligning lay and specialized passages in comparable medical corpora. Stud Health Technol Inform. 2008;136:89-94 Authors: Deleger L, Zweigenbaum P While the public has increasingly access to medical information, specialized medical language is often difficult for non-experts to understand and there is a need to bridge the gap between specialized language and lay language. As a first step towards this end, we describe here a method to build a comparable corpus of expert and non-expert medical French documents and to identify similar text segments of lay and specialized language. Among the top 400 pairs of text segments ...
    12. Multimodal Processing

      With a multimedia document, its semantics are embedded in multiple forms that are usually complimentary each other. For example, a live report on TV about a tsunami conveys information that is far beyond what we read from the newspaper. Therefore, it is necessary to analyze all types of data: image frames, sound tracks, text that can be extracted from image frames, and spoken words that can be deciphered from the audio track [Wang00]. For some applications, automated techniques that process single media, for example, audio or images, may be error-prone, and multimodal processing is used to improve the overall system ...
    13. Text Processing

      Text provides crucial cues for understanding content. For example, the closed captions in broadcast television programs and subtitles in DVD movies facilitate video consumption for viewers. When a transcript is not available for certain content, automatic speech recognition can be used to extract linguistic information. Text information is much more concise than corresponding audio or video. The reason is that we need language knowledge to understand text, and the knowledge itself does not need to be embedded in the text data. For example, we only need five characters to express a “plane,” but to show a video clip of plane ...
    14. Automatic extraction of translations from web-based bilingual materials

      Abstract  This paper describes the framework of the StatCan Daily Translation Extraction System (SDTES), a computer system that maps and compares web-based translation texts of Statistics Canada (StatCan) news releases in the StatCan publication The Daily. The goal is to extract translations for translation memory systems, for translation terminology building, for cross-language information retrieval and for corpus-based machine translation systems. Three years of officially published statistical news release texts at http://www.statcan.ca were collected to compose the StatCan Daily data bank. The English and French texts in this collection were roughly aligned using the Gale-Church statistical algorithm. After ...
    15. Word segmentation based on database semantics in NChiql

      Abstract  In this paper a novel word-segmentation algorithm is presented to delimit words in Chinese natural language queries in NChiql system, a Chinese natural language query interface to databases. Although there are sizable literatures on Chinese segmentation, they cannot satisfy particular requirements in this system. The novel word-segmentation algorithm is based on the database semantics, namely Semantic Conceptual Model (SCM) for specific domain knowledge. Based on SCM, the segmenter labels the database semantics to words directly, which eases the disambiguation and translation (from natural language to database query) in NChiql. Content Type Journal ArticleDOI 10.1007/BF02948870Authors Xiaofeng Meng, Renmin ...
    16. Statistical Properties of Overlapping Ambiguities in Chinese Word Segmentation and a Strategy for Their Disambiguation

      Overlapping ambiguity is a major ambiguity type in Chinese word segmentation. In this paper, the statistical properties of overlapping ambiguities are intensively studied based on the observations from a very large balanced general-purpose Chinese corpus. The relevant statistics are given from different perspectives. The stability of high frequent maximal overlapping ambiguities is tested based on statistical observations from both general-purpose corpus and domain-specific corpora. A disambiguation strategy for overlapping ambiguities, with a predefined solution for each of the 5,507 pseudo overlapping ambiguities, is proposed consequently, suggesting that over 42% of overlapping ambiguities in Chinese running text could be solved ...
    17. A Comparison of Language Models for Dialog Act Segmentation of Meeting Transcripts

      This paper compares language modeling techniques for dialog act segmentation of multiparty meetings. The evaluation is twofold; we search for a convenient representation of textual information and an efficient modeling approach. The textual features capture word identities, parts-of-speech, and automatically induced classes. The models under examination include hidden event language models, maximum entropy, and BoosTexter. All presented methods are tested using both human-generated reference transcripts and automatic transcripts obtained from a state-of-the-art speech recognizer. Content Type Book ChapterDOI 10.1007/978-3-540-87391-4_17Authors Jáchym Kolář, University of West Bohemia Department of Cybernetics at Faculty of Applied Sciences Univerzitní 8 CZ-306 14 Plzeň ...
      Mentions: Czech Republic
    18. Integration of Named Entity Information for Chinese Word Segmentation Based on Maximum Entropy

      Word segmentation is an essential process in Chinese information processing. Although related researches were reported and made progresses, the Unknown Named Entity (UNE) problem in segmentation is not fully solved. This usually degrades the accuracy of segmentation in general. In this paper, a model to identify UNEs for improving the overall performance of the segmentation is presented. In order to capture the NE information, functions of characters or words are defined with tags. In addition, useful surrounding contexts are collected from a corpus and used as features. The model is constructed based on Maximum Entropy to handle the UNE identification ...
      Mentions: China Macau
    19. Full-form lexicon with tagged data and methods of constructing and using the same

      CROSS-REFERENCE TO RELATED APPLICATIONSReference is hereby made to the following co-pending and commonly assigned patent applications: U.S. application Ser. No. 10/804,930, filed Mar. 19, 2004, entitled "Compound Word Breaker and Spell Checker" and U.S. application Ser. No.10/804,883, filed Mar. 19, 2004, entitled "System and Method for Performing Analysis on Word Variants", both of which are incorporated by reference in their entirety.BACKGROUND OF THE INVENTIONThe present invention relates to
