7 Aug 2015

Exploiting Semantic Analysis of Documents for the Domain User

Auditorio, DCC, UChile

Many document organization tasks, such as a student writing the related work chapter of a thesis, a professor surveying the state of the art in a proposal or planning a reading course, or a conference chair organizing sessions would be performed more efficiently through the use of document clustering. Fully unsupervised document clustering does not always yield clusters that are relevant to the user’s point of view. In this work, we pursue document clustering algorithms that allow the interactive engagement of the user in the clustering process. The main challenge is how to obtain useful clusters with minimum user effort. To address this challenge, we propose (1) a user-supervised double clustering algorithm, designed in three stages, and (2) a novel approach for mapping documents to entities and concepts. The user-supervised double clustering algorithm was demonstrated to be competitive to state-of-the-art clustering algorithms. It was further extended into an ensemble algorithm to incorporate Wikipedia concepts in the document representation. User supervision was introduced into these algorithms in the form of term supervision (term labelling) and document supervision. A visual interface was designed to make the algorithms accessible to real domain users. The work received the Best Student Paper award at ACM DocEng 2014. To address the problem of coming up with succinct and intuitive representations of documents in terms of entities and concepts, we have pursued two directions of research: (1) we designed a system that accomplishes entity recognition and disambiguation using the Wikipedia category structure in multiple languages. We are currently extending this system to concept recognition and disambiguation. Our system got the first prize in the ERD challenge at ACM SIGIR 2014; (2) we proposed a simple but very effective approach for computing semantic relatedness between words and documents based on the Google n-gram corpus, which is competitive to human performance on standard word pair data sets. The clustering work is joint with H. Nourashraf and D. Arnold, the ERD work with Marek Lipczak and Arash Koushkestani, and the Google n-gram based semantic relatedness with Aminul Islam and Vlado Keselj.

Previous events

Data Series Management: Fulfilling the Need for Big Sequence Analytics

"A gentle introduction to Gaussian processes with applications"

An algorithm for binary chance-constrained problems using IIS

Mobile Semantic Search

A proof of the AGM bound

A Language-Theoretical Approach to Descriptive Complexity

Scientific productivity in Chile: Are universities more productive or have they just learned how to game the system?

Verification of Infinite-State Probabilistic Systems

JSON: data model, query languages and schema specification

Understanding and preventing natural disasters using data science

Bias in the Web

Centro de I+D de Telefónica en Chile

The complexity of reverse engineering problems for conjunctive queries

Efficient Computation of Certain Answers: Breaking the CQ Barrier

Efficient Query Processing Under Updates

Cienciometría: Índices y mapas

The Multiset semantics of SPARQL

Statistical Query Algorithms for Stochastic Convex Optimization

The interplay between word equations with regular constraints and graph querying with path comparisons

Querying Wikidata: Comparing SPARQL, Relational and Graph Databases

Word transducers: from two-way to one-way

Tree Automata for Reasoning in Databases and Artificial Intelligence

BigData como Unificador de Información en el Estado

A framework for annotating CSV-like data

Big Data? Big Promise, Big Problems

Copyless cost register automata

Acquiring and Exploiting Lexical Knowledge for Twitter Sentiment Analysis

Extending Weakly-Sticky Datalog+-:Query-Answering Tractability and Optimizations

Reconsidering REST: Do we need a new ontology for the Web?

Rebooting Peer Review Systems via Boot Strapping Databases?

DASL: A Scala-based DSL for Graph Analytics on GPUs

Social media, information and behavior

Linking Open-world Knowledge Bases using Nonmonotonic Rules

Designing a Flash Translation Layer

Distributed Machine Learning in Yahoo Sponsored Search

Data science for astronomy

Learning Opinion Dynamics in Social Networks

Climate Research Infrastructure in Chile: Databases, modelling and the gap to Decision Making

Schema Mappings and Data Examples: An Interplay between Syntax and Semantics

Forbidden vertices

A measure of centrality based on Random Walks

Planning with Linear Temporal Logic Goals: Two Translation Approaches

Dynamic Graph Queries

Fully Homomorphic Encryption: Blindfold Computations in the Cloud

Visual-Semantic Graphs: Using Queries to Reduce the Semantic Gap in Web Image Retrieval

Understanding Real-World Events through Social Media

Querying graph databases (a theoretician's perspective)

CONSTRUCT queries in SPARQL

Characterizing tractable restrictions of query languages

Epistemonikos, la base de Revisiones sistemáticas más grande del mundo

Quantifying collective states from online social networking data

The Hearts and Minds of Data Science

Provenance Circuits for Trees and Treelike Instances

Pebble -- a data structure implementation that makes your big data small

Traversal-Based Linked Data Queries

Pushing the Boundaries of Tractable Ontology Reasoning

Decidability of Conjunctive Queries for Description Logics with Counting, Inverses and Nominals

Dynamics in Linked Data Environments

Monadic datalog on trees

CloudMdsQL: Querying Heterogeneous Cloud Data Stores with a Common Language

Certain Answers to Well-Designed SPARQL Queries

Lexicographic orderings

SAMOA: A Platform for Mining Big Data Streams

Expressive queries over light-weight ontologies

Techniques to solve computationally hard problems in automata theory

Practical aspects of the research carried out at DAMA-UPC

Stochastic rule-based modelling of biological systems using a parallel implementation of spatial Kappa

Representaciones compactas de grafos web, redes sociales y RDF

R2RML: Standard RDB2RDF Mapping Language

Contexts and Multidimensional Data Quality Assessment and Cleaning

Semantic Annotators - how good are they, and how can we make them better?

Generating knowledge from online social networks

Querying Graph Databases

Benchmarking RDF and Graph Databases

The Anatomy of a Large-Scale Semantic Web Search Engine

Seminars

Data Series Management: Fulfilling the Need for Big Sequence Analytics

6 September 2018

"A gentle introduction to Gaussian processes with applications"