Buscar:   
Center for Web Research D.C.S.
		University of Chile

Research Areas

Databases and Information Retrieval
Information processing is a big problem in modern society because several interrelated issues arise that have to do with the need for:

  1. integration of data from different sources;
  2. navigation through information;
  3. learning and knowledge acquisition from data;
  4. prediction;
  5. decision making support;
  6. information visualization in computational media;
  7. storing, accessing and processing data in computational platforms;
  8. handling non traditional information, e.g. environmental, geographical, cartographic, etc;
  9. confronting the non traditional structure -or lack of structure of new kinds of information, e.g. complex objects or Web pages.

All these problems with different restrictions define different types of databases. We focus our research on the three last problems, which include multimedia retrieval, spatial information, semistructured data, and Web agents. At the heart of them is combinatorial pattern matching, a research area that studies from a combinatorial point of view how to search for given patterns in regular and discrete structures such as sequences or graphs. This area should be distinguished from classical pattern matching, which considers continuous elements and uses different techniques.


Distributed Systems and Networks
We call a distributed system any multi-processor system where each processor has a local memory, not shared with the rest of the processors. The only way of communicating between processors is by sending messages through a network. To give a higher-level interface to the programmer, we need to build a distributed runtime software. The main focus is on the CS problems, leaving out the hardware implementation and the physical network layer. However, research on protocols for particular high-speed networks is included.

A key issue in this area is scalability. With the success of Internet, we must face now the possibility of a global distributed system covering the whole world (sometimes called mega-programming), and the algorithms used must scale.

Another key issue is parallelism. After two decades of active research in parallel hardware and software techniques, new approaches to parallel computing are emerging. On the one hand, hardware is converging to a distributed memory model composed of a set of memory-processor pairs which communicate with each other by passing messages through a communication network. We can see this trend at the global level in the Internet, and at the local level in technologies of low cost such as clusters of personal computers. On the other hand, algorithmic design must make no assumptions about the particular features of the hardware so that portability across different platforms can be ensured. Moreover, it is enforced that the algorithmic design be driven by models of computation which allow accurate performance prediction and embrace a simple software engineering methodology. It is worthwhile then to review new models of parallel computation in order to determine which are most suitable for different Web computing applications and to develop new strategies based on the specific features of these models.


Specific Technical Goals
All the problems addressed have the unified goal of seeing the Web as a multimedia database.
Along the exposition we include our previous work on these problems. In all the problems outlined below we expect three main types of results:

  1. new models,
  2. new algorithms or techniques, and
  3. new specific applications.

Comparing multimedia objects
Multimedia data are quite different from traditional data, in the sense that they do not represent "discrete" information. Rather, they represent continuous signals of the real world which are sampled and quantized. One of the most important consequences of this fact is that there is no point in searching multimedia data by exact equality, as traditional data would be searched. Rather, we need mechanisms to search it by "similarity", that is, find objects which are similar enough to a sample object.


Combinatorial pattern matching in images and audio.
The signal processing community has traditionally addressed the problem of measuring the similarity between two images or audio segments (or parts thereof) despite of slight differences due to scale, orientation, lighting, stretching, etc. (in the first case) or timing, volume, tone, noise, etc. (in the second case). They have used an approach where the object is seen as a continuous signal to be processed.

A recent alternative approach to pattern matching in audio and images relies on combinatory rather than on signal processing. The audio or image is seen as a one or two dimensional text, where one or two dimensional patterns are sought. Several results on searching images permitting rotations, scaling, pixel differences and stretching have been obtained, in many of which we have been involved. The same has happened in searching music files, using techniques derived from the large body of knowledge acquired in the field of pattern matching of biological sequences. Although the degree of flexibility obtained is still inferior to that of the signal processing approach, much faster search algorithms have been obtained. These results are rather encouraging and we plan to pursue more in this line.


Approximate text searching.
The text, on the other hand, can also be considered as a medium that can be queried by similarity, as opposed to searching exact strings. Approximate text searching regards the text as a stream of symbols and seeks to retrieve occurrences of user entered patterns even when they are not correctly written (in the pattern or in the text). This is mainly to recover from errors due to spelling, typing, optical character recognition, etc. We have devoted a lot of research to this problem and plan to continue working on faster algorithms and their adaptation to the particular problematic of the Web search engines.


Similarity access methods
In all the cases above, the problem is not solved just by developing fast and accurate algorithms to compare images, audio clips, texts, etc. Given a user query, there will be millions of elements in the multimedia database, and we cannot afford comparing them one by one. Moreover, queries can be more complex than just a measure of similarity, as they can involve complex relations among several objects. Efficient access methods are necessary that permit fast retrieval of those elements that match the query criteria. Only with such a technology can we hope for a world-scale Web multimedia database. We plan to contribute to this research in several aspects.

Top of page


Answering structural queries.
We refer to a structural query as one that is expressed by a set of spatial objects and a set of relations for each pair of these objects. Query by sketches, by examples, or by extended SQL commands in Geographic Information Systems are examples of structural queries. Objects in these queries are not necessarily described by their spatial extents in an Euclidean space but by, for example, their distinguishing features (e.g., color, texture, shape, size) or by their semantic classifications (e.g., building and road). Spatial relations are usually a subset of topological, orientation, and distance relations. Answering a structural query implies to find instances of objects in the database that satisfy the spatial constraints. As opposed to previous work on answering structural queries we plan to combine semantics of objects with their spatial characteristics and interrelations for query processing.


Search algorithms for metric spaces.
Similarity searching is a research subject that abstracts several of the issues we have mentioned. The problem can be stated as follows: given a set of objects of unknown nature, a distance function defined among them that measures how dissimilar the objects are, and given yet another object called the query, find all the elements of the set which are similar enough to the query. We seek for indexing techniques to structure the database so as to perform as few distance evaluations as possible when answering a similarity query.

Several of the problems we have mentioned can be converted into a metric space search problem:

  1. when finding images, audio or video clips "close" to a sample query;
  2. in approximate text searching;
  3. in information retrieval we define a similarity between documents and want to retrieve the most similar ones to the query;
  4. in artificial intelligence applications, for labeling using the closest known point;
  5. in pattern recognition and clustering;
  6. in lossy signal compression (audio, images, video) to quickly find the most similar frame already seen; etc.

All these applications are important to search the Web:

  1. permits indexing the Web to search for similar images and audio clips;
  2. permits coping with the poor quality of the texts that exist in the Web;
  3. permits quickly finding Web pages relevant to a query;
  4. permit understanding the content of images and text semantics to enable more sophisticated searching;
  5. permits better compression of multimedia data, which is essential for transmission over a slow network like Internet.

Metric space searching is quite young as an area by itself. For this reason, it is still quite immature and open to developments in new algorithms and applications. We have done intensive research on this subject in the last years and plan to continue in the framework of this project.

Top of page


Handling semistructured information
The widespread penetration of the Web has converted HTML into a de-facto standard for exchanging documents. HTML is a simplification of SGML, a structured text specification language formerly designed with the aims of a universal language for exchanging and manipulating structured text. A recent derivation of SGML, called XML, is rapidly gaining space in the community. It is quite possible that XML will in the future replace HTML, and the research community is putting large efforts in standardization, definition of a suitable query language, etc. on XML.

The structure that can be derived from the text is in no case similar to a relational one, which can be separated in fixed fields and records, and tabulated accordingly. Texts have more complex and fuzzy structure, which in the case of the Web is a graph. Designing and implementing suitable query and manipulation languages for structured text databases, including for the Web, is an active research activity. There are currently several proposals for a query language on XML. We have contributed to the area of structured text searching and to the particular case of efficiently implementing XQL.

We plan to continue working in efficient query languages over XML, developing prototypes to query XML data. The ability of efficiently querying XML (and HTML as a simpler case of it) will open the door to enhancements of current Web search engines so as to incorporate predicates on the structure of the documents.

Top of page


Mathematical Modeling and Simulation of the Web
The last decade has been featured by an ever-increasing demand for applications running on the Internet that are able to efficiently retrieve and process information scattered on huge and dynamic repositories like the Web. However, it is well-known that predicting detailed behavior of such applications is extremely difficult since the Internet not only grows at an exponential rate, but it also experiences changes in use and topology over time. How to make sensible performance analyses of software artifacts interacting with such a complex and large system is indeed an open question. The whole problematic resembles scaling conditions in statistical physics wherein interesting phenomena arise only in sufficiently large models. A large and good enough model has a chance to exhibit "rare'' critical fluctuations that seem to emerge regularly in the real Internet. Clearly, analytical approaches become quickly inadequate in such situations. Thus, simulation validated against empirical data is potentially the only tool that can enable the analysis of alternative designs under different scenarios.

Currently the problem of modeling and simulating the global Internet is receiving little attention. As a result, no work has been done in the development of realistic simulation frameworks for predicting the performance of information retrieval systems running on the Internet. In the immediate, we anticipate unique opportunities for productive research on the development of more suitable strategies for scanning the whole Web and their associated simulation models, which allow these strategies to be analyzed and re-designed before their actual implementation and testing. Suitable simulation models can certainly allow one to explore current and future trends in the ever-moving Internet, under conditions that are impossible to reproduce at will in the real one.

One specific problem is to understand the structure and characteristics of the Web, including its temporal behavior as well as usage behavior. The latter implies analysis of logs and Web data mining. Another important problem is to traverse the Web to gather new and updated pages. This is a hard scheduling problem that can be modeled mathematically and simulated.

Top of page


Distributed Computing Environments
The complex distribution of computing power in the Internet makes it impossible to use the traditional programming paradigms to develop Web Computing applications. New approaches are being explored by our group, using the Mobile Agent paradigm. The main idea is to program small agents that migrate from one machine to another, using a small fraction of processing power at each stage, collecting information and making decisions based on their knowledge. From time to time, they may come back to their original creator machine, if a database is being built.

Much research concerning agents is being done around the world. However there is still a surprisingly low number of available platforms implementing them. We have built a reflective platform in Java (called Reflex) that provides a functional environment to test these ideas, with dynamic behavior.

Agents are a powerful paradigm for Web Computing,. However, many issues are still open to provide a reliable developing platform: they must be robust (fault-tolerant), handle remote objects (remote method invocation, garbage collection), migrate with their state between heterogeneous machines (thread migration), support replicated objects (consi possible. Parallel computing can then be an effective tool for the development of high-performance servers which are able to process thousands of requests per minute. Web based applications pose new challenges in this matter. For example, little research has been done so far in the efficient parallel processing of read-only queries on Web documents. For transactional servers, we anticipate new research topics such as the efficient synchronization of sequences of read/write operations coming from a large number of concurrent clients/agents using the services provided by the server site. Similarities with the problem of event synchronization in parallel simulation are evident and it is worthwhile to investigate the extent to which new techniques developed in this field can be applied.

 

Department of Computer Sciences
University of Chile
Blanco Encalada #2120
Santiago, Chile

Millenium Science Initiative Questions/Comments: cwr@dcc.uchile.cl
Last modification:
Search Services in: Go to todocl.cl

The Center for Web Research (CWR) is possible thanks to the Millenium Science Initiative Program
Millenium Science Initiative, Ministry of Planning and Cooperation - Government of Chile


Valid HTML 4.01! Valid CSS!


dcc