|
|
Databases and Information
Retrieval
Information processing is a big
problem in modern society because several interrelated issues
arise that have to do with the need for:
- integration of data from different sources;
- navigation through information;
- learning and knowledge acquisition from
data;
- prediction;
- decision making support;
- information visualization in computational
media;
- storing, accessing and processing data
in computational platforms;
- handling non traditional information, e.g.
environmental, geographical, cartographic, etc;
- confronting the non traditional structure
-or lack of structure of new kinds of information, e.g. complex
objects or Web pages.
All these problems with different restrictions
define different types of databases. We focus our research on
the three last problems, which include multimedia retrieval, spatial
information, semistructured data, and Web agents. At the heart
of them is combinatorial pattern matching, a research area that
studies from a combinatorial point of view how to search for given
patterns in regular and discrete structures such as sequences
or graphs. This area should be distinguished from classical pattern
matching, which considers continuous elements and uses different
techniques.
Distributed Systems and Networks
We call a distributed system any multi-processor system where
each processor has a local memory, not shared with the rest of
the processors. The only way of communicating between processors
is by sending messages through a network. To give a higher-level
interface to the programmer, we need to build a distributed runtime
software. The main focus is on the CS problems, leaving out the
hardware implementation and the physical network layer. However,
research on protocols for particular high-speed networks is included.
A key issue in this area is scalability.
With the success of Internet, we must face now the possibility
of a global distributed system covering the whole world (sometimes
called mega-programming), and the algorithms used must scale.
Another key issue is parallelism. After two decades
of active research in parallel hardware and software techniques,
new approaches to parallel computing are emerging. On the one
hand, hardware is converging to a distributed memory model composed
of a set of memory-processor pairs which communicate with each
other by passing messages through a communication network. We
can see this trend at the global level in the Internet, and at
the local level in technologies of low cost such as clusters of
personal computers. On the other hand, algorithmic design must
make no assumptions about the particular features of the hardware
so that portability across different platforms can be ensured.
Moreover, it is enforced that the algorithmic design be driven
by models of computation which allow accurate performance prediction
and embrace a simple software engineering methodology. It is worthwhile
then to review new models of parallel computation in order to
determine which are most suitable for different Web computing
applications and to develop new strategies based on the specific
features of these models.
Specific Technical Goals
All the problems addressed have the unified goal of seeing
the Web as a multimedia database.
Along the exposition we include our previous work on these problems.
In all the problems outlined below we expect three main types
of results:
- new models,
- new algorithms or techniques, and
- new specific applications.
Comparing multimedia objects
Multimedia data are quite different from traditional data, in
the sense that they do not represent "discrete" information.
Rather, they represent continuous signals of the real world which
are sampled and quantized. One of the most important consequences
of this fact is that there is no point in searching multimedia
data by exact equality, as traditional data would be searched.
Rather, we need mechanisms to search it by "similarity",
that is, find objects which are similar enough to a sample object.
Combinatorial pattern matching
in images and audio.
The signal processing community has traditionally addressed the
problem of measuring the similarity between two images or audio
segments (or parts thereof) despite of slight differences due
to scale, orientation, lighting, stretching, etc. (in the first
case) or timing, volume, tone, noise, etc. (in the second case).
They have used an approach where the object is seen as a continuous
signal to be processed.
A recent alternative approach to pattern matching
in audio and images relies on combinatory rather than on signal
processing. The audio or image is seen as a one or two dimensional
text, where one or two dimensional patterns are sought. Several
results on searching images permitting rotations, scaling, pixel
differences and stretching have been obtained, in many of which
we have been involved. The same has happened in searching music
files, using techniques derived from the large body of knowledge
acquired in the field of pattern matching of biological sequences.
Although the degree of flexibility obtained is still inferior
to that of the signal processing approach, much faster search
algorithms have been obtained. These results are rather encouraging
and we plan to pursue more in this line.
Approximate text searching.
The text, on the other hand, can also be considered as a medium
that can be queried by similarity, as opposed to searching exact
strings. Approximate text searching regards the text as a stream
of symbols and seeks to retrieve occurrences of user entered patterns
even when they are not correctly written (in the pattern or in
the text). This is mainly to recover from errors due to spelling,
typing, optical character recognition, etc. We have devoted a
lot of research to this problem and plan to continue working on
faster algorithms and their adaptation to the particular problematic
of the Web search engines.
Similarity access methods
In all the cases above, the problem is not solved just by
developing fast and accurate algorithms to compare images, audio
clips, texts, etc. Given a user query, there will be millions
of elements in the multimedia database, and we cannot afford comparing
them one by one. Moreover, queries can be more complex than just
a measure of similarity, as they can involve complex relations
among several objects. Efficient access methods are necessary
that permit fast retrieval of those elements that match the query
criteria. Only with such a technology can we hope for a world-scale
Web multimedia database. We plan to contribute to this research
in several aspects.
Top of page
Answering structural queries.
We refer to a structural query as one that is expressed by a set
of spatial objects and a set of relations for each pair of these
objects. Query by sketches, by examples, or by extended SQL commands
in Geographic Information Systems are examples of structural queries.
Objects in these queries are not necessarily described by their
spatial extents in an Euclidean space but by, for example, their
distinguishing features (e.g., color, texture, shape, size) or
by their semantic classifications (e.g., building and road). Spatial
relations are usually a subset of topological, orientation, and
distance relations. Answering a structural query implies to find
instances of objects in the database that satisfy the spatial
constraints. As opposed to previous work on answering structural
queries we plan to combine semantics of objects with their spatial
characteristics and interrelations for query processing.
Search algorithms for metric
spaces.
Similarity searching is a research subject that abstracts several
of the issues we have mentioned. The problem can be stated as
follows: given a set of objects of unknown nature, a distance
function defined among them that measures how dissimilar the objects
are, and given yet another object called the query, find all the
elements of the set which are similar enough to the query. We
seek for indexing techniques to structure the database so as to
perform as few distance evaluations as possible when answering
a similarity query.
Several of the problems we have mentioned can be
converted into a metric space search problem:
- when finding images, audio or video clips "close"
to a sample query;
- in approximate text searching;
- in information retrieval we define a similarity
between documents and want to retrieve the most similar ones
to the query;
- in artificial intelligence applications, for
labeling using the closest known point;
- in pattern recognition and clustering;
- in lossy signal compression (audio, images, video)
to quickly find the most similar frame already seen; etc.
All these applications are important to search the
Web:
- permits indexing the Web to search for similar images and
audio clips;
- permits coping with the poor quality of the texts that exist
in the Web;
- permits quickly finding Web pages relevant to a query;
- permit understanding the content of images and text semantics
to enable more sophisticated searching;
- permits better compression of multimedia data, which is essential
for transmission over a slow network like Internet.
Metric space searching is quite young as an area
by itself. For this reason, it is still quite immature and open
to developments in new algorithms and applications. We have done
intensive research on this subject in the last years and plan
to continue in the framework of this project.
Top of page
Handling semistructured information
The widespread penetration of the Web has converted HTML into
a de-facto standard for exchanging documents. HTML is a simplification
of SGML, a structured text specification language formerly designed
with the aims of a universal language for exchanging and manipulating
structured text. A recent derivation of SGML, called XML, is rapidly
gaining space in the community. It is quite possible that XML
will in the future replace HTML, and the research community is
putting large efforts in standardization, definition of a suitable
query language, etc. on XML.
The structure that can be derived from the text
is in no case similar to a relational one, which can be separated
in fixed fields and records, and tabulated accordingly. Texts
have more complex and fuzzy structure, which in the case of the
Web is a graph. Designing and implementing suitable query and
manipulation languages for structured text databases, including
for the Web, is an active research activity. There are currently
several proposals for a query language on XML. We have contributed
to the area of structured text searching and to the particular
case of efficiently implementing XQL.
We plan to continue working in efficient query languages
over XML, developing prototypes to query XML data. The ability
of efficiently querying XML (and HTML as a simpler case of it)
will open the door to enhancements of current Web search engines
so as to incorporate predicates on the structure of the documents.
Top of page
Mathematical Modeling and
Simulation of the Web
The last decade has been featured by an ever-increasing demand
for applications running on the Internet that are able to efficiently
retrieve and process information scattered on huge and dynamic
repositories like the Web. However, it is well-known that predicting
detailed behavior of such applications is extremely difficult
since the Internet not only grows at an exponential rate, but
it also experiences changes in use and topology over time. How
to make sensible performance analyses of software artifacts interacting
with such a complex and large system is indeed an open question.
The whole problematic resembles scaling conditions in statistical
physics wherein interesting phenomena arise only in sufficiently
large models. A large and good enough model has a chance to exhibit
"rare'' critical fluctuations that seem to emerge regularly
in the real Internet. Clearly, analytical approaches become quickly
inadequate in such situations. Thus, simulation validated against
empirical data is potentially the only tool that can enable the
analysis of alternative designs under different scenarios.
Currently the problem of modeling and simulating
the global Internet is receiving little attention. As a result,
no work has been done in the development of realistic simulation
frameworks for predicting the performance of information retrieval
systems running on the Internet. In the immediate, we anticipate
unique opportunities for productive research on the development
of more suitable strategies for scanning the whole Web and their
associated simulation models, which allow these strategies to
be analyzed and re-designed before their actual implementation
and testing. Suitable simulation models can certainly allow one
to explore current and future trends in the ever-moving Internet,
under conditions that are impossible to reproduce at will in the
real one.
One specific problem is to understand the structure
and characteristics of the Web, including its temporal behavior
as well as usage behavior. The latter implies analysis of logs
and Web data mining. Another important problem is to traverse
the Web to gather new and updated pages. This is a hard scheduling
problem that can be modeled mathematically and simulated.
Top of page
Distributed Computing Environments
The complex distribution of computing power
in the Internet makes it impossible to use the traditional programming
paradigms to develop Web Computing applications. New approaches
are being explored by our group, using the Mobile Agent paradigm.
The main idea is to program small agents that migrate from one
machine to another, using a small fraction of processing power
at each stage, collecting information and making decisions based
on their knowledge. From time to time, they may come back to their
original creator machine, if a database is being built.
Much research concerning agents
is being done around the world. However there is still a surprisingly
low number of available platforms implementing them. We have built
a reflective platform in Java (called Reflex) that provides a
functional environment to test these ideas, with dynamic behavior.
Agents are a powerful paradigm
for Web Computing,. However, many issues are still open to provide
a reliable developing platform: they must be robust (fault-tolerant),
handle remote objects (remote method invocation, garbage collection),
migrate with their state between heterogeneous machines (thread
migration), support replicated objects (consi possible. Parallel computing can then
be an effective tool for the development of high-performance servers
which are able to process thousands of requests per minute. Web
based applications pose new challenges in this matter. For example,
little research has been done so far in the efficient parallel
processing of read-only queries on Web documents. For transactional
servers, we anticipate new research topics such as the efficient
synchronization of sequences of read/write operations coming from
a large number of concurrent clients/agents using the services
provided by the server site. Similarities with the problem of
event synchronization in parallel simulation are evident and it
is worthwhile to investigate the extent to which new techniques
developed in this field can be applied.
|