Qualitative modeling from data

Project type:
Research projects ARRS
Project duration:
01.05.2009 - 30.04.2012
Project website:
/

Qualitative modeling from data

Qualitative models are models which, for a contrast from classification and regression models which predict classes or numerical quantities, describe qualitative relations between the observed variables. An example of such a relation is y=Q(+x), “y increases with x”. Qualitative models usually also include conditions under which they are true, e.g. y=Q(-x, +z) if x<12, and y=Q(-z) otherwise.

Although such models do not predict exact numerical values, they also have several advantages as compared to regression models. Qualitative descriptions are close to the human way of thinking (“the more it rains, the more wet I will get”, and not “the quantity of water in my clothes equals 0.45 l/cm × rt, where r is the rain intensity in cm/h and t is the time spent under it”), therefore such models are easier to explain and can reveal more information than regression models. They are typically also more robust since qualitative relations are simpler to model. They are thus often used as a step before regression modeling, where the relations in the qualitative model are used as constraints for the regression model (Šuc, 2004).

Despite the nice properties of qualitative models, there are no efficient algorithms for their construction. One of the rare methods from the field, QUIN (Šuc, 2001) induces trees similar to classification trees except that their leaves contain qualitative constraints. While using QUIN on real-world data (Žabkar 2005, 2006) we noticed a number of its deficiencies. The algorithm becomes very slow with the growing number of dimensions. It cannot treat the variable representing the time separate from other variables, which decreases its suitability for modeling dynamic systems. Based on an impurity measure, it is limited to the construction of tree models which are inappropriate for many practical problems. Finally, its formal definition of a constraint does not correspond to the mathematical definition of a derivative.

To solve the above problems, we developed a new approach to qualitative modeling which is based on the approximation of partial derivatives of the sampled multidimensional function. The procedure computes the derivative at each point where the function is sampled using the points at its vicinity which is defined either by triangulation or by the axis in which direction we compute the derivative. The computed derivatives can be treated as numbers or, to proceed with qualitative modeling, we can observe only their signs. The data preprocessed in this way can be subsequently modeled by any general machine learning algorithm or presented using a suitable visualization method.

We developed a prototype implementation named Pade (Žabkar 2007a). Its experimental results are excellent even on rather complicated synthetic data, such as the function sin(x)sin(y) over a few periods. Even in its prototype form, the algorithm has also been successfully used in EU projects XPERO in XMEDIA.

The goal of the project is to theoretically investigate the field and develop the procedures which will be useful for qualitatively modeling real-world data. The expected research problems are:

Basic computation of partial derivatives as explained above; we need to investigate the influence of various parameters of the method and develop the existing prototype to the final version;
Derivation using the chain rule, where all variables are treated as depending upon the special variable representing the time;
Treatment of discrete variables by incorporating them in the concept of vicinity, derive them, and derive by them; the present state-of-the art method QUIN is unable to treat discrete variables at all, while Pade can do some of the above in some of its variants
Investigating the use of different machine learning algorithms for constructing the final qualitative model;
Quantitative-to-qualitative transformation, where we construct a numerical model from the qualitative one and the data; we intend to focus on construction of symbolic models and have already published some results in this field (Žabkar 2007b);
Self-evaluation of the modeling; we expect to do this by using statistical measures such as linear correlation and/or by observing the density of the space coverage around the points where we compute the derivative.

The usefulness of the developed methods will be tested on synthetic data and, especially, on the real-world data from industry, medicine and elsewhere. All methods will be implemented inside the system Orange (Demšar, 2004) and be freely available to their potential users.

The topic of the project partially overlaps with the anticipated topic of the doctoral dissertation of Jure Žabkar, a researcher at the Faculty of Computer and Information Science whose co-advisor is the leader of this proposed project and who already worked on these problems. We also expect to engage other under- and postgraduate students on the project, while our partners on various EU projects will test the developed methods as a part of the respective projects.

The developed methods will be quite applicable in practice as qualitative models are useful for:

Industry, where we can observe the effects of input parameters on the behavior or the quality of the product (Vladušić, 2006),
Economy and sociology (Samuelson, 1947) where we can, similarly, predict qualitative effects of a certain quantity on the other or, by treating time separately, predict trends and their dependence on various variables,
In natural sciences, for instance in meteorology (Žabkar 2005) or in biology and medicine, where, for instance, qualitative relations between genes can reveal their mutual dependencies and elsewhere.

Implemented procedures will also be used for teaching at the Faculty of computer and information science.

Project funding:

Slovenian Research and Innovation Agency