Arto Klami, Department of Information and Computer Science, Aalto University
Proactive user interfaces attempt to improve human computer interaction by monitoring the users actions and status and by providing suggestions or making decisions based on information inferred from such implicit measurements. For example, a mobile phone interface can automatically switch to different modes depending on whether the user is currently driving a car or in a meeting, inferred based on various sensors in the device.
Eye movements are one of the most promising sources of implicit information. A desktop computer user is looking at the screen for most of the time and most natural actions require focusing on specific portions of the screen -- for example, one cannot read text without actually looking at it. With modern equipment it is possible to measure the coordinates where the user is looking at with both high spatial and temporal resolution. Given such data, the question is how much we can infer from the eye movements besides just the information on where the user is currently looking at. Do the fine details of the eye movements reveal also what the user intents to do or how he interprets the contents of the display?
In this problem you will solve a particular inference task, trying to determine solely based on eye movement measurements which of the sentences read by the user is the right answer to a given question. This represents a prototypical inference task required by real proactive systems; the system infers from the implicit actions of the user something that the user decided in his mind.
The practical problem is a relatively straightforward classification task, and you are allowed to use any suitable classifier. However, some work will need to be done to find a good representation for the sentences.
The data used for this problem is given in Inferring relevance from eye movements challenge 2005. The data was originally released for a PASCAL Challenge, and is consequently well documented on the web site. In brief, the data contains 11 subjects viewing 50 questions each (for training and validation data). For each question the subjects read 10 sentences (possible answers to the questions), and each answer is classified as one of three possible labels: one of the answers is the right one (C), some of them are relevant for the topic of the question but do not contain the answer (R), while the rest are completely irrelevant (I). The data consists of 27-dimensional feature vectors given for every occurrence of the subject seeing a single word. That is, the data set is a sequence of words seen by the subject. The features are explained in a technical report
Note that you need to do all the learning on the training data. The validation data can additionally be used for preliminary evaluation of accuracy or for tuning additional parameters of the model, but you are not allowed to take a look at the test data labels provided on the website. The test data will only be used for final evaluation of the model.
Your task is to learn a classifier that learns to classify each of the possible answers as one of the three class labels (C/R/I). Note that the original data vectors are given for individual words in the answers and not for the answers themselves, and hence the problem cannot be solved by direct application of off-the-shelf classifiers. Instead, you will need to represent the sentences somehow. Note that the same word may be viewed more than once.
Implement your solution as if you were taking part in the Competition 1 described on the web page. You are free to choose any classifier algorithm and feature representation for the sentences, but you need to clearly justify the choices. In particular, you should argue why the choices could be good for this data. In order to do that, you most likely need to perform some exploratory data analysis first to learn what the data looks like. Report also these preliminary analyzes, since they belong to the process of actually solving a classification problem well.
On the website you will find results of the original competition participants, measured using specific evaluation criteria. Compare your accuracy with those, and discuss the results. Did you beat some of them? Would your result be significantly worse or better in a practical application? What could the reasons be for different performance, and how could your solution be improved?
In addition to computing the accuracy measures used in the competition, try to illustrate the usefulness of your solution by some other means. Would some other goodness measure be more informative? The dummy model that predicts everything to be irrelevant has reasonable accuracy (though you should still aim to beat that by a decent margin) but would naturally be worthless in a real application.
You can also try solving two different binary classification tasks: irrelevant vs relevant+correct, and irrelevant+relevant vs correct. Is either of these considerably easier to solve than the other? Why could that be?