Instance-based classifiers such as the kNN classifier operate on the premises that classification of unknown instances can be done by relating the unknown to the known according to some distance/similarity function. The intuition is that two instances far apart in the instance space defined by the appropriate distance function are less likely than two closely situated instances to belong to the same class.
Unlike many artificial learners, instance-based learners do not abstract any information from the training data during the learning phase. Learning is merely a question of encapsulating the training data. The process of generalization is postponed until it is absolutely unavoidable, that is, at the time of classification. This property has lead to the referring to instance-based learners as lazy learners, whereas classifiers such as feedforward neural networks, where proper abstraction is done during the learning phase, often are entitled eager learners.
Classification (generalization) using an instance-based classifier can be a simple matter of locating the nearest neighbour in instance space and labelling the unknown instance with the same class label as that of the located (known) neighbour. This approach is often referred to as a nearest neighbour classifier. The downside of this simple approach is the lack of robustness that characterize the resulting classifiers. The high degree of local sensitivity makes nearest neighbour classifiers highly susceptible to noise in the training data.
More robust models can be achieved by locating k, where k > 1, neighbours and letting the majority vote decide the outcome of the class labelling. A higher value of k results in a smoother, less locally sensitive, function. The nearest neighbour classifier can be regarded as a special case of the more general k-nearest neighbours classifier, hereafter referred to as a kNN classifier. The drawback of increasing the value of k is of course that as k approaches n, where n is the size of the instance base, the performance of the classifier will approach that of the most straightforward statistical baseline, the assumption that all unknown instances belong to the class most most frequently represented in the training data.
This problem can be avoided by limiting the influence of distant instances. One way of doing so is to assign a weight to each vote, where the weight is a function of the distance between the unknown and the known instance. By letting each weight be defined by the inversed squared distance between the known and unknown instances votes cast by distant instances will have very little influence on the decision process compared to instances in the near neighbourhood. Distance weighted voting usually serves as a good middle ground as far as local sensitivity is concerned.
© Ola Söder, May 29, 2008