Multiple Instance Learning


What is multiple instance learning?
Multiple instance learning, unlike normal supervised learning, provides you with bags, labelled either positive or negative! A Bag is positive if at least any one instance in it is positive, a Bag is negative if all instances in it are negative.

Real life example: Suppose scientists are coming up with a drug which can cure cancer. They try various medicines on a patient, and notice that certain batches of medicines seem to cure the cancer. We do not know which of these medicines inside these bags cure cancer. Now comes the use of multiple Instance Learning algorithms! All satchets which cured cancer would be positive bags, since atleast one medicine from these bags must cure cancer. All other satchets are the negative bags. Our goal is, given a medicine, can we tell the probability of it curing cancer.
Today we will talk about a popular multiple instance learning algorithm called Diverse Density learning. The original paper is here. We will be creating a synthetic data set as shown below, vary some parameters and compare the results.
If you want to look at some code of MIL in action, check out this cool application in the RL Section - here
DataSet preparation for the plots below

How DDE Works?
It finds a point in space, which is most positive.I.e, a close to positive instances and far from negative ones, roughly speaking.This picture should make it clear.

Clearly, we do not look at a point with high density! We look at one with high diverse denstity.
Finding the required point:

This is the objective function we maximize.
Take the logorithm, and then use gradient ascent in order to maximize the likelihood function. The Pr(x=t | Bij+) term is defined as exp(-(scale.^2)*(Bij-x).^2) where Bij is the j'th instance in the positive Bag i.

We use the scaling term since we might want to scale different features differently!
Note that this term lies between 0 to 1.
Choosing step size parameters using line search methods or others to speed up learning, we arrive at a point x.
Classifying  a new instance:
In order to classify a new instance, we compute the term exp(-||Bij-inst||^2), where inst is the new instance. If this lies below a threshold, we say that it is a negative instance, otherwise we classify it as a one.
How do we choose this threshold?
We can do this by validation on the current data set and a cross validation data set.
Result Comparison:
First prepare the 5 Positive Bags and 5 Negative bags from the instances created using the above code.
Let us compare the plots obtained in the paper to the plot obtained by us, by changing thecsize of the true instance square.




Left: Plots obtained by us using 10x10 concept square Right:
Plots obtained in the paper using 5X5 concept square. Clearly, using a larger square has given smoother result for this particular dataset.

Using a slightly larger true concept square gave rise to a unique maxima at the centre whereas you can see multiple local maxima in the experiment in the original paper.
As more and more positive instances are added to the positive bag, we noticed better convergence properties ( smooth unique maxima in plots).
Hope this post helped in getting a good understanding of multiple instance learning.
Cheers!