Prediction of insurance policy

From Teachwiki
Jump to: navigation, search


The task of this thesis was taken from CoIL Challenge 2000. I decide for solving already closed competition, because it gives me a good possibility to compare my approach, decisions and results with almost 30 another participants. CoIL Challenge 2000 was a international data mining competition, with aim to compare application of different approaches on the well known question of targeting customers (the abbreviation CoIL means Computational Intelligence and Learning). Detail description of challenge and solutions of participants you can find on CoIL website


Direct mailings to a company’s potential customers can be a very effective way for to market a product or service. However, as we all know, much of this mail is really of no interest to the majority of the people that receive it. Most of it ends up thrown away, not only wasting the money that the company spent on it, but also annoy customers.


The competition consists of two tasks:

  1. Predict which customers are potentially interested in a caravan insurance policy.
  2. Describe the actual or potential customers and possibly explain why these customers buy a caravan policy


The data was provided by the Dutch data mining company Sentient Machine Research and are based on real world business data. It consists of two parts, training data (5822 customers) and test data (4000 customers). The purpose of training data was to built and validate model. This part contains the target attribute "CARAVAN: Number of mobile home policies". The test data was given in the same format, but with missing information about caravan policies. The participants were supposed to return the list of predicted targets only. They were asked to choose 800 most perspective customers. Organizers compare it with real result and the number of actual policyholders gives the score of solution. Each data record consists of 86 attributes, containing socio-demographic data (attribute 1-43) and product ownership (attributes 44-86). The socio-demographic data is derived from zip codes. All customers living in areas with the same zip code have the same socio-demographic attributes (but the zip code was not included in data). All attributes have range from 0-10.

It is very important awake to this fact, because it enable correct interpretation of data. For example the answer 7 on question "Are you married?" does not have sense in other way (the correct interpretation is: the customer live in an area, where 70 percents of inhabitants are married). We also could not put so strong emphasis on zip-code derivate attribute, as on information uniquely belonging to one customer (as ownership attributes). They just provide us probable image of our customer.


The whole analysis was made in Weka 3.4.9

Attribute choice[edit]

The first step of my analysis was attribute choice. 86 attributes contain a lot of redundant, irrelevant or noisy information. Using all of them can cause overfitting of model. I handle with data in two ways: numeric and nominal. For converting values into nominal I use a filter Discretize. In nominal case I use following 3 algorithm with method Ranker.

  • ChiSquaredAttributeEval - Evaluates the worth of an attribute by computing the value of the chi-squared statistic with respect to the class.
  • GainRationAttributeEval - Evaluates the worth of an attribute by measuring the gain ratio with respect to the class.

GainR(Class, Attribute) = (H(Class) - H(Class | Attribute)) / H(Attribute).

  • InfoGainAttributeEval -Evaluates the worth of an attribute by measuring the information gain with respect to the class.

InfoGain(Class,Attribute) = H(Class) - H(Class | Attribute).

I combine them in rather intuitive way, with effort to keep information of all important domains. Finally I choose these 8 attributes.

  1. Customer subtype
  2. Customer main type
  3. Lower level of education
  4. 1 car
  5. Average salary
  6. Contribution to car policies
  7. Contribution to fire policies
  8. Number of car policies
There are only discrete values, but I add noise for enabling view of amount

In numerical case the situation was more complicated. CfsSubsetEval method (Evaluates the worth of a subset of attributes by considering the individual predictive ability of each feature along with the degree of redundancy between them) offers me a few ownership characteristics, which are not convenient according me opinion. They typically look like the right picture: Number of property insurance policies. There are thousands of customers without this policy, and only a small group of holders. Due to this, the automatical interpretation could be misleading. The caravan rate in small group is very sensitive with respect to number of positive cases, one case more can strongly influence the final rate (50/3 and 50/4 versus 500/30 and 500/31 ). So the algorithm can consider this attribute as more confirmative, as it really is.

Additionally, the ownership of these policies could more depend on specific situation of customer, as on his general likelihood to buy policies. For example people with danger job are more likely to buy life insurance. But we don't know how dangerous job have our customer. According these arguments, I put away all ownership attributes except these four:

  • Contribution for car policies
  • Contribution for fire policies
  • Number of car policies
  • Number of fire policies

On their histograms we could see the clear dependency between them and caravan policy (in the histograms are amount of customers with caravan policy marked by red color). The dependency is support on large amount of instances. There is one more exception, the number of boat policies. Almost each second customer with boat policy has also caravan policy. Due to this, I made a special rule absolutely out of rest of my model, that all people with boat are my perspective customers. The explanation is that boating and caravanning are typical for an outdoor lifestyle, so this two variables are correlated. Now I apply on rest of attributes CfsSubsetEval. As I explain later, I choose M5Rules method for classification. Advantage of this method is that it can cope with large number of attributes. Therefore I try to use this method with all attributes and take a look on weight assign by algorithm itself. So my final choice is :

  1. No religion
  2. Living together
  3. High level education
  4. Farmer
  5. 1 car
  6. Income 75-122.000
  7. Purchasing power
  8. Contribution for car policies
  9. Contribution for fire policies
  10. Number of car policies

In attribute choices for nominal and numeric cases were some differences and some similarities. For nominal algorithm are more suitable Customer subtype, Customer main type, for numeric is better No religion, Living together, Income 75-122.000, Farmer. Customer type is typical nominal attribute, where types are assigned by TypeID. So it is quite clear that such variable is not applicable for regression. There are also attributes, which in fact describe the similar proprieties of customer e.g. education, but their histograms have different shape.

  • Nominal: Lower level of education, Average income
  • Numeric: Higher level of education, Purchasing power

The attributes

  • Contribution to car policies
  • Contribution to fire policies
  • Number of car policies
  • 1 car

was identical both selections. Mainly first two was the most important for all algorithms, which I use for testing in Weka.

Algorithm choice[edit]

The second step was the choice of classification algorithm. I do not use any theoretical reasons for my choice. I just run and validate each algorithm according the task. I use cross - validation with 10 folds. I could not use default measure as error rate or correlation, because it is not exactly what my task is. The problem of this data is that it contains only 6 percent of customers with caravan policy. Because the success rate of classification is low, around 1:7, the best error rate you obtain if you classify all customers as no caravan policy. Therefore I use measure directly derivate from the task. I take 20 percent of most perspective customers from the test fold, and I count the number of actual policyholders. According this measure, I choose these two algorithms:

  • M5Rules
  • NaiveBayes

M5Rules is able to work with numeric variables, NaiveBayes With nominal. Therefore they use also different attributes and I hope that they will be able to find different dependencies in the data. M5Rules algorithm works in two steps: First it built a decision tree according M5 algorithm, and after it for each leaf built a regression model. For more detail description see [2]. I our case the tree have only two leafs, people witch contribute not much to car policies and the rest. The first leaf is not interesting, customers evaluated according this rule are not perspective. Here are the rules:

Rule: 1
	Contribution car policies <= 5.5

Number of mobile home policies = 
	-0.0041 * No religion  
	+ 0.0056 * 1 car   
	+ 0.0078 
Rule: 2

Number of mobile home policies = 
	-0.0131 * Living together 
	+ 0.0131 * High level education 
	- 0.0177 * Farmer 
	+ 0.0118 * Income 75-122.000 
	+ 0.012 * Purchasing power class 
	+ 0.016 * Contribution fire policies 
	+ 0.0161

We could see that “Contribution fire policies” is the most important factor, if the customer is not farmer. The NaiveBayes is a simple application of Bayes' theorem, so it can not give us result as above. For more detail description see wikipedia. For final model, I use bagging method, namely one more M5Rules. It return a simple weights: 0.56 * M5 + 0.1603 * NaiveBayes

Description of customer[edit]

My model is in final version little bit like a black box, hence I use for description of typical customer graphs. On the x-axis is an attribute, and on the y-axis my prediction (0-is not between 800 most perspective customers, 1 - is a perspective customer). All values are discrete, but I add noise to enabling view of amount.

My prediction and purchasing power class
My prediction and Contribution to car policies
According these two pictures you can see, that we prefer rich customer, and we choose almost purely the customer with contribution to car policies equal to six. I the same way we can build image of our typical customer:
  • Pay a lot contribution to car and fire policies
  • Have higher salary
  • Is married and well educated
  • Have 1 car
  • Belong to this main type: Successful hedonists, Driven Grower, Average Family, Family with grown ups, Conservative families

I was also interested, if some my decisions were right. The first was getting away a lot of ownership attributes. It proved to be good decision. The caravan holders were successfully mark by my algorithm, without using information of e.g. social insurance policies. The second decision was about the boat policies. It was also successful. Although it does not have so good rate as in training dataset (1:2), the rate in test set (3:12) was above-average.


In my sample of 800 customers were 115 customers with caravan policy. The maximum number of policy owners that could be found was 238, the winning model selected 121 policy owners. Random selection results in 42 policy owners. According this result I take 2-3rd position in competition. It was surprising for me, but not surprising in general, because in this competition man can well see the Occam‘s razor. The solution of the winner was quite simple, he basically use NaiveBayes, and as predictors

  • purchasing power class
  • a private third party insurance policy
  • a boat policy
  • a social security insurance policy
  • a single fire policy with higher contribution
  • + two derivated attributes.

The frequency distribution of scores for the prediction task is displayed in picture below.

Competition result


[1] Ian H. Witten, Eibe Frank Data Mining: Practical Machine Learning Tools and Techniques

[2] J. R. Quinlan, Learning with continuous classes, Basser Department of Computer Science, University of Sydney Sydney, Australia 2006,


  • english could be polished
  • attrubute choice: how were they choosen?
  • why I'am not interested if a customer has dangerous job? Since they live otherwise dangerously they might be more interested to ensure their family for daily life (would be better to argue with redundant information)
  • a little bit complicated sentences
  • relatively short