Car insurance company

From Teachwiki
Jump to: navigation, search



Data description[edit]

Our selected data is composed by 1,200 observations, or number of customers held by the car insurance company, and 13 variables describing specific characteristics of the insured, which initially were found relevant for the company. From the 13 variables, 6 are nominal and 7 are scalar.

The 6 nominal variables are:


As an example, a certaisex, personal status, job status, place of residence, auto type, and parking space. The following table summarizes all the possible values that each variable might obtain, as well as their meaning. n customer with a value of 3 in the auto type column, would own a luxury car.

The 7 scalar variables are: age, income, dependants, common distance, claims, amount of claims, and age of car. In the following table there is a short description of these mentioned variables:


However, while carefully examining the data, we found 2 inconsistencies. There were observations with a value smaller than 16 in the age variable, which is the minimum driving age in the United States, and also customers with 0 number of claims and dollar amount of claims greater than 0 at the same time, which is illogical. For the first inconsistence we simply deleted the 4 cases found, and for the second one, the approximated 200 observations showing this behaviour, were imputed, meaning that their values for the dollar amount of claims were replaced by 0.

Descriptive statistics[edit]

BEWter solving the inconsistencies, in order to have a better understanding of the scalar variables, their following statistical properties were calculated: range, minimum and maximum value, mean, and standard deviation. The next table shows the results obtained:

Table 2.jpg

Dollar amount of claims: initial approach[edit]

The dollar amount of claims is the most relevant variable for the insurance company, since its profitability rely much on how this variable behaves, and it is also the principal parameter to identify the different profiles of the customers. Our analysis will be mainly focused on the interrelationship between the dollar amount of claims variable and the rest. But previously, it was considered necessary to observe some general properties of this variable.

Through a histogram, it could be perceived its statistical distribution. From the following graph it can be inferred that the variable is not normally distributed, it is clearly skewed to the left side. This property affects directly the performance of a linear regression analysis, which will be implemented in the last section of this work.

Graph 1.jpg

Four box plots were constructed, each of them describing the behaviour of the amount of claims variable, taking in account the number of claims made by the customer. The box plot for the population having 0 claims was obviated, because all of the them have a dollar amount of claims of 0 as well. Then, the box plot on the left side includes all customers having 1 claim, and the one on the right side having 4 claims.


Except for the customers holding 4 claims, the variable is skewed to the left side, meaning that a different than normal distribution was also found even when the data is restricted to these specific populations. It was expected that the average of the dollar amount of claims would be rising as the number of claims increases (the amount of claims for a single customer is the sum of all claims done by him), which occurs for all the cases expect the third one, demonstrating that customers holding 3 claims tend to have less expensive dollar amount of claims on the average. The box plots also show that the distribution of the dollar amount of claims is more spread for people having 2 and 3 claims. The extreme values obtained cannot be assumed to be outliers because in none of the cases the data is normally distributed. The Andrew’s curves confirm the inexistence of significant outliers, then all observations gotten after the initial depuration are considered in our further analysis.


Dollar amount of claims: relationship with other variables[edit]

Initially, it is introduced a 2 dimensional graph involving the dollar amount of claims and each of the other continuous variables (income, age, common distance, and age of the car), in order to test a relationship between them, and if it is the case, to observe a positive or negative pattern, as well as the existence of clusters or subgroups. After identifying it, another variable is added, namely a nominal variable, with the purpose of better understanding why the indicated behaviour occurs. All nominal variables were utilized, but only the graphs showing a clear pattern are presented in this work. In the last subsections three and fourth dimensional graphs are explored.

Dollar amount of claims vs. income[edit]

A positive relationship was found between the 2 variables, meaning that the higher the income, the larger the amount is going to be claimed to the insurance company. There were two groups identified as well: customers with an income lower than $80,000 and those with an income grater than $100,000. It can be clearly observed that the subgroup with a high income has more spread observations than the subgroup with low incomes.

The positive relationship between the variables can be intuitively explained by assuming that customers with larger incomes will own more expensive cars, then at the time of an accident or failure, the value of the repairing will be more costly and then the claim. By observing the two subgroups, it can be inferred that the company mainly markets on the middle and low income populations or that their area of influence does not include much high income residents. People earning less than $80,000 a year will rarely make a claim greater than $3,000, while this would be approximately the mean value of the claim of someone making more than $100,000.

The following step consisted in giving a different colour to the observations according to the type of job the person holds: red are independents, blue managers, green workers, yellow employees, and blacks none professionals. All green and black observations are exclusively in the left cloud, meaning that for all workers and none professionals their incomes will not surpass the $80,000 and then their claims will be likely to be under the $3,000.

Most red and blue observations are located in the right cloud, while the yellow observations are proportionally found in both clouds, although in absolute terms they mostly appear in the left one. Due to these characteristics, independents and managers will be having much larger incomes, and then claiming larger dollar amounts on the average. It can be appreciated as well, that the insurance company attracts more employees and workers, since the yellow and green dots are the most seen.

When estimating a linear regression for the variables dollar amount of claims and income, for each type of work subgroups, it was found that the relationship is actually not positive for all of them, the none professionals and employees present a negative sloped line. The regression line with a positive slope consists for all observations.

Although unexpected, this behaviour seen in the none professionals and workers can be somehow explained by the fact that these populations will not face great salary changes during their whole working lives (reason why all of these observations are located in the left cloud, as mentioned before), then they will not change their autos owned as often as in the other groups. For this reason, they will hold the same auto for longer periods of time, although their incomes are increasing (but in a small degree), then their autos will be older when having better salaries, so their claims will be smaller.

For these reasons, the none professionals and employees with the highest incomes will tend to be the least costly customers for the insurance company, while managers, independents, and workers with higher incomes the most costly.

Dollar amount of claims vs. age[edit]

There is a clear positive relationship between the dollar amount of claims and the age of the customer, but unlike in the previous sections, subgroups were not identified, meaning that the company deals with all kinds of age populations. This behaviour can be explained by the fact that driving skills will worsen with age. However the amounts claimed become more volatile for customers older than 45 years because since this age the highest values are observed, then the most costly claims incurred by the insurance company will be held inside this group.

At the time of introducing a nominal variable, a significant result was only found with the type of car variable, but this analysis was left for a posterior section, since this relationship is easier to visualize on the three dimensional graph where the income is presented along.

Amount of claims vs. common distance[edit]

There is a clear negative relationship between the amount of claims and the common distance, the more kilometres a customer drives per day, the fewer claims the insurance will have to pay, which resulted to be the opposite of the expectation we had.

People living in urban areas will probably have their domiciles located not too far away (measured in kilometres) from their respective work places, which are assumed to be more proper for accidents. On the other hand, when having a domicile in a rural area, and then assumed the need to drive much more kilometres per day, the chances of having an accident will be lower due to less traffic.

The next step consisted in giving different colours to the observations according to the place of the domicile, and drawing regression lines of each of the 3 populations. The black dots are the urban customers, the blue the suburban, and the red the rural, and the regression line with the most negative slope was run only on the urban population. It can be seen that a suburban client will be the most representative for the insurance company, since the proportion of blue dots is the largest. The majority of the observations are contained within less than $3.000 claims and less than 30 kilometres; those customers claiming more than $3.000 will be more likely to live in urban and suburban areas, and those ones driving more than 30 kilometres per day in rural areas as well as making much cheaper claims. This supports the assumptions made in the previous paragraph when explaining the reason of finding a negative relationship between the 2 variables.

When looking at the regression lines, it is recognized that the urban population line has a higher intercept value, and it always takes much larger values than the suburban and rural regression lines for the range between 0 and 30 kilometres. This means that the urban customers are going to claim much more, and then will be more costly for the insurance company. The suburban and rural will behave similarly when only the kilometres driven per day are taken in count. The steepest slope found on the urban regression line let us know that the cheapest customers for the company, or making the smallest claims, will be those ones living in urban areas and driving more than 30 kilometres per day, probably because of a better driving expertise.

Dollar amount of claims vs. age of the car[edit]

There was not found a relationship between the dollar amount of claims and the age of the car. When adding a nominal variable, as in the previous cases, no patterns were found neither. Then, the age of the car was found to be not relevant when predicting a claim.

Dollar amount of claims vs. income and age[edit]

Knowing that there is a positive relationship between the dollar amount of claims, and both, the income and the age, a three dimensional graph was plotted to find out the interaction between them. Again 2 subgroups are depicted, the upper cloud having customers with significantly larger incomes, amounts of claims, and ages, than the other one; it has less and more spread observations too. For this reason, there is a very notorious relationship between these three variables, and then the income and the age will be good customer indicators at the time of predicting the claims.

All the nominal variables were added in the graph just described, and it was found an additional pattern in the behaviour of the three variables for the case of the type of car variable. The red dots indicate the owning of a van, sedan, or compact car, while the blue ones a luxury or sport car. As it can be appreciated, the upper cloud is mostly conformed by blue dots, and the lower one has proportionally more red dots than blue ones. Then it can be inferred that customers having luxury and sport autos are going to be richer and older than those holding a van, sedan, or compact auto, and will make larger claims to the company. The type of car registered then will set a customer profile and a trend for the dollar amount of claims variable.

Dollar amount of claims vs. age, income, and number of claims[edit]

Although the relationship between the first 3 variables had been already analyzed in the previous section, when using Chernoff – Flury faces, the interrelationships can be easier percept. This time the number of claims was as well included. The darkness and thickness of the hair is the dollar amount of claims (darker and thicker hair means larger amounts), the shape of the face the number of claims (fat faces mean larger number of claims), and the size of the eyes the income (the bigger the eyes, the richer the customer). The first graph include the 50 youngest observations, and the last one the 50 oldest ones.


On the first graph, it can bee seen many thin faces with light and little hair, and most eyes are completely closed, while in the second one more fat faces with dark and a lot of hair, and many opened eyes. This confirms the positive relationship between the 4 variables, which had been previously discussed.

Multiple linear regression analysis[edit]

After showing how the variables behave and the relationships among them, we tried to detect which variables have a significant influence on the main variable, the dollar amount of claims, by running a multiple linear regression. The regression equation is as follows, we set the dollar amount of claims as the dependant variable and the other six scalar variables as the independent ones in the subsequent order: age, income in dollars, number of dependants, average kilometres driven per day, number of claims, and age of car.

At the time of running the regression, two problems were found in our dataset. The first one consisting in not having a normally distributed dependent variable (as pointed out at the beginning of this paper) mainly because it has more than 200 observations with a value of 0. The second one derived from having very different ranges for each independent variable. In order to make our regression results more accurate and persuasive, we only took in count all the observations with 1 or more claims (or avoided all of them having a value of 0 in the dollar amount of claims variable); log functions were taken for those variables having very large values.

Y = β0 + β1X1 + β2X2 + β3X3 + β4X4 + β5X5 + β6X6 + u


The ANOVA table shows an overall P-value of 0.000, implying that at least one of the coefficients is different to zero. If we use an α value of 0.05, the significant variables are: age (P-value = 0.0069), income in dollars (P-value = 0.000), and average kilometres driven per day (P-value = 0.0000). The regression results show that age in and income in dollars are positively related with the dollar amount of claims, and that average kilometers driven per day is negatively related with the dollar amount of claims, as we had seen in the graphs before.

Discrimination analysis[edit]

The objective of our discrimination analysis is to distinguish potential customers according to their expected amount of claims and to state a cost structure for the company. We classified the data into two groups: low cost type customers, or observations with a dollar amount of claims lower than 780.96 (median), and high cost type customers, or observations with a dollar amount of claims greater than 780.96. Due to the unknown density distribution of the data, the Fisher’s linear discrimination rule was applied, which finds the linear combination aTx, maximizing the ratio of between-group sum of squares and within-group sum of squares (aTBa/aTWa ). For example, for a new observation x0, it would be a low cost type customer if aT(x0 - x‾)>0.


The fraction of the red curve right of the intersection, is an area of misclassification of a true low type customer as a high type, and the fraction of the blue curve left of the intersection is an area of misclassification of a true high type customer as a low type.


The multivariate graphical analysis identified the relationship between the claims made by the customers and other variables describing their socioeconomic structure and personal characteristics and preferences. Diverse customer profiles were found predicting different behaviours at the time of making a claim to the insurance company. Due to the statistical properties of the dependant variable, a multiple linear regression was run throughout a sample of the data, obtaining the same results concluded from the previous graphical analysis. These two methods let us recognized the profiles of the costly and inexpensive customers, suggesting the proper fees to be charged according to expected claims, and the type of clients that should be marketed.


  1. Härdle, W., Klinke, S. and Müller, M. (2000). XploRe – Learning Guide. Springer-Verlag Berlin Heidelberg.
  2. Härdle, W. and Simar, L. (2003). Applied Multivariate Statistical Analysis. Springer-Verlag Berlin Heidelberg.


  • Data source?
  • No programs
  • Nice statement about the outliers
  • How does this Anrews curves plot show that you have no outliers?
  • You should use smaller dots in the colored scatterplot graphics
  • Could it be that most people are "employees and workers" and that not the car company attracts such people?
  • "The ANOVA table shows an overall P-value of 0.000, implying that at least one of the coefficients is different to zero". Wrong.
  • Comment on R^2?
  • Last graphics: y-axis labelling?