Analysis of Percentage of Bodyfat
- 1 Abstract
- 2 Dataset
- 3 Outlier Detection
- 4 Target of Analysis
- 5 Correlation Analysis
- 6 Cluster Analysis
- 7 Regression Analysis
- 8 Questions Opened
- 9 Reference
- 10 Comments
Based on a dataset consisting of Percentage of Body Fat and other 14 physical measurements, this paper aims at showing both the graphical and numerical statistical characteristics of variables in use of XploRe, and further analyzing the relationships between Percentage of Body Fat and other given variables.
The results of analysis show that the percentage of body fat of American Men is negative related with Age, positive related with Weight and several Circumferences of different parts of body, also arrive at a conclusion that Weight plays the most important role in determining the Percentage of Bodyfat.
Recent decades, the estimation of body fat percentage is popular, at least partly, to be used to assess the health condition of people. Many doctors suggest that the higher percentage of body fat one has, the more risky he or she will be in suffering diseases. Percentage of body fat of an individual can be estimated by determining the Body Density which can be given by an underwater weighing technique. Meanwhile a variety of health books and texts suggest that people can also estimate body fat percentage in terms of their ages and body circumference measurements.
The dataset consists of 15 variables with a sample space of 252 men from United States. The 15 variables are Percentage of body fat(%), Body density from underwater weighing (gm/cm^3), Age (year), Weight (lbs), Height (inches), and ten Body Circumferences (Neck, Chest, Abdomen, Hip, Thigh, Knee, Ankle, Biceps, Forearm, Wrist, all in units of cm). Percentage of body fat is given from Siri's (1956) equation:
Percentage of Body Fat = 495/Density – 450
Five Number Summary
From the Five Number Summary of all the 252 observations, we can surprisingly see the minimum of Percentage of body fat of 0. Later we find that the "0 percentage" comes from the maximum of Density from which the Percentage of body fat is given according to Siri's equation. By Siri's equation, given a body density of 1.1089, the percentage body fat is negative which is treated as 0 in the dataset.
We further detect that the Minimums of Percentage, Weight, Chest Cir., Abdomen Cir. Hip Cir. Thigh Cir. and the Maximum of Density all come from the 182nd observation; In addition, the Maximums of Weight, Neck Cir., Chest Cir., Abdomen Cir., Hip Cir., Thigh Cir., Knee Cir., Biceps Cir., Wrist Cir. are all from the same observation, which is the 39th.
So far we can say that the 39th and 182nd observations should be treated as ourliers and are better to be excluded in further analysis.
From the output of Chernoff-Flury Faces(right), it is obvious to find that the 39th and 182nd observations are far different from the others, especially the 39th. We should note that, how the differences among faces would be like is largely due to how the variables are allocated to different parts of the face. Meanwhile, we can also detect out the 39th and 182nd observations in the provided Star plot (left)since they are shaped differently in quite a few aspects.
After the outliers are detected out, we can have a further observation on the differences between outlier-inclulded and outlier-excluded in multivariate graphs.
For Andrews Curve (left), the up-left and down-left graphs show the 1st-50th and 151st-200th observations separately, with the aforementioned outliers(39th and 182nd observations) included; The up-right and down-right graphs, on the other hand, are drawed with outliers excluded.
For the Parallel Coordinate Plot (right), the left graph show a combination of the 1st-50th and 151st-200th observations, including the outliers; the right graph has the outliers excluded.
Although only two among the 252 observations are kicked out, the graphs show relative huge differences before and after, which have the abnormality of the two outliers and the necessity of kicking out the outliers illustrated vividly.
Target of Analysis
With the treated data set, we aim at examining the relationships between Percentage of Body fat and other given variables. Some questiones could be raised here, for instance: Does a larger weight mean a larger percentage of body fat? Does the age matter? Can we estimate the percentage of body fat of men just using only a scale and a measuring tape?
With the analysis target, the Percentage of Body Fat turns out to be the explained variable naturally. It's statistical characteristics are shown as below.
From the plots of Boxplot,Histogram,QQ plot and Dot plot of Percentage of Body fat, we can see the explained variable subjects to the normal distribution.
Among other variables, we first select out Age (years), Weight (lbs), Abdomen Circumference (cm) and Hip Circumference (cm) intuitively as the explanatory variables, since in common sense, to some extent, those four variables might have relative higher correlation with the explained variable. We will further examine their relationships with Percentage of Body fat in later part. The statistical characteristics of those four variables, as well as the correlations between the explained variable and each of them are shown as below:
One thing we would like to mention here is that, the correlation dot plots(the down-rights) in graphs with respect to Abdomen Cir. and Hip Cir.(the third and fourth graphs) are nearly the same. So far we have not figured out the reason.
Jarque-Bera Test is uesd to measure the goodness-of-fits of variables, basing on observing the values of Kurtosis and Skewness. By doing Jarque-Bera Test, we can show the normalities of selected variables in numerical way. A table from the output of JB-Test by XploRe, consisting of JB-test statistic,Probability value, Standard Deviation, Skewness and Kurtosis, is provided on the left. With a critical value of 5.991 for JB-test statistic in our case, we can conclude that the Percentage of Body fat and Age have better normalities than the other three, for their JB-test statistic values are both smaller than the critical value.
The Scatter Plot is able to graphically show the correlations of each two variables. As the Scatter Plot showed on the left, a high correlation exists for Bodyfat Percentage and Abdomen, also for Weight and Abdomen, Hip and Abdomen. By observing this, we would easily ask, does the Abdomen matter a lot? On the other hand, it seems that Age has no obvious correlation with the other four variables. We will later provide (numerical) correlation analysis to have a deeper observation.
After seeing all the graphical description of our body fat data, we go further to examine the numerical relationship. The purpose of the following analysis is to find out factors that can best capture the variance of percentage of body fat. Thereafter, our process is divided into 2 steps: Firstly, examine the linear relationship between 15 variables; secondly, to establish a regression model that can best capture the variance of depend variable—percentage of body fat.
We use Bravais-Pearson correlation to find out the linear relationship between variables, see Figure (11).
The result from correlation corroborate previous analysis: X8—the circumference of abdomen are highly positively related with the percentage of body fat, the weigh, as well as the circumference of hip, with R bigger than 0.8. In addition, we can see that percentage of body fat is highly positively related with weigh, circumference of abdomen, neck, chest and hip, low positively related to age, ankle, forearm, and wrist and negatively to high. This informal result is quite interesting. Does it capture the truth, or just reflect roughly the overall situation? To answer this question, we go further to exploit the data set.
The aim of cluster analysis is to find groups with homogeneous properties out of heterogeneous large samples. The underlying motive of this paper to use cluster analysis is to find out if our 252 observations belong natural to different subgroups with respect to certain feature. Just think about the idea of “apple” and “pear” type people, this analysis is quite meaningful.
The groups or clusters should be as homogeneous as possible and the differences between different groups as large as possible. Cluster analysis can be divided into two fundamental steps:
- Choice of a proximity measure;
- Choice of group-building algorithm.
In our paper, we use Ward clustering algorithm and Euclidean distance as our clustering algorithm and proximity measure. The result is pictured in Figure 12.
We can clearly distinguish between 2 groups (clusters) with relatively high homogeneity from this cluster analysis. The bigger cluster contains 183 observations, the smaller group with 69 observations. Thereafter, our conclusion is that, our data can be divided into 2 subgroups arbitrarily. However, in the following regression process, we continue to use the whole data set, trying to see their underlying relationships.
General Regression on 13 Variables
The purpose of regression is trying to capture the variance of our explanatory variable, i.e. the percentage of body fat by a series of explanatory variables. In our paper, we first put all the variables into the general linear regression model. The idea is to get a rough impression about the relationship between 15 variables. The result is shown in Figure 13:
From this “naive" regression model 1 we can see that the R-squared is 0.74905, and the adjusted R-squared is, a little bit smaller, 0.73534. Since R-squared is the criteria to test the fitness of our linear regression model. Thereafter we can draw to the conclusion that our regression model is in general a good present of percentage of body fat. However, when we go further to see the result of t-statistic test, the findings are not that satisfying. Since the number of our observations is 252, the distribution of parameters should be approximately equal to normal distribution. Under the critical value of 1.96(a=0.05), we found that, only x6, x8, x14 and x15 are statistic significant. Nevertheless, is this result persuading? Do age, weigh, circumference of hip and chest play an insignificant role in deciding the percentage of body fat? Recalling our correlation table, in which most variables are correlated with one another, it is obvious that we need to take multicollinearity into account.
The idea of principle analysis is try to decrease the dimensions of variables while keep the strength in interpreting the data. While scale of variables may change the result of PC transformation, we use this method only to the series of variables that measure the circumference of a body, i.e. variables from X6 to X15, because they all have the same kind of scale—centimeter.
Under the criterion that 95% of the variance should be captured, we use the first 3 pc transformations to represent the 10 variables, see Figure 14
As next step we construct the transformation according to the eigenvector of variance and covariance matrix of variables.The transformation equation are shown as follows:
Using pc1, pc2, and the left variables: X3, X4, X5 we construct the following regression model, which we name as Model 2. The reason we get rid of pc3 is due to its tiny contribution to the explanation power. Following are the regression result(Figure 15):
As shown in this regression model 2, our R-squared has dropped a little, compared with the original regression model, while the t-statistics have obvious changes. For x3, it has dropped from 1.919 to 1.324, for x4, increased from -1.652 to -4.601, for x5, increasing from -0.725 to -0.823 and for pcs, pc1 are statistic significant, while pc2 not. What is the interpretation of this result? Our conclusion is that, using principle analysis has solved part of the mulcollinearity problems, however there is still probability that the existing explanatory variables in model 2 still related with one another. To test this guest, we conduct again the Bravais-Pearson correlation, see Figure 16:
From the left side correlation, we find out that x4 and pc1 are highly correlated; also x3 and pc2 are highly correlated. Thereafter, we still need the remedy to reduce the multicollinearity. To step further, we use the forward selection process.
Forward Selection Model
The forward selection option starts from our “good” variable, and then add stepwise new variables into the basic model till we get a better fit final model. The test criteria of the forward selection option are:
- If the new added variables increase the value of R-squared and the t-statistic, then we let it in.
- If the new added variable cannot increase the value of R-squared as well as the t-statistics, we will keep it aside.
- If the new added variable increase the R-squared value while affect other variables a lot, and its t-statistic is not significant, then we take it as the reason of multicollinearity.
In our paper we following the steps shown bellow:
1. We start with the x4—weight of a person. The choice is based on 2 reasons:
- according to our common knowledge, weight play an important role in deciding the percentage of body fat;
- our t-statistic corroborate our common knowledge with a significant number —-4.601. The result of regression model 3 is shown in Figure 17:
From the above result we can see that the R-square has increased from 0.3738 to 0.67682 compared with regression model 3, adjusted R-squared also jumps from 0.37126 to 0.67482, meanwhile the t-statistic is significant for both explanatory variables. Thereafter, we draw to the conclusion that pc1 contribute a lot to the regression model, and we keep it.
3. We put in x3—age in and form the regression model 5 . Result are shown in Figure 19:
From the above figure we can see that, age has increase the value of R-square and adjusted R-square a little, while decreasing the t-statistic of X4, PC1, meanwhile it’s t-statistic is significant under the critical value of 1.96.
4. We try to add X5—height to construct the regression model 6. See Figure 20: The above figure shows the same story, height has increased R-squared and adjusted R-square, while it decrease the t-statistic of weight and PC1, and has insignificant t-statistic value. It is sure that height brings the multicollinearity.
Final Regression Model and Interpretation
After the above analysis, we get the final regression model as follows:
se=(6.2731) (0.0371) (0.0674) (0.0261)
t=(-15.792) (-7.195) (12.899) (1.979)
In general we take this result is comply with our daily observation and our common sense, and the underlying interpretation is quite interesting:
- The negative intercept of the regression model proves a simple fact: except body fat, the main component of our body is bones, muscle and other tissues, and they are much more important in forming our body. Thereafter, body fat will exist theoretically after the certain necessary requirements are satisfied.
- The increase of age will in general lead to the decrease of percentage of body fat.
- The increase of weigh will in general increase the percentage of body fat.
- The increase of the pc which is constructed on circumference of neck, abdomen, hip, thigh, knee, ankle, bicep, forearm and wrist will also in general lead to increase of percentage of body fat.
- Should there be non-linear relationships?
- How to solve the multicollinearity between circumferences and other variables?
- Could the conclusion also be used for women?
- W.Härdle, L.Simar(2002),Applied Multivariate Statistical Analysis, Springer
- Roger W. Johnson(1996),"Fitting Percentage of Body Fat to Simple Body Measurements",Journal of Statistics Education v.4, n.1 (1996),Carleton College
- Bailey, Covert (1994). _Smart Exercise: Burning Fat, Getting Fit_, Houghton-Mifflin Co., Boston.
- Behnke, A.R. and Wilmore, J.H. (1974). _Evaluation and Regulation of Body Build and Composition_, Prentice-Hall
- Englewood Cliffs, N.J. Siri, W.E. (1956), "Gross composition of the body", in _Advances in Biological and Medical Physics_, vol. IV, edited by J.H. Lawrence and C.A. Tobias, Academic Press, Inc., New York. Katch,
- Frank and McArdle, William (1977). _Nutrition, Weight Control, and Exercise_, Houghton Mifflin Co., Boston.
- Wilmore, Jack (1976). _Athletic Training and Physical Fitness: Physiological Principles of the Conditioning Process_, Allyn and Bacon, Inc., Boston.
- Where does the dataset come from? Not citing the sources is plagiarism, because you tell the reader that you have made the survey and not A.G. Fisher.
- What is the source of "Lifestyle.jpg"?
- Starplot: What are the strange observations? I can only see one.
- No programs :(
- Figure 1-4: What are the colors for?
- Test on normality for body fat?
- Figure 6-9: The title "Dotplot" is wrong
- What is R?
- Cluster analysis: I would choose 3 clusters
- It is named "Principal Component Analysis"
- Why first using three components and then reducing to two components?