Answer:
A) 8
B) Significantly (increased)
C) Not affected
D) Median
E) Significantly (increased)
F) Not affected
G) Interquartile range
Step-by-step explanation:
A)
When we have a certain set of data, we call "outlier" a value in this set of data that is significantly far from all the other values.
In this problem, we see that we have the following set of data:
0,0,0,0,0,0,0,1,1,1,1,2,2,2,2,2,2,2,3,3,3,3,4,5,8
We see that all the values between 0 and 5 are relatively close to each other, being only 1 number distant from any other value.
On the other hand, the last value "8" is 3 numbers distant from the other highest value ("5"), so the outliar is 8.
B)
The mean of a certain set of value indicates the "average" value of the dataset. It is given by:
where
is the sum of all the values in the dataset
N is the number of values in the dataset (in this case, 25)
An outlier usually affects a lot the mean, because it changes the value of the sum by quite a lot.
In fact in this problem, we have
N = 25
So the mean is
Instead, without the outliar "8",
And the mean would be
So we see that the outlier increases the mean significantly.
C)
The median of a dataset is the "central value" of the dataset: it means it is the value for which 50% of the values of the dataset are above the median and 50% of the values of the dataset are below the median.
In this problem, we have 25 values in the dataset: so the median is the 13th values of the ordered dataset (because if we take the 13th value, there are 12 data below it, and 12 data above it). The 13th value here is a "2", so the median is 2.
We see also that the outlier does not affect the median.
In fact, if we had a "5" instead of a "8" (so, not an outlier), the median of this dataset would still be a 2.
D)
The center of a distribution is described by two different quantities depending on the conditions:
- If the distribution is symmetrical and has no tails/outliers, then the mean value is a good value that can be used to describe the center of the distribution.
- On the other hand, if the distribution is not very symmetrical and has tails/outliers, the median is a better estimator of the center of the distribution, because the mean is very affected by the presence of outliers, while the median is not.
So in this case, since the distribution is not symmetrical and has an outlier, it is better to use the median to describe the center of the distribution.
E)
The standard deviation of a distribution is a quantity used to describe the spread of the distribution. It is calculated as:
where
is the sum of the residuals
N is the number of values in the dataset (in this case, 25)
is the mean value of the dataset
The presence of an outlier in the distribution affects significantly the standard deviation, because it changes significantly the value of the sum .
In fact, in this problem,
So the standard deviation is
Instead, if we remove the outlier from the dataset,
So the standard deviation is
So, the standard deviation is much higher when the outlier is included.
F)
The interquartile range (IQ) of a dataset is the difference between the 3rd quartile and the 1st quartile:
Where:
= 3rd quartile is the value of the dataset for which 25% of the values are above , and 75% of the values are below it
= 1st quartile is the value of the dataset for which 75% of the values are above , and 25% of the values are below it
In this dataset, the 1st and 3rd quartiles are the 7th and the 18th values, respectively (since we have a total of 25 values), so they are:
So the interquartile range is
If the outlier is removed, do not change, therefore the IQ range does not change.
G)
The spread of a distribution can be described by using two different quantities:
- If the distribution is quite symmetrical and has no tails/outliers, then the standard deviation is a good value that can be used to describe the spread of the distribution.
- On the other hand, if the distribution is not very symmetrical and has tails/outliers, the interquartile range is a better estimator of the spread of the distribution, because the standard deviation is very affected by the presence of outliers, while the IQ range is not is not.
So in this case, since the distribution is not symmetrical and has an outlier, it is better to use the IQ range to describe the spread of the distribution.