Red wine data-set consisting of 1599 variants of the Portuguese “Vinho Verde” wine. Includes physicochemical (inputs) and sensory (the output) variables. This data-set is public available for research. The details are described in [Cortez et al., 2009].
Overview: There are 13 variables within the original data-set. Of these one is a dependent variable giving a subjective measure of quality based on experts sensory review of the wine. The main 11 variables are independent physio-chemical tests. These may be inter-related but are initially thought of as individual measurements.
Input variables (based on physio-chemical tests):
Output variable (based on sensory data):
13 - upper_vs_lower (quality grouped into lower and upper)
**Descriptive Statistics of Variables**## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 4.60 Min. :0.1200 Min. :0.000 Min. : 0.900
## 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090 1st Qu.: 1.900
## Median : 7.90 Median :0.5200 Median :0.260 Median : 2.200
## Mean : 8.32 Mean :0.5278 Mean :0.271 Mean : 2.539
## 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420 3rd Qu.: 2.600
## Max. :15.90 Max. :1.5800 Max. :1.000 Max. :15.500
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :0.01200 Min. : 1.00 Min. : 6.00
## 1st Qu.:0.07000 1st Qu.: 7.00 1st Qu.: 22.00
## Median :0.07900 Median :14.00 Median : 38.00
## Mean :0.08747 Mean :15.87 Mean : 46.47
## 3rd Qu.:0.09000 3rd Qu.:21.00 3rd Qu.: 62.00
## Max. :0.61100 Max. :72.00 Max. :289.00
## density pH sulphates alcohol
## Min. :0.9901 Min. :2.740 Min. :0.3300 Min. : 8.40
## 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500 1st Qu.: 9.50
## Median :0.9968 Median :3.310 Median :0.6200 Median :10.20
## Mean :0.9967 Mean :3.311 Mean :0.6581 Mean :10.42
## 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300 3rd Qu.:11.10
## Max. :1.0037 Max. :4.010 Max. :2.0000 Max. :14.90
## quality upper_vs_lower
## Min. :3.000 lower:744
## 1st Qu.:5.000 upper:855
## Median :6.000
## Mean :5.636
## 3rd Qu.:6.000
## Max. :8.000
The summary table depicts descriptive statistics for all of the variables.
Quality is measured from 0-10 based. However the data-set is limited to values between 3 and 8.
One categorical variables within the data-set is generated to group quality into two categories. Lower (<=5) and Upper (>5). This will be used to explore for any major differences in values.
pH is on a scale of 0 to 14 (potential of hydrogen), all values here are acidic (<7) with the range occurring between 2.7-4.
Alcohol is measured in percentages.
The other variables are all numeric data types.
There are no missing values within the data-set.
Quality is between 2 and 8 despite the scoring range being 0-10. The majority of values are 5 and 6. There is skew to the right tail giving more positive scores in 7. Giving the poor sampling of scores outside of 5 and 6 the quality measure will be split into two categories, upper (>5) and lower (<=5).
Alcohol also has a skewed distribution of values but with most values located between 9% and 13%. A small histogram can give a quick impression of the distribution of data more effectively than reading the descriptive statistics.
Investigate further, 9 variables are placed into histograms to understand their distributions. Of these pH, Fixed acidity and Volatile Acidity have slightly normal distributions, these could be investigated further using quantile quantile plots.
Residual Sugar and Chlorides both appear to have highly skewed, over dispersed data, these could be explored using a transformation to investigate the distribution of values further.
Citric Acidity has two spikes, one occurring at 0 and one around 0.5. Bin size could be changed to investigate if there is some artifact in the data related to rounding.
Density appears relatively normal but with occasional spikes, these should be checked for rounding issues.
Free sulfur dioxide and total sulfur dioxide have skewed distributions, with a sparse population of values towards the maximum.
Sulphates also has over dispersed values to the right with a skewed distribution. The rug plot helps highlight the low occurrence of these values in two clusters, close to 2 and close to 1.6.
Using this plot, pH can be examined in greater detail and compared to an idealized normal distribution. pH appears to have light tails as displayed by the overall sigmoidal shape of the points, it displays very fine grained clustering which is related to rounding of values and it shows an overall skew to the right.
Towards the upper right of the quantile quantile plot against a standard normal distribution the line crosses the 95% confidence intervals.
Density also shows light tails and skew to the right. The fit for a normal distribution is outside of the 95% confidence intervals.
To improve visualization of over dispersed variables data transformations can be applied. The original plot followed by a square root and log10 data transformation is conducted on residual sugar in the above three plots. The data transformations give a better representation of the distribution of values. They still have long tails towards the right in this case.
Other over dispersed variables are explored in the above plot and each is matched to a data transformation that better represents the distribution of data values. Chlorides shows a slightly normal distribution with a long tail to the right, it is centered nicely after a log10 transformation.
Citric acid still has a large count of values at close to 0, the data distribution is best represented using a square root transformation, this does not bring the data closer to a normal distribution but represents the the distribution of values across the range.
Total sulfur dioxide is transformed using log10 given a wide normal styled distribution.
Free sulfur dioxide puts too many values to the right hand side during a log10 transformation so even with a skew, square root gives a better distribution of values.
Sulphates gives a better distribution of values but is still skewed.
Each of these transformations can be considered when comparing these variables to other variables in further stages of the analysis.
The data-set is in a tidy format. Each observation corresponds to a series of variables.
The data-set consists of one key dependent variable, quality. This is based on the subjective sensory assessment of experts. This should result in a value between 0-10, within this data-set the majority of the samples exist with results of 5 or 6.
The other eleven variables are all measurements. It is assumed that these are independent, although there may be relationships between some of these eleven variables.
The main feature of interest is the dependent variable quality.
Without further exploratory work it can not be assessed which of the 11 measurement variables is the most important.
At the moment all other features are of interest as this investigation is not applying any domain knowledge about what is most important.
One categorical variable is created based on the dependent quality variable. This splits quality into two categories, upper and lower. The purpose of this is to represent better (upper) and worse (lower) wines. This is split in two due to the limited distribution of values outside of 5 and 6. The idea is to use this new variable to see if there are any relationships in other variables that separate better or worse wines.
There appear to be no normal distributions in this data-set, two are close but the qq plots have shown they sit outside of an idealized normal distribution.
Rounding of inputs appears to cause minor clustering within the data-sets, this may relate to the tool precision used to assess the physio-chemical properties of each wine.
Citric acid has a large number of values close to zero, this could be further investigated to see if this is a data quality issue or true signal.
Variables can depict over dispersed distributions with long tails to the right , using either square root or log10 transformations can help better represent the distribution of these values. Each of these as been investigated to identify which transformation should be applied for future plots.
This section will begin with assessing if there any variables with a strong difference between upper and lower wine quality reviews.
This section will also check if any of the measurement variables are related.
Frequency polygons showing the distribution of values but now split by groups of the dependent variable quality can help show if there are trends in any of the variables and the output variables.
In the above figure alcohol shows a large count of wines with lower quality measurements are associated to lower alcohol values.
Sulphates also displays a difference between the upper and lower categories. In which higher values of sulphates appear to exclude the lower quality scores.
By plotting the remaining variables in a similar way it is possible to quickly identify any variables that appear to have a stronger relationship to the upper or lower quality group. The first observation is that many of the variables show little difference between the frequency polygons for upper or for lower. Those that do show some difference, but it does not appear to be large or very obvious at first inspection.
Volatile acidity shows a distribution centered to the left for upper. The frequency polygon shows lower values for upper than the lower category.
Total sulfur dioxide has none of it´s upper values in the upper group.
Both free sulfur dioxide and total sulfur dioxide show some two peaks for the upper group (this may relate to count as this is not a density plot).
Citric acid has a spike in it´s higher values for the upper group.
These issues will be worth investigating further.
The pair plot gives all combinations of bi-variate analysis. It´s main limitation is that it is not created using the data transformations previously identified. This acts as a useful way to look up pairs of values to investigate further.
As previously addressed by the bi-variate frequency polygon plot very few of the data-sets impact the upper and lower categories of quality significantly apart from alcohol. This can be viewed in the box plots and paired histograms. Alcohol has a 0.48 correlation to quality on this plot. Volatile acidity has a negative 0.39 and sulphates has 0.25. These are variables with the highest correlation to the quality variable.
Free sulfur and total sulfur have a correlation of 0.67 but the two variables have little correlation to other variables.
Density appears to have some weak correlation with a few variables like residual sugar and fixed acidity. This could be a candidate for multi-variate analysis.
Chlorides and residual sugars should be checked as these are both highly dispersed variables so it is difficult to see any correlation in this pair plot.
An alternative way to view which variables are correlated is with a correlation matrix. This represents similar observations as above but the magnitude of the colour highlights strong correlations.
## `geom_smooth()` using method = 'gam'
Total vs. residual sulfur dioxide is plotted with both variables on a log10 scale to highlight the correlation (0.67) between the two variables. There are a handful of outliers but this appears to show a relationship.
Density plotted against fixed acidity shows a relationship between the two variables. This has been fitted with a linear model.
Density compared to alcohol shows a negative relationship, it does not fit well using a linear model. There is a cluster of values with low alcohol. It shows a good spread of values.
Chlorides and residual sugar are compared with a log10 transform but there still appears to be limited correlation between these two variables.
The comparison of sulphates to volatile acidity shows a weak negative relationship. It shows a good dispersion of values when both axis are on log10. A linear model does not fit well to this plot.
The fixed compared to volatile acidity shows a negative correlation.
Citric vs fixed acidity shows a positive relationship.
Comparing sulphates to quality values gives an initial appearance of a correlation. In the following series of plots category 5 and 6 are always key as they have a much higher proportion of values. Category 5 and 6 have a wider distribution of values of than the other categories. Despite the apparent trend between high quality and low quality values this may not be signal given the low number of samples in the highest and lowest values. Still there is a complete difference between categories 3 sulphate values and category 8 sulphate values.
Linear models are not used in these plots as they will always be swayed by the high number of values within quality values 5 and 6.
Comparing the dependent variable, quality, to variables that have a correlation to it includes volatile acidity. The plot shows each category of quality on the y axis and then a scatter of plots showing the distribution of volatile acidity. There is a weak correlation but what is more interesting is which values do not occur in certain classes. For example there are no values above 0.9 for the top scored wines. For both the high and low quality scores the challenge is in the low sample of variables.
Alcohol compared to quality shows a positive correlation. The values are dominantly within the 6 and 5 scores. Lowest scoring wines often have lower alcohol values. While the highest scores tend towards the higher values. A key observation is category six which shows a range of values across the alcohol values. This puts into question the strength of this correlation that higher alcohol equals higher quality of wine.
Comparing two variables using a quantile quantile plot (Empirical QQ plot) can give information about how different the variables are. In this case alcohol is split between upper quality values (>5) and lower quality values (<=5).
The plot and confidence intervals suggest that this the difference between the two groups of alcohol samples is not significant as the confidence bands and values intersects the 1-1 line.
Overall there are few strong correlations between variables in this data-set. The most promising line of investigation seems to be alcohol and it´s relationship to quality.
Two other variables (Volatile acidity and sulphates) have a weak relationship to the feature of interest.
All have been investigated, the challenge relates to the low proportion of samples in the highest and lowest values compared to the quality scores of 5 and 6. Often these values show a wider range of values in each independent variable putting to question if the relationships between higher and lower quality scores are signal or noise.
Few strong relationships exist, density appears to have associations to multiple variables and can be investigated further through multivariate analysis.
A number of plots have outlines, it would be interesting to know if there are clusters within the data-set or any other types of structures that can not be observed through bi-variate analysis.
Total and free sulfur dioxide have the strongest relationship found, this would make sense if free sulfur dioxide has a proportional relationship to the total amount of sulfur dioxide in a sample.
Density to fixed acidity has the same correlation value in the pair plot as total and free sulfur dioxide.
By using the ratio of pairs of correlated independent variables more dimensions of the data-set can be viewed at the same time. These paired relationships are based on the bi-variate analysis.
Based on bi-variate exploration of variables a number of candidates exist to explore as ratios. Ratios allow for combining two variables with a relationship. This can allow for more multivariate exploration, each plot will combine two pairs of variables to try and see if the quality of wine stands out in some sort of pattern or relationship.
The upper right plot uses the relationship between total and free sulfur dioxide and the relationship between density and fixed acidity. Overall there is weak suggestions that this is aiding in operating quality values. The are some weak clusters (e.g. value 8) between bottom left and top right. This is likely reflecting that sulfur dioxide is not having a considerable impact on the quality value.
The top right plot uses sulphates and volatile acidity transformed to log10 on the y axis with density and alcohol on the x axis. Both pairs of variables have some correlation and each of these variables has been previously noted as being correlated to quality. This plot does show a relationship to quality values with higher quality values being associated to a low density/alcohol ratio and a positive sulphates/volatile acidity ratio. This is a noisy relationship with a lot of overlap, especially from values 5 and 6 which are the most populated.
The lower left plot is similar to the last plot described but swaps density/alcohol ratio for density/fixed acidity. This shows a similar relationship as the previous plot but not as strongly along the x axis.
The lower right plot uses a ratio of sulphates and chlorides with a log10 transformation on the y axis with a ratio of pH and alcohol on the x axis. This also shows a relationship where a high ratio of sulphates and chlorides along with a lower ratio of pH and alcohol suggest more higher quality wines.
Based on multivariate exploration two pairs of ratios are selected and plotted with upper and lower quality categories as colour and quality as the size in this bubble plot. The size of the shape is subtle due to the dominance of values 5 and 6. It helps highlight the lowest scoring values.
This plot shows there is relationship within the data that can begin to separate some of the lower and upper quality wines. It also shows there is significant overlap.
The last plot uses a chart separated by each factor of quality. The axis are based on the same pair of ratios as previously used to highlight differences between upper and lower quality wines against alcohol.
The main purpose of this plot is to again show the challenges of the number of samples outside of values 5 and 6. Having a more even sample of values in the higher or lower values would increase the confidence that relationship previously described are reasonable and can be pursued further.
To assess the dependent variable quality the most productive investigation was using ratio of correlated variables.
The chosen pair of density/alcohol and sulphates/volatile acidity (using a log10 transform) proved most effective at separating upper and lower quality wines. Neither of these pairs have the strongest correlation but both pairs showed a good distribution of values which may be why they work well as ratios to describe variations in the data-set.
Despite selecting four variables to highlight the relationship different combinations of variables each gave different insights into the data-set.
It was surprising to see the relationship using upper and lower quality considering that bi-variate and uni variate analysis has so far only shown weak relationships.
Principal component analysis could be used to expand this investigation.
No models have been created due to ambiguity between the dependent and independent variables. Machine learning could create a model based on the observations.
Trying to get a better sample of across quality values should be prioritized before trying sophisticated modelling in this case.
The dependent variable within this data-set is quality measured between 0-10. The majority of samples lie between scores of 5 and 6. 2 and 3 is very poorly sampled and 2 to 0, 9 & 10 do not occur in the data-set. This makes it challenging to identify robust relationships between independent variables. This plot highlights that there is no information outside of the range 3-8.
Alcohol is one of the variables which have an apparent relationship with quality. At first look there appears to be a trend suggesting lower alcohol, lower quality and vice versa.
Comparing two variables using a quantile quantile plot (Empirical QQ plot) can give information about if the difference is significant. In this case alcohol is split between upper quality values (>5) and lower quality values (<=5). This is chosen as alcohol has the strongest correlation to quality (0.48) and the observation made when comparing frequency polygons between these two groups.
The steps in the plot will relate to rounding of measurement. The majority of the values between 10% and 12.5% are separated with higher values in the upper quality group. The 95% confidence bands highlight that this is not a significant observation as the two tails intersect the 1-1 line.
This plot is chosen to highlight that the independent variable with the strongest correlation does not show a significant relationship between upper and lower quality groups.
Based on multivariate exploration of many combinations of pairs of ratios two are selected to highlight trends in the data. Plotted with upper and lower quality categories as colour and quality as the size in this bubble plot. The size of the shape is subtle due to the dominance of values 5 and 6. It helps highlight the lowest scoring values.
The two ratios used are density/alcohol along the x axis and sulphates/volatile acidity with a log10 transform. There are weak relationship between each of these pairs and all variables have been seen to have weak to moderate impact on the dependent quality variable.
This plot shows there is relationship within the data that can begin to separate some of the lower and upper quality wines. Machine learning approaches could fit a model to separate these groups of values. It also shows there is significant overlap, which will hamper the performance of any model created. Expectations should not be too high about having any high quality predictive model from this data-set. ——
By systematically working through all of the variables in a uni-variate and bi-variate analysis this allowed for the multivariate analysis to be steered in a way that has shown a relationship to the dependent variable.
This was much more effective than just playing with advanced multivariate graphs to plot many variables which is often a temptation.
The struggle with this data-set was not having any domain knowledge about the topic and then the lack of correlated variables. Perhaps there are ways to better identify a relationship between quality and other variables with more knowledge about the measurement types, however I feel this analysis has explored most of the potential combinations of variables.
The other key issue was the low sample rate of quality outside of 5 and 6. This data-set is a good case where gathering a broader sample of observations should be prioritized before committing time to other methodologies like machine learning.
The main proposal for further work would be to focus on getting a more even representation of data for all labels. There is a similar dataset for white wine which could be compared to see if there are any similar relationships between this trend. This is still a limited sample as we are looking at only one region of wine.
Principal component analysis could be used to reduce the dimensionality of the dataset and explore if this relates to qualiy.
The quality measure itself could also be verified as an experts opinion on wine may be subjective. Perhaps collecting each individual reviewers score for each sample will shed further insight.
After exploring further an experiment would be to make a hypothesis and what controls wine quality, and then test this on new larger datasets. If this includes attributes that can be controlled for during production of wine this may aid in wine producers producing better quality wine.
For example are there a typical range of alcohol values that are preferred by wine experts and customers.
P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236. Available at: [@Elsevier](http://dx.doi.org/10.1016/j.dss.2009.05.016) Pre-press (pdf) bib