1 Introduction

High gloss surfaces are gaining more and more importance in the furniture industry, a clear example of this is the surface in kitchen furniture. Looking at the product range of diverse panel and furniture manufacturer, this trend is quite obvious: High-gloss surfaces are heavily advertised and this trend seems to be continuing. Because of the increasing demand for high gloss surfaces, it is important to be able to make reliable statements about the gloss properties of these surfaces. There are different methods to determine the glossiness of surfaces, but one of the greatest challenges within these is to achieve satisfying concordance of instrumental gloss measurements with human visual perception, which is not yet realised sufficiently. Because of the demand for a method which better describes human perception a new gloss measurement method, the so-called gloss impression, was developed. The term gloss impression was chosen to describe that the entire visual appearance of surface gloss is addressed by the study.

To better understand measurement and idea of gloss, in the 4th edition of the International Lighting Vocabulary of CIE (CIE 1987) the following definition of gloss was stated:

“Gloss (of a surface)—the mode of appearance by which reflected highlights of objects are perceived as superimposed on the surface due to the directionally selective properties of that surface”.

This definition illustrates the perceptional issue of gloss, meaning that the measurement of gloss has to consider the response of the visual system.

Today’s instrumental gloss evaluation is based on the findings of Hunter (1937), who identified six different visual criteria to evaluate gloss, which are described hereafter.

  • Specular gloss: describing the perceived shininess by specular reflection.

  • Sheen: shininess at grazing angles of otherwise matte or low glossy surfaces.

  • Contrast gloss: related to the contrast between specular highlights and adjacent surface areas.

  • Absence-of-bloom gloss: perceived haze in reflection adjacent to reflected highlights.

  • Distinctness-of-image (DOI) gloss: relates to the sharpness and distinctness of the observed image after reflection from the surface.

  • Absence-of-surface-texture gloss: describes the perceived surface smoothness and absence of texture or orange-peel.

One of the most common methods to assess gloss is the measurement of what Hunter described as specular gloss by use of a glossmeter, which is expressed by the gloss value. However, this method is highly limited in describing what human can perceive; therefore new approaches are needed to achieve a better correspondence to the human perception (Leloup et al. 2014). The limitations of existing methods were noticed in different studies. A variety of studies is dealing with gloss perception and all arrived at the same conclusion: There is no linear relationship between visual human perception and measured instrumental data (Billmeyer and O’Donnell 1987; Gruber et al. 2012; Harrison 1949; Harrison and Poulter 1951; Leloup et al. 2011; Obein et al. 2004). Especially Leloup et al. (2014) provide a good overview about the history of gloss measurement, existing gloss measurement techniques and their limitations. The inconsistency between instrumental and sensory data in gloss measurement is even pointed out in the International Standard for Gloss measurement: “However, it is not ensured that the obtained gloss values correspond to the visual gloss perception” (ISO 2014). Therefore, Leloup et al. (2014) highlighted the need for new methods which are suitable for soft metrology of surface gloss, in other words, methods to characterize gloss which have satisfactory correspondence to human gloss appraisal.

There are different explanatory approaches for this inconsistency, most of them referring to the multidimensionality of gloss: existing methodologies, such as Gloss Value, are based on the measurement of one specific surface gloss aspect. However, this is not what humans are able to see: according to Billmeyer and O’Donnell (Billmeyer and O’Donnell 1987; O’Donnell and Billmeyer 1986) humans are not able to distinguish between the six different gloss dimensions as defined by Hunter but can only perceive the whole appearance of gloss.

This limitation was recognized for a long time. In 1930, Pfund was the first who studied the multidimensionality of gloss and noticed that other factors than just the measurement of specular reflection must be involved in gloss perception (Pfund 1930). More studies followed: In 1951, Harrison and Poulter carried out an experiment concerning specular gloss perception of humans and came to the result that there is no direct correlation between specular gloss measurements and visual gloss rankings, but a rather good correlation can be achieved by using a correction factor. This result confirms the findings of Pfund, that there have to be other factors, which influence the human gloss perception and it is not sufficient to rely on just one aspect of gloss.

In regard to these findings, a new gloss parameter, called gloss impression (Moser et al. 2016) was developed with the aim to better assess the human perception of gloss. This new parameter is based on the measurement of reflectance of a light beam and its spatially resolved graphical information.

This paper deals with the appropriateness of the new parameter to predict the visual discrimination. To investigate if there is a statistically significant correspondence between human gloss perception and instrumental data a sensory evaluation study with 113 naïve observers was carried out. The human visual perception was on the one hand compared to the data obtained by measuring the commonly used gloss value, on the other hand to the data gained by measuring the gloss impression. The evaluators had to rank the glossiness of eight series of black and white samples. The overall accordance of the ranking test with gloss value and gloss impression respectively was calculated using chi-squared tests at a significance level of α = 0.05.

2 Methodology

2.1 Gloss measurement

For gloss measurement, a Rhopoint Instruments Ltd. Novo-Gloss Trio glossmeter was used. The principle of the glossmeter is based on the measurement of directed reflection. For this purpose, the intensity of the reflected light is measured in a narrow range of the reflection angle. The measurement results of the glossmeter are based on a black, polished glass standard with a defined refractive index. For this standard, the measured value is set at 100 gloss units (GU). For materials with a higher refractive index the measured value can be larger than 100 GU. To obtain comparable measurement results, the reflectometers are internationally standardized, especially the angle of incidence used by the reflectometer. High gloss surfaces (>70 GU) are measured with 20° geometry. The presented gloss values were measured with 20° geometry.

2.2 Measurement of gloss impression

A new setup has been developed (Moser et al. 2016), where a light beam with known cross section is projected on a glossy surface. The reflectance of the light beam generates spatially resolved graphical information. Due to the deformation of the original light beam on the samples surface, the new parameter, the so called gloss impression, can be calculated. The instrumental setup as published by Moser et al. (2016) is shown in Fig. 1.

Fig. 1
figure 1

Measurement setup (left) with the camera (1), the measuring window (2) and projector (3), and measuring principle (right) (Moser et al. 2016)

The technical implementation uses a pico projector to project the light through a measurement window on the sample, and the reflectance is detected and imaged with a camera. An ideal glossy surface (mirror) should give this setup a circular image of the light beam. Rough and restless surfaces provide scattered and distorted reflexion of the light beam. The more unstable the surface, the wider the profiles or several secondary maxima are formed. You can now define numerical criteria based on the reflexion structure image (RSB), which allow an objective characterization of the surface smoothness. One possibility is shown in the following formula:

$${\delta _{RSB}}=\frac{1}{{\mathop \smallint \nolimits_{(x,y)\epsilon \Omega }^{} I\left( {x,y} \right){d}\Omega }}\int\limits_{(x,y) \epsilon \Omega } {r(x,y){I_{RSB}}\left( {x,y} \right){d}\Omega }$$

wherein δRSB refers to a measure of the surface stability, Ω the integration surface, r(x, y) the distance of the location point (x, y) to the center of the peak and IRSB (x, y) is the local intensity of the location point (x, y). By choice of suitable integration areas different surface properties can be characterized. The normalization to the total light intensity in the integration limits Ω achieved a degree of independence from the brightness (Beer–Lambert reflectivity) of the sample. A higher δRSB results in an unsmooth surface. To determine the gloss impression a circular area with three times the radius of the RSB has proved as a good choice for integration area. With this new method it is possible to evaluate the gloss impression in total. So far, the gloss impression was not measured, but experienced people classified the surface as good or bad with grades (bad surface got 1, good surface got 5). This is the so-called carpenter view. The high correlation between the carpenter view and δRSB (R2 = 0.95) demonstrates the agreement of this new objective method with the currently used subjective individual impression (Moser et al. 2016).

2.3 Samples

24 melamine coated particle boards of 2800 mm × 1700 mm were used for the instrumental and sensory evaluation, thereof 12 black and 12 white ones. These boards are designed to be used as furniture fronts. These 24 samples were grouped into groups of three, based on their gloss values and their gloss impression. Grouping produced eight triplets, which differed in the categories colour and considered measured data. The classification is shown in Table 1. Each combination of colour and instrumental data was used within two sets.

Table 1 Differentiation within the test sets regarding colour and instrumental data

To obtain comparable values, the differences of the used instrumental data within the sets were almost constant. In each set, the data of either gloss value or gloss impression was used for ranking. In set 1, for instance, the gloss value is of interest, whereas the corresponding values for the gloss impression were nearly identical for the three samples of set 1. The exact values of each test sample are presented in Table 2.

Table 2 Instrumental data of the samples

A gloss value of more than 70 gloss units (GU) characterizes high gloss surfaces. Low gloss or matt surfaces are described by gloss values below 10 GU; medium gloss is from 10 to 70 GU. For this study high gloss respectively mirror-finish surfaces were examined.

The range of the values for the gloss impression is narrower than for the GU. For smooth surfaces the gloss impression gives a value of around 8, for very rough surfaces a value of 18 is achieved (Regenfelder et al. 2015). Whereas high gloss values imply high gloss, the gloss impression is inverse: a lower gloss impression means a higher gloss of the measured surface.

2.4 Sensory evaluation

Sensory evaluation was performed by 113 potential consumers using ranking tests. The evaluators can be seen as a homogeneous group with an age range from 22 to 36 years, thereof 44% female and 66% male, all of them were students at university. The assessors had to rank the samples according to their glossiness. Ranking tests are a method to assign whether observers can see, taste or feel a difference between three or more specimen. It was decided not to use a rating or scoring test, because the gloss of the samples seemed too similar to achieve reliable results in rating. The range of the data was too narrow for rating: although, a difference in gloss was recognizable, it seemed too difficult to give reliable statements regarding the differential amount of gloss units or values for the gloss impression of the specimen. Another reason for the use of ranking tests is their easy conductibility. In addition, ranking is not as time consuming as other methods, panelists do not need to be especially trained and instructions are simple. Furthermore, due to the ordinal data, there are minimal assumptions about the level of measurement (Lawless and Heymann 2010; Resurreccion 1998). Especially the last argument revealed to be important within this work, as the two compared methods are totally different in regard to their instrumental data and particularly in regard to their ranges of data.

The evaluators had to assess each set separately. They were instructed to assess the three boards, lying side by side, in relation to each other. They had to determine the glossiness of the respective set by mirroring the light source above them, as it has been done by Gruber et al. (2012). The evaluators were instructed to rank the samples according to the distinctness of the reflected image, whereas an assignment of same ranks was allowed to minimize guessing. Black and white sets were presented alternately to avoid a distortion of the results by learning effects. The samples within one set were presented randomly. The presentation of the sets is shown in Fig. 2.

Fig. 2
figure 2

Presentation of the sets

Naïve persons were chosen for evaluation instead of trained experts. This seemed more appropriate because the objective of the new methodology is to determine customer’s perception of high gloss surfaces. Therefore, the visual perception of potential consumers not of coating experts was determining. The surrounding was kept on a constant brightness level and the same light source was used during the whole experiment.

Prior to the sensory experiment, a pre-test with 16 evaluators, thereof 8 men and 8 women, was carried out. Objective of the pre-test was to figure out whether the test design was appropriate to answer the research question satisfactorily and whether the questionnaire needed to be adopted in some way.

2.5 Statistical analysis

The objective of the study was to investigate the appropriateness of the new method to predict the visual discrimination between samples of similar (high) gloss appearance. For statistical evaluation Pearson’s chi-square test was used at a level of significance of 0.05. The obtained data from the experiment was coded as follows: if the evaluation of one observer was in accordance with the respective instrumental measurement, the value 1 was given for this evaluation (“match”). If no accordance was found, a value of 0 was given (“no match”). For each board, the given values were summed so that one value was achieved for each of the 24 samples. This figure illustrates how often the sensory evaluation of one board was in accordance with the instrumental measurement and can also be expressed as relative frequency.

3 Results

To analyse whether there is a significant difference between the instrumental measurements and the human visual perception, χ²-tests were performed. To answer different characteristics of the research question, several contingency tables were built.

In total 1140 out of 2712 observations were correct (“match”), whereas 1572 classifications were not matching with instrumental measurement (“no match”), which means that overall 42% of the visual observations were matching with instrumental data. So far there is no differentiation regarding the colours of the samples. Table 3 takes a closer look on the correct and incorrect classifications for boards which differ in gloss value and for boards which differ in gloss impression.

Table 3 Percentage of correct and incorrect classifications regarding both methods

Only 41.5% of the correct classifications were determined by the gloss value whereas samples, which differ in gloss impression, had higher concordance, with 58.5% of matches. The calculated chi-squared probability of less than 0.05 leads to rejection of the hypothesis of no difference in correct classifications between the two measurement methods, χ²(1, N=113) = 54.63, p = 0.00. The method of gloss impression has a statistically significantly higher correspondence to human gloss appraisal than the gloss value.

However, this result shows which method better corresponds to visual discrimination between samples of similar (high) gloss appearance, but it remains unclear whether this correspondence of gloss impression and visual perception is satisfactory or just higher than correspondence of the gloss value. Hence, the guessing probability of the data was analysed to determine if the assessment of the observers was just by chance correct. As there were three potential categories to classify each specimen, whereas an assignment of same ranks was allowed, the guessing probability is one-third. As explained in more detail in Table 1, half of the samples (24 samples) and therefore observations (12 × 113 = 1356), are for boards which differ in gloss value, half for boards which differ in gloss impression. By guessing, one-third of these 1356 observations would have been correct, so guessing would have led to 452 correct answers and 904 incorrect answers. Table 4 shows the guessing probability in comparison to the gloss value.

Table 4 Number of correct and incorrect classifications regarding the gloss value compared with guessing

Table 4 shows that the commonly used gloss value provides no significantly better results than guessing. If the observers had guessed, 452 correct answers would have been expected. The total amount of correct answers is 473, which is not statistically significantly higher, as the p-value does exceed the 0.05 critical point, χ²(1, N=113) = 0.72, p = 0.395. Hence, it can be reasonably assumed that there is no unambiguous relation between the measured gloss value and visual perception of the observers, a result that is in full compliance with the literature findings described in the introduction.

Table 5 shows the number of correct and incorrect answers for those boards, which differ in the gloss impression. The calculated chi-squared probability is less than 0.05. Therefore, the null hypothesis has to be rejected, χ²(1, N=113) = 39.09, p = 0.00. The answers were not only by chance correct. Hence, it can be claimed that the newly developed parameter leads to better results than the gloss value regarding the human visual perception.

Table 5 Number of correct and incorrect classifications regarding the gloss impression compared with guessing

As shown in Table 6, there are huge differences in perception regarding the colour of the samples. In total, 1140 samples were classified correctly, thereof 647 black and 493 white boards. The amount of matches within black samples is actually about 13% higher than the number of matches within the white boards. On the whole, it can be stated that the gloss of black boards can be significantly better distinguished than the gloss of white boards and the hypothesis of no difference has to be rejected, χ²(1, N=113) = 35.89, p = 0.00. This has already been assumed during the experiments: after visual evaluation, observers often reported that it was extremely difficult and nearly impossible to distinguish between the white samples, whereas discrimination of black samples was reported to be far easier. This finding is confirmed in Table 7, which provides a good overview of the results.

Table 6 Number of correct and incorrect classifications regarding the surface colour
Table 7 Number of correct classifications regarding instrumental data, guessing and colours

In Table 7, only those classifications are presented, which are in accordance with the instrumental measurements. These matches are compared with the guessing probability. There were four combinations of colour and analysed instrumental data (see Table 1), so for every combination, 678 evaluations were given. The guessing probability is calculated as one-third of the total evaluations given for one combination.

The number of matches for white boards was 250 for samples with differences in gloss value and 243 for samples with differences in the gloss impression. Within black boards, 223 samples which differ in the gloss value were classified correctly. If the observers had guessed, 226 correct answers would have been expected. Therefore it is obvious that the correct classifications of white boards do not differ significantly from a random classification, neither regarding the gloss value, nor the gloss impression, χ²(1, N=113) = 0.75, p = 0.39. Similar applies to black boards which differ in the gloss value. The gloss impression, however, provides a significantly better estimation of human gloss perception at least for black boards: the number of matches regarding the gloss impression was 424, which is statistically significantly higher than the number achieved by guessing, χ²(1, N=113) = 20.37, p = 0.00.

Finally, it was tested whether there is an influence of gender in gloss perception: There is no significant influence of gender on gloss perception, χ²(1, N=113) = 0.13, p = 0.72. Furthermore, it was investigated whether the need of visual aids has a significant influence in perception. Of the 113 observers almost 42% are wearers of glasses or contact lenses. However, there is no significant difference of correct classifications made by persons who need or do not need any visual aid, χ²(1, N=113) = 0.42, p = 0.52.

4 Discussion

While not all of the results were significant, the overall direction of results showed that the gloss impression is a suitable method to measure visual gloss perception. The results are especially good for black surfaces in the high glossy or mirror-finish ranges. For white surfaces, the consistency of sensory and instrumental data was quite poor or rather random. But even though the new method provides predominantly good results for black surfaces and quite poor for white ones, the results for white surfaces are comparable to those of the gloss value. By considering this, it can be claimed, that the overall performance of the gloss impression is closer to the human perception than that of the gloss value. The new method is a relevant alternative for gloss measurement to the use of a glossmeter and a suitable method for the soft metrology of surface gloss. As already illustrated in the introduction, the correlation of the measurements by glossmeters to human visual perception was found to be rather poor in literature. This finding was once again confirmed, for black as well as for white samples. Since the study was performed on high-gloss samples distinctness-of-image (DOI) gloss could have been applied as an alternative method to provide reference values. Therefore, this validation is not limited to the presented method (specular glossmeter) since the study design also tests the results against random guess. In a further step, the comparison of DOI and the presented method should be carried out in another experimental setup.

However, there are limitations of the new method on which further research is needed. The described experiments were conducted on mirror-finish boards with gloss values of at least 100 GU for black and 90 GU for white samples. Therefore, it is not possible to make any certain statements concerning the relation of instrumental and visual data in the range of low or middle gloss. This was not of interest for this study because for today’s furniture industry and marketing mainly high gloss surfaces are in focus. As demonstrated by Obein et al. (2004), humans are more sensitive in high and matte glossy ranges. From these results it could be possible that the gloss impression does provide good results within high gloss surfaces but not as good correspondence in middle gloss areas. Further research is needed to investigate this hypothesis.

The good results for black surfaces and the quite poor results within the white samples could be explained by the fact that in gloss perception the contrast between the specular highlight and the surrounding surface seems to be of great importance. Pfund (1930) recognized that black surfaces seem glossier to the observer than white ones because of the sharper contrast that can be seen; the same result was obtained by Harrison and Poulter (1951). Furthermore, they figured out that darker surfaces appear glossier to the observers than light surfaces. This fact could possibly be also an explanation for the better results of black samples.

The results are only valid for materials with surfaces comparable to the coated particle boards used. To determine the gloss of, for example, coated papers, this measuring method still needs to be validated. The same applies to surfaces which are curved: the gloss impression has so far only been validated for plain surfaces. Further, the relation of gloss values with visual judgement for other colours was not included in the experiment; therefore, no reliable statement is possible. From the results above, it could perhaps be assumed that the appropriateness of the gloss impression to predict the visual discrimination between samples within darker colours is better than within light ones, but this hypothesis needs further research.

The inconsistency between instrumental measurement and visual analysis for the gloss value is not only a widely known fact, but the reason for this inconsistency might be due to the validation method: Leloup et al. (2014) attribute this insufficient result to “the fact that specular glossmeter readings were originally only intended to correlate in rank with visual judgements” (Leloup et al. 2014). The same could be said about the gloss impression, as in the present study a correspondence between instrumental and sensory data was only found by using ordinal scales. This finding cannot be completely invalidated. Nevertheless, the results of the gloss impression in comparison to those of the specular glossmeter must be taken into consideration. Even if the relation was only calculated with ordinal data, the gloss impression still provides significantly better results than the gloss value. Furthermore, Chadwick and Kentridge (2015) are questioning the feasibility in linking perceptual judgements and physical attributes with rating scales. They pointed out, that any rating scale needs a reference point, which would allow a comparison between the evaluations of different persons and that pairs of stimuli must be compared directly to each other in order to gain reliable results. However, a paired comparison as applied in AHP methods would allow deriving rating scales (e.g. Stern et al. 2012; Brudermann et al. 2015). Given the difficulty to assess the surface difference by visual methods such a rating procedure would most likely lead to inconsistent results, an issue most often connected to pairwise rating approaches. Therefore, Chadwick and Kentridge (2015) suggest a method which allows a comparison of two samples where each sample is compared with the other samples. This condition is fulfilled in the present study for which a rating test is considered as too difficult to perform for naïve observers leading potentially to inconsistent ratings.

5 Conclusion

The sensory experiment revealed that the gloss impression is a suitable method to measure visual gloss perception. The statistical evaluation showed that the overall consistency between visual appraisal and instrumental measured gloss impression is higher than consistency with data of the glossmeter. As already discussed in the introduction this study corroborates the fact that the gloss value is not providing satisfactory results for the human perception. In contrast to the literature mentioned in the introduction, the present study additionally analyses the guessing probability for the commonly used glossmeter as well as for the newly developed gloss impression. So far, the gloss impression has only been validated for black and white surfaces in high glossy ranges, starting from gloss values of 90 GU. To examine the consistency of other colours, of curved surfaces and for matte and middle gloss ranges, further research is needed.

Determining the human visual perception more exactly is not only important for validation of measurement techniques but has also, as mentioned in the introduction above, impacts towards effective and efficient marketing of furniture or other goods with high gloss surfaces. By using a measurement method which is in accordance with human senses, designs with different gloss ranges can be presented more efficaciously.