Keywords

1 Introduction

Machine learning has recently risen as one the most groundbreaking technologies of our times. Many companies are using machine learning and other data science techniques to improve processes and resource allocation improving significantly business value. While many Machine Learning algorithms have been around for a long time, the ability to automatically apply complex mathematical calculations to big data over and over, faster and faster is a recent development. Some of the challenges faced by companies is the high-dimensional data. Dealing with many variables can help in certain situations but also may divert attention from what really matters. Therefore, it is important to check whether the dimensionality can be reduced while preserving the essential properties of the full data matrix. Principal Component Analysis (PCA) is a most widely used tool in exploratory data analysis and in machine learning for predictive models. Moreover, PCA is an unsupervised statistical technique used to examine the interrelations among a set of variables. It is also known as a general factor analysis where regression determines a line of best fit. The main idea of this procedure is to reduce dimensionality of a dataset while preserving as much ‘variability’ (statistical information) as possible. Preserving variability may sometimes implies the finding of new variables that are linear functions of those in the original dataset, that successively maximize variance and that are uncorrelated with each other. Finding such new variables, the principal components (PCs), reduces to solving an eigenvalue/eigenvector problem. Mathematical basis of the methodology were presented by Pearson [1] and Hotelling [2]. The advances in technology for data processing have generated a broad study of PCA method. Literature is vast, some of the most substancial books for understanding PCA are [3, 4], and for more specialized applications are [5, 6]. Also, the use of machine learning techniques for prediction models are widely study in literature, Hervert-Escobar [15] present a PCA method for reduction dimensionality combined with a multiple regression analysis to generate econometrics models. Then, such models are optimized to obtain the optimal price of a set of products. The model was tested in a case-study showing favorable results in profits for the pilot stores. Additionally, the literature provide articles that compile the advances and uses of prediction using machine learning prediction techniques. Sharma [7] present a survey of well-known efficient regression approach to predict the stock market price from stock market data based. Buskirk [16] provide a review of some commonly used concepts and terms associated with machine learning modeling and evaluation. The introduction also provides a description of the data set that was used as the common application example different machine learning methods. In this research we present a machine learning technique that given a matrix of explanatory variables value samples can produce predictions for quantitative dependent variables. It can also label the samples allocating one or several classes. This technique was tested to determine prices of articles for sale in convenience stores, to predict pollution contingencies, to determine leisure activities for tourists, to establish the probability of metastasis in cancer patients or the malignity of tumors in breast cancer patients, and other applications.

The procedure starts with a principal component analysis (PCA). This derives in several linear combinations of the original explanatory variables called principal components projections. These linear combinations are used to carry out a least-squares curve fitting to give predictions about variables of interest. Some applications implies a classification of the samples, mainly when the variables of interest have a definition of success or failure (within a thresh-old value). In such cases, point probabilities are computed for every value in the principal component projections. Then, the curve-fitting model will throw approximate probabilities for the value of a given combination of independent input variables.

The rest of the manuscript is organized as follows. The proposed methodology is presented in Sect. 2. The testing and analysis of the procedure are presented in Sect. 3. Finally conclusions are given in Sect. 4.

2 General Procedure

The general outline of the technique follows. We are given a matrix X with data in which rows are samples (m samples) with numerical values and columns are variables (n variables). The explanatory variables are separated into matrix E (o variables) and the dependent variables (p variables), or variables of interest, are separated into matrix D. Furthermore n = o + p. Also, from PCA [8] samples scores and variables loadings, dimension reduction and clustering can also be carried out. Figure 1 shows the steps of the procedure.

Fig. 1.
figure 1

Predictive factor variance association steps.

2.1 Dimension Reduction

The procedure starts by performing a PCA

  1. 1.

    Carry out PCA for matrix X and find matrices \( X = PDQ^{t} \).

  2. 2.

    Let \( F = XQ \) be the principal component scores.

  3. 3.

    Use square cosines and a clustering algorithm over the first two columns of F (First two components) to determine:

    1. a.

      Collinearity and dimension reduction. Explanatory variables that are grouped together are collinear and thus eliminate those columns from matrix X. Repeat PCA, go to 1.

    2. b.

      Causality relationships. If explanatory variables are grouped with variables of interest then that is evidence of a causality relationship.

    3. c.

      The square cosine is the square of the cosine of the angle between the vectors of variable loadings. If variables are close, the cosine will tend to one. If variables are separated, the cosine will approach zero. If \( cosine^{2} \left( x \right) > 0.5 \) then \( cosine > \mp 0.7071 \), thus associate variables with squared cosine greater than 0.5.

  4. 4.

    Eliminate variables of interest leaving only matrix E. Carry out PCA over E.

  5. 5.

    Select according to the following cases:

    1. a.

      For predictive analytics carryout curve fitting. Carry our curve fitting between explanatory variables and variables of interest clustered together in step 3.b. Discard any R2 < 0.5. This indicates which explanatory variables influence the most the variable of interest.

    2. b.

      For multi-label problem estimate probabilities. Match Fi row by row with variable of interest j (assumed to be categorical with 0 or 1 as value), where i = 1 … o and j = 1…p. Use window procedure to estimate probabilities.

  6. 6.

    Select according to the following cases:

    1. a.

      For predictive analytics match Fi row by row with variable of interest j, where i = 1 … o and j = 1…p. Carry out curve fitting. Discard all R2 < 0.5. Each Fi is a linear combination of explanatory variables that potentially has enough information to give good predictions for variables of interest. Thus \( \hat{D}_{\iota } = f_{ij} \left( {F_{j} } \right) = f_{ij} \left( {q_{1} X_{1} ,q_{2} X_{2} , \ldots ,q_{o} X_{o} } \right) \) where the coefficients qi are determined by matrix Q.

    2. b.

      For multi-label problem carry out curve fitting. Discard all R2 < 0.5. Each Fi is a linear combination of explanatory variables that potentially has enough information to give good predictions for variables of interest. Thus \( \hat{D}_{\iota } = f_{ij} \left( {F_{j} } \right) = f_{ij} \left( {q_{1} X_{1} ,q_{2} X_{2} , \ldots ,q_{o} X_{o} } \right) \) where the coefficients qi are determined by matrix Q. If \( \hat{D}_{\iota } > 0.5 \) then assume a result of 1, and 0 otherwise.

  7. 7.

    Select according to the following cases:

    1. a.

      For predictive analytics normalized new samples called \( X^{{\prime }} \) can give predictions on variables, for \( F^{\prime } = X^{\prime } Q \) and \( \hat{D}_{\iota }^{\prime } = f_{ij} \left( {F_{j}^{\prime } } \right) \).

    2. b.

      For multi-label problem normalized new samples called \( X^{{\prime }} \) can give predictions on variables, for \( F^{\prime } = X^{\prime } Q \) and \( \hat{D}_{\iota }^{\prime } = f_{ij} \left( {F_{j}^{\prime } } \right) \). Again, if \( \hat{D}_{\iota } > 0.5 \) then assume a result of 1, and 0 otherwise.

2.2 Sorting and Estimating Probabilities

When a multi-label classification solution is required, class probabilities per sample must be calculated [9]. We introduce a new measure called the success probability which is usually interpreted as \( P\left[ {x_{i} } \right] > l \), that is the probability that the variable will reach a value above a threshold (although there are many other types of categories). The horizontal axis for this estimate can be any of the explanatory variables and the vertical axis any of the variables of interest which would in this case be categorical with a value of 1 for success or yes, and 0 for failure or no. But the most effective models are usually derived from using an entire factor, that is the linear combinations of explanatory variables obtained as columns of matrix \( F \) to predict variables in matrix \( D \).

To estimate this probability, for each factor and each variable of interest, an \( m \)-row vector of two entries is created. One entry is \( vx \), or the explanatory linear combination from \( F \) (or just single variable) and the other entry is \( vy \), the categorical variable used to estimate probabilities. This vector is sorted by \( vx \). A sample of \( h \) items of is taken above and below a given value of \( vx \) is taken creating a sliding window of samples. In general, \( h \) data points from the plots are lost, \( h/2 \) points at the beginning of the plot an \( h/2 \) at the end. We use \( h \) = 21 and \( h = 41 \). The samples are taken and then the number of successes are counted. So \( vy_{i} \) is a measure which indicates that horizontal point \( vx_{i} \) is a success \( \left( {vy_{i} = 1} \right) \) or not \( \left( {vy_{i} = 0} \right) \). The probability of success \( P_{s} \left( i \right) \) at horizontal point i is estimated by (Eq. 1):

$$ P_{s} \left( i \right) = \frac{1}{h}\mathop \sum \limits_{{j = i - \frac{{\left( {h - 1} \right)}}{2}}}^{{j = i + \frac{{\left( {h - 1} \right)}}{2}}} vy_{j} $$
(1)

In essence, we consider each success result in the sample as a Bernoulli experiment and use the sample mean of a collection of results of the variable of interest as an estimator for the probability, which is a well-known maximum likelihood estimator [10]. This is similar to Krichevsky-Trofimov [12] estimator, which is used as a conditional estimator of the outcome of the next Bernoulli experiment [11]. However, since our dataset is not a time series and our objective is to create a global model for the success probability as a function of some variable, we can use the sample mean of the current windows as seen in [12, 13].

The sample size used to estimate the success probability is small, so tests were made with different values of \( h \) such as 11, 41 and even 101 when possible, and it was found that the behaviour of the probability estimate was about the same, showing stability in the estimate. Naturally, the bigger the sample the more precise the estimate will be but the more variability will be lost to averaging.

3 Testing

In this section we present different data sets were the proposed methodology is tested.

3.1 Metastasis on Breast Cancer Patients

This data set consists of 2920 reference values for different genes of 78 breast cancer patients. These patients already have breast cancer tumors. We wish to determine the likelihood of tumors metastasizing. This is a binary classification problem. This data file was obtained from the UCI Machine Learning Repository (Dua and Karra [14]).

The probability estimate procedure with a windows of only 11 samples found a cubic equation with \( R^{2} = 0.831 \). The fitted model with F17 is able to correctly determine if the sample metastasized or not on 75.64% of the samples. See Fig. 2 for a plot of probabilistic model.

Fig. 2.
figure 2

Probabilistic model determining the probability of metastasis in breast cancer patients.

3.2 Malignity of Tumors in Breast Cancer Patients

This data set consists of 570 samples of information about cancer patients. The variables are information about the tumor such as radius, perimeter, area, texture, softness, concave points, concavity, symmetry and fractal dimension. The objective is to find if the tumor is malignant or benign. This is a binary classification problem. This data file was obtained from the UCI Machine Learning Repository (Dua and Karra [14]).

The fitted probabilistic model using F1 shows a coefficient determination of \( R^{2} = 0.9434 \) and is able to determine correctly if the sample corresponds to a malignant tumor in 91.38% of the samples. See Fig. 3 for a plot of the probabilistic model.

Fig. 3.
figure 3

Probabilistic model determining if a tumor is malignant in breast cancer patients.

Dimension Reduction.

PCA on the data set also shows that some of the variables can be eliminated achieving dimension reduction. This is shown in Fig. 4, where some variables can be eliminated. This is very important. If a medical device is being designed to help people heal or prevent a health problem, the less variables that are necessary to control the cheaper the device will be and the more people it will help.

Fig. 4.
figure 4

Dimension reduction to determine of a tumor is malignant

Figure 5 shows how the clustering in the F1 vs F2 biplot of samples agrees with using F1 to separate malignant to benign tumors. As can be seen, there is a clear separation between samples marked as M or B along the F1 axis.

Fig. 5.
figure 5

Clustering of samples. M means malign, B means benign.

3.3 Travel and Activities

This data set corresponds to 87 samples containing answers to a survey. The provided information is the individual psych-socio-economic profile. This data is useful for traveling websites in order to determine what kind of activities a user would prefer. The information contains several features for the socio-economic profile, and for psychological profile.

The socio-economic features are gender, age, education, marital, employment, WEMWBS, PANAS: PA, PANAS: NA, SWLS, SWLS: group.

The features for psychological profile are:

  • Personality traits: Extroversion, Agreeableness, Consciousness, Imagination, Neuroticism shown as the columns: ‘Big5: Extraversion’, ‘Big5: Agreeableness’, ‘Big5: Conscientiousness’, ‘Big5: Neuroticism’, ‘Big5: Imagination’

  • 3 orientations to happiness: Pleasure, Meaning, and Engagement. The columns: ‘OTH: Pleasure’, ‘OTH: Meaning’, ‘OTH: Engagement’

  • Fear of missing out (FoMO).

The types of activities are (Would the person like to do…?): ‘Outdoors-n-Adventures’, ‘Tech’, ‘Family’, ‘Health-n-Wellness’, ‘Sports-n-Fitness’, ‘Learning’, ‘Photography’, ‘Food-n-Drink’, ‘Writing’, ‘Language-n- Culture’, ‘Music’, ‘Movements’, ‘LGBTQ’, ‘Film’, ‘Sci-Fi-n-Games’, ‘Beliefs’, ‘Arts’, ‘Book Clubs’, ‘Dance’, ‘Hobbies-n-Crafts’, ‘Fashion-n-Beauty’, ‘Social’, ‘Career-n-Business’, ‘Gardening-n-Outdoor housework’, ‘Cooking’, ‘Theatre, Show, Performance, Concerts’, ‘Drinking alcohol, Partying’, ‘Sex and Making Love’.

The data file was obtained from online surveys.

This is a multi-label problem. In Table 1 we show which activities are predicted by which factors with and the model \( R^{2} \).

Table 1. Types of activities by component best able to predict it and model R2.

For example Activity ‘Language-n-Culture’ can be predicted by the linear combination of explanatory variables formed by factor 4, with and \( R^{2} = 0.945 \). The model is able to determine accurately that a person will like this type of activity in 67.81% of the samples. Figure 6 shows the probabilistic model.

Fig. 6.
figure 6

Probabilistic model for language and culture related activities preferences.

3.4 Convenience Store Pricing Strategy

This is a data set of 6362 samples of sales tickets on different stores of several convenience stores. The data consists of type of merchandise, merchandize identification number, year and year-week, article cost, taxes, units bought, purchase amount, units with discount, amount of discount, articles prize, total profit margin and % of profit margin (from cost), minimum, medium and maximum weather temperatures, and amount of rain.

Principal component analysis determined that profit margin and % of profit margin as well as units sold is dependent only in the units sold with discount and the total amount of discount. See Fig. 7.

Fig. 7.
figure 7

Principal component analysis of convenience store sales

For the convenience store, one important goal is to sale articles with profit margin of at least 40%. PFVA determined that F10 of a PCA that excludes variables “profit margin” and “% of profit margin” gave the best prediction about attaining this goal with an \( R^{2} = 0.753 \). The regression model is able to determine if the sales ticket will achieve the goal in 68.86% of the samples.

An important observation is that profit margin and the probability of achieving more than 40% of profit margin are opposite. This is because margin increases with the number of articles sold at discount but the probability of achieving more than 40% profit margin diminishes, as discount articles have lower price at same cost than regular priced articles. That can be seen in Figs. 8a and b for profit margin and Prob(%margin > 40%) versus discount units sold (usdes) and Figs. 9a and b for profit margin and Prob(%margin > 40%) versus discount total amount (montodes), both for article identifies as sku = 455.

Fig. 8.
figure 8

a. Profit margin versus discount units sold for article sku = 455. b. Prob(%profit margin > 40%) versus discount units sold for article sku = 455

Fig. 9.
figure 9

a. Profit margin versus total discount amount for article sku = 455. b. Prob(%profit margin > 40%) versus total discount amount for article sku = 455

Fig. 10.
figure 10

a. Balance between profit margin versus total margin. Allow at most 5 units sold at discount per sale. b. Balance between profit margin versus total margin. Allow at most 16 currency units discount per sale

3.5 Metropolitan Area Air-Pollution

Monterrey’s Metropolitan Area (MMA) in Mexico, consists of 15 municipalities with an approximate population of 4,406,054 inhabitants. As it so happens with other large metropolitan areas, pollution is a concern. More than 60% of days in a year have pollution levels that label air quality as bad or extremely bad. There are 12 monitoring stations throughout the metropolitan area, although for this paper we only used data from five stations because the other had too much incomplete data.

Each monitoring station measures every hour weather variables such as pressure (PRS), temperature (TOUT), relative humidity (HR), solar radiation (SR), rainfall (Rain), wind speed (WSR) and direction (WDV); and pollutants such as carbon oxide (COx), nitrogen oxide (NOx), sulfur oxide (SOx), ozone (O3), particles with diameter less than 2.5 \( \upmu \) (PM2.5) and particles with diameter less than 10 \( \upmu \) (PM10). Regional health authorities consider the last three pollutants the ones that have the most adverse effects on population. Nevertheless, in only 3% of 2015 measured days did daily O3 maximum concentration exceed norms, whereas PM10 and PM2.5 exceed maximum limits in 58% and 63% of days respectively. Although PM10 includes PM2.5 particles, usually PM10 particles greater than 2.5 \( \upmu \) (but less than 10 \( \upmu \)) are mainly dust. There are in total 14,374 samples. The information was obtained from Integral Air Quality Monitoring system from Nuevo Leon province in Mexico (Martinez et al. 2012).

PM2.5 (Particles with Less than 2.5 \( \upmu \) in Diameter).

Particulate matter, or PM, is the term for particles found in the air. Many manmade sources emit PM directly or emit other pollutants that react in the atmosphere to form PM (Martinez et al. 2012). PM10 pose a health concern because they can be inhaled into and accumulate in the respiratory system. PM2.5 are referred to as “fine” particles and are believed to pose the greatest health risks. Because of their small size fine particles can lodge in the respiratory system, and may even reach the bloodstream. Exposure to such particles can affect respiratory and cardiovascular systems. Numerous scientific studies have linked particle pollution exposure to a variety of problems, including (Lerma et al. 2013): premature death in people with heart or lung disease, nonfatal heart attacks, irregular heartbeat, aggravated asthma, decreased lung function, increased respiratory symptoms, such as irritation of the airways, coughing or difficulty breathing.

Sources of fine particles include all types of combustion activities and industrial processes but also they are indirectly formed when gases from burning fuels react with sunlight and water vapour (Martinez et al. 2016). These can result from fuel combustion in motor vehicles, at power plants, and in other industrial processes. Most particles form in the atmosphere as a result of complex reactions of chemicals such as sulphur dioxide and nitrogen oxides, which are pollutants emitted from power plants, industries and automobiles.

PM2.5 is the main concern since these particles form out of dangerous chemicals that have very serious effects on population. Several measures have been proposed to reduce PM2.5 pollution such as restricting vehicle transit by license plate number (that is, some vehicles would not be allow to circulate some days of the week) and vehicle verification. As we can see in Figs. 11a and b, although ozone can be directly related to vehicle traffic (as represented by time of day) PM2.5 cannot (Carrera et al. 2015). Therefore, none of those vehicle related preventive measures would work.

Fig. 11.
figure 11

a. Changes in O3 air concentration according to time of day. O3 is clearly related to daily traffic as represented by time of day. b. Changes in PM2.5 air concentration according to time of day. PM2.5 is clearly NOT related to daily traffic as represented by time of day

Also, weather affects pollution. Our aim is to, given a morning weather forecast, to determine if there will be a pollution contingency, that is, pollution levels higher than allowed by laws and regulations. This would allow regional governments to implement emergency measures to protect the population.

PCA shows (See Fig. 12) that O3 is mainly related to temperature, solar radiation and relative humidity, as it’s well known that O3 levels are often high when the day is hot, sunny and dry. Nevertheless O3 levels can also be high at low temperatures.

Fig. 12.
figure 12

PCA indicating that O3 is related to weather, specifically temperature, humidity and solar radiation. There seem to be no clear relationship between weather and particles.

For O3 we are able to create a deterministic model as shown in Fig. 13. F1 can predict the value of O3 with \( R^{2} = 0.7801 \).

Fig. 13.
figure 13

Predictive model for O3 based on F1.

Figure 12 also shows that PM2.5 and PM10 are very weakly related to weather, as the biplot lines are orthogonal from any weather variable. Nevertheless, since we are mainly worried about the daily maximum levels, a new file was created with 365 samples containing daily maximum levels of PM2.5 and maximum, minimum and average daily readings for all weather variables. We found F10 from this new analysis to be best predictor of the class PM2.5 > 40.5. It has \( R^{2} = 0.9663 \) and it’s able to correctly predict bad air quality because of high PM2.5 levels in 65.7% of the samples, giving evidence that even though weather does influence pollution, for air suspended particulate neither weather nor traffic are the main determining factors. Since the number of samples was large, we tried windows from 11 samples to 101 samples, finding that the windows from 11 to 51 would have the same percentage of correct predictions, with that percentage dropping slightly at window size 101.

4 Conclusions

PFVA is a new technique that uses well-known concepts of matrix algebra and probability that is able to solve several problems of data analytics. It provides orthogonal linear combinations of the explanatory variables that can be used to predict the value of a variable of interest given a collection of values for explanatory variables; determine the classes to which a sample belong; and to classify samples into groups with common characteristics.

The main differentiator of this technique is the use of sample window to compute class probabilities that is has proven to be, even though the window can be really small, as seen in the examples presented, to be robust and accurate.