Marcel Merchat

Machine-Learning and other Data Products

Introduction

Modern computers and software programs are powerful tools for analyzing data and making machine-learning possible. With machine learning, we can make useful predictions or even make prudent automated decisions provided we understand the statistical accuracy of the estimates that the prediction is based on. An important measure of accuracy is the expected range and variation for likely outcomes of statistical variables. Estimates are provided by the data but probability theory of standard random variables informs us about the accuracy of these estimates.

Accuracy of Estimations and Predictions

Knowledge about the thing being estimated is needed to select an appropriate probability distribution. If a statistical outcome varies for many complex reasons such as a person's height, the bell-shaped normal distribution is usually a good choice. If the outcome is limited by filtering such as manufactured parts that have been sorted into bins, the distribution of parts within a bin may be uniform, the bell-shaped distribution would not be a good assumption and a uniform distribution would be a better choice. If a variable is similar to the number of phone calls received by a call center per hour, the Poisson distribution is usually a good choice.

Principal Components Analysis (PCA)

One of the basic ideas of machine learning is that including multiple correlated variables in an algorithm increases variability and reduces accuracy. Thus one attempts to build algorithms using uncorrelated variables; PCA is an automated way to help accomplish this in an unsupevised manner by using linear algebra to transform data into a matrix with independent columns. While this eliminates the noise of correlated variables, it's inherently difficult to interpret transformed variables that are linear combinations of other variables.

Predicting a Dry June in Illinois

It might seem foolhardy to try to predict anything about weather, but this investigation applies machine-learning to weather data records from O’Hare Airport at Chicago, Illinois. This analysis certainly shows the difficulties in weather prediction, but perhaps it shows that the likelihood or probability of a wet June appears to be weakly correlated with snow in February and cold weather in early spring and other variables over the previous months of a yearly weather record.

For unsupervised learning, we first explore the relationship between June rainfall and other weather data by computing the correlations between weather variables and performing cluster analysis. Based on the correlation results, a preliminary ordinary-least-squares analysis is performed for every possible combination of a reduced set of six column variables which have relatively high correlation with June rain. This was followed by principal components analysis for the same reduced set of six variables.

For supervised machine-learning, we develop models for four algorithms that predict the amount of June rain and compare the mean-squared error for these models using cross-validation. Finally, we perform final tests for which the bias, variance, and mean-square-error is presented for each method. The low correlation of June rain with February snow and cold weather in March and April changes the probability of a dry June a small but statistically significant amount, particularly for wet cluster years with a predicted rain level exceeds 100-mm. The years in the dry cluster seem to have a constant rain level, independent of the amount of predicted rainfall. None of the dry cluster years have June rain predictions above 100-mm while the three wettest years have predictions above 100-mm. The report is available at this Rpubs address . The reproducible code for this project and report is shared at GitHub at this address.

Cross-validation was performed to compare the performance of four models. There were 10-folds.

Regression Algorithms

Partial Least Squares (PLS)
Ordinary-Least-Squares (OLS)
Preprocessing with Principal Components Analysis (PCA) followed by PLS
Ridge regression

Ridge regression was included as a model instead of LASSO regression because the later did not perform well for this rather unpredictable weather problem with many weakly correlated variables. LASSO causes an underfitting problem with little predictive power for cases at a distance from the expected mean value. Increasing the hyperparameter λ until the model was simplified caused underfitting and loss of sensitivity where the predicted variation from the average response was filtered too much. The underfitting problem is reduced For ridge regression and at least some predictive power is obtained.

Malignant Tumor Detection with University of Wisconsin Dataset:

This illustrates a machine learning tool called Principal Components Analysis (PCA).

http://rpubs.com/marcelMerchat/244844

Electric Power Reliability

This illustrates a machine-learning study of SAIDI and SAIFI Figures of Merit.

Github Repository: https://github.com/marcelMerchat/electricPowerReliability1

To better understand the correlation between the raw variables, we construct a heat map that shows the level of correlations amongst the variables. The heat map clearly shows the likely more independent variables to include in the model. Figure 2-A and Figure 2-B in the link below show the relationship between the two most important PCA variables computed from patient data. Table-4 describes the accuracy for the algorithm. The program code that implements this study is at this Github repository.

Electric Power Reliability

SAIDI and SAIFI Figures of Merit:

http://www.rpubs.com/marcelMerchat/304554

The program code and raw data that generates this study is at this Github Repository:

Risk and Decisions Based on Machine Learning

We need to understand the nature of data and business risk factors for loss as well as for opportunities. In order to exploit machine-learning, risk is managed using a mathematical tool with a seemingly odd name called the receiver operating characteristic (ROC), a name taken from an early case where a radar receiver system had to decide whether or not something was an enemy plane. It's called the ROC even when the problem has nothing to do with radar such as detecting cancer or predicting which product will be purchased. It's just the name of a general method.

A machine learning algorithm may predict who we may target for advertising but sometimes it can be a bigger decision. In the case of a military radar system, the probability of detecting an enemy plane must be balanced against the probability of shooting down one of your own planes. Radar manufacturers do everything possible to avoid friendly planes but the decision must ultimately be made with a limited amount of noisy data subject to risk. The unavoidable noise and the laws of probability require letting at least a few unfriendly planes pass in order to shoot down fewer of our own friendly planes. The best we can do is achieve a balance between undesired outcomes. Fortunately, many business decisions are less grave. Perhaps advertising dollars could be directed to more likely buyers and so on, but there are still plenty of safety matters such as accidents and if a decision costs the company money or there are lost opportunities, the weight of a decision is not without consequences.

Receiver Operating Characteristic (ROC)

The graphical tool called the ROC curve can be used with machine learning to make informed decisions. Like common quality control tools, the tool permits balancing a tradeoff between detecting something and ignoring or missing it. It is a plot of the probability of deciding there is an unfriendly plane versus the probability that there is one. The probability of a false alarm can be read from the same plot.