Copy the following code My Journey in Six Sigma, Lean, Data Science

## Saturday, November 16, 2019

### Logistic Regression

This is another new knowledge I learn after attending a Predictive Analysis course at NUS.

Logistic Regression. It is used to predict the odds of being a case based on the values of the independent variable (predictors). The Odds are defined as the probability that a particular outcome is a case divided by the probability that it is a noncase. (Wikipedia)

odd_ratio = p/(1-p)

Below is the take home exercise given to me and the code.

data source

Case :

I don't plan to spend too much time discussing the preliminary data assessment and the basic descriptive data analysis we need to do before proceed for any other kind of deep dive data analysis. It is common sense.

```
# Note : the script is tested working (1/10/2019)
# Reference : https://stats.idre.ucla.edu/r/dae/logit-regression/
# Reference : https://stats.idre.ucla.edu/other/mult-pkg/faq/general/faq-how-do-i-interpret-odds-ratios-in-logistic-regression/
# Applicable for standard logistic regression analysis
# Not tested on any real case

# Set working directory

readxl, fpp2, ggplot2, Ecdat, caret, ggpubr,
rpart, caTools, MASS, ROCR, oddsratio, broom)

# Load data from working director, name the data frame as Raw_Data
# Set qualitative data as factor
cols = c('Fac_Class', 'DEATH', 'RACEC', 'INH_INJ', 'FLAME' , 'GENDER')
Raw_Data[,cols] = lapply(Raw_Data[, cols], factor)
# Split Train & Test data with ration 70 : 30
set.seed(1)

split_values = sample.split(Raw_Data\$DEATH, SplitRatio = 0.7)
train_set = subset(Raw_Data, split_values==T)
test_set = subset(Raw_Data, split_values==F)

# Perform quick a quick model assessment without interaction
# Check which are the main effect
factor_screening = DEATH ~ AGE + RACEC + TBSA + FLAME + GENDER + Fac_Class + INH_INJ
Main_Factor = glm(factor_screening, data = Raw_Data , family = "binomial")
summary(Main_Factor)

# Run a full factoria GLM model, for general overview of GLM model
Full_Term = DEATH ~ AGE + RACEC + TBSA + FLAME + GENDER + Fac_Class + INH_INJ +

AGE:RACEC + AGE:TBSA + AGE:FLAME + AGE:GENDER + AGE:Fac_Class + AGE:INH_INJ +
RACEC:TBSA + RACEC:FLAME + RACEC:GENDER + RACEC:Fac_Class + RACEC:INH_INJ +
TBSA:FLAME + TBSA:GENDER + TBSA:Fac_Class + TBSA:INH_INJ +
GENDER:Fac_Class + GENDER:INH_INJ +
Fac_Class:INH_INJ

Full_Model = glm(Full_Term, data = train_set , family = "binomial")
summary(Full_Model)

# Run stepAIC to a model giving minimum AIC value
# stepAIC(Full_Model, direction = 'backward')
Sim_Term_1 = DEATH ~ AGE + RACEC + TBSA + FLAME + GENDER + INH_INJ +
AGE:INH_INJ + RACEC:FLAME + TBSA:INH_INJ
# Model training
Sim_Model_1 = glm(Sim_Term_1, data = train_set , family = "binomial")
summary(Sim_Model_1)
or_glm(data = train_set, model = Sim_Model_1 ,
incr = list(AGE = 10, FLAME = 1, TBSA = 10, INH_INJ = 1 ))

# Model validation with training data set
predict1 = predict(Sim_Model_1, train_set, type = "response")
summary(predict1)
train_set\$prob1 = predict1
sapply(train_set, class)

x1 = ifelse(predict1 > 0.5, 1, 0)
x1 = as.factor(x1)
con_mx_train = confusionMatrix(data = x1, reference = train_set\$DEATH)

# Generate a table to get optimum F1 score
list1 = c(1:8)/10
F1_Tb_Train = data.frame()

for (i in list1)
{
x2 = ifelse(predict1 > i, 1, 0)
x2 = as.factor(x2)
con_mx_train1 = confusionMatrix(data = x2, reference = train_set\$DEATH)

byClass = data.frame(con_mx_train1\$byClass)
Precision = byClass[5,]
Recall = byClass[6,]
F1 = byClass[7,]
a = cbind(i, Precision, Recall, F1)
F1_Tb_Train = rbind(F1_Tb_Train, a)
}

F1_Tb_Train = round(F1_Tb_Train,4)

# Plot F1 score chart
p1 = ggplot(F1_Tb_Train, aes( x = F1_Tb_Train\$i, y = F1_Tb_Train\$F1)) +
geom_line(color="blue") +
labs(x="i", y="F1 Score", title="F1 Score chart")

# Model validataion with test data set
predict2 = predict(Sim_Model_1, test_set, type = "response")
summary(predict2)
test_set\$prob1 = predict2
sapply(test_set, class)

x3 = ifelse(predict2 > 0.5, 1, 0)
x3 = as.factor(x3)

con_mx_test = confusionMatrix(data = x3, reference = test_set\$DEATH)

# Generate a table to get optimum F1 score
list1 = c(1:8)/10
F1_Tb_Test = data.frame()

for (i in list1)
{
x3 = ifelse(predict2 > i, 1, 0)
x3 = as.factor(x3)
con_mx_test1 = confusionMatrix(data = x3, reference = test_set\$DEATH)

byClass = data.frame(con_mx_test1\$byClass)
Precision = byClass[5,]
Recall = byClass[6,]
F1 = byClass[7,]
a = cbind(i, Precision, Recall, F1)
F1_Tb_Test = rbind(F1_Tb_Test, a)
}

F1_Tb_Test = round(F1_Tb_Test,4)

# Plot F1 Score chart
p2 = ggplot(F1_Tb_Test, aes( x = F1_Tb_Test\$i, y = F1_Tb_Test\$F1)) +
geom_line(color="blue") +
labs(x="i", y="F1 Score", title="F1 Score chart")

ggarrange(p1, p2, ncol = 2, nrow = 1)

# Aplication
# The file below is specially created for prediction purpose.
# Allow user to key in the input variable, and scrip to call the data for analysis

# Define data type of each column
cols = c('Fac_Class', 'DEATH', 'RACEC', 'INH_INJ', 'FLAME' , 'GENDER')
Application[,cols] = lapply( Application[, cols], factor)

# Calculate probability and Odd Ratio
Predict_App = predict(Sim_Model_1, Application, type = "response")
summary(Predict_App)

Application\$prob1 = Predict_App
sapply(Application, class)
Application\$Odd_Ratio = Predict_App/(1-Predict_App)

```

Result

## Saturday, November 9, 2019

### Time Series Forecasting - Naive Method

I have many years experience in data analysis, but as mentioned by a professor, I'm framed in the descriptive statistic. If I wish to step into the field of data science, I need to acquire some knowledge in machine learning.

I have some C, C++ and Visual Basic programing knowledge (not really master in programing, but sufficient enough for my work).

R is the first programming language I come across in data science. That is the reason why I land myself with R.

Anyway, I'm picking up Phyton for no reason ( due to this, I did not really put in much effort to learn this programming language)

Talking about the script of machine learning, Time Series Forecasting (Naive Method) is the first machine learning script I developed, by doing some Google and Youtube.

The script will automatic recalculate a best fit every time new number enter into the table. It is some how self learning. Therefore I classify it as a Machine Learning

Here my first working machine learning program. You may just plug and play the script if you follow the criteria below

1. set up an excel table with the same header, store your data in column C1

2. copy the R script below, set up your own working directory and run. It should work

```
# Note : the script is tested working (1/10/2019)
# Reference document : https://otexts.com/fpp2/ chapter 6 section 6.8
# Applicable for time series forecasting
# set path to data folder on your computer

# use pacman to load necessary package for data science
fpp2, ggplot2, caret, ggpubr, rowr, psych)

# load data file and store it in a data table call "Data"
# perform auto data transformation & estimate optimum lambda value
# estimate optimum lambda value for data transforamtion
lambda_C1 = BoxCox.lambda(Data\$C1)

# transformed Data with optimum lambda value
Data\$C1_T = (Data\$C1+1)**lambda_C1

# set Data into time series data
# you need to tell R the data you have is a time series data with
# number of month within a cycle
TS_Data = ts(Data\$C1_T, frequency = 12)

# perform time series forecasting with "naive" method, with 5 month forecast
FC_C1 = stlf(TS_Data, method = c("naive"), h = 5)
Ac_C1 = accuracy(FC_C1)

# plot time series chart
# x-axis could be further improve. How to displace in month instead of "t"
# value shows in the chart is a transformed value. You need to inverst back
# to the actual value
FC_Chart_1 =  autoplot(FC_C1, series = "TS_Data") +
autolayer(TS_Data, series = "TS_Data") +
xlab("t") + ylab("value") + ggtitle("Time Series Forecasting Plot")

FC_Chart_1

# Alternative way to plot time series chart, to shows in actual value
# data preparation for Tableue data table (this code was originally design to perform time
# series analysis in R but display the result with Tableau)

# invert estimated time series data to original form in percentage value
a1 = (FC_C1\$fitted**(1/lambda_C1)-1)*100
a2 = (FC_C1\$mean**(1/lambda_C1)-1)*100

Tb_Data = data.frame(merge.zoo(a1,a2))
Tb_Data = rowr::cbind.fill(Tb_Data,TS_Data, fill = NA)
colnames(Tb_Data) = c("Fit_Value", "FC_Value", "Act_Value")
Tb_Data\$Act_Value = (Tb_Data\$Act_Value**(1/lambda_C1)-1)*100
Tb_Data\$Segment = "Data"
Tb_Data\$t = seq.int(from = 1, to = nrow(Tb_Data))
Tb_Data\$Month = seq(as.Date("2017/1/1"), by = "month",
length.out = nrow(Tb_Data))

# write data into a csv file
write.csv(Tb_Data,file ="Data.csv",  row.names = TRUE)

# Since I want to demostrate the result here
# I use GGplot to plot the time series chart here

FC_Chart_2 = ggplot(Tb_Data, aes(x = Month)) +
geom_line(aes(y = Tb_Data\$Act_Value, colour = "Act_Value")) +
geom_line(aes(y = Tb_Data\$Fit_Value, colour = "Fit_Value")) +
geom_line(aes(y = Tb_Data\$FC_Value, colour = "FC_Value")) +
geom_point(aes(y = Tb_Data\$Act_Value), color = "blue", size = 2) +
geom_point(aes(y = Tb_Data\$FC_Value), color = "red", size = 2) +
scale_colour_manual("",
breaks = c("Act_Value", "Fit_Value", "FC_Value"),
values = c("blue", "red" ,"grey")) +
labs(x="Month", y="Act Value", title="Time Series Forecasting Plot")

FC_Chart_2

# End
```

Outcome

End

## Thursday, November 7, 2019

### Common Machine Learning Technique

Below is a list of common machine learning technique commonly discussed in the data scientist community, so far, I only have some knowledge in Linear Regression, Logistic Regression & Time Series Forecasting. Still long way to go….

Linear Regression
Logistic Regression
Time Series Forecasting

Decision Tree
Random Forest
SVM
Naïve Bayes
KNN
K Means

Dimensionality Reduction Algorithms

GBM
XGBoost
LightGBM
CatBoost

## Tuesday, November 5, 2019

### Big Data & Machine Learning Definition

This is a collection of common Data Science definition
(for my own reference)

What is Big Data ?

Big data is a term that describes the large volume of data – both structured and unstructured – that inundates a business on a day-to-day basis.

But it’s not the amount of data that’s important. It’s what organizations do with the data that matters. Big data can be analyzed for insights that lead to better decisions and strategic business moves.

What is machine learning ?

Machine learning is an application of artificial intelligence (AI) that provides systems the ability to automatically learn and improve from experience without being explicitly programmed. Machine learning focuses on the development of computer programs that can access data and use it learn for themselves.

Common machine learning technique

Supervised machine learning - an algorithm used to train a model, with a data set consist of a series of predictor (independent variable) and a target output (dependent variable).

Example of Supervised Learning are;
Linear Regression
Logistic Regression
Decision Tree
Random Forest
KNN

Unsupervised machine learning - an algorithm used to train a model, with a data set that do not have any target outcome variable to predict. It is used for clustering population in different groups.

Example of Unsupervised Learning are;
K-means
Apriori Algorithm

Semi-supervised machine learning - an algorithms fall somewhere in between supervised and unsupervised learning, since they use both labeled and unlabeled data for training – typically a small amount of labeled data and a large amount of unlabeled data. The systems that use this method are able to considerably improve learning accuracy. Usually, semi-supervised learning is chosen when the acquired labeled data requires skilled and relevant resources in order to train it / learn from it. Otherwise, acquiring unlabeled data generally doesn’t require additional resources.

Reinforcement machine learning - an algorithm used to train a model to make specific decisions, by exposed itself to the environment, the algorithm continuously train the model by trial and error, learn from past experience and tries to capture the best possible knowledge to make accurate decision.

Example of Reinforced Learning are;
Markov Decision Process

Machine learning enables analysis of massive quantities of data. While it generally delivers faster, more accurate results in order to identify profitable opportunities or dangerous risks, it may also require additional time and resources to train it properly. Combining machine learning with AI and cognitive technologies can make it even more effective in processing large volumes of information.

## Saturday, November 2, 2019

### Reset - 1/11/2019

It was a while I keep this blog idle for whatever reason. Today I decided to switch all my earlier posting to “Draft” and have a restart.

I started to switch my career from “Full Six Sigma” consultant to “Data Scientist” since early 2019, it is time I compile my journey to data science, may be, I could share my experience, how I blend Six Sigma and Data Science together for a new recipe in Problem Solving and Innovation.

Reset here meant: I have to offload all my old idea in Problem Solving, and restart again like a fresh graduate, to acquire new technology.