Copy the following code My Journey in Six Sigma, Lean, Data Science

Saturday, November 16, 2019

Logistic Regression

This is another new knowledge I learn after attending a Predictive Analysis course at NUS. 

Logistic Regression. It is used to predict the odds of being a case based on the values of the independent variable (predictors). The Odds are defined as the probability that a particular outcome is a case divided by the probability that it is a noncase. (Wikipedia)


odd_ratio = p/(1-p)

Below is the take home exercise given to me and the code.

data source

Case :


I don't plan to spend too much time discussing the preliminary data assessment and the basic descriptive data analysis we need to do before proceed for any other kind of deep dive data analysis. It is common sense. 


# Note : the script is tested working (1/10/2019) # Reference : https://stats.idre.ucla.edu/r/dae/logit-regression/ # Reference : https://stats.idre.ucla.edu/other/mult-pkg/faq/general/faq-how-do-i-interpret-odds-ratios-in-logistic-regression/ # Applicable for standard logistic regression analysis # Not tested on any real case
# Set working directory   setwd("set your own working directory")
# Load required library   pacman::p_load(tidyverse, lubridate,zoo,forecast, fUnitRoots, readxl, fpp2, ggplot2, Ecdat, caret, ggpubr, rpart, caTools, MASS, ROCR, oddsratio, broom)
# Load data from working director, name the data frame as Raw_Data # Set qualitative data as factor   Raw_Data = read.csv("BURN1000_1.csv", header = TRUE)     cols = c('Fac_Class', 'DEATH', 'RACEC', 'INH_INJ', 'FLAME' , 'GENDER')       Raw_Data[,cols] = lapply(Raw_Data[, cols], factor)
# Split Train & Test data with ration 70 : 30   set.seed(1)         split_values = sample.split(Raw_Data$DEATH, SplitRatio = 0.7)       train_set = subset(Raw_Data, split_values==T)       test_set = subset(Raw_Data, split_values==F)
# Perform quick a quick model assessment without interaction # Check which are the main effect    factor_screening = DEATH ~ AGE + RACEC + TBSA + FLAME + GENDER + Fac_Class + INH_INJ
    Main_Factor = glm(factor_screening, data = Raw_Data , family = "binomial")       summary(Main_Factor)
# Run a full factoria GLM model, for general overview of GLM model   Full_Term = DEATH ~ AGE + RACEC + TBSA + FLAME + GENDER + Fac_Class + INH_INJ +                    AGE:RACEC + AGE:TBSA + AGE:FLAME + AGE:GENDER + AGE:Fac_Class + AGE:INH_INJ +           RACEC:TBSA + RACEC:FLAME + RACEC:GENDER + RACEC:Fac_Class + RACEC:INH_INJ +            TBSA:FLAME + TBSA:GENDER + TBSA:Fac_Class + TBSA:INH_INJ +              GENDER:Fac_Class + GENDER:INH_INJ +                Fac_Class:INH_INJ     Full_Model = glm(Full_Term, data = train_set , family = "binomial")     summary(Full_Model)
# Run stepAIC to a model giving minimum AIC value # stepAIC(Full_Model, direction = 'backward')
Sim_Term_1 = DEATH ~ AGE + RACEC + TBSA + FLAME + GENDER + INH_INJ +  AGE:INH_INJ + RACEC:FLAME + TBSA:INH_INJ
# Model training   Sim_Model_1 = glm(Sim_Term_1, data = train_set , family = "binomial")     summary(Sim_Model_1)      or_glm(data = train_set, model = Sim_Model_1 , incr = list(AGE = 10, FLAME = 1, TBSA = 10, INH_INJ = 1 ))           # Model validation with training data set     predict1 = predict(Sim_Model_1, train_set, type = "response")       summary(predict1)         train_set$prob1 = predict1           sapply(train_set, class) 
      x1 = ifelse(predict1 > 0.5, 1, 0)       x1 = as.factor(x1)
    con_mx_train = confusionMatrix(data = x1, reference = train_set$DEATH)
# Generate a table to get optimum F1 score     list1 = c(1:8)/10       F1_Tb_Train = data.frame()        for (i in list1)        {            x2 = ifelse(predict1 > i, 1, 0)           x2 = as.factor(x2)           con_mx_train1 = confusionMatrix(data = x2, reference = train_set$DEATH)             byClass = data.frame(con_mx_train1$byClass)           Precision = byClass[5,]             Recall = byClass[6,]               F1 = byClass[7,]                 a = cbind(i, Precision, Recall, F1)                   F1_Tb_Train = rbind(F1_Tb_Train, a)          }
      F1_Tb_Train = round(F1_Tb_Train,4) # Plot F1 score chart     p1 = ggplot(F1_Tb_Train, aes( x = F1_Tb_Train$i, y = F1_Tb_Train$F1)) +         geom_line(color="blue") +         labs(x="i", y="F1 Score", title="F1 Score chart")
# Model validataion with test data set   predict2 = predict(Sim_Model_1, test_set, type = "response")       summary(predict2)       test_set$prob1 = predict2         sapply(test_set, class)                         x3 = ifelse(predict2 > 0.5, 1, 0)         x3 = as.factor(x3)                 con_mx_test = confusionMatrix(data = x3, reference = test_set$DEATH)         # Generate a table to get optimum F1 score      list1 = c(1:8)/10         F1_Tb_Test = data.frame()                 for (i in list1)         {            x3 = ifelse(predict2 > i, 1, 0)           x3 = as.factor(x3)           con_mx_test1 = confusionMatrix(data = x3, reference = test_set$DEATH)                     byClass = data.frame(con_mx_test1$byClass)           Precision = byClass[5,]           Recall = byClass[6,]           F1 = byClass[7,]           a = cbind(i, Precision, Recall, F1)           F1_Tb_Test = rbind(F1_Tb_Test, a)        }         F1_Tb_Test = round(F1_Tb_Test,4) # Plot F1 Score chart     p2 = ggplot(F1_Tb_Test, aes( x = F1_Tb_Test$i, y = F1_Tb_Test$F1)) +           geom_line(color="blue") +           labs(x="i", y="F1 Score", title="F1 Score chart") ggarrange(p1, p2, ncol = 2, nrow = 1)         # Aplication # The file below is specially created for prediction purpose. # Allow user to key in the input variable, and scrip to call the data for analysis     Application = read.csv("BURN1000_20_80_1.csv", header = TRUE)         # Define data type of each column         cols = c('Fac_Class', 'DEATH', 'RACEC', 'INH_INJ', 'FLAME' , 'GENDER')         Application[,cols] = lapply( Application[, cols], factor)                 # Calculate probability and Odd Ratio         Predict_App = predict(Sim_Model_1, Application, type = "response")         summary(Predict_App)                 Application$prob1 = Predict_App         sapply(Application, class)         Application$Odd_Ratio = Predict_App/(1-Predict_App)

Result






Saturday, November 9, 2019

Time Series Forecasting - Naive Method

I have many years experience in data analysis, but as mentioned by a professor, I'm framed in the descriptive statistic. If I wish to step into the field of data science, I need to acquire some knowledge in machine learning. 

I have some C, C++ and Visual Basic programing knowledge (not really master in programing, but sufficient enough for my work). 

R is the first programming language I come across in data science. That is the reason why I land myself with R. 

Anyway, I'm picking up Phyton for no reason ( due to this, I did not really put in much effort to learn this programming language)

Talking about the script of machine learning, Time Series Forecasting (Naive Method) is the first machine learning script I developed, by doing some Google and Youtube.

The script will automatic recalculate a best fit every time new number enter into the table. It is some how self learning. Therefore I classify it as a Machine Learning

Here my first working machine learning program. You may just plug and play the script if you follow the criteria below 

1. set up an excel table with the same header, store your data in column C1


2. copy the R script below, set up your own working directory and run. It should work 

# Note : the script is tested working (1/10/2019) # Reference document : https://otexts.com/fpp2/ chapter 6 section 6.8 # Applicable for time series forecasting
# set path to data folder on your computer     setwd("set your own working director")     # use pacman to load necessary package for data science     pacman::p_load(tidyverse, lubridate,zoo,forecast, fUnitRoots, readxl,                    fpp2, ggplot2, caret, ggpubr, rowr, psych)     # load data file and store it in a data table call "Data"     Data = read_xlsx("Data.xlsx")
# perform auto data transformation & estimate optimum lambda value # estimate optimum lambda value for data transforamtion     lambda_C1 = BoxCox.lambda(Data$C1)    # transformed Data with optimum lambda value     Data$C1_T = (Data$C1+1)**lambda_C1       # set Data into time series data # you need to tell R the data you have is a time series data with # number of month within a cycle     TS_Data = ts(Data$C1_T, frequency = 12)   # perform time series forecasting with "naive" method, with 5 month forecast     FC_C1 = stlf(TS_Data, method = c("naive"), h = 5)     Ac_C1 = accuracy(FC_C1)
# plot time series chart # x-axis could be further improve. How to displace in month instead of "t" # value shows in the chart is a transformed value. You need to inverst back # to the actual value    FC_Chart_1 =  autoplot(FC_C1, series = "TS_Data") +                   autolayer(TS_Data, series = "TS_Data") +                   xlab("t") + ylab("value") + ggtitle("Time Series Forecasting Plot")               FC_Chart_1     # Alternative way to plot time series chart, to shows in actual value # data preparation for Tableue data table (this code was originally design to perform time # series analysis in R but display the result with Tableau)    
# invert estimated time series data to original form in percentage value       a1 = (FC_C1$fitted**(1/lambda_C1)-1)*100     a2 = (FC_C1$mean**(1/lambda_C1)-1)*100         Tb_Data = data.frame(merge.zoo(a1,a2))       Tb_Data = rowr::cbind.fill(Tb_Data,TS_Data, fill = NA)         colnames(Tb_Data) = c("Fit_Value", "FC_Value", "Act_Value")           Tb_Data$Act_Value = (Tb_Data$Act_Value**(1/lambda_C1)-1)*100               Tb_Data$Segment = "Data"                   Tb_Data$t = seq.int(from = 1, to = nrow(Tb_Data))                     Tb_Data$Month = seq(as.Date("2017/1/1"), by = "month",                                          length.out = nrow(Tb_Data))                  
# write data into a csv file                                        write.csv(Tb_Data,file ="Data.csv",  row.names = TRUE)                     # Since I want to demostrate the result here # I use GGplot to plot the time series chart here         FC_Chart_2 = ggplot(Tb_Data, aes(x = Month)) +                 geom_line(aes(y = Tb_Data$Act_Value, colour = "Act_Value")) +                 geom_line(aes(y = Tb_Data$Fit_Value, colour = "Fit_Value")) +                 geom_line(aes(y = Tb_Data$FC_Value, colour = "FC_Value")) +                 geom_point(aes(y = Tb_Data$Act_Value), color = "blue", size = 2) +                 geom_point(aes(y = Tb_Data$FC_Value), color = "red", size = 2) +                 scale_colour_manual("",                      breaks = c("Act_Value", "Fit_Value", "FC_Value"),                      values = c("blue", "red" ,"grey")) +                 labs(x="Month", y="Act Value", title="Time Series Forecasting Plot")                       FC_Chart_2            # End

Outcome



End

Thursday, November 7, 2019

Common Machine Learning Technique


Below is a list of common machine learning technique commonly discussed in the data scientist community, so far, I only have some knowledge in Linear Regression, Logistic Regression & Time Series Forecasting. Still long way to go….

 Linear Regression
 Logistic Regression
 Time Series Forecasting

 Decision Tree
 Random Forest
 SVM
 Naïve Bayes
 KNN
 K Means

 Dimensionality Reduction Algorithms

 Gradient Boosting Algorithms
   GBM
   XGBoost
   LightGBM
   CatBoost

Tuesday, November 5, 2019

Big Data & Machine Learning Definition

This is a collection of common Data Science definition 
(for my own reference)

What is Big Data ? 

Big data is a term that describes the large volume of data – both structured and unstructured – that inundates a business on a day-to-day basis. 


But it’s not the amount of data that’s important. It’s what organizations do with the data that matters. Big data can be analyzed for insights that lead to better decisions and strategic business moves. 

What is machine learning ? 

Machine learning is an application of artificial intelligence (AI) that provides systems the ability to automatically learn and improve from experience without being explicitly programmed. Machine learning focuses on the development of computer programs that can access data and use it learn for themselves.


Common machine learning technique

Supervised machine learning - an algorithm used to train a model, with a data set consist of a series of predictor (independent variable) and a target output (dependent variable). 


Example of Supervised Learning are;
Linear Regression
Logistic Regression
Decision Tree
Random Forest
KNN

Unsupervised machine learning - an algorithm used to train a model, with a data set that do not have any target outcome variable to predict. It is used for clustering population in different groups.

Example of Unsupervised Learning are;
K-means
Apriori Algorithm

Semi-supervised machine learning - an algorithms fall somewhere in between supervised and unsupervised learning, since they use both labeled and unlabeled data for training – typically a small amount of labeled data and a large amount of unlabeled data. The systems that use this method are able to considerably improve learning accuracy. Usually, semi-supervised learning is chosen when the acquired labeled data requires skilled and relevant resources in order to train it / learn from it. Otherwise, acquiring unlabeled data generally doesn’t require additional resources.

Reinforcement machine learning - an algorithm used to train a model to make specific decisions, by exposed itself to the environment, the algorithm continuously train the model by trial and error, learn from past experience and tries to capture the best possible knowledge to make accurate decision.

Example of Reinforced Learning are;
Markov Decision Process

Machine learning enables analysis of massive quantities of data. While it generally delivers faster, more accurate results in order to identify profitable opportunities or dangerous risks, it may also require additional time and resources to train it properly. Combining machine learning with AI and cognitive technologies can make it even more effective in processing large volumes of information.

Saturday, November 2, 2019

Reset - 1/11/2019


It was a while I keep this blog idle for whatever reason. Today I decided to switch all my earlier posting to “Draft” and have a restart. 


I started to switch my career from “Full Six Sigma” consultant to “Data Scientist” since early 2019, it is time I compile my journey to data science, may be, I could share my experience, how I blend Six Sigma and Data Science together for a new recipe in Problem Solving and Innovation.

Reset here meant: I have to offload all my old idea in Problem Solving, and restart again like a fresh graduate, to acquire new technology.