Predicting Disaster (Titanic)

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;

up vote
0
down vote

favorite

Intro

I have started a new course (Analyzing Big Data with Microsoft R) and have an exam soon.

So I wanted to test my skills, and a nice way to do this was by doing a Kaggle competition Titanic: Predicting Disaster. I managed to get a decent score (80%). You can run this script by first downloading the .csv, and set your working directory. I am using the latest Microsoft R Client.

Any review is welcome, but I am most interested in:

Did I correctly use the RevoScaler package from Microsoft?

Any Coding mistakes?

Any way to improve the accuracy of my prediction?

Is my R style ok?

Code

# Titanic Kaggle Solution
setwd("<your_work_directory>")
library("dplyr")

test <- read.csv("test.csv")
train <- read.csv("train.csv")

# Combine the DataSets to fill in the missing Data
full <- bind_rows(test, train)

######################
# Transform the Data #
######################
# - Get the FamilySize by adding the (Parents + Siblings + 1)
# - Get the Titles, and factorize them (Sir is the same as Jonkheer [Dutch] and Don [Italian], Same for Lady.... )
# - Get Embarked as Factor
full <- rxDataStep(inData = full,
 transforms = list(FamilySize = as.numeric(Parch + SibSp + 1),
 Titles = as.factor(gsub("Don|Jonkheer|Sir", "Sir", 
 gsub("Dona|Lady|Madame|the Countess", "Lady", 
 gsub("^.*, (.*?)\..*$", "\1", Name)))),
 Embarked = as.factor(Embarked)
 )
 )

# Sort on PassengerId for missing Data
full <- arrange(full, PassengerId)

##########################
# Fix the missing values #
##########################
sapply(full, function(y) sum(is.na(y)))
# Fare 1
# Embarked 2
# Age 263
# Cabin 1014 (I will drop this, since there is too much missing value)
# Survived 418 (This is the TestData so not actually missing!)

###########################
# 1. Guess the missing fare
fare_na <- full[which(is.na(full$Fare)), 1]
full[fare_na, c("Pclass", "Embarked")]
# PClass = 3 & Embarked = S
filtered_fares <- filter(full, Pclass == "3" & Embarked == "S")[, "Fare"]
median_fare <- median(filtered_fares, na.rm = TRUE)
full$Fare[fare_na] <- median_fare

##########################
# 2. Guess the Embarked Port
emb_na <- full[which(is.na(full$Embarked)), "PassengerId"]
full[emb_na, c("Pclass", "Fare")]
# Class = 1 & Fare = 80
full %>% group_by(Embarked, Pclass) %>% filter(Pclass == "1") %>% summarise(mfare = median(Fare), n = n()) 
# C is most likely (since only 3 from the other closest one)
full$Embarked[c(62, 830)] <- "C"
full$Embarked <- droplevels(full$Embarked)

#########################
# Predict the missing Ages
age_na <- full[which(is.na(full$Age)), "PassengerId"]
age_tree <- rxDTree(Age ~ Pclass + Sex + SibSp + Parch + Fare + Embarked + Titles + FamilySize,
 data = full, 
 rowSelection = !is.na(Age)
 )
age_tree <- prune.rxDTree(age_tree, rxDTreeBestCp(age_tree))
predicted_age <- rxPredict(age_tree, data = full)
full$Age[age_na] <- predicted_age$Age_Pred[age_na]

#######################################################################################
# Split Full back into test and train --> Afterwards split train to avoid overfitting #
#######################################################################################
test <- rxDataStep(inData = full, rowSelection = is.na(Survived))
train <- rxDataStep(inData = full, rowSelection = !is.na(Survived))
trainData <- rxDataStep(inData = train,
 transforms = list(set = factor(ifelse(runif(.rxNumRows) >= 0.2, 
 "train", 
 "test"
 )
 )
 )
 )
dataSet <- rxSplit(trainData, splitByFactor = "set")

#######################
# Predicting survival #
#######################
# Creating a Forrest model
titanicForrest <- rxDForest(Survived ~ Sex + Pclass + Fare + Age + Embarked + FamilySize + Titles, 
 data = dataSet$trainData.set.train)

# testData + Predict
testForrest <- rxDataStep(inData = dataSet$trainData.set.test, 
 varsToKeep = c("Survived", "Sex", "Pclass", "Fare", "Age", "Embarked", "FamilySize", "Titles")
 )
predictDataForrest <- rxPredict(titanicForrest, data = testForrest)

# Calculate the loss
print(sum(round(predictDataForrest$Survived_Pred) == dataSet$trainData.set.test$Survived) / length(predictDataForrest$Survived_Pred))
# 0.83%

#######################################
# Create csv to be accepted by Kaggle #
#######################################
kaggleData <- rxPredict(titanicForrest, data = test)
dataFrame <- data.frame(PassengerId=test$PassengerId, Survived=round(kaggleData$Survived_Pred))
write.csv(dataFrame, "Forrest.csv", row.names = FALSE)

asked May 24 at 15:23

Ludisposed

5,68621656

add a commentÂ |Â

up vote
0
down vote

favorite

Intro

I have started a new course (Analyzing Big Data with Microsoft R) and have an exam soon.

Any review is welcome, but I am most interested in:

Did I correctly use the RevoScaler package from Microsoft?

Any Coding mistakes?

Any way to improve the accuracy of my prediction?

Is my R style ok?

Code

# Titanic Kaggle Solution
setwd("<your_work_directory>")
library("dplyr")

test <- read.csv("test.csv")
train <- read.csv("train.csv")

# Combine the DataSets to fill in the missing Data
full <- bind_rows(test, train)

######################
# Transform the Data #
######################
# - Get the FamilySize by adding the (Parents + Siblings + 1)
# - Get the Titles, and factorize them (Sir is the same as Jonkheer [Dutch] and Don [Italian], Same for Lady.... )
# - Get Embarked as Factor
full <- rxDataStep(inData = full,
 transforms = list(FamilySize = as.numeric(Parch + SibSp + 1),
 Titles = as.factor(gsub("Don|Jonkheer|Sir", "Sir", 
 gsub("Dona|Lady|Madame|the Countess", "Lady", 
 gsub("^.*, (.*?)\..*$", "\1", Name)))),
 Embarked = as.factor(Embarked)
 )
 )

# Sort on PassengerId for missing Data
full <- arrange(full, PassengerId)

##########################
# Fix the missing values #
##########################
sapply(full, function(y) sum(is.na(y)))
# Fare 1
# Embarked 2
# Age 263
# Cabin 1014 (I will drop this, since there is too much missing value)
# Survived 418 (This is the TestData so not actually missing!)

###########################
# 1. Guess the missing fare
fare_na <- full[which(is.na(full$Fare)), 1]
full[fare_na, c("Pclass", "Embarked")]
# PClass = 3 & Embarked = S
filtered_fares <- filter(full, Pclass == "3" & Embarked == "S")[, "Fare"]
median_fare <- median(filtered_fares, na.rm = TRUE)
full$Fare[fare_na] <- median_fare

##########################
# 2. Guess the Embarked Port
emb_na <- full[which(is.na(full$Embarked)), "PassengerId"]
full[emb_na, c("Pclass", "Fare")]
# Class = 1 & Fare = 80
full %>% group_by(Embarked, Pclass) %>% filter(Pclass == "1") %>% summarise(mfare = median(Fare), n = n()) 
# C is most likely (since only 3 from the other closest one)
full$Embarked[c(62, 830)] <- "C"
full$Embarked <- droplevels(full$Embarked)

#########################
# Predict the missing Ages
age_na <- full[which(is.na(full$Age)), "PassengerId"]
age_tree <- rxDTree(Age ~ Pclass + Sex + SibSp + Parch + Fare + Embarked + Titles + FamilySize,
 data = full, 
 rowSelection = !is.na(Age)
 )
age_tree <- prune.rxDTree(age_tree, rxDTreeBestCp(age_tree))
predicted_age <- rxPredict(age_tree, data = full)
full$Age[age_na] <- predicted_age$Age_Pred[age_na]

#######################################################################################
# Split Full back into test and train --> Afterwards split train to avoid overfitting #
#######################################################################################
test <- rxDataStep(inData = full, rowSelection = is.na(Survived))
train <- rxDataStep(inData = full, rowSelection = !is.na(Survived))
trainData <- rxDataStep(inData = train,
 transforms = list(set = factor(ifelse(runif(.rxNumRows) >= 0.2, 
 "train", 
 "test"
 )
 )
 )
 )
dataSet <- rxSplit(trainData, splitByFactor = "set")

#######################
# Predicting survival #
#######################
# Creating a Forrest model
titanicForrest <- rxDForest(Survived ~ Sex + Pclass + Fare + Age + Embarked + FamilySize + Titles, 
 data = dataSet$trainData.set.train)

# testData + Predict
testForrest <- rxDataStep(inData = dataSet$trainData.set.test, 
 varsToKeep = c("Survived", "Sex", "Pclass", "Fare", "Age", "Embarked", "FamilySize", "Titles")
 )
predictDataForrest <- rxPredict(titanicForrest, data = testForrest)

# Calculate the loss
print(sum(round(predictDataForrest$Survived_Pred) == dataSet$trainData.set.test$Survived) / length(predictDataForrest$Survived_Pred))
# 0.83%

#######################################
# Create csv to be accepted by Kaggle #
#######################################
kaggleData <- rxPredict(titanicForrest, data = test)
dataFrame <- data.frame(PassengerId=test$PassengerId, Survived=round(kaggleData$Survived_Pred))
write.csv(dataFrame, "Forrest.csv", row.names = FALSE)

asked May 24 at 15:23

Ludisposed

5,68621656

add a commentÂ |Â

up vote
0
down vote

favorite

Intro

I have started a new course (Analyzing Big Data with Microsoft R) and have an exam soon.

Any review is welcome, but I am most interested in:

Did I correctly use the RevoScaler package from Microsoft?

Any Coding mistakes?

Any way to improve the accuracy of my prediction?

Is my R style ok?

Code

# Titanic Kaggle Solution
setwd("<your_work_directory>")
library("dplyr")

test <- read.csv("test.csv")
train <- read.csv("train.csv")

# Combine the DataSets to fill in the missing Data
full <- bind_rows(test, train)

######################
# Transform the Data #
######################
# - Get the FamilySize by adding the (Parents + Siblings + 1)
# - Get the Titles, and factorize them (Sir is the same as Jonkheer [Dutch] and Don [Italian], Same for Lady.... )
# - Get Embarked as Factor
full <- rxDataStep(inData = full,
 transforms = list(FamilySize = as.numeric(Parch + SibSp + 1),
 Titles = as.factor(gsub("Don|Jonkheer|Sir", "Sir", 
 gsub("Dona|Lady|Madame|the Countess", "Lady", 
 gsub("^.*, (.*?)\..*$", "\1", Name)))),
 Embarked = as.factor(Embarked)
 )
 )

# Sort on PassengerId for missing Data
full <- arrange(full, PassengerId)

##########################
# Fix the missing values #
##########################
sapply(full, function(y) sum(is.na(y)))
# Fare 1
# Embarked 2
# Age 263
# Cabin 1014 (I will drop this, since there is too much missing value)
# Survived 418 (This is the TestData so not actually missing!)

###########################
# 1. Guess the missing fare
fare_na <- full[which(is.na(full$Fare)), 1]
full[fare_na, c("Pclass", "Embarked")]
# PClass = 3 & Embarked = S
filtered_fares <- filter(full, Pclass == "3" & Embarked == "S")[, "Fare"]
median_fare <- median(filtered_fares, na.rm = TRUE)
full$Fare[fare_na] <- median_fare

##########################
# 2. Guess the Embarked Port
emb_na <- full[which(is.na(full$Embarked)), "PassengerId"]
full[emb_na, c("Pclass", "Fare")]
# Class = 1 & Fare = 80
full %>% group_by(Embarked, Pclass) %>% filter(Pclass == "1") %>% summarise(mfare = median(Fare), n = n()) 
# C is most likely (since only 3 from the other closest one)
full$Embarked[c(62, 830)] <- "C"
full$Embarked <- droplevels(full$Embarked)

#########################
# Predict the missing Ages
age_na <- full[which(is.na(full$Age)), "PassengerId"]
age_tree <- rxDTree(Age ~ Pclass + Sex + SibSp + Parch + Fare + Embarked + Titles + FamilySize,
 data = full, 
 rowSelection = !is.na(Age)
 )
age_tree <- prune.rxDTree(age_tree, rxDTreeBestCp(age_tree))
predicted_age <- rxPredict(age_tree, data = full)
full$Age[age_na] <- predicted_age$Age_Pred[age_na]

#######################################################################################
# Split Full back into test and train --> Afterwards split train to avoid overfitting #
#######################################################################################
test <- rxDataStep(inData = full, rowSelection = is.na(Survived))
train <- rxDataStep(inData = full, rowSelection = !is.na(Survived))
trainData <- rxDataStep(inData = train,
 transforms = list(set = factor(ifelse(runif(.rxNumRows) >= 0.2, 
 "train", 
 "test"
 )
 )
 )
 )
dataSet <- rxSplit(trainData, splitByFactor = "set")

#######################
# Predicting survival #
#######################
# Creating a Forrest model
titanicForrest <- rxDForest(Survived ~ Sex + Pclass + Fare + Age + Embarked + FamilySize + Titles, 
 data = dataSet$trainData.set.train)

# testData + Predict
testForrest <- rxDataStep(inData = dataSet$trainData.set.test, 
 varsToKeep = c("Survived", "Sex", "Pclass", "Fare", "Age", "Embarked", "FamilySize", "Titles")
 )
predictDataForrest <- rxPredict(titanicForrest, data = testForrest)

# Calculate the loss
print(sum(round(predictDataForrest$Survived_Pred) == dataSet$trainData.set.test$Survived) / length(predictDataForrest$Survived_Pred))
# 0.83%

#######################################
# Create csv to be accepted by Kaggle #
#######################################
kaggleData <- rxPredict(titanicForrest, data = test)
dataFrame <- data.frame(PassengerId=test$PassengerId, Survived=round(kaggleData$Survived_Pred))
write.csv(dataFrame, "Forrest.csv", row.names = FALSE)

asked May 24 at 15:23

Ludisposed

5,68621656

Intro

I have started a new course (Analyzing Big Data with Microsoft R) and have an exam soon.

Any review is welcome, but I am most interested in:

Did I correctly use the RevoScaler package from Microsoft?

Any Coding mistakes?

Any way to improve the accuracy of my prediction?

Is my R style ok?

Code

# Titanic Kaggle Solution
setwd("<your_work_directory>")
library("dplyr")

test <- read.csv("test.csv")
train <- read.csv("train.csv")

# Combine the DataSets to fill in the missing Data
full <- bind_rows(test, train)

######################
# Transform the Data #
######################
# - Get the FamilySize by adding the (Parents + Siblings + 1)
# - Get the Titles, and factorize them (Sir is the same as Jonkheer [Dutch] and Don [Italian], Same for Lady.... )
# - Get Embarked as Factor
full <- rxDataStep(inData = full,
 transforms = list(FamilySize = as.numeric(Parch + SibSp + 1),
 Titles = as.factor(gsub("Don|Jonkheer|Sir", "Sir", 
 gsub("Dona|Lady|Madame|the Countess", "Lady", 
 gsub("^.*, (.*?)\..*$", "\1", Name)))),
 Embarked = as.factor(Embarked)
 )
 )

# Sort on PassengerId for missing Data
full <- arrange(full, PassengerId)

##########################
# Fix the missing values #
##########################
sapply(full, function(y) sum(is.na(y)))
# Fare 1
# Embarked 2
# Age 263
# Cabin 1014 (I will drop this, since there is too much missing value)
# Survived 418 (This is the TestData so not actually missing!)

###########################
# 1. Guess the missing fare
fare_na <- full[which(is.na(full$Fare)), 1]
full[fare_na, c("Pclass", "Embarked")]
# PClass = 3 & Embarked = S
filtered_fares <- filter(full, Pclass == "3" & Embarked == "S")[, "Fare"]
median_fare <- median(filtered_fares, na.rm = TRUE)
full$Fare[fare_na] <- median_fare

##########################
# 2. Guess the Embarked Port
emb_na <- full[which(is.na(full$Embarked)), "PassengerId"]
full[emb_na, c("Pclass", "Fare")]
# Class = 1 & Fare = 80
full %>% group_by(Embarked, Pclass) %>% filter(Pclass == "1") %>% summarise(mfare = median(Fare), n = n()) 
# C is most likely (since only 3 from the other closest one)
full$Embarked[c(62, 830)] <- "C"
full$Embarked <- droplevels(full$Embarked)

#########################
# Predict the missing Ages
age_na <- full[which(is.na(full$Age)), "PassengerId"]
age_tree <- rxDTree(Age ~ Pclass + Sex + SibSp + Parch + Fare + Embarked + Titles + FamilySize,
 data = full, 
 rowSelection = !is.na(Age)
 )
age_tree <- prune.rxDTree(age_tree, rxDTreeBestCp(age_tree))
predicted_age <- rxPredict(age_tree, data = full)
full$Age[age_na] <- predicted_age$Age_Pred[age_na]

#######################################################################################
# Split Full back into test and train --> Afterwards split train to avoid overfitting #
#######################################################################################
test <- rxDataStep(inData = full, rowSelection = is.na(Survived))
train <- rxDataStep(inData = full, rowSelection = !is.na(Survived))
trainData <- rxDataStep(inData = train,
 transforms = list(set = factor(ifelse(runif(.rxNumRows) >= 0.2, 
 "train", 
 "test"
 )
 )
 )
 )
dataSet <- rxSplit(trainData, splitByFactor = "set")

#######################
# Predicting survival #
#######################
# Creating a Forrest model
titanicForrest <- rxDForest(Survived ~ Sex + Pclass + Fare + Age + Embarked + FamilySize + Titles, 
 data = dataSet$trainData.set.train)

# testData + Predict
testForrest <- rxDataStep(inData = dataSet$trainData.set.test, 
 varsToKeep = c("Survived", "Sex", "Pclass", "Fare", "Age", "Embarked", "FamilySize", "Titles")
 )
predictDataForrest <- rxPredict(titanicForrest, data = testForrest)

# Calculate the loss
print(sum(round(predictDataForrest$Survived_Pred) == dataSet$trainData.set.test$Survived) / length(predictDataForrest$Survived_Pred))
# 0.83%

#######################################
# Create csv to be accepted by Kaggle #
#######################################
kaggleData <- rxPredict(titanicForrest, data = test)
dataFrame <- data.frame(PassengerId=test$PassengerId, Survived=round(kaggleData$Survived_Pred))
write.csv(dataFrame, "Forrest.csv", row.names = FALSE)

asked May 24 at 15:23

Ludisposed

5,68621656

asked May 24 at 15:23

Ludisposed

5,68621656

asked May 24 at 15:23

Ludisposed

5,68621656

asked May 24 at 15:23

Ludisposed

5,68621656

add a commentÂ |Â

active

oldest

votes

Your Answer

StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
);
);
, "mathjax-editing");

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "196"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: false,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f195095%2fpredicting-disaster-titanic%23new-answer', 'question_page');

);

Post as a guest

Name

active

oldest

votes

draft saved

draft discarded

draft saved

draft discarded

Post as a guest

Name

搜尋此網誌

trjhtr

Predicting Disaster (Titanic)

Intro

Code

Intro

Code

Intro

Code

Intro

Code

Your Answer

Post as a guest

Post as a guest

Popular posts from this blog

Chat program with C++ and SFML

Read files from a directory using Promises

Read an image with ADNS2610 optical sensor and Arduino Uno

Predicting Disaster (Titanic)

Intro

Code

Intro

Code

Intro

Code

Intro

Code

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

Chat program with C++ and SFML

Read files from a directory using Promises

Read an image with ADNS2610 optical sensor and Arduino Uno