Predicting Disaster (Titanic)
Clash Royale CLAN TAG#URR8PPP
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;
up vote
0
down vote
favorite
Intro
I have started a new course (Analyzing Big Data with Microsoft R) and have an exam soon.
So I wanted to test my skills, and a nice way to do this was by doing a Kaggle competition Titanic: Predicting Disaster. I managed to get a decent score (80%). You can run this script by first downloading the .csv
, and set your working directory. I am using the latest Microsoft R Client.
Any review is welcome, but I am most interested in:
- Did I correctly use the RevoScaler package from Microsoft?
- Any Coding mistakes?
- Any way to improve the accuracy of my prediction?
- Is my R style ok?
Code
# Titanic Kaggle Solution
setwd("<your_work_directory>")
library("dplyr")
test <- read.csv("test.csv")
train <- read.csv("train.csv")
# Combine the DataSets to fill in the missing Data
full <- bind_rows(test, train)
######################
# Transform the Data #
######################
# - Get the FamilySize by adding the (Parents + Siblings + 1)
# - Get the Titles, and factorize them (Sir is the same as Jonkheer [Dutch] and Don [Italian], Same for Lady.... )
# - Get Embarked as Factor
full <- rxDataStep(inData = full,
transforms = list(FamilySize = as.numeric(Parch + SibSp + 1),
Titles = as.factor(gsub("Don|Jonkheer|Sir", "Sir",
gsub("Dona|Lady|Madame|the Countess", "Lady",
gsub("^.*, (.*?)\..*$", "\1", Name)))),
Embarked = as.factor(Embarked)
)
)
# Sort on PassengerId for missing Data
full <- arrange(full, PassengerId)
##########################
# Fix the missing values #
##########################
sapply(full, function(y) sum(is.na(y)))
# Fare 1
# Embarked 2
# Age 263
# Cabin 1014 (I will drop this, since there is too much missing value)
# Survived 418 (This is the TestData so not actually missing!)
###########################
# 1. Guess the missing fare
fare_na <- full[which(is.na(full$Fare)), 1]
full[fare_na, c("Pclass", "Embarked")]
# PClass = 3 & Embarked = S
filtered_fares <- filter(full, Pclass == "3" & Embarked == "S")[, "Fare"]
median_fare <- median(filtered_fares, na.rm = TRUE)
full$Fare[fare_na] <- median_fare
##########################
# 2. Guess the Embarked Port
emb_na <- full[which(is.na(full$Embarked)), "PassengerId"]
full[emb_na, c("Pclass", "Fare")]
# Class = 1 & Fare = 80
full %>% group_by(Embarked, Pclass) %>% filter(Pclass == "1") %>% summarise(mfare = median(Fare), n = n())
# C is most likely (since only 3 from the other closest one)
full$Embarked[c(62, 830)] <- "C"
full$Embarked <- droplevels(full$Embarked)
#########################
# Predict the missing Ages
age_na <- full[which(is.na(full$Age)), "PassengerId"]
age_tree <- rxDTree(Age ~ Pclass + Sex + SibSp + Parch + Fare + Embarked + Titles + FamilySize,
data = full,
rowSelection = !is.na(Age)
)
age_tree <- prune.rxDTree(age_tree, rxDTreeBestCp(age_tree))
predicted_age <- rxPredict(age_tree, data = full)
full$Age[age_na] <- predicted_age$Age_Pred[age_na]
#######################################################################################
# Split Full back into test and train --> Afterwards split train to avoid overfitting #
#######################################################################################
test <- rxDataStep(inData = full, rowSelection = is.na(Survived))
train <- rxDataStep(inData = full, rowSelection = !is.na(Survived))
trainData <- rxDataStep(inData = train,
transforms = list(set = factor(ifelse(runif(.rxNumRows) >= 0.2,
"train",
"test"
)
)
)
)
dataSet <- rxSplit(trainData, splitByFactor = "set")
#######################
# Predicting survival #
#######################
# Creating a Forrest model
titanicForrest <- rxDForest(Survived ~ Sex + Pclass + Fare + Age + Embarked + FamilySize + Titles,
data = dataSet$trainData.set.train)
# testData + Predict
testForrest <- rxDataStep(inData = dataSet$trainData.set.test,
varsToKeep = c("Survived", "Sex", "Pclass", "Fare", "Age", "Embarked", "FamilySize", "Titles")
)
predictDataForrest <- rxPredict(titanicForrest, data = testForrest)
# Calculate the loss
print(sum(round(predictDataForrest$Survived_Pred) == dataSet$trainData.set.test$Survived) / length(predictDataForrest$Survived_Pred))
# 0.83%
#######################################
# Create csv to be accepted by Kaggle #
#######################################
kaggleData <- rxPredict(titanicForrest, data = test)
dataFrame <- data.frame(PassengerId=test$PassengerId, Survived=round(kaggleData$Survived_Pred))
write.csv(dataFrame, "Forrest.csv", row.names = FALSE)
r machine-learning
add a comment |Â
up vote
0
down vote
favorite
Intro
I have started a new course (Analyzing Big Data with Microsoft R) and have an exam soon.
So I wanted to test my skills, and a nice way to do this was by doing a Kaggle competition Titanic: Predicting Disaster. I managed to get a decent score (80%). You can run this script by first downloading the .csv
, and set your working directory. I am using the latest Microsoft R Client.
Any review is welcome, but I am most interested in:
- Did I correctly use the RevoScaler package from Microsoft?
- Any Coding mistakes?
- Any way to improve the accuracy of my prediction?
- Is my R style ok?
Code
# Titanic Kaggle Solution
setwd("<your_work_directory>")
library("dplyr")
test <- read.csv("test.csv")
train <- read.csv("train.csv")
# Combine the DataSets to fill in the missing Data
full <- bind_rows(test, train)
######################
# Transform the Data #
######################
# - Get the FamilySize by adding the (Parents + Siblings + 1)
# - Get the Titles, and factorize them (Sir is the same as Jonkheer [Dutch] and Don [Italian], Same for Lady.... )
# - Get Embarked as Factor
full <- rxDataStep(inData = full,
transforms = list(FamilySize = as.numeric(Parch + SibSp + 1),
Titles = as.factor(gsub("Don|Jonkheer|Sir", "Sir",
gsub("Dona|Lady|Madame|the Countess", "Lady",
gsub("^.*, (.*?)\..*$", "\1", Name)))),
Embarked = as.factor(Embarked)
)
)
# Sort on PassengerId for missing Data
full <- arrange(full, PassengerId)
##########################
# Fix the missing values #
##########################
sapply(full, function(y) sum(is.na(y)))
# Fare 1
# Embarked 2
# Age 263
# Cabin 1014 (I will drop this, since there is too much missing value)
# Survived 418 (This is the TestData so not actually missing!)
###########################
# 1. Guess the missing fare
fare_na <- full[which(is.na(full$Fare)), 1]
full[fare_na, c("Pclass", "Embarked")]
# PClass = 3 & Embarked = S
filtered_fares <- filter(full, Pclass == "3" & Embarked == "S")[, "Fare"]
median_fare <- median(filtered_fares, na.rm = TRUE)
full$Fare[fare_na] <- median_fare
##########################
# 2. Guess the Embarked Port
emb_na <- full[which(is.na(full$Embarked)), "PassengerId"]
full[emb_na, c("Pclass", "Fare")]
# Class = 1 & Fare = 80
full %>% group_by(Embarked, Pclass) %>% filter(Pclass == "1") %>% summarise(mfare = median(Fare), n = n())
# C is most likely (since only 3 from the other closest one)
full$Embarked[c(62, 830)] <- "C"
full$Embarked <- droplevels(full$Embarked)
#########################
# Predict the missing Ages
age_na <- full[which(is.na(full$Age)), "PassengerId"]
age_tree <- rxDTree(Age ~ Pclass + Sex + SibSp + Parch + Fare + Embarked + Titles + FamilySize,
data = full,
rowSelection = !is.na(Age)
)
age_tree <- prune.rxDTree(age_tree, rxDTreeBestCp(age_tree))
predicted_age <- rxPredict(age_tree, data = full)
full$Age[age_na] <- predicted_age$Age_Pred[age_na]
#######################################################################################
# Split Full back into test and train --> Afterwards split train to avoid overfitting #
#######################################################################################
test <- rxDataStep(inData = full, rowSelection = is.na(Survived))
train <- rxDataStep(inData = full, rowSelection = !is.na(Survived))
trainData <- rxDataStep(inData = train,
transforms = list(set = factor(ifelse(runif(.rxNumRows) >= 0.2,
"train",
"test"
)
)
)
)
dataSet <- rxSplit(trainData, splitByFactor = "set")
#######################
# Predicting survival #
#######################
# Creating a Forrest model
titanicForrest <- rxDForest(Survived ~ Sex + Pclass + Fare + Age + Embarked + FamilySize + Titles,
data = dataSet$trainData.set.train)
# testData + Predict
testForrest <- rxDataStep(inData = dataSet$trainData.set.test,
varsToKeep = c("Survived", "Sex", "Pclass", "Fare", "Age", "Embarked", "FamilySize", "Titles")
)
predictDataForrest <- rxPredict(titanicForrest, data = testForrest)
# Calculate the loss
print(sum(round(predictDataForrest$Survived_Pred) == dataSet$trainData.set.test$Survived) / length(predictDataForrest$Survived_Pred))
# 0.83%
#######################################
# Create csv to be accepted by Kaggle #
#######################################
kaggleData <- rxPredict(titanicForrest, data = test)
dataFrame <- data.frame(PassengerId=test$PassengerId, Survived=round(kaggleData$Survived_Pred))
write.csv(dataFrame, "Forrest.csv", row.names = FALSE)
r machine-learning
add a comment |Â
up vote
0
down vote
favorite
up vote
0
down vote
favorite
Intro
I have started a new course (Analyzing Big Data with Microsoft R) and have an exam soon.
So I wanted to test my skills, and a nice way to do this was by doing a Kaggle competition Titanic: Predicting Disaster. I managed to get a decent score (80%). You can run this script by first downloading the .csv
, and set your working directory. I am using the latest Microsoft R Client.
Any review is welcome, but I am most interested in:
- Did I correctly use the RevoScaler package from Microsoft?
- Any Coding mistakes?
- Any way to improve the accuracy of my prediction?
- Is my R style ok?
Code
# Titanic Kaggle Solution
setwd("<your_work_directory>")
library("dplyr")
test <- read.csv("test.csv")
train <- read.csv("train.csv")
# Combine the DataSets to fill in the missing Data
full <- bind_rows(test, train)
######################
# Transform the Data #
######################
# - Get the FamilySize by adding the (Parents + Siblings + 1)
# - Get the Titles, and factorize them (Sir is the same as Jonkheer [Dutch] and Don [Italian], Same for Lady.... )
# - Get Embarked as Factor
full <- rxDataStep(inData = full,
transforms = list(FamilySize = as.numeric(Parch + SibSp + 1),
Titles = as.factor(gsub("Don|Jonkheer|Sir", "Sir",
gsub("Dona|Lady|Madame|the Countess", "Lady",
gsub("^.*, (.*?)\..*$", "\1", Name)))),
Embarked = as.factor(Embarked)
)
)
# Sort on PassengerId for missing Data
full <- arrange(full, PassengerId)
##########################
# Fix the missing values #
##########################
sapply(full, function(y) sum(is.na(y)))
# Fare 1
# Embarked 2
# Age 263
# Cabin 1014 (I will drop this, since there is too much missing value)
# Survived 418 (This is the TestData so not actually missing!)
###########################
# 1. Guess the missing fare
fare_na <- full[which(is.na(full$Fare)), 1]
full[fare_na, c("Pclass", "Embarked")]
# PClass = 3 & Embarked = S
filtered_fares <- filter(full, Pclass == "3" & Embarked == "S")[, "Fare"]
median_fare <- median(filtered_fares, na.rm = TRUE)
full$Fare[fare_na] <- median_fare
##########################
# 2. Guess the Embarked Port
emb_na <- full[which(is.na(full$Embarked)), "PassengerId"]
full[emb_na, c("Pclass", "Fare")]
# Class = 1 & Fare = 80
full %>% group_by(Embarked, Pclass) %>% filter(Pclass == "1") %>% summarise(mfare = median(Fare), n = n())
# C is most likely (since only 3 from the other closest one)
full$Embarked[c(62, 830)] <- "C"
full$Embarked <- droplevels(full$Embarked)
#########################
# Predict the missing Ages
age_na <- full[which(is.na(full$Age)), "PassengerId"]
age_tree <- rxDTree(Age ~ Pclass + Sex + SibSp + Parch + Fare + Embarked + Titles + FamilySize,
data = full,
rowSelection = !is.na(Age)
)
age_tree <- prune.rxDTree(age_tree, rxDTreeBestCp(age_tree))
predicted_age <- rxPredict(age_tree, data = full)
full$Age[age_na] <- predicted_age$Age_Pred[age_na]
#######################################################################################
# Split Full back into test and train --> Afterwards split train to avoid overfitting #
#######################################################################################
test <- rxDataStep(inData = full, rowSelection = is.na(Survived))
train <- rxDataStep(inData = full, rowSelection = !is.na(Survived))
trainData <- rxDataStep(inData = train,
transforms = list(set = factor(ifelse(runif(.rxNumRows) >= 0.2,
"train",
"test"
)
)
)
)
dataSet <- rxSplit(trainData, splitByFactor = "set")
#######################
# Predicting survival #
#######################
# Creating a Forrest model
titanicForrest <- rxDForest(Survived ~ Sex + Pclass + Fare + Age + Embarked + FamilySize + Titles,
data = dataSet$trainData.set.train)
# testData + Predict
testForrest <- rxDataStep(inData = dataSet$trainData.set.test,
varsToKeep = c("Survived", "Sex", "Pclass", "Fare", "Age", "Embarked", "FamilySize", "Titles")
)
predictDataForrest <- rxPredict(titanicForrest, data = testForrest)
# Calculate the loss
print(sum(round(predictDataForrest$Survived_Pred) == dataSet$trainData.set.test$Survived) / length(predictDataForrest$Survived_Pred))
# 0.83%
#######################################
# Create csv to be accepted by Kaggle #
#######################################
kaggleData <- rxPredict(titanicForrest, data = test)
dataFrame <- data.frame(PassengerId=test$PassengerId, Survived=round(kaggleData$Survived_Pred))
write.csv(dataFrame, "Forrest.csv", row.names = FALSE)
r machine-learning
Intro
I have started a new course (Analyzing Big Data with Microsoft R) and have an exam soon.
So I wanted to test my skills, and a nice way to do this was by doing a Kaggle competition Titanic: Predicting Disaster. I managed to get a decent score (80%). You can run this script by first downloading the .csv
, and set your working directory. I am using the latest Microsoft R Client.
Any review is welcome, but I am most interested in:
- Did I correctly use the RevoScaler package from Microsoft?
- Any Coding mistakes?
- Any way to improve the accuracy of my prediction?
- Is my R style ok?
Code
# Titanic Kaggle Solution
setwd("<your_work_directory>")
library("dplyr")
test <- read.csv("test.csv")
train <- read.csv("train.csv")
# Combine the DataSets to fill in the missing Data
full <- bind_rows(test, train)
######################
# Transform the Data #
######################
# - Get the FamilySize by adding the (Parents + Siblings + 1)
# - Get the Titles, and factorize them (Sir is the same as Jonkheer [Dutch] and Don [Italian], Same for Lady.... )
# - Get Embarked as Factor
full <- rxDataStep(inData = full,
transforms = list(FamilySize = as.numeric(Parch + SibSp + 1),
Titles = as.factor(gsub("Don|Jonkheer|Sir", "Sir",
gsub("Dona|Lady|Madame|the Countess", "Lady",
gsub("^.*, (.*?)\..*$", "\1", Name)))),
Embarked = as.factor(Embarked)
)
)
# Sort on PassengerId for missing Data
full <- arrange(full, PassengerId)
##########################
# Fix the missing values #
##########################
sapply(full, function(y) sum(is.na(y)))
# Fare 1
# Embarked 2
# Age 263
# Cabin 1014 (I will drop this, since there is too much missing value)
# Survived 418 (This is the TestData so not actually missing!)
###########################
# 1. Guess the missing fare
fare_na <- full[which(is.na(full$Fare)), 1]
full[fare_na, c("Pclass", "Embarked")]
# PClass = 3 & Embarked = S
filtered_fares <- filter(full, Pclass == "3" & Embarked == "S")[, "Fare"]
median_fare <- median(filtered_fares, na.rm = TRUE)
full$Fare[fare_na] <- median_fare
##########################
# 2. Guess the Embarked Port
emb_na <- full[which(is.na(full$Embarked)), "PassengerId"]
full[emb_na, c("Pclass", "Fare")]
# Class = 1 & Fare = 80
full %>% group_by(Embarked, Pclass) %>% filter(Pclass == "1") %>% summarise(mfare = median(Fare), n = n())
# C is most likely (since only 3 from the other closest one)
full$Embarked[c(62, 830)] <- "C"
full$Embarked <- droplevels(full$Embarked)
#########################
# Predict the missing Ages
age_na <- full[which(is.na(full$Age)), "PassengerId"]
age_tree <- rxDTree(Age ~ Pclass + Sex + SibSp + Parch + Fare + Embarked + Titles + FamilySize,
data = full,
rowSelection = !is.na(Age)
)
age_tree <- prune.rxDTree(age_tree, rxDTreeBestCp(age_tree))
predicted_age <- rxPredict(age_tree, data = full)
full$Age[age_na] <- predicted_age$Age_Pred[age_na]
#######################################################################################
# Split Full back into test and train --> Afterwards split train to avoid overfitting #
#######################################################################################
test <- rxDataStep(inData = full, rowSelection = is.na(Survived))
train <- rxDataStep(inData = full, rowSelection = !is.na(Survived))
trainData <- rxDataStep(inData = train,
transforms = list(set = factor(ifelse(runif(.rxNumRows) >= 0.2,
"train",
"test"
)
)
)
)
dataSet <- rxSplit(trainData, splitByFactor = "set")
#######################
# Predicting survival #
#######################
# Creating a Forrest model
titanicForrest <- rxDForest(Survived ~ Sex + Pclass + Fare + Age + Embarked + FamilySize + Titles,
data = dataSet$trainData.set.train)
# testData + Predict
testForrest <- rxDataStep(inData = dataSet$trainData.set.test,
varsToKeep = c("Survived", "Sex", "Pclass", "Fare", "Age", "Embarked", "FamilySize", "Titles")
)
predictDataForrest <- rxPredict(titanicForrest, data = testForrest)
# Calculate the loss
print(sum(round(predictDataForrest$Survived_Pred) == dataSet$trainData.set.test$Survived) / length(predictDataForrest$Survived_Pred))
# 0.83%
#######################################
# Create csv to be accepted by Kaggle #
#######################################
kaggleData <- rxPredict(titanicForrest, data = test)
dataFrame <- data.frame(PassengerId=test$PassengerId, Survived=round(kaggleData$Survived_Pred))
write.csv(dataFrame, "Forrest.csv", row.names = FALSE)
r machine-learning
asked May 24 at 15:23
Ludisposed
5,68621656
5,68621656
add a comment |Â
add a comment |Â
active
oldest
votes
active
oldest
votes
active
oldest
votes
active
oldest
votes
active
oldest
votes
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f195095%2fpredicting-disaster-titanic%23new-answer', 'question_page');
);
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password