Predicting Disaster (Titanic)

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP





.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;







up vote
0
down vote

favorite












Intro



I have started a new course (Analyzing Big Data with Microsoft R) and have an exam soon.



So I wanted to test my skills, and a nice way to do this was by doing a Kaggle competition Titanic: Predicting Disaster. I managed to get a decent score (80%). You can run this script by first downloading the .csv, and set your working directory. I am using the latest Microsoft R Client.



Any review is welcome, but I am most interested in:



  • Did I correctly use the RevoScaler package from Microsoft?

  • Any Coding mistakes?

  • Any way to improve the accuracy of my prediction?

  • Is my R style ok?

Code



# Titanic Kaggle Solution
setwd("<your_work_directory>")
library("dplyr")

test <- read.csv("test.csv")
train <- read.csv("train.csv")

# Combine the DataSets to fill in the missing Data
full <- bind_rows(test, train)

######################
# Transform the Data #
######################
# - Get the FamilySize by adding the (Parents + Siblings + 1)
# - Get the Titles, and factorize them (Sir is the same as Jonkheer [Dutch] and Don [Italian], Same for Lady.... )
# - Get Embarked as Factor
full <- rxDataStep(inData = full,
transforms = list(FamilySize = as.numeric(Parch + SibSp + 1),
Titles = as.factor(gsub("Don|Jonkheer|Sir", "Sir",
gsub("Dona|Lady|Madame|the Countess", "Lady",
gsub("^.*, (.*?)\..*$", "\1", Name)))),
Embarked = as.factor(Embarked)
)
)

# Sort on PassengerId for missing Data
full <- arrange(full, PassengerId)

##########################
# Fix the missing values #
##########################
sapply(full, function(y) sum(is.na(y)))
# Fare 1
# Embarked 2
# Age 263
# Cabin 1014 (I will drop this, since there is too much missing value)
# Survived 418 (This is the TestData so not actually missing!)

###########################
# 1. Guess the missing fare
fare_na <- full[which(is.na(full$Fare)), 1]
full[fare_na, c("Pclass", "Embarked")]
# PClass = 3 & Embarked = S
filtered_fares <- filter(full, Pclass == "3" & Embarked == "S")[, "Fare"]
median_fare <- median(filtered_fares, na.rm = TRUE)
full$Fare[fare_na] <- median_fare

##########################
# 2. Guess the Embarked Port
emb_na <- full[which(is.na(full$Embarked)), "PassengerId"]
full[emb_na, c("Pclass", "Fare")]
# Class = 1 & Fare = 80
full %>% group_by(Embarked, Pclass) %>% filter(Pclass == "1") %>% summarise(mfare = median(Fare), n = n())
# C is most likely (since only 3 from the other closest one)
full$Embarked[c(62, 830)] <- "C"
full$Embarked <- droplevels(full$Embarked)

#########################
# Predict the missing Ages
age_na <- full[which(is.na(full$Age)), "PassengerId"]
age_tree <- rxDTree(Age ~ Pclass + Sex + SibSp + Parch + Fare + Embarked + Titles + FamilySize,
data = full,
rowSelection = !is.na(Age)
)
age_tree <- prune.rxDTree(age_tree, rxDTreeBestCp(age_tree))
predicted_age <- rxPredict(age_tree, data = full)
full$Age[age_na] <- predicted_age$Age_Pred[age_na]

#######################################################################################
# Split Full back into test and train --> Afterwards split train to avoid overfitting #
#######################################################################################
test <- rxDataStep(inData = full, rowSelection = is.na(Survived))
train <- rxDataStep(inData = full, rowSelection = !is.na(Survived))
trainData <- rxDataStep(inData = train,
transforms = list(set = factor(ifelse(runif(.rxNumRows) >= 0.2,
"train",
"test"
)
)
)
)
dataSet <- rxSplit(trainData, splitByFactor = "set")

#######################
# Predicting survival #
#######################
# Creating a Forrest model
titanicForrest <- rxDForest(Survived ~ Sex + Pclass + Fare + Age + Embarked + FamilySize + Titles,
data = dataSet$trainData.set.train)

# testData + Predict
testForrest <- rxDataStep(inData = dataSet$trainData.set.test,
varsToKeep = c("Survived", "Sex", "Pclass", "Fare", "Age", "Embarked", "FamilySize", "Titles")
)
predictDataForrest <- rxPredict(titanicForrest, data = testForrest)

# Calculate the loss
print(sum(round(predictDataForrest$Survived_Pred) == dataSet$trainData.set.test$Survived) / length(predictDataForrest$Survived_Pred))
# 0.83%

#######################################
# Create csv to be accepted by Kaggle #
#######################################
kaggleData <- rxPredict(titanicForrest, data = test)
dataFrame <- data.frame(PassengerId=test$PassengerId, Survived=round(kaggleData$Survived_Pred))
write.csv(dataFrame, "Forrest.csv", row.names = FALSE)






share|improve this question

























    up vote
    0
    down vote

    favorite












    Intro



    I have started a new course (Analyzing Big Data with Microsoft R) and have an exam soon.



    So I wanted to test my skills, and a nice way to do this was by doing a Kaggle competition Titanic: Predicting Disaster. I managed to get a decent score (80%). You can run this script by first downloading the .csv, and set your working directory. I am using the latest Microsoft R Client.



    Any review is welcome, but I am most interested in:



    • Did I correctly use the RevoScaler package from Microsoft?

    • Any Coding mistakes?

    • Any way to improve the accuracy of my prediction?

    • Is my R style ok?

    Code



    # Titanic Kaggle Solution
    setwd("<your_work_directory>")
    library("dplyr")

    test <- read.csv("test.csv")
    train <- read.csv("train.csv")

    # Combine the DataSets to fill in the missing Data
    full <- bind_rows(test, train)

    ######################
    # Transform the Data #
    ######################
    # - Get the FamilySize by adding the (Parents + Siblings + 1)
    # - Get the Titles, and factorize them (Sir is the same as Jonkheer [Dutch] and Don [Italian], Same for Lady.... )
    # - Get Embarked as Factor
    full <- rxDataStep(inData = full,
    transforms = list(FamilySize = as.numeric(Parch + SibSp + 1),
    Titles = as.factor(gsub("Don|Jonkheer|Sir", "Sir",
    gsub("Dona|Lady|Madame|the Countess", "Lady",
    gsub("^.*, (.*?)\..*$", "\1", Name)))),
    Embarked = as.factor(Embarked)
    )
    )

    # Sort on PassengerId for missing Data
    full <- arrange(full, PassengerId)

    ##########################
    # Fix the missing values #
    ##########################
    sapply(full, function(y) sum(is.na(y)))
    # Fare 1
    # Embarked 2
    # Age 263
    # Cabin 1014 (I will drop this, since there is too much missing value)
    # Survived 418 (This is the TestData so not actually missing!)

    ###########################
    # 1. Guess the missing fare
    fare_na <- full[which(is.na(full$Fare)), 1]
    full[fare_na, c("Pclass", "Embarked")]
    # PClass = 3 & Embarked = S
    filtered_fares <- filter(full, Pclass == "3" & Embarked == "S")[, "Fare"]
    median_fare <- median(filtered_fares, na.rm = TRUE)
    full$Fare[fare_na] <- median_fare

    ##########################
    # 2. Guess the Embarked Port
    emb_na <- full[which(is.na(full$Embarked)), "PassengerId"]
    full[emb_na, c("Pclass", "Fare")]
    # Class = 1 & Fare = 80
    full %>% group_by(Embarked, Pclass) %>% filter(Pclass == "1") %>% summarise(mfare = median(Fare), n = n())
    # C is most likely (since only 3 from the other closest one)
    full$Embarked[c(62, 830)] <- "C"
    full$Embarked <- droplevels(full$Embarked)

    #########################
    # Predict the missing Ages
    age_na <- full[which(is.na(full$Age)), "PassengerId"]
    age_tree <- rxDTree(Age ~ Pclass + Sex + SibSp + Parch + Fare + Embarked + Titles + FamilySize,
    data = full,
    rowSelection = !is.na(Age)
    )
    age_tree <- prune.rxDTree(age_tree, rxDTreeBestCp(age_tree))
    predicted_age <- rxPredict(age_tree, data = full)
    full$Age[age_na] <- predicted_age$Age_Pred[age_na]

    #######################################################################################
    # Split Full back into test and train --> Afterwards split train to avoid overfitting #
    #######################################################################################
    test <- rxDataStep(inData = full, rowSelection = is.na(Survived))
    train <- rxDataStep(inData = full, rowSelection = !is.na(Survived))
    trainData <- rxDataStep(inData = train,
    transforms = list(set = factor(ifelse(runif(.rxNumRows) >= 0.2,
    "train",
    "test"
    )
    )
    )
    )
    dataSet <- rxSplit(trainData, splitByFactor = "set")

    #######################
    # Predicting survival #
    #######################
    # Creating a Forrest model
    titanicForrest <- rxDForest(Survived ~ Sex + Pclass + Fare + Age + Embarked + FamilySize + Titles,
    data = dataSet$trainData.set.train)

    # testData + Predict
    testForrest <- rxDataStep(inData = dataSet$trainData.set.test,
    varsToKeep = c("Survived", "Sex", "Pclass", "Fare", "Age", "Embarked", "FamilySize", "Titles")
    )
    predictDataForrest <- rxPredict(titanicForrest, data = testForrest)

    # Calculate the loss
    print(sum(round(predictDataForrest$Survived_Pred) == dataSet$trainData.set.test$Survived) / length(predictDataForrest$Survived_Pred))
    # 0.83%

    #######################################
    # Create csv to be accepted by Kaggle #
    #######################################
    kaggleData <- rxPredict(titanicForrest, data = test)
    dataFrame <- data.frame(PassengerId=test$PassengerId, Survived=round(kaggleData$Survived_Pred))
    write.csv(dataFrame, "Forrest.csv", row.names = FALSE)






    share|improve this question





















      up vote
      0
      down vote

      favorite









      up vote
      0
      down vote

      favorite











      Intro



      I have started a new course (Analyzing Big Data with Microsoft R) and have an exam soon.



      So I wanted to test my skills, and a nice way to do this was by doing a Kaggle competition Titanic: Predicting Disaster. I managed to get a decent score (80%). You can run this script by first downloading the .csv, and set your working directory. I am using the latest Microsoft R Client.



      Any review is welcome, but I am most interested in:



      • Did I correctly use the RevoScaler package from Microsoft?

      • Any Coding mistakes?

      • Any way to improve the accuracy of my prediction?

      • Is my R style ok?

      Code



      # Titanic Kaggle Solution
      setwd("<your_work_directory>")
      library("dplyr")

      test <- read.csv("test.csv")
      train <- read.csv("train.csv")

      # Combine the DataSets to fill in the missing Data
      full <- bind_rows(test, train)

      ######################
      # Transform the Data #
      ######################
      # - Get the FamilySize by adding the (Parents + Siblings + 1)
      # - Get the Titles, and factorize them (Sir is the same as Jonkheer [Dutch] and Don [Italian], Same for Lady.... )
      # - Get Embarked as Factor
      full <- rxDataStep(inData = full,
      transforms = list(FamilySize = as.numeric(Parch + SibSp + 1),
      Titles = as.factor(gsub("Don|Jonkheer|Sir", "Sir",
      gsub("Dona|Lady|Madame|the Countess", "Lady",
      gsub("^.*, (.*?)\..*$", "\1", Name)))),
      Embarked = as.factor(Embarked)
      )
      )

      # Sort on PassengerId for missing Data
      full <- arrange(full, PassengerId)

      ##########################
      # Fix the missing values #
      ##########################
      sapply(full, function(y) sum(is.na(y)))
      # Fare 1
      # Embarked 2
      # Age 263
      # Cabin 1014 (I will drop this, since there is too much missing value)
      # Survived 418 (This is the TestData so not actually missing!)

      ###########################
      # 1. Guess the missing fare
      fare_na <- full[which(is.na(full$Fare)), 1]
      full[fare_na, c("Pclass", "Embarked")]
      # PClass = 3 & Embarked = S
      filtered_fares <- filter(full, Pclass == "3" & Embarked == "S")[, "Fare"]
      median_fare <- median(filtered_fares, na.rm = TRUE)
      full$Fare[fare_na] <- median_fare

      ##########################
      # 2. Guess the Embarked Port
      emb_na <- full[which(is.na(full$Embarked)), "PassengerId"]
      full[emb_na, c("Pclass", "Fare")]
      # Class = 1 & Fare = 80
      full %>% group_by(Embarked, Pclass) %>% filter(Pclass == "1") %>% summarise(mfare = median(Fare), n = n())
      # C is most likely (since only 3 from the other closest one)
      full$Embarked[c(62, 830)] <- "C"
      full$Embarked <- droplevels(full$Embarked)

      #########################
      # Predict the missing Ages
      age_na <- full[which(is.na(full$Age)), "PassengerId"]
      age_tree <- rxDTree(Age ~ Pclass + Sex + SibSp + Parch + Fare + Embarked + Titles + FamilySize,
      data = full,
      rowSelection = !is.na(Age)
      )
      age_tree <- prune.rxDTree(age_tree, rxDTreeBestCp(age_tree))
      predicted_age <- rxPredict(age_tree, data = full)
      full$Age[age_na] <- predicted_age$Age_Pred[age_na]

      #######################################################################################
      # Split Full back into test and train --> Afterwards split train to avoid overfitting #
      #######################################################################################
      test <- rxDataStep(inData = full, rowSelection = is.na(Survived))
      train <- rxDataStep(inData = full, rowSelection = !is.na(Survived))
      trainData <- rxDataStep(inData = train,
      transforms = list(set = factor(ifelse(runif(.rxNumRows) >= 0.2,
      "train",
      "test"
      )
      )
      )
      )
      dataSet <- rxSplit(trainData, splitByFactor = "set")

      #######################
      # Predicting survival #
      #######################
      # Creating a Forrest model
      titanicForrest <- rxDForest(Survived ~ Sex + Pclass + Fare + Age + Embarked + FamilySize + Titles,
      data = dataSet$trainData.set.train)

      # testData + Predict
      testForrest <- rxDataStep(inData = dataSet$trainData.set.test,
      varsToKeep = c("Survived", "Sex", "Pclass", "Fare", "Age", "Embarked", "FamilySize", "Titles")
      )
      predictDataForrest <- rxPredict(titanicForrest, data = testForrest)

      # Calculate the loss
      print(sum(round(predictDataForrest$Survived_Pred) == dataSet$trainData.set.test$Survived) / length(predictDataForrest$Survived_Pred))
      # 0.83%

      #######################################
      # Create csv to be accepted by Kaggle #
      #######################################
      kaggleData <- rxPredict(titanicForrest, data = test)
      dataFrame <- data.frame(PassengerId=test$PassengerId, Survived=round(kaggleData$Survived_Pred))
      write.csv(dataFrame, "Forrest.csv", row.names = FALSE)






      share|improve this question











      Intro



      I have started a new course (Analyzing Big Data with Microsoft R) and have an exam soon.



      So I wanted to test my skills, and a nice way to do this was by doing a Kaggle competition Titanic: Predicting Disaster. I managed to get a decent score (80%). You can run this script by first downloading the .csv, and set your working directory. I am using the latest Microsoft R Client.



      Any review is welcome, but I am most interested in:



      • Did I correctly use the RevoScaler package from Microsoft?

      • Any Coding mistakes?

      • Any way to improve the accuracy of my prediction?

      • Is my R style ok?

      Code



      # Titanic Kaggle Solution
      setwd("<your_work_directory>")
      library("dplyr")

      test <- read.csv("test.csv")
      train <- read.csv("train.csv")

      # Combine the DataSets to fill in the missing Data
      full <- bind_rows(test, train)

      ######################
      # Transform the Data #
      ######################
      # - Get the FamilySize by adding the (Parents + Siblings + 1)
      # - Get the Titles, and factorize them (Sir is the same as Jonkheer [Dutch] and Don [Italian], Same for Lady.... )
      # - Get Embarked as Factor
      full <- rxDataStep(inData = full,
      transforms = list(FamilySize = as.numeric(Parch + SibSp + 1),
      Titles = as.factor(gsub("Don|Jonkheer|Sir", "Sir",
      gsub("Dona|Lady|Madame|the Countess", "Lady",
      gsub("^.*, (.*?)\..*$", "\1", Name)))),
      Embarked = as.factor(Embarked)
      )
      )

      # Sort on PassengerId for missing Data
      full <- arrange(full, PassengerId)

      ##########################
      # Fix the missing values #
      ##########################
      sapply(full, function(y) sum(is.na(y)))
      # Fare 1
      # Embarked 2
      # Age 263
      # Cabin 1014 (I will drop this, since there is too much missing value)
      # Survived 418 (This is the TestData so not actually missing!)

      ###########################
      # 1. Guess the missing fare
      fare_na <- full[which(is.na(full$Fare)), 1]
      full[fare_na, c("Pclass", "Embarked")]
      # PClass = 3 & Embarked = S
      filtered_fares <- filter(full, Pclass == "3" & Embarked == "S")[, "Fare"]
      median_fare <- median(filtered_fares, na.rm = TRUE)
      full$Fare[fare_na] <- median_fare

      ##########################
      # 2. Guess the Embarked Port
      emb_na <- full[which(is.na(full$Embarked)), "PassengerId"]
      full[emb_na, c("Pclass", "Fare")]
      # Class = 1 & Fare = 80
      full %>% group_by(Embarked, Pclass) %>% filter(Pclass == "1") %>% summarise(mfare = median(Fare), n = n())
      # C is most likely (since only 3 from the other closest one)
      full$Embarked[c(62, 830)] <- "C"
      full$Embarked <- droplevels(full$Embarked)

      #########################
      # Predict the missing Ages
      age_na <- full[which(is.na(full$Age)), "PassengerId"]
      age_tree <- rxDTree(Age ~ Pclass + Sex + SibSp + Parch + Fare + Embarked + Titles + FamilySize,
      data = full,
      rowSelection = !is.na(Age)
      )
      age_tree <- prune.rxDTree(age_tree, rxDTreeBestCp(age_tree))
      predicted_age <- rxPredict(age_tree, data = full)
      full$Age[age_na] <- predicted_age$Age_Pred[age_na]

      #######################################################################################
      # Split Full back into test and train --> Afterwards split train to avoid overfitting #
      #######################################################################################
      test <- rxDataStep(inData = full, rowSelection = is.na(Survived))
      train <- rxDataStep(inData = full, rowSelection = !is.na(Survived))
      trainData <- rxDataStep(inData = train,
      transforms = list(set = factor(ifelse(runif(.rxNumRows) >= 0.2,
      "train",
      "test"
      )
      )
      )
      )
      dataSet <- rxSplit(trainData, splitByFactor = "set")

      #######################
      # Predicting survival #
      #######################
      # Creating a Forrest model
      titanicForrest <- rxDForest(Survived ~ Sex + Pclass + Fare + Age + Embarked + FamilySize + Titles,
      data = dataSet$trainData.set.train)

      # testData + Predict
      testForrest <- rxDataStep(inData = dataSet$trainData.set.test,
      varsToKeep = c("Survived", "Sex", "Pclass", "Fare", "Age", "Embarked", "FamilySize", "Titles")
      )
      predictDataForrest <- rxPredict(titanicForrest, data = testForrest)

      # Calculate the loss
      print(sum(round(predictDataForrest$Survived_Pred) == dataSet$trainData.set.test$Survived) / length(predictDataForrest$Survived_Pred))
      # 0.83%

      #######################################
      # Create csv to be accepted by Kaggle #
      #######################################
      kaggleData <- rxPredict(titanicForrest, data = test)
      dataFrame <- data.frame(PassengerId=test$PassengerId, Survived=round(kaggleData$Survived_Pred))
      write.csv(dataFrame, "Forrest.csv", row.names = FALSE)








      share|improve this question










      share|improve this question




      share|improve this question









      asked May 24 at 15:23









      Ludisposed

      5,68621656




      5,68621656

























          active

          oldest

          votes











          Your Answer




          StackExchange.ifUsing("editor", function ()
          return StackExchange.using("mathjaxEditing", function ()
          StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
          StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
          );
          );
          , "mathjax-editing");

          StackExchange.ifUsing("editor", function ()
          StackExchange.using("externalEditor", function ()
          StackExchange.using("snippets", function ()
          StackExchange.snippets.init();
          );
          );
          , "code-snippets");

          StackExchange.ready(function()
          var channelOptions =
          tags: "".split(" "),
          id: "196"
          ;
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function()
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled)
          StackExchange.using("snippets", function()
          createEditor();
          );

          else
          createEditor();

          );

          function createEditor()
          StackExchange.prepareEditor(
          heartbeatType: 'answer',
          convertImagesToLinks: false,
          noModals: false,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: null,
          bindNavPrevention: true,
          postfix: "",
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          );



          );








           

          draft saved


          draft discarded


















          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f195095%2fpredicting-disaster-titanic%23new-answer', 'question_page');

          );

          Post as a guest



































          active

          oldest

          votes













          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes










           

          draft saved


          draft discarded


























           


          draft saved


          draft discarded














          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f195095%2fpredicting-disaster-titanic%23new-answer', 'question_page');

          );

          Post as a guest













































































          Popular posts from this blog

          Greedy Best First Search implementation in Rust

          Function to Return a JSON Like Objects Using VBA Collections and Arrays

          C++11 CLH Lock Implementation