Function to calculate Persistence Rate with optional group

Function to calculate Persistence Rate with optional group_by variable and logical arguments

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;

up vote
0
down vote

favorite

The Function

"Persistence" is sometimes also referred to as "retention". It is defined as the number of units (ID's) in a given term/period that are also found in the subsequent term/period. So, if I have 10 customers in period 1, and 3 of those customers return in period 2, my persistence rate is 30%.

I have written a function that will either:

Calculate the persistence rate for each period's cohort of ID's
if calculate = TRUE.

Create an indicator variable on the original dataframe that

identifies whether the ID persisted (1) or not (0), if
calculate = FALSE.

Furthermore, if overall = TRUE when calculate = TRUE, it will include the persistence rate over all of the terms.

The Arguments

Here is a brief description about each of the arguments:

df (REQUIRED): This is the dataframe argument, and a dataframe should be passed to this argument.

id (REQUIRED): This is the unique identification of the observational unit of interest. (Customer ID, Product ID, Student ID, etc.)

rank (REQUIRED): This is the numeric or ordered factor argument that defines the sequence of periods.

period (OPTIONAL): This is the "label" or more interpretative version of rank. Essentially just makes output pretty, if desired. (e.g., "October" is the period, 10 is the ranking number for October)

... (OPTIONAL): Variables to group_by in case a comparison of persistence rates across groups is desired.

overall (REQUIRED w/ DEFAULT): Logical variable to decide whether or not to include an "overall" persistence rate calculation.

calculate (REQUIRED w/ DEFAULT): Logical variable to decide whether to summarize the data into persistence rates, or to create an indicator variable denoting persistence.

Perceived Improvement Areas

Of course, any and all suggestions for ways to improve this function are greatly appreciated. I do, however, have some areas that I think could be improved, I'm just not sure how.

Grouping the Optional period Argument: In the section that describes what to do if calculate == TRUE, I had to create an if statement to group the variables differently depending on whether the period argument was supplied. Before, there was only one group_by argument, and if I explicitly called all of the arguments, the function would work great. But when I only called the first 3 required arguments, I would get an error. The current version works fine, but is there a better way to conditionally group optional variables?

Conditional overall Argument: In order to calculate the overall persistence, it seems like I have to repeat a lot of code, which could be computationally expensive, and is a little less easy to read than one continuous dplyr chain would be. Is there a more code-efficient way to calculate the overall rate?

What I've Already Tried

I tried to make things a little more efficient by creating the indicator variable 1st, whether or not calculate == TRUE. The I just summarised the persistence_indicator by group. But when I used system.time() to compare performance before and after, my current function was more efficient in almost every combination of arguments. In retrospect, this makes sense. Why create that variable if I don't need it when calculate == TRUE.

I also tried posting an earlier version of my function here on Code Review, just to be completely transparent. It didn't get much attention, which is probably fine since the function has changed so much. But I am still interested in general best practices for improving code, especially as it relates to conditionals.

Sample Data

dataFrame <- data.frame(id = as.character(c(1, 2, 3, 4, 1, 2, 3, 1, 2)), 
 period = c("A", "A", "A", "A", "B", "B", "B", "C", "C"), 
 rank = c(1, 1, 1, 1, 2, 2, 2, 3, 3), 
 group = c(1, 2, 1, 2, 1, 2, 1, 1, 2), 
 stringsAsFactors = FALSE)

The Function Code

persistence <- function(df, id, rank, period, ..., overall = TRUE, calculate = TRUE)

 stopifnot(!missing(df), !missing(id), !missing(rank))
 period_missing <- missing(period)

 enq_id <- enquo(id)
 enq_rank <- enquo(rank)
 enq_period <- enquo(period)
 enq_group_var <- quos(...)

 valid_rank_type <- is.numeric(rlang::eval_tidy(enq_rank, df))

Sample Function Call, Output, and sessionInfo()

library(dplyr)

persistence(df = dataFrame,
 id = id,
 rank = rank,
 period = period,
 group,
 overall = TRUE,
 calculate = TRUE)

# A tibble: 4 x 6
 group rank period persistence_rate count overall
 <dbl> <dbl> <chr> <dbl> <int> <dbl>
1 1 1 A 1.0 2 0.7142857
2 2 1 A 0.5 2 0.7142857
3 1 2 B 0.5 2 0.7142857
4 2 2 B 1.0 1 0.7142857

> sessionInfo()
R version 3.4.2 (2017-09-28)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C 
[5] LC_TIME=English_United States.1252 

attached base packages:
[1] stats graphics grDevices utils datasets methods base 

other attached packages:
[1] bindrcpp_0.2.2 dplyr_0.7.6 

loaded via a namespace (and not attached):
 [1] tidyselect_0.2.3 compiler_3.4.2 magrittr_1.5 assertthat_0.2.0 R6_2.2.2 
 [6] tools_3.4.2 glue_1.2.0 tibble_1.3.4 yaml_2.1.14 Rcpp_0.12.17 
[11] pkgconfig_2.0.1 rlang_0.2.1 purrr_0.2.4 bindr_0.1.1

Final Note

The data I use interactively to test this function has about 15,000 rows, so when I mentioned performance above using system.time(), it was with much more data that the sample data I have provided. The sample data works just fine.

asked Jul 13 at 18:33

MillionC

161

What exactly you want to improve to this function?
â€“Â minem
Jul 16 at 11:23

I want to improve this function by using code that follows best practices. So, what are the best practices for incorporating optional arguments into a dplyr group_by() chain? And what are the best practices for incorporating a conditional statement, as directed by an argument? Does my code follow those practices, does it get close? Is there a reason I shouldn't use the method I did?
â€“Â MillionC
Jul 16 at 15:19

add a commentÂ |Â

up vote
0
down vote

favorite

The Function

I have written a function that will either:

Calculate the persistence rate for each period's cohort of ID's
if calculate = TRUE.

Create an indicator variable on the original dataframe that

identifies whether the ID persisted (1) or not (0), if
calculate = FALSE.

Furthermore, if overall = TRUE when calculate = TRUE, it will include the persistence rate over all of the terms.

The Arguments

Here is a brief description about each of the arguments:

df (REQUIRED): This is the dataframe argument, and a dataframe should be passed to this argument.

id (REQUIRED): This is the unique identification of the observational unit of interest. (Customer ID, Product ID, Student ID, etc.)

rank (REQUIRED): This is the numeric or ordered factor argument that defines the sequence of periods.

period (OPTIONAL): This is the "label" or more interpretative version of rank. Essentially just makes output pretty, if desired. (e.g., "October" is the period, 10 is the ranking number for October)

... (OPTIONAL): Variables to group_by in case a comparison of persistence rates across groups is desired.

overall (REQUIRED w/ DEFAULT): Logical variable to decide whether or not to include an "overall" persistence rate calculation.

calculate (REQUIRED w/ DEFAULT): Logical variable to decide whether to summarize the data into persistence rates, or to create an indicator variable denoting persistence.

Perceived Improvement Areas

Of course, any and all suggestions for ways to improve this function are greatly appreciated. I do, however, have some areas that I think could be improved, I'm just not sure how.

Grouping the Optional period Argument: In the section that describes what to do if calculate == TRUE, I had to create an if statement to group the variables differently depending on whether the period argument was supplied. Before, there was only one group_by argument, and if I explicitly called all of the arguments, the function would work great. But when I only called the first 3 required arguments, I would get an error. The current version works fine, but is there a better way to conditionally group optional variables?

Conditional overall Argument: In order to calculate the overall persistence, it seems like I have to repeat a lot of code, which could be computationally expensive, and is a little less easy to read than one continuous dplyr chain would be. Is there a more code-efficient way to calculate the overall rate?

What I've Already Tried

Sample Data

dataFrame <- data.frame(id = as.character(c(1, 2, 3, 4, 1, 2, 3, 1, 2)), 
 period = c("A", "A", "A", "A", "B", "B", "B", "C", "C"), 
 rank = c(1, 1, 1, 1, 2, 2, 2, 3, 3), 
 group = c(1, 2, 1, 2, 1, 2, 1, 1, 2), 
 stringsAsFactors = FALSE)

The Function Code

persistence <- function(df, id, rank, period, ..., overall = TRUE, calculate = TRUE)

 stopifnot(!missing(df), !missing(id), !missing(rank))
 period_missing <- missing(period)

 enq_id <- enquo(id)
 enq_rank <- enquo(rank)
 enq_period <- enquo(period)
 enq_group_var <- quos(...)

 valid_rank_type <- is.numeric(rlang::eval_tidy(enq_rank, df))

Sample Function Call, Output, and sessionInfo()

library(dplyr)

persistence(df = dataFrame,
 id = id,
 rank = rank,
 period = period,
 group,
 overall = TRUE,
 calculate = TRUE)

# A tibble: 4 x 6
 group rank period persistence_rate count overall
 <dbl> <dbl> <chr> <dbl> <int> <dbl>
1 1 1 A 1.0 2 0.7142857
2 2 1 A 0.5 2 0.7142857
3 1 2 B 0.5 2 0.7142857
4 2 2 B 1.0 1 0.7142857

> sessionInfo()
R version 3.4.2 (2017-09-28)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C 
[5] LC_TIME=English_United States.1252 

attached base packages:
[1] stats graphics grDevices utils datasets methods base 

other attached packages:
[1] bindrcpp_0.2.2 dplyr_0.7.6 

loaded via a namespace (and not attached):
 [1] tidyselect_0.2.3 compiler_3.4.2 magrittr_1.5 assertthat_0.2.0 R6_2.2.2 
 [6] tools_3.4.2 glue_1.2.0 tibble_1.3.4 yaml_2.1.14 Rcpp_0.12.17 
[11] pkgconfig_2.0.1 rlang_0.2.1 purrr_0.2.4 bindr_0.1.1

Final Note

asked Jul 13 at 18:33

MillionC

161

What exactly you want to improve to this function?
â€“Â minem
Jul 16 at 11:23

I want to improve this function by using code that follows best practices. So, what are the best practices for incorporating optional arguments into a dplyr group_by() chain? And what are the best practices for incorporating a conditional statement, as directed by an argument? Does my code follow those practices, does it get close? Is there a reason I shouldn't use the method I did?
â€“Â MillionC
Jul 16 at 15:19

add a commentÂ |Â

up vote
0
down vote

favorite

The Function

I have written a function that will either:

Calculate the persistence rate for each period's cohort of ID's
if calculate = TRUE.

Create an indicator variable on the original dataframe that

identifies whether the ID persisted (1) or not (0), if
calculate = FALSE.

Furthermore, if overall = TRUE when calculate = TRUE, it will include the persistence rate over all of the terms.

The Arguments

Here is a brief description about each of the arguments:

df (REQUIRED): This is the dataframe argument, and a dataframe should be passed to this argument.

id (REQUIRED): This is the unique identification of the observational unit of interest. (Customer ID, Product ID, Student ID, etc.)

rank (REQUIRED): This is the numeric or ordered factor argument that defines the sequence of periods.

period (OPTIONAL): This is the "label" or more interpretative version of rank. Essentially just makes output pretty, if desired. (e.g., "October" is the period, 10 is the ranking number for October)

... (OPTIONAL): Variables to group_by in case a comparison of persistence rates across groups is desired.

overall (REQUIRED w/ DEFAULT): Logical variable to decide whether or not to include an "overall" persistence rate calculation.

calculate (REQUIRED w/ DEFAULT): Logical variable to decide whether to summarize the data into persistence rates, or to create an indicator variable denoting persistence.

Perceived Improvement Areas

Of course, any and all suggestions for ways to improve this function are greatly appreciated. I do, however, have some areas that I think could be improved, I'm just not sure how.

Grouping the Optional period Argument: In the section that describes what to do if calculate == TRUE, I had to create an if statement to group the variables differently depending on whether the period argument was supplied. Before, there was only one group_by argument, and if I explicitly called all of the arguments, the function would work great. But when I only called the first 3 required arguments, I would get an error. The current version works fine, but is there a better way to conditionally group optional variables?

Conditional overall Argument: In order to calculate the overall persistence, it seems like I have to repeat a lot of code, which could be computationally expensive, and is a little less easy to read than one continuous dplyr chain would be. Is there a more code-efficient way to calculate the overall rate?

What I've Already Tried

Sample Data

dataFrame <- data.frame(id = as.character(c(1, 2, 3, 4, 1, 2, 3, 1, 2)), 
 period = c("A", "A", "A", "A", "B", "B", "B", "C", "C"), 
 rank = c(1, 1, 1, 1, 2, 2, 2, 3, 3), 
 group = c(1, 2, 1, 2, 1, 2, 1, 1, 2), 
 stringsAsFactors = FALSE)

The Function Code

persistence <- function(df, id, rank, period, ..., overall = TRUE, calculate = TRUE)

 stopifnot(!missing(df), !missing(id), !missing(rank))
 period_missing <- missing(period)

 enq_id <- enquo(id)
 enq_rank <- enquo(rank)
 enq_period <- enquo(period)
 enq_group_var <- quos(...)

 valid_rank_type <- is.numeric(rlang::eval_tidy(enq_rank, df))

Sample Function Call, Output, and sessionInfo()

library(dplyr)

persistence(df = dataFrame,
 id = id,
 rank = rank,
 period = period,
 group,
 overall = TRUE,
 calculate = TRUE)

# A tibble: 4 x 6
 group rank period persistence_rate count overall
 <dbl> <dbl> <chr> <dbl> <int> <dbl>
1 1 1 A 1.0 2 0.7142857
2 2 1 A 0.5 2 0.7142857
3 1 2 B 0.5 2 0.7142857
4 2 2 B 1.0 1 0.7142857

> sessionInfo()
R version 3.4.2 (2017-09-28)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C 
[5] LC_TIME=English_United States.1252 

attached base packages:
[1] stats graphics grDevices utils datasets methods base 

other attached packages:
[1] bindrcpp_0.2.2 dplyr_0.7.6 

loaded via a namespace (and not attached):
 [1] tidyselect_0.2.3 compiler_3.4.2 magrittr_1.5 assertthat_0.2.0 R6_2.2.2 
 [6] tools_3.4.2 glue_1.2.0 tibble_1.3.4 yaml_2.1.14 Rcpp_0.12.17 
[11] pkgconfig_2.0.1 rlang_0.2.1 purrr_0.2.4 bindr_0.1.1

Final Note

asked Jul 13 at 18:33

MillionC

161

The Function

I have written a function that will either:

Calculate the persistence rate for each period's cohort of ID's
if calculate = TRUE.

Create an indicator variable on the original dataframe that

identifies whether the ID persisted (1) or not (0), if
calculate = FALSE.

Furthermore, if overall = TRUE when calculate = TRUE, it will include the persistence rate over all of the terms.

The Arguments

Here is a brief description about each of the arguments:

df (REQUIRED): This is the dataframe argument, and a dataframe should be passed to this argument.

id (REQUIRED): This is the unique identification of the observational unit of interest. (Customer ID, Product ID, Student ID, etc.)

rank (REQUIRED): This is the numeric or ordered factor argument that defines the sequence of periods.

period (OPTIONAL): This is the "label" or more interpretative version of rank. Essentially just makes output pretty, if desired. (e.g., "October" is the period, 10 is the ranking number for October)

... (OPTIONAL): Variables to group_by in case a comparison of persistence rates across groups is desired.

overall (REQUIRED w/ DEFAULT): Logical variable to decide whether or not to include an "overall" persistence rate calculation.

calculate (REQUIRED w/ DEFAULT): Logical variable to decide whether to summarize the data into persistence rates, or to create an indicator variable denoting persistence.

Perceived Improvement Areas

Of course, any and all suggestions for ways to improve this function are greatly appreciated. I do, however, have some areas that I think could be improved, I'm just not sure how.

Grouping the Optional period Argument: In the section that describes what to do if calculate == TRUE, I had to create an if statement to group the variables differently depending on whether the period argument was supplied. Before, there was only one group_by argument, and if I explicitly called all of the arguments, the function would work great. But when I only called the first 3 required arguments, I would get an error. The current version works fine, but is there a better way to conditionally group optional variables?

Conditional overall Argument: In order to calculate the overall persistence, it seems like I have to repeat a lot of code, which could be computationally expensive, and is a little less easy to read than one continuous dplyr chain would be. Is there a more code-efficient way to calculate the overall rate?

What I've Already Tried

Sample Data

dataFrame <- data.frame(id = as.character(c(1, 2, 3, 4, 1, 2, 3, 1, 2)), 
 period = c("A", "A", "A", "A", "B", "B", "B", "C", "C"), 
 rank = c(1, 1, 1, 1, 2, 2, 2, 3, 3), 
 group = c(1, 2, 1, 2, 1, 2, 1, 1, 2), 
 stringsAsFactors = FALSE)

The Function Code

persistence <- function(df, id, rank, period, ..., overall = TRUE, calculate = TRUE)

 stopifnot(!missing(df), !missing(id), !missing(rank))
 period_missing <- missing(period)

 enq_id <- enquo(id)
 enq_rank <- enquo(rank)
 enq_period <- enquo(period)
 enq_group_var <- quos(...)

 valid_rank_type <- is.numeric(rlang::eval_tidy(enq_rank, df))

Sample Function Call, Output, and sessionInfo()

library(dplyr)

persistence(df = dataFrame,
 id = id,
 rank = rank,
 period = period,
 group,
 overall = TRUE,
 calculate = TRUE)

# A tibble: 4 x 6
 group rank period persistence_rate count overall
 <dbl> <dbl> <chr> <dbl> <int> <dbl>
1 1 1 A 1.0 2 0.7142857
2 2 1 A 0.5 2 0.7142857
3 1 2 B 0.5 2 0.7142857
4 2 2 B 1.0 1 0.7142857

> sessionInfo()
R version 3.4.2 (2017-09-28)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C 
[5] LC_TIME=English_United States.1252 

attached base packages:
[1] stats graphics grDevices utils datasets methods base 

other attached packages:
[1] bindrcpp_0.2.2 dplyr_0.7.6 

loaded via a namespace (and not attached):
 [1] tidyselect_0.2.3 compiler_3.4.2 magrittr_1.5 assertthat_0.2.0 R6_2.2.2 
 [6] tools_3.4.2 glue_1.2.0 tibble_1.3.4 yaml_2.1.14 Rcpp_0.12.17 
[11] pkgconfig_2.0.1 rlang_0.2.1 purrr_0.2.4 bindr_0.1.1

Final Note

asked Jul 13 at 18:33

MillionC

161

asked Jul 13 at 18:33

MillionC

161

asked Jul 13 at 18:33

MillionC

161

asked Jul 13 at 18:33

MillionC

161

What exactly you want to improve to this function?
â€“Â minem
Jul 16 at 11:23

I want to improve this function by using code that follows best practices. So, what are the best practices for incorporating optional arguments into a dplyr group_by() chain? And what are the best practices for incorporating a conditional statement, as directed by an argument? Does my code follow those practices, does it get close? Is there a reason I shouldn't use the method I did?
â€“Â MillionC
Jul 16 at 15:19

add a commentÂ |Â

What exactly you want to improve to this function?
â€“Â minem
Jul 16 at 11:23

I want to improve this function by using code that follows best practices. So, what are the best practices for incorporating optional arguments into a dplyr group_by() chain? And what are the best practices for incorporating a conditional statement, as directed by an argument? Does my code follow those practices, does it get close? Is there a reason I shouldn't use the method I did?
â€“Â MillionC
Jul 16 at 15:19

What exactly you want to improve to this function?
â€“Â minem
Jul 16 at 11:23

I want to improve this function by using code that follows best practices. So, what are the best practices for incorporating optional arguments into a dplyr group_by() chain? And what are the best practices for incorporating a conditional statement, as directed by an argument? Does my code follow those practices, does it get close? Is there a reason I shouldn't use the method I did?
â€“Â MillionC
Jul 16 at 15:19

add a commentÂ |Â

active

oldest

votes

Your Answer

StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
);
);
, "mathjax-editing");

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "196"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: false,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f198449%2ffunction-to-calculate-persistence-rate-with-optional-group-by-variable-and-logic%23new-answer', 'question_page');

);

Post as a guest

Name

active

oldest

votes

draft saved

draft discarded

draft saved

draft discarded

Post as a guest

Name

搜尋此網誌

trjhtr