Determining whether a list of pathways and its genes are all included in another list

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP





.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;







up vote
3
down vote

favorite












I am evaluating if a list of pathways and its genes has pathways included in another pathways. So the way I thought about it is to create a function that does just a single comparison.



all_in <- function(x, y) 
if (length(x) > length(y))
0
else
ifelse(all(x %in% y), 1, 0)




And then Vectorize to be able to use outer



all_in_vec <- Vectorize(all_in, vectorize.args = c("x", "y"))


And use it as follows:



nested <- outer(paths2genes, paths2genes, all_in_vec)


example:



paths2genes <- structure(list(`1430728` = c("10", "9"), `156580` = c("10", "3", 
"9"), `156582` = c("10", "9"), `194840` = c("2", "3"), `211859` = c("10",
"9")), .Names = c("1430728", "156580", "156582", "194840", "211859"
))


Tests:



library(testthat)
expect_true(all(diag(nested) == 1L))
expect_equal(nested[1, 2], 1L)


However I have recently found that Vectorize should be avoided in favor of a built-in vectorization. Is there any built-in method I am missing? How can I vectorize this code ?







share|improve this question



























    up vote
    3
    down vote

    favorite












    I am evaluating if a list of pathways and its genes has pathways included in another pathways. So the way I thought about it is to create a function that does just a single comparison.



    all_in <- function(x, y) 
    if (length(x) > length(y))
    0
    else
    ifelse(all(x %in% y), 1, 0)




    And then Vectorize to be able to use outer



    all_in_vec <- Vectorize(all_in, vectorize.args = c("x", "y"))


    And use it as follows:



    nested <- outer(paths2genes, paths2genes, all_in_vec)


    example:



    paths2genes <- structure(list(`1430728` = c("10", "9"), `156580` = c("10", "3", 
    "9"), `156582` = c("10", "9"), `194840` = c("2", "3"), `211859` = c("10",
    "9")), .Names = c("1430728", "156580", "156582", "194840", "211859"
    ))


    Tests:



    library(testthat)
    expect_true(all(diag(nested) == 1L))
    expect_equal(nested[1, 2], 1L)


    However I have recently found that Vectorize should be avoided in favor of a built-in vectorization. Is there any built-in method I am missing? How can I vectorize this code ?







    share|improve this question























      up vote
      3
      down vote

      favorite









      up vote
      3
      down vote

      favorite











      I am evaluating if a list of pathways and its genes has pathways included in another pathways. So the way I thought about it is to create a function that does just a single comparison.



      all_in <- function(x, y) 
      if (length(x) > length(y))
      0
      else
      ifelse(all(x %in% y), 1, 0)




      And then Vectorize to be able to use outer



      all_in_vec <- Vectorize(all_in, vectorize.args = c("x", "y"))


      And use it as follows:



      nested <- outer(paths2genes, paths2genes, all_in_vec)


      example:



      paths2genes <- structure(list(`1430728` = c("10", "9"), `156580` = c("10", "3", 
      "9"), `156582` = c("10", "9"), `194840` = c("2", "3"), `211859` = c("10",
      "9")), .Names = c("1430728", "156580", "156582", "194840", "211859"
      ))


      Tests:



      library(testthat)
      expect_true(all(diag(nested) == 1L))
      expect_equal(nested[1, 2], 1L)


      However I have recently found that Vectorize should be avoided in favor of a built-in vectorization. Is there any built-in method I am missing? How can I vectorize this code ?







      share|improve this question













      I am evaluating if a list of pathways and its genes has pathways included in another pathways. So the way I thought about it is to create a function that does just a single comparison.



      all_in <- function(x, y) 
      if (length(x) > length(y))
      0
      else
      ifelse(all(x %in% y), 1, 0)




      And then Vectorize to be able to use outer



      all_in_vec <- Vectorize(all_in, vectorize.args = c("x", "y"))


      And use it as follows:



      nested <- outer(paths2genes, paths2genes, all_in_vec)


      example:



      paths2genes <- structure(list(`1430728` = c("10", "9"), `156580` = c("10", "3", 
      "9"), `156582` = c("10", "9"), `194840` = c("2", "3"), `211859` = c("10",
      "9")), .Names = c("1430728", "156580", "156582", "194840", "211859"
      ))


      Tests:



      library(testthat)
      expect_true(all(diag(nested) == 1L))
      expect_equal(nested[1, 2], 1L)


      However I have recently found that Vectorize should be avoided in favor of a built-in vectorization. Is there any built-in method I am missing? How can I vectorize this code ?









      share|improve this question












      share|improve this question




      share|improve this question








      edited Apr 19 at 3:51









      flodel

      2,9771815




      2,9771815









      asked Apr 18 at 22:06









      Llopis

      189110




      189110




















          1 Answer
          1






          active

          oldest

          votes

















          up vote
          3
          down vote



          accepted










          If you were to add a cat("hello") at the top of your all_in function, you would find that your function is called 25 times, once for each combination (pair) of pathways. So yes, despite having used Vectorized, it is still essentially a big old loop you have under the hood... Here is how I would write a vectorized function so the heavy-lifting function (%in%, or in my case, match) is only called once or twice:



          all_in_outer <- function(list_x, list_y) 
          uniq_x <- unique(unlist(list_x, use.names = FALSE))
          len_x <- vapply(list_x, length, integer(1L))
          as_mat <- function(list_a, ids = uniq_x)
          vec <- unlist(list_a, use.names = FALSE)
          len <- vapply(list_a, length, integer(1L))
          idx <- rep(seq_along(list_a), len)
          mat <- matrix(0L, nrow = length(list_a), ncol = length(ids),
          dimnames = list(names(list_a), ids))
          mat[cbind(idx, match(vec, ids))] <- 1L
          mat

          (as_mat(list_x) %*% t(as_mat(list_y)) == len_x) * 1


          all_in_outer(paths2genes, paths2genes)
          # 1430728 156580 156582 194840 211859
          # 1430728 1 1 1 0 1
          # 156580 0 1 0 0 0
          # 156582 1 1 1 0 1
          # 194840 0 0 0 1 0
          # 211859 1 1 1 0 1


          Some explanation: After we have found the list of unique genes across all pathways in the first argument (10, 9, 3, 2), we turn both arguments into a matrix A where A[i,j] = 1 if the pathway i contains the gene j, or 0 otherwise. For paths2genes, this matrix is as_mat(paths2genes):



          # 10 9 3 2
          # 1430728 1 1 0 0
          # 156580 1 1 1 0
          # 156582 1 1 0 0
          # 194840 0 0 1 1
          # 211859 1 1 0 0


          Then, using matrix multiplication between two such matrices, you get the number of gene matches for each possible combination of pathways. You then just have to compare that number of matches with the length of the pathway: (as_mat(list_x) %*% t(as_mat(list_y)) == len_x).



          In the conversion of each input into a matrix of zero and ones, see this particular line of code: mat[cbind(idx, match(vec, ids))] <- 1L. This is what fills the ones in the matrix. It is completely vectorized, via a single call to match.






          share|improve this answer























          • Clever trick! I would also use lengths instead of vapply(..., length). Perhaps I should have mentioned in the question but I also compare two different lists (with different number (and names) of genes and pathways), this function would still work? I am under the impression that using uniq_x in as_mat prevents this usage
            – Llopis
            Apr 19 at 7:29










          • No, that's intentional. You are interested in matching the items of the first list, hence uniq_x is an input to both calls of as_mat. And yes, the function is written to take possibly different lists as inputs. Give it a try! And thanks for suggesting lengths, I had not come across in many years of R programming, maybe it is a recent addition..
            – flodel
            Apr 19 at 11:02











          Your Answer




          StackExchange.ifUsing("editor", function ()
          return StackExchange.using("mathjaxEditing", function ()
          StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
          StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
          );
          );
          , "mathjax-editing");

          StackExchange.ifUsing("editor", function ()
          StackExchange.using("externalEditor", function ()
          StackExchange.using("snippets", function ()
          StackExchange.snippets.init();
          );
          );
          , "code-snippets");

          StackExchange.ready(function()
          var channelOptions =
          tags: "".split(" "),
          id: "196"
          ;
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function()
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled)
          StackExchange.using("snippets", function()
          createEditor();
          );

          else
          createEditor();

          );

          function createEditor()
          StackExchange.prepareEditor(
          heartbeatType: 'answer',
          convertImagesToLinks: false,
          noModals: false,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: null,
          bindNavPrevention: true,
          postfix: "",
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          );



          );








           

          draft saved


          draft discarded


















          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f192410%2fdetermining-whether-a-list-of-pathways-and-its-genes-are-all-included-in-another%23new-answer', 'question_page');

          );

          Post as a guest






























          1 Answer
          1






          active

          oldest

          votes








          1 Answer
          1






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes








          up vote
          3
          down vote



          accepted










          If you were to add a cat("hello") at the top of your all_in function, you would find that your function is called 25 times, once for each combination (pair) of pathways. So yes, despite having used Vectorized, it is still essentially a big old loop you have under the hood... Here is how I would write a vectorized function so the heavy-lifting function (%in%, or in my case, match) is only called once or twice:



          all_in_outer <- function(list_x, list_y) 
          uniq_x <- unique(unlist(list_x, use.names = FALSE))
          len_x <- vapply(list_x, length, integer(1L))
          as_mat <- function(list_a, ids = uniq_x)
          vec <- unlist(list_a, use.names = FALSE)
          len <- vapply(list_a, length, integer(1L))
          idx <- rep(seq_along(list_a), len)
          mat <- matrix(0L, nrow = length(list_a), ncol = length(ids),
          dimnames = list(names(list_a), ids))
          mat[cbind(idx, match(vec, ids))] <- 1L
          mat

          (as_mat(list_x) %*% t(as_mat(list_y)) == len_x) * 1


          all_in_outer(paths2genes, paths2genes)
          # 1430728 156580 156582 194840 211859
          # 1430728 1 1 1 0 1
          # 156580 0 1 0 0 0
          # 156582 1 1 1 0 1
          # 194840 0 0 0 1 0
          # 211859 1 1 1 0 1


          Some explanation: After we have found the list of unique genes across all pathways in the first argument (10, 9, 3, 2), we turn both arguments into a matrix A where A[i,j] = 1 if the pathway i contains the gene j, or 0 otherwise. For paths2genes, this matrix is as_mat(paths2genes):



          # 10 9 3 2
          # 1430728 1 1 0 0
          # 156580 1 1 1 0
          # 156582 1 1 0 0
          # 194840 0 0 1 1
          # 211859 1 1 0 0


          Then, using matrix multiplication between two such matrices, you get the number of gene matches for each possible combination of pathways. You then just have to compare that number of matches with the length of the pathway: (as_mat(list_x) %*% t(as_mat(list_y)) == len_x).



          In the conversion of each input into a matrix of zero and ones, see this particular line of code: mat[cbind(idx, match(vec, ids))] <- 1L. This is what fills the ones in the matrix. It is completely vectorized, via a single call to match.






          share|improve this answer























          • Clever trick! I would also use lengths instead of vapply(..., length). Perhaps I should have mentioned in the question but I also compare two different lists (with different number (and names) of genes and pathways), this function would still work? I am under the impression that using uniq_x in as_mat prevents this usage
            – Llopis
            Apr 19 at 7:29










          • No, that's intentional. You are interested in matching the items of the first list, hence uniq_x is an input to both calls of as_mat. And yes, the function is written to take possibly different lists as inputs. Give it a try! And thanks for suggesting lengths, I had not come across in many years of R programming, maybe it is a recent addition..
            – flodel
            Apr 19 at 11:02















          up vote
          3
          down vote



          accepted










          If you were to add a cat("hello") at the top of your all_in function, you would find that your function is called 25 times, once for each combination (pair) of pathways. So yes, despite having used Vectorized, it is still essentially a big old loop you have under the hood... Here is how I would write a vectorized function so the heavy-lifting function (%in%, or in my case, match) is only called once or twice:



          all_in_outer <- function(list_x, list_y) 
          uniq_x <- unique(unlist(list_x, use.names = FALSE))
          len_x <- vapply(list_x, length, integer(1L))
          as_mat <- function(list_a, ids = uniq_x)
          vec <- unlist(list_a, use.names = FALSE)
          len <- vapply(list_a, length, integer(1L))
          idx <- rep(seq_along(list_a), len)
          mat <- matrix(0L, nrow = length(list_a), ncol = length(ids),
          dimnames = list(names(list_a), ids))
          mat[cbind(idx, match(vec, ids))] <- 1L
          mat

          (as_mat(list_x) %*% t(as_mat(list_y)) == len_x) * 1


          all_in_outer(paths2genes, paths2genes)
          # 1430728 156580 156582 194840 211859
          # 1430728 1 1 1 0 1
          # 156580 0 1 0 0 0
          # 156582 1 1 1 0 1
          # 194840 0 0 0 1 0
          # 211859 1 1 1 0 1


          Some explanation: After we have found the list of unique genes across all pathways in the first argument (10, 9, 3, 2), we turn both arguments into a matrix A where A[i,j] = 1 if the pathway i contains the gene j, or 0 otherwise. For paths2genes, this matrix is as_mat(paths2genes):



          # 10 9 3 2
          # 1430728 1 1 0 0
          # 156580 1 1 1 0
          # 156582 1 1 0 0
          # 194840 0 0 1 1
          # 211859 1 1 0 0


          Then, using matrix multiplication between two such matrices, you get the number of gene matches for each possible combination of pathways. You then just have to compare that number of matches with the length of the pathway: (as_mat(list_x) %*% t(as_mat(list_y)) == len_x).



          In the conversion of each input into a matrix of zero and ones, see this particular line of code: mat[cbind(idx, match(vec, ids))] <- 1L. This is what fills the ones in the matrix. It is completely vectorized, via a single call to match.






          share|improve this answer























          • Clever trick! I would also use lengths instead of vapply(..., length). Perhaps I should have mentioned in the question but I also compare two different lists (with different number (and names) of genes and pathways), this function would still work? I am under the impression that using uniq_x in as_mat prevents this usage
            – Llopis
            Apr 19 at 7:29










          • No, that's intentional. You are interested in matching the items of the first list, hence uniq_x is an input to both calls of as_mat. And yes, the function is written to take possibly different lists as inputs. Give it a try! And thanks for suggesting lengths, I had not come across in many years of R programming, maybe it is a recent addition..
            – flodel
            Apr 19 at 11:02













          up vote
          3
          down vote



          accepted







          up vote
          3
          down vote



          accepted






          If you were to add a cat("hello") at the top of your all_in function, you would find that your function is called 25 times, once for each combination (pair) of pathways. So yes, despite having used Vectorized, it is still essentially a big old loop you have under the hood... Here is how I would write a vectorized function so the heavy-lifting function (%in%, or in my case, match) is only called once or twice:



          all_in_outer <- function(list_x, list_y) 
          uniq_x <- unique(unlist(list_x, use.names = FALSE))
          len_x <- vapply(list_x, length, integer(1L))
          as_mat <- function(list_a, ids = uniq_x)
          vec <- unlist(list_a, use.names = FALSE)
          len <- vapply(list_a, length, integer(1L))
          idx <- rep(seq_along(list_a), len)
          mat <- matrix(0L, nrow = length(list_a), ncol = length(ids),
          dimnames = list(names(list_a), ids))
          mat[cbind(idx, match(vec, ids))] <- 1L
          mat

          (as_mat(list_x) %*% t(as_mat(list_y)) == len_x) * 1


          all_in_outer(paths2genes, paths2genes)
          # 1430728 156580 156582 194840 211859
          # 1430728 1 1 1 0 1
          # 156580 0 1 0 0 0
          # 156582 1 1 1 0 1
          # 194840 0 0 0 1 0
          # 211859 1 1 1 0 1


          Some explanation: After we have found the list of unique genes across all pathways in the first argument (10, 9, 3, 2), we turn both arguments into a matrix A where A[i,j] = 1 if the pathway i contains the gene j, or 0 otherwise. For paths2genes, this matrix is as_mat(paths2genes):



          # 10 9 3 2
          # 1430728 1 1 0 0
          # 156580 1 1 1 0
          # 156582 1 1 0 0
          # 194840 0 0 1 1
          # 211859 1 1 0 0


          Then, using matrix multiplication between two such matrices, you get the number of gene matches for each possible combination of pathways. You then just have to compare that number of matches with the length of the pathway: (as_mat(list_x) %*% t(as_mat(list_y)) == len_x).



          In the conversion of each input into a matrix of zero and ones, see this particular line of code: mat[cbind(idx, match(vec, ids))] <- 1L. This is what fills the ones in the matrix. It is completely vectorized, via a single call to match.






          share|improve this answer















          If you were to add a cat("hello") at the top of your all_in function, you would find that your function is called 25 times, once for each combination (pair) of pathways. So yes, despite having used Vectorized, it is still essentially a big old loop you have under the hood... Here is how I would write a vectorized function so the heavy-lifting function (%in%, or in my case, match) is only called once or twice:



          all_in_outer <- function(list_x, list_y) 
          uniq_x <- unique(unlist(list_x, use.names = FALSE))
          len_x <- vapply(list_x, length, integer(1L))
          as_mat <- function(list_a, ids = uniq_x)
          vec <- unlist(list_a, use.names = FALSE)
          len <- vapply(list_a, length, integer(1L))
          idx <- rep(seq_along(list_a), len)
          mat <- matrix(0L, nrow = length(list_a), ncol = length(ids),
          dimnames = list(names(list_a), ids))
          mat[cbind(idx, match(vec, ids))] <- 1L
          mat

          (as_mat(list_x) %*% t(as_mat(list_y)) == len_x) * 1


          all_in_outer(paths2genes, paths2genes)
          # 1430728 156580 156582 194840 211859
          # 1430728 1 1 1 0 1
          # 156580 0 1 0 0 0
          # 156582 1 1 1 0 1
          # 194840 0 0 0 1 0
          # 211859 1 1 1 0 1


          Some explanation: After we have found the list of unique genes across all pathways in the first argument (10, 9, 3, 2), we turn both arguments into a matrix A where A[i,j] = 1 if the pathway i contains the gene j, or 0 otherwise. For paths2genes, this matrix is as_mat(paths2genes):



          # 10 9 3 2
          # 1430728 1 1 0 0
          # 156580 1 1 1 0
          # 156582 1 1 0 0
          # 194840 0 0 1 1
          # 211859 1 1 0 0


          Then, using matrix multiplication between two such matrices, you get the number of gene matches for each possible combination of pathways. You then just have to compare that number of matches with the length of the pathway: (as_mat(list_x) %*% t(as_mat(list_y)) == len_x).



          In the conversion of each input into a matrix of zero and ones, see this particular line of code: mat[cbind(idx, match(vec, ids))] <- 1L. This is what fills the ones in the matrix. It is completely vectorized, via a single call to match.







          share|improve this answer















          share|improve this answer



          share|improve this answer








          edited Apr 19 at 4:02


























          answered Apr 19 at 3:49









          flodel

          2,9771815




          2,9771815











          • Clever trick! I would also use lengths instead of vapply(..., length). Perhaps I should have mentioned in the question but I also compare two different lists (with different number (and names) of genes and pathways), this function would still work? I am under the impression that using uniq_x in as_mat prevents this usage
            – Llopis
            Apr 19 at 7:29










          • No, that's intentional. You are interested in matching the items of the first list, hence uniq_x is an input to both calls of as_mat. And yes, the function is written to take possibly different lists as inputs. Give it a try! And thanks for suggesting lengths, I had not come across in many years of R programming, maybe it is a recent addition..
            – flodel
            Apr 19 at 11:02

















          • Clever trick! I would also use lengths instead of vapply(..., length). Perhaps I should have mentioned in the question but I also compare two different lists (with different number (and names) of genes and pathways), this function would still work? I am under the impression that using uniq_x in as_mat prevents this usage
            – Llopis
            Apr 19 at 7:29










          • No, that's intentional. You are interested in matching the items of the first list, hence uniq_x is an input to both calls of as_mat. And yes, the function is written to take possibly different lists as inputs. Give it a try! And thanks for suggesting lengths, I had not come across in many years of R programming, maybe it is a recent addition..
            – flodel
            Apr 19 at 11:02
















          Clever trick! I would also use lengths instead of vapply(..., length). Perhaps I should have mentioned in the question but I also compare two different lists (with different number (and names) of genes and pathways), this function would still work? I am under the impression that using uniq_x in as_mat prevents this usage
          – Llopis
          Apr 19 at 7:29




          Clever trick! I would also use lengths instead of vapply(..., length). Perhaps I should have mentioned in the question but I also compare two different lists (with different number (and names) of genes and pathways), this function would still work? I am under the impression that using uniq_x in as_mat prevents this usage
          – Llopis
          Apr 19 at 7:29












          No, that's intentional. You are interested in matching the items of the first list, hence uniq_x is an input to both calls of as_mat. And yes, the function is written to take possibly different lists as inputs. Give it a try! And thanks for suggesting lengths, I had not come across in many years of R programming, maybe it is a recent addition..
          – flodel
          Apr 19 at 11:02





          No, that's intentional. You are interested in matching the items of the first list, hence uniq_x is an input to both calls of as_mat. And yes, the function is written to take possibly different lists as inputs. Give it a try! And thanks for suggesting lengths, I had not come across in many years of R programming, maybe it is a recent addition..
          – flodel
          Apr 19 at 11:02













           

          draft saved


          draft discarded


























           


          draft saved


          draft discarded














          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f192410%2fdetermining-whether-a-list-of-pathways-and-its-genes-are-all-included-in-another%23new-answer', 'question_page');

          );

          Post as a guest













































































          Popular posts from this blog

          Python Lists

          Aion

          JavaScript Array Iteration Methods