Determining whether a list of pathways and its genes are all included in another list

Clash Royale CLAN TAG#URR8PPP
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;
up vote
3
down vote
favorite
I am evaluating if a list of pathways and its genes has pathways included in another pathways. So the way I thought about it is to create a function that does just a single comparison.
all_in <- function(x, y)
if (length(x) > length(y))
0
else
ifelse(all(x %in% y), 1, 0)
And then Vectorize to be able to use outer
all_in_vec <- Vectorize(all_in, vectorize.args = c("x", "y"))
And use it as follows:
nested <- outer(paths2genes, paths2genes, all_in_vec)
example:
paths2genes <- structure(list(`1430728` = c("10", "9"), `156580` = c("10", "3",
"9"), `156582` = c("10", "9"), `194840` = c("2", "3"), `211859` = c("10",
"9")), .Names = c("1430728", "156580", "156582", "194840", "211859"
))
Tests:
library(testthat)
expect_true(all(diag(nested) == 1L))
expect_equal(nested[1, 2], 1L)
However I have recently found that Vectorize should be avoided in favor of a built-in vectorization. Is there any built-in method I am missing? How can I vectorize this code ?
r bioinformatics vectorization
add a comment |Â
up vote
3
down vote
favorite
I am evaluating if a list of pathways and its genes has pathways included in another pathways. So the way I thought about it is to create a function that does just a single comparison.
all_in <- function(x, y)
if (length(x) > length(y))
0
else
ifelse(all(x %in% y), 1, 0)
And then Vectorize to be able to use outer
all_in_vec <- Vectorize(all_in, vectorize.args = c("x", "y"))
And use it as follows:
nested <- outer(paths2genes, paths2genes, all_in_vec)
example:
paths2genes <- structure(list(`1430728` = c("10", "9"), `156580` = c("10", "3",
"9"), `156582` = c("10", "9"), `194840` = c("2", "3"), `211859` = c("10",
"9")), .Names = c("1430728", "156580", "156582", "194840", "211859"
))
Tests:
library(testthat)
expect_true(all(diag(nested) == 1L))
expect_equal(nested[1, 2], 1L)
However I have recently found that Vectorize should be avoided in favor of a built-in vectorization. Is there any built-in method I am missing? How can I vectorize this code ?
r bioinformatics vectorization
add a comment |Â
up vote
3
down vote
favorite
up vote
3
down vote
favorite
I am evaluating if a list of pathways and its genes has pathways included in another pathways. So the way I thought about it is to create a function that does just a single comparison.
all_in <- function(x, y)
if (length(x) > length(y))
0
else
ifelse(all(x %in% y), 1, 0)
And then Vectorize to be able to use outer
all_in_vec <- Vectorize(all_in, vectorize.args = c("x", "y"))
And use it as follows:
nested <- outer(paths2genes, paths2genes, all_in_vec)
example:
paths2genes <- structure(list(`1430728` = c("10", "9"), `156580` = c("10", "3",
"9"), `156582` = c("10", "9"), `194840` = c("2", "3"), `211859` = c("10",
"9")), .Names = c("1430728", "156580", "156582", "194840", "211859"
))
Tests:
library(testthat)
expect_true(all(diag(nested) == 1L))
expect_equal(nested[1, 2], 1L)
However I have recently found that Vectorize should be avoided in favor of a built-in vectorization. Is there any built-in method I am missing? How can I vectorize this code ?
r bioinformatics vectorization
I am evaluating if a list of pathways and its genes has pathways included in another pathways. So the way I thought about it is to create a function that does just a single comparison.
all_in <- function(x, y)
if (length(x) > length(y))
0
else
ifelse(all(x %in% y), 1, 0)
And then Vectorize to be able to use outer
all_in_vec <- Vectorize(all_in, vectorize.args = c("x", "y"))
And use it as follows:
nested <- outer(paths2genes, paths2genes, all_in_vec)
example:
paths2genes <- structure(list(`1430728` = c("10", "9"), `156580` = c("10", "3",
"9"), `156582` = c("10", "9"), `194840` = c("2", "3"), `211859` = c("10",
"9")), .Names = c("1430728", "156580", "156582", "194840", "211859"
))
Tests:
library(testthat)
expect_true(all(diag(nested) == 1L))
expect_equal(nested[1, 2], 1L)
However I have recently found that Vectorize should be avoided in favor of a built-in vectorization. Is there any built-in method I am missing? How can I vectorize this code ?
r bioinformatics vectorization
edited Apr 19 at 3:51
flodel
2,9771815
2,9771815
asked Apr 18 at 22:06
Llopis
189110
189110
add a comment |Â
add a comment |Â
1 Answer
1
active
oldest
votes
up vote
3
down vote
accepted
If you were to add a cat("hello") at the top of your all_in function, you would find that your function is called 25 times, once for each combination (pair) of pathways. So yes, despite having used Vectorized, it is still essentially a big old loop you have under the hood... Here is how I would write a vectorized function so the heavy-lifting function (%in%, or in my case, match) is only called once or twice:
all_in_outer <- function(list_x, list_y)
uniq_x <- unique(unlist(list_x, use.names = FALSE))
len_x <- vapply(list_x, length, integer(1L))
as_mat <- function(list_a, ids = uniq_x)
vec <- unlist(list_a, use.names = FALSE)
len <- vapply(list_a, length, integer(1L))
idx <- rep(seq_along(list_a), len)
mat <- matrix(0L, nrow = length(list_a), ncol = length(ids),
dimnames = list(names(list_a), ids))
mat[cbind(idx, match(vec, ids))] <- 1L
mat
(as_mat(list_x) %*% t(as_mat(list_y)) == len_x) * 1
all_in_outer(paths2genes, paths2genes)
# 1430728 156580 156582 194840 211859
# 1430728 1 1 1 0 1
# 156580 0 1 0 0 0
# 156582 1 1 1 0 1
# 194840 0 0 0 1 0
# 211859 1 1 1 0 1
Some explanation: After we have found the list of unique genes across all pathways in the first argument (10, 9, 3, 2), we turn both arguments into a matrix A where A[i,j] = 1 if the pathway i contains the gene j, or 0 otherwise. For paths2genes, this matrix is as_mat(paths2genes):
# 10 9 3 2
# 1430728 1 1 0 0
# 156580 1 1 1 0
# 156582 1 1 0 0
# 194840 0 0 1 1
# 211859 1 1 0 0
Then, using matrix multiplication between two such matrices, you get the number of gene matches for each possible combination of pathways. You then just have to compare that number of matches with the length of the pathway: (as_mat(list_x) %*% t(as_mat(list_y)) == len_x).
In the conversion of each input into a matrix of zero and ones, see this particular line of code: mat[cbind(idx, match(vec, ids))] <- 1L. This is what fills the ones in the matrix. It is completely vectorized, via a single call to match.
Clever trick! I would also use lengths instead of vapply(..., length). Perhaps I should have mentioned in the question but I also compare two different lists (with different number (and names) of genes and pathways), this function would still work? I am under the impression that usinguniq_xinas_matprevents this usage
â Llopis
Apr 19 at 7:29
No, that's intentional. You are interested in matching the items of the first list, henceuniq_xis an input to both calls ofas_mat. And yes, the function is written to take possibly different lists as inputs. Give it a try! And thanks for suggestinglengths, I had not come across in many years of R programming, maybe it is a recent addition..
â flodel
Apr 19 at 11:02
add a comment |Â
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
3
down vote
accepted
If you were to add a cat("hello") at the top of your all_in function, you would find that your function is called 25 times, once for each combination (pair) of pathways. So yes, despite having used Vectorized, it is still essentially a big old loop you have under the hood... Here is how I would write a vectorized function so the heavy-lifting function (%in%, or in my case, match) is only called once or twice:
all_in_outer <- function(list_x, list_y)
uniq_x <- unique(unlist(list_x, use.names = FALSE))
len_x <- vapply(list_x, length, integer(1L))
as_mat <- function(list_a, ids = uniq_x)
vec <- unlist(list_a, use.names = FALSE)
len <- vapply(list_a, length, integer(1L))
idx <- rep(seq_along(list_a), len)
mat <- matrix(0L, nrow = length(list_a), ncol = length(ids),
dimnames = list(names(list_a), ids))
mat[cbind(idx, match(vec, ids))] <- 1L
mat
(as_mat(list_x) %*% t(as_mat(list_y)) == len_x) * 1
all_in_outer(paths2genes, paths2genes)
# 1430728 156580 156582 194840 211859
# 1430728 1 1 1 0 1
# 156580 0 1 0 0 0
# 156582 1 1 1 0 1
# 194840 0 0 0 1 0
# 211859 1 1 1 0 1
Some explanation: After we have found the list of unique genes across all pathways in the first argument (10, 9, 3, 2), we turn both arguments into a matrix A where A[i,j] = 1 if the pathway i contains the gene j, or 0 otherwise. For paths2genes, this matrix is as_mat(paths2genes):
# 10 9 3 2
# 1430728 1 1 0 0
# 156580 1 1 1 0
# 156582 1 1 0 0
# 194840 0 0 1 1
# 211859 1 1 0 0
Then, using matrix multiplication between two such matrices, you get the number of gene matches for each possible combination of pathways. You then just have to compare that number of matches with the length of the pathway: (as_mat(list_x) %*% t(as_mat(list_y)) == len_x).
In the conversion of each input into a matrix of zero and ones, see this particular line of code: mat[cbind(idx, match(vec, ids))] <- 1L. This is what fills the ones in the matrix. It is completely vectorized, via a single call to match.
Clever trick! I would also use lengths instead of vapply(..., length). Perhaps I should have mentioned in the question but I also compare two different lists (with different number (and names) of genes and pathways), this function would still work? I am under the impression that usinguniq_xinas_matprevents this usage
â Llopis
Apr 19 at 7:29
No, that's intentional. You are interested in matching the items of the first list, henceuniq_xis an input to both calls ofas_mat. And yes, the function is written to take possibly different lists as inputs. Give it a try! And thanks for suggestinglengths, I had not come across in many years of R programming, maybe it is a recent addition..
â flodel
Apr 19 at 11:02
add a comment |Â
up vote
3
down vote
accepted
If you were to add a cat("hello") at the top of your all_in function, you would find that your function is called 25 times, once for each combination (pair) of pathways. So yes, despite having used Vectorized, it is still essentially a big old loop you have under the hood... Here is how I would write a vectorized function so the heavy-lifting function (%in%, or in my case, match) is only called once or twice:
all_in_outer <- function(list_x, list_y)
uniq_x <- unique(unlist(list_x, use.names = FALSE))
len_x <- vapply(list_x, length, integer(1L))
as_mat <- function(list_a, ids = uniq_x)
vec <- unlist(list_a, use.names = FALSE)
len <- vapply(list_a, length, integer(1L))
idx <- rep(seq_along(list_a), len)
mat <- matrix(0L, nrow = length(list_a), ncol = length(ids),
dimnames = list(names(list_a), ids))
mat[cbind(idx, match(vec, ids))] <- 1L
mat
(as_mat(list_x) %*% t(as_mat(list_y)) == len_x) * 1
all_in_outer(paths2genes, paths2genes)
# 1430728 156580 156582 194840 211859
# 1430728 1 1 1 0 1
# 156580 0 1 0 0 0
# 156582 1 1 1 0 1
# 194840 0 0 0 1 0
# 211859 1 1 1 0 1
Some explanation: After we have found the list of unique genes across all pathways in the first argument (10, 9, 3, 2), we turn both arguments into a matrix A where A[i,j] = 1 if the pathway i contains the gene j, or 0 otherwise. For paths2genes, this matrix is as_mat(paths2genes):
# 10 9 3 2
# 1430728 1 1 0 0
# 156580 1 1 1 0
# 156582 1 1 0 0
# 194840 0 0 1 1
# 211859 1 1 0 0
Then, using matrix multiplication between two such matrices, you get the number of gene matches for each possible combination of pathways. You then just have to compare that number of matches with the length of the pathway: (as_mat(list_x) %*% t(as_mat(list_y)) == len_x).
In the conversion of each input into a matrix of zero and ones, see this particular line of code: mat[cbind(idx, match(vec, ids))] <- 1L. This is what fills the ones in the matrix. It is completely vectorized, via a single call to match.
Clever trick! I would also use lengths instead of vapply(..., length). Perhaps I should have mentioned in the question but I also compare two different lists (with different number (and names) of genes and pathways), this function would still work? I am under the impression that usinguniq_xinas_matprevents this usage
â Llopis
Apr 19 at 7:29
No, that's intentional. You are interested in matching the items of the first list, henceuniq_xis an input to both calls ofas_mat. And yes, the function is written to take possibly different lists as inputs. Give it a try! And thanks for suggestinglengths, I had not come across in many years of R programming, maybe it is a recent addition..
â flodel
Apr 19 at 11:02
add a comment |Â
up vote
3
down vote
accepted
up vote
3
down vote
accepted
If you were to add a cat("hello") at the top of your all_in function, you would find that your function is called 25 times, once for each combination (pair) of pathways. So yes, despite having used Vectorized, it is still essentially a big old loop you have under the hood... Here is how I would write a vectorized function so the heavy-lifting function (%in%, or in my case, match) is only called once or twice:
all_in_outer <- function(list_x, list_y)
uniq_x <- unique(unlist(list_x, use.names = FALSE))
len_x <- vapply(list_x, length, integer(1L))
as_mat <- function(list_a, ids = uniq_x)
vec <- unlist(list_a, use.names = FALSE)
len <- vapply(list_a, length, integer(1L))
idx <- rep(seq_along(list_a), len)
mat <- matrix(0L, nrow = length(list_a), ncol = length(ids),
dimnames = list(names(list_a), ids))
mat[cbind(idx, match(vec, ids))] <- 1L
mat
(as_mat(list_x) %*% t(as_mat(list_y)) == len_x) * 1
all_in_outer(paths2genes, paths2genes)
# 1430728 156580 156582 194840 211859
# 1430728 1 1 1 0 1
# 156580 0 1 0 0 0
# 156582 1 1 1 0 1
# 194840 0 0 0 1 0
# 211859 1 1 1 0 1
Some explanation: After we have found the list of unique genes across all pathways in the first argument (10, 9, 3, 2), we turn both arguments into a matrix A where A[i,j] = 1 if the pathway i contains the gene j, or 0 otherwise. For paths2genes, this matrix is as_mat(paths2genes):
# 10 9 3 2
# 1430728 1 1 0 0
# 156580 1 1 1 0
# 156582 1 1 0 0
# 194840 0 0 1 1
# 211859 1 1 0 0
Then, using matrix multiplication between two such matrices, you get the number of gene matches for each possible combination of pathways. You then just have to compare that number of matches with the length of the pathway: (as_mat(list_x) %*% t(as_mat(list_y)) == len_x).
In the conversion of each input into a matrix of zero and ones, see this particular line of code: mat[cbind(idx, match(vec, ids))] <- 1L. This is what fills the ones in the matrix. It is completely vectorized, via a single call to match.
If you were to add a cat("hello") at the top of your all_in function, you would find that your function is called 25 times, once for each combination (pair) of pathways. So yes, despite having used Vectorized, it is still essentially a big old loop you have under the hood... Here is how I would write a vectorized function so the heavy-lifting function (%in%, or in my case, match) is only called once or twice:
all_in_outer <- function(list_x, list_y)
uniq_x <- unique(unlist(list_x, use.names = FALSE))
len_x <- vapply(list_x, length, integer(1L))
as_mat <- function(list_a, ids = uniq_x)
vec <- unlist(list_a, use.names = FALSE)
len <- vapply(list_a, length, integer(1L))
idx <- rep(seq_along(list_a), len)
mat <- matrix(0L, nrow = length(list_a), ncol = length(ids),
dimnames = list(names(list_a), ids))
mat[cbind(idx, match(vec, ids))] <- 1L
mat
(as_mat(list_x) %*% t(as_mat(list_y)) == len_x) * 1
all_in_outer(paths2genes, paths2genes)
# 1430728 156580 156582 194840 211859
# 1430728 1 1 1 0 1
# 156580 0 1 0 0 0
# 156582 1 1 1 0 1
# 194840 0 0 0 1 0
# 211859 1 1 1 0 1
Some explanation: After we have found the list of unique genes across all pathways in the first argument (10, 9, 3, 2), we turn both arguments into a matrix A where A[i,j] = 1 if the pathway i contains the gene j, or 0 otherwise. For paths2genes, this matrix is as_mat(paths2genes):
# 10 9 3 2
# 1430728 1 1 0 0
# 156580 1 1 1 0
# 156582 1 1 0 0
# 194840 0 0 1 1
# 211859 1 1 0 0
Then, using matrix multiplication between two such matrices, you get the number of gene matches for each possible combination of pathways. You then just have to compare that number of matches with the length of the pathway: (as_mat(list_x) %*% t(as_mat(list_y)) == len_x).
In the conversion of each input into a matrix of zero and ones, see this particular line of code: mat[cbind(idx, match(vec, ids))] <- 1L. This is what fills the ones in the matrix. It is completely vectorized, via a single call to match.
edited Apr 19 at 4:02
answered Apr 19 at 3:49
flodel
2,9771815
2,9771815
Clever trick! I would also use lengths instead of vapply(..., length). Perhaps I should have mentioned in the question but I also compare two different lists (with different number (and names) of genes and pathways), this function would still work? I am under the impression that usinguniq_xinas_matprevents this usage
â Llopis
Apr 19 at 7:29
No, that's intentional. You are interested in matching the items of the first list, henceuniq_xis an input to both calls ofas_mat. And yes, the function is written to take possibly different lists as inputs. Give it a try! And thanks for suggestinglengths, I had not come across in many years of R programming, maybe it is a recent addition..
â flodel
Apr 19 at 11:02
add a comment |Â
Clever trick! I would also use lengths instead of vapply(..., length). Perhaps I should have mentioned in the question but I also compare two different lists (with different number (and names) of genes and pathways), this function would still work? I am under the impression that usinguniq_xinas_matprevents this usage
â Llopis
Apr 19 at 7:29
No, that's intentional. You are interested in matching the items of the first list, henceuniq_xis an input to both calls ofas_mat. And yes, the function is written to take possibly different lists as inputs. Give it a try! And thanks for suggestinglengths, I had not come across in many years of R programming, maybe it is a recent addition..
â flodel
Apr 19 at 11:02
Clever trick! I would also use lengths instead of vapply(..., length). Perhaps I should have mentioned in the question but I also compare two different lists (with different number (and names) of genes and pathways), this function would still work? I am under the impression that using
uniq_x in as_mat prevents this usageâ Llopis
Apr 19 at 7:29
Clever trick! I would also use lengths instead of vapply(..., length). Perhaps I should have mentioned in the question but I also compare two different lists (with different number (and names) of genes and pathways), this function would still work? I am under the impression that using
uniq_x in as_mat prevents this usageâ Llopis
Apr 19 at 7:29
No, that's intentional. You are interested in matching the items of the first list, hence
uniq_x is an input to both calls of as_mat. And yes, the function is written to take possibly different lists as inputs. Give it a try! And thanks for suggesting lengths, I had not come across in many years of R programming, maybe it is a recent addition..â flodel
Apr 19 at 11:02
No, that's intentional. You are interested in matching the items of the first list, hence
uniq_x is an input to both calls of as_mat. And yes, the function is written to take possibly different lists as inputs. Give it a try! And thanks for suggesting lengths, I had not come across in many years of R programming, maybe it is a recent addition..â flodel
Apr 19 at 11:02
add a comment |Â
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f192410%2fdetermining-whether-a-list-of-pathways-and-its-genes-are-all-included-in-another%23new-answer', 'question_page');
);
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password