Build clusters from connections (virus propagation)

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP





.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;







up vote
8
down vote

favorite
2












The scenario is the following: There is a list of persons and places where those persons visit. Each person is connected to a single or multiple places. See the picture below, where O[i] represents objects (persons) and P[i] places, respectively.



I need to get the list of places connected by those persons.



Here **O** represents objects (persons) and **P** places, respectively



Codes:



class Program

static void Main(string args)

var Connections = new List<Tuple<string, List<string>>>()

new Tuple<string, List<string>>("O1", new List<string>() "P1", "P2" ),
new Tuple<string, List<string>>("O2", new List<string>() "P2", "P3", "P4" ),
new Tuple<string, List<string>>("O3", new List<string>() "P8" ),
new Tuple<string, List<string>>("O4", new List<string>() "P5", "P11" ),
new Tuple<string, List<string>>("O5", new List<string>() "P2", "P3" ),
new Tuple<string, List<string>>("O6", new List<string>() "P4"),
new Tuple<string, List<string>>("O7", new List<string>() "P7", "P10" ),
new Tuple<string, List<string>>("O8", new List<string>() "P5", "P6", "P7", "P12" ),
new Tuple<string, List<string>>("O9", new List<string>() "P7", "P11" ),
new Tuple<string, List<string>>("O10", new List<string>() "P13" ),
new Tuple<string, List<string>>("O11", new List<string>() "P8", "P9" ),
;

var Clusters = Connections.Select(x => x.Item2).ToList();

bool proceed = true;
do

var pairs = Clusters.SelectMany(x => Clusters, (x, y) =>
Tuple.Create(x, y)).Where(x => x.Item1 != x.Item2);

proceed = false;
foreach (var pair in pairs)

var cluster1 = pair.Item1;
var cluster2 = pair.Item2;

if (cluster1.Intersect(cluster2).Any())

cluster1.AddRange(cluster2.Where(x => !cluster1.Contains(x)));
Clusters.Remove(cluster2);
proceed = true;
break;


while (proceed);

foreach (var cluster in Clusters.OrderByDescending(x => x.Count()))

Console.WriteLine(cluster.OrderBy(x => x).Aggregate((i, j) => i + ", " + j));

Console.Read();




Output:



P10, P11, P12, P5, P6, P7
P1, P2, P3, P4
P8, P9
P13


Questions:



  • Is this optimal way to solve the given problem for larger data?


  • How can I estimate the computational complexity of those operations?


  • Am I doing here something potentially dangerous or excessive with the collections?


  • Any other concerns or remarks?


Any feedback would be appreciated.







share|improve this question





















  • Please do not update the code in your question to incorporate feedback from answers, doing so goes against the Question + Answer style of Code Review. This is not a forum where you should keep the most updated version in your question. Please see what you may and may not do after receiving answers.
    – Vogel612♦
    Jun 29 at 10:54










  • As Vogel612 says, if you want to ask for futher help, you'll need a different question, possibly on a different Stack Exchange site. However, I did link to a Disjoints Set implementation I wrote as part of my answer (which is here) (which you are welcome to use, but I can't promise it's perfect). There is also mathblog.dk/disjoint-set-data-structure which is linked from the Wikipedia article.
    – VisualMelon
    Jun 29 at 12:46

















up vote
8
down vote

favorite
2












The scenario is the following: There is a list of persons and places where those persons visit. Each person is connected to a single or multiple places. See the picture below, where O[i] represents objects (persons) and P[i] places, respectively.



I need to get the list of places connected by those persons.



Here **O** represents objects (persons) and **P** places, respectively



Codes:



class Program

static void Main(string args)

var Connections = new List<Tuple<string, List<string>>>()

new Tuple<string, List<string>>("O1", new List<string>() "P1", "P2" ),
new Tuple<string, List<string>>("O2", new List<string>() "P2", "P3", "P4" ),
new Tuple<string, List<string>>("O3", new List<string>() "P8" ),
new Tuple<string, List<string>>("O4", new List<string>() "P5", "P11" ),
new Tuple<string, List<string>>("O5", new List<string>() "P2", "P3" ),
new Tuple<string, List<string>>("O6", new List<string>() "P4"),
new Tuple<string, List<string>>("O7", new List<string>() "P7", "P10" ),
new Tuple<string, List<string>>("O8", new List<string>() "P5", "P6", "P7", "P12" ),
new Tuple<string, List<string>>("O9", new List<string>() "P7", "P11" ),
new Tuple<string, List<string>>("O10", new List<string>() "P13" ),
new Tuple<string, List<string>>("O11", new List<string>() "P8", "P9" ),
;

var Clusters = Connections.Select(x => x.Item2).ToList();

bool proceed = true;
do

var pairs = Clusters.SelectMany(x => Clusters, (x, y) =>
Tuple.Create(x, y)).Where(x => x.Item1 != x.Item2);

proceed = false;
foreach (var pair in pairs)

var cluster1 = pair.Item1;
var cluster2 = pair.Item2;

if (cluster1.Intersect(cluster2).Any())

cluster1.AddRange(cluster2.Where(x => !cluster1.Contains(x)));
Clusters.Remove(cluster2);
proceed = true;
break;


while (proceed);

foreach (var cluster in Clusters.OrderByDescending(x => x.Count()))

Console.WriteLine(cluster.OrderBy(x => x).Aggregate((i, j) => i + ", " + j));

Console.Read();




Output:



P10, P11, P12, P5, P6, P7
P1, P2, P3, P4
P8, P9
P13


Questions:



  • Is this optimal way to solve the given problem for larger data?


  • How can I estimate the computational complexity of those operations?


  • Am I doing here something potentially dangerous or excessive with the collections?


  • Any other concerns or remarks?


Any feedback would be appreciated.







share|improve this question





















  • Please do not update the code in your question to incorporate feedback from answers, doing so goes against the Question + Answer style of Code Review. This is not a forum where you should keep the most updated version in your question. Please see what you may and may not do after receiving answers.
    – Vogel612♦
    Jun 29 at 10:54










  • As Vogel612 says, if you want to ask for futher help, you'll need a different question, possibly on a different Stack Exchange site. However, I did link to a Disjoints Set implementation I wrote as part of my answer (which is here) (which you are welcome to use, but I can't promise it's perfect). There is also mathblog.dk/disjoint-set-data-structure which is linked from the Wikipedia article.
    – VisualMelon
    Jun 29 at 12:46













up vote
8
down vote

favorite
2









up vote
8
down vote

favorite
2






2





The scenario is the following: There is a list of persons and places where those persons visit. Each person is connected to a single or multiple places. See the picture below, where O[i] represents objects (persons) and P[i] places, respectively.



I need to get the list of places connected by those persons.



Here **O** represents objects (persons) and **P** places, respectively



Codes:



class Program

static void Main(string args)

var Connections = new List<Tuple<string, List<string>>>()

new Tuple<string, List<string>>("O1", new List<string>() "P1", "P2" ),
new Tuple<string, List<string>>("O2", new List<string>() "P2", "P3", "P4" ),
new Tuple<string, List<string>>("O3", new List<string>() "P8" ),
new Tuple<string, List<string>>("O4", new List<string>() "P5", "P11" ),
new Tuple<string, List<string>>("O5", new List<string>() "P2", "P3" ),
new Tuple<string, List<string>>("O6", new List<string>() "P4"),
new Tuple<string, List<string>>("O7", new List<string>() "P7", "P10" ),
new Tuple<string, List<string>>("O8", new List<string>() "P5", "P6", "P7", "P12" ),
new Tuple<string, List<string>>("O9", new List<string>() "P7", "P11" ),
new Tuple<string, List<string>>("O10", new List<string>() "P13" ),
new Tuple<string, List<string>>("O11", new List<string>() "P8", "P9" ),
;

var Clusters = Connections.Select(x => x.Item2).ToList();

bool proceed = true;
do

var pairs = Clusters.SelectMany(x => Clusters, (x, y) =>
Tuple.Create(x, y)).Where(x => x.Item1 != x.Item2);

proceed = false;
foreach (var pair in pairs)

var cluster1 = pair.Item1;
var cluster2 = pair.Item2;

if (cluster1.Intersect(cluster2).Any())

cluster1.AddRange(cluster2.Where(x => !cluster1.Contains(x)));
Clusters.Remove(cluster2);
proceed = true;
break;


while (proceed);

foreach (var cluster in Clusters.OrderByDescending(x => x.Count()))

Console.WriteLine(cluster.OrderBy(x => x).Aggregate((i, j) => i + ", " + j));

Console.Read();




Output:



P10, P11, P12, P5, P6, P7
P1, P2, P3, P4
P8, P9
P13


Questions:



  • Is this optimal way to solve the given problem for larger data?


  • How can I estimate the computational complexity of those operations?


  • Am I doing here something potentially dangerous or excessive with the collections?


  • Any other concerns or remarks?


Any feedback would be appreciated.







share|improve this question













The scenario is the following: There is a list of persons and places where those persons visit. Each person is connected to a single or multiple places. See the picture below, where O[i] represents objects (persons) and P[i] places, respectively.



I need to get the list of places connected by those persons.



Here **O** represents objects (persons) and **P** places, respectively



Codes:



class Program

static void Main(string args)

var Connections = new List<Tuple<string, List<string>>>()

new Tuple<string, List<string>>("O1", new List<string>() "P1", "P2" ),
new Tuple<string, List<string>>("O2", new List<string>() "P2", "P3", "P4" ),
new Tuple<string, List<string>>("O3", new List<string>() "P8" ),
new Tuple<string, List<string>>("O4", new List<string>() "P5", "P11" ),
new Tuple<string, List<string>>("O5", new List<string>() "P2", "P3" ),
new Tuple<string, List<string>>("O6", new List<string>() "P4"),
new Tuple<string, List<string>>("O7", new List<string>() "P7", "P10" ),
new Tuple<string, List<string>>("O8", new List<string>() "P5", "P6", "P7", "P12" ),
new Tuple<string, List<string>>("O9", new List<string>() "P7", "P11" ),
new Tuple<string, List<string>>("O10", new List<string>() "P13" ),
new Tuple<string, List<string>>("O11", new List<string>() "P8", "P9" ),
;

var Clusters = Connections.Select(x => x.Item2).ToList();

bool proceed = true;
do

var pairs = Clusters.SelectMany(x => Clusters, (x, y) =>
Tuple.Create(x, y)).Where(x => x.Item1 != x.Item2);

proceed = false;
foreach (var pair in pairs)

var cluster1 = pair.Item1;
var cluster2 = pair.Item2;

if (cluster1.Intersect(cluster2).Any())

cluster1.AddRange(cluster2.Where(x => !cluster1.Contains(x)));
Clusters.Remove(cluster2);
proceed = true;
break;


while (proceed);

foreach (var cluster in Clusters.OrderByDescending(x => x.Count()))

Console.WriteLine(cluster.OrderBy(x => x).Aggregate((i, j) => i + ", " + j));

Console.Read();




Output:



P10, P11, P12, P5, P6, P7
P1, P2, P3, P4
P8, P9
P13


Questions:



  • Is this optimal way to solve the given problem for larger data?


  • How can I estimate the computational complexity of those operations?


  • Am I doing here something potentially dangerous or excessive with the collections?


  • Any other concerns or remarks?


Any feedback would be appreciated.









share|improve this question












share|improve this question




share|improve this question








edited Jun 29 at 10:53









Vogel612♦

20.9k345124




20.9k345124









asked Jun 25 at 0:03









L_J

1837




1837











  • Please do not update the code in your question to incorporate feedback from answers, doing so goes against the Question + Answer style of Code Review. This is not a forum where you should keep the most updated version in your question. Please see what you may and may not do after receiving answers.
    – Vogel612♦
    Jun 29 at 10:54










  • As Vogel612 says, if you want to ask for futher help, you'll need a different question, possibly on a different Stack Exchange site. However, I did link to a Disjoints Set implementation I wrote as part of my answer (which is here) (which you are welcome to use, but I can't promise it's perfect). There is also mathblog.dk/disjoint-set-data-structure which is linked from the Wikipedia article.
    – VisualMelon
    Jun 29 at 12:46

















  • Please do not update the code in your question to incorporate feedback from answers, doing so goes against the Question + Answer style of Code Review. This is not a forum where you should keep the most updated version in your question. Please see what you may and may not do after receiving answers.
    – Vogel612♦
    Jun 29 at 10:54










  • As Vogel612 says, if you want to ask for futher help, you'll need a different question, possibly on a different Stack Exchange site. However, I did link to a Disjoints Set implementation I wrote as part of my answer (which is here) (which you are welcome to use, but I can't promise it's perfect). There is also mathblog.dk/disjoint-set-data-structure which is linked from the Wikipedia article.
    – VisualMelon
    Jun 29 at 12:46
















Please do not update the code in your question to incorporate feedback from answers, doing so goes against the Question + Answer style of Code Review. This is not a forum where you should keep the most updated version in your question. Please see what you may and may not do after receiving answers.
– Vogel612♦
Jun 29 at 10:54




Please do not update the code in your question to incorporate feedback from answers, doing so goes against the Question + Answer style of Code Review. This is not a forum where you should keep the most updated version in your question. Please see what you may and may not do after receiving answers.
– Vogel612♦
Jun 29 at 10:54












As Vogel612 says, if you want to ask for futher help, you'll need a different question, possibly on a different Stack Exchange site. However, I did link to a Disjoints Set implementation I wrote as part of my answer (which is here) (which you are welcome to use, but I can't promise it's perfect). There is also mathblog.dk/disjoint-set-data-structure which is linked from the Wikipedia article.
– VisualMelon
Jun 29 at 12:46





As Vogel612 says, if you want to ask for futher help, you'll need a different question, possibly on a different Stack Exchange site. However, I did link to a Disjoints Set implementation I wrote as part of my answer (which is here) (which you are welcome to use, but I can't promise it's perfect). There is also mathblog.dk/disjoint-set-data-structure which is linked from the Wikipedia article.
– VisualMelon
Jun 29 at 12:46











3 Answers
3






active

oldest

votes

















up vote
5
down vote



accepted










I always feel odd reviewing code like this, because there is no API. You should really pull out the clustering code into it's own method, with a nice signature, something like:



IEnumerable<IReadOnlyCollection<string>> Cluster(IEnumerable<Person> people);


This will make it reusable (and so testable).



You ask "Am I doing here something potentially dangerous or excessive with the collections?". I'm not entirely sure what you mean... but I think no. Collections are efficient, and it's hard to do anything dangerous in C# by accident. If anything, you should probably be using more powerful collections.



Tuples



What is Tuple<string, List<string>>? What are Item1 and Item2? Give it a meaningful type, with meaningful member names!



public class Person

public Person(string name, IReadOnlyCollection<string> places)

Name = name;
Places = places;


public string Name get;
public IReadOnlyCollection<string> Places get;



Complexity



Never been much good at complexity... but here's an attempt at answering your second question. Take it with spoonful of salt. Hopefully someone will poke holes in it promptly.



First you need to identify the parameters you care about. Let's say n for the number of people and 'b' number of places. You iterate over pairs of people, which gives you n^2 (just to enumerate the pairs).



Inside that loop you are looking for an intersection between 2 sets of up to b elements, which can be no better than O(b) (that's the case where the hashes turn out perfectly, assuming Intersect converts one of the lists to a HashSet internally). cluster1.AddRange(cluster2.Where(x => !cluster1.Contains(x))) is O(b^2). You ought instead be using HashSets from the outset, which will get the inner working down to something more like O(b) again (i.e. HashSet<T>.UnionWith(HashSet<T>) is kind of just about linear most of the time). ... but then that matching can only happen n times... so it's a separate concern.



Then there is the outer most loop... which I don't want to think about... but in the worst case it will be run n times, giving you a worst-case complexity of O(n*n*n*b + n*n*b*b), or just about O(n*n*n*b) if you use HashSets.



I can't think how the space complexity of your code could be anything other than the input space complexity of O(b + n).



Technically, your printing out at the end has third order time complexity: O(m*k*k) in number of clusters m, and number of places per cluster k. Consider using a StringBuilder or string.Join instead of Aggregate and repeated concatenation (that will make it O(m*k)).



Disjoint Sets



Any problem like this I would solve with disjoint sets, and even a poor implementation should get you O(n*b) time-complexity (max tree-depth is O(n)) on this problem.



The basic idea is that you maintain a table pointers which describe a set of trees, each tree being an independent (disjoint) set. You can follow the pointers from any member of a set to the root member, and by assigning the parent of one root node to the root of another you join the two sets together. You can throw an implementation together in no time, with very little code.



Some very unscientific benchmarking suggests that a somewhat untested disjoint sets implementation I just threw together is a couple of orders of magnitude faster than your implementation (using HashSets and with reduced Pair count, as suggested in Paparazzo's answer) on randomly generated problems. Your implementation is taking over 10seconds on my machine to solve a problem with b=4000, which the disjoint sets solve in a few milliseconds. The disjoint set version seems to be linear in n*b (easier to scale than just b in random problems); yours is taking a while for larger problems larger than n*b=5000, and I wouldn't wish to comment on the experimental time-complexity with these few (noisy) results.



A gist of the disjoint sets implementation and cluster method is provided so that people can find juicy bugs in it: 1 bug fixed so far (2018-06-26)






share|improve this answer






























    up vote
    3
    down vote













    Note that a decent amount of this is preference but I do provide reasoning for it below (I try to encourage improved code readability).



    Consistency



    I would make bool proceed = false; into var proceed = false; to keep consistent with your usage of var.



    Naming



    Booleans usually have a prefix of is, has, will, etc. So perhaps willProceed is more fitting. Ultimately, it might be best to name it something that describes the scenario in which we will proceed. In this case, I believe you proceed as long has you have found intersecting pairs/clusters. So maybe hasIntersectingPairs or hasIntersectingClusters it's not as short, but it definitely is more clear on its intentions.



    Casing



    Typically you would make local variables such as Connections and Clusters lower-case (i.e. var connections = ...; and var clusters = ...;).



    Unnecessary Foreach



    I might be wrong, but I'm pretty sure your foreach statement can be replaced with LINQs FirstOrDefault and an if-statement.



    var cluster = pairs.FirstOrDefault(p => p.Item1.Intersect(p.Item2).Any());
    if (cluster != null)

    cluster.Item1.AddRange(cluster.Item2.Where(x => !cluster.Item1.Contains(x)));
    clusters.Remove(cluster.Item2);
    willProceed = true;



    No need for a break and it's a bit more clear what the possible paths your program can take, since your foreach loop will either find an intersecting pair and stop (via break) or will find no intersecting pair and continue until out of pairs.



    Additionally naming it cluster (instead of pair) makes the naming line up with cluster1 -> cluster.Item1 and cluster2 -> cluster.Item2 without having to make unnecessary variables. If you prefer the extra variables I would then suggest you take VisualMelon's advice and make a meaningful type instead of using Tuple<>.



    do/while vs while



    You declare bool proceed = true; then use a do/while loop as if it was initialized as false, but then immediately set it to false inside the loop. The following should be a bit more straight forward and accomplish the same thing,



    var hasIntersectingPairs = true;
    while (hasIntersectingPairs)

    hasIntersectingPairs = false;
    ...



    Putting My Suggestions Together



    static void Main(string args)

    var connections = new List<Tuple<string, List<string>>>()

    new Tuple<string, List<string>>("O1", new List<string>() "P1", "P2" ),
    new Tuple<string, List<string>>("O2", new List<string>() "P2", "P3", "P4" ),
    new Tuple<string, List<string>>("O3", new List<string>() "P8" ),
    new Tuple<string, List<string>>("O4", new List<string>() "P5", "P11" ),
    new Tuple<string, List<string>>("O5", new List<string>() "P2", "P3" ),
    new Tuple<string, List<string>>("O6", new List<string>() "P4"),
    new Tuple<string, List<string>>("O7", new List<string>() "P7", "P10" ),
    new Tuple<string, List<string>>("O8", new List<string>() "P5", "P6", "P7", "P12" ),
    new Tuple<string, List<string>>("O9", new List<string>() "P7", "P11" ),
    new Tuple<string, List<string>>("O10", new List<string>() "P13" ),
    new Tuple<string, List<string>>("O11", new List<string>() "P8", "P9" ),
    ;
    var clusters = connections.Select(x => x.Item2).ToList();

    var hasIntersectingPairs = true;
    while (hasIntersectingPairs)

    hasIntersectingPairs = false;
    var pairs = clusters.SelectMany(x => clusters, (x, y) => Tuple.Create(x, y)).Where(x => x.Item1 != x.Item2);

    var cluster = pairs.FirstOrDefault(p => p.Item1.Intersect(p.Item2).Any());
    if (cluster != null)

    cluster.Item1.AddRange(cluster.Item2.Where(x => !cluster.Item1.Contains(x)));
    clusters.Remove(cluster.Item2);
    hasIntersectingPairs = true;



    foreach (var cluster in clusters.OrderByDescending(x => x.Count()))

    Console.WriteLine(cluster.OrderBy(x => x).Aggregate((i, j) => i + ", " + j));

    Console.Read();



    Output still Matches



    P10, P11, P12, P5, P6, P7
    P1, P2, P3, P4
    P8, P9
    P13





    share|improve this answer




























      up vote
      1
      down vote













      Look good to me.



      Not sure it matters but you are creating all pairs. (A,B) and (B,A). If you only created unique combinations you would cut the count in 1/2.






      share|improve this answer























        Your Answer




        StackExchange.ifUsing("editor", function ()
        return StackExchange.using("mathjaxEditing", function ()
        StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
        StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
        );
        );
        , "mathjax-editing");

        StackExchange.ifUsing("editor", function ()
        StackExchange.using("externalEditor", function ()
        StackExchange.using("snippets", function ()
        StackExchange.snippets.init();
        );
        );
        , "code-snippets");

        StackExchange.ready(function()
        var channelOptions =
        tags: "".split(" "),
        id: "196"
        ;
        initTagRenderer("".split(" "), "".split(" "), channelOptions);

        StackExchange.using("externalEditor", function()
        // Have to fire editor after snippets, if snippets enabled
        if (StackExchange.settings.snippets.snippetsEnabled)
        StackExchange.using("snippets", function()
        createEditor();
        );

        else
        createEditor();

        );

        function createEditor()
        StackExchange.prepareEditor(
        heartbeatType: 'answer',
        convertImagesToLinks: false,
        noModals: false,
        showLowRepImageUploadWarning: true,
        reputationToPostImages: null,
        bindNavPrevention: true,
        postfix: "",
        onDemand: true,
        discardSelector: ".discard-answer"
        ,immediatelyShowMarkdownHelp:true
        );



        );








         

        draft saved


        draft discarded


















        StackExchange.ready(
        function ()
        StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f197189%2fbuild-clusters-from-connections-virus-propagation%23new-answer', 'question_page');

        );

        Post as a guest






























        3 Answers
        3






        active

        oldest

        votes








        3 Answers
        3






        active

        oldest

        votes









        active

        oldest

        votes






        active

        oldest

        votes








        up vote
        5
        down vote



        accepted










        I always feel odd reviewing code like this, because there is no API. You should really pull out the clustering code into it's own method, with a nice signature, something like:



        IEnumerable<IReadOnlyCollection<string>> Cluster(IEnumerable<Person> people);


        This will make it reusable (and so testable).



        You ask "Am I doing here something potentially dangerous or excessive with the collections?". I'm not entirely sure what you mean... but I think no. Collections are efficient, and it's hard to do anything dangerous in C# by accident. If anything, you should probably be using more powerful collections.



        Tuples



        What is Tuple<string, List<string>>? What are Item1 and Item2? Give it a meaningful type, with meaningful member names!



        public class Person

        public Person(string name, IReadOnlyCollection<string> places)

        Name = name;
        Places = places;


        public string Name get;
        public IReadOnlyCollection<string> Places get;



        Complexity



        Never been much good at complexity... but here's an attempt at answering your second question. Take it with spoonful of salt. Hopefully someone will poke holes in it promptly.



        First you need to identify the parameters you care about. Let's say n for the number of people and 'b' number of places. You iterate over pairs of people, which gives you n^2 (just to enumerate the pairs).



        Inside that loop you are looking for an intersection between 2 sets of up to b elements, which can be no better than O(b) (that's the case where the hashes turn out perfectly, assuming Intersect converts one of the lists to a HashSet internally). cluster1.AddRange(cluster2.Where(x => !cluster1.Contains(x))) is O(b^2). You ought instead be using HashSets from the outset, which will get the inner working down to something more like O(b) again (i.e. HashSet<T>.UnionWith(HashSet<T>) is kind of just about linear most of the time). ... but then that matching can only happen n times... so it's a separate concern.



        Then there is the outer most loop... which I don't want to think about... but in the worst case it will be run n times, giving you a worst-case complexity of O(n*n*n*b + n*n*b*b), or just about O(n*n*n*b) if you use HashSets.



        I can't think how the space complexity of your code could be anything other than the input space complexity of O(b + n).



        Technically, your printing out at the end has third order time complexity: O(m*k*k) in number of clusters m, and number of places per cluster k. Consider using a StringBuilder or string.Join instead of Aggregate and repeated concatenation (that will make it O(m*k)).



        Disjoint Sets



        Any problem like this I would solve with disjoint sets, and even a poor implementation should get you O(n*b) time-complexity (max tree-depth is O(n)) on this problem.



        The basic idea is that you maintain a table pointers which describe a set of trees, each tree being an independent (disjoint) set. You can follow the pointers from any member of a set to the root member, and by assigning the parent of one root node to the root of another you join the two sets together. You can throw an implementation together in no time, with very little code.



        Some very unscientific benchmarking suggests that a somewhat untested disjoint sets implementation I just threw together is a couple of orders of magnitude faster than your implementation (using HashSets and with reduced Pair count, as suggested in Paparazzo's answer) on randomly generated problems. Your implementation is taking over 10seconds on my machine to solve a problem with b=4000, which the disjoint sets solve in a few milliseconds. The disjoint set version seems to be linear in n*b (easier to scale than just b in random problems); yours is taking a while for larger problems larger than n*b=5000, and I wouldn't wish to comment on the experimental time-complexity with these few (noisy) results.



        A gist of the disjoint sets implementation and cluster method is provided so that people can find juicy bugs in it: 1 bug fixed so far (2018-06-26)






        share|improve this answer



























          up vote
          5
          down vote



          accepted










          I always feel odd reviewing code like this, because there is no API. You should really pull out the clustering code into it's own method, with a nice signature, something like:



          IEnumerable<IReadOnlyCollection<string>> Cluster(IEnumerable<Person> people);


          This will make it reusable (and so testable).



          You ask "Am I doing here something potentially dangerous or excessive with the collections?". I'm not entirely sure what you mean... but I think no. Collections are efficient, and it's hard to do anything dangerous in C# by accident. If anything, you should probably be using more powerful collections.



          Tuples



          What is Tuple<string, List<string>>? What are Item1 and Item2? Give it a meaningful type, with meaningful member names!



          public class Person

          public Person(string name, IReadOnlyCollection<string> places)

          Name = name;
          Places = places;


          public string Name get;
          public IReadOnlyCollection<string> Places get;



          Complexity



          Never been much good at complexity... but here's an attempt at answering your second question. Take it with spoonful of salt. Hopefully someone will poke holes in it promptly.



          First you need to identify the parameters you care about. Let's say n for the number of people and 'b' number of places. You iterate over pairs of people, which gives you n^2 (just to enumerate the pairs).



          Inside that loop you are looking for an intersection between 2 sets of up to b elements, which can be no better than O(b) (that's the case where the hashes turn out perfectly, assuming Intersect converts one of the lists to a HashSet internally). cluster1.AddRange(cluster2.Where(x => !cluster1.Contains(x))) is O(b^2). You ought instead be using HashSets from the outset, which will get the inner working down to something more like O(b) again (i.e. HashSet<T>.UnionWith(HashSet<T>) is kind of just about linear most of the time). ... but then that matching can only happen n times... so it's a separate concern.



          Then there is the outer most loop... which I don't want to think about... but in the worst case it will be run n times, giving you a worst-case complexity of O(n*n*n*b + n*n*b*b), or just about O(n*n*n*b) if you use HashSets.



          I can't think how the space complexity of your code could be anything other than the input space complexity of O(b + n).



          Technically, your printing out at the end has third order time complexity: O(m*k*k) in number of clusters m, and number of places per cluster k. Consider using a StringBuilder or string.Join instead of Aggregate and repeated concatenation (that will make it O(m*k)).



          Disjoint Sets



          Any problem like this I would solve with disjoint sets, and even a poor implementation should get you O(n*b) time-complexity (max tree-depth is O(n)) on this problem.



          The basic idea is that you maintain a table pointers which describe a set of trees, each tree being an independent (disjoint) set. You can follow the pointers from any member of a set to the root member, and by assigning the parent of one root node to the root of another you join the two sets together. You can throw an implementation together in no time, with very little code.



          Some very unscientific benchmarking suggests that a somewhat untested disjoint sets implementation I just threw together is a couple of orders of magnitude faster than your implementation (using HashSets and with reduced Pair count, as suggested in Paparazzo's answer) on randomly generated problems. Your implementation is taking over 10seconds on my machine to solve a problem with b=4000, which the disjoint sets solve in a few milliseconds. The disjoint set version seems to be linear in n*b (easier to scale than just b in random problems); yours is taking a while for larger problems larger than n*b=5000, and I wouldn't wish to comment on the experimental time-complexity with these few (noisy) results.



          A gist of the disjoint sets implementation and cluster method is provided so that people can find juicy bugs in it: 1 bug fixed so far (2018-06-26)






          share|improve this answer

























            up vote
            5
            down vote



            accepted







            up vote
            5
            down vote



            accepted






            I always feel odd reviewing code like this, because there is no API. You should really pull out the clustering code into it's own method, with a nice signature, something like:



            IEnumerable<IReadOnlyCollection<string>> Cluster(IEnumerable<Person> people);


            This will make it reusable (and so testable).



            You ask "Am I doing here something potentially dangerous or excessive with the collections?". I'm not entirely sure what you mean... but I think no. Collections are efficient, and it's hard to do anything dangerous in C# by accident. If anything, you should probably be using more powerful collections.



            Tuples



            What is Tuple<string, List<string>>? What are Item1 and Item2? Give it a meaningful type, with meaningful member names!



            public class Person

            public Person(string name, IReadOnlyCollection<string> places)

            Name = name;
            Places = places;


            public string Name get;
            public IReadOnlyCollection<string> Places get;



            Complexity



            Never been much good at complexity... but here's an attempt at answering your second question. Take it with spoonful of salt. Hopefully someone will poke holes in it promptly.



            First you need to identify the parameters you care about. Let's say n for the number of people and 'b' number of places. You iterate over pairs of people, which gives you n^2 (just to enumerate the pairs).



            Inside that loop you are looking for an intersection between 2 sets of up to b elements, which can be no better than O(b) (that's the case where the hashes turn out perfectly, assuming Intersect converts one of the lists to a HashSet internally). cluster1.AddRange(cluster2.Where(x => !cluster1.Contains(x))) is O(b^2). You ought instead be using HashSets from the outset, which will get the inner working down to something more like O(b) again (i.e. HashSet<T>.UnionWith(HashSet<T>) is kind of just about linear most of the time). ... but then that matching can only happen n times... so it's a separate concern.



            Then there is the outer most loop... which I don't want to think about... but in the worst case it will be run n times, giving you a worst-case complexity of O(n*n*n*b + n*n*b*b), or just about O(n*n*n*b) if you use HashSets.



            I can't think how the space complexity of your code could be anything other than the input space complexity of O(b + n).



            Technically, your printing out at the end has third order time complexity: O(m*k*k) in number of clusters m, and number of places per cluster k. Consider using a StringBuilder or string.Join instead of Aggregate and repeated concatenation (that will make it O(m*k)).



            Disjoint Sets



            Any problem like this I would solve with disjoint sets, and even a poor implementation should get you O(n*b) time-complexity (max tree-depth is O(n)) on this problem.



            The basic idea is that you maintain a table pointers which describe a set of trees, each tree being an independent (disjoint) set. You can follow the pointers from any member of a set to the root member, and by assigning the parent of one root node to the root of another you join the two sets together. You can throw an implementation together in no time, with very little code.



            Some very unscientific benchmarking suggests that a somewhat untested disjoint sets implementation I just threw together is a couple of orders of magnitude faster than your implementation (using HashSets and with reduced Pair count, as suggested in Paparazzo's answer) on randomly generated problems. Your implementation is taking over 10seconds on my machine to solve a problem with b=4000, which the disjoint sets solve in a few milliseconds. The disjoint set version seems to be linear in n*b (easier to scale than just b in random problems); yours is taking a while for larger problems larger than n*b=5000, and I wouldn't wish to comment on the experimental time-complexity with these few (noisy) results.



            A gist of the disjoint sets implementation and cluster method is provided so that people can find juicy bugs in it: 1 bug fixed so far (2018-06-26)






            share|improve this answer















            I always feel odd reviewing code like this, because there is no API. You should really pull out the clustering code into it's own method, with a nice signature, something like:



            IEnumerable<IReadOnlyCollection<string>> Cluster(IEnumerable<Person> people);


            This will make it reusable (and so testable).



            You ask "Am I doing here something potentially dangerous or excessive with the collections?". I'm not entirely sure what you mean... but I think no. Collections are efficient, and it's hard to do anything dangerous in C# by accident. If anything, you should probably be using more powerful collections.



            Tuples



            What is Tuple<string, List<string>>? What are Item1 and Item2? Give it a meaningful type, with meaningful member names!



            public class Person

            public Person(string name, IReadOnlyCollection<string> places)

            Name = name;
            Places = places;


            public string Name get;
            public IReadOnlyCollection<string> Places get;



            Complexity



            Never been much good at complexity... but here's an attempt at answering your second question. Take it with spoonful of salt. Hopefully someone will poke holes in it promptly.



            First you need to identify the parameters you care about. Let's say n for the number of people and 'b' number of places. You iterate over pairs of people, which gives you n^2 (just to enumerate the pairs).



            Inside that loop you are looking for an intersection between 2 sets of up to b elements, which can be no better than O(b) (that's the case where the hashes turn out perfectly, assuming Intersect converts one of the lists to a HashSet internally). cluster1.AddRange(cluster2.Where(x => !cluster1.Contains(x))) is O(b^2). You ought instead be using HashSets from the outset, which will get the inner working down to something more like O(b) again (i.e. HashSet<T>.UnionWith(HashSet<T>) is kind of just about linear most of the time). ... but then that matching can only happen n times... so it's a separate concern.



            Then there is the outer most loop... which I don't want to think about... but in the worst case it will be run n times, giving you a worst-case complexity of O(n*n*n*b + n*n*b*b), or just about O(n*n*n*b) if you use HashSets.



            I can't think how the space complexity of your code could be anything other than the input space complexity of O(b + n).



            Technically, your printing out at the end has third order time complexity: O(m*k*k) in number of clusters m, and number of places per cluster k. Consider using a StringBuilder or string.Join instead of Aggregate and repeated concatenation (that will make it O(m*k)).



            Disjoint Sets



            Any problem like this I would solve with disjoint sets, and even a poor implementation should get you O(n*b) time-complexity (max tree-depth is O(n)) on this problem.



            The basic idea is that you maintain a table pointers which describe a set of trees, each tree being an independent (disjoint) set. You can follow the pointers from any member of a set to the root member, and by assigning the parent of one root node to the root of another you join the two sets together. You can throw an implementation together in no time, with very little code.



            Some very unscientific benchmarking suggests that a somewhat untested disjoint sets implementation I just threw together is a couple of orders of magnitude faster than your implementation (using HashSets and with reduced Pair count, as suggested in Paparazzo's answer) on randomly generated problems. Your implementation is taking over 10seconds on my machine to solve a problem with b=4000, which the disjoint sets solve in a few milliseconds. The disjoint set version seems to be linear in n*b (easier to scale than just b in random problems); yours is taking a while for larger problems larger than n*b=5000, and I wouldn't wish to comment on the experimental time-complexity with these few (noisy) results.



            A gist of the disjoint sets implementation and cluster method is provided so that people can find juicy bugs in it: 1 bug fixed so far (2018-06-26)







            share|improve this answer















            share|improve this answer



            share|improve this answer








            edited Jun 26 at 9:59


























            answered Jun 25 at 17:44









            VisualMelon

            2,105716




            2,105716






















                up vote
                3
                down vote













                Note that a decent amount of this is preference but I do provide reasoning for it below (I try to encourage improved code readability).



                Consistency



                I would make bool proceed = false; into var proceed = false; to keep consistent with your usage of var.



                Naming



                Booleans usually have a prefix of is, has, will, etc. So perhaps willProceed is more fitting. Ultimately, it might be best to name it something that describes the scenario in which we will proceed. In this case, I believe you proceed as long has you have found intersecting pairs/clusters. So maybe hasIntersectingPairs or hasIntersectingClusters it's not as short, but it definitely is more clear on its intentions.



                Casing



                Typically you would make local variables such as Connections and Clusters lower-case (i.e. var connections = ...; and var clusters = ...;).



                Unnecessary Foreach



                I might be wrong, but I'm pretty sure your foreach statement can be replaced with LINQs FirstOrDefault and an if-statement.



                var cluster = pairs.FirstOrDefault(p => p.Item1.Intersect(p.Item2).Any());
                if (cluster != null)

                cluster.Item1.AddRange(cluster.Item2.Where(x => !cluster.Item1.Contains(x)));
                clusters.Remove(cluster.Item2);
                willProceed = true;



                No need for a break and it's a bit more clear what the possible paths your program can take, since your foreach loop will either find an intersecting pair and stop (via break) or will find no intersecting pair and continue until out of pairs.



                Additionally naming it cluster (instead of pair) makes the naming line up with cluster1 -> cluster.Item1 and cluster2 -> cluster.Item2 without having to make unnecessary variables. If you prefer the extra variables I would then suggest you take VisualMelon's advice and make a meaningful type instead of using Tuple<>.



                do/while vs while



                You declare bool proceed = true; then use a do/while loop as if it was initialized as false, but then immediately set it to false inside the loop. The following should be a bit more straight forward and accomplish the same thing,



                var hasIntersectingPairs = true;
                while (hasIntersectingPairs)

                hasIntersectingPairs = false;
                ...



                Putting My Suggestions Together



                static void Main(string args)

                var connections = new List<Tuple<string, List<string>>>()

                new Tuple<string, List<string>>("O1", new List<string>() "P1", "P2" ),
                new Tuple<string, List<string>>("O2", new List<string>() "P2", "P3", "P4" ),
                new Tuple<string, List<string>>("O3", new List<string>() "P8" ),
                new Tuple<string, List<string>>("O4", new List<string>() "P5", "P11" ),
                new Tuple<string, List<string>>("O5", new List<string>() "P2", "P3" ),
                new Tuple<string, List<string>>("O6", new List<string>() "P4"),
                new Tuple<string, List<string>>("O7", new List<string>() "P7", "P10" ),
                new Tuple<string, List<string>>("O8", new List<string>() "P5", "P6", "P7", "P12" ),
                new Tuple<string, List<string>>("O9", new List<string>() "P7", "P11" ),
                new Tuple<string, List<string>>("O10", new List<string>() "P13" ),
                new Tuple<string, List<string>>("O11", new List<string>() "P8", "P9" ),
                ;
                var clusters = connections.Select(x => x.Item2).ToList();

                var hasIntersectingPairs = true;
                while (hasIntersectingPairs)

                hasIntersectingPairs = false;
                var pairs = clusters.SelectMany(x => clusters, (x, y) => Tuple.Create(x, y)).Where(x => x.Item1 != x.Item2);

                var cluster = pairs.FirstOrDefault(p => p.Item1.Intersect(p.Item2).Any());
                if (cluster != null)

                cluster.Item1.AddRange(cluster.Item2.Where(x => !cluster.Item1.Contains(x)));
                clusters.Remove(cluster.Item2);
                hasIntersectingPairs = true;



                foreach (var cluster in clusters.OrderByDescending(x => x.Count()))

                Console.WriteLine(cluster.OrderBy(x => x).Aggregate((i, j) => i + ", " + j));

                Console.Read();



                Output still Matches



                P10, P11, P12, P5, P6, P7
                P1, P2, P3, P4
                P8, P9
                P13





                share|improve this answer

























                  up vote
                  3
                  down vote













                  Note that a decent amount of this is preference but I do provide reasoning for it below (I try to encourage improved code readability).



                  Consistency



                  I would make bool proceed = false; into var proceed = false; to keep consistent with your usage of var.



                  Naming



                  Booleans usually have a prefix of is, has, will, etc. So perhaps willProceed is more fitting. Ultimately, it might be best to name it something that describes the scenario in which we will proceed. In this case, I believe you proceed as long has you have found intersecting pairs/clusters. So maybe hasIntersectingPairs or hasIntersectingClusters it's not as short, but it definitely is more clear on its intentions.



                  Casing



                  Typically you would make local variables such as Connections and Clusters lower-case (i.e. var connections = ...; and var clusters = ...;).



                  Unnecessary Foreach



                  I might be wrong, but I'm pretty sure your foreach statement can be replaced with LINQs FirstOrDefault and an if-statement.



                  var cluster = pairs.FirstOrDefault(p => p.Item1.Intersect(p.Item2).Any());
                  if (cluster != null)

                  cluster.Item1.AddRange(cluster.Item2.Where(x => !cluster.Item1.Contains(x)));
                  clusters.Remove(cluster.Item2);
                  willProceed = true;



                  No need for a break and it's a bit more clear what the possible paths your program can take, since your foreach loop will either find an intersecting pair and stop (via break) or will find no intersecting pair and continue until out of pairs.



                  Additionally naming it cluster (instead of pair) makes the naming line up with cluster1 -> cluster.Item1 and cluster2 -> cluster.Item2 without having to make unnecessary variables. If you prefer the extra variables I would then suggest you take VisualMelon's advice and make a meaningful type instead of using Tuple<>.



                  do/while vs while



                  You declare bool proceed = true; then use a do/while loop as if it was initialized as false, but then immediately set it to false inside the loop. The following should be a bit more straight forward and accomplish the same thing,



                  var hasIntersectingPairs = true;
                  while (hasIntersectingPairs)

                  hasIntersectingPairs = false;
                  ...



                  Putting My Suggestions Together



                  static void Main(string args)

                  var connections = new List<Tuple<string, List<string>>>()

                  new Tuple<string, List<string>>("O1", new List<string>() "P1", "P2" ),
                  new Tuple<string, List<string>>("O2", new List<string>() "P2", "P3", "P4" ),
                  new Tuple<string, List<string>>("O3", new List<string>() "P8" ),
                  new Tuple<string, List<string>>("O4", new List<string>() "P5", "P11" ),
                  new Tuple<string, List<string>>("O5", new List<string>() "P2", "P3" ),
                  new Tuple<string, List<string>>("O6", new List<string>() "P4"),
                  new Tuple<string, List<string>>("O7", new List<string>() "P7", "P10" ),
                  new Tuple<string, List<string>>("O8", new List<string>() "P5", "P6", "P7", "P12" ),
                  new Tuple<string, List<string>>("O9", new List<string>() "P7", "P11" ),
                  new Tuple<string, List<string>>("O10", new List<string>() "P13" ),
                  new Tuple<string, List<string>>("O11", new List<string>() "P8", "P9" ),
                  ;
                  var clusters = connections.Select(x => x.Item2).ToList();

                  var hasIntersectingPairs = true;
                  while (hasIntersectingPairs)

                  hasIntersectingPairs = false;
                  var pairs = clusters.SelectMany(x => clusters, (x, y) => Tuple.Create(x, y)).Where(x => x.Item1 != x.Item2);

                  var cluster = pairs.FirstOrDefault(p => p.Item1.Intersect(p.Item2).Any());
                  if (cluster != null)

                  cluster.Item1.AddRange(cluster.Item2.Where(x => !cluster.Item1.Contains(x)));
                  clusters.Remove(cluster.Item2);
                  hasIntersectingPairs = true;



                  foreach (var cluster in clusters.OrderByDescending(x => x.Count()))

                  Console.WriteLine(cluster.OrderBy(x => x).Aggregate((i, j) => i + ", " + j));

                  Console.Read();



                  Output still Matches



                  P10, P11, P12, P5, P6, P7
                  P1, P2, P3, P4
                  P8, P9
                  P13





                  share|improve this answer























                    up vote
                    3
                    down vote










                    up vote
                    3
                    down vote









                    Note that a decent amount of this is preference but I do provide reasoning for it below (I try to encourage improved code readability).



                    Consistency



                    I would make bool proceed = false; into var proceed = false; to keep consistent with your usage of var.



                    Naming



                    Booleans usually have a prefix of is, has, will, etc. So perhaps willProceed is more fitting. Ultimately, it might be best to name it something that describes the scenario in which we will proceed. In this case, I believe you proceed as long has you have found intersecting pairs/clusters. So maybe hasIntersectingPairs or hasIntersectingClusters it's not as short, but it definitely is more clear on its intentions.



                    Casing



                    Typically you would make local variables such as Connections and Clusters lower-case (i.e. var connections = ...; and var clusters = ...;).



                    Unnecessary Foreach



                    I might be wrong, but I'm pretty sure your foreach statement can be replaced with LINQs FirstOrDefault and an if-statement.



                    var cluster = pairs.FirstOrDefault(p => p.Item1.Intersect(p.Item2).Any());
                    if (cluster != null)

                    cluster.Item1.AddRange(cluster.Item2.Where(x => !cluster.Item1.Contains(x)));
                    clusters.Remove(cluster.Item2);
                    willProceed = true;



                    No need for a break and it's a bit more clear what the possible paths your program can take, since your foreach loop will either find an intersecting pair and stop (via break) or will find no intersecting pair and continue until out of pairs.



                    Additionally naming it cluster (instead of pair) makes the naming line up with cluster1 -> cluster.Item1 and cluster2 -> cluster.Item2 without having to make unnecessary variables. If you prefer the extra variables I would then suggest you take VisualMelon's advice and make a meaningful type instead of using Tuple<>.



                    do/while vs while



                    You declare bool proceed = true; then use a do/while loop as if it was initialized as false, but then immediately set it to false inside the loop. The following should be a bit more straight forward and accomplish the same thing,



                    var hasIntersectingPairs = true;
                    while (hasIntersectingPairs)

                    hasIntersectingPairs = false;
                    ...



                    Putting My Suggestions Together



                    static void Main(string args)

                    var connections = new List<Tuple<string, List<string>>>()

                    new Tuple<string, List<string>>("O1", new List<string>() "P1", "P2" ),
                    new Tuple<string, List<string>>("O2", new List<string>() "P2", "P3", "P4" ),
                    new Tuple<string, List<string>>("O3", new List<string>() "P8" ),
                    new Tuple<string, List<string>>("O4", new List<string>() "P5", "P11" ),
                    new Tuple<string, List<string>>("O5", new List<string>() "P2", "P3" ),
                    new Tuple<string, List<string>>("O6", new List<string>() "P4"),
                    new Tuple<string, List<string>>("O7", new List<string>() "P7", "P10" ),
                    new Tuple<string, List<string>>("O8", new List<string>() "P5", "P6", "P7", "P12" ),
                    new Tuple<string, List<string>>("O9", new List<string>() "P7", "P11" ),
                    new Tuple<string, List<string>>("O10", new List<string>() "P13" ),
                    new Tuple<string, List<string>>("O11", new List<string>() "P8", "P9" ),
                    ;
                    var clusters = connections.Select(x => x.Item2).ToList();

                    var hasIntersectingPairs = true;
                    while (hasIntersectingPairs)

                    hasIntersectingPairs = false;
                    var pairs = clusters.SelectMany(x => clusters, (x, y) => Tuple.Create(x, y)).Where(x => x.Item1 != x.Item2);

                    var cluster = pairs.FirstOrDefault(p => p.Item1.Intersect(p.Item2).Any());
                    if (cluster != null)

                    cluster.Item1.AddRange(cluster.Item2.Where(x => !cluster.Item1.Contains(x)));
                    clusters.Remove(cluster.Item2);
                    hasIntersectingPairs = true;



                    foreach (var cluster in clusters.OrderByDescending(x => x.Count()))

                    Console.WriteLine(cluster.OrderBy(x => x).Aggregate((i, j) => i + ", " + j));

                    Console.Read();



                    Output still Matches



                    P10, P11, P12, P5, P6, P7
                    P1, P2, P3, P4
                    P8, P9
                    P13





                    share|improve this answer













                    Note that a decent amount of this is preference but I do provide reasoning for it below (I try to encourage improved code readability).



                    Consistency



                    I would make bool proceed = false; into var proceed = false; to keep consistent with your usage of var.



                    Naming



                    Booleans usually have a prefix of is, has, will, etc. So perhaps willProceed is more fitting. Ultimately, it might be best to name it something that describes the scenario in which we will proceed. In this case, I believe you proceed as long has you have found intersecting pairs/clusters. So maybe hasIntersectingPairs or hasIntersectingClusters it's not as short, but it definitely is more clear on its intentions.



                    Casing



                    Typically you would make local variables such as Connections and Clusters lower-case (i.e. var connections = ...; and var clusters = ...;).



                    Unnecessary Foreach



                    I might be wrong, but I'm pretty sure your foreach statement can be replaced with LINQs FirstOrDefault and an if-statement.



                    var cluster = pairs.FirstOrDefault(p => p.Item1.Intersect(p.Item2).Any());
                    if (cluster != null)

                    cluster.Item1.AddRange(cluster.Item2.Where(x => !cluster.Item1.Contains(x)));
                    clusters.Remove(cluster.Item2);
                    willProceed = true;



                    No need for a break and it's a bit more clear what the possible paths your program can take, since your foreach loop will either find an intersecting pair and stop (via break) or will find no intersecting pair and continue until out of pairs.



                    Additionally naming it cluster (instead of pair) makes the naming line up with cluster1 -> cluster.Item1 and cluster2 -> cluster.Item2 without having to make unnecessary variables. If you prefer the extra variables I would then suggest you take VisualMelon's advice and make a meaningful type instead of using Tuple<>.



                    do/while vs while



                    You declare bool proceed = true; then use a do/while loop as if it was initialized as false, but then immediately set it to false inside the loop. The following should be a bit more straight forward and accomplish the same thing,



                    var hasIntersectingPairs = true;
                    while (hasIntersectingPairs)

                    hasIntersectingPairs = false;
                    ...



                    Putting My Suggestions Together



                    static void Main(string args)

                    var connections = new List<Tuple<string, List<string>>>()

                    new Tuple<string, List<string>>("O1", new List<string>() "P1", "P2" ),
                    new Tuple<string, List<string>>("O2", new List<string>() "P2", "P3", "P4" ),
                    new Tuple<string, List<string>>("O3", new List<string>() "P8" ),
                    new Tuple<string, List<string>>("O4", new List<string>() "P5", "P11" ),
                    new Tuple<string, List<string>>("O5", new List<string>() "P2", "P3" ),
                    new Tuple<string, List<string>>("O6", new List<string>() "P4"),
                    new Tuple<string, List<string>>("O7", new List<string>() "P7", "P10" ),
                    new Tuple<string, List<string>>("O8", new List<string>() "P5", "P6", "P7", "P12" ),
                    new Tuple<string, List<string>>("O9", new List<string>() "P7", "P11" ),
                    new Tuple<string, List<string>>("O10", new List<string>() "P13" ),
                    new Tuple<string, List<string>>("O11", new List<string>() "P8", "P9" ),
                    ;
                    var clusters = connections.Select(x => x.Item2).ToList();

                    var hasIntersectingPairs = true;
                    while (hasIntersectingPairs)

                    hasIntersectingPairs = false;
                    var pairs = clusters.SelectMany(x => clusters, (x, y) => Tuple.Create(x, y)).Where(x => x.Item1 != x.Item2);

                    var cluster = pairs.FirstOrDefault(p => p.Item1.Intersect(p.Item2).Any());
                    if (cluster != null)

                    cluster.Item1.AddRange(cluster.Item2.Where(x => !cluster.Item1.Contains(x)));
                    clusters.Remove(cluster.Item2);
                    hasIntersectingPairs = true;



                    foreach (var cluster in clusters.OrderByDescending(x => x.Count()))

                    Console.WriteLine(cluster.OrderBy(x => x).Aggregate((i, j) => i + ", " + j));

                    Console.Read();



                    Output still Matches



                    P10, P11, P12, P5, P6, P7
                    P1, P2, P3, P4
                    P8, P9
                    P13






                    share|improve this answer













                    share|improve this answer



                    share|improve this answer











                    answered Jun 25 at 18:20









                    Shelby115

                    1,354516




                    1,354516




















                        up vote
                        1
                        down vote













                        Look good to me.



                        Not sure it matters but you are creating all pairs. (A,B) and (B,A). If you only created unique combinations you would cut the count in 1/2.






                        share|improve this answer



























                          up vote
                          1
                          down vote













                          Look good to me.



                          Not sure it matters but you are creating all pairs. (A,B) and (B,A). If you only created unique combinations you would cut the count in 1/2.






                          share|improve this answer

























                            up vote
                            1
                            down vote










                            up vote
                            1
                            down vote









                            Look good to me.



                            Not sure it matters but you are creating all pairs. (A,B) and (B,A). If you only created unique combinations you would cut the count in 1/2.






                            share|improve this answer















                            Look good to me.



                            Not sure it matters but you are creating all pairs. (A,B) and (B,A). If you only created unique combinations you would cut the count in 1/2.







                            share|improve this answer















                            share|improve this answer



                            share|improve this answer








                            edited Jun 25 at 15:58


























                            answered Jun 25 at 13:23









                            paparazzo

                            4,8131730




                            4,8131730






















                                 

                                draft saved


                                draft discarded


























                                 


                                draft saved


                                draft discarded














                                StackExchange.ready(
                                function ()
                                StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f197189%2fbuild-clusters-from-connections-virus-propagation%23new-answer', 'question_page');

                                );

                                Post as a guest













































































                                Popular posts from this blog

                                Greedy Best First Search implementation in Rust

                                Function to Return a JSON Like Objects Using VBA Collections and Arrays

                                C++11 CLH Lock Implementation