Compute leaderboard of competitive riders throughout a season

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP





.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;







up vote
3
down vote

favorite












I have a table in my DB that contains Riders as rows and their points throughout the season as columns. Here are the columns:



['name', 'wwq_id', 'wwq_code', 'year', 'round1_qualy', 'round1_final', 
'round1_timed', 'round1_total_qualy', 'round1_total_qualy_position',
'round1_total_final', 'round1_total_final_position', 'round1_total_timed',
'round1_total_timed_position', 'round1_total', 'round1_total_position'...etc


The 'round1_qualy', 'round1_final', 'round1_timed' all come from different tables and are added using another function. Once they are in the DB all of total columns are calculated using generated columns in MySQL.



I need to calculate the position of the rider at different points in the season. I.E. After round one qualy, after round two final, etc.



Here is the code I have right now:



def set_positions():
"""Add positions to the DB for each column in our overall table"""
engine = sql.create_engine(MYSQL_CON_STRING)
connection = engine.connect()
trans = connection.begin()

# Men, Women, Junior
for category in CATEGORIES:
table = category + '_overall'
df = pd.read_sql('SELECT * FROM 0'.format(table), engine)
points_columns_list = [col for col in df.columns
if 'total' in col and 'position' not in col]

# Calculate the rank based on current total points column and
# insert into current column index + 1
for points_column in points_columns_list:
position_index = df.columns.get_loc(points_column) + 1
position_column = df.columns[position_index]
current_points = df.groupby('year')
position =
current_points[points_column].rank(ascending=0,
method='min').astype(int)
df.loc[:, position_column] = position.values

# No points awarded in this and/or previous rounds. Everyone is
# at 0 so everyone is rank 1
df.loc[(df[points_column] == 0)
& (df[position_column] == 1), position_column] = 'NULL'

# Add rows to the DB
for row in df.itertuples():
connection.execute('UPDATE 0 '
'SET 1 = 2 '
'WHERE year = 3 and name = "4"'
''.format(table, position_column,
getattr(row, position_column),
getattr(row, 'year'),
getattr(row, 'name')))
trans.commit()
connection.close()


The code works perfectly, it just takes a long time. This function doesn't get executed very often, but considering there are about 8000 rows per table, it takes a bit of time.



Is there any way to get the complexity down?







share|improve this question





















  • Accidentally posted outside of my account(as a guest) so I can't comment. Thanks for your input! The only problem is that there are generated columns and pre-defined datatypes in the table so I'm not able to recreate it without breaking the data pipeline. Thinking about it now, that could have been a bad idea to use them...
    – moto
    Mar 1 at 13:16

















up vote
3
down vote

favorite












I have a table in my DB that contains Riders as rows and their points throughout the season as columns. Here are the columns:



['name', 'wwq_id', 'wwq_code', 'year', 'round1_qualy', 'round1_final', 
'round1_timed', 'round1_total_qualy', 'round1_total_qualy_position',
'round1_total_final', 'round1_total_final_position', 'round1_total_timed',
'round1_total_timed_position', 'round1_total', 'round1_total_position'...etc


The 'round1_qualy', 'round1_final', 'round1_timed' all come from different tables and are added using another function. Once they are in the DB all of total columns are calculated using generated columns in MySQL.



I need to calculate the position of the rider at different points in the season. I.E. After round one qualy, after round two final, etc.



Here is the code I have right now:



def set_positions():
"""Add positions to the DB for each column in our overall table"""
engine = sql.create_engine(MYSQL_CON_STRING)
connection = engine.connect()
trans = connection.begin()

# Men, Women, Junior
for category in CATEGORIES:
table = category + '_overall'
df = pd.read_sql('SELECT * FROM 0'.format(table), engine)
points_columns_list = [col for col in df.columns
if 'total' in col and 'position' not in col]

# Calculate the rank based on current total points column and
# insert into current column index + 1
for points_column in points_columns_list:
position_index = df.columns.get_loc(points_column) + 1
position_column = df.columns[position_index]
current_points = df.groupby('year')
position =
current_points[points_column].rank(ascending=0,
method='min').astype(int)
df.loc[:, position_column] = position.values

# No points awarded in this and/or previous rounds. Everyone is
# at 0 so everyone is rank 1
df.loc[(df[points_column] == 0)
& (df[position_column] == 1), position_column] = 'NULL'

# Add rows to the DB
for row in df.itertuples():
connection.execute('UPDATE 0 '
'SET 1 = 2 '
'WHERE year = 3 and name = "4"'
''.format(table, position_column,
getattr(row, position_column),
getattr(row, 'year'),
getattr(row, 'name')))
trans.commit()
connection.close()


The code works perfectly, it just takes a long time. This function doesn't get executed very often, but considering there are about 8000 rows per table, it takes a bit of time.



Is there any way to get the complexity down?







share|improve this question





















  • Accidentally posted outside of my account(as a guest) so I can't comment. Thanks for your input! The only problem is that there are generated columns and pre-defined datatypes in the table so I'm not able to recreate it without breaking the data pipeline. Thinking about it now, that could have been a bad idea to use them...
    – moto
    Mar 1 at 13:16













up vote
3
down vote

favorite









up vote
3
down vote

favorite











I have a table in my DB that contains Riders as rows and their points throughout the season as columns. Here are the columns:



['name', 'wwq_id', 'wwq_code', 'year', 'round1_qualy', 'round1_final', 
'round1_timed', 'round1_total_qualy', 'round1_total_qualy_position',
'round1_total_final', 'round1_total_final_position', 'round1_total_timed',
'round1_total_timed_position', 'round1_total', 'round1_total_position'...etc


The 'round1_qualy', 'round1_final', 'round1_timed' all come from different tables and are added using another function. Once they are in the DB all of total columns are calculated using generated columns in MySQL.



I need to calculate the position of the rider at different points in the season. I.E. After round one qualy, after round two final, etc.



Here is the code I have right now:



def set_positions():
"""Add positions to the DB for each column in our overall table"""
engine = sql.create_engine(MYSQL_CON_STRING)
connection = engine.connect()
trans = connection.begin()

# Men, Women, Junior
for category in CATEGORIES:
table = category + '_overall'
df = pd.read_sql('SELECT * FROM 0'.format(table), engine)
points_columns_list = [col for col in df.columns
if 'total' in col and 'position' not in col]

# Calculate the rank based on current total points column and
# insert into current column index + 1
for points_column in points_columns_list:
position_index = df.columns.get_loc(points_column) + 1
position_column = df.columns[position_index]
current_points = df.groupby('year')
position =
current_points[points_column].rank(ascending=0,
method='min').astype(int)
df.loc[:, position_column] = position.values

# No points awarded in this and/or previous rounds. Everyone is
# at 0 so everyone is rank 1
df.loc[(df[points_column] == 0)
& (df[position_column] == 1), position_column] = 'NULL'

# Add rows to the DB
for row in df.itertuples():
connection.execute('UPDATE 0 '
'SET 1 = 2 '
'WHERE year = 3 and name = "4"'
''.format(table, position_column,
getattr(row, position_column),
getattr(row, 'year'),
getattr(row, 'name')))
trans.commit()
connection.close()


The code works perfectly, it just takes a long time. This function doesn't get executed very often, but considering there are about 8000 rows per table, it takes a bit of time.



Is there any way to get the complexity down?







share|improve this question













I have a table in my DB that contains Riders as rows and their points throughout the season as columns. Here are the columns:



['name', 'wwq_id', 'wwq_code', 'year', 'round1_qualy', 'round1_final', 
'round1_timed', 'round1_total_qualy', 'round1_total_qualy_position',
'round1_total_final', 'round1_total_final_position', 'round1_total_timed',
'round1_total_timed_position', 'round1_total', 'round1_total_position'...etc


The 'round1_qualy', 'round1_final', 'round1_timed' all come from different tables and are added using another function. Once they are in the DB all of total columns are calculated using generated columns in MySQL.



I need to calculate the position of the rider at different points in the season. I.E. After round one qualy, after round two final, etc.



Here is the code I have right now:



def set_positions():
"""Add positions to the DB for each column in our overall table"""
engine = sql.create_engine(MYSQL_CON_STRING)
connection = engine.connect()
trans = connection.begin()

# Men, Women, Junior
for category in CATEGORIES:
table = category + '_overall'
df = pd.read_sql('SELECT * FROM 0'.format(table), engine)
points_columns_list = [col for col in df.columns
if 'total' in col and 'position' not in col]

# Calculate the rank based on current total points column and
# insert into current column index + 1
for points_column in points_columns_list:
position_index = df.columns.get_loc(points_column) + 1
position_column = df.columns[position_index]
current_points = df.groupby('year')
position =
current_points[points_column].rank(ascending=0,
method='min').astype(int)
df.loc[:, position_column] = position.values

# No points awarded in this and/or previous rounds. Everyone is
# at 0 so everyone is rank 1
df.loc[(df[points_column] == 0)
& (df[position_column] == 1), position_column] = 'NULL'

# Add rows to the DB
for row in df.itertuples():
connection.execute('UPDATE 0 '
'SET 1 = 2 '
'WHERE year = 3 and name = "4"'
''.format(table, position_column,
getattr(row, position_column),
getattr(row, 'year'),
getattr(row, 'name')))
trans.commit()
connection.close()


The code works perfectly, it just takes a long time. This function doesn't get executed very often, but considering there are about 8000 rows per table, it takes a bit of time.



Is there any way to get the complexity down?









share|improve this question












share|improve this question




share|improve this question








edited Mar 1 at 10:40









Mathias Ettinger

21.9k32876




21.9k32876









asked Mar 1 at 10:26









moto

161




161











  • Accidentally posted outside of my account(as a guest) so I can't comment. Thanks for your input! The only problem is that there are generated columns and pre-defined datatypes in the table so I'm not able to recreate it without breaking the data pipeline. Thinking about it now, that could have been a bad idea to use them...
    – moto
    Mar 1 at 13:16

















  • Accidentally posted outside of my account(as a guest) so I can't comment. Thanks for your input! The only problem is that there are generated columns and pre-defined datatypes in the table so I'm not able to recreate it without breaking the data pipeline. Thinking about it now, that could have been a bad idea to use them...
    – moto
    Mar 1 at 13:16
















Accidentally posted outside of my account(as a guest) so I can't comment. Thanks for your input! The only problem is that there are generated columns and pre-defined datatypes in the table so I'm not able to recreate it without breaking the data pipeline. Thinking about it now, that could have been a bad idea to use them...
– moto
Mar 1 at 13:16





Accidentally posted outside of my account(as a guest) so I can't comment. Thanks for your input! The only problem is that there are generated columns and pre-defined datatypes in the table so I'm not able to recreate it without breaking the data pipeline. Thinking about it now, that could have been a bad idea to use them...
– moto
Mar 1 at 13:16











1 Answer
1






active

oldest

votes

















up vote
3
down vote













You should strive to reduce the number of queries to your DB. Instead of updating the whole table, row by row, each time you compute a new value, you could:



  1. load the data (using pd.read_sql_table);

  2. perform the computation of the positions (this may require you to use None or pd.np.NaN instead of 'NULL');

  3. write the whole new table at once at the end (using df.to_sql).

The rewrite could look like:



def set_positions():
"""Add positions to the DB for each column in our overall table"""
engine = sql.create_engine(MYSQL_CON_STRING)
connection = engine.connect()
trans = connection.begin()

# Men, Women, Junior
for category in CATEGORIES:
table = category + '_overall'
df = pd.read_sql_table(table, engine)
compute_positions_for_table(df)
df.to_sql(table, engine, if_exists='replace')
trans.commit()
connection.close()


def compute_positions_for_table(df):
points_columns_list = [col for col in df.columns
if 'total' in col and 'position' not in col]

# Calculate the rank based on current total points column and
# insert into current column index + 1
for points_column in points_columns_list:
position_index = df.columns.get_loc(points_column) + 1
position_column = df.columns[position_index]
current_points = df.groupby('year')
position = current_points[points_column].rank(ascending=0, method='min').astype(int)
df.loc[:, position_column] = position.values

# No points awarded in this and/or previous rounds. Everyone is
# at 0 so everyone is rank 1
df.loc[(df[points_column] == 0)
& (df[position_column] == 1), position_column] = None


And even if you can't "replace" the table, you should still update the table only once at the end of the computation rather than at each new column computation.






share|improve this answer























    Your Answer




    StackExchange.ifUsing("editor", function ()
    return StackExchange.using("mathjaxEditing", function ()
    StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
    StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
    );
    );
    , "mathjax-editing");

    StackExchange.ifUsing("editor", function ()
    StackExchange.using("externalEditor", function ()
    StackExchange.using("snippets", function ()
    StackExchange.snippets.init();
    );
    );
    , "code-snippets");

    StackExchange.ready(function()
    var channelOptions =
    tags: "".split(" "),
    id: "196"
    ;
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function()
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled)
    StackExchange.using("snippets", function()
    createEditor();
    );

    else
    createEditor();

    );

    function createEditor()
    StackExchange.prepareEditor(
    heartbeatType: 'answer',
    convertImagesToLinks: false,
    noModals: false,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: null,
    bindNavPrevention: true,
    postfix: "",
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    );



    );








     

    draft saved


    draft discarded


















    StackExchange.ready(
    function ()
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f188590%2fcompute-leaderboard-of-competitive-riders-throughout-a-season%23new-answer', 'question_page');

    );

    Post as a guest






























    1 Answer
    1






    active

    oldest

    votes








    1 Answer
    1






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes








    up vote
    3
    down vote













    You should strive to reduce the number of queries to your DB. Instead of updating the whole table, row by row, each time you compute a new value, you could:



    1. load the data (using pd.read_sql_table);

    2. perform the computation of the positions (this may require you to use None or pd.np.NaN instead of 'NULL');

    3. write the whole new table at once at the end (using df.to_sql).

    The rewrite could look like:



    def set_positions():
    """Add positions to the DB for each column in our overall table"""
    engine = sql.create_engine(MYSQL_CON_STRING)
    connection = engine.connect()
    trans = connection.begin()

    # Men, Women, Junior
    for category in CATEGORIES:
    table = category + '_overall'
    df = pd.read_sql_table(table, engine)
    compute_positions_for_table(df)
    df.to_sql(table, engine, if_exists='replace')
    trans.commit()
    connection.close()


    def compute_positions_for_table(df):
    points_columns_list = [col for col in df.columns
    if 'total' in col and 'position' not in col]

    # Calculate the rank based on current total points column and
    # insert into current column index + 1
    for points_column in points_columns_list:
    position_index = df.columns.get_loc(points_column) + 1
    position_column = df.columns[position_index]
    current_points = df.groupby('year')
    position = current_points[points_column].rank(ascending=0, method='min').astype(int)
    df.loc[:, position_column] = position.values

    # No points awarded in this and/or previous rounds. Everyone is
    # at 0 so everyone is rank 1
    df.loc[(df[points_column] == 0)
    & (df[position_column] == 1), position_column] = None


    And even if you can't "replace" the table, you should still update the table only once at the end of the computation rather than at each new column computation.






    share|improve this answer



























      up vote
      3
      down vote













      You should strive to reduce the number of queries to your DB. Instead of updating the whole table, row by row, each time you compute a new value, you could:



      1. load the data (using pd.read_sql_table);

      2. perform the computation of the positions (this may require you to use None or pd.np.NaN instead of 'NULL');

      3. write the whole new table at once at the end (using df.to_sql).

      The rewrite could look like:



      def set_positions():
      """Add positions to the DB for each column in our overall table"""
      engine = sql.create_engine(MYSQL_CON_STRING)
      connection = engine.connect()
      trans = connection.begin()

      # Men, Women, Junior
      for category in CATEGORIES:
      table = category + '_overall'
      df = pd.read_sql_table(table, engine)
      compute_positions_for_table(df)
      df.to_sql(table, engine, if_exists='replace')
      trans.commit()
      connection.close()


      def compute_positions_for_table(df):
      points_columns_list = [col for col in df.columns
      if 'total' in col and 'position' not in col]

      # Calculate the rank based on current total points column and
      # insert into current column index + 1
      for points_column in points_columns_list:
      position_index = df.columns.get_loc(points_column) + 1
      position_column = df.columns[position_index]
      current_points = df.groupby('year')
      position = current_points[points_column].rank(ascending=0, method='min').astype(int)
      df.loc[:, position_column] = position.values

      # No points awarded in this and/or previous rounds. Everyone is
      # at 0 so everyone is rank 1
      df.loc[(df[points_column] == 0)
      & (df[position_column] == 1), position_column] = None


      And even if you can't "replace" the table, you should still update the table only once at the end of the computation rather than at each new column computation.






      share|improve this answer

























        up vote
        3
        down vote










        up vote
        3
        down vote









        You should strive to reduce the number of queries to your DB. Instead of updating the whole table, row by row, each time you compute a new value, you could:



        1. load the data (using pd.read_sql_table);

        2. perform the computation of the positions (this may require you to use None or pd.np.NaN instead of 'NULL');

        3. write the whole new table at once at the end (using df.to_sql).

        The rewrite could look like:



        def set_positions():
        """Add positions to the DB for each column in our overall table"""
        engine = sql.create_engine(MYSQL_CON_STRING)
        connection = engine.connect()
        trans = connection.begin()

        # Men, Women, Junior
        for category in CATEGORIES:
        table = category + '_overall'
        df = pd.read_sql_table(table, engine)
        compute_positions_for_table(df)
        df.to_sql(table, engine, if_exists='replace')
        trans.commit()
        connection.close()


        def compute_positions_for_table(df):
        points_columns_list = [col for col in df.columns
        if 'total' in col and 'position' not in col]

        # Calculate the rank based on current total points column and
        # insert into current column index + 1
        for points_column in points_columns_list:
        position_index = df.columns.get_loc(points_column) + 1
        position_column = df.columns[position_index]
        current_points = df.groupby('year')
        position = current_points[points_column].rank(ascending=0, method='min').astype(int)
        df.loc[:, position_column] = position.values

        # No points awarded in this and/or previous rounds. Everyone is
        # at 0 so everyone is rank 1
        df.loc[(df[points_column] == 0)
        & (df[position_column] == 1), position_column] = None


        And even if you can't "replace" the table, you should still update the table only once at the end of the computation rather than at each new column computation.






        share|improve this answer















        You should strive to reduce the number of queries to your DB. Instead of updating the whole table, row by row, each time you compute a new value, you could:



        1. load the data (using pd.read_sql_table);

        2. perform the computation of the positions (this may require you to use None or pd.np.NaN instead of 'NULL');

        3. write the whole new table at once at the end (using df.to_sql).

        The rewrite could look like:



        def set_positions():
        """Add positions to the DB for each column in our overall table"""
        engine = sql.create_engine(MYSQL_CON_STRING)
        connection = engine.connect()
        trans = connection.begin()

        # Men, Women, Junior
        for category in CATEGORIES:
        table = category + '_overall'
        df = pd.read_sql_table(table, engine)
        compute_positions_for_table(df)
        df.to_sql(table, engine, if_exists='replace')
        trans.commit()
        connection.close()


        def compute_positions_for_table(df):
        points_columns_list = [col for col in df.columns
        if 'total' in col and 'position' not in col]

        # Calculate the rank based on current total points column and
        # insert into current column index + 1
        for points_column in points_columns_list:
        position_index = df.columns.get_loc(points_column) + 1
        position_column = df.columns[position_index]
        current_points = df.groupby('year')
        position = current_points[points_column].rank(ascending=0, method='min').astype(int)
        df.loc[:, position_column] = position.values

        # No points awarded in this and/or previous rounds. Everyone is
        # at 0 so everyone is rank 1
        df.loc[(df[points_column] == 0)
        & (df[position_column] == 1), position_column] = None


        And even if you can't "replace" the table, you should still update the table only once at the end of the computation rather than at each new column computation.







        share|improve this answer















        share|improve this answer



        share|improve this answer








        edited Mar 1 at 14:04


























        answered Mar 1 at 12:30









        Mathias Ettinger

        21.9k32876




        21.9k32876






















             

            draft saved


            draft discarded


























             


            draft saved


            draft discarded














            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f188590%2fcompute-leaderboard-of-competitive-riders-throughout-a-season%23new-answer', 'question_page');

            );

            Post as a guest













































































            Popular posts from this blog

            Greedy Best First Search implementation in Rust

            Function to Return a JSON Like Objects Using VBA Collections and Arrays

            C++11 CLH Lock Implementation