Linear Regression on random data

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP





.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;







up vote
5
down vote

favorite
2












Wrote a simple script to implement Linear regression and practice numpy/pandas. Uses random data, so obviously weights (thetas) have no significant meaning. Looking for feedback on



  1. Performance

  2. Python code style

  3. Machine Learning code style

# Performs Linear Regression (from scratch) using randomized data
# Optimizes weights by using Gradient Descent Algorithm

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

np.random.seed(0)

features = 3
trainingSize = 10 ** 1
trainingSteps = 10 ** 3
learningRate = 10 ** -2

randData = np.random.rand(trainingSize, features + 1)
colNames = [f'featurei' for i in range(1, features + 1)]
colNames.append('labels')

dummy_column = pd.Series(np.ones(trainingSize), name='f0')
df = pd.DataFrame(randData, columns=colNames)

X = pd.concat([dummy_column, df.drop(columns='labels')], axis=1)
y = df['labels']
thetas = np.random.rand(features + 1)

cost = lambda thetas: np.mean((np.matmul(X, thetas) - y) ** 2) / 2
dJdtheta = lambda thetas, k: np.mean((np.matmul(X, thetas) - y) * X.iloc[:, k])
gradient = lambda thetas: np.array([dJdtheta(thetas, k) for k in range(X.shape[1])])

# J(theta) before gradient descent
print(cost(thetas))

# Perform gradient descent
errors = np.zeros(trainingSteps)
for step in range(trainingSteps):
thetas -= learningRate * gradient(thetas)
errors[step] = cost(thetas)

# J(theta) after gradient descent
print(cost(thetas))

# Plots Cost function as gradient descent runs
plt.plot(errors)
plt.xlabel('Training Steps')
plt.ylabel('Cost Function')
plt.show()






share|improve this question



























    up vote
    5
    down vote

    favorite
    2












    Wrote a simple script to implement Linear regression and practice numpy/pandas. Uses random data, so obviously weights (thetas) have no significant meaning. Looking for feedback on



    1. Performance

    2. Python code style

    3. Machine Learning code style

    # Performs Linear Regression (from scratch) using randomized data
    # Optimizes weights by using Gradient Descent Algorithm

    import numpy as np
    import pandas as pd
    import matplotlib.pyplot as plt

    np.random.seed(0)

    features = 3
    trainingSize = 10 ** 1
    trainingSteps = 10 ** 3
    learningRate = 10 ** -2

    randData = np.random.rand(trainingSize, features + 1)
    colNames = [f'featurei' for i in range(1, features + 1)]
    colNames.append('labels')

    dummy_column = pd.Series(np.ones(trainingSize), name='f0')
    df = pd.DataFrame(randData, columns=colNames)

    X = pd.concat([dummy_column, df.drop(columns='labels')], axis=1)
    y = df['labels']
    thetas = np.random.rand(features + 1)

    cost = lambda thetas: np.mean((np.matmul(X, thetas) - y) ** 2) / 2
    dJdtheta = lambda thetas, k: np.mean((np.matmul(X, thetas) - y) * X.iloc[:, k])
    gradient = lambda thetas: np.array([dJdtheta(thetas, k) for k in range(X.shape[1])])

    # J(theta) before gradient descent
    print(cost(thetas))

    # Perform gradient descent
    errors = np.zeros(trainingSteps)
    for step in range(trainingSteps):
    thetas -= learningRate * gradient(thetas)
    errors[step] = cost(thetas)

    # J(theta) after gradient descent
    print(cost(thetas))

    # Plots Cost function as gradient descent runs
    plt.plot(errors)
    plt.xlabel('Training Steps')
    plt.ylabel('Cost Function')
    plt.show()






    share|improve this question























      up vote
      5
      down vote

      favorite
      2









      up vote
      5
      down vote

      favorite
      2






      2





      Wrote a simple script to implement Linear regression and practice numpy/pandas. Uses random data, so obviously weights (thetas) have no significant meaning. Looking for feedback on



      1. Performance

      2. Python code style

      3. Machine Learning code style

      # Performs Linear Regression (from scratch) using randomized data
      # Optimizes weights by using Gradient Descent Algorithm

      import numpy as np
      import pandas as pd
      import matplotlib.pyplot as plt

      np.random.seed(0)

      features = 3
      trainingSize = 10 ** 1
      trainingSteps = 10 ** 3
      learningRate = 10 ** -2

      randData = np.random.rand(trainingSize, features + 1)
      colNames = [f'featurei' for i in range(1, features + 1)]
      colNames.append('labels')

      dummy_column = pd.Series(np.ones(trainingSize), name='f0')
      df = pd.DataFrame(randData, columns=colNames)

      X = pd.concat([dummy_column, df.drop(columns='labels')], axis=1)
      y = df['labels']
      thetas = np.random.rand(features + 1)

      cost = lambda thetas: np.mean((np.matmul(X, thetas) - y) ** 2) / 2
      dJdtheta = lambda thetas, k: np.mean((np.matmul(X, thetas) - y) * X.iloc[:, k])
      gradient = lambda thetas: np.array([dJdtheta(thetas, k) for k in range(X.shape[1])])

      # J(theta) before gradient descent
      print(cost(thetas))

      # Perform gradient descent
      errors = np.zeros(trainingSteps)
      for step in range(trainingSteps):
      thetas -= learningRate * gradient(thetas)
      errors[step] = cost(thetas)

      # J(theta) after gradient descent
      print(cost(thetas))

      # Plots Cost function as gradient descent runs
      plt.plot(errors)
      plt.xlabel('Training Steps')
      plt.ylabel('Cost Function')
      plt.show()






      share|improve this question













      Wrote a simple script to implement Linear regression and practice numpy/pandas. Uses random data, so obviously weights (thetas) have no significant meaning. Looking for feedback on



      1. Performance

      2. Python code style

      3. Machine Learning code style

      # Performs Linear Regression (from scratch) using randomized data
      # Optimizes weights by using Gradient Descent Algorithm

      import numpy as np
      import pandas as pd
      import matplotlib.pyplot as plt

      np.random.seed(0)

      features = 3
      trainingSize = 10 ** 1
      trainingSteps = 10 ** 3
      learningRate = 10 ** -2

      randData = np.random.rand(trainingSize, features + 1)
      colNames = [f'featurei' for i in range(1, features + 1)]
      colNames.append('labels')

      dummy_column = pd.Series(np.ones(trainingSize), name='f0')
      df = pd.DataFrame(randData, columns=colNames)

      X = pd.concat([dummy_column, df.drop(columns='labels')], axis=1)
      y = df['labels']
      thetas = np.random.rand(features + 1)

      cost = lambda thetas: np.mean((np.matmul(X, thetas) - y) ** 2) / 2
      dJdtheta = lambda thetas, k: np.mean((np.matmul(X, thetas) - y) * X.iloc[:, k])
      gradient = lambda thetas: np.array([dJdtheta(thetas, k) for k in range(X.shape[1])])

      # J(theta) before gradient descent
      print(cost(thetas))

      # Perform gradient descent
      errors = np.zeros(trainingSteps)
      for step in range(trainingSteps):
      thetas -= learningRate * gradient(thetas)
      errors[step] = cost(thetas)

      # J(theta) after gradient descent
      print(cost(thetas))

      # Plots Cost function as gradient descent runs
      plt.plot(errors)
      plt.xlabel('Training Steps')
      plt.ylabel('Cost Function')
      plt.show()








      share|improve this question












      share|improve this question




      share|improve this question








      edited Mar 4 at 14:40









      200_success

      123k14142399




      123k14142399









      asked Mar 4 at 13:53









      Vivek Jha

      533




      533




















          1 Answer
          1






          active

          oldest

          votes

















          up vote
          0
          down vote













          Welcome!



          Your first two lines are nice comments. Consider putting them in a module docstring:



          """Performs Linear Regression (from scratch) using randomized data.

          Optimizes weights by using Gradient Descent Algorithm.
          """


          Consider adding random noise to something linear (or to some "wrong model" sine or polynomial), rather than to a constant.



          np.random.seed(0)


          Nice - reproducibility is Good.



          trainingSize = 10 ** 1
          trainingSteps = 10 ** 3
          learningRate = 10 ** -2


          These expressions are correct and clear. But why evaluate a FP expression when you could just write it as a literal? 1e1, 1e3, 1e-2. (This answer would apply in many languages, including Python. And yes, I actually prefer seeing the two integers written as floating point, even if that forces me to call int() on them.)



          PEP8 asks that you spell it training_size, and so on. Please run flake8, and follow its advice.



          Your column names expression is fine. Consider handling the one-origin within the format expression:



          col_names = [f'featurei + 1' for i in range(features)] + ['labels']


          Specifying axis=1 is correct. I have a (weak) preference for explicitly spelling out: axis='columns'.



          Consider hoisting the expression np.matmul(X, thetas) - y, so it is only evaluated once.



          The three lambda expressions are fine, but they don't seem to buy you anything. Probably better to use def three times.



          Ship it! But do consider noising a linear function, to make it easier to evaluate your results.






          share|improve this answer





















            Your Answer




            StackExchange.ifUsing("editor", function ()
            return StackExchange.using("mathjaxEditing", function ()
            StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
            StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
            );
            );
            , "mathjax-editing");

            StackExchange.ifUsing("editor", function ()
            StackExchange.using("externalEditor", function ()
            StackExchange.using("snippets", function ()
            StackExchange.snippets.init();
            );
            );
            , "code-snippets");

            StackExchange.ready(function()
            var channelOptions =
            tags: "".split(" "),
            id: "196"
            ;
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function()
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled)
            StackExchange.using("snippets", function()
            createEditor();
            );

            else
            createEditor();

            );

            function createEditor()
            StackExchange.prepareEditor(
            heartbeatType: 'answer',
            convertImagesToLinks: false,
            noModals: false,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: null,
            bindNavPrevention: true,
            postfix: "",
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            );



            );








             

            draft saved


            draft discarded


















            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f188790%2flinear-regression-on-random-data%23new-answer', 'question_page');

            );

            Post as a guest






























            1 Answer
            1






            active

            oldest

            votes








            1 Answer
            1






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes








            up vote
            0
            down vote













            Welcome!



            Your first two lines are nice comments. Consider putting them in a module docstring:



            """Performs Linear Regression (from scratch) using randomized data.

            Optimizes weights by using Gradient Descent Algorithm.
            """


            Consider adding random noise to something linear (or to some "wrong model" sine or polynomial), rather than to a constant.



            np.random.seed(0)


            Nice - reproducibility is Good.



            trainingSize = 10 ** 1
            trainingSteps = 10 ** 3
            learningRate = 10 ** -2


            These expressions are correct and clear. But why evaluate a FP expression when you could just write it as a literal? 1e1, 1e3, 1e-2. (This answer would apply in many languages, including Python. And yes, I actually prefer seeing the two integers written as floating point, even if that forces me to call int() on them.)



            PEP8 asks that you spell it training_size, and so on. Please run flake8, and follow its advice.



            Your column names expression is fine. Consider handling the one-origin within the format expression:



            col_names = [f'featurei + 1' for i in range(features)] + ['labels']


            Specifying axis=1 is correct. I have a (weak) preference for explicitly spelling out: axis='columns'.



            Consider hoisting the expression np.matmul(X, thetas) - y, so it is only evaluated once.



            The three lambda expressions are fine, but they don't seem to buy you anything. Probably better to use def three times.



            Ship it! But do consider noising a linear function, to make it easier to evaluate your results.






            share|improve this answer

























              up vote
              0
              down vote













              Welcome!



              Your first two lines are nice comments. Consider putting them in a module docstring:



              """Performs Linear Regression (from scratch) using randomized data.

              Optimizes weights by using Gradient Descent Algorithm.
              """


              Consider adding random noise to something linear (or to some "wrong model" sine or polynomial), rather than to a constant.



              np.random.seed(0)


              Nice - reproducibility is Good.



              trainingSize = 10 ** 1
              trainingSteps = 10 ** 3
              learningRate = 10 ** -2


              These expressions are correct and clear. But why evaluate a FP expression when you could just write it as a literal? 1e1, 1e3, 1e-2. (This answer would apply in many languages, including Python. And yes, I actually prefer seeing the two integers written as floating point, even if that forces me to call int() on them.)



              PEP8 asks that you spell it training_size, and so on. Please run flake8, and follow its advice.



              Your column names expression is fine. Consider handling the one-origin within the format expression:



              col_names = [f'featurei + 1' for i in range(features)] + ['labels']


              Specifying axis=1 is correct. I have a (weak) preference for explicitly spelling out: axis='columns'.



              Consider hoisting the expression np.matmul(X, thetas) - y, so it is only evaluated once.



              The three lambda expressions are fine, but they don't seem to buy you anything. Probably better to use def three times.



              Ship it! But do consider noising a linear function, to make it easier to evaluate your results.






              share|improve this answer























                up vote
                0
                down vote










                up vote
                0
                down vote









                Welcome!



                Your first two lines are nice comments. Consider putting them in a module docstring:



                """Performs Linear Regression (from scratch) using randomized data.

                Optimizes weights by using Gradient Descent Algorithm.
                """


                Consider adding random noise to something linear (or to some "wrong model" sine or polynomial), rather than to a constant.



                np.random.seed(0)


                Nice - reproducibility is Good.



                trainingSize = 10 ** 1
                trainingSteps = 10 ** 3
                learningRate = 10 ** -2


                These expressions are correct and clear. But why evaluate a FP expression when you could just write it as a literal? 1e1, 1e3, 1e-2. (This answer would apply in many languages, including Python. And yes, I actually prefer seeing the two integers written as floating point, even if that forces me to call int() on them.)



                PEP8 asks that you spell it training_size, and so on. Please run flake8, and follow its advice.



                Your column names expression is fine. Consider handling the one-origin within the format expression:



                col_names = [f'featurei + 1' for i in range(features)] + ['labels']


                Specifying axis=1 is correct. I have a (weak) preference for explicitly spelling out: axis='columns'.



                Consider hoisting the expression np.matmul(X, thetas) - y, so it is only evaluated once.



                The three lambda expressions are fine, but they don't seem to buy you anything. Probably better to use def three times.



                Ship it! But do consider noising a linear function, to make it easier to evaluate your results.






                share|improve this answer













                Welcome!



                Your first two lines are nice comments. Consider putting them in a module docstring:



                """Performs Linear Regression (from scratch) using randomized data.

                Optimizes weights by using Gradient Descent Algorithm.
                """


                Consider adding random noise to something linear (or to some "wrong model" sine or polynomial), rather than to a constant.



                np.random.seed(0)


                Nice - reproducibility is Good.



                trainingSize = 10 ** 1
                trainingSteps = 10 ** 3
                learningRate = 10 ** -2


                These expressions are correct and clear. But why evaluate a FP expression when you could just write it as a literal? 1e1, 1e3, 1e-2. (This answer would apply in many languages, including Python. And yes, I actually prefer seeing the two integers written as floating point, even if that forces me to call int() on them.)



                PEP8 asks that you spell it training_size, and so on. Please run flake8, and follow its advice.



                Your column names expression is fine. Consider handling the one-origin within the format expression:



                col_names = [f'featurei + 1' for i in range(features)] + ['labels']


                Specifying axis=1 is correct. I have a (weak) preference for explicitly spelling out: axis='columns'.



                Consider hoisting the expression np.matmul(X, thetas) - y, so it is only evaluated once.



                The three lambda expressions are fine, but they don't seem to buy you anything. Probably better to use def three times.



                Ship it! But do consider noising a linear function, to make it easier to evaluate your results.







                share|improve this answer













                share|improve this answer



                share|improve this answer











                answered May 15 at 3:17









                J_H

                4,317129




                4,317129






















                     

                    draft saved


                    draft discarded


























                     


                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function ()
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f188790%2flinear-regression-on-random-data%23new-answer', 'question_page');

                    );

                    Post as a guest













































































                    Popular posts from this blog

                    Python Lists

                    Aion

                    JavaScript Array Iteration Methods