Using Sklearn with own text data

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP





.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;







up vote
0
down vote

favorite
1












I've been experimenting with scikit-learn for the past few months and have been finding it difficult to move away from the inbuilt datasets (such as Twenty Newsgroups and Iris) and onto my own text datasets.



I have finally managed to get something working, but am keen to get my code sense-checked just in case I'm tricking myself into thinking I'm doing better than I am.



The following code is based on this Sklearn tutorial, but uses my own dataset of approximately 25,000 text files spread across 273 subdirectories in the main project folder. Each directory name serves as a descriptive label for the text files contained within it.



The objectives of the following code are as follows:



  1. Iterate over each subdirectory path in the main project folder to extract the name of each of label (these are appended to the categories list, which is passed in to sklearn's load_files() function.

  2. Use train_test_split() to hold out 40% of the dataset as test data

  3. Transform the training and testing data to tf-idf

  4. Train the classifier

  5. Evaluate the classifier's predictions with the test data

The code below works and I'm currently averaging around 0.7 accuracy (so obviously, there's some improvement still needed).



The main areas I feel might be lacking are the way in which I'm bringing the names of the categories in and the way I'm dealing with the test/train split.



Some feedback from more seasoned developers would be gratefully received.



The code



import sklearn
import numpy as np
from glob import glob
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.linear_model import SGDClassifier
from sklearn import metrics
from sklearn.pipeline import Pipeline

# Get paths to labelled data
rawFolderPaths = glob("/Users/danielhoadley/PycharmProjects/trainer/!labelled_data_reportXML/*/")


categories =

# Extract the folder paths, reduce down to the label and append to the categories list
for i in rawFolderPaths:
string1 = i.replace('/Users/danielhoadley/PycharmProjects/trainer/!labelled_data_reportXML/','')
category = string1.strip('/')
categories.append(category)

# Load the data
print ('nLoading the dataset...n')
docs_to_train = sklearn.datasets.load_files("/Users/danielhoadley/PycharmProjects/trainer/!labelled_data_reportXML",
description=None, categories=categories, load_content=True,
encoding='utf-8', shuffle=True, random_state=42)

# Split the dataset into training and testing sets
print ('nBuilding out hold-out test sample...n')
X_train, X_test, y_train, y_test = train_test_split(docs_to_train.data, docs_to_train.target, test_size=0.4)

# THE TRAINING DATA

# Transform the training data into tfidf vectors
print ('nTransforming the training data...n')
count_vect = CountVectorizer(stop_words='english')
X_train_counts = count_vect.fit_transform(raw_documents=X_train)

tfidf_transformer = TfidfTransformer(use_idf=True)
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)

# THE TEST DATA

# Transform the test data into tfidf vectors
print ('nTransforming the test data...n')
count_vect = CountVectorizer(stop_words='english')
X_test_counts = count_vect.fit_transform(raw_documents=X_test)

tfidf_transformer = TfidfTransformer(use_idf=True)
X_test_tfidf = tfidf_transformer.fit_transform(X_test_counts)
print (X_test_tfidf.shape)

docs_test = X_test

# Construct the classifier pipeline
text_clf = Pipeline([('vect', CountVectorizer(stop_words='english')),
('tfidf', TfidfTransformer(use_idf=True)),
('clf', SGDClassifier(loss='hinge', penalty='l2',
alpha=1e-3, random_state=42, verbose=1)),
])

# Fit the model to the training data
text_clf.fit(X_train, y_train)

# Run the test data into the model
predicted = text_clf.predict(docs_test)

# Calculate mean accuracy of predictions
print (np.mean(predicted == y_test))

# Generate labelled performance metrics
print(metrics.classification_report(y_test, predicted,
target_names=docs_to_train.target_names))






share|improve this question



























    up vote
    0
    down vote

    favorite
    1












    I've been experimenting with scikit-learn for the past few months and have been finding it difficult to move away from the inbuilt datasets (such as Twenty Newsgroups and Iris) and onto my own text datasets.



    I have finally managed to get something working, but am keen to get my code sense-checked just in case I'm tricking myself into thinking I'm doing better than I am.



    The following code is based on this Sklearn tutorial, but uses my own dataset of approximately 25,000 text files spread across 273 subdirectories in the main project folder. Each directory name serves as a descriptive label for the text files contained within it.



    The objectives of the following code are as follows:



    1. Iterate over each subdirectory path in the main project folder to extract the name of each of label (these are appended to the categories list, which is passed in to sklearn's load_files() function.

    2. Use train_test_split() to hold out 40% of the dataset as test data

    3. Transform the training and testing data to tf-idf

    4. Train the classifier

    5. Evaluate the classifier's predictions with the test data

    The code below works and I'm currently averaging around 0.7 accuracy (so obviously, there's some improvement still needed).



    The main areas I feel might be lacking are the way in which I'm bringing the names of the categories in and the way I'm dealing with the test/train split.



    Some feedback from more seasoned developers would be gratefully received.



    The code



    import sklearn
    import numpy as np
    from glob import glob
    from sklearn import datasets
    from sklearn.model_selection import train_test_split
    from sklearn.feature_extraction.text import CountVectorizer
    from sklearn.feature_extraction.text import TfidfTransformer
    from sklearn.linear_model import SGDClassifier
    from sklearn import metrics
    from sklearn.pipeline import Pipeline

    # Get paths to labelled data
    rawFolderPaths = glob("/Users/danielhoadley/PycharmProjects/trainer/!labelled_data_reportXML/*/")


    categories =

    # Extract the folder paths, reduce down to the label and append to the categories list
    for i in rawFolderPaths:
    string1 = i.replace('/Users/danielhoadley/PycharmProjects/trainer/!labelled_data_reportXML/','')
    category = string1.strip('/')
    categories.append(category)

    # Load the data
    print ('nLoading the dataset...n')
    docs_to_train = sklearn.datasets.load_files("/Users/danielhoadley/PycharmProjects/trainer/!labelled_data_reportXML",
    description=None, categories=categories, load_content=True,
    encoding='utf-8', shuffle=True, random_state=42)

    # Split the dataset into training and testing sets
    print ('nBuilding out hold-out test sample...n')
    X_train, X_test, y_train, y_test = train_test_split(docs_to_train.data, docs_to_train.target, test_size=0.4)

    # THE TRAINING DATA

    # Transform the training data into tfidf vectors
    print ('nTransforming the training data...n')
    count_vect = CountVectorizer(stop_words='english')
    X_train_counts = count_vect.fit_transform(raw_documents=X_train)

    tfidf_transformer = TfidfTransformer(use_idf=True)
    X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)

    # THE TEST DATA

    # Transform the test data into tfidf vectors
    print ('nTransforming the test data...n')
    count_vect = CountVectorizer(stop_words='english')
    X_test_counts = count_vect.fit_transform(raw_documents=X_test)

    tfidf_transformer = TfidfTransformer(use_idf=True)
    X_test_tfidf = tfidf_transformer.fit_transform(X_test_counts)
    print (X_test_tfidf.shape)

    docs_test = X_test

    # Construct the classifier pipeline
    text_clf = Pipeline([('vect', CountVectorizer(stop_words='english')),
    ('tfidf', TfidfTransformer(use_idf=True)),
    ('clf', SGDClassifier(loss='hinge', penalty='l2',
    alpha=1e-3, random_state=42, verbose=1)),
    ])

    # Fit the model to the training data
    text_clf.fit(X_train, y_train)

    # Run the test data into the model
    predicted = text_clf.predict(docs_test)

    # Calculate mean accuracy of predictions
    print (np.mean(predicted == y_test))

    # Generate labelled performance metrics
    print(metrics.classification_report(y_test, predicted,
    target_names=docs_to_train.target_names))






    share|improve this question























      up vote
      0
      down vote

      favorite
      1









      up vote
      0
      down vote

      favorite
      1






      1





      I've been experimenting with scikit-learn for the past few months and have been finding it difficult to move away from the inbuilt datasets (such as Twenty Newsgroups and Iris) and onto my own text datasets.



      I have finally managed to get something working, but am keen to get my code sense-checked just in case I'm tricking myself into thinking I'm doing better than I am.



      The following code is based on this Sklearn tutorial, but uses my own dataset of approximately 25,000 text files spread across 273 subdirectories in the main project folder. Each directory name serves as a descriptive label for the text files contained within it.



      The objectives of the following code are as follows:



      1. Iterate over each subdirectory path in the main project folder to extract the name of each of label (these are appended to the categories list, which is passed in to sklearn's load_files() function.

      2. Use train_test_split() to hold out 40% of the dataset as test data

      3. Transform the training and testing data to tf-idf

      4. Train the classifier

      5. Evaluate the classifier's predictions with the test data

      The code below works and I'm currently averaging around 0.7 accuracy (so obviously, there's some improvement still needed).



      The main areas I feel might be lacking are the way in which I'm bringing the names of the categories in and the way I'm dealing with the test/train split.



      Some feedback from more seasoned developers would be gratefully received.



      The code



      import sklearn
      import numpy as np
      from glob import glob
      from sklearn import datasets
      from sklearn.model_selection import train_test_split
      from sklearn.feature_extraction.text import CountVectorizer
      from sklearn.feature_extraction.text import TfidfTransformer
      from sklearn.linear_model import SGDClassifier
      from sklearn import metrics
      from sklearn.pipeline import Pipeline

      # Get paths to labelled data
      rawFolderPaths = glob("/Users/danielhoadley/PycharmProjects/trainer/!labelled_data_reportXML/*/")


      categories =

      # Extract the folder paths, reduce down to the label and append to the categories list
      for i in rawFolderPaths:
      string1 = i.replace('/Users/danielhoadley/PycharmProjects/trainer/!labelled_data_reportXML/','')
      category = string1.strip('/')
      categories.append(category)

      # Load the data
      print ('nLoading the dataset...n')
      docs_to_train = sklearn.datasets.load_files("/Users/danielhoadley/PycharmProjects/trainer/!labelled_data_reportXML",
      description=None, categories=categories, load_content=True,
      encoding='utf-8', shuffle=True, random_state=42)

      # Split the dataset into training and testing sets
      print ('nBuilding out hold-out test sample...n')
      X_train, X_test, y_train, y_test = train_test_split(docs_to_train.data, docs_to_train.target, test_size=0.4)

      # THE TRAINING DATA

      # Transform the training data into tfidf vectors
      print ('nTransforming the training data...n')
      count_vect = CountVectorizer(stop_words='english')
      X_train_counts = count_vect.fit_transform(raw_documents=X_train)

      tfidf_transformer = TfidfTransformer(use_idf=True)
      X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)

      # THE TEST DATA

      # Transform the test data into tfidf vectors
      print ('nTransforming the test data...n')
      count_vect = CountVectorizer(stop_words='english')
      X_test_counts = count_vect.fit_transform(raw_documents=X_test)

      tfidf_transformer = TfidfTransformer(use_idf=True)
      X_test_tfidf = tfidf_transformer.fit_transform(X_test_counts)
      print (X_test_tfidf.shape)

      docs_test = X_test

      # Construct the classifier pipeline
      text_clf = Pipeline([('vect', CountVectorizer(stop_words='english')),
      ('tfidf', TfidfTransformer(use_idf=True)),
      ('clf', SGDClassifier(loss='hinge', penalty='l2',
      alpha=1e-3, random_state=42, verbose=1)),
      ])

      # Fit the model to the training data
      text_clf.fit(X_train, y_train)

      # Run the test data into the model
      predicted = text_clf.predict(docs_test)

      # Calculate mean accuracy of predictions
      print (np.mean(predicted == y_test))

      # Generate labelled performance metrics
      print(metrics.classification_report(y_test, predicted,
      target_names=docs_to_train.target_names))






      share|improve this question













      I've been experimenting with scikit-learn for the past few months and have been finding it difficult to move away from the inbuilt datasets (such as Twenty Newsgroups and Iris) and onto my own text datasets.



      I have finally managed to get something working, but am keen to get my code sense-checked just in case I'm tricking myself into thinking I'm doing better than I am.



      The following code is based on this Sklearn tutorial, but uses my own dataset of approximately 25,000 text files spread across 273 subdirectories in the main project folder. Each directory name serves as a descriptive label for the text files contained within it.



      The objectives of the following code are as follows:



      1. Iterate over each subdirectory path in the main project folder to extract the name of each of label (these are appended to the categories list, which is passed in to sklearn's load_files() function.

      2. Use train_test_split() to hold out 40% of the dataset as test data

      3. Transform the training and testing data to tf-idf

      4. Train the classifier

      5. Evaluate the classifier's predictions with the test data

      The code below works and I'm currently averaging around 0.7 accuracy (so obviously, there's some improvement still needed).



      The main areas I feel might be lacking are the way in which I'm bringing the names of the categories in and the way I'm dealing with the test/train split.



      Some feedback from more seasoned developers would be gratefully received.



      The code



      import sklearn
      import numpy as np
      from glob import glob
      from sklearn import datasets
      from sklearn.model_selection import train_test_split
      from sklearn.feature_extraction.text import CountVectorizer
      from sklearn.feature_extraction.text import TfidfTransformer
      from sklearn.linear_model import SGDClassifier
      from sklearn import metrics
      from sklearn.pipeline import Pipeline

      # Get paths to labelled data
      rawFolderPaths = glob("/Users/danielhoadley/PycharmProjects/trainer/!labelled_data_reportXML/*/")


      categories =

      # Extract the folder paths, reduce down to the label and append to the categories list
      for i in rawFolderPaths:
      string1 = i.replace('/Users/danielhoadley/PycharmProjects/trainer/!labelled_data_reportXML/','')
      category = string1.strip('/')
      categories.append(category)

      # Load the data
      print ('nLoading the dataset...n')
      docs_to_train = sklearn.datasets.load_files("/Users/danielhoadley/PycharmProjects/trainer/!labelled_data_reportXML",
      description=None, categories=categories, load_content=True,
      encoding='utf-8', shuffle=True, random_state=42)

      # Split the dataset into training and testing sets
      print ('nBuilding out hold-out test sample...n')
      X_train, X_test, y_train, y_test = train_test_split(docs_to_train.data, docs_to_train.target, test_size=0.4)

      # THE TRAINING DATA

      # Transform the training data into tfidf vectors
      print ('nTransforming the training data...n')
      count_vect = CountVectorizer(stop_words='english')
      X_train_counts = count_vect.fit_transform(raw_documents=X_train)

      tfidf_transformer = TfidfTransformer(use_idf=True)
      X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)

      # THE TEST DATA

      # Transform the test data into tfidf vectors
      print ('nTransforming the test data...n')
      count_vect = CountVectorizer(stop_words='english')
      X_test_counts = count_vect.fit_transform(raw_documents=X_test)

      tfidf_transformer = TfidfTransformer(use_idf=True)
      X_test_tfidf = tfidf_transformer.fit_transform(X_test_counts)
      print (X_test_tfidf.shape)

      docs_test = X_test

      # Construct the classifier pipeline
      text_clf = Pipeline([('vect', CountVectorizer(stop_words='english')),
      ('tfidf', TfidfTransformer(use_idf=True)),
      ('clf', SGDClassifier(loss='hinge', penalty='l2',
      alpha=1e-3, random_state=42, verbose=1)),
      ])

      # Fit the model to the training data
      text_clf.fit(X_train, y_train)

      # Run the test data into the model
      predicted = text_clf.predict(docs_test)

      # Calculate mean accuracy of predictions
      print (np.mean(predicted == y_test))

      # Generate labelled performance metrics
      print(metrics.classification_report(y_test, predicted,
      target_names=docs_to_train.target_names))








      share|improve this question












      share|improve this question




      share|improve this question








      edited Feb 4 at 21:59









      200_success

      123k14143401




      123k14143401









      asked Feb 4 at 17:12









      DanielH

      626




      626

























          active

          oldest

          votes











          Your Answer




          StackExchange.ifUsing("editor", function ()
          return StackExchange.using("mathjaxEditing", function ()
          StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
          StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
          );
          );
          , "mathjax-editing");

          StackExchange.ifUsing("editor", function ()
          StackExchange.using("externalEditor", function ()
          StackExchange.using("snippets", function ()
          StackExchange.snippets.init();
          );
          );
          , "code-snippets");

          StackExchange.ready(function()
          var channelOptions =
          tags: "".split(" "),
          id: "196"
          ;
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function()
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled)
          StackExchange.using("snippets", function()
          createEditor();
          );

          else
          createEditor();

          );

          function createEditor()
          StackExchange.prepareEditor(
          heartbeatType: 'answer',
          convertImagesToLinks: false,
          noModals: false,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: null,
          bindNavPrevention: true,
          postfix: "",
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          );



          );








           

          draft saved


          draft discarded


















          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f186744%2fusing-sklearn-with-own-text-data%23new-answer', 'question_page');

          );

          Post as a guest



































          active

          oldest

          votes













          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes










           

          draft saved


          draft discarded


























           


          draft saved


          draft discarded














          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f186744%2fusing-sklearn-with-own-text-data%23new-answer', 'question_page');

          );

          Post as a guest













































































          Popular posts from this blog

          Greedy Best First Search implementation in Rust

          Function to Return a JSON Like Objects Using VBA Collections and Arrays

          C++11 CLH Lock Implementation