Using Sklearn with own text data

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;

up vote
0
down vote

favorite

I've been experimenting with scikit-learn for the past few months and have been finding it difficult to move away from the inbuilt datasets (such as Twenty Newsgroups and Iris) and onto my own text datasets.

I have finally managed to get something working, but am keen to get my code sense-checked just in case I'm tricking myself into thinking I'm doing better than I am.

The following code is based on this Sklearn tutorial, but uses my own dataset of approximately 25,000 text files spread across 273 subdirectories in the main project folder. Each directory name serves as a descriptive label for the text files contained within it.

The objectives of the following code are as follows:

Iterate over each subdirectory path in the main project folder to extract the name of each of label (these are appended to the categories list, which is passed in to sklearn's load_files() function.

Use train_test_split() to hold out 40% of the dataset as test data

Transform the training and testing data to tf-idf

Train the classifier

Evaluate the classifier's predictions with the test data

The code below works and I'm currently averaging around 0.7 accuracy (so obviously, there's some improvement still needed).

The main areas I feel might be lacking are the way in which I'm bringing the names of the categories in and the way I'm dealing with the test/train split.

Some feedback from more seasoned developers would be gratefully received.

The code

import sklearn
import numpy as np
from glob import glob
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.linear_model import SGDClassifier
from sklearn import metrics
from sklearn.pipeline import Pipeline

# Get paths to labelled data
rawFolderPaths = glob("/Users/danielhoadley/PycharmProjects/trainer/!labelled_data_reportXML/*/")


categories = 

# Extract the folder paths, reduce down to the label and append to the categories list
for i in rawFolderPaths:
 string1 = i.replace('/Users/danielhoadley/PycharmProjects/trainer/!labelled_data_reportXML/','')
 category = string1.strip('/')
 categories.append(category)

# Load the data
print ('nLoading the dataset...n')
docs_to_train = sklearn.datasets.load_files("/Users/danielhoadley/PycharmProjects/trainer/!labelled_data_reportXML",
 description=None, categories=categories, load_content=True,
 encoding='utf-8', shuffle=True, random_state=42)

# Split the dataset into training and testing sets
print ('nBuilding out hold-out test sample...n')
X_train, X_test, y_train, y_test = train_test_split(docs_to_train.data, docs_to_train.target, test_size=0.4)

# THE TRAINING DATA

# Transform the training data into tfidf vectors
print ('nTransforming the training data...n')
count_vect = CountVectorizer(stop_words='english')
X_train_counts = count_vect.fit_transform(raw_documents=X_train)

tfidf_transformer = TfidfTransformer(use_idf=True)
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)

# THE TEST DATA

# Transform the test data into tfidf vectors
print ('nTransforming the test data...n')
count_vect = CountVectorizer(stop_words='english')
X_test_counts = count_vect.fit_transform(raw_documents=X_test)

tfidf_transformer = TfidfTransformer(use_idf=True)
X_test_tfidf = tfidf_transformer.fit_transform(X_test_counts)
print (X_test_tfidf.shape)

docs_test = X_test

# Construct the classifier pipeline 
text_clf = Pipeline([('vect', CountVectorizer(stop_words='english')),
 ('tfidf', TfidfTransformer(use_idf=True)),
 ('clf', SGDClassifier(loss='hinge', penalty='l2',
 alpha=1e-3, random_state=42, verbose=1)),
])

# Fit the model to the training data
text_clf.fit(X_train, y_train)

# Run the test data into the model
predicted = text_clf.predict(docs_test)

# Calculate mean accuracy of predictions
print (np.mean(predicted == y_test))

# Generate labelled performance metrics
print(metrics.classification_report(y_test, predicted,
 target_names=docs_to_train.target_names))

edited Feb 4 at 21:59

200_success

123k14143401

asked Feb 4 at 17:12

DanielH

626

add a commentÂ |Â

up vote
0
down vote

favorite

I have finally managed to get something working, but am keen to get my code sense-checked just in case I'm tricking myself into thinking I'm doing better than I am.

The objectives of the following code are as follows:

Iterate over each subdirectory path in the main project folder to extract the name of each of label (these are appended to the categories list, which is passed in to sklearn's load_files() function.

Use train_test_split() to hold out 40% of the dataset as test data

Transform the training and testing data to tf-idf

Train the classifier

Evaluate the classifier's predictions with the test data

The code below works and I'm currently averaging around 0.7 accuracy (so obviously, there's some improvement still needed).

The main areas I feel might be lacking are the way in which I'm bringing the names of the categories in and the way I'm dealing with the test/train split.

Some feedback from more seasoned developers would be gratefully received.

The code

import sklearn
import numpy as np
from glob import glob
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.linear_model import SGDClassifier
from sklearn import metrics
from sklearn.pipeline import Pipeline

# Get paths to labelled data
rawFolderPaths = glob("/Users/danielhoadley/PycharmProjects/trainer/!labelled_data_reportXML/*/")


categories = 

# Extract the folder paths, reduce down to the label and append to the categories list
for i in rawFolderPaths:
 string1 = i.replace('/Users/danielhoadley/PycharmProjects/trainer/!labelled_data_reportXML/','')
 category = string1.strip('/')
 categories.append(category)

# Load the data
print ('nLoading the dataset...n')
docs_to_train = sklearn.datasets.load_files("/Users/danielhoadley/PycharmProjects/trainer/!labelled_data_reportXML",
 description=None, categories=categories, load_content=True,
 encoding='utf-8', shuffle=True, random_state=42)

# Split the dataset into training and testing sets
print ('nBuilding out hold-out test sample...n')
X_train, X_test, y_train, y_test = train_test_split(docs_to_train.data, docs_to_train.target, test_size=0.4)

# THE TRAINING DATA

# Transform the training data into tfidf vectors
print ('nTransforming the training data...n')
count_vect = CountVectorizer(stop_words='english')
X_train_counts = count_vect.fit_transform(raw_documents=X_train)

tfidf_transformer = TfidfTransformer(use_idf=True)
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)

# THE TEST DATA

# Transform the test data into tfidf vectors
print ('nTransforming the test data...n')
count_vect = CountVectorizer(stop_words='english')
X_test_counts = count_vect.fit_transform(raw_documents=X_test)

tfidf_transformer = TfidfTransformer(use_idf=True)
X_test_tfidf = tfidf_transformer.fit_transform(X_test_counts)
print (X_test_tfidf.shape)

docs_test = X_test

# Construct the classifier pipeline 
text_clf = Pipeline([('vect', CountVectorizer(stop_words='english')),
 ('tfidf', TfidfTransformer(use_idf=True)),
 ('clf', SGDClassifier(loss='hinge', penalty='l2',
 alpha=1e-3, random_state=42, verbose=1)),
])

# Fit the model to the training data
text_clf.fit(X_train, y_train)

# Run the test data into the model
predicted = text_clf.predict(docs_test)

# Calculate mean accuracy of predictions
print (np.mean(predicted == y_test))

# Generate labelled performance metrics
print(metrics.classification_report(y_test, predicted,
 target_names=docs_to_train.target_names))

edited Feb 4 at 21:59

200_success

123k14143401

asked Feb 4 at 17:12

DanielH

626

add a commentÂ |Â

up vote
0
down vote

favorite

I have finally managed to get something working, but am keen to get my code sense-checked just in case I'm tricking myself into thinking I'm doing better than I am.

The objectives of the following code are as follows:

Iterate over each subdirectory path in the main project folder to extract the name of each of label (these are appended to the categories list, which is passed in to sklearn's load_files() function.

Use train_test_split() to hold out 40% of the dataset as test data

Transform the training and testing data to tf-idf

Train the classifier

Evaluate the classifier's predictions with the test data

The code below works and I'm currently averaging around 0.7 accuracy (so obviously, there's some improvement still needed).

The main areas I feel might be lacking are the way in which I'm bringing the names of the categories in and the way I'm dealing with the test/train split.

Some feedback from more seasoned developers would be gratefully received.

The code

import sklearn
import numpy as np
from glob import glob
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.linear_model import SGDClassifier
from sklearn import metrics
from sklearn.pipeline import Pipeline

# Get paths to labelled data
rawFolderPaths = glob("/Users/danielhoadley/PycharmProjects/trainer/!labelled_data_reportXML/*/")


categories = 

# Extract the folder paths, reduce down to the label and append to the categories list
for i in rawFolderPaths:
 string1 = i.replace('/Users/danielhoadley/PycharmProjects/trainer/!labelled_data_reportXML/','')
 category = string1.strip('/')
 categories.append(category)

# Load the data
print ('nLoading the dataset...n')
docs_to_train = sklearn.datasets.load_files("/Users/danielhoadley/PycharmProjects/trainer/!labelled_data_reportXML",
 description=None, categories=categories, load_content=True,
 encoding='utf-8', shuffle=True, random_state=42)

# Split the dataset into training and testing sets
print ('nBuilding out hold-out test sample...n')
X_train, X_test, y_train, y_test = train_test_split(docs_to_train.data, docs_to_train.target, test_size=0.4)

# THE TRAINING DATA

# Transform the training data into tfidf vectors
print ('nTransforming the training data...n')
count_vect = CountVectorizer(stop_words='english')
X_train_counts = count_vect.fit_transform(raw_documents=X_train)

tfidf_transformer = TfidfTransformer(use_idf=True)
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)

# THE TEST DATA

# Transform the test data into tfidf vectors
print ('nTransforming the test data...n')
count_vect = CountVectorizer(stop_words='english')
X_test_counts = count_vect.fit_transform(raw_documents=X_test)

tfidf_transformer = TfidfTransformer(use_idf=True)
X_test_tfidf = tfidf_transformer.fit_transform(X_test_counts)
print (X_test_tfidf.shape)

docs_test = X_test

# Construct the classifier pipeline 
text_clf = Pipeline([('vect', CountVectorizer(stop_words='english')),
 ('tfidf', TfidfTransformer(use_idf=True)),
 ('clf', SGDClassifier(loss='hinge', penalty='l2',
 alpha=1e-3, random_state=42, verbose=1)),
])

# Fit the model to the training data
text_clf.fit(X_train, y_train)

# Run the test data into the model
predicted = text_clf.predict(docs_test)

# Calculate mean accuracy of predictions
print (np.mean(predicted == y_test))

# Generate labelled performance metrics
print(metrics.classification_report(y_test, predicted,
 target_names=docs_to_train.target_names))

edited Feb 4 at 21:59

200_success

123k14143401

asked Feb 4 at 17:12

DanielH

626

I have finally managed to get something working, but am keen to get my code sense-checked just in case I'm tricking myself into thinking I'm doing better than I am.

The objectives of the following code are as follows:

Iterate over each subdirectory path in the main project folder to extract the name of each of label (these are appended to the categories list, which is passed in to sklearn's load_files() function.

Use train_test_split() to hold out 40% of the dataset as test data

Transform the training and testing data to tf-idf

Train the classifier

Evaluate the classifier's predictions with the test data

The code below works and I'm currently averaging around 0.7 accuracy (so obviously, there's some improvement still needed).

The main areas I feel might be lacking are the way in which I'm bringing the names of the categories in and the way I'm dealing with the test/train split.

Some feedback from more seasoned developers would be gratefully received.

The code

import sklearn
import numpy as np
from glob import glob
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.linear_model import SGDClassifier
from sklearn import metrics
from sklearn.pipeline import Pipeline

# Get paths to labelled data
rawFolderPaths = glob("/Users/danielhoadley/PycharmProjects/trainer/!labelled_data_reportXML/*/")


categories = 

# Extract the folder paths, reduce down to the label and append to the categories list
for i in rawFolderPaths:
 string1 = i.replace('/Users/danielhoadley/PycharmProjects/trainer/!labelled_data_reportXML/','')
 category = string1.strip('/')
 categories.append(category)

# Load the data
print ('nLoading the dataset...n')
docs_to_train = sklearn.datasets.load_files("/Users/danielhoadley/PycharmProjects/trainer/!labelled_data_reportXML",
 description=None, categories=categories, load_content=True,
 encoding='utf-8', shuffle=True, random_state=42)

# Split the dataset into training and testing sets
print ('nBuilding out hold-out test sample...n')
X_train, X_test, y_train, y_test = train_test_split(docs_to_train.data, docs_to_train.target, test_size=0.4)

# THE TRAINING DATA

# Transform the training data into tfidf vectors
print ('nTransforming the training data...n')
count_vect = CountVectorizer(stop_words='english')
X_train_counts = count_vect.fit_transform(raw_documents=X_train)

tfidf_transformer = TfidfTransformer(use_idf=True)
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)

# THE TEST DATA

# Transform the test data into tfidf vectors
print ('nTransforming the test data...n')
count_vect = CountVectorizer(stop_words='english')
X_test_counts = count_vect.fit_transform(raw_documents=X_test)

tfidf_transformer = TfidfTransformer(use_idf=True)
X_test_tfidf = tfidf_transformer.fit_transform(X_test_counts)
print (X_test_tfidf.shape)

docs_test = X_test

# Construct the classifier pipeline 
text_clf = Pipeline([('vect', CountVectorizer(stop_words='english')),
 ('tfidf', TfidfTransformer(use_idf=True)),
 ('clf', SGDClassifier(loss='hinge', penalty='l2',
 alpha=1e-3, random_state=42, verbose=1)),
])

# Fit the model to the training data
text_clf.fit(X_train, y_train)

# Run the test data into the model
predicted = text_clf.predict(docs_test)

# Calculate mean accuracy of predictions
print (np.mean(predicted == y_test))

# Generate labelled performance metrics
print(metrics.classification_report(y_test, predicted,
 target_names=docs_to_train.target_names))

edited Feb 4 at 21:59

200_success

123k14143401

asked Feb 4 at 17:12

DanielH

626

edited Feb 4 at 21:59

200_success

123k14143401

edited Feb 4 at 21:59

200_success

123k14143401

edited Feb 4 at 21:59

200_success

123k14143401

asked Feb 4 at 17:12

DanielH

626

asked Feb 4 at 17:12

DanielH

626

asked Feb 4 at 17:12

DanielH

626

add a commentÂ |Â

active

oldest

votes

Your Answer

StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
);
);
, "mathjax-editing");

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "196"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: false,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f186744%2fusing-sklearn-with-own-text-data%23new-answer', 'question_page');

);

Post as a guest

Name

active

oldest

votes

draft saved

draft discarded

draft saved

draft discarded

Post as a guest

Name

搜尋此網誌

trjhtr