Using Sklearn with own text data
Clash Royale CLAN TAG#URR8PPP
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;
up vote
0
down vote
favorite
I've been experimenting with scikit-learn for the past few months and have been finding it difficult to move away from the inbuilt datasets (such as Twenty Newsgroups
and Iris
) and onto my own text datasets.
I have finally managed to get something working, but am keen to get my code sense-checked just in case I'm tricking myself into thinking I'm doing better than I am.
The following code is based on this Sklearn tutorial, but uses my own dataset of approximately 25,000 text files spread across 273 subdirectories in the main project folder. Each directory name serves as a descriptive label for the text files contained within it.
The objectives of the following code are as follows:
- Iterate over each subdirectory path in the main project folder to extract the name of each of label (these are appended to the
categories
list, which is passed in to sklearn'sload_files()
function. - Use
train_test_split()
to hold out 40% of the dataset as test data - Transform the training and testing data to
tf-idf
- Train the classifier
- Evaluate the classifier's predictions with the test data
The code below works and I'm currently averaging around 0.7
accuracy (so obviously, there's some improvement still needed).
The main areas I feel might be lacking are the way in which I'm bringing the names of the categories
in and the way I'm dealing with the test/train split.
Some feedback from more seasoned developers would be gratefully received.
The code
import sklearn
import numpy as np
from glob import glob
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.linear_model import SGDClassifier
from sklearn import metrics
from sklearn.pipeline import Pipeline
# Get paths to labelled data
rawFolderPaths = glob("/Users/danielhoadley/PycharmProjects/trainer/!labelled_data_reportXML/*/")
categories =
# Extract the folder paths, reduce down to the label and append to the categories list
for i in rawFolderPaths:
string1 = i.replace('/Users/danielhoadley/PycharmProjects/trainer/!labelled_data_reportXML/','')
category = string1.strip('/')
categories.append(category)
# Load the data
print ('nLoading the dataset...n')
docs_to_train = sklearn.datasets.load_files("/Users/danielhoadley/PycharmProjects/trainer/!labelled_data_reportXML",
description=None, categories=categories, load_content=True,
encoding='utf-8', shuffle=True, random_state=42)
# Split the dataset into training and testing sets
print ('nBuilding out hold-out test sample...n')
X_train, X_test, y_train, y_test = train_test_split(docs_to_train.data, docs_to_train.target, test_size=0.4)
# THE TRAINING DATA
# Transform the training data into tfidf vectors
print ('nTransforming the training data...n')
count_vect = CountVectorizer(stop_words='english')
X_train_counts = count_vect.fit_transform(raw_documents=X_train)
tfidf_transformer = TfidfTransformer(use_idf=True)
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
# THE TEST DATA
# Transform the test data into tfidf vectors
print ('nTransforming the test data...n')
count_vect = CountVectorizer(stop_words='english')
X_test_counts = count_vect.fit_transform(raw_documents=X_test)
tfidf_transformer = TfidfTransformer(use_idf=True)
X_test_tfidf = tfidf_transformer.fit_transform(X_test_counts)
print (X_test_tfidf.shape)
docs_test = X_test
# Construct the classifier pipeline
text_clf = Pipeline([('vect', CountVectorizer(stop_words='english')),
('tfidf', TfidfTransformer(use_idf=True)),
('clf', SGDClassifier(loss='hinge', penalty='l2',
alpha=1e-3, random_state=42, verbose=1)),
])
# Fit the model to the training data
text_clf.fit(X_train, y_train)
# Run the test data into the model
predicted = text_clf.predict(docs_test)
# Calculate mean accuracy of predictions
print (np.mean(predicted == y_test))
# Generate labelled performance metrics
print(metrics.classification_report(y_test, predicted,
target_names=docs_to_train.target_names))
python machine-learning natural-language-processing
add a comment |Â
up vote
0
down vote
favorite
I've been experimenting with scikit-learn for the past few months and have been finding it difficult to move away from the inbuilt datasets (such as Twenty Newsgroups
and Iris
) and onto my own text datasets.
I have finally managed to get something working, but am keen to get my code sense-checked just in case I'm tricking myself into thinking I'm doing better than I am.
The following code is based on this Sklearn tutorial, but uses my own dataset of approximately 25,000 text files spread across 273 subdirectories in the main project folder. Each directory name serves as a descriptive label for the text files contained within it.
The objectives of the following code are as follows:
- Iterate over each subdirectory path in the main project folder to extract the name of each of label (these are appended to the
categories
list, which is passed in to sklearn'sload_files()
function. - Use
train_test_split()
to hold out 40% of the dataset as test data - Transform the training and testing data to
tf-idf
- Train the classifier
- Evaluate the classifier's predictions with the test data
The code below works and I'm currently averaging around 0.7
accuracy (so obviously, there's some improvement still needed).
The main areas I feel might be lacking are the way in which I'm bringing the names of the categories
in and the way I'm dealing with the test/train split.
Some feedback from more seasoned developers would be gratefully received.
The code
import sklearn
import numpy as np
from glob import glob
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.linear_model import SGDClassifier
from sklearn import metrics
from sklearn.pipeline import Pipeline
# Get paths to labelled data
rawFolderPaths = glob("/Users/danielhoadley/PycharmProjects/trainer/!labelled_data_reportXML/*/")
categories =
# Extract the folder paths, reduce down to the label and append to the categories list
for i in rawFolderPaths:
string1 = i.replace('/Users/danielhoadley/PycharmProjects/trainer/!labelled_data_reportXML/','')
category = string1.strip('/')
categories.append(category)
# Load the data
print ('nLoading the dataset...n')
docs_to_train = sklearn.datasets.load_files("/Users/danielhoadley/PycharmProjects/trainer/!labelled_data_reportXML",
description=None, categories=categories, load_content=True,
encoding='utf-8', shuffle=True, random_state=42)
# Split the dataset into training and testing sets
print ('nBuilding out hold-out test sample...n')
X_train, X_test, y_train, y_test = train_test_split(docs_to_train.data, docs_to_train.target, test_size=0.4)
# THE TRAINING DATA
# Transform the training data into tfidf vectors
print ('nTransforming the training data...n')
count_vect = CountVectorizer(stop_words='english')
X_train_counts = count_vect.fit_transform(raw_documents=X_train)
tfidf_transformer = TfidfTransformer(use_idf=True)
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
# THE TEST DATA
# Transform the test data into tfidf vectors
print ('nTransforming the test data...n')
count_vect = CountVectorizer(stop_words='english')
X_test_counts = count_vect.fit_transform(raw_documents=X_test)
tfidf_transformer = TfidfTransformer(use_idf=True)
X_test_tfidf = tfidf_transformer.fit_transform(X_test_counts)
print (X_test_tfidf.shape)
docs_test = X_test
# Construct the classifier pipeline
text_clf = Pipeline([('vect', CountVectorizer(stop_words='english')),
('tfidf', TfidfTransformer(use_idf=True)),
('clf', SGDClassifier(loss='hinge', penalty='l2',
alpha=1e-3, random_state=42, verbose=1)),
])
# Fit the model to the training data
text_clf.fit(X_train, y_train)
# Run the test data into the model
predicted = text_clf.predict(docs_test)
# Calculate mean accuracy of predictions
print (np.mean(predicted == y_test))
# Generate labelled performance metrics
print(metrics.classification_report(y_test, predicted,
target_names=docs_to_train.target_names))
python machine-learning natural-language-processing
add a comment |Â
up vote
0
down vote
favorite
up vote
0
down vote
favorite
I've been experimenting with scikit-learn for the past few months and have been finding it difficult to move away from the inbuilt datasets (such as Twenty Newsgroups
and Iris
) and onto my own text datasets.
I have finally managed to get something working, but am keen to get my code sense-checked just in case I'm tricking myself into thinking I'm doing better than I am.
The following code is based on this Sklearn tutorial, but uses my own dataset of approximately 25,000 text files spread across 273 subdirectories in the main project folder. Each directory name serves as a descriptive label for the text files contained within it.
The objectives of the following code are as follows:
- Iterate over each subdirectory path in the main project folder to extract the name of each of label (these are appended to the
categories
list, which is passed in to sklearn'sload_files()
function. - Use
train_test_split()
to hold out 40% of the dataset as test data - Transform the training and testing data to
tf-idf
- Train the classifier
- Evaluate the classifier's predictions with the test data
The code below works and I'm currently averaging around 0.7
accuracy (so obviously, there's some improvement still needed).
The main areas I feel might be lacking are the way in which I'm bringing the names of the categories
in and the way I'm dealing with the test/train split.
Some feedback from more seasoned developers would be gratefully received.
The code
import sklearn
import numpy as np
from glob import glob
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.linear_model import SGDClassifier
from sklearn import metrics
from sklearn.pipeline import Pipeline
# Get paths to labelled data
rawFolderPaths = glob("/Users/danielhoadley/PycharmProjects/trainer/!labelled_data_reportXML/*/")
categories =
# Extract the folder paths, reduce down to the label and append to the categories list
for i in rawFolderPaths:
string1 = i.replace('/Users/danielhoadley/PycharmProjects/trainer/!labelled_data_reportXML/','')
category = string1.strip('/')
categories.append(category)
# Load the data
print ('nLoading the dataset...n')
docs_to_train = sklearn.datasets.load_files("/Users/danielhoadley/PycharmProjects/trainer/!labelled_data_reportXML",
description=None, categories=categories, load_content=True,
encoding='utf-8', shuffle=True, random_state=42)
# Split the dataset into training and testing sets
print ('nBuilding out hold-out test sample...n')
X_train, X_test, y_train, y_test = train_test_split(docs_to_train.data, docs_to_train.target, test_size=0.4)
# THE TRAINING DATA
# Transform the training data into tfidf vectors
print ('nTransforming the training data...n')
count_vect = CountVectorizer(stop_words='english')
X_train_counts = count_vect.fit_transform(raw_documents=X_train)
tfidf_transformer = TfidfTransformer(use_idf=True)
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
# THE TEST DATA
# Transform the test data into tfidf vectors
print ('nTransforming the test data...n')
count_vect = CountVectorizer(stop_words='english')
X_test_counts = count_vect.fit_transform(raw_documents=X_test)
tfidf_transformer = TfidfTransformer(use_idf=True)
X_test_tfidf = tfidf_transformer.fit_transform(X_test_counts)
print (X_test_tfidf.shape)
docs_test = X_test
# Construct the classifier pipeline
text_clf = Pipeline([('vect', CountVectorizer(stop_words='english')),
('tfidf', TfidfTransformer(use_idf=True)),
('clf', SGDClassifier(loss='hinge', penalty='l2',
alpha=1e-3, random_state=42, verbose=1)),
])
# Fit the model to the training data
text_clf.fit(X_train, y_train)
# Run the test data into the model
predicted = text_clf.predict(docs_test)
# Calculate mean accuracy of predictions
print (np.mean(predicted == y_test))
# Generate labelled performance metrics
print(metrics.classification_report(y_test, predicted,
target_names=docs_to_train.target_names))
python machine-learning natural-language-processing
I've been experimenting with scikit-learn for the past few months and have been finding it difficult to move away from the inbuilt datasets (such as Twenty Newsgroups
and Iris
) and onto my own text datasets.
I have finally managed to get something working, but am keen to get my code sense-checked just in case I'm tricking myself into thinking I'm doing better than I am.
The following code is based on this Sklearn tutorial, but uses my own dataset of approximately 25,000 text files spread across 273 subdirectories in the main project folder. Each directory name serves as a descriptive label for the text files contained within it.
The objectives of the following code are as follows:
- Iterate over each subdirectory path in the main project folder to extract the name of each of label (these are appended to the
categories
list, which is passed in to sklearn'sload_files()
function. - Use
train_test_split()
to hold out 40% of the dataset as test data - Transform the training and testing data to
tf-idf
- Train the classifier
- Evaluate the classifier's predictions with the test data
The code below works and I'm currently averaging around 0.7
accuracy (so obviously, there's some improvement still needed).
The main areas I feel might be lacking are the way in which I'm bringing the names of the categories
in and the way I'm dealing with the test/train split.
Some feedback from more seasoned developers would be gratefully received.
The code
import sklearn
import numpy as np
from glob import glob
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.linear_model import SGDClassifier
from sklearn import metrics
from sklearn.pipeline import Pipeline
# Get paths to labelled data
rawFolderPaths = glob("/Users/danielhoadley/PycharmProjects/trainer/!labelled_data_reportXML/*/")
categories =
# Extract the folder paths, reduce down to the label and append to the categories list
for i in rawFolderPaths:
string1 = i.replace('/Users/danielhoadley/PycharmProjects/trainer/!labelled_data_reportXML/','')
category = string1.strip('/')
categories.append(category)
# Load the data
print ('nLoading the dataset...n')
docs_to_train = sklearn.datasets.load_files("/Users/danielhoadley/PycharmProjects/trainer/!labelled_data_reportXML",
description=None, categories=categories, load_content=True,
encoding='utf-8', shuffle=True, random_state=42)
# Split the dataset into training and testing sets
print ('nBuilding out hold-out test sample...n')
X_train, X_test, y_train, y_test = train_test_split(docs_to_train.data, docs_to_train.target, test_size=0.4)
# THE TRAINING DATA
# Transform the training data into tfidf vectors
print ('nTransforming the training data...n')
count_vect = CountVectorizer(stop_words='english')
X_train_counts = count_vect.fit_transform(raw_documents=X_train)
tfidf_transformer = TfidfTransformer(use_idf=True)
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
# THE TEST DATA
# Transform the test data into tfidf vectors
print ('nTransforming the test data...n')
count_vect = CountVectorizer(stop_words='english')
X_test_counts = count_vect.fit_transform(raw_documents=X_test)
tfidf_transformer = TfidfTransformer(use_idf=True)
X_test_tfidf = tfidf_transformer.fit_transform(X_test_counts)
print (X_test_tfidf.shape)
docs_test = X_test
# Construct the classifier pipeline
text_clf = Pipeline([('vect', CountVectorizer(stop_words='english')),
('tfidf', TfidfTransformer(use_idf=True)),
('clf', SGDClassifier(loss='hinge', penalty='l2',
alpha=1e-3, random_state=42, verbose=1)),
])
# Fit the model to the training data
text_clf.fit(X_train, y_train)
# Run the test data into the model
predicted = text_clf.predict(docs_test)
# Calculate mean accuracy of predictions
print (np.mean(predicted == y_test))
# Generate labelled performance metrics
print(metrics.classification_report(y_test, predicted,
target_names=docs_to_train.target_names))
python machine-learning natural-language-processing
edited Feb 4 at 21:59
200_success
123k14143401
123k14143401
asked Feb 4 at 17:12
DanielH
626
626
add a comment |Â
add a comment |Â
active
oldest
votes
active
oldest
votes
active
oldest
votes
active
oldest
votes
active
oldest
votes
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f186744%2fusing-sklearn-with-own-text-data%23new-answer', 'question_page');
);
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password