twitter - Train customized corpus, with NLTK for Python -
i try train corpus own documents. documents structured in same way original movie_reviews corpus data, 1k positive text files in folder 'pos' , 1k negative text files in folder 'neg'. each textfile contains 25 lines of tweets, cleaned, in: urls, usernames, capital letters, punctuation removed.
how can adjust code use own text data instead of movie_reviews?
import nltk.classify.util nltk.classify import naivebayesclassifier nltk.corpus import movie_reviews collections import defaultdict import numpy np # define split of % training / % test split = 0.8 def word_feats(words): return dict([(word, true) word in words]) posids = movie_reviews.fileids('pos') negids = movie_reviews.fileids('neg') negfeats = [(word_feats(movie_reviews.words(fileids=[f])), 'neg') f in negids] posfeats = [(word_feats(movie_reviews.words(fileids=[f])), 'pos') f in posids] cutoff = int(len(posfeats) * split) trainfeats = negfeats[:cutoff] + posfeats[:cutoff] testfeats = negfeats[cutoff:] + posfeats[cutoff:] print 'train on %d instances\ntest on %d instances' % (len(trainfeats),len(testfeats)) classifier = naivebayesclassifier.train(trainfeats) print 'accuracy:', nltk.classify.util.accuracy(classifier, testfeats) classifier.show_most_informative_features()
you can login root user , change directory path this:
/usr/local/lib/python2.7/dist-packages/nltk/corpus/__init__.py in document can find existing movie_reviews corpora loaded using lazycorpusloader:
movie_reviews = lazycorpusloader( 'movie_reviews', categorizedplaintextcorpusreader, r'(?!\.).*\.txt', cat_pattern=r'(neg|pos)/.*') then try adding thing similar this:
my_movie = lazycorpusloader( 'my_movie', categorizedplaintextcorpusreader, r'(?!\.).*\.txt', cat_pattern=r'(neg|pos)/.*') where my_movie name have created movie reviews. once done save , exit.
finally place corpus in nltk directory can find movie_review corpus.
try performing this:
from nltk.corpus import my_movie # newly created own corpus hope work.
Comments
Post a Comment