twitter - Train customized corpus, with NLTK for Python -

July 15, 2011

i try train corpus own documents. documents structured in same way original movie_reviews corpus data, 1k positive text files in folder 'pos' , 1k negative text files in folder 'neg'. each textfile contains 25 lines of tweets, cleaned, in: urls, usernames, capital letters, punctuation removed.

how can adjust code use own text data instead of movie_reviews?

import nltk.classify.util nltk.classify import naivebayesclassifier nltk.corpus import movie_reviews collections import defaultdict import numpy np  # define split of % training / % test split = 0.8  def word_feats(words):     return dict([(word, true) word in words])   posids = movie_reviews.fileids('pos') negids = movie_reviews.fileids('neg')  negfeats = [(word_feats(movie_reviews.words(fileids=[f])), 'neg') f in negids] posfeats = [(word_feats(movie_reviews.words(fileids=[f])), 'pos') f in posids]  cutoff = int(len(posfeats) * split)  trainfeats = negfeats[:cutoff] + posfeats[:cutoff] testfeats = negfeats[cutoff:] + posfeats[cutoff:]  print 'train on %d instances\ntest on %d instances' % (len(trainfeats),len(testfeats))  classifier = naivebayesclassifier.train(trainfeats) print 'accuracy:', nltk.classify.util.accuracy(classifier, testfeats)  classifier.show_most_informative_features()

you can login root user , change directory path this:

/usr/local/lib/python2.7/dist-packages/nltk/corpus/__init__.py

in document can find existing movie_reviews corpora loaded using lazycorpusloader:

movie_reviews = lazycorpusloader(     'movie_reviews', categorizedplaintextcorpusreader,     r'(?!\.).*\.txt', cat_pattern=r'(neg|pos)/.*')

then try adding thing similar this:

my_movie = lazycorpusloader(     'my_movie', categorizedplaintextcorpusreader,     r'(?!\.).*\.txt', cat_pattern=r'(neg|pos)/.*')

where my_movie name have created movie reviews. once done save , exit.

finally place corpus in nltk directory can find movie_review corpus.

try performing this:

from nltk.corpus import my_movie  # newly created own corpus

hope work.

Search This Blog

UV code

twitter - Train customized corpus, with NLTK for Python -

Comments

Post a Comment

Popular posts from this blog

shopping cart - Page redirect not working PHP -

php - How to modify a menu to show sub-menus -

python - Installing PyDev in eclipse is failed -