numpy - Term document matrix and cosine similarity in Python -


i have following situation want address using python (preferably using numpy , scipy):

  1. collection of documents want convert sparse term document matrix.
  2. extract sparse vector representation of each document (i.e. row in matrix) , find out top 10 similary documents using cosine similarity within subset of documents (documents labelled categories , want find similar documents within same category).

how achieve in python? know can use scipy.sparse.coo_matrix represent documents sparse vectors , take dot product find cosine similarity, how convert entire corpus large sparse term document matrix (so can extract it's rows scipy.sparse.coo_matrix row vectors)?

thanks.

may recommend take @ scikit-learn? regarded library in python community simple consistent api. have implemented cosine similarity metric. example taken here of how in 3 lines of code:

>>> sklearn.feature_extraction.text import tfidfvectorizer  >>> vect = tfidfvectorizer(min_df=1) >>> tfidf = vect.fit_transform(["i'd apple", ...                             "an apple day keeps doctor away", ...                             "never compare apple orange", ...                             "i prefer scikit-learn orange"]) >>> (tfidf * tfidf.t).a array([[ 1.        ,  0.25082859,  0.39482963,  0.        ],        [ 0.25082859,  1.        ,  0.22057609,  0.        ],        [ 0.39482963,  0.22057609,  1.        ,  0.26264139],        [ 0.        ,  0.        ,  0.26264139,  1.        ]]) 

Comments

Popular posts from this blog

plot - Remove Objects from Legend When You Have Also Used Fit, Matlab -

java - Why does my date parsing return a weird date? -

Need help in packaging app using TideSDK on Windows -