numpy - Term document matrix and cosine similarity in Python -
i have following situation want address using python
(preferably using numpy
, scipy
):
- collection of documents want convert sparse term document matrix.
- extract sparse vector representation of each document (i.e. row in matrix) , find out top 10 similary documents using cosine similarity within subset of documents (documents labelled categories , want find similar documents within same category).
how achieve in python
? know can use scipy.sparse.coo_matrix
represent documents sparse vectors , take dot product find cosine similarity, how convert entire corpus large sparse term document matrix (so can extract it's rows scipy.sparse.coo_matrix
row vectors)?
thanks.
may recommend take @ scikit-learn? regarded library in python community simple consistent api. have implemented cosine similarity metric. example taken here of how in 3 lines of code:
>>> sklearn.feature_extraction.text import tfidfvectorizer >>> vect = tfidfvectorizer(min_df=1) >>> tfidf = vect.fit_transform(["i'd apple", ... "an apple day keeps doctor away", ... "never compare apple orange", ... "i prefer scikit-learn orange"]) >>> (tfidf * tfidf.t).a array([[ 1. , 0.25082859, 0.39482963, 0. ], [ 0.25082859, 1. , 0.22057609, 0. ], [ 0.39482963, 0.22057609, 1. , 0.26264139], [ 0. , 0. , 0.26264139, 1. ]])
Comments
Post a Comment