Python Pandas - Remove values from first dataframe if not in second dataframe -
i have user/item data recommender. i'm splitting test , train data, , need sure new users or items in test data omitted before evaluating recommender. approach works small datasets, when gets big, takes ever. there better way this?
# test set removing users or items not in train te = pd.dataframe({'user': [1,2,3,1,6,1], 'item':[16,12,19,15,13,12]}) tr = pd.dataframe({'user': [1,2,3,4,5], 'item':[11,12,13,14,15]}) print "training_______" print tr print "\ntesting_______" print te # using 2 joins , selecting proper indices, 'new' members of test set removed b = pd.merge( pd.merge(te,tr, on='user', suffixes=['', '_d']) , tr, on='item', suffixes=['', '_d'])[['user', 'item']] print "\nsolution_______" print b
gives:
training_______ item user 0 11 1 1 12 2 2 13 3 3 14 4 4 15 5 testing_______ item user 0 16 1 1 12 2 2 19 3 3 15 1 4 13 6 5 12 1 solution_______ user item 0 1 15 1 1 12 2 2 12
the solution correct (any new users or items cause whole row removed test. slow @ scale.
thanks in advance.
i think can achieve want using isin
series method on each of columns:
in [11]: te['item'].isin(tr['item']) & te['user'].isin(tr['user']) out[11]: 0 false 1 true 2 false 3 true 4 false 5 true dtype: bool in [12]: te[te['item'].isin(tr['item']) & te['user'].isin(tr['user'])] out[12]: item user 1 12 2 3 15 1 5 12 1
in 0.13 you'll able use new dataframe isin
method (on current master):
in [21]: te[te.isin(tr.to_dict(outtype='list')).all(1)] out[21]: item user 1 12 2 3 15 1 5 12 1
hopefully release syntax should bit better on release:
te[te.isin(tr).all(1)]
Comments
Post a Comment