loops - Python: Looping over 2mln lines -


i have loop on large file 2mln lines, looks

p61981  1433g_human p61982  1433g_mouse q5rc20  1433g_ponab p61983  1433g_rat p68253  1433g_sheep 

currently have following function, take every entry in list, , if entry in large file - took row occurence, it's slow (~10min). due looping scheme, can please suggest optimization?

up = "database.txt"  def mplist(somelist):     newlist = []     open(up) u:         row in u:             in somelist:                 if in row:                     newlist.append(row)     return newlist 

example of somelist

somelist = [     'p68250',     'p31946',     'q4r572',     'q9cqv8',     'a4k2u9',     'p35213',     'p68251' ] 

if somelist contains values found in first column, split line , test first value against set, not list:

def mplist(somelist):     someset = set(somelist)     open(up) u:         return [line line in u if line.split(none, 1)[0] in someset] 

testing against set o(1) constant time operation (independent of size of set).

demo:

>>> = '/tmp/database.txt' >>> open(up, 'w').write('''\ ... p61981  1433g_human ... p61982  1433g_mouse ... q5rc20  1433g_ponab ... p61983  1433g_rat ... p68253  1433g_sheep ... ''') >>> def mplist(somelist): ...     someset = set(somelist) ...     open(up) u: ...         return [line line in u if line.split(none, 1)[0] in someset] ...  >>> mplist(['p61981', 'q5rc20']) ['p61981  1433g_human\n', 'q5rc20  1433g_ponab\n'] 

you may want return generator instead, , filter, not build list in memory:

def mplist(somelist):     someset = set(somelist)     open(up) u:         return (line line in u if line.split(none, 1)[0] in someset) 

you can loop, not index result:

for match in mplist(somelist):     # match 

and not need hold matched entries in memory.


Comments

Popular posts from this blog

plot - Remove Objects from Legend When You Have Also Used Fit, Matlab -

java - Why does my date parsing return a weird date? -

Need help in packaging app using TideSDK on Windows -