loops - Python: Looping over 2mln lines -

- January 15, 2011

i have loop on large file 2mln lines, looks

p61981  1433g_human p61982  1433g_mouse q5rc20  1433g_ponab p61983  1433g_rat p68253  1433g_sheep

currently have following function, take every entry in list, , if entry in large file - took row occurence, it's slow (~10min). due looping scheme, can please suggest optimization?

up = "database.txt"  def mplist(somelist):     newlist = []     open(up) u:         row in u:             in somelist:                 if in row:                     newlist.append(row)     return newlist

example of somelist

somelist = [     'p68250',     'p31946',     'q4r572',     'q9cqv8',     'a4k2u9',     'p35213',     'p68251' ]

if somelist contains values found in first column, split line , test first value against set, not list:

def mplist(somelist):     someset = set(somelist)     open(up) u:         return [line line in u if line.split(none, 1)[0] in someset]

testing against set o(1) constant time operation (independent of size of set).

demo:

>>> = '/tmp/database.txt' >>> open(up, 'w').write('''\ ... p61981  1433g_human ... p61982  1433g_mouse ... q5rc20  1433g_ponab ... p61983  1433g_rat ... p68253  1433g_sheep ... ''') >>> def mplist(somelist): ...     someset = set(somelist) ...     open(up) u: ...         return [line line in u if line.split(none, 1)[0] in someset] ...  >>> mplist(['p61981', 'q5rc20']) ['p61981  1433g_human\n', 'q5rc20  1433g_ponab\n']

you may want return generator instead, , filter, not build list in memory:

def mplist(somelist):     someset = set(somelist)     open(up) u:         return (line line in u if line.split(none, 1)[0] in someset)

you can loop, not index result:

for match in mplist(somelist):     # match

and not need hold matched entries in memory.

Search This Blog

You

loops - Python: Looping over 2mln lines -

Comments

Post a Comment

Popular posts from this blog

asp.net - redirect .aspx with query string to html page using htaccess -

Need help in packaging app using TideSDK on Windows -

c++ - boost interprocess mutex in managed_shared_memory -