loops - Python: Looping over 2mln lines -
i have loop on large file 2mln lines, looks
p61981 1433g_human p61982 1433g_mouse q5rc20 1433g_ponab p61983 1433g_rat p68253 1433g_sheep
currently have following function, take every entry in list, , if entry in large file - took row occurence, it's slow (~10min). due looping scheme, can please suggest optimization?
up = "database.txt" def mplist(somelist): newlist = [] open(up) u: row in u: in somelist: if in row: newlist.append(row) return newlist
example of somelist
somelist = [ 'p68250', 'p31946', 'q4r572', 'q9cqv8', 'a4k2u9', 'p35213', 'p68251' ]
if somelist
contains values found in first column, split line , test first value against set
, not list
:
def mplist(somelist): someset = set(somelist) open(up) u: return [line line in u if line.split(none, 1)[0] in someset]
testing against set o(1) constant time operation (independent of size of set).
demo:
>>> = '/tmp/database.txt' >>> open(up, 'w').write('''\ ... p61981 1433g_human ... p61982 1433g_mouse ... q5rc20 1433g_ponab ... p61983 1433g_rat ... p68253 1433g_sheep ... ''') >>> def mplist(somelist): ... someset = set(somelist) ... open(up) u: ... return [line line in u if line.split(none, 1)[0] in someset] ... >>> mplist(['p61981', 'q5rc20']) ['p61981 1433g_human\n', 'q5rc20 1433g_ponab\n']
you may want return generator instead, , filter, not build list in memory:
def mplist(somelist): someset = set(somelist) open(up) u: return (line line in u if line.split(none, 1)[0] in someset)
you can loop, not index result:
for match in mplist(somelist): # match
and not need hold matched entries in memory.
Comments
Post a Comment