loops - Python: Looping over 2mln lines -
i have loop on large file 2mln lines, looks
p61981 1433g_human p61982 1433g_mouse q5rc20 1433g_ponab p61983 1433g_rat p68253 1433g_sheep currently have following function, take every entry in list, , if entry in large file - took row occurence, it's slow (~10min). due looping scheme, can please suggest optimization?
up = "database.txt" def mplist(somelist): newlist = [] open(up) u: row in u: in somelist: if in row: newlist.append(row) return newlist example of somelist
somelist = [ 'p68250', 'p31946', 'q4r572', 'q9cqv8', 'a4k2u9', 'p35213', 'p68251' ]
if somelist contains values found in first column, split line , test first value against set, not list:
def mplist(somelist): someset = set(somelist) open(up) u: return [line line in u if line.split(none, 1)[0] in someset] testing against set o(1) constant time operation (independent of size of set).
demo:
>>> = '/tmp/database.txt' >>> open(up, 'w').write('''\ ... p61981 1433g_human ... p61982 1433g_mouse ... q5rc20 1433g_ponab ... p61983 1433g_rat ... p68253 1433g_sheep ... ''') >>> def mplist(somelist): ... someset = set(somelist) ... open(up) u: ... return [line line in u if line.split(none, 1)[0] in someset] ... >>> mplist(['p61981', 'q5rc20']) ['p61981 1433g_human\n', 'q5rc20 1433g_ponab\n'] you may want return generator instead, , filter, not build list in memory:
def mplist(somelist): someset = set(somelist) open(up) u: return (line line in u if line.split(none, 1)[0] in someset) you can loop, not index result:
for match in mplist(somelist): # match and not need hold matched entries in memory.
Comments
Post a Comment