python - How to create a large pandas dataframe from an sql query without running out of memory? -


i having trouble querying table of > 5 million records ms sql server database. want able select of records, code seems fail when selecting data memory.

this works:

import pandas.io.sql psql sql = "select top 1000000 * mytable"  data = psql.read_frame(sql, cnxn) 

...but not work:

sql = "select top 2000000 * mytable"  data = psql.read_frame(sql, cnxn) 

it returns error:

file "inference.pyx", line 931, in pandas.lib.to_object_array_tuples (pandas\lib.c:42733) memory error 

i have read here similar problem exists when creating dataframe csv file, , work-around use 'iterator' , 'chunksize' parameters this:

read_csv('exp4326.csv', iterator=true, chunksize=1000) 

is there similar solution querying sql database? if not, preferred work-around? need read in records in chunks other method? read bit of discussion here working large datasets in pandas, seems lot of work execute select * query. surely there simpler approach.

update: make sure check out answer below, pandas has built-in support chunked loading.

you try read input table chunk-wise , assemble full dataframe individual pieces afterwards, this:

import pandas pd import pandas.io.sql psql chunk_size = 10000 offset = 0 dfs = [] while true:   sql = "select * mytable limit %d offset %d order id" % (chunk_size,offset)    dfs.append(psql.read_frame(sql, cnxn))   offset += chunk_size   if len(dfs[-1]) < chunk_size:     break full_df = pd.concat(dfs) 

it might possible whole dataframe large fit in memory, in case have no other option restrict number of rows or columns you're selecting.


Comments

Popular posts from this blog

Need help in packaging app using TideSDK on Windows -

java - Why does my date parsing return a weird date? -

plot - Remove Objects from Legend When You Have Also Used Fit, Matlab -