python - How to create a large pandas dataframe from an sql query without running out of memory? -


i having trouble querying table of > 5 million records ms sql server database. want able select of records, code seems fail when selecting data memory.

this works:

import pandas.io.sql psql sql = "select top 1000000 * mytable"  data = psql.read_frame(sql, cnxn) 

...but not work:

sql = "select top 2000000 * mytable"  data = psql.read_frame(sql, cnxn) 

it returns error:

file "inference.pyx", line 931, in pandas.lib.to_object_array_tuples (pandas\lib.c:42733) memory error 

i have read here similar problem exists when creating dataframe csv file, , work-around use 'iterator' , 'chunksize' parameters this:

read_csv('exp4326.csv', iterator=true, chunksize=1000) 

is there similar solution querying sql database? if not, preferred work-around? need read in records in chunks other method? read bit of discussion here working large datasets in pandas, seems lot of work execute select * query. surely there simpler approach.

update: make sure check out answer below, pandas has built-in support chunked loading.

you try read input table chunk-wise , assemble full dataframe individual pieces afterwards, this:

import pandas pd import pandas.io.sql psql chunk_size = 10000 offset = 0 dfs = [] while true:   sql = "select * mytable limit %d offset %d order id" % (chunk_size,offset)    dfs.append(psql.read_frame(sql, cnxn))   offset += chunk_size   if len(dfs[-1]) < chunk_size:     break full_df = pd.concat(dfs) 

it might possible whole dataframe large fit in memory, in case have no other option restrict number of rows or columns you're selecting.


Comments

Popular posts from this blog

plot - Remove Objects from Legend When You Have Also Used Fit, Matlab -

java - Why does my date parsing return a weird date? -

Need help in packaging app using TideSDK on Windows -