python - How to create a large pandas dataframe from an sql query without running out of memory? -
i having trouble querying table of > 5 million records ms sql server database. want able select of records, code seems fail when selecting data memory.
this works:
import pandas.io.sql psql sql = "select top 1000000 * mytable" data = psql.read_frame(sql, cnxn)
...but not work:
sql = "select top 2000000 * mytable" data = psql.read_frame(sql, cnxn)
it returns error:
file "inference.pyx", line 931, in pandas.lib.to_object_array_tuples (pandas\lib.c:42733) memory error
i have read here similar problem exists when creating dataframe csv file, , work-around use 'iterator' , 'chunksize' parameters this:
read_csv('exp4326.csv', iterator=true, chunksize=1000)
is there similar solution querying sql database? if not, preferred work-around? need read in records in chunks other method? read bit of discussion here working large datasets in pandas, seems lot of work execute select * query. surely there simpler approach.
update: make sure check out answer below, pandas has built-in support chunked loading.
you try read input table chunk-wise , assemble full dataframe individual pieces afterwards, this:
import pandas pd import pandas.io.sql psql chunk_size = 10000 offset = 0 dfs = [] while true: sql = "select * mytable limit %d offset %d order id" % (chunk_size,offset) dfs.append(psql.read_frame(sql, cnxn)) offset += chunk_size if len(dfs[-1]) < chunk_size: break full_df = pd.concat(dfs)
it might possible whole dataframe large fit in memory, in case have no other option restrict number of rows or columns you're selecting.
Comments
Post a Comment