Storing large structured binary data with Haskell -
i'm writing application interacts large (10-1000 gb) memory-mapped binary file, holding bunch of objects refer each other. i've come mechanism read/write data effective, ugly , verbose (imo).
q: there more elegant way achieve i've done?
i have typeclass structured data, 1 method reads structure haskell datatype (dataop
readert
around io
).
class dbstruct structread :: addr -> dataop
to make more readable, have typeclass defines structure members go where:
class dbstruct st => structmem structty valty name | structty name -> valty offset :: structty -> valty -> name -> int64
i have few helper functions use offset method reading/writing structure elements, reading structures stored references, , lazily deferring structure reads (to allow lazy reading of entire file).
the problem involves lot of repetition use. 1 structure, first have define haskell type:
data rowblock = rowblock {rbnext :: maybe rowblock ,rbprev :: maybe rowblock ,rbrows :: [rowty] }
then name
types:
data next = next data prev = prev data count = count newtype row = row int64
then instances each structure member:
instance structmem rowblock (maybe (addr rowblock)) next offset _ _ _ = 0 instance structmem rowblock (maybe (addr rowblock)) prev offset _ _ _ = 8 instance structmem rowblock int64 count offset _ _ _ = 16 instance structmem rowblock rowty row offset _ _ (row n) = 24 + n * 8
then structure read method:
instance dbstruct rowblock structread = n <- elemmaybeptr next p <- elemmaybeptr prev c <- elemread count rs <- mapm (elemread . row) [0 .. c-1] return $ rowblock n p rs
so i've accomplished re-implement c structs in more verbose (and slow) way. happier if more concise while preserving type safety. surely commonly encountered problem.
a few possible alternatives can think of are:
- ditch memory-mapped files , use
data.binary
, writingbytestrings
disk normal way. - use
deriving generic
create generic read , write functions - overload functional references
- do magical monadic lenses.
edit: sscce requested
you might try using data.binary ptrs.
for writing:
use data.binary build bytestring. bytestring tuple (foreignptr word8, int, int) holds address, offset, , length data stored. can use data.bytestring.internal package toforeignptr, unpack tuple you. foreign.foreignptr provides withforeignptr, takes function performs io action via pointer. in there can memcpy (a binding provided in data.bytestring.internal) bytestring storage mmapped ptr got mmap.
for reading:
you can use data.bytestring.internal's fromforiegnptr turn ptr bytestring. mmap libraries do, record @ time instead of entire region. once have bytestring view on memory, can unpack data.binary.
another option take advantage of fact bytestring has alternative implementation in data.vector.storable.bytestring, let use storable interface you're using read/write them mmaped ptrs. interface , basic type isomorphic data.bytestring one, it's got storable instances.
Comments
Post a Comment