ff - Memory-Efficient Storage of Large Data on Disk and Fast Access
Functions
The ff package provides data structures that are stored on
disk but behave (almost) as if they were in RAM by
transparently mapping only a section (pagesize) in main memory
- the effective virtual memory consumption per ff object. ff
supports R's standard atomic data types 'double', 'logical',
'raw' and 'integer' and non-standard atomic types boolean (1
bit), quad (2 bit unsigned), nibble (4 bit unsigned), byte (1
byte signed with NAs), ubyte (1 byte unsigned), short (2 byte
signed with NAs), ushort (2 byte unsigned), single (4 byte
float with NAs). For example 'quad' allows efficient storage of
genomic data as an 'A','T','G','C' factor. The unsigned types
support 'circular' arithmetic. There is also support for
close-to-atomic types 'factor', 'ordered', 'POSIXct', 'Date'
and custom close-to-atomic types. ff not only has native
C-support for vectors, matrices and arrays with flexible
dimorder (major column-order, major row-order and
generalizations for arrays). There is also a ffdf class not
unlike data.frames and import/export filters for csv files. ff
objects store raw data in binary flat files in native encoding,
and complement this with metadata stored in R as physical and
virtual attributes. ff objects have well-defined hybrid copying
semantics, which gives rise to certain performance improvements
through virtualization. ff objects can be stored and reopened
across R sessions. ff files can be shared by multiple ff R
objects (using different data en/de-coding schemes) in the same
process or from multiple R processes to exploit parallelism. A
wide choice of finalizer options allows to work with
'permanent' files as well as creating/removing 'temporary' ff
files completely transparent to the user. On certain
OS/Filesystem combinations, creating the ff files works without
notable delay thanks to using sparse file allocation. Several
access optimization techniques such as Hybrid Index
Preprocessing and Virtualization are implemented to achieve
good performance even with large datasets, for example virtual
matrix transpose without touching a single byte on disk.
Further, to reduce disk I/O, 'logicals' and non-standard data
types get stored native and compact on binary flat files i.e.
logicals take up exactly 2 bits to represent TRUE, FALSE and
NA. Beyond basic access functions, the ff package also provides
compatibility functions that facilitate writing code for ff and
ram objects and support for batch processing on ff objects
(e.g. as.ram, as.ff, ffapply). ff interfaces closely with
functionality from package 'bit': chunked looping, fast bit
operations and coercions between different objects that can
store subscript information ('bit', 'bitwhich', ff 'boolean',
ri range index, hi hybrid index). This allows to work
interactively with selections of large datasets and quickly
modify selection criteria. Further high-performance
enhancements can be made available upon request.