Introduction to the TSVIO Package

Background

The TSVIO package provides a fast but simple interface for accessing, read-only, (subsets of) potentially very large (many gigabytes) data matrices stored in plain text files.

Data File Format

Data files are required to be plain text files containing lines with tab-separated data columns.

Each line is separated into logical columns (fields) by tab characters.

The first line must contain unique labels for each data column. The first line may contain one less field than the remaining lines. Such files are often produced by R. Alternatively, the first line may contain the same number of fields as the remaining lines and the first field on that line is ignored. Such files are often produced by anything other than R.

Every line (row) after the first must contain the same number of fields. The first field of each line must be a unique row label. (Row and column labels are treated separately and can have labels in common.)

tsvio assumes that the data file is static and does not change during an R session.

Index File

Before data can be read from a data file, an index file containing the starting position of the data line for each row label must be generated.

The index file can be generated explicitly by calling tsvGenIndex:

tsvGenIndex (filename, indexfile)

tsvio assumes that the data file is static and does not change during an R session. Hence, an index file, once created, does not change during an R session either.

The index file must be regenerated by the user whenever the data file changes. The tsvio package cannot detect that the data file has changed. Using an outdated index file can result in erroneous results or a run-time error.

The data access functions described below can generate the index file automatically on first access. Depending on file permissions, this may allow the user to simply remove the index file whenever the data file is modified. A new index file will be generated on the next access (which will thus be slower than normal).

Matrix Data Access

The function tsvGetData is used to read data as a matrix:

tsvGetData (filename, indexfile, rowpatterns, colpatterns, dtype="", findany=TRUE)

rowpatterns is either NULL or a vector of row labels. If NULL, data from all lines in the file is returned. Otherwise, only data from rows matching an entry in rowpatterns is returned. Only exact matches are supported.

Similarly, colpatterns specifies which columns to return data for.

Thus, the entire data matrix can be returned by specifying NULL for both rowpatterns and colpatterns.

The return value is always a data matrix with two dimensions. If rowpatterns or colpatterns is a single element, the corresponding axis of the returned matrix is not ‘dropped’. The standard R function drop can be used to delete any dimensions of length one if desired.

By default, if rowpatterns or colpatterns are not NULL, any specified labels not in the data file will be silently ignored and not included in the result. However, if there are no matching rows or no matching columns, tsvGetData will throw an error.

Setting the optional parameter findany to FALSE will cause tsvGetData to throw an error if any specified label is not in the data file.

Rows and columns in the returned matrix will occur in same order as they appear in rowpatterns and colpatterns respectively. Duplicate entries in rowpatterns or colpatterns will never match any label (and always result in an error if findany is FALSE).

Matrix Data Type

The returned matrix will have the same mode as the dtype parameter, which can be a string, a numeric, or an integer. The value of the parameter is ignored. Returning a numeric or integer matrix can be much faster than returning a character matrix and then converting it. However, it requires all data elements in the data file to conform to that type. Otherwise tsvGetData will throw an error.

Row Data Access

The function tsvGetLines returns a subset of the lines in the data file as a string vector:

tsvGetLines (filename, indexfile, patterns, findany=TRUE)

The string vector returned by tsvGetLines consists of the entire first line in the data file, followed by the entirety of every line whose row label occurs in patterns. Unlike with tsvGetData, patterns cannot be NULL and matching lines are ordered by their order in the data file, not the order of their labels in patterns. If findany is TRUE, labels in patterns that do not occur are ignored. If no labels match, an error is thrown. If findany is FALSE, an error is thrown if there is no row for any label in patterns.