ParseCSV

Revision as of 16:43, 9 May 2016 by Lchrisman (talk | contribs) (corrected some hyperlinks, added some See Also items)


new to Analytica 5.0

ParseCSV( csvText, columnIndex, rowIndex, separator, firstLineIsHeader, trimCells, matchColumnLabels, rowIndexLabelColumn, parseFlags )

Parses text that is in a CSV-format (Comma Separated Values format) and returns a two-dimensional array containing the data. The CSV format in inherently two-dimensional, with items on each row separated by a comma or other specified «separator», and each row on a new line (separated by new-line characters, CR, LF or CRLF). Numerical value or dates are automatically parsed as appropriate.

Indexes for result

When called with only one parameter, it parses using the most common CSV conventions in the same style as Excel. Local indexes named Column and Row are created for you. Use «columnIndex» and «rowIndex» to specify existing indexes for the result. If «columnIndex» has fewer columns than actually exist in the data, then only a subset of the columns are returned. If it has more, then the extra columns are null-padded. Similarly for «rowIndex» -- if shorter than the number of rows in the data, then only the first N rows are parsed, and if it is longer, the final rows are null-padded.

Non-comma separator

If values are delineated be something other than a comma, specify «separator». Commonly occurring non-comma separators include '|', Chr(9) (tab), ';'. «Separator» can be multiple characters.

Header column in data

In some CSV data files, the first line contains columns names. When you set «firstLineIsHeader» to true, the values in the first line are used to construct the local column index (if you didn't specify «columnIndex»), and to identify specify columns when you've specified «matchColumnLabels» or «rowIndexLabelsColumn», but the first line data does not appear in the resulting array itself. When «firstLineIsHeader» is set to false, then the first line is included as part of the array data, and not used to name the columns. When left unspecified, the value is inferred based on the other options specified, defaulting to true when the column names are used or referenced (i.e., when a local column index is created automatically, or «matchColumnLabels» or «rowIndexLabelsColumn» are specified).

Spaces around values

«trimCells» controls whether leading or trailing space around each value are included or not. This defaults to true (surrounding space is not included). Spaces inside quotes are never trimmed.

Subset of or re-ordered columns

If «matchColumnLabels» is set to true, «columnIndex» must be specified, and in this case, the names of the columns in the first line of the data are matched with the index labels (the comparison is case-sensitive). Only the columns that appear in the index are returned, and the order of the columns does not have to be the same as in the data.

Using a column for row labels

You can use «rowIndexLabelsColumn» specify a column to be used for the row index labels, which a local row index is automatically created for you. The column used is removed from the array (unless you specify «columnIndex» with «matchColumnLabels» and explicitly include that column). For example, ParseCSV(csvText, rowIndexLabelsColumn:1) treats the first column as row index labels rather than as array cells.

Parse flags

«parseFlags» is a bit-field of flags that control things such as for which cells an attempt is made to parse the cell into a number or date. The default (0) corresponds to the standard Excel conventions. The flags can be added, and are as follows.

  • 0 = Parse both quoted ("52") and unquoted (52) cells.
  • 1 = Don't parse any cells. E.g., ="52", "52" and 52 will all be return as text.
  • 2 = Don't parse quoted cells. E.g., "52" is text, but 52 is numeric.
  • 4 = Disable the '=' prefix (="0042" normally suppressed parsing).
  • 8 = Recognize backslash-escaped quotes inside quotes. (in addition to the doubling of quotes, which is the usual CSV standard).
  • 16 = Disable generalized date parsing. When on, only the format 6-May-2016 is recognized as a date but not other formats. This flag dramatically speeds up parsing times for data with no other date formats.

The CSV format

There is no single well-defined CSV standard. Undocumented conventions used by Excel are often seen as the most definitive "specification". The article at Comma Separated Values (CSV) Standard File Format is relatively easy to read and covers the basics well, without being the most comprehensive reference.

Each "record" (or row) is a CSV file is delineated by a newline, either CR, LF or CRLF. Each value within a row is separated from the next field by a separator character, usually comma (hence the name "Comma-separated values"), although other characters or character sequences are also sometimes used, including commonly TAB (Chr(9)), bar ('|'), semi-colon (';') and even space.

A row of data might look like this:

California, CA, "39,144,818", 163696, Sacramento, "Edmund Gerald ""Jerry"" Brown, Jr.", ="95814", 9/9/1850

Both textual and numeric values may or may not be surrounded by quotes. Any value containing the separator character or a newline must be surrounded by quotes. Quotes are always the double-quote character ("). A value may also be a date (e.g., 9/9/1850). Spaces to the left or right of a value are normally trimmed (unless you set «textTrim» to false).

When a value is read, if it parses as a valid date, it will be parsed as a date, otherwise if it parses as a valid number, it will parse as a number, and otherwise it will be read as text. The «parseFlags» parameter can be used to alter this behavior somewhat. To force a value to be read as text, even when it would be a valid number, use the equal-quote form, ="95814".

Note that new lines can appear within quoted values, so simply splitting CSV data on newlines does not reliably separate the data into rows.

The first line of a CSV file often consists of columns names, with data starting on the second row. But this is not always the case, some CSV files have no column headers with data starting on the first line.

Examples

The simplest usage is:

ParseCSV( csvText )

which follows the most common format and Excel standards.

To read in a CSV file:

ParseCSV( ReadTextFile( filename ) )


History

This built-in function was introduced in Analytica 5.0. For earlier releases, the Flat-file library contains ReadCsvFile and ParseCsvText functions. The ParseCSV is a substantial improvement.


See Also

Comments


You are not allowed to post comments.