Difference between revisions of "ParseCSV"

Revision as of 03:10, 13 July 2018

new to Analytica 5.0 Converts text from a file with CSV (Comma-Separated Values) format into a 2D array. By default, it converts numbers and dates from the text into Analytica numbers and dates. It offers lots of optional parameters to handle the many variants of the CSV format -- for example, with nonstandard characters that separate columns, rows, and for the quotes around text.

ParseCSV( csvText, columnIndex, rowIndex, separator, firstLineIsHeader, trimCells, columnsToKeep, rowIndexLabelColumn, parseFlags, dateTemplate, decimal, quote )

Given text in CSV-format (Comma Separated Values), it returns a two-dimensional array containing the data. By default, it assumes the same CSV format conventions as Excel -- that is, commas to separate columns, and newline to separate rows, and double quotes to enclose text that may contain commas or newlines. It provides local indexes .Column and .Row.

Indexes for result

By default, it returns the array with local indexes, .Column and .Row. Or you can provide it column and/or row indexes to optional parameters «columnIndex» and «rowIndex». If you provide indexes that are shorter than the number of columns or rows in the original, it ignores the extra columns or rows. If you provide an index that is longer than expected, it pads the result with NULL for the extra columns or rows.

Non-comma separator

By default, it assumes that cells (columns) are separated by commas (,), as you would expect in a "comma-separated" format! You can specify any character (or a text with multiple characters) to the optional parameters «separator». Common separators include '|', Chr(9) (tab), ';'.

Spaces around values

By default it trims away any leading or trailing spaces around each value. Set «trimCells» to False if you want to retain those spaces. It never trims spaces inside quotes.

Header column in data

By default, it assumes that the first row contains columns names, which is common in CSV file, and it puts those names into the local .Column index, unless you specify «columnIndex». In that case, the first row does not appear in the data. It also assumes the first row contains column names if you specify «columnsToKeep» or «rowIndexLabelsColumn». You can override this default behavior by setting «firstLineIsHeader» to false, in which case it treats the first row as data. Conversely, you can override the default of treating the first line as data when you specify a «columnIndex» by setting «firstLineIsHeader» to true.

Subset of or re-ordered columns

The «columnsToKeep» parameter lets you select a subset of the columns, or re-order them.

To extract just a single column, you can specify either the name (from the first line) or the number of the column that you want to keep to «columnsToKeep». You can omit «columnIndex».

To extract several columns or to reorder columns, you should specify a «columnIndex». If the column index contains the column names from the CSV (in any order), you can specify it also to «columnsToKeep» e.g., ParseCSV(csv, col, columnsToKeep:col). Or you can pass «columnsToKeep» a 1D array of names or numbers indexed by «columnIndex». .

If you pass a list or array to «columnsToKeep», but don't specify «columnIndex», it will still extract the specified columns, but less efficiently because array abstraction repeats the call for each element in the array passed to «columnsToKeep», and it parses the «csvText» anew each time.

Using a column for row labels

You can use «rowIndexLabelsColumn» to specify a column number to use for the labels for the local row index. It then removes that column from the array (unless you specify «columnIndex» with «matchColumnLabels» and explicitly include that column). For example, ParseCSV(csvText, rowIndexLabelsColumn: 1) treats the first column as row index labels rather than as array cells.

Parse flags

«parseFlags» is a bit-field of flags to control whether it should try to convert cells into a number or date. The default (0) corresponds to the standard Excel conventions -- i.e. it converts an unquoted number or a number between quotes into a number. You can add flags together to combine their effects:

0 = Parse both quoted ("52") and unquoted (52) cells.
1 = Don't parse any cells. E.g., return ="52", "52" and 52 as text.
2 = Don't parse quoted cells. E.g., "52" is text, but 52 is numeric.
4 = Disable the '=' prefix (="0042" normally suppressed parsing).
8 = Recognize backslash-escaped quotes inside quotes (in addition to the doubling of quotes, which is the usual CSV standard).
16 = Disable generalized date parsing. When on, only the format 6-May-2016 is recognized as a date but not other formats. This flag dramatically speeds up parsing times for data with no other date formats.

Date Formats

Use the «dateTemplate» parameter to specify the international ordering that should be used for parsing dates. Use the letters "d", "M" and "y" to specify the ordering of these components. For example, with a «dateTemplate» of "d/M/y", the text "11/10/9" parses as 11-Oct-2009, whereas with a «dateTemplate» of "y/M/d" it would be 9-Oct-2011, and with "M/d/y" it would be 10-Nov-2009.

This parameter uses the same conventions used in the Custom date formats template in the Number Format Dialog, but only the relative ordering of the day, month and year patterns matter. Hence, all these work equivalently: "dMy", "dd/MM/yyyy", "d-M-yy", "d M yyyy". Also, upper/lower case matters! "M" must be capital (because "m" means minutes").

Decimal point and thousands separators

By default, it assumes that '.' (dot or period) separates the whole number from the fractional part of a decimal number -- i.e. the convention used in English-speaking countries. You can tell it to use ',' (comma) by setting the «decimal» parameter. That parameter accepts only '.' (dot) or ',' (comma).

When comma is used for «decimal», the «separator» is usually something other than comma -- commonly semi-colon ';'. If the «decimal» and «separator» are both commas, there must be quotes around any numbers containing decimals.

When «decimal» is dot, you can use commas to group digits, such as "525,948.77". Conversely, when «decimal» is comma, you can use dot to group digits, such as "525.948,77" (for five hundred twenty-five thousand, nine hundred eight and 77 hundredths).

Quote character

The quote character is used in CSV to enclose numbers or text that may contain separators. By default, the quote character is double quote ("). You can specify a different quote character using the «quote» parameter. The most common other quote character used is the single quote ('). To type a single quote in Analytica, you can type doubleQuote SingleQuote doubleQuote ("'"). Analytica lets you use either matching single quotes or matching double quotes to enclose text strings.

The CSV format

There is no single well-defined CSV standard. Undocumented conventions used by Excel are often seen as the most definitive "specification". The article at Comma Separated Values (CSV) Standard File Format is relatively easy to read and covers the basics well, without being the most comprehensive reference.

Each "record" (or row) is a CSV file is delineated by a newline, either CR, LF or CRLF. Each value within a row is separated from the next field by a separator character, usually comma (hence the name "Comma-separated values"), although other characters or character sequences are also sometimes used, including commonly TAB (Chr(9)), bar ('|'), semi-colon (';') and even space.

A row of data might look like this:

California, CA, "39,144,818", 163696, Sacramento, "Edmund Gerald ""Jerry"" Brown, Jr.", ="95814", 9/9/1850

Both textual and numeric values may or may not be surrounded by quotes. Any value containing the separator character or a newline must be surrounded by quotes. Quotes are always the double-quote character ("). A value may also be a date (e.g., 9/9/1850). Spaces to the left or right of a value are normally trimmed (unless you set «textTrim» to false).

When a value is read, if it parses as a valid date, it will be parsed as a date, otherwise if it parses as a valid number, it will parse as a number, and otherwise it will be read as text. The «parseFlags» parameter can be used to alter this behavior somewhat. To force a value to be read as text, even when it would be a valid number, use the equal-quote form, ="95814".

Note that new lines can appear within quoted values, so simply splitting CSV data on newlines does not reliably separate the data into rows.

The first line of a CSV file often consists of columns names, with data starting on the second row. But this is not always the case, some CSV files have no column headers with data starting on the first line.

Examples

The simplest usage is:

ParseCSV( csvText )

which follows the most common format and Excel standards.

To read in a CSV file:

ParseCSV( ReadTextFile( filename ) )

History

This built-in function was introduced in Analytica 5.0. For earlier releases, the Flat-file library contains ReadCsvFile and ParseCsvText functions. The ParseCSV is a substantial improvement.

@@ Line 2: / Line 2: @@
 ''new to [[Analytica 5.0]]''
+Converts text from a file with CSV (Comma-Separated Values) format into a 2D array. By default, it converts numbers and dates from the text into Analytica numbers and dates. It offers lots of optional parameters to handle the many variants of the CSV format -- for example, with nonstandard characters that separate columns, rows, and for the quotes around text.
 == ParseCSV( csvText'', columnIndex, rowIndex, separator, firstLineIsHeader, trimCells, columnsToKeep, rowIndexLabelColumn, parseFlags, dateTemplate, decimal, quote '' ) ==
-Parses text that is in a CSV-format (Comma Separated Values format) and returns a two-dimensional array containing the data. The CSV format in inherently two-dimensional, with items on each row separated by a comma or other specified «separator», and each row on a new line (separated by new-line characters, CR, LF or CRLF). Numerical value or dates are automatically parsed as appropriate.
+Given text in CSV-format (Comma Separated Values), it returns a two-dimensional array containing the data. By default, it assumes the same CSV format conventions as Excel -- that is, commas to separate columns, and newline to separate rows, and double quotes to enclose text that may contain commas or newlines. It provides local indexes <code>.Column</code> and <code>.Row</code>.
 === Indexes for result ===
-If you give it only the single text parameter, it uses the most common CSV conventions in the same style as Excel -- including commas to separate columns, and newline to separate rows, and double quotes to enclose text that may contain commas or newlines. It creates local indexes, <code>Column</code> and <code>Row</code>. Specify «columnIndex» and «rowIndex» if you want to use existing indexes for the result. If «columnIndex» has fewer columns or  «rowIndex» has fewer rows than the data, it ignores the additional columns or rows in the data.  If «columnIndex» has more columns or  «rowIndex» has more rows than the data, it pads the extra columns with Null.
+By default, it returns the array with  local indexes, <code>.Column</code> and <code>.Row</code>.  Or you can provide it column and/or row indexes to optional parameters «columnIndex» and «rowIndex». If you provide indexes that are shorter than the number of columns or rows in the original, it ignores the extra columns or rows. If you provide an index that is longer than expected, it pads the result with NULL for the extra columns or rows.
 === Non-comma separator ===
-If values are delineated be something other than a comma, specify «separator». Commonly occurring non-comma separators include <code>'|'</code>, <code>[[Chr]](9)</code> (tab), <code>';'</code>. «Separator» can be multiple characters.
+By default, it assumes that cells (columns) are separated by commas (,), as you would expect in a "comma-separated" format!   You can specify any character (or a text with multiple characters) to the optional parameters  «separator». Common separators include <code>'|'</code>, <code>[[Chr]](9)</code> (tab), <code>';'</code>.
+=== Spaces around values ===
+By default it trims away any leading or trailing spaces around each value. Set «trimCells» to False if you want to retain those spaces. It never trims spaces inside quotes.
 === Header column in data ===
-In some CSV data files, the first line contains columns names. When you set «firstLineIsHeader» to <code>true</code>, the values in the first line are used to construct the local column index (if you didn't specify «columnIndex»), and to identify specify columns when you've specified «columnsToKeep» or «rowIndexLabelsColumn», but the first line data does not appear in the resulting array itself. When «firstLineIsHeader» is set to false, then the first line is included as part of the array data, and not used to name the columns. When left unspecified, the value is inferred based on the other options specified, defaulting to true when the column names are used or referenced (i.e., when a local column index is created automatically, or «columnsToKeep» or «rowIndexLabelsColumn» are specified).
+By default, it assumes that the first row contains columns names, which is common in CSV file, and it puts those names into the local <code>.Column</code> index, unless you specify «columnIndex».   In that case, the first row does not appear in the data.  It also assumes the first row contains column names if you specify «columnsToKeep» or «rowIndexLabelsColumn».  You can override this default behavior by setting «firstLineIsHeader» to <code>false</code>, in which case it treats the first row as data.  Conversely, you can override the default of treating the first line as data when you specify a «columnIndex» by setting «firstLineIsHeader» to <code>true</code>.
-=== Spaces around values ===
+=== Subset of or re-ordered columns ===
-«trimCells» controls whether leading or trailing space around each value are included or not. This defaults to true (surrounding space is not included). Spaces inside quotes are never trimmed.
-=== Subset of or re-ordered columns ===
+The «columnsToKeep» parameter lets you select a subset of the columns, or re-order them.
-The «columnsToKeep» parameter can be used to select only a subset of the columns appearing in the CSV, or to re-order the columns.
-If you extracting a single column only, then you can omit «columnIndex» and specify either the column header (from the first line) or the column number that you want to keep to «columnsToKeep».
+To extract just a single column, you can specify either the name (from the first line) or the number of the column that you want to keep to «columnsToKeep».  You can omit «columnIndex».
-If you are extracting more than one column, then you should specify «columnIndex» and «columnsToKeep» should be indexed by «columnIndex». When the elements of «columnIndex» already contains the column labels, you should pass the «columnIndex» to «columnsToKeep», e.g., <code>[[ParseCSV]](csv, col, columnsToKeep:col)</code>.
+To extract several columns or to reorder columns, you should specify a «columnIndex». If the column index contains the column names from the CSV (in any order), you can specify it also to «columnsToKeep»  e.g., <code>[[ParseCSV]](csv, col, columnsToKeep:col)</code>. Or you can pass «columnsToKeep» a 1D array of names or numbers indexed by «columnIndex». .
-If you pass a list or array to «columnsToKeep», but don't specify «columnIndex», it will still extract the specified columns, but not as efficiently. In this case, array abstraction repeats the call for each element in the array passed to «columnsToKeep», parsing the «csvText» anew each time.
+If you pass a list or array to «columnsToKeep», but don't specify «columnIndex», it will still extract the specified columns, but less efficiently because array abstraction repeats the call for each element in the array passed to «columnsToKeep», and it parses the «csvText» anew each time.
 ===  Using a column for row labels ===
-You can use «rowIndexLabelsColumn» specify a column to be used for the row index labels, which a local row index is automatically created for you. The column used is removed from the array (unless you specify «columnIndex» with «matchColumnLabels» and explicitly include that column). For example, <code>ParseCSV(csvText, rowIndexLabelsColumn:1)</code> treats the first column as row index labels rather than as array cells.
+You can use «rowIndexLabelsColumn» to specify a column number to use for the labels for the local row index. It then removes that column from the array (unless you specify «columnIndex» with «matchColumnLabels» and explicitly include that column). For example, <code>ParseCSV(csvText, rowIndexLabelsColumn: 1)</code> treats the first column as row index labels rather than as array cells.
 === Parse flags ===
-«parseFlags» is a bit-field of flags that control things such as for which cells an attempt is made to parse the cell into a number or date. The default (0) corresponds to the standard Excel conventions. The flags can be added, and are as follows.
+«parseFlags» is a bit-field of flags to control whether it should try to convert cells into a number or date. The default (0) corresponds to the standard Excel conventions -- i.e. it converts an unquoted number or a number between quotes into a number. You can add flags together to combine their effects:
 * 0 = Parse both quoted ("52") and unquoted (52) cells.
-* 1 = Don't parse any cells. E.g., <code>="52"</code>, <code>"52"</code> and <code>52</code> will all be return as text.
+* 1 = Don't parse any cells. E.g., return <code>="52"</code>, <code>"52"</code> and <code>52</code> as text.
 * 2 = Don't parse quoted cells. E.g., <code>"52"</code> is text, but <code>52</code> is numeric.
 * 4 = Disable the '=' prefix (<code>="0042"</code> normally suppressed parsing).
-* 8 = Recognize backslash-escaped quotes inside quotes. (in addition to the doubling of quotes, which is the usual CSV standard).
+* 8 = Recognize backslash-escaped quotes inside quotes (in addition to the doubling of quotes, which is the usual CSV standard).
 * 16 = Disable generalized date parsing. When on, only the format <code>6-May-2016</code> is recognized as a date but not other formats. This flag dramatically speeds up parsing times for data with no other date formats.
@@ Line 49: / Line 54: @@
 === Decimal point and thousands separators ===
+By default, it assumes that <code>'.'</code> (dot or period) separates the whole number from the fractional part of a decimal number -- i.e. the convention used in English-speaking countries.  You can tell it to use  <code>','</code> (comma) by setting the «decimal» parameter. That parameter accepts only <code>'.'</code> (dot) or <code>','</code> (comma).
-Use the «decimal» parameter to specify the character used to separate the whole number from the fractional part of a number. Only two values are allowed: <code>'.'</code> (dot) or <code>','</code> (comma). The default when omitted is the United States convention of <code>'.'</code>.
+When comma is used for «decimal», the «separator» is usually  something other than comma -- commonly semi-colon <code>';'</code>. If the «decimal» and «separator» are both commas, there must be quotes around any numbers containing decimals.
-When comma is used for «decimal», you will usually specify «separator» to be something other than comma, with semi-colon <code>';'</code> being a common convention. They aren't required to be different, but numbers containing decimals will need to be contained in quotes, otherwise the comma will be treated as the column separator.
+When «decimal» is dot, you can use commas to group digits, such as <code>"525,948.77"</code>. Conversely, when «decimal» is comma, you can use dot to group digits, such as <code>"525.948,77"</code> (for five hundred twenty-five thousand, nine hundred eight and 77 hundredths).
-When «decimal» is omitted or dot, then comma can be used to group digits, such as <code>"525,948.77"</code>. When «decimal» is comma, then the dot character can be used to group digits, such as <code>"525.948,77"</code> (for five hundred twenty-five thousand, nine hundred eight and 77 hundredths).
+=== Quote character ===
-=== Quote character ===
+The quote character is used in CSV to enclose numbers or text that may contain separators. By default, the quote character is double quote (<code>"</code>). You can specify a different quote character using the «quote» parameter.  The most common other quote character used is the single quote ('). To type a single quote in Analytica, you can type doubleQuote SingleQuote doubleQuote (<code>"'"</code>). Analytica lets you use either matching single quotes or matching double quotes to enclose text strings.
-To use a different character for quotes, specify the character in the «quote» parameter. The default is to use the double quote (<code>"</code>).  The most common other quote character used is the single quote ('). To type this for the parameter in Analytica, you can either type four single quotes (<code>''''</code>) or type doubleQuote SingleQuote doubleQuote (<code>"'"</code>).
 == The CSV format ==