Difference between revisions of "Parsing and formatting data"
(CSV) |
(→XML) |
||
Line 25: | Line 25: | ||
== XML == | == XML == | ||
+ | The ''eXtensible Markup Language'' is a flexible (but verbose) standard for data encoding. The standard nails down all the encoding details, but leaves the specific schema specification to the application, so the actual structure of the data is virtually unlimited. Hence, it is quite possible (likely even) that data in an XML file has a rich structure that may look quite a bit different than a rectangular array. | ||
+ | |||
+ | From a single richly-structured XML source, there will be many pieces of information that you can extract, and each of these typically will fit well into arrays and indexes. So a good way to think about XML data is that you'll extract information from it from a series of "queries". | ||
+ | |||
+ | To parse XML data in your model, use the Microsoft XML DOM parser, which handles the parsing, and provides an extremely rich query capability. In the case where you have already read the XML text into a variable <code>XML_Text</code>, instantiate the parser as follows: | ||
+ | |||
+ | Variable xmlDoc := | ||
+ | [[Var]] d := [[COMCreateObject]]("[https://msdn.microsoft.com/en-us/library/ms764730(VS.85).aspx Msxml2.DOMDocument.3.0]"); | ||
+ | d->[https://msdn.microsoft.com/en-us/library/ms761398(v=vs.85).aspx async] := [[False]]; | ||
+ | d->[https://msdn.microsoft.com/en-us/library/ms754585(v=vs.85).aspx loadXML](XML_Text); | ||
+ | [[If]] (d->[https://msdn.microsoft.com/en-us/library/ms756041(v=vs.85).aspx parseError]->[https://msdn.microsoft.com/en-us/library/ms757867(v=vs.85).aspx errorCode] <> 0) [[Then]] [[Error]](d->[https://msdn.microsoft.com/en-us/library/ms756041(v=vs.85).aspx parseError]->[https://msdn.microsoft.com/en-us/library/ms761342(v=vs.85).aspx reason]); | ||
+ | d | ||
+ | |||
+ | The methods and properties of the XML parser are documented at [https://msdn.microsoft.com/en-us/library/ms764730(VS.85).aspx Microsoft XML DOM Parser API]. | ||
+ | |||
+ | You can use [[ReadTextFile]] or [[ReadFromURL]] to read <code>XML_Text</code>, or you can replace the <code>d->loadXML</code> method call with a direct read from a file or URL using, e.g., | ||
+ | :<code>d->[https://msdn.microsoft.com/en-us/library/ms762722(v=vs.85).aspx load]("http://website.com/download/data.xml")</code> | ||
+ | |||
+ | With the data loaded into the XML DOM, you then extract data via a series of queries using [https://en.wikipedia.org/wiki/XPath XPath] expressions. [https://en.wikipedia.org/wiki/XPath XPath] is a extremely rich and powerful query language, making it suitable to the wide variations in schema among XML files. One useful pattern is illustrated here, where all tags in the XML named <code><person></code> are extracted. | ||
+ | |||
+ | Variable PeopleNodes:= | ||
+ | [[Var]] nodes := xmlDoc->[https://msdn.microsoft.com/en-us/library/ms754523(v=vs.85).aspx selectNodes]("//*/row"); | ||
+ | [[Index]] Row := 1..nodes->[https://msdn.microsoft.com/en-us/library/ms767664(v=vs.85).aspx length]; | ||
+ | nodes->[https://msdn.microsoft.com/en-us/library/ms767664(v=vs.85).aspx item](Row-1) | ||
+ | |||
+ | The resulting <code>nodes</code> is an array of [https://msdn.microsoft.com/en-us/library/ms765513(v=vs.85).aspx IXMLDOMNode] objects. If each of these <code><person></code> tag contains a single tag <code><age></code>, you can extract the age using | ||
+ | :<code>[[ParseNumber]](PeopleNodes->[https://msdn.microsoft.com/en-us/library/ms757846(v=vs.85).aspx selectSingleNode]("age")->[https://msdn.microsoft.com/en-us/library/ms762687(v=vs.85).aspx text])</code> | ||
+ | |||
+ | For further information, see the example at [[Example_Models#Extracting_Data_from_an_XML_file|Extracting data from an XML file]] and consult the [https://msdn.microsoft.com/en-us/library/ms764730(VS.85).aspx Microsoft XML DOM API reference]. | ||
+ | |||
+ | To create XML, it is generally relatively easy to write Analytica code to concatenate your information using the [[Text Concatenation Operator: &]] and [[JoinText]]. When including text, you should XML-encode it in case it contains XML-reserved characters. For example: | ||
+ | :<code>[[JoinText]]( "<person>" & [[TextCharacterEncode]]("XML", personNames) & "</person>", People )</code> | ||
+ | For numbers or dates, use [[NumberToText]] to control the number format used for the numbers. Because XML schemas are so open-ended, there is no generic <code>MakeXML</code> function. | ||
+ | |||
+ | A second method for creating XML, which we have found to be less convenient, is to do it through the same Microsoft DOM that we use for parsing XML. Methods within the DOM allow you to add and modify tags, and once complete, you can write the XML to a file using the [https://msdn.microsoft.com/en-us/library/ms753769(v=vs.85).aspx save] method or extract the XML text using the [https://msdn.microsoft.com/en-us/library/ms762687(v=vs.85).aspx text] property. | ||
+ | |||
== JSON == | == JSON == | ||
== Custom data formats == | == Custom data formats == |
Revision as of 17:26, 8 August 2017
After reading a data file using ReadTextFile, you'll need to parse it to separate the contents into individual values, and to convert some of the fields to numbers and dates. There are data file formats, ranging from common standards like CSV, XML and JSON, to custom formats. When you write data to a text file, using WriteTextFile, you'll need to put your output into the desired standard or custom format.
Data in rows and columns (CSV)
Data files are often organized into rows and columns. Each "cell" contains one datum, which may be a number, a date, or text. The first line of the file may (or may not) contain column headings rather than actual data. This type or organization is broadly referred to as a CSV format. CSV stands for Comma-Separated Values, since a common convention is for the value on each line to be separated by a commas, but the term is broadly applied even when a different separator, such as a tab character, is used.
Even though CSV is one of the most widely used standard data formats, there is no official CSV standard. While all CSV conventions have a lot in common, particularly the 2-D structure of the data, there are many details that can and do vary among applications. Foremost among these conventions regarding when quotes are placed around cells, how separator, new-line, and other special characters are escaped within single-cell text values, and how quoted cells are interpreted. The ParseCSV and MakeCSV functions in Analytica 5.0 and later parse and produce CSV using Excel's conventions by default, with a great deal of flexibility with optional parameters to adapt to other CSV conventions, which makes it quite easy to parse or produce CSV. These functions also handle the conversion from text to numbers and dates and vise-versa.
Reading and parsing a CSV file that uses commas as separators is done as follows:
ParseCSV(ReadTextFile( "MyFile.csv" ) )
The result is a 2-D array, indexed by local indexes named .Row
and .Column
. For a CSV file that uses a tab character as a separator, use
ParseCSV(ReadTextFile( "MyFile.csv" ), separator:Chr(9) )
ParseCSV includes many other options, some of which are likely to be necessary or convenient in a particular case. You may wish to use an existing index for the column index or row index, take the row index labels from a specific column in the data, adopt different quoting conventions, use different international/regional conventions, or extract only a subset of the columns. See ParseCSV for details.
Writing a 2-D array, x
, to a CSV file is done as follows:
WriteTextFile("MyFile.csv", MakeCSV( x, I, J ) )
where I
and J
are the indexes of x
. To write a tab-separated file, use
WriteTextFile("MyFile.csv", MakeCSV( x, I, J, separator:Chr(9) ) )
MakeCSV supports many additional conventions, see Chr(9).
XML
The eXtensible Markup Language is a flexible (but verbose) standard for data encoding. The standard nails down all the encoding details, but leaves the specific schema specification to the application, so the actual structure of the data is virtually unlimited. Hence, it is quite possible (likely even) that data in an XML file has a rich structure that may look quite a bit different than a rectangular array.
From a single richly-structured XML source, there will be many pieces of information that you can extract, and each of these typically will fit well into arrays and indexes. So a good way to think about XML data is that you'll extract information from it from a series of "queries".
To parse XML data in your model, use the Microsoft XML DOM parser, which handles the parsing, and provides an extremely rich query capability. In the case where you have already read the XML text into a variable XML_Text
, instantiate the parser as follows:
Variable xmlDoc :=
Var d := COMCreateObject("Msxml2.DOMDocument.3.0"); d->async := False; d->loadXML(XML_Text); If (d->parseError->errorCode <> 0) Then Error(d->parseError->reason); d
The methods and properties of the XML parser are documented at Microsoft XML DOM Parser API.
You can use ReadTextFile or ReadFromURL to read XML_Text
, or you can replace the d->loadXML
method call with a direct read from a file or URL using, e.g.,
With the data loaded into the XML DOM, you then extract data via a series of queries using XPath expressions. XPath is a extremely rich and powerful query language, making it suitable to the wide variations in schema among XML files. One useful pattern is illustrated here, where all tags in the XML named <person>
are extracted.
Variable PeopleNodes:=
Var nodes := xmlDoc->selectNodes("//*/row"); Index Row := 1..nodes->length; nodes->item(Row-1)
The resulting nodes
is an array of IXMLDOMNode objects. If each of these <person>
tag contains a single tag <age>
, you can extract the age using
ParseNumber(PeopleNodes->selectSingleNode("age")->text)
For further information, see the example at Extracting data from an XML file and consult the Microsoft XML DOM API reference.
To create XML, it is generally relatively easy to write Analytica code to concatenate your information using the Text Concatenation Operator: & and JoinText. When including text, you should XML-encode it in case it contains XML-reserved characters. For example:
JoinText( "<person>" & TextCharacterEncode("XML", personNames) & "</person>", People )
For numbers or dates, use NumberToText to control the number format used for the numbers. Because XML schemas are so open-ended, there is no generic MakeXML
function.
A second method for creating XML, which we have found to be less convenient, is to do it through the same Microsoft DOM that we use for parsing XML. Methods within the DOM allow you to add and modify tags, and once complete, you can write the XML to a file using the save method or extract the XML text using the text property.
JSON
Custom data formats
See Also
- ReadTextFile
- WriteTextFile
- ParseCSV
- MakeCSV
- ParseJSON
- MakeJSON
- Extracting data from an XML file
- For custom-format parsing or formatting
Enable comment auto-refresher