Reference:-http://blogs.msdn.com/ericwhite/archive/2008/09/30/linq-to-text-and-linq-to-csv.aspx
LINQ is a great tool for writing ad-hoc queries and transforms, and occasionally I need to write queries or transforms on text files. And sometimes I receive CSV files, and need to do something with them. I wrote a blog post on LINQ to Text files over two years ago. My coding practices today differ from what I presented in that blog post. This post presents my current approach for dealing with text files using LINQ, and includes a function for splitting CSV lines.
In that post, I detailed an approach for writing lazy queries on text files. Well, to tell you the truth, I never write lazy queries on text files – I simply use File.ReadAllLines and write queries over the returned string array. After all, my computer has a lot of RAM (and probably yours does too), and the CSV files I receive are maybe in the 50K size, and are rarely greater than a few hundred K. It simply doesn’t matter that I read the entire text file into memory. If I were to write the exact same two queries today, here is how I would write them:
string[] lines = File.ReadAllLines("TextFile.txt");
var t1 = lines
.Where(l => !l.StartsWith("#"))
.Select(l => l.Split(','))
.Select(items => String.Format("{0}{1}{2}",
items[1].PadRight(16),
items[2].PadRight(16),
items[3].PadRight(16)));
var t2 = t1
.Select(l => l.ToUpper());
foreach (var t in t2)
Console.WriteLine(t);
I have a small extension method (CsvSplit) that I use to split lines that are in CSV format. I’ve had this method around for a while – it’s not written in the functional style. Instead, it’s a state machine. I’ve thought about what it would take to rewrite this method in the functional style, and as far as I know, the only way to do it would be to define a grammar, and maybe write a recursive descent parser. (Actually, there is another approach, but it would be very inefficient, and the code would be longer and less readable.) Well the job of parsing CSV files simply isn’t worth the effort! So a state machine it is. In any case, the ‘functional impurity’ is local to the function.
The semantics of CsvSplit are:
- CsvSplit is an extension method on the String class.
- Only a comma is valid for the separator character.
- Values can be quoted. The quotes are trimmed.
- Quoted values can have, of course, internal commas. Quoted values can also have internal escape sequences: backslash followed by any character, including quote (\”), backslash (\\) or any other character (\a).
- CsvSplit will throw an exception for incorrectly formatted strings.
If the CSV file that I receive isn’t in this format, then I just load it into Excel and save so that it is in this format.
If you have this for a source file:
Bob,"Bob said to go to the store."
Mary,"Mary said, \"Whatever.\""
Jim,Jim's quote doesn't contain quotes or commas.
Then you can query the CSV file like this:
var data = File.ReadAllLines("TextFile.txt")
.Select(
l => {
var split = l.CsvSplit();
return new {
Person = split[0],
Quote = split[1]
};
}
);
foreach (var item in data)
Console.WriteLine("{0}:{1}", item.Person, item.Quote);
The function is composable. If you want to convert the CSV file to an XML file, you can do so like this:
XElement xmlDoc = new XElement("Root",
File.ReadAllLines("TextFile.txt")
.Select
(
line => {
var split = line.CsvSplit();
return new XElement("Quote",
new XElement("Person", split[0]),
new XElement("Text", split[1])
);
}
)
);
Console.WriteLine(xmlDoc);
Here is the listing for CsvSplit (also attached):
public class CsvParseException : Exception
{
public CsvParseException(string message)
: base(message)
{
}
}
public static class MyExtensions
{
private enum State
{
AtBeginningOfToken,
InNonQuotedToken,
InQuotedToken,
ExpectingComma,
InEscapedCharacter
};
public static string[] CsvSplit(this String source)
{
List<string> splitString = new List<string>();
List<int> slashesToRemove = null;
State state = State.AtBeginningOfToken;
char[] sourceCharArray = source.ToCharArray();
int tokenStart = 0;
int len = sourceCharArray.Length;
for (int i = 0; i <>
{
switch (state)
{
case State.AtBeginningOfToken:
if (sourceCharArray[i] == '"')
{
state = State.InQuotedToken;
slashesToRemove = new List<int>();
continue;
}
if (sourceCharArray[i] == ',')
{
splitString.Add("");
tokenStart = i + 1;
continue;
}
state = State.InNonQuotedToken;
continue;
case State.InNonQuotedToken:
if (sourceCharArray[i] == ',')
{
splitString.Add(
source.Substring(tokenStart, i - tokenStart));
state = State.AtBeginningOfToken;
tokenStart = i + 1;
}
continue;
case State.InQuotedToken:
if (sourceCharArray[i] == '"')
{
state = State.ExpectingComma;
continue;
}
if (sourceCharArray[i] == '\\')
{
state = State.InEscapedCharacter;
slashesToRemove.Add(i - tokenStart);
continue;
}
continue;
case State.ExpectingComma:
if (sourceCharArray[i] != ',')
throw new CsvParseException("Expecting comma");
string stringWithSlashes =
source.Substring(tokenStart, i - tokenStart);
foreach (int item in slashesToRemove.Reverse<int>())
stringWithSlashes =
stringWithSlashes.Remove(item, 1);
splitString.Add(
stringWithSlashes.Substring(1,
stringWithSlashes.Length - 2));
state = State.AtBeginningOfToken;
tokenStart = i + 1;
continue;
case State.InEscapedCharacter:
state = State.InQuotedToken;
continue;
}
}
switch (state)
{
case State.AtBeginningOfToken:
splitString.Add("");
return splitString.ToArray();
case State.InNonQuotedToken:
splitString.Add(
source.Substring(tokenStart,
source.Length - tokenStart));
return splitString.ToArray();
case State.InQuotedToken:
throw new CsvParseException("Expecting ending quote");
case State.ExpectingComma:
string stringWithSlashes =
source.Substring(tokenStart, source.Length - tokenStart);
foreach (int item in slashesToRemove.Reverse<int>())
stringWithSlashes = stringWithSlashes.Remove(item, 1);
splitString.Add(
stringWithSlashes.Substring(1,
stringWithSlashes.Length - 2));
return splitString.ToArray();
case State.InEscapedCharacter:
throw new CsvParseException("Expecting escaped character");
}
throw new CsvParseException("Unexpected error");
}
}
The whole program is as follows:- LinqToCsv.cs
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.IO;
using System.Xml.Linq;
namespace LinqToCsv
{
public class CsvParseException : Exception
{
public CsvParseException(string message)
: base(message)
{
}
}
public static class MyExtensions
{
private enum State
{
AtBeginningOfToken,
InNonQuotedToken,
InQuotedToken,
ExpectingComma,
InEscapedCharacter
};
public static string[] CsvSplit(this String source)
{
List<string> splitString = new List<string>();
List<int> slashesToRemove = null;
State state = State.AtBeginningOfToken;
char[] sourceCharArray = source.ToCharArray();
int tokenStart = 0;
int len = sourceCharArray.Length;
for (int i = 0; i <>
{
switch (state)
{
case State.AtBeginningOfToken:
if (sourceCharArray[i] == '"')
{
state = State.InQuotedToken;
slashesToRemove = new List<int>();
continue;
}
if (sourceCharArray[i] == ',')
{
splitString.Add("");
tokenStart = i + 1;
continue;
}
state = State.InNonQuotedToken;
continue;
case State.InNonQuotedToken:
if (sourceCharArray[i] == ',')
{
splitString.Add(
source.Substring(tokenStart, i - tokenStart));
state = State.AtBeginningOfToken;
tokenStart = i + 1;
}
continue;
case State.InQuotedToken:
if (sourceCharArray[i] == '"')
{
state = State.ExpectingComma;
continue;
}
if (sourceCharArray[i] == '\\')
{
state = State.InEscapedCharacter;
slashesToRemove.Add(i - tokenStart);
continue;
}
continue;
case State.ExpectingComma:
if (sourceCharArray[i] != ',')
throw new CsvParseException("Expecting comma");
string stringWithSlashes =
source.Substring(tokenStart, i - tokenStart);
foreach (int item in slashesToRemove.Reverse<int>())
stringWithSlashes =
stringWithSlashes.Remove(item, 1);
splitString.Add(
stringWithSlashes.Substring(1,
stringWithSlashes.Length - 2));
state = State.AtBeginningOfToken;
tokenStart = i + 1;
continue;
case State.InEscapedCharacter:
state = State.InQuotedToken;
continue;
}
}
switch (state)
{
case State.AtBeginningOfToken:
splitString.Add("");
return splitString.ToArray();
case State.InNonQuotedToken:
splitString.Add(
source.Substring(tokenStart,
source.Length - tokenStart));
return splitString.ToArray();
case State.InQuotedToken:
throw new CsvParseException("Expecting ending quote");
case State.ExpectingComma:
string stringWithSlashes =
source.Substring(tokenStart, source.Length - tokenStart);
foreach (int item in slashesToRemove.Reverse<int>())
stringWithSlashes = stringWithSlashes.Remove(item, 1);
splitString.Add(
stringWithSlashes.Substring(1,
stringWithSlashes.Length - 2));
return splitString.ToArray();
case State.InEscapedCharacter:
throw new CsvParseException("Expecting escaped character");
}
throw new CsvParseException("Unexpected error");
}
}
class Program
{
static bool Validate(string[] results, string[] expectedResults)
{
if (results.Length != expectedResults.Length)
{
Console.WriteLine(" Validation error");
return false;
}
for (int i = 0; i <>
{
if (results[i] != expectedResults[i])
{
Console.WriteLine(" Validation error");
return false;
}
}
Console.WriteLine(" Validated");
return true;
}
static void ValidateAll()
{
string[] split;
Console.WriteLine("Test1");
split = "\"12\\\"3\",\"456\",\"789\"".CsvSplit();
Validate(split, new[] { "12\"3", "456", "789" });
Console.WriteLine("Test2");
split = "\"123\",\"456\",\"789\"".CsvSplit();
Validate(split, new[] { "123", "456", "789" });
Console.WriteLine("Test3");
split = "\"aaa,bbb\",\"ccc,ddd\",ghi".CsvSplit();
Validate(split, new[] { "aaa,bbb", "ccc,ddd", "ghi" });
Console.WriteLine("Test4");
split = "aaa,,bbb".CsvSplit();
Validate(split, new[] { "aaa", "", "bbb" });
Console.WriteLine("Test5");
try
{
split = "\"aaa\\bbb\",ccc,ddd".CsvSplit();
Console.WriteLine(" Validation error");
}
catch (CsvParseException)
{
Console.WriteLine(" Validated");
}
Console.WriteLine("Test6");
try
{
split = "\"aaabbb\"bbb,ccc,ddd".CsvSplit();
Console.WriteLine(" Validation error");
}
catch (CsvParseException)
{
Console.WriteLine(" Validated");
}
Console.WriteLine("Test7");
split = "aaa,,bbb,".CsvSplit();
Validate(split, new[] { "aaa", "", "bbb", "" });
Console.WriteLine("Test8");
try
{
split = "\"aaabbb\",ccc,\"ddd".CsvSplit();
Console.WriteLine(" Validation error");
}
catch (CsvParseException)
{
Console.WriteLine(" Validated");
}
Console.WriteLine("Test9");
try
{
split = "aaa,ccc,\"ddd\\".CsvSplit();
Console.WriteLine(" Validation error");
}
catch (CsvParseException)
{
Console.WriteLine(" Validated");
}
Console.WriteLine("Test10");
split = "\"a\\\\aa\",,bbb,".CsvSplit();
Validate(split, new[] { "a\\aa", "", "bbb", "" });
Console.WriteLine("Test11");
split = "\"a\\aa\",,bbb,".CsvSplit();
Validate(split, new[] { "aaa", "", "bbb", "" });
}
///
/// Main Function
///
/// Command Line arguments
static void Main(string[] args)
{
XElement xmlDoc = new XElement("Root",
File.ReadAllLines("TextFile.txt")
.Select
(
line =>
{
var split = line.CsvSplit();
return new XElement("Quote",
new XElement("Person", split[0]),
new XElement("Text", split[1])
);
}
)
);
Console.WriteLine(xmlDoc);
ValidateAll();
}
}
}
Reference :- http://blogs.msdn.com/ericwhite/archive/2008/09/30/linq-to-text-and-linq-to-csv.aspx
Posted By :- EricWhite