[ACCEPTED]-C#, regular expressions : how to parse comma-separated values, where some values might be quoted strings themselves containing commas-csv

Accepted answer
Score: 24

Try with this Regex:

"[^"\r\n]*"|'[^'\r\n]*'|[^,\r\n]*

    Regex regexObj = new Regex(@"""[^""\r\n]*""|'[^'\r\n]*'|[^,\r\n]*");
    Match matchResults = regexObj.Match(input);
    while (matchResults.Success) 
    {
        Console.WriteLine(matchResults.Value);
        matchResults = matchResults.NextMatch();
    }

Ouputs:

  • cat
  • dog
  • "0 = OFF, 1 = ON"
  • lion
  • tiger
  • 'R = red, G = green, B = blue'
  • bear

Note: This regex solution 2 will work for your case, however I recommend 1 you to use a specialized library like FileHelpers.

Score: 22

Why not heed the advice from the experts 10 and Don't roll your own CSV parser.

Your first thought is, "I need 9 to handle commas inside of quotes."

Your 8 next thought will be, "Oh, crap, I 7 need to handle quotes inside of quotes. Escaped 6 quotes. Double quotes. Single quotes..."

It's 5 a road to madness. Don't write your own. Find 4 a library with an extensive unit test coverage 3 that hits all the hard parts and has gone 2 through hell for you. For .NET, use the 1 free and open source FileHelpers library.

Score: 8

it's not a regex, but I've used Microsoft.VisualBasic.FileIO.TextFieldParser 4 to accomplish this for csv files. yes, it 3 might feel a little strange adding a reference 2 to Microsoft.VisualBasic in a C# app, maybe 1 even a little dirty, but hey it works.

Score: 8

Ah, RegEx. Now you have two problems. ;)

I'd 4 use a tokenizer/parser, since it is quite 3 straightforward, and more importantly, much 2 easier to read for later maintenance.

This 1 works, for example:

using System;
using System.Collections;
using System.Collections.Generic;
using System.IO;
using System.Text;

class Program
{
    static void Main(string[] args)
    {
        string myString = "cat,dog,\"0 = OFF, 1 = ON\",lion,tiger,'R = red, G = green,     B = blue',bear"; 
        Console.WriteLine("\nmyString is ...\n\t" + myString + "\n");
        CsvParser parser = new CsvParser(myString);

        Int32 lineNumber = 0;
        foreach (string s in parser)
        {
            Console.WriteLine(lineNumber + ": " + s);
        }

        Console.ReadKey();
    }
}

internal enum TokenType
{
    Comma,
    Quote,
    Value
}

internal class Token
{
    public Token(TokenType type, string value)
    {
        Value = value;
        Type = type;
    }

    public String Value { get; private set; }
    public TokenType Type { get; private set; }
}

internal class StreamTokenizer : IEnumerable<Token>
{
    private TextReader _reader;

    public StreamTokenizer(TextReader reader)
    {
        _reader = reader;    
    }

    public IEnumerator<Token> GetEnumerator()
    {
        String line;
        StringBuilder value = new StringBuilder();

        while ((line = _reader.ReadLine()) != null)
        {
            foreach (Char c in line)
            {
                switch (c)
                {
                    case '\'':
                    case '"':
                        if (value.Length > 0)
                        {
                            yield return new Token(TokenType.Value, value.ToString());
                            value.Length = 0;
                        }
                        yield return new Token(TokenType.Quote, c.ToString());
                        break;
                    case ',':
                       if (value.Length > 0)
                        {
                            yield return new Token(TokenType.Value, value.ToString());
                            value.Length = 0;
                        }
                        yield return new Token(TokenType.Comma, c.ToString());
                        break;
                    default:
                        value.Append(c);
                        break;
                }
            }

            // Thanks, dpan
            if (value.Length > 0) 
            {
                yield return new Token(TokenType.Value, value.ToString()); 
            }
        }
    }

    IEnumerator IEnumerable.GetEnumerator()
    {
        return GetEnumerator();
    }
}

internal class CsvParser : IEnumerable<String>
{
    private StreamTokenizer _tokenizer;

    public CsvParser(Stream data)
    {
        _tokenizer = new StreamTokenizer(new StreamReader(data));
    }

    public CsvParser(String data)
    {
        _tokenizer = new StreamTokenizer(new StringReader(data));
    }

    public IEnumerator<string> GetEnumerator()
    {
        Boolean inQuote = false;
        StringBuilder result = new StringBuilder();

        foreach (Token token in _tokenizer)
        {
            switch (token.Type)
            {
                case TokenType.Comma:
                    if (inQuote)
                    {
                        result.Append(token.Value);
                    }
                    else
                    {
                        yield return result.ToString();
                        result.Length = 0;
                    }
                    break;
                case TokenType.Quote:
                    // Toggle quote state
                    inQuote = !inQuote;
                    break;
                case TokenType.Value:
                    result.Append(token.Value);
                    break;
                default:
                    throw new InvalidOperationException("Unknown token type: " +    token.Type);
            }
        }

        if (result.Length > 0)
        {
            yield return result.ToString();
        }
    }

    IEnumerator IEnumerable.GetEnumerator()
    {
        return GetEnumerator();
    }
}
Score: 7

Just adding the solution I worked on this 7 morning.

var regex = new Regex("(?<=^|,)(\"(?:[^\"]|\"\")*\"|[^,]*)");

foreach (Match m in regex.Matches("<-- input line -->"))
{
    var s = m.Value; 
}

As you can see, you need to call 6 regex.Matches() per line. It will then return a 5 MatchCollection with the same number of 4 items you have as columns. The Value property 3 of each match is, obviously, the parsed 2 value.

This is still a work in progress, but 1 it happily parses CSV strings like:

2,3.03,"Hello, my name is ""Joshua""",A,B,C,,,D
Score: 4

CSV is not regular. Unless your regex language 12 has sufficient power to handle the stateful 11 nature of csv parsing (unlikely, the MS 10 one does not) then any pure regex solution 9 is a list of bugs waiting to happen as you 8 hit a new input source that isn't quite handled 7 by the last regex.

CSV reading is not that 6 complex to write as a state machine since 5 the grammar is simple but even so you must 4 consider: quoted quotes, commas within quotes, new 3 lines within quotes, empty fields.

As such 2 you should probably just use someone else's 1 CSV parser. I recommend CSVReader for .Net

Score: 2

Function:

    private List<string> ParseDelimitedString (string arguments, char delim = ',')
    {
        bool inQuotes = false;
        bool inNonQuotes = false; //used to trim leading WhiteSpace

        List<string> strings = new List<string>();

        StringBuilder sb = new StringBuilder();
        foreach (char c in arguments)
        {
            if (c == '\'' || c == '"')
            {
                if (!inQuotes)
                    inQuotes = true;
                else
                    inQuotes = false;
            }else if (c == delim)
            {
                if (!inQuotes)
                {
                    strings.Add(sb.Replace("'", string.Empty).Replace("\"", string.Empty).ToString());
                    sb.Remove(0, sb.Length);
                    inNonQuotes = false;
                }
                else
                {
                    sb.Append(c);
                }
            }
            else if ( !char.IsWhiteSpace(c) && !inQuotes && !inNonQuotes)  
            {
                if (!inNonQuotes) inNonQuotes = true;
                sb.Append(c);
            }
        }
        strings.Add(sb.Replace("'", string.Empty).Replace("\"", string.Empty).ToString());


        return strings;
    }

Usage

    string myString = "cat,dog,\"0 = OFF, 1 = ON\",lion,tiger,'R = red, G = green, B = blue',bear,         text";
    List<string> strings = ParseDelimitedString(myString);

    foreach( string s in strings )
            Console.WriteLine( s );

Output:

cat
dog
0 = OFF, 1 = ON
lion
tiger
R = red, G = green, B = blue
bear
text

0

Score: 1

I found a few bugs in that version, for 7 example, a non-quoted string that has a 6 single quote in the value.

And I agree use 5 the FileHelper library when you can, however 4 that library requires you know what your 3 data will look like... I need a generic 2 parser.

So I've updated the code to the following 1 and thought I'd share...

    static public List<string> ParseDelimitedString(string value, char delimiter)
    {
        bool inQuotes = false;
        bool inNonQuotes = false;
        bool secondQuote = false;
        char curQuote = '\0';

        List<string> results = new List<string>();

        StringBuilder sb = new StringBuilder();
        foreach (char c in value)
        {
            if (inNonQuotes)
            {
                // then quotes are just characters
                if (c == delimiter)
                {
                    results.Add(sb.ToString());
                    sb.Remove(0, sb.Length);
                    inNonQuotes = false;
                }
                else
                {
                    sb.Append(c);
                }
            }
            else if (inQuotes)
            {
                // then quotes need to be double escaped
                if ((c == '\'' && c == curQuote) || (c == '"' && c == curQuote))
                {
                    if (secondQuote)
                    {
                        secondQuote = false;
                        sb.Append(c);
                    }
                    else
                        secondQuote = true;
                }
                else if (secondQuote && c == delimiter)
                {
                    results.Add(sb.ToString());
                    sb.Remove(0, sb.Length);
                    inQuotes = false;
                }
                else if (!secondQuote)
                {
                    sb.Append(c);
                }
                else
                {
                    // bad,as,"user entered something like"this,poorly escaped,value
                    // just ignore until second delimiter found
                }
            }
            else
            {
                // not yet parsing a field
                if (c == '\'' || c == '"')
                {
                    curQuote = c;
                    inQuotes = true;
                    inNonQuotes = false;
                    secondQuote = false;
                }
                else if (c == delimiter)
                {
                    // blank field
                    inQuotes = false;
                    inNonQuotes = false;
                    results.Add(string.Empty);
                }
                else
                {
                    inQuotes = false;
                    inNonQuotes = true;
                    sb.Append(c);
                }
            }
        }

        if (inQuotes || inNonQuotes)
            results.Add(sb.ToString());

        return results;
    }
Score: 0

since this question: Regex to to parse csv with nested quotes

reports here and is 6 much more generic, and since a RegEx is 5 not really the proper way to solve this 4 problem (i.e. I have had many issues with 3 catastrophic backtracking (http://www.regular-expressions.info/catastrophic.html)

here is a 2 simple parser implementation in Python as 1 well

def csv_to_array(string):
    stack = []
    match = []
    matches = []

    for c in string:
        # do we have a quote or double quote?
        if c == "\"":
            # is it a closing match?
            if len(stack) > 0 and stack[-1] == c:
                stack.pop()
            else:
                stack.append(c)
        elif (c == "," and len(stack) == 0) or (c == "\n"):
            matches.append("".join(match))
            match = []
        else:
            match.append(c)

    return matches

More Related questions