Wednesday, April 25, 2012

Java: splitting a comma-separated string but ignoring commas in quotes


I have a string vaguely like this:




foo,bar,c;qual="baz,blurb",d;junk="quux,syzygy"



that I want to split by commas -- but I need to ignore commas in quotes. How can I do this? Seems like a regexp approach fails; I suppose I can manually scan and enter a different mode when I see a quote, but it would be nice to use preexisting libraries. ( edit : I guess I meant libraries that are already part of the JDK or already part of a commonly-used libraries like Apache Commons.)



the above string should split into:




foo
bar
c;qual="baz,blurb"
d;junk="quux,syzygy"



note: this is NOT a CSV file, it's a single string contained in a file with a larger overall structure


Source: Tips4all

9 comments:

  1. Try:

    public class Main {
    public static void main(String[] args) {
    String line = "foo,bar,c;qual=\"baz,blurb\",d;junk=\"quux,syzygy\"";
    String[] tokens = line.split(",(?=([^\"]*\"[^\"]*\")*[^\"]*$)");
    for(String t : tokens) {
    System.out.println("> "+t);
    }
    }
    }


    Output:

    > foo
    > bar
    > c;qual="baz,blurb"
    > d;junk="quux,syzygy"


    In other words: split on the comma only if that comma has zero, or an even number of quotes in ahead of it.

    Needless to say, it won't work if your Strings can contain escaped quotes. In that case, a proper CSV parser should be used.

    Or, a bit friendlier for the eyes:

    public class Main {
    public static void main(String[] args) {
    String line = "foo,bar,c;qual=\"baz,blurb\",d;junk=\"quux,syzygy\"";

    String otherThanQuote = " [^\"] ";
    String quotedString = String.format(" \" %s* \" ", otherThanQuote);
    String regex = String.format("(?x) "+ // enable comments, ignore white spaces
    ", "+ // match a comma
    "(?= "+ // start positive look ahead
    " ( "+ // start group 1
    " %s* "+ // match 'otherThanQuote' zero or more times
    " %s "+ // match 'quotedString'
    " )* "+ // end group 1 and repeat it zero or more times
    " %s* "+ // match 'otherThanQuote'
    " $ "+ // match the end of the string
    ") ", // stop positive look ahead
    otherThanQuote, quotedString, otherThanQuote);

    String[] tokens = line.split(regex);
    for(String t : tokens) {
    System.out.println("> "+t);
    }
    }
    }


    which produces the same as the first example.

    ReplyDelete
  2. http://sourceforge.net/projects/javacsv/

    http://opencsv.sourceforge.net/

    http://stackoverflow.com/questions/101100/csv-api-for-java

    http://stackoverflow.com/questions/200609/can-you-recommend-a-java-library-for-reading-and-possibly-writing-csv-files

    http://stackoverflow.com/questions/123/csv-file-to-xml

    ReplyDelete
  3. I was impatient and chose not to wait for answers... for reference it doesn't look that hard to do something like this (which works for my application, I don't need to worry about escaped quotes, as the stuff in quotes is limited to a few constrained forms):

    final static private Pattern splitSearchPattern = Pattern.compile("[\",]");
    private List<String> splitByCommasNotInQuotes(String s) {
    if (s == null)
    return Collections.emptyList();

    List<String> list = new ArrayList<String>();
    Matcher m = splitSearchPattern.matcher(s);
    int pos = 0;
    boolean quoteMode = false;
    while (m.find())
    {
    String sep = m.group();
    if ("\"".equals(sep))
    {
    quoteMode = !quoteMode;
    }
    else if (!quoteMode && ",".equals(sep))
    {
    int toPos = m.start();
    list.add(s.substring(pos, toPos));
    pos = m.end();
    }
    }
    if (pos < s.length())
    list.add(s.substring(pos));
    return list;
    }


    (exercise for the reader: extend to handling escaped quotes by looking for backslashes also.)

    ReplyDelete
  4. While I do like regular expressions in general, for this kind of state-dependent tokenization I believe a simple parser (which in this case is much simpler than that word might make it sound) is probably a cleaner solution, in particular with regards to maintainability, e.g.:

    String input = "foo,bar,c;qual=\"baz,blurb\",d;junk=\"quux,syzygy\"";
    List<String> result = new ArrayList<String>();
    int start = 0;
    boolean inQuotes = false;
    for (int current = 0; current < input.length(); current++) {
    if (input.charAt(current) == '\"') inQuotes = !inQuotes; // toggle state
    boolean atLastChar = (current == input.length() - 1);
    if(atLastChar) result.add(input.substring(start));
    else if (input.charAt(current) == ',' && !inQuotes) {
    result.add(input.substring(start, current));
    start = current + 1;
    }
    }


    If you don't care about preserving the commas inside the quotes you could simplify this approach (no handling of start index, no last character special case) by replacing your commas in quotes by something else and then split at commas:

    String input = "foo,bar,c;qual=\"baz,blurb\",d;junk=\"quux,syzygy\"";
    StringBuilder builder = new StringBuilder(input);
    boolean inQuotes = false;
    for (int currentIndex = 0; currentIndex < builder.length(); currentIndex++) {
    char currentChar = builder.charAt(currentIndex);
    if (currentChar == '\"') inQuotes = !inQuotes; // toggle state
    if (currentChar == ',' && inQuotes) {
    builder.setCharAt(currentIndex, ';'); // or '♡', and replace later
    }
    }
    List<String> result = Arrays.asList(builder.toString().split(","));

    ReplyDelete
  5. You're in that annoying boundary area where regexps almost won't do (as has been pointed out by Bart, escaping the quotes would make life hard) , and yet a full-blown parser seems like overkill.

    If you are likely to need greater complexity any time soon I would go looking for a parser library. For example this one

    ReplyDelete
  6. Rather than use lookahead and other crazy regex, just pull out the quotes first. That is, for every quote grouping, replace that grouping with __IDENTIFIER_1 or some other indicator, and map that grouping to a map of string,string.

    After you split on comma, replace all mapped identifiers with the original string values.

    ReplyDelete
  7. Try a lookaround like (?!\"),(?!\"). This should match , that are not surrounded by ".

    ReplyDelete
  8. similar I had to split by coma but avoid spliting in strings "'xxxxx','x,x','xxx'" or attributes in function(arg, arg,arg) ... my solution in javaScript is :

    function ComaSplit(str){
    var cw=''; var ar=[];var b=false ;var q=false;
    for(i=0;i<str.length;i++){var x=str[i];


    // test position in the string & modifying accordingly the flags q for quotes b for brackets

    if(x=="'"){q?q=false:q=true};x=="("?b=true:'';x==")"?b=false:'';
    if(x!=','){cw+=x;}


    // if any flag is true the coma pases to the current word cw else step to next word

    if(x==','){if(q||b)cw+=x;else{ar.push(cw);cw=''}
    }
    }
    ar.push(cw);return ar
    }


    It's not an universal solution but serves the specific goal

    ReplyDelete
  9. I would do something like this:

    boolean foundQuote = false;

    if(charAtIndex(currentStringIndex) == '"')
    {
    foundQuote = true;
    }

    if(foundQuote == true)
    {
    //do nothing
    }

    else

    {
    string[] split = currentString.split(',');
    }

    ReplyDelete