coding5to9.com - Parsing CSV Files in C

This article is an in-depth guide on how to create a CSV parser in C programming language. However, if you don't have time or are not interested in the tutorial, then here is the finished code:

Simple CSV Parser
CSV Parser with Double Quotes Support

There are two variants. The first is very simple, but it does not support double quoted fields, so you can't escape the comma. The second one is more sophisticated as it supports escaping the comma.

The only purpose of these sample CSV parsers is to give some idea how you can implement it from scratch without using any third party libraries.

If you're looking for something more robust then you may want to look at the libcsv library.

Introduction

In this post we'll implement a CSV file parser in C. The abbreviation CSV stands for Comma Separated Values — that is, a file format where the data fields are separated by a comma. Despite its age (defined in the early 80s) this format is still quite popular because it's simple. Really simple. It's a text-based and non-proprietary format. This format is often the common denominator when it comes to exchanging data between applications: The format that (almost) all programs understand.

There are of course many CSV parser libraries out there. In this post, however, we don't plan to compete and challenge them, not at all. The goal of this post is just to show how CSV files can be parsed with the C programming language without any third party libraries used. Why? It helps you develop your algorithmic and general C programming skills.

This tutorial is rather for beginner programmers who have written their first programs. However, some part of the article may be useful for a broader audience too.

We'll process an imaginary company's employee records file. During the journey we will:

Read the file as a simple text-file and print each record line
Start handling the file as a real CSV
Learn how to escape commas inside the text and properly process double quotes

What You'll Need

Just your favorite editor and a C compiler. This article assumes basic knowledge of writing and compiling C programs, however, as a quick recap you can compile the source files as follows:

    $ gcc source.c -g -Wall -o out

The output file can be executed then:

    $ ./out

Suppose the imaginary company's employee records file contains the following information:

Name (first and last name);
Phone number;
Job title.

These fields are separated by a comma, practically forming a CSV file. The sample file is as follows:

sample.csv

first_name,last_name,phone,job_title
John,Doe,555-444,CTO
Jane,Doe,444-555,Director

Notice the file starts with the header, each column name is listed here.

Even though that in this tutorial we're aiming to read the CSV file, but this is the perfect opportunity to tell you to always add headers if you happen to create CSV files.

The header line is often missing from CSV files, but the data itself, without headers, is not self-descriptive enough and hard to understand for the other parties.

That said, it's a good practice to always add the header to your CSV file.

Read the File and Display its Raw Content

Let's start jumping into some actual code. We'll open the sample CSV file and print each line:

#include <stdio.h>

int main(int argc, char **argv) {
    FILE *fp = fopen(argv[1], "r");
    char buf[1024];

    if (!fp) {
        printf("Can't open file %s\n", argv[1]);
        return 1;
    }

    while (fgets(buf, 1024, fp)) {
        printf("%s\n", buf);
    }

    fclose(fp);

    return 0;
}

The #include <stdio.h> line (more precisely preprocessing directive) tells the compiler that we're going to use the functions defined in the C standard library, namely printf and fopen. The latter is for printing text to the standard output (stdout), the former is, as its name suggests, for opening a file.

In the next section we define the main function. By default, the compiler always looks for the main function since this will be the entry point to the application. This function runs when our program is executed. In this particular program we accept parameters from the console, that's why the function signature is int argc, char **argv. The argc parameter indicates how many arguments we received from the console. The array argv will contain the actual parameters.

We'll provide the file name for the program to open (argv[1]).

Next, we open the file with the fopen function. The first parameter is the file name we want to open, the second one r indicates that we open the file in read-only mode, we don't plan to change it.

The return value is a pointer to the file. In case the file cannot be opened (for example, it doesn't exist), the fopen returns with NULL.

The following if block prepares for that case; if the fp variable is set to NULL (!fp), then the program terminates with return code 1. In general, return code 1 means an error.

In the while loop each line is read and displayed on the standard output. The fgets function reads the next 1024 characters from the file and stores these characters in the buf variable. That's why we previously defined the buf variable as an array of characters with size of 1024. Note that, if fgets encounters a new line or the file ends, it'll stop and return.

Then we close the file and terminate the application.

I'd like to mention the importance of the fclose function, more generally closing and freeing up the resources. If you happen to be a beginner developer, you may forget to call the cleanup methods like this fclose. It doesn't really have any visible effect, right? Your program runs perfectly without it too.

However, allocated resource or memory that is never freed will lead to memory leaks and might eventually result in serious problems that are extremely hard to debug. I've seen enterprise systems where memory leaks started causing problems after weeks of running.

With that said, always pay extra attention to freeing up the resources and memory and call the appropriate methods, even if you write a five-liner program. Why? Because, if you always call the cleanup methods it'll eventually become a routine and will minimize the risk that one day you'll forget it in an important production application. If you develop the habit of properly closing the resources and freeing up the memory, you'll write better programs.

Now that we've understood the code, compile and run it to see the results:

$ gcc csv_reader1.c -g -pedantic -Wall -o csv_reader1
$ ./csv_reader1
first_name,last_name,phone,job_title

John,Doe,555-444,CTO

Jane,Doe,444-555,Director

$

Great! We managed to print the raw contents of the sample file.

Implementing a Simple CSV Parser

We'll start handling the file as a real CSV file. To do so, we'll use the strtok function to split up the current line by commas. Each time we call this function we get back the next token pointed by a char* variable.

If there's no more token, the strtok returns NULL.

Let's first look at the code then we take a deeper look:

#include <stdio.h>
#include <string.h>
#include <stdlib.h>

int count_columns(char *line)
{
    int cnt = 0;

    for (int i = 0; line[i]; i++)
    {
        if (line[i] == ',')
        {
            cnt++;
        }
    }

    return (cnt + 1);
}

char **read_csv(char *line, int *arr_len)
{
    int col_count = 0;

    *arr_len = count_columns(line);

    char **arr = (char **)malloc((*arr_len) * sizeof(char *));
    char *tmp = strdup(line);

    char *token = strtok(tmp, ",");
    while (token)
    {
        int len_field = strlen(token);
        int bytes_to_copy = token[len_field - 1] == '\n'
                            ? len_field
                            : (len_field + 1);
        char *col = (char *)malloc(sizeof(char) * bytes_to_copy);
        strncpy(col, token, bytes_to_copy);
        col[bytes_to_copy - 1] = 0;

        arr[col_count++] = col;

        token = strtok(NULL, ",");
    }

    free(tmp);

    return arr;
}

int main(int argc, char **argv)
{
    FILE *fp = fopen(argv[1], "r");

    if (!fp)
    {
        printf("Can't open file %s\n", argv[1]);
        return 0;
    }

    char buf[1024];
    int row_count = 0;
    while (fgets(buf, 1024, fp))
    {
        if (++row_count == 1)
        {
            continue;
        }

        int arr_len = 0;
        char **arr = read_csv(buf, &arr_len);

        printf("First name: %s\n", arr[0]);
        printf("Last name: %s\n", arr[1]);
        printf("Phone number name: %s\n", arr[2]);
        printf("Job title: %s\n\n", arr[3]);

        for (int i = 0; i < arr_len; i++)
        {
            free(arr[i]);
        }

        free(arr);
    }

    fclose(fp);
    return 0;
}

Let's start with the main function.

First of all, we'll need to count the rows and columns. We count the rows because we know that the first line is the header and we don't want to process it. On the other hand, we count the columns to know what field we are processing.

First of all, we'll need to count the rows (row_count). We count the rows because we know that the first line is the header and we do not want to process it.

while (fgets(buf, 1024, fp))
{
    if (++row_count == 1)
    {
        continue; /* skip the header */
    }
}

Then let's look at the next block:

int arr_len = 0;
char **arr = read_csv(buf, &arr_len);

printf("First name: %s\n", arr[0]);
printf("Last name: %s\n", arr[1]);
printf("Phone number name: %s\n", arr[2]);
printf("Job title: %s\n\n", arr[3]);

Only the first two lines are really important here.

First, we define a new variable called arr_len. This is where we'll expect to receive the number of columns the parser found in the given line. If we'll later iterate through the results, we'll need to know how many columns we actually have.

The next line calls the actual CSV parser. Look at the type of the variable where we expect the result: char **. This a two dimensional array. The first dimension stores a pointer to the tokens, the second dimension stores the characters of the tokens.

That being said, what we expect here is as follows:

printf("The second column is: %s\n", arr[1]);

Since we're allocating memory using malloc, we also need to release it:

    for (int i = 0; i < arr_len; i++)
    {
        free(arr[i]);
    }

    free(arr);
}

fclose(fp);

First we release the memory allocated for each column (line 3), then we release the memory allocated to store the pointers of the columns (line 6). This is an important step to prevent memory leaks.

Let's continue with the parser itself!

char **read_csv(char *line, int *arr_len)
{
    int col_count = 0;

    *arr_len = count_columns(line);

    char **arr = (char **)malloc((*arr_len) * sizeof(char *));
    char *tmp = strdup(line);

We'll store the actual column we're currently parsing in the col_count variable.

As we saw previously, we'll inform the caller through the *arr_len pointer about how many columns we have. Counting how many columns we have (function count_columns) is very simple because we just count the number of occurrences of the , character.

Then we allocate memory for the array where we'll store the pointers to the columns.

We also create a copy (duplicate) of the current line using the strdup function. Why do we need it? Because we'll use the strtok function that'll modify the input string. Since we do not want the line to be modified, we create a clone of it.

Let's extract the loop:

char *token = strtok(tmp, ",");
while (token)
{
    int len_field = strlen(token);
    int bytes_to_copy = token[len_field - 1] == '\n'
                        ? len_field
                        : (len_field + 1);
    char *col = (char *)malloc(sizeof(char) * bytes_to_copy);
    strncpy(col, token, bytes_to_copy);
    col[bytes_to_copy - 1] = 0;

    arr[col_count++] = col;

    token = strtok(NULL, ",");
}

Here comes the strtok function that splits up the line. Do you spot something interesting? For the first time, the first parameter of strtok is the line that we want to split up. However, when we call it again in the while loop, notice that the first parameter is NULL. This is how the strtok works by specification.

I remember when I was studying strtok I didn't understand how it would know what to parse next if I pass a NULL parameter. It turned out that strtok uses a global variable to store the unparsed part of the string. This global variable will keep its data after strtok returned and can be re-used when strtok is called again.

Calling the strtok repeatedily will return the next token separated by ,.

We'll store the token in the col array. In order to do this, we need to allocate memory (bytes_to_copy). The condition on the bytes_to_copy considers whether this is an intermittent element or the last element in the line. In the latter case, we allocate one byte more.

Why? We do not want to store the new line \n character, so we should be allocating one byte less. However, in both cases, we need to allocate one extra byte to store the string termination character \0. So for the last element this exactly sums up to len_field, for the intermittent elements we allocate an extra byte (len_field + 1).

Then we copy the characters from token into col, then we terminate col with a \0. Now col stores the token, we just have to add it to arr and start looking for the next token.

If you compile and run the program you'll see he following output:

$ ./out
First Name:   John
Last Name:    Doe
Phone Number: 555-444
Job Title:    CTO


First Name:   Jane
Last Name:    Doe
Phone Number: 444-555
Job Title:    Director

How to Escape Commas using Double Quotes

Say the employee records database has been updated since the last time we processed it. First, a new member joined the company that has the job title Support, Level 2.

Second, an existing employee got promoted and now holds the job title VP of "Advanced Technologies". Note the double quotes.

The new input file that represents these changes would look like this:

sample2.csv

first_name,last_name,phone,job_title
John,"Doe",555-444,CTO
"Jane","Doe",444-555,"VP of Department of ""Advanced Technologies"""
John,Doe Jr,444-333,"Support, Level2"

Note the "Support, Level 2" is double quoted — that is, if you want to have commas inside the fields, you'll need to enclose the fields in double quotes. According to the specification, though, you can also use double quotes even if you don't have commas in the text.

So, what will happen if we try to parse the new file with the tool?

$ ./out
...
First Name:    John
Last Name:    Doe Jr
Phone Number:    444-333
Job Title:    "Support
    Level2"

Certainly, this is not what we want. First, the double quotes appear and the job title was split up.

A new approach to properly parse the CSV file could be the following:

char **read_csv(char *line, int *arr_len)
{
    char **arr = (char **)malloc(sizeof(char *) * MAX_FIELDS);
    char *field = NULL;

    int col_count = 0;
    int char_count = 0;
    int i = 0;

    int line_len = strlen(line);

    bool token_start = true;
    bool skip_store = false;

    bool is_quoted_field = false;

    bool stop_condition = false;
    bool quoted_field_stops = false;
    bool non_quoted_field_stops = false;
    while (i < line_len && line[i])
    {
        if (token_start)
        {
            field = (char *)malloc(sizeof(char) * line_len + 1);

            token_start = false;
            is_quoted_field = line[i] == '"' ? 1 : 0;
            if (is_quoted_field)
            {
                skip_store = 1;
            }
        }

        if (is_quoted_field && line[i] == '"' && line[i - 1] == '"')
        {
            skip_store = 1;
        }

        quoted_field_stops = is_quoted_field && line[i] == '"' && (line[i + 1] == ',' || line[i + 1] == '\n'); non_quoted_field_stops = !is_quoted_field && (line[i] == ',' || line[i] == '\n');

        stop_condition = quoted_field_stops || non_quoted_field_stops;
        if (stop_condition)
        {
            skip_store = true;
            token_start = true;

            field[char_count] = 0;
            arr[col_count++] = field;
            char_count = 0;

            if (quoted_field_stops)
            {
                i++;
            }
        }

        if (!skip_store)
        {
            field[char_count++] = line[i];
        }

        skip_store = false;
        i++;
    }

    *arr_len = col_count;
    return arr;
}