coding5to9.com - How to Split Strings and Extract Tokens in C?

Introduction

When working with strings, it is often necessary to split them by space or any other arbitrary character. Just think of processing CSV files — you'll need to use some tokenizer to extract its fields.

In C there's a function called strtok for this purpose, however, it has some drawbacks.

We will find out in this post when we can use strtok to break strings into tokens and when we should look for something else.

Additionally, for practicing purposes we'll also create our own split function. We'll discover and address some design issues during the implementation.

The strtok function

Signature

According to the manual page of strtok its signature looks like as follows:

char *strtok(char *str, const char *delim);

It's part of the string.h header file.

The first parameter str points to the string we want to tokenize by the set of characters pointed by delim. That is, multiple characters can be given as separator. For example, if delim is pointing to " ,;" then the string will be fragmented when whitespace, comma or semicolon character is hit.

This function is not that simple though: The documentation also says that the pointer to the source string should be passed in the first call. The return value will point to the first token. The subsequent calls will then need a NULL value in the first parameter and it will return the next token. This should be repeated until there's no more token to read. If that happens, the function returns NULL.

Next token and subsequent calls, huh? If you have Java or C# background, you might have expected that this function simply returns an array of the tokens so that you can reference them by typing result[0] and so on. However, that's not the case, it behaves really differently.

If it doesn't return an array then how it works? It can be best looked at through an example.

Example

Let's assume you want to parse a string like One Two Three. In order to receive the individual tokens separated by a whitespace character you have to call strtok three times:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

int main(void)
{
    int i = 0;
    char str[1024] = "One Two Three";
    char *p = strtok(str, " ");

    while (p) {
        printf("%i. token = %s\n", i, p);
        p = strtok(NULL, " ");
        i++;
    }
}

Compile and run the code above. You'll see the following:

$ gcc split4.c -Wall -o split && ./split
0. token = One
1. token = Two
2. token = Three
$

If you read through the source code, you can interpret it as follows:

The source string is stored in an array of chars (that's important, in the next chapter we'll see why)
The first call passes the pointer to the source string and returns a pointer to the first token (One)
If there's (more) token to read, then
Print the token
Call the function again, but this time passing NULL as the first parameter. It'll instruct the strtok to continue parsing the previous source string (that is, the string that was passed at the last invocation)

Caveats

Considering the signature of this function, it would be tempting to write something like that:

char *str = "One Two Tree";
char *ret = strtok(str, " ");
printf("First token: %s\n", ret);

As the first token, you expect the word One. But let's actually see what happens if you try to run this program:

The result is segmentation fault

The program crashes! Why? The reason is surprising but rather simple: strtok actually tries to modify the input array pointed by the first parameter. Usually, a string literal created like above lives in the read-only section of the memory. That is, when the strtok tries to modify that data, the program crashes.

Keep that in mind and be very careful with that behavior because even when applying the -Wall flag (show all warnings), we did not get any warning message from the compiler!

The first lesson here is that you should always call strtok with a parameter that points to a char array that can be modified.

And that's not all. We run into another problem when it comes to parsing different strings at the same time.

Let's consider the following example: We want to print the first names of employees and their corresponding job titles from a file. A line in the file might look like as follows:

John Doe|Chief Executive Officer

We expect the following output:

John: Chief Executive Officer

If we take into account these requirements and our current knowledge of the strtok function, we might wind up writing this:

void extract_name(char *full_name, char *first_name)
{
    char *token = strtok(full_name, " ");
    strcpy(first_name, token);
}

/* ... */

p_line_token = strtok(line, "|");
while (p_line_token)
{
    if (cnt_token == 0)
    {
        extract_name(p_line_token, name);
    }
    else if (cnt_token == 1)
    {
        printf("%s: %s\n", name, p_line_token);
    }

    p_line_token = strtok(NULL, "|");
    cnt_token++;
}

Let's compile and see if it works!

~/work $ gcc split5.c -Wall -o split5 && ./split5
John: Doe
~/work $

As the output tells, something went wrong, this is not what we wanted to see. We got back the last name instead of the proper job title.

Where's the problem? If we look into the code, we see that in the loop the extract_name function is called which also invokes the strtok function! Calling strtok again with the full_name parameter resets its internal state. This means, when the subsequent call happens in the main loop with NULL parameter to parse further the job title, it'll rather parse the string last pointed by the first parameter. Which is, in this case, the full_name and not the p_line_token.

The second lesson is that strtok has an internal state and thus cannot parse multiple strings at the same time.

In order to address the second problem, though, there's a variant called strtok_r which receives a third parameter to store the internal state.

Summing up

The best thing you can do is you don't use strtok at all. If possible, use its improved version, the strtok_r.

Nevertheless, if you must use strtok (e.g. working on legacy systems where the newer version is not available), you should be very familiar with the behavior of strtok:

It modifies its input
The input string should be writable. That is, using string literals will likely lead to segmentation fault.
If you intend to reuse the input, you should copy it before using strtok
It has internal (global) state
Trying to use it in multiple contexts (i.e. different inputs) at the same time will produce (hard to catch!) errors.

Custom Split Function

Now that we see the limitations of strtok, let's write something that's more convenient to use and is similar to Java's split function.

Suppose we want to get back an array containing the tokens by a simple call. Of course, we also intend to avoid side-effects, i.e. we do not modify the source string.

The signature of our split function will be the following:

char **split(char *src, char *sep, int *cnt_token)

In general, the return value is just a two-dimensional array of strings. The parameter cnt_token is used to communicate to the caller how many elements the array has so that it can iterate it through.

Here's an example of the intended usage of our brand-new split function:

int i = 0;
int cnt_token = 0;
char **arr = split("Hello World", " ", &cnt_token);
for (i = 0; i < cnt_token; i++)
{
    printf("%d. %s\n", i, arr[i]);
}

Result:

0. Hello
1. World

Looks cool, doesn't it? You can see that it accepts string literal and only a single call was needed to get back the fragments of the string.

Writing Substring From Scratch

Recognizing a character sequence (which is the separator) in the input string is an essential part of our split function:

int i = 0;
int j, k;
while (TRUE)
{
    j = 0;
    k = i;
    while (k < strlen(str) && j < strlen(separator) && str[k] == separator[j] && k++ && j++)
        ;
    if (j == strlen(separator) || k == strlen(str))
    {
        last_pos = k;
        if (k == strlen(str))
        {
            break;
        }
        else
        {
            i = k;
            continue;
        }
    }
    i++;
}

Note that, of course, it could've been done by using the strstr function as well. But as I mentioned at the beginning of the article, this post also aims at improving our coding skills, so let's practice and code it on our own.

What does this code do? At the current position of the source string (denoted by variable i) we start another loop looking for a match with the delimiter.

Comparisons with the length of the strings also makes sure that we properly handle the cases when one of them ends (terminated by the \0 character).

If the condition is met, then we have a token. It starts from position last_pos and ends at i. Translating this to actual code it would look like:

char temp[1024];
strncpy(temp, src + last_pos, i - last_pos);

Storing the Tokens

There are two unknown factors in play when we want to store the results, namely the number of the tokens and their size individually.

As a quick recap, in C we have to manage the memory ourselves — we have to reserve (and free up) memory for the tokens.

Taking the above into account, here's a potential solution:

char **tokens = NULL;
if (j == strlen(separator) || k == strlen(str))
{
    tokens = (char **) realloc(tokens, sizeof(char *) * (*cnt_token + 1));
    tokens[*cnt_token] = (char *) malloc(sizeof(char) * (i - last_pos + 1));
    strncpy(tokens[*cnt_token], str + last_pos, i - last_pos);
    tokens[*cnt_token][i - last_pos] = '\0';
    (*cnt_token)++;
    last_pos = k;
    /* ... */
}

For the first sight that might look OK. But let's think about what happens if this piece of code is run when processing a large file line by line.

The realloc and malloc functions are invoked for each line and token respectively, and they might impact the performance negatively. It depends on the actual implementation of these methods, it may vary by different compilers and operating systems. So if the performance looks fine on our system, it is not guaranteed that compiling the same program with a different compiler on another system will also produce decent results.

Also, keep in mind, that these allocated resources have to be freed up as well. And this is the responsibility of the caller:

for (i = 0; i < cnt_token; i++)
{
    /* ... <> & */
    printf("%d. %s\n", i, arr[i]);
    free(arr[i]);
}
free(arr);

Does it make sense to optimize it for speed by getting rid of the costly calls? It depends. For curiosity, we'll provide another solution and carry out a quick benchmark to see the results.

Alternative Solution for Performance

To show you another option to handle the memory, let's suppose that the performance is really critical.

In this case we could use a pre-allocated memory pool and have our split function use it.

The point of this approach is that a chunk of memory will be allocated by the client. The split function should just place the tokens in the array slots. That being said, most of the resource heavy calls are eliminated.

Our function signature now includes a third parameter:

char **split(char *str, char *sep, int *cnt_token, char **custom_pool);

And the caller should act as follows:

char **custom_pool = (char **)malloc(sizeof(char *) * 255);
for (j = 0; j < 255; j++)
{
    custom_pool[j] = (char *)malloc(sizeof(char) * 255);
}
char **tokens = split(line, " ", &num_of_tokens, custom_pool);

This approach path sounds fine, but there's a disadvantage though: We don't know the number of the tokens and the length in advance.

That is, we have to use some assumption on the default value. If we know the input data that might work, but if we face unknown data then this method is not suitable. This is a trade-off between performance and usability.

Benchmarking The Solutions

We've talked about different implementations so far. It's time to see how they are performing.

I counted the tokens in a file that nearly takes 20 megabytes in size. I tested both the version that contains the realloc and malloc statements and the other one that leveerages pre-allocated memory pool.

I compiled an optimized version from them using the -O3 flag of gcc. Also, to make it more interesting I also processed the file in Java by using its out of the box split function.

Benchmarking the solutions

It surprised me that even the non-optimized version with the memory allocation statements ran only in 553 ms.

We won more than 100 ms by using the custom memory pool implementation.

The Java version was also pretty fast though. The JVM is lightning fast compared to the old days.

And then the variants complied with -O3 flag are taking the first two places in our comparison report. It'd be worth looking into assembly code the compiler generates because the first version is basically more than 4 times faster compared to its non-optimized counterpart.

Conclusion

We've seen how we can extract tokens in C by using the standard library's strtok function. We've also become familiar with the behavior of this function and learned that we have to be very cautious when using it. The best is to simply avoid it and use its refined strtok_r version.

In order to practice, we have also created our own string tokenizer with different implementations.