mdhender

super secret hq

A False Forth Scanner

Two languages that I keep returning to, like a moth to the flame, are LISP and Forth. I have the SwiftForth compiler on my system, along with Clozure CL. I like both, but neither of them are embeddable in a C program, so the next set of posts will be on writing a very simple Forth interpreter that I’ll call “False Forth” because it is not a “true” Forth.

The grammar for False Forth is very simple:

gramar = ( WHITESPACE | word )* .
word   = ( QUOTEDTEXT | UNQUOTEDTEXT ) .

Our scanner will also be simple. It will whitespace and return words only.

Whitespace includes blanks, tabs, linefeeds, and most non-printing characters between words.

Quoted text is anything between two quotation marks, using either single quote or double quote marks as the starting and ending delimiters. Inside a quoted text, two consecutive delimiters are considered an escape and are treated as a single occurrence of the delimiter within the text. All other characters are kept as-is. This includes both printing and non-printing characters.

Unquoted text is everything else. There is no escaping within unquoted text.

Here’s an example showing both words and spaces, along with the equivalent using C-style strings.

I wrote that
"Fred yelled, ""Are you ready?""
while the band marched on."
Dave's pet" "trick."

If you understand C-style strings, that would be equivalent to:

const char *quoted, *unquoted, *space;
unquoted = "I";
   space = " ";
unquoted = "wrote";
   space = " ";
unquoted = "that";
   space = "\n";
  quoted = "Fred yelled, \"Are you ready?\"\nwhile the band marched on.";
   space = "\n";
unquoted = "Dave's";
   space = " ";
unquoted = "pet\"";
   space = " ";
  quoted = "trick";
   space = "\n";

Specifications For A Really Simple Scanner

The simplest scanner I can think of does the following:

  1. Accept a string as input.
  2. If we are at end of input, return a NULL pointer.
  3. Find the end of the word.
  4. Return a pointer to a structure that points to the start and end of the word.
typedef enum { isSpace, isQuotedText, isUnquotedText } WORDKIND;

struct WORD {
  WORDKIND    kind;
  const char *first;
  const char *last;
};
typedef struct WORD WORD;

// scans input for words and whitespace. whitespace is anything
// that the C "isspace" function accepts. words are everything
// else. there are two types of words, quoted text and unquoted
// text. this version of the scanner does not deal with escaped
// quotes at this time. it also ignores errors such as
// unterminated quoted text.
//
WORD *Scan(const char *input) {
  static WORD word;

  // if nothing to parse, return a NULL pointer
  // to signal end of input.
  //
  if (!*input) {
      return (WORD *)(NULL);
  }

  // word.first points to the first character in the text
  //
  word.first = input++;

  // test to see what is waiting for us in the buffer
  //
  if (isspace(*(word.first)) {
      word.kind = isSpace;
  } else if (*(word.first) == '"' || *(word.first) == '\'') {
      word.kind = isQuotedText;
  } else {
      word.kind = isUnquotedText;
  }

  // now that we know the type of text to scan, let's
  // scan it!
  //
  if (word.kind == isSpace) {
      // scan whitespace. ends at the first non-space
      // character or end of string.
      //
      while (isspace(*input)) {
          input++;
      }
      // last points at the last space character
      //
      word.last = input - 1;
  } else if (word.kind == isUnquotedText) {
      // scan unquoted text. ends at the first space
      // character or end of string.
      //
      while (!isspace(*input)) {
          input++;
      }
      // last points at the last character
      // in the unquoted text
      //
      word.last = input - 1;
  } else {
      // scan quoted text. ends at the quote delimiter
      // or end of string.
      //
      char quote = *(word.first);

      while (*input && *input != quote) {
          input++;
      }
      if (*input) {
          // last points at the last character
          // in the quoted text, which should be
          // the closing quote mark.
          //
          word.last = input++;
      } else {
          // whoops. unterminated quoted text
          //
          word.last = input - 1;
      }
  }

  return &word;
}

A Driver For Our Really Simple scanner

The driver will just call the scanner until the input is exhausted.

bool Driver(const char *inputBuffer) {
  WORD *w;
  while ((w = Scan(inputBuffer)) {
      inputBuffer = w->last + 1;
  }
  return true;
}

Next post, we’ll add to the scanner so that it handles errors and escapes in quoted text.

Comments