Tokenizers 20 Jun21 Tokens n n A tokenizer

  • Slides: 16
Download presentation
Tokenizers 20 -Jun-21

Tokenizers 20 -Jun-21

Tokens n n A tokenizer is a program that extracts tokens from an input

Tokens n n A tokenizer is a program that extracts tokens from an input stream A token has two parts: n n n Its value—this is just the characters making up the token Its kind, or type For example, if we tokenize "while (x >= 0)" we might get these tokens: n n n "while", keyword "(", punctuation "x", name ">=", operator "0", integer ")", punctuation

Tokenizers as state machines n Tokenizers can be implemented as state machines, but with

Tokenizers as state machines n Tokenizers can be implemented as state machines, but with these important differences: n n To succeed (recognize a token), the tokenizer does not have to reach the end of input; it only has to reach a final state When the tokenizer returns a token, the remainder of the input string is kept for use in getting the remaining tokens Tokenizers are almost always implemented as state machines We’ll do a quick tokenizer to recognize tokens in arithmetic expressions: n n n Integers (digits only) Variables (letters and digits, starting with a letter) Operators, + - * / % Parentheses, ( ) Errors (anything not in the above list)

Tokenizers as DFAs n A tokenizer is a kind of DFA, but… digit INTEGER

Tokenizers as DFAs n A tokenizer is a kind of DFA, but… digit INTEGER digit letter READY letter VARIABLE digit +, -, *, /, (, ) OPERATOR n …if there is no valid transition: n n If in a “final” state, return with a token; the next call start in the READY state with the next input character If not in a final state, that’s a syntax error

Token. Type n public enum Token. Type { INTEGER, VARIABLE, OPERATOR, PARENTHESIS, ERROR; }

Token. Type n public enum Token. Type { INTEGER, VARIABLE, OPERATOR, PARENTHESIS, ERROR; }

Token n public class Token { private Token. Type type; private String value; public

Token n public class Token { private Token. Type type; private String value; public Token(Token. Type type, String value) { this. type = type; this. value = value; } public Token. Type get. Type() { return type; } } public String get. Value() { return value; }

Additions to the Token class n For my JUnit testing, I needed to ask

Additions to the Token class n For my JUnit testing, I needed to ask whether my Tokenizer was returning the correct Tokens n n public boolean equals(Object object) { Token that = (Token)object; return this. type == that. type && this. value. equals(that. value); } When tests fail, you need to see what Tokens you are getting n public String to. String() { return value + ": " + type; }

The constructor and has. Next() n public class Tokenizer { private String input; private

The constructor and has. Next() n public class Tokenizer { private String input; private int position; public Tokenizer(String input) { // add space to simplify getting last token this. input = input. trim() + " "; position = -1; } public boolean has. Next() { return position < input. length() - 2; } } public Token next() {. . . }

The shell of next() n public class Tokenizer { private enum States { READY,

The shell of next() n public class Tokenizer { private enum States { READY, IN_NUMBER, IN_VARIABLE, ERROR }; } public Token next() { States state; String value = ""; if (!has. Next()) { throw new Illegal. State. Exception("No more tokens!"); } state = States. READY; while ((++position) < input. length()) { char ch = input. char. At(position); switch (state) { case READY: {. . . } case IN_VARIABLE: {. . . } case IN_NUMBER: {. . . } default: {. . . } return new Token(Token. Type. ERROR, value); } } assert false; // should never get here return null; }

The READY state n case READY: value = ch + ""; if (Character. is.

The READY state n case READY: value = ch + ""; if (Character. is. Whitespace(ch)) break; if ("()". contains(ch + "")) { return new Token(Token. Type. PARENTHESIS, value); } if ("+-*/%". contains(ch + "")) { return new Token(Token. Type. OPERATOR, value); } if (Character. is. Letter(ch)) { state = States. IN_VARIABLE; break; } if (Character. is. Digit(ch)) { state = States. IN_NUMBER; break; } return new Token(Token. Type. ERROR, value);

The IN_NUMBER state n case IN_NUMBER: if (Character. is. Digit(ch)) { value += ch;

The IN_NUMBER state n case IN_NUMBER: if (Character. is. Digit(ch)) { value += ch; break; } else { position--; // save char for next time return new Token(Token. Type. INTEGER, value); }

The IN_VARIABLE state n case IN_VARIABLE: if (Character. is. Letter(ch) || Character. is. Digit(ch))

The IN_VARIABLE state n case IN_VARIABLE: if (Character. is. Letter(ch) || Character. is. Digit(ch)) { value += ch; break; } else { position--; // save char for next time return new Token(Token. Type. VARIABLE, value); }

The default case n default: return new Token(Token. Type. ERROR, value);

The default case n default: return new Token(Token. Type. ERROR, value);

java. util. String. Tokenizer n String. Tokenizer is a trivial tokenizer provided by Sun

java. util. String. Tokenizer n String. Tokenizer is a trivial tokenizer provided by Sun n Everything is either a “token” or a “delimiter” The most important methods are has. More. Tokens() and next. Token() There are three constructors: n String. Tokenizer(String str) n n String. Tokenizer(String str, String delim) n n Delimiters are whitespace characters; any sequence of non-whitespace characters is returned as a token Same as above, except you get to specify which characters are delimiters String. Tokenizer(String str, String delim, boolean return. Delims) n Same as above, except you get to say you also want the delimiters returned as tokens

java. io. Stream. Tokenizer n Stream. Tokenizer is a much more powerful (and much

java. io. Stream. Tokenizer n Stream. Tokenizer is a much more powerful (and much more complex) tokenizer n n It is basically capable of tokenizing C and Java programs, including integers, doubles, and comments There a large number of possible settings, so that the tokenizer can be customized The constructor is Stream. Tokenizer(Reader r), where Reader is an abstract class for reading character streams The most important method is int next. Token(), where the returned int tells you what kind of token it found n n Once you know what kind of token has been found, you access fields of the tokenizer to get its value I’m not going to cover Stream. Tokenizer in my lectures n n All the details are in the Java API It’s really ugly

The End

The End