How to design your own programming language?

2022-09-25

Programming laguage requirement

A typical programming language supports a wide range of features and capabilities that enable developers to write, compile, and execute code to solve various computational problems. While the specific features and capabilities can vary between programming languages, here are some common elements that are typically supported in programming languages:

Data Types: Programming languages support various data types such as integers, floating-point numbers, characters, strings, booleans, arrays, and more. Some languages also allow users to define custom data types.
Variables: You can declare and use variables to store and manipulate data within a program.
Operators: Programming languages provide a set of operators for performing operations on data, such as arithmetic operators (+, -, *, /), comparison operators (==, !=, >, <), and logical operators (&&, ||, !).
Control Structures: This includes features like conditional statements (if, else, switch), loops (for, while, do-while), and branching (goto in some languages).
Functions/Methods: Most languages allow you to define and call functions or methods, which are reusable blocks of code that perform specific tasks.
Error Handling: Support for handling errors and exceptions is essential in programming languages. Try-catch blocks or similar constructs are commonly used for this purpose.
Input/Output (I/O): Programming languages typically offer mechanisms for reading input from users or external sources (e.g., stdin, files) and outputting results (e.g., stdout, files).
Data Structures: Many languages include built-in data structures like arrays, lists, sets, maps (dictionaries), and others for managing and organizing data efficiently.
Object-Oriented Programming (OOP): OOP languages provide support for classes and objects, encapsulation, inheritance, and polymorphism.
Concurrency and Multithreading: Some languages have features for concurrent and parallel programming to take advantage of multiple CPU cores.
Memory Management: Languages may offer automatic memory management through features like garbage collection, or they may require manual memory management through concepts like pointers.
Libraries and Modules: Most programming languages provide standard libraries or modules that offer pre-built functions and classes for common tasks, allowing developers to reuse code.
Standard Library: A standard library is a collection of pre-written code modules that provide various functionalities such as file I/O, string manipulation, and networking.
Debugging and Profiling Tools: Debugging features like breakpoints, step-by-step execution, and profiling tools to analyze code performance are often available.
IDEs and Development Tools: Integrated Development Environments (IDEs) and development tools tailored to the language can make coding, debugging, and testing easier.
Documentation and Community Support: A thriving developer community, official documentation, and online resources are essential for learning and using a programming language effectively.
Interoperability: Some languages support interoperability with other languages, allowing you to call functions or use libraries written in other languages.
Extensibility: Extensibility features enable you to add new functionality through plugins, extensions, or custom modules.
Platform Independence: Some languages are platform-independent or support cross-platform development, while others are more tightly tied to specific platforms.
Security Features: Security-oriented languages may include features for safer programming, such as type checking, buffer overflow prevention, and input validation.

In this blog, I introduce the program that I have designed using the C language as the underlying programming language. The supported features are

- comments
- integers and strings
- dynamically typed (like Scheme and Python)
- classes/objects
- arrays with O(1) access time
- conditionals
- recursion
- iteration
- convenient means to access command line arguments
- convenient means to print to the console
- convenient means to read integers from a file
- an adequate set of operators
- anonymous functions
- functions as first-class objects (i.e. functions can be manipulated as in Scheme - e.g. local functions)
- (graduate only) an inheritance system and detection of variables used before definition

Backus-Naur Form (BNF) for programming language design

The Backus-Naur Form (BNF) is a widely used notation for describing the syntax and structure of programming languages, as well as other formal languages. When designing a programming language using BNF, you typically follow a structured approach to define the grammar rules of your language. Here are the key steps and considerations for designing a programming language using BNF notation:

Define Terminals and Non-terminals:
- Terminals: These are the basic symbols or tokens in your language, such as keywords, operators, literals, and punctuation marks.
- Non-terminals: These are symbols that represent syntactic constructs, such as expressions, statements, functions, and program structures.
Start Symbol: Define a start symbol that represents the entry point of your language’s grammar. This is often called “program” or “start.”
Production Rules: Use production rules to describe how non-terminals can be derived from other non-terminals and terminals. Each production rule has the form:
```
non-terminal -> replacement
```
- The replacement can consist of a combination of terminals and non-terminals.
- Use ‘|’ (pipe) to indicate alternatives if multiple rules are possible for a non-terminal.
- Use epsilon (ε) to represent an empty string or an optional part of the grammar.
Grouping and Precedence: Define rules for operator precedence and associativity to ensure that expressions are parsed correctly. This often involves creating non-terminals for different levels of precedence.
Syntax Diagrams: Optionally, you can complement your BNF notation with syntax diagrams or railroad diagrams to provide a visual representation of the grammar.
Semantic Actions: While BNF primarily describes the syntax of the language, you can include comments or annotations within the production rules to describe the intended semantics of each construct.

Here’s a simple example of a BNF rule for a basic arithmetic expression in a hypothetical programming language:

<expression> ::= <term> | <expression> '+' <term> | <expression> '-' <term>
<term> ::= <factor> | <term> '*' <factor> | <term> '/' <factor>
<factor> ::= <integer> | '(' <expression> ')'
<integer> ::= [0-9]+

In this example:

<expression>, <term>, <factor>, and <integer> are non-terminals.
Operators like +, -, *, / are terminals.
Parentheses are used for grouping.
[0-9]+ represents a regular expression for an integer.

Remember that BNF is a high-level notation for defining the syntax of a language. Implementing a parser based on these grammar rules is a separate task, typically done using tools like parser generators (e.g., yacc, ANTLR) or hand-written code.

When designing a programming language, it’s crucial to carefully consider the syntax and semantics to ensure readability, maintainability, and consistency for developers using your language.

BNF grammar design for my language with operator and symbol define:

QUATE              """
ASSIGN             "="
OPREN              "("
CPPREN             ")"
OBRACE             "{"
CBRACE             "}"
OBRAKET            "[ "  
CBRAKET            "]"
SEMICOLON          ";"
COLON              ":"
COMMA              ","
EQUALS             "=="
NOTEQUALS          "!="
LESSTAN            "<"
GREATERTHAN        ">"
LESSOREQUAL        "<="
GREATEROREQUAL     ">="
PLUS               "+"
TIMES              "*"
DIVIDE             "/"
SUBTRACT           "-"
MODULUS            "%"
AND                "&&"
OR                 "||"
CARET              "^"
POUND              "#"
NOT                "!"

reserved keywords

keywords : show
         | if
         | goto
         | else
         | until
         | call
         | var
         | func
         | loop
         | when

unary

unary : VARIABLE
      | NUMBER
      | OPREN expression CPREN
      | STRING
      | VARIABLE OBRAKET optExpression CBRAKET
      | NULL

operators

operator : PLUS
         | SUBTRACT
         | TIMES
         | DIVIDE
         | CARET
         | ASSIGN
         | condition

condition

condition : LESSTAN
          | GREATERTHAN
          | GREATEROREQUA
          | LESSOREQUAL
          | NOT
          | EQUALS
          | OR
          | AND

expression

expression : unary
           | unary oprator expression
           | printExp
           | functionDefExp
           | functionCallExp
           | variableDefExp

print expression

printExp : show unary

expression list

expressionList : expression
               | expression COMMA expressionList

optional expression

optExpression : expressionList
              | *empty*

statement

statement : expression SEMICOLON
          | loopStatment
          | ifStatment

statements

statements : statement
           | statement statements

parameter list

parameterList : unary
              | unary COMMA parameterList

optional parameter list

optParameterList : parameterList
                 | *empty*

function define expression

functionDefExp : func VARIABLE OPREN optExpression CPREN block

function call expression

functionCallExp : call VARIABLE
                | call VARIABLE OPREN optExpression CPREN

variable define expression

variableDefExp : var VARIABLE
               | var VARIABLE ASSIGN unary

if statement

ifStatment : if OPREN expression CPREN block
           | elseStatement

else statement

elseStatement : else block
              | else ifStatment
              | *empty*

return statement

returnStatement : return SEMICOLON
                | return unary SEMICOLON

program

program : statements

block

block : OBRACE statements CBRACE
      | OBRACE *empty* CBRACE

Scan in souce code of the programming language

When implementing a programming language, the first step is reading in the source code of a program written in that language. Typically, the source code is stored as a file of characters. To read in a source code file, one groups the important individual characters into tokens and discards the unimportant characters. For example, consider the Python program:

print 'Hello World!'

There are two tokens in this program, print and ‘Hello World!’. The unimportant characters are the space that follows the token print and the newline that follows the token ‘Hello World’. Note that the space within the token ‘Hello World!’ is important, so the subsystem for reading in source code must be smart enough to distinguish between important and unimportant spaces, among other things. This subsystem is called lexical analysis.

- types
- lexeme
- lexer
- scanner

Lexical analysis approach

The is a program that read the file you have, it will out put a sequence of lexeme until the end of the line.

This is the .h file to define different token names and corresponding with grammar. I put partial example here:

types.h

#define OPREN "OPREN"
#define FUNCTION_DEF "func"

The approach here is to have a lexeme.c data structure to help me solve the problem.

lexeme.c

typdef structure lexeme {
  char* type;
  char* string;
  int integer;
  double real;
  struct lexeme* left; // reserved for parsing
  struct lexeme* right;// reserved for parsing
}

Another part is the lexer to determine each token

lexer.h

lexeme* lex(void);
void newLexer(char* file);
void skipWhiteSpace(); // for comments
lexeme* lexNumber();
lexeme* lexVariable();
lexeme* lexString();
lexeme* lexUnknown();

The last part is the main function withc output the lexeme name in a sequence.

scanner.c

int scanner(char* file){
  newLexer(file);
  lexeme* token = lex();
  while(strcmp(token -> type, ENDOFPOINT)!=0){
    if(strcmp(toke,NUMBER)){
      printf("%s %d\n", token->type, token->integer);
    }
    else if(strcmp(token->type, REAL)== 0){
      printf("%s %lf\n", token->type, token->real);
    }
    else {
      printf("%s %lf\n", token->type, token->string);
      token = lex();
    }
  }
  return 0;
}

int main(int argc, char* argv[]){
  scanner(arv[1]);
  return 0;
}

After the reding part and lexical part, we can forward to the parser design.

Parser design

Parser is the module that devide the words in the programming language. It can recongnize a certain pattern or a sequence and translate to the logical tree for the lather excecution.

The below features are supported in my project

- Recursive descent parsing
- Transforming grammars
- Support functions for recursive descent parsing
- Recognizing expressions
- Conditionals and iterations

parse

Lexeme *parse(FILE *inputFile)
{
    Parser *p = malloc(sizeof(Parser));
    p->fIn = inputFile;
    p->line = 1;
    p->pending = lex(p);
    p->tree = program(p);
    return p->tree;
}

parser rule

primary

Lexeme *primary(Parser *p)
{
    Lexeme *x, *y = NULL;
    if (literalPending(p))
    {
        return literal(p);
    }
    else if (check(p, BREAK))
    {
        return match(p, BREAK);
    }
    else if (check(p, OPREN))
    {
        match(p, OPREN);
        x = expr(p);
        match(p, CPREN);
        return x;
    }
    else if (lambdaPending(p))
    {
        x = lambda(p);
        if (check(p, OPREN))
        {
            match(p, OPREN);
            y = optParamList(p);
            match(p, CPREN);
            return cons(FUNCCALL, x, y);
        }
        return x;
    }
    else if (check(p, NIL))
    {
        return match(p, NIL);
    }
    else if (check(p, IDENTIFIER))
    {
        x = match(p, IDENTIFIER);
        if (check(p, OBRACKET))
        {
            match(p, OBRACKET);
            y = expr(p);
            match(p, CBRACKET);
            return cons(ARRAYACCESS, x, y);
        }
        else if (check(p, OPREN))
        {
            match(p, OPREN);
            y = optParamList(p);
            match(p, CPREN);
            return cons(FUNCCALL, x, y);
        }
        else if (check(p, DOT))
        {
            y = match(p, DOT);
            y->left = x;
            y->right = primary(p);
            return y;
        }
        return x;
    }
    else
    {

        Fatal("Primary not found.");
        exit(1);
    }
    return NULL;
}

The environment is a tree to store all the value we need for future use. The tree stucture ensure the programming language has features like functions.

In the environment file it will have five basic functions:

- create
- extend
- lookup
- insert
- update

functions

function define

extern Lexeme *createEnv();
extern Lexeme *extendEnv(Lexeme *env, Lexeme *vars, Lexeme *vals);
extern Lexeme *makeTable(Lexeme *vars, Lexeme *vals);
extern Lexeme *lookupEnv(Lexeme *var, Lexeme *env);
extern int sameVariable(Lexeme *x, Lexeme *y);
extern Lexeme *insert(Lexeme *var, Lexeme *val, Lexeme *env);
extern Lexeme *updateEnv(Lexeme *var, Lexeme *env, Lexeme *newVariable);

create environment

Lexeme *createEnv()
{
    return extendEnv(NULL, NULL, NULL);
}

extend environment

Lexeme *extendEnv(Lexeme *env, Lexeme *vars, Lexeme *vals)
{
    return cons(ENVIRONMENT, makeTable(vars, vals), env);
}

make table

Lexeme *makeTable(Lexeme *vars, Lexeme *vals)
{
    return cons(TABLE, vars, vals);
}

lookup environment

Lexeme *lookupEnv(Lexeme *var, Lexeme *env)
{
    while (env != NULL)
    {
        Lexeme *table = car(env);
        Lexeme *vars = car(table);
        Lexeme *vals = cdr(table);
        while (vars != NULL)
        {
            if (sameVariable(var, car(vars)))
            {
                return car(vals);
            }
            //walk the lists in parallel
            vars = cdr(vars);
            vals = cdr(vals);
        }
        env = cdr(env);
    }
    fprintf(stderr, "FATAL: variable, %s, is undefined.", var->stringVal);
    return NULL;
}

insert environment

Lexeme *insert(Lexeme *var, Lexeme *val, Lexeme *env)
{
    Lexeme *table = car(env);
    setCar(table, cons(GLUE, var, car(table)));
    setCdr(table, cons(GLUE, val, cdr(table)));
    return val;
}

update environment

Lexeme *updateEnv(Lexeme *var, Lexeme *val, Lexeme *env)
{
     while (env != NULL)
    {
        Lexeme *table = car(env);
        Lexeme *vars = car(table);
        Lexeme *vals = cdr(table);
        while (vars != NULL)
        {
            if (sameVariable(var, car(vars)))
            {   
                setCar(var, val);
                return val;
            }
            //walk the lists in parallel
            vars = cdr(vars);
            vals = cdr(vals);
        }
        env = cdr(env);
    }
    fprintf(stderr, "FATAL: variable, %s, is undefined.", var->stringVal);
    return NULL;
}

#LP