This project is the coprocessor part of Dingo Expression, written in C++. The project is made standalone to be easily integrated into other C++ projects.
#include "expr/runner.h"
using namespace dingodb::expr;Assume bytes is an array of type expr::Byte[size], which contains the encoded expression
Runner runner;
runner.Decode(bytes, size);
runner.Run();
Operand result = runner.Get();Operand is a class to hold any supported data types. An Operand object can be created from any supported type T by the constructor Operand<T>(T value), and the value contained in an Operand object can be retrieved by
T Operand::GetValue() const;Operand can also hold a NULL value, and can be tested by the overloaded "equals/non-equals" operator, like operand == nullptr or operand != nullptr.
Dingo Expression Coprocessor supports variables. Internally, variables are indexed by integers, start from 0. If the expression contains variables, a Tuple must be bind to the Runner before run
Tuple tuple{1, "Alice", 100.0};
runner.BindTuple(&tuple);where Tuple is a type alias of std::vector<Operand>. The variables indexed by 0, 1, 2 would be assigned the 1st, 2nd, 3rd values of the tuple, respectively.
Note:
- The
runnerdoes not take ownership of thetuple, which should be released by the caller if necessary - The number of elemnets in the
tupleis not checked when retrieving the value and it is the caller's duty to ensure the index of variables are valid
Dingo Expression Coprocessor has also limited implementation for relational algebra. To use it, add the following to the source code
#include "rel/rel_runner.h"
using namespace dingodb::expr;
using namespace dingodb::rel;Assume bytes is an array of type expr::Byte[size], which contains the encoded relational algebra
auto *rel = new RelRunner();
rel->Decode(buf, size);Then Tuples can be put into the RelRunner
for (int i = 0; i < tuples.size(); ++i) {
const auto *output = rel->Put(tuples[i]);
}The output returned is another Tuple or nullptr. The former is a valid result and the latter means no result can be retrieved immediately for the input may be fitered out or cached.
If there are datum cached in the RelRunner (which is typically when the relational algebra contains any aggregation), the cached results must be taken out after all tuples are put in
while ((const auto *output = rel->Get()) != nullptr) {
do_some_thing(output);
};Note:
- The
RelRunnertakes over the ownership of theTupleput in. The caller must not try to release it - If the
outputreturned either byPutorGetis notnullptr, it must be released by the caller - The implementation of
RelRunneris not thread-safe
The Runner class is for expression evaluating. The following figure illustrate the evaluating process, in which the expression is like 2 + T[0] * 3
The Runner contains an operand stack and an operator verctor. The operator vector is constructed by Decode method from the encoded bytes of an expression. Each operator can manipulate (push/pop) operands in the operand stack following pre-defined process. When Run method is called, operators in the vector are carried out one by one. By calling Get method, the top elemement in the stack is poped out as the returned result. Mostly, there is only one oprand left in the stack after Run for a valid expression.
Data types are represented by 4 bit, as described in the following table
| Type | Description | Decimal Code | Hex Code | Corresponding C++ Type |
|---|---|---|---|---|
INT32 |
32 bit integer | 1 |
0x1 |
int32_t |
INT64 |
64 bit integer | 2 |
0x2 |
int64_t |
BOOL |
boolean | 3 |
0x3 |
bool |
FLOAT |
single precision floating point | 4 |
0x4 |
float |
DOUBLE |
double precision floating point | 5 |
0x5 |
double |
DECIMAL |
arbitrary-precision decimal number | 6 |
0x6 |
Not supported yet |
STRING |
string | 7 |
0x7 |
std::shared_ptr<std::string> |
Note the C++ types are wrapped in Operand class.
Immediate numbers are const values embedded in the encoded expressions, to implement const, index of variables, length of arrays, etc. The encoding of immediate numbers is described in the following table
| Type of Immediate Number | Description of Encoding |
|---|---|
INT32 |
See encoding of VarInt in Protocol Buffers |
INT64 |
The same as above |
FLOAT |
4 bytes big-endian representation |
DOUBLE |
8 bytes big-endian representation |
DECIMAL |
To be defined |
STRING |
Encoding the byte length of it first, followed by all the types of it. The length is encoded as INT32 type |
If the expressions are used in relational algebra, a special byte may be needed to indicate the end of expression. Byte 0x00 is used for that purpose. But byte 0x00 is not always means the termination of expression, because there may be any bytes in the presentation of an immediate number.
The expressions Dingo Expression Coprocessor processed are represented in suffix style, i.e. the operator is put after its operands. Consts and variables are also treated as operator in this representation. All the operators are encoded into unique bytes representation. Operators may have one or more immediate numbers followed. Generally
- For 1-byte representation, the HSB is set to
0, and the lower 4 bits can be used to indicate a data type if needed - For 2-bytes representation, the HSB is set to
1, and the higher 4 bits and the lower 4 bits of the 2nd byte can be used to indicate two data types individually if needed - No byte
0x00or0xFFis allowed in the representation except for immediate numbers
The encoding of operators is illustrated in the following figure
The encoding of one-byte operators is as follows (The template-like T is any of the Data Types)
| Operator | Higher 4 Bits | Lower 4 Bits | Immediate Number | Description |
|---|---|---|---|---|
NULL<T> |
0x0 |
Encode type T |
None | NULL value |
CONST<T> |
0x1 |
Encode type T |
T type value |
T type const, T != BOOL |
CONST<BOOL> |
0x1 |
Encode type BOOL |
None | BOOL value true |
CONST_N<T> |
0x2 |
Encode type T |
T type value |
T type const, T == INT32 or T == INT64, the real value is the inverse of the encoded immediate number |
CONST_N<BOOL> |
0x2 |
Encode type BOOL |
None | BOOL value false |
VAR<T> |
0x3 |
Encode type T |
INT32 type value |
T type variable indexed by an integer |
VAR_S<T> |
0x4 |
Encode type T |
STRING type value |
T type variable indexed by a string. Not implemented yet |
NOT |
0x5 |
0x1 |
None | Unary NOT |
AND |
0x5 |
0x2 |
None | Binary AND |
OR |
0x5 |
0x3 |
None | Binary OR |
AND_FUN |
0x5 |
0x4 |
INT32 type value |
Variadic AND, the immediate number is the number of parameters. Not implemented yet |
OR_FUN |
0x5 |
0x5 |
INT32 type value |
Variadic OR, the immediate number is the number of parameters. Not implemented yet |
| Operator | Higher 4 Bits | Lower 4 Bits | Immediate number 0 | Imediate number 1..N | Description |
|---|---|---|---|---|---|
ARRAY<T> |
0x6 |
Encode type T |
INT32 type number N, the number of elements |
encode N values of T type continously |
Array of T type consts. Not implement yet |
ARRAY<AGG> |
None | None | INT32 type number N, the number of elements |
encode N aggregation functions continously |
Array of aggregations, used in relational algebra |
ARRAY<CONST> |
None | None | INT32 type number N, the number of elements |
encode N CONST/CONST_N expressions continously |
Tuple of consts. Not implemented yet |
| Operator | 1st Byte | Higher 4 Bits of 2nd Byte | Lower 4 Bits of 2nd Byte | Immediate Number 0 | Immediate Number 1..2N | Description |
|---|---|---|---|---|---|---|
MAP<K, V> |
0x80 |
Encode type K |
Encode type V |
INT32 type number N, tye number of key-value pairs |
encode K and V type consts alternately |
Map of consts, K and V is the type of key and value. Not implemented yet |
The encoding of two-bytes operators is as follows (The template-like T is any of the Data Types)
| Operator | 1st Byte | Higher 4 Bits of 2nd Byte | Lower 4 Bits of 2nd Byte | Immediate Number | Description |
|---|---|---|---|---|---|
POS<T> |
0x81 |
0x0 |
Encode type T |
None | Unary + |
SUM<T> |
0x81 |
0x1 |
Encode type T |
INT32 type value, the number of parameters |
Varidiac SUM. Not implemented yet |
NEG<T> |
0x82 |
0x0 |
Encode type T |
None | Unary - |
ADD<T> |
0x83 |
0x0 |
Encode type T |
None | Binary + |
SUB<T> |
0x84 |
0x0 |
Encode type T |
None | Binary - |
MUL<T> |
0x85 |
0x0 |
Encode type T |
None | Binary * |
DIV<T> |
0x86 |
0x0 |
Encode type T |
None | Binary / |
MOD<T> |
0x87 |
0x0 |
Encode type T |
None | Binary % |
EQ<T> |
0x91 |
0x0 |
Encode type T |
None | Binary == |
GE<T> |
0x92 |
0x0 |
Encode type T |
None | Binary >= |
GT<T> |
0x93 |
0x0 |
Encode type T |
None | Binary > |
LE<T> |
0x94 |
0x0 |
Encode type T |
None | Binary <= |
LT<T> |
0x95 |
0x0 |
Encode type T |
None | Binary < |
NE<T> |
0x96 |
0x0 |
Encode type T |
None | Binary <> |
IS_NULL<T> |
0xA1 |
0x0 |
Encode type T |
None | Unary IS_NULL function |
IS_TRUE<T> |
0xA2 |
0x0 |
Encode type T |
None | Unary IS_TRUE function |
IS_FALSE<T> |
0xA3 |
0x0 |
Encode type T |
None | Unary IS_FALSE function |
MIN<T> |
0xB1 |
0x0 |
Encode type T |
None | Binary MIN function |
VARG_MIN<T> |
0xB1 |
0x1 |
Encode type T |
INT32 type value, the number of parameters |
Variadic MIN function |
MAX<T> |
0xB2 |
0x0 |
Encode type T |
None | Binary MAX function |
VARG_MAX<T> |
0xB2 |
0x1 |
Encode type T |
INT32 type value, the number of parameters |
Variadic MAX function |
ABS<T> |
0xB3 |
0x0 |
Encode type T |
None | Unary ABS function |
CAST<D, T> |
0xF0 |
Encode type D |
Encode type T |
None | Casting from T type to D type |
The coding of functions is as follows
| Operator | 1st Byte | 2nd Byte | Immediate Number | Description |
|---|---|---|---|---|
FUN |
0xF1 |
1-byte positive integer, the sequence number of the function | None | Pre-defined functions |
VARG_FUN |
0xF2 |
1-byte positive integer, the sequence number of the function | INT32 type value, the number of parameters |
Pre-defined varidiac functions. Not implemented yet |
| Function | Type of Parameters | Type of Return Value | Sequence Number | Description |
|---|---|---|---|---|
CEIL |
DOUBLE |
DOUBLE |
0x01 |
Round up to the smallest integer |
FLOOR |
DOUBLE |
DOUBLE |
0x02 |
Round down to the largest integer |
ROUND |
INT64, INT32 |
INT64 |
0x03 |
Round. Not implemented yet |
ROUND |
DOUBLE, INT32 |
DOUBLE |
0x04 |
Round. Not implemented yet |
POW |
DOUBLE |
DOUBLE |
0x05 |
Power. Not implemented yet |
POW |
INT64 |
INT64 |
0x06 |
Power. Not implemented yet |
SIN |
DOUBLE |
DOUBLE |
0x07 |
Sine |
COS |
DOUBLE |
DOUBLE |
0x08 |
Cosine |
TAN |
DOUBLE |
DOUBLE |
0x09 |
Tangent |
ASIN |
DOUBLE |
DOUBLE |
0x0A |
Arc sine |
ACOS |
DOUBLE |
DOUBLE |
0x0B |
Arc cosine |
ATAN |
DOUBLE |
DOUBLE |
0x0C |
Arc tangent |
SINH |
DOUBLE |
DOUBLE |
0x0D |
Hyperbolic sine |
COSH |
DOUBLE |
DOUBLE |
0x0E |
Hyperbolic cosine |
TANH |
DOUBLE |
DOUBLE |
0x0F |
Hyperbolic tangent |
EXP |
DOUBLE |
DOUBLE |
0x10 |
Exponential |
LOG |
DOUBLE |
DOUBLE |
0x11 |
Logarithm |
| Function | Type of Parameters | Type of Return Value | Sequence Number | Description |
|---|---|---|---|---|
CHAR_LENGTH |
STRING |
INT32 |
0x20 |
Return the length of a string. Not implemented yet |
CONCAT |
STRING, STRING |
STRING |
0x21 |
Concatenate two strings |
LOWER |
STRING |
STRING |
0x22 |
Converts to lowercase |
UPPER |
STRING |
STRING |
0x23 |
Converts to uppercase |
LEFT |
STRING, INT32 |
STRING |
0x24 |
Sub string from left |
RIGHT |
STRING, INT32 |
STRING |
0x25 |
Sub string from right |
TRIM |
STRING |
STRING |
0x26 |
Trim whitespaces |
TRIM |
STRING, STRING |
STRING |
0x27 |
Trim specified string. Not implemented yet |
LTRIM |
STRING |
STRING |
0x28 |
Trim whitespaces from left |
LTRIM |
STRING, STRING |
STRING |
0x29 |
Trim specified string from left. Not implemented yet |
RTRIM |
STRING |
STRING |
0x2A |
Trim whitespaces from right |
RTRIM |
STRING, STRING |
STRING |
0x2B |
Trime specified string from right. Not implemented yet |
SUBSTR |
STRING, INT32, INT32 |
STRING |
0x2C |
Sub string from a "position" to another "position". The 1st "position" is 0 |
SUBSTR |
STRING, INT32 |
STRING |
0x2D |
Sub string from a "position" to the end. The 1st "position" is 0 |
MID |
STRING, INT32, INT32 |
STRING |
0x2E |
Sub string from a "position" to the specified length. The 1st "position" is 1 |
MID |
STRING, INT32 |
STRING |
0x2F |
Sub string from a "position" to the end. The 1st "position" is 1 |
REPEAT |
STRING, INT32 |
STRING |
0x30 |
Repeat the string. Not implemented yet |
REVERSE |
STRING |
STRING |
0x31 |
Reverse the string. Not implemented yet |
REPLACE |
STRING, STRING, STRING |
STRING |
0x32 |
Search and replace. Not implemented yet |
LOCATE |
STRING, STRING, INT32 |
INT32 |
0x33 |
Find the "position" of a string in another string. The 1st "position" is 1. Not implemented yet |
LOCATE |
STRING, STRING |
INT32 |
0x34 |
Find the "position" of a string in another string. The 1st "position" is 1. Not implemented yet |
FORMAT |
DOUBLE, INT32 |
STRING |
0x35 |
Formatted output of a number. Not implemented yet |
Aggregation functions are used only in relational algebra. Complicated aggregations such as AVG is not supported here. Actually, AVG can be converted to SUM and COUNT with a projection before encoded.
Encoding of aggregations does not follow the way to encode normal function and has no leading 0xF1 byte, but is more like that of one-byte operators, as in the following table
| Aggregation | Type of Parameter | Type of Return Value | Higher 4 Bits | Low 4 Bits | Immediate Number | Description |
|---|---|---|---|---|---|---|
COUNT_ALL |
None | INT64 |
0x1 |
0x0 |
None | Count all rows |
COUNT<T> |
T |
INT64 |
0x1 |
Encode type T |
INT32 type value, the column index |
Count only non-null values |
SUM<T> |
T |
T |
0x2 |
Encode type T |
INT32 type value, the column index |
Sum the values |
MAX<T> |
T |
T |
0x3 |
Encode type T |
INT32 type value, the column index |
Maximum of the values |
MIN<T> |
T |
T |
0x4 |
Encode type T |
INT32 type values, the column index |
Minimum of the values |
Note:
- The column indices are all started from
0 - Some aggregation functions are requried to return
0if there are no input rows, such asCOUNT. For simplicity,NULLis returned for all aggregation functions here and it is in the final reducing stage to convertNULLto0
Only "simple" (the operators have only one single input and one output) relational algebra expressions are supported. These operators are chained with one's output connected to another's input to form a whole relational algebra expression. They are encoded one by one from the input end to the output end.
The encoding of each algebra operators contains an unique "leading byte", a "payload" for the parameters and expressions in the operator and a "trailing byte" which is always EOE (End Of Expression). The details are in the following table
| Operator | Leading Byte | Payload | Trailing Byte |
|---|---|---|---|
| Filter | 0x71 |
Encode the filter expression | EOE |
| Project | 0x72 |
Encode the project expression one by one | EOE |
| Grouped Aggregation | 0x73 |
Encode group indices as ARRAY<INT32> type, then the list of aggregation functions as ARRAY<AGG> type |
EOE |
| Ungrouped Aggregation | 0x74 |
Encode the list of aggregation functions as ARRAY<AGG> type |
EOE |
A "Project" operator may contains several expressions but they can be concatenated into one "huge" expression without any separator simplify the evaluating process. The "huge" expression is decoded by one Runner, and after evaluating there will be several results left in the operand stack just as needed. These results can be taken out by multiple calls to Get method.