Skip to content

Commit 66e2eee

Browse files
committed
topn extension implementation
0 parents  commit 66e2eee

27 files changed

+2722
-0
lines changed

.gitattributes

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
*.[ch] citus-style
2+
/data/customer_reviews_1998.csv filter=lfs diff=lfs merge=lfs -text

.gitignore

Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,35 @@
1+
# Object files
2+
*.o
3+
*.ko
4+
*.obj
5+
*.elf
6+
7+
# Precompiled Headers
8+
*.gch
9+
*.pch
10+
11+
# Libraries
12+
*.lib
13+
*.a
14+
*.la
15+
*.lo
16+
17+
# Shared objects (inc. Windows DLLs)
18+
*.dll
19+
*.so
20+
*.so.*
21+
*.dylib
22+
23+
# Executables
24+
*.exe
25+
*.app
26+
*.i*86
27+
*.x86_64
28+
*.hex
29+
30+
# Debug files
31+
*.dSYM/
32+
33+
#Temporary files
34+
*~
35+
copy_data.out

.travis.yml

Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
sudo: required
2+
git:
3+
lfs_skip_smudge: true
4+
dist: trusty
5+
language: c
6+
branches:
7+
except: [ /^open-.*$/ ]
8+
matrix:
9+
fast_finish: true
10+
include:
11+
- env: PGVERSION=9.5
12+
- env: PGVERSION=9.6
13+
- env: PGVERSION=10
14+
- env: PGVERSION=11
15+
allow_failures:
16+
- env: PGVERSION=11
17+
before_install:
18+
- bash test_data_provider
19+
- git clone -b v0.7.1 --depth 1 https://github.com/citusdata/tools.git
20+
- sudo make -C tools install
21+
- setup_apt
22+
- nuke_pg
23+
install:
24+
- install_uncrustify
25+
- install_pg
26+
before_script:
27+
- citus_indent --quiet --check
28+
- config_and_start_cluster
29+
script: pg_travis_test

Makefile

Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
#contrib/topn/Makefile
2+
3+
MODULES = topn
4+
EXTENSION = topn
5+
DATA = topn--2.0.0.sql
6+
##README??
7+
8+
REGRESS = add_agg union_agg char_tests null_tests add_union_tests copy_data customer_reviews_query
9+
10+
EXTRA_CLEAN += -r $(RPM_BUILD_ROOT)
11+
12+
ifdef DEBUG
13+
COPT += -O0
14+
CXXFLAGS += -g -O0
15+
endif
16+
17+
ifndef PG_CONFIG
18+
PG_CONFIG = pg_config
19+
endif
20+
21+
PGXS := $(shell $(PG_CONFIG) --pgxs)
22+
include $(PGXS)
23+
24+
test_data:
25+
./test_data_provider
26+
check: test_data
27+
make installcheck

README.md

Lines changed: 147 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,147 @@
1+
# topN
2+
`topN` is a PostgreSQL extension which uses a counter-based algorithm and implements necessary functions for top-n approximation. This project uses Postgres `JSONB` type to aggregate the data and provide some functionalities:
3+
4+
#### 1. Top-n Query
5+
This query is helpful to find the most frequent items of a column of data.
6+
7+
#### 2. Union
8+
Union is the process of merging more than one topN `JSONB` counters for cumulative results of top-n query.
9+
10+
For the top-n approximation, the strategy of the algorithm is keeping predefined number of counters for frequent items. If a new item already exist in the counters, its frequency is incremented. Otherwise, the algorithm inserts the new counter into the counter list if there is enough space for one more, but if there is not, the list is pruned by finding the median and removing the bottom half. The accuracy of the result can be increased by storing greater number of counters with the cost of bigger space requirement and slower aggregation.
11+
12+
# Usage
13+
We provide user defined Postgres aggregates and functions:
14+
15+
### Data Type
16+
###### `JSONB`
17+
A PostgreSQL type to keep the frequent items and their frequencies.
18+
19+
### Aggregates
20+
###### `topn_add_agg(textColumnName)`
21+
This is the aggregate add function. It creates an empty `JSONB` and inserts series of item from given column to create aggregate summary of these items. Note that the value must be `TEXT` type or casted to `TEXT`.
22+
23+
###### `topn_union_agg(topnTypeColumn)`
24+
This is the aggregate for union operation. It merges the `JSONB` counter lists and returns the final `JSONB` which stores overall result.
25+
26+
### Functions
27+
###### `topn(jsonb, n)`
28+
Gives the most frequent n elements and their frequencies as set of rows from the given `JSONB`.
29+
30+
###### `topn_add(jsonb, text)`
31+
Adds the given text value as a new counter into the `JSONB` and returns a new `JSONB` if there is an enough space for one more counter. If not, the counter is added and then the counter list is pruned.
32+
33+
###### `topn_union(jsonb, jsonb)`
34+
Takes the union of both `JSONB`s and returns a new `JSONB`.
35+
36+
### Variables
37+
###### `topn.number_of_counters`
38+
Sets the number of counters to be tracked in a `JSONB`. If at some point, the current number of counters exceed this value, the list is pruned. The default value is 1000 for topn.number_of_counters. You can increase the accuracy of the results by increasing the value of this variable by sacrificing space and time.
39+
40+
# Build
41+
Once you have PostgreSQL, you're ready to build topn. For this, you will need to include the pg_config directory path in your make command. This path is typically the same as your PostgreSQL installation's bin/ directory path. For example:
42+
43+
PATH=/usr/local/pgsql/bin/:$PATH make
44+
sudo PATH=/usr/local/pgsql/bin/:$PATH make install
45+
46+
You can run the regression tests as the following;
47+
48+
sudo PATH=/usr/local/pgsql/bin/:$PATH make check
49+
50+
Please note that the test dataset `customer_reviews_1998.csv` file is too big so it is handled by git-lfs.
51+
52+
# Citus Use Case Example
53+
Let's start with downloading and decompressing the data
54+
files.
55+
56+
wget http://examples.citusdata.com/customer_reviews_1998.csv.gz
57+
wget http://examples.citusdata.com/customer_reviews_1999.csv.gz
58+
59+
gzip -d customer_reviews_1998.csv.gz
60+
gzip -d customer_reviews_1999.csv.gz
61+
62+
Create topn extension and sum(topn) function on the master and also on the worker nodes.
63+
64+
```SQL
65+
-- create extension
66+
CREATE EXTENSION topn;
67+
68+
-- override sum(topn) function
69+
CREATE AGGREGATE sum(jsonb)(
70+
SFUNC = topn_union_trans,
71+
STYPE = internal,
72+
FINALFUNC = topn_pack
73+
);
74+
```
75+
76+
For the remaining part, you can run only on the master node.
77+
78+
```SQL
79+
-- create table
80+
CREATE TABLE customer_reviews
81+
(
82+
customer_id TEXT,
83+
review_date DATE,
84+
review_rating INTEGER,
85+
review_votes INTEGER,
86+
review_helpful_votes INTEGER,
87+
product_id CHAR(10),
88+
product_title TEXT,
89+
product_sales_rank BIGINT,
90+
product_group TEXT,
91+
product_category TEXT,
92+
product_subcategory TEXT,
93+
similar_product_ids CHAR(10)[]
94+
);
95+
```
96+
97+
Next, we load data into the table:
98+
99+
```SQL
100+
\COPY customer_reviews FROM 'customer_reviews_1998.csv' WITH CSV;
101+
\COPY customer_reviews FROM 'customer_reviews_1999.csv' WITH CSV;
102+
```
103+
104+
Finally, let's run some example SQL:
105+
106+
```SQL
107+
-- Create a distributed table to insert summaries.
108+
create table popular_products
109+
(
110+
review_summary jsonb,
111+
year double precision,
112+
month double precision
113+
);
114+
115+
SELECT create_distributed_table('popular_products', 'year');
116+
```
117+
118+
```sh
119+
-- Create different summaries by grouping the reviews according to
120+
-- their year and month, and copy into distributed table
121+
122+
psql -d postgres -c "COPY (select
123+
topn_add_agg(product_id),
124+
extract(year from review_date) as year,
125+
extract(month from review_date) as month
126+
from
127+
customer_reviews
128+
group by
129+
year,
130+
month) TO STDOUT" | psql -d postgres -c "COPY popular_products FROM STDIN"
131+
```
132+
133+
```SQL
134+
-- Let's check top-20 items.
135+
136+
SELECT
137+
*
138+
FROM
139+
(SELECT
140+
(topn(sum(review_summary), 20)).*
141+
FROM
142+
popular_products
143+
GROUP BY year
144+
) foo
145+
ORDER BY
146+
2 DESC;
147+
```

data/.DS_Store

6 KB
Binary file not shown.

data/.gitattributes

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
/data/customer_reviews_1998.csv filter=lfs diff=lfs merge=lfs -text

expected/.DS_Store

6 KB
Binary file not shown.

0 commit comments

Comments
 (0)