dozer-samples/usecases/scaling-ecommerce at main · getdozer/dozer-samples

Name	Name	Last commit message	Last commit date
parent directory ..
images	images
prometheus	prometheus
.gitignore	.gitignore
README.md	README.md
aggregate-config.yaml	aggregate-config.yaml
direct-config.yaml	direct-config.yaml
docker-compose.yml	docker-compose.yml
generate.py	generate.py
half-config.yaml	half-config.yaml
join-config.yaml	join-config.yaml
load_test_grpc.sh	load_test_grpc.sh
local-config.yaml	local-config.yaml
poetry.lock	poetry.lock
pyproject.toml	pyproject.toml
requirements.txt	requirements.txt
running.md	running.md

Scaling E-Commerce

In this example we demonstrate Dozer's capability of processing large volume of data.

Scaling E-Commerce
Table of Contents
Data Schema and Volume
Instance Type
Experiment 1
- Instructions
- Findings
Experiment 2
- Instructions
- Findings
API Performance
- Instructions
- Findings

Running instructions can be found here

Data Schema and Volume

Let's consider the following schema. The data source has 4 tables: orders, order_items, products, customers.

Data has been generated using dbldatagen. We generate 11 million rows for customers, 11 million rows for orders, 10 million rows for products, and 100 million rows for order_items. These parameters can be adjusted in generate.py.

Table	No of Rows
customers	11_000_000
orders	11_000_000
order_items	100_000_000
products	16_000_000

Instance Type

Following tests have been run on AWS Graviton m7g.8xlarge type. Potentially storage optimized instances such as im4gn.large could be used if high amount of disk reads are to be expected in the case of external storage.

Instance Type	vCPUs	Memory(Gb)
m7g.8xlarge	32	128

Experiment 1

Running dozer direct from source to cache.

Instructions

dozer clean -c direct-config.yaml
dozer build -c direct-config.yaml
dozer run app -c direct-config.yaml

Findings

Roughly took 4 mins to process all the records.
Note that processing of customers, orders and order_items finished in about 2 mins compared to products.
Pipeline latency is very low (~0.04) as there is no transformation involved.

Start Time	End Time	Elapsed
3:00:50 PM	3:04:38 PM	~ 4 mins

Experiment 2

Running dozer with aggregations and joins.

We run 3 cascading JOINs and a COUNT aggregation on the data source. The sql can be found in aggregate-config.yaml.

select c.customer_id, c.name, c.email,  o.order_id, o.order_date, o.total_amount, COUNT(*)
  into customer_orders 
  from customers c
  inner join orders o on c.customer_id = o.customer_id
  join order_items i on o.order_id = i.order_id
  join products p on i.product_id = p.product_id
  group by c.customer_id, c.name, c.email, o.order_id, o.order_date, o.total_amount

Instructions

dozer clean -c aggregate-config.yaml
dozer build -c aggregate-config.yaml
dozer run app -c aggregate-config.yaml

Findings

Roughly took 12 mins to process all the records.
Note that here total number of order_items increases in conjunction with products. This is to due to the dependency of the join.
Pipeline latency stays under 1s even with 4 joins and an aggregation.

Start Time	End Time	Elapsed
2:32:48 PM	2:44:51 PM	~ 12 mins

API Performance

Dozer really shines when it comes to API performance as views are pre-materialized. Dozer automatically generates gRPC and REST APIs.

Lets use ghz to run a loadtest against the gRPC server. You can find the script here

Instructions

HOST=localhost:50051
TOTAL=100000
CONCURRENCY=50
echo "Querying count of customers:  $TOTAL requests and $CONCURRENCY concurrency"
ghz --insecure --proto .dozer/api/customers/v0001/common.proto --call dozer.common.CommonGrpcService.query --total $TOTAL --concurrency $CONCURRENCY --data '{"endpoint":"customers"}' $HOST

Findings

Dozer maintains an average of 4.92 ms at a very high throughput of 10000 total requests at 50 concurrency.

Summary:
  Count:	100000
  Total:	10.83 s
  Slowest:	30.56 ms
  Fastest:	1.27 ms
  Average:	4.92 ms
  Requests/sec:	9234.20

Response time histogram:
  1.268  [1]     |
  4.197  [48567] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  7.126  [36274] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  10.056 [10248] |∎∎∎∎∎∎∎∎
  12.985 [3135]  |∎∎∎
  15.915 [1152]  |∎
  18.844 [414]   |
  21.773 [104]   |
  24.703 [73]    |
  27.632 [26]    |
  30.561 [6]     |

Latency distribution:
  10 % in 2.44 ms
  25 % in 3.18 ms
  50 % in 4.27 ms
  75 % in 5.92 ms
  90 % in 8.11 ms
  95 % in 10.00 ms
  99 % in 14.63 ms

Status code distribution:
  [OK]   100000 responses

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

Scaling E-Commerce

Table of Contents

Data Schema and Volume

Instance Type

Experiment 1

Instructions

Findings

Experiment 2

Instructions

Findings

API Performance

Instructions

Findings

FilesExpand file tree

scaling-ecommerce

Directory actions

More options

Directory actions

More options

Latest commit

History

scaling-ecommerce

Folders and files

parent directory

README.md

Scaling E-Commerce

Table of Contents

Data Schema and Volume

Instance Type

Experiment 1

Instructions

Findings

Experiment 2

Instructions

Findings

API Performance

Instructions

Findings