In this example we demonstrate Dozer's capability of processing large volume of data.
- Scaling E-Commerce
- Table of Contents
- Data Schema and Volume
- Instance Type
- Experiment 1
- Experiment 2
- API Performance
Running instructions can be found here
Let's consider the following schema. The data source has 4 tables: orders, order_items, products, customers.
Data has been generated using dbldatagen.
We generate 11 million rows for customers, 11 million rows for orders, 10 million rows for products, and 100 million rows for order_items. These parameters can be adjusted in generate.py.
| Table | No of Rows |
|---|---|
| customers | 11_000_000 |
| orders | 11_000_000 |
| order_items | 100_000_000 |
| products | 16_000_000 |
Following tests have been run on AWS Graviton m7g.8xlarge type. Potentially storage optimized instances such as im4gn.large could be used if high amount of disk reads are to be expected in the case of external storage.
| Instance Type | vCPUs | Memory(Gb) |
|---|---|---|
| m7g.8xlarge | 32 | 128 |
Running dozer direct from source to cache.
dozer clean -c direct-config.yaml
dozer build -c direct-config.yaml
dozer run app -c direct-config.yaml
- Roughly took
4 minsto process all the records. - Note that processing of
customers,ordersandorder_itemsfinished in about2 minscompared toproducts. - Pipeline latency is very low (
~0.04) as there is no transformation involved.
| Start Time | End Time | Elapsed |
|---|---|---|
| 3:00:50 PM | 3:04:38 PM | ~ 4 mins |
Running dozer with aggregations and joins.
We run 3 cascading JOINs and a COUNT aggregation on the data source. The sql can be found in aggregate-config.yaml.
select c.customer_id, c.name, c.email, o.order_id, o.order_date, o.total_amount, COUNT(*)
into customer_orders
from customers c
inner join orders o on c.customer_id = o.customer_id
join order_items i on o.order_id = i.order_id
join products p on i.product_id = p.product_id
group by c.customer_id, c.name, c.email, o.order_id, o.order_date, o.total_amountdozer clean -c aggregate-config.yaml
dozer build -c aggregate-config.yaml
dozer run app -c aggregate-config.yaml
- Roughly took
12 minsto process all the records. - Note that here total number of
order_itemsincreases in conjunction withproducts. This is to due to the dependency of the join. - Pipeline latency stays under
1seven with 4 joins and an aggregation.
| Start Time | End Time | Elapsed |
|---|---|---|
| 2:32:48 PM | 2:44:51 PM | ~ 12 mins |
Dozer really shines when it comes to API performance as views are pre-materialized.
Dozer automatically generates gRPC and REST APIs.
Lets use ghz to run a loadtest against the gRPC server. You can find the script here
HOST=localhost:50051
TOTAL=100000
CONCURRENCY=50
echo "Querying count of customers: $TOTAL requests and $CONCURRENCY concurrency"
ghz --insecure --proto .dozer/api/customers/v0001/common.proto --call dozer.common.CommonGrpcService.query --total $TOTAL --concurrency $CONCURRENCY --data '{"endpoint":"customers"}' $HOSTDozer maintains an average of 4.92 ms at a very high throughput of 10000 total requests at 50 concurrency.
Summary:
Count: 100000
Total: 10.83 s
Slowest: 30.56 ms
Fastest: 1.27 ms
Average: 4.92 ms
Requests/sec: 9234.20
Response time histogram:
1.268 [1] |
4.197 [48567] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
7.126 [36274] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
10.056 [10248] |∎∎∎∎∎∎∎∎
12.985 [3135] |∎∎∎
15.915 [1152] |∎
18.844 [414] |
21.773 [104] |
24.703 [73] |
27.632 [26] |
30.561 [6] |
Latency distribution:
10 % in 2.44 ms
25 % in 3.18 ms
50 % in 4.27 ms
75 % in 5.92 ms
90 % in 8.11 ms
95 % in 10.00 ms
99 % in 14.63 ms
Status code distribution:
[OK] 100000 responses


