You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+2-2Lines changed: 2 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -22,8 +22,8 @@ spark3D should be viewed as an extension of the Apache Spark framework, and more
22
22
23
23
Why would you use spark3D? If you often need to repartition large spatial 3D data sets, or perform spatial queries (neighbour search, window queries, cross-match, clustering, ...), spark3D is for you. It contains optimised classes and methods to do so, and it spares you the implementation time! In addition, a big advantage of all those extensions is to efficiently perform visualisation of large data sets by quickly building a representation of your data set (see more [here](https://astrolabsoftware.github.io/spark3D/)).
24
24
25
-
spark3D exposes two API: Scala (spark3D) and Python (pyspark3d). The core developments are done in Scala, and interfaced with Python using the great [py4j](https://www.py4j.org/) package. This means pyspark3d might not contain all the features present in the spark3D.
26
-
In addition, due to different in Scala and Python, there might be subtle difference in the two APIs.
25
+
spark3D exposes two API: Scala (spark3D) and Python (pyspark3d). The core developments are done in Scala, and interfaced with Python using the great [py4j](https://www.py4j.org/) package. This means pyspark3d might not contain all the features present in spark3D.
26
+
In addition, due to difference between Scala and Python, there might be subtle differences in the two APIs.
27
27
28
28
While we try to stick to the latest Apache Spark developments, spark3D started with the RDD API and slowly migrated to use the DataFrame API. This process left a huge imprint on the code structure, and low-level layers in spark3D often still use RDD to manipulate the data. Do not be surprised if things are moving, the package is under an active development but we try to keep the user interface as stable as possible!
The spark3D library contains a number of methods and tools to manipulate 3D RDD. Currently, you can already play with *window query*, *KNN* and *cross-match between data sets*.
11
-
12
-
## Envelope query
13
-
14
-
A Envelope query takes as input a `RDD[Shape3D]` and an envelope, and returns all objects in the RDD intersecting the envelope (contained in and crossing the envelope):
15
-
16
-
```python
17
-
# Launch this example: spark-submit --master ... --packages spark3D_id
18
-
from pyspark.sql import SparkSession
19
-
from pyspark3d.geometryObjects import ShellEnvelope
# Load the envelope (Sphere at the origin, and radius 0.5)
33
-
sh = ShellEnvelope(0.0, 0.0, 0.0, False, 0.0, 0.5)
34
-
35
-
# Perform the query
36
-
matchRDD = windowQuery(rdd.rawRDD(), sh)
37
-
print("{}/{} objects found in the envelope".format(
38
-
len(matchRDD.collect()), rdd.rawRDD().count()))
39
-
# 1435/20000 objects found in the envelope
40
-
```
41
-
42
-
Note that the input objects and the envelope can be anything among the `Shape3D`: points, shells (incl. sphere), boxes.
43
-
44
-
## Cross-match between data-sets
45
-
46
-
A cross-match takes as input two data sets, and return objects matching based on the center distance, or pixel index of objects. Note that performing a cross-match between a data set of N elements and another of M elements is a priori a NxM operation - so it can be very costly! Let's load two `Point3D` data sets:
47
-
48
-
```python
49
-
# Launch this example: spark-submit --master ... --packages spark3D_id
By default, the two sets are partitioned randomly (in the sense points spatially close are probably not in the same partition).
67
-
In order to decrease the cost of performing the cross-match, you need to partition the two data sets the same way. By doing so, you will cross-match only points belonging to the same partition. For a large number of partitions, you will decrease significantly the cost:
68
-
69
-
```python
70
-
# nPart is the wanted number of partitions.
71
-
# Default is rdd.rawRDD() partition number.
72
-
npart =100
73
-
74
-
# For the spatial partitioning, you can currently choose
75
-
# between LINEARONIONGRID, or OCTREE (see GridType.scala).
We advice to cache the re-partitioned sets, to speed-up future call by not performing the re-partitioning again.
84
-
However keep in mind that if a large `nPart` decreases the cost of performing the cross-match, it increases the partitioning cost as more partitions implies more data shuffle between partitions. There is no magic number for `nPart` which applies in general, and you'll need to set it according to the needs of your problem. My only advice would be: re-partitioning is typically done once, queries can be multiple...
85
-
86
-
### What a cross-match returns?
87
-
88
-
In spark3D, the cross-match between two sets A and B can return:
89
-
90
-
* (1) Elements of (A, B) matching (returnType="AB")
91
-
* (2) Elements of A matching B (returnType="A")
92
-
* (3) Elements of B matching A (returnType="B")
93
-
94
-
Which one you should choose? That depends on what you need:
95
-
(1) gives you all pairs matching but can be slow.
96
-
(2) & (3) give you all elements matching only in one side but is faster.
97
-
98
-
### What is the criterion for the cross-match?
99
-
100
-
Currently, we implemented two methods to perform a cross-match:
101
-
102
-
* Based on center distance (a and b match if norm(a - b) < epsilon).
103
-
* Based on the center angular separation (Healpix index) inside a shell (a and b match if their healpix index is the same). Note that this strategy can be used only in combination with the `LINEARONIONGRID` partitioning which produces 3D shells along the radial axis, and project the data in 2D shells (where Healpix can be used!).
104
-
105
-
Here is an example which returns only elements from A with counterpart in B using distance center:
106
-
107
-
```python
108
-
from pyspark3d.spatialOperator import CrossMatchCenter
109
-
110
-
# Distance threshold for the match
111
-
epsilon =0.04
112
-
113
-
# Keeping only elements from A with counterpart in B
print("{}/{} elements in A match with elements of B!".format(
132
-
matchRDDB_healpix.count(), rddB_part.count()))
133
-
# 15/1000 elements in A match with elements of B!
134
-
```
135
-
136
-
In addition, you can choose to return only the Healpix indices for which points match (returnType="healpix"). It is even faster than returning objects.
137
-
138
-
## Neighbour search
139
-
140
-
### Simple KNN
141
-
142
-
Finds the K nearest neighbours of a query object within a `rdd`.
143
-
The naive implementation here searches through all the the objects in the
144
-
RDD to get the KNN. The nearness of the objects here is decided on the
145
-
basis of the distance between their centers.
146
-
Note that `queryObject` and elements of `rdd` must have the same type
147
-
(either both Point3D, or both ShellEnvelope, or both BoxEnvelope).
148
-
149
-
```python
150
-
# Launch this example: spark-submit --master ... --packages spark3D_id
0 commit comments