Skip to content

Commit c047d36

Browse files
committed
Update the website sections
1 parent dc03ba8 commit c047d36

File tree

7 files changed

+23
-504
lines changed

7 files changed

+23
-504
lines changed

README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -22,8 +22,8 @@ spark3D should be viewed as an extension of the Apache Spark framework, and more
2222

2323
Why would you use spark3D? If you often need to repartition large spatial 3D data sets, or perform spatial queries (neighbour search, window queries, cross-match, clustering, ...), spark3D is for you. It contains optimised classes and methods to do so, and it spares you the implementation time! In addition, a big advantage of all those extensions is to efficiently perform visualisation of large data sets by quickly building a representation of your data set (see more [here](https://astrolabsoftware.github.io/spark3D/)).
2424

25-
spark3D exposes two API: Scala (spark3D) and Python (pyspark3d). The core developments are done in Scala, and interfaced with Python using the great [py4j](https://www.py4j.org/) package. This means pyspark3d might not contain all the features present in the spark3D.
26-
In addition, due to different in Scala and Python, there might be subtle difference in the two APIs.
25+
spark3D exposes two API: Scala (spark3D) and Python (pyspark3d). The core developments are done in Scala, and interfaced with Python using the great [py4j](https://www.py4j.org/) package. This means pyspark3d might not contain all the features present in spark3D.
26+
In addition, due to difference between Scala and Python, there might be subtle differences in the two APIs.
2727

2828
While we try to stick to the latest Apache Spark developments, spark3D started with the RDD API and slowly migrated to use the DataFrame API. This process left a huge imprint on the code structure, and low-level layers in spark3D often still use RDD to manipulate the data. Do not be surprised if things are moving, the package is under an active development but we try to keep the user interface as stable as possible!
2929

docs/03_visualisation_python.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,3 +4,5 @@ layout: splash
44
title: "Tutorial: Visualisation"
55
date: 2018-06-18 22:31:13 +0200
66
---
7+
8+
Under construction

docs/03_visualisation_scala.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,3 +5,4 @@ title: "Tutorial: Visualisation"
55
date: 2018-06-18 22:31:13 +0200
66
---
77

8+
Under construction

docs/04_query_python.md

Lines changed: 1 addition & 266 deletions
Original file line numberDiff line numberDiff line change
@@ -5,269 +5,4 @@ title: "Tutorial: Query, cross-match, play!"
55
date: 2018-06-18 22:31:13 +0200
66
---
77

8-
# Tutorial: Query, cross-match, play!
9-
10-
The spark3D library contains a number of methods and tools to manipulate 3D RDD. Currently, you can already play with *window query*, *KNN* and *cross-match between data sets*.
11-
12-
## Envelope query
13-
14-
A Envelope query takes as input a `RDD[Shape3D]` and an envelope, and returns all objects in the RDD intersecting the envelope (contained in and crossing the envelope):
15-
16-
```python
17-
# Launch this example: spark-submit --master ... --packages spark3D_id
18-
from pyspark.sql import SparkSession
19-
from pyspark3d.geometryObjects import ShellEnvelope
20-
from pyspark3d.spatial3DRDD import SphereRDD
21-
from pyspark3d.spatialOperator import windowQuery
22-
23-
# Initialise Spark Session
24-
spark = SparkSession.builder\
25-
.appName("collapse")\
26-
.getOrCreate()
27-
28-
# Load the data
29-
fn = "src/test/resources/cartesian_spheres.fits"
30-
rdd = SphereRDD(spark, fn, "x,y,z,radius", False, "fits", {"hdu": "1"})
31-
32-
# Load the envelope (Sphere at the origin, and radius 0.5)
33-
sh = ShellEnvelope(0.0, 0.0, 0.0, False, 0.0, 0.5)
34-
35-
# Perform the query
36-
matchRDD = windowQuery(rdd.rawRDD(), sh)
37-
print("{}/{} objects found in the envelope".format(
38-
len(matchRDD.collect()), rdd.rawRDD().count()))
39-
# 1435/20000 objects found in the envelope
40-
```
41-
42-
Note that the input objects and the envelope can be anything among the `Shape3D`: points, shells (incl. sphere), boxes.
43-
44-
## Cross-match between data-sets
45-
46-
A cross-match takes as input two data sets, and return objects matching based on the center distance, or pixel index of objects. Note that performing a cross-match between a data set of N elements and another of M elements is a priori a NxM operation - so it can be very costly! Let's load two `Point3D` data sets:
47-
48-
```python
49-
# Launch this example: spark-submit --master ... --packages spark3D_id
50-
from pyspark.sql import SparkSession
51-
from pyspark3d.spatial3DRDD import Point3DRDD
52-
53-
# Initialise Spark Session
54-
spark = SparkSession.builder\
55-
.appName("collapse")\
56-
.getOrCreate()
57-
58-
# Load the raw data (spherical coordinates)
59-
fn = "src/test/resources/astro_obs_{}_light.fits"
60-
rddA = Point3DRDD(
61-
spark, fn.format("A"), "Z_COSMO,RA,DEC", True, "fits", {"hdu": "1"})
62-
rddB = Point3DRDD(
63-
spark, fn.format("B"), "Z_COSMO,RA,DEC", True, "fits", {"hdu": "1"})
64-
```
65-
66-
By default, the two sets are partitioned randomly (in the sense points spatially close are probably not in the same partition).
67-
In order to decrease the cost of performing the cross-match, you need to partition the two data sets the same way. By doing so, you will cross-match only points belonging to the same partition. For a large number of partitions, you will decrease significantly the cost:
68-
69-
```python
70-
# nPart is the wanted number of partitions.
71-
# Default is rdd.rawRDD() partition number.
72-
npart = 100
73-
74-
# For the spatial partitioning, you can currently choose
75-
# between LINEARONIONGRID, or OCTREE (see GridType.scala).
76-
rddA_part = rddA.spatialPartitioningPython("LINEARONIONGRID", 100)
77-
78-
# Repartition B as A
79-
rddB_part = rddB.spatialPartitioningPython(
80-
rddA_part.partitioner().get()).cache()
81-
```
82-
83-
We advice to cache the re-partitioned sets, to speed-up future call by not performing the re-partitioning again.
84-
However keep in mind that if a large `nPart` decreases the cost of performing the cross-match, it increases the partitioning cost as more partitions implies more data shuffle between partitions. There is no magic number for `nPart` which applies in general, and you'll need to set it according to the needs of your problem. My only advice would be: re-partitioning is typically done once, queries can be multiple...
85-
86-
### What a cross-match returns?
87-
88-
In spark3D, the cross-match between two sets A and B can return:
89-
90-
* (1) Elements of (A, B) matching (returnType="AB")
91-
* (2) Elements of A matching B (returnType="A")
92-
* (3) Elements of B matching A (returnType="B")
93-
94-
Which one you should choose? That depends on what you need:
95-
(1) gives you all pairs matching but can be slow.
96-
(2) & (3) give you all elements matching only in one side but is faster.
97-
98-
### What is the criterion for the cross-match?
99-
100-
Currently, we implemented two methods to perform a cross-match:
101-
102-
* Based on center distance (a and b match if norm(a - b) < epsilon).
103-
* Based on the center angular separation (Healpix index) inside a shell (a and b match if their healpix index is the same). Note that this strategy can be used only in combination with the `LINEARONIONGRID` partitioning which produces 3D shells along the radial axis, and project the data in 2D shells (where Healpix can be used!).
104-
105-
Here is an example which returns only elements from A with counterpart in B using distance center:
106-
107-
```python
108-
from pyspark3d.spatialOperator import CrossMatchCenter
109-
110-
# Distance threshold for the match
111-
epsilon = 0.04
112-
113-
# Keeping only elements from A with counterpart in B
114-
matchRDDB = CrossMatchCenter(rddA_part, rddB_part, epsilon, "A")
115-
116-
print("{}/{} elements in A match with elements of B!".format(
117-
matchRDDB.count(), rddB_part.count()))
118-
# 19/1000 elements in A match with elements of B!
119-
```
120-
121-
and the same using the Healpix indices:
122-
123-
```python
124-
from pyspark3d.spatialOperator import PixelCrossMatch
125-
126-
# Shell resolution for Healpix indexing
127-
nside = 512
128-
129-
# Keeping only elements from A with counterpart in B
130-
matchRDDB_healpix = CrossMatchHealpixIndex(rddA_part, rddB_part, 512, "A")
131-
print("{}/{} elements in A match with elements of B!".format(
132-
matchRDDB_healpix.count(), rddB_part.count()))
133-
# 15/1000 elements in A match with elements of B!
134-
```
135-
136-
In addition, you can choose to return only the Healpix indices for which points match (returnType="healpix"). It is even faster than returning objects.
137-
138-
## Neighbour search
139-
140-
### Simple KNN
141-
142-
Finds the K nearest neighbours of a query object within a `rdd`.
143-
The naive implementation here searches through all the the objects in the
144-
RDD to get the KNN. The nearness of the objects here is decided on the
145-
basis of the distance between their centers.
146-
Note that `queryObject` and elements of `rdd` must have the same type
147-
(either both Point3D, or both ShellEnvelope, or both BoxEnvelope).
148-
149-
```python
150-
# Launch this example: spark-submit --master ... --packages spark3D_id
151-
from pyspark.sql import SparkSession
152-
from pyspark3d.geometryObjects import Point3D
153-
from pyspark3d.spatial3DRDD import Point3DRDD
154-
from pyspark3d.spatialOperator import KNN
155-
156-
# Initialise Spark Session
157-
spark = SparkSession.builder\
158-
.appName("collapse")\
159-
.getOrCreate()
160-
161-
# Load the data (spherical coordinates)
162-
fn = "src/test/resources/astro_obs.fits"
163-
rdd = Point3DRDD(spark, fn, "Z_COSMO,RA,DEC", True, "fits", {"hdu": "1"}
164-
)
165-
166-
# Define your query point
167-
pt = Point3D(0.0, 0.0, 0.0, True)
168-
169-
# Perform the query: look for the 5 closest neighbours from `pt`.
170-
# Note that the last argument controls whether we want to eliminate duplicates.
171-
match = KNN(rdd.rawRDD(), pt, 5, True)
172-
173-
# `match` is a list of Point3D. Take the coordinates and convert them
174-
# into cartesian coordinates for later use:
175-
mod = "com.astrolabsoftware.spark3d.utils.Utils.sphericalToCartesian"
176-
converter = load_from_jvm(mod)
177-
match_coord = [converter(m).getCoordinatePython() for m in match]
178-
179-
# Print the distance to the query point (origin)
180-
def normit(l: list) -> float:
181-
"""
182-
Return the distance to the origin for each point of the list
183-
184-
Parameters
185-
----------
186-
l : list of list
187-
List of (list of 3 float). Each main list element represents
188-
the cartesian coordinates of a point in space.
189-
190-
Returns
191-
----------
192-
ld : list of float
193-
list containing (euclidean) norm btw each point and the origin.
194-
"""
195-
return sqrt(sum([e**2 for e in l]))
196-
197-
# Pretty print
198-
print([str(normit(l) * 1e6) + "e-6" for l in match_coord])
199-
# [72, 73, 150, 166, 206]
200-
```
201-
202-
### More efficient KNN
203-
204-
More efficient implementation of the KNN query above.
205-
First we seek the partitions in which the query object belongs and we
206-
will look for the knn only in those partitions. After this if the limit k
207-
is not satisfied, we keep looking similarly in the neighbors of the
208-
containing partitions.
209-
210-
Note 1: elements of `rdd` and `queryObject` can have different types
211-
among Shape3D (Point3D or ShellEnvelope or BoxEnvelope) (unlike KNN above).
212-
213-
Note 2: KNNEfficient only works on repartitioned RDD (python version).
214-
215-
```python
216-
# Launch this example: spark-submit --master ... --packages spark3D_id
217-
from pyspark.sql import SparkSession
218-
from pyspark3d.geometryObjects import Point3D
219-
from pyspark3d.spatial3DRDD import Point3DRDD
220-
from pyspark3d.spatialOperator import KNN
221-
222-
# Initialise Spark Session
223-
spark = SparkSession.builder\
224-
.appName("collapse")\
225-
.getOrCreate()
226-
227-
# Load the data (spherical coordinates)
228-
fn = "src/test/resources/astro_obs.fits"
229-
rdd = Point3DRDD(spark, fn, "Z_COSMO,RA,DEC", True, "fits", {"hdu": "1"}
230-
)
231-
232-
# Repartition the RDD
233-
rdd_part = rdd.spatialPartitioningPython("LINEARONIONGRID", 5)
234-
235-
# Define your query point
236-
pt = Point3D(0.0, 0.0, 0.0, True)
237-
238-
# Perform the query: look for the 5 closest neighbours from `pt`.
239-
# Automatically discards duplicates
240-
match = KNNEfficient(rdd_part, pt, 5)
241-
242-
# `match` is a list of Point3D. Take the coordinates and convert them
243-
# into cartesian coordinates for later use:
244-
mod = "com.astrolabsoftware.spark3d.utils.Utils.sphericalToCartesian"
245-
converter = load_from_jvm(mod)
246-
match_coord = [converter(m).getCoordinatePython() for m in match]
247-
248-
# Print the distance to the query point (origin)
249-
def normit(l: list) -> float:
250-
"""
251-
Return the distance to the origin for each point of the list
252-
253-
Parameters
254-
----------
255-
l : list of list
256-
List of (list of 3 float). Each main list element represents
257-
the cartesian coordinates of a point in space.
258-
259-
Returns
260-
----------
261-
ld : list of float
262-
list containing (euclidean) norm btw each point and the origin.
263-
"""
264-
return sqrt(sum([e**2 for e in l]))
265-
266-
# Pretty print
267-
print([str(normit(l) * 1e6) + "e-6" for l in match_coord])
268-
# [72, 73, 150, 166, 206]
269-
```
270-
271-
## Benchmarks
272-
273-
TBD
8+
Under construction

0 commit comments

Comments
 (0)