The Hilbert Curve in practice

In a previous post, I introduced the concept of space filling curves and the ability to take a high dimension space and reduce it to a low or 1 dimension space. I also showed the complexity in 2 dimensions and in 3 dimensions of the Hilbert Curve which I hope also provided an appreciation for the ability of the curve to traverse the higher dimension space.

In practice there are a number of implementations of the Hilbert curve mapping available in a number of languages including:

From galtay we have the following example where the simplest 1st order 2 dimension hilbert curve maps 4 integers [1,2,3,4] onto the 2 dimensional space <x,y> x = (0|1), y = (0|1)

>>> from hilbertcurve.hilbertcurve import HilbertCurve
>>> p=1; n=2
>>> hilbert_curve = HilbertCurve(p, n)
>>> distances = list(range(4))
>>> points = hilbert_curve.points_from_distances(distances)
>>> for point, dist in zip(points, distances):
>>>     print(f'point(h={dist}) = {point}')

point(h=0) = [0, 0]
point(h=1) = [0, 1]
point(h=2) = [1, 1]
point(h=3) = [1, 0]

its also possible to query the reverse transformation going from a point in the space to a distance along the curve.

>>> points = [[0,0], [0,1], [1,1], [1,0]]
>>> distances = hilbert_curve.distances_from_points(points)
>>> for point, dist in zip(points, distances):
>>>     print(f'distance(x={point}) = {dist}')

distance(x=[0, 0]) = 0
distance(x=[0, 1]) = 1
distance(x=[1, 1]) = 2
distance(x=[1, 0]) = 3

On galtay’s repositary, there is a graphic that shows 1st order, 2nd order and 3rd order curves in an <x,y> space. The ranges represented by each of the curve get more resolution as the order increases:

  • 1st order curve supports values (0|1) on the x and y axis giving us a 2 bit binary number of range values i.e. 00, 01, 10,11 -> 0..3
  • the 2nd order curve is more complex and (0|1|2|3) on the x and y axis giving us a 4 bit number of range values, 0000, 0001 0010….1111 -> 0..15
  • the 3rd curve includes the x,y values to 6 bits of resultion giving values 0->63.

As I have noted, an increase in the order of the curve, increases it complexity ( wiggleness) and its space covering measure and also provides more range quanta along the curve.

Returning to my suggestion in the earlier post, that the curve can be used to map a geographic space into the range and then have a entity (ip addresses which by themselves have no geographic relationship mapped not onto the range. in this fashion, subtraction along the range provides a (resolution dependant) measure of closeness of the location of these ip addresses.

galley rendering of a 3rd order hilbert curve in black

Using galtay rendering of the 3 order curve shown in back, if one focuses on the value 8 along the curve, its specially close in 2 dimensions 13,12,11,10,9,7,6,2 but not specially close to 63 or 42 which are rendered outside the area shown. With simple subtraction we see can have rule that says ip addresses within 5/6 units the 8 are close to whereas ip addresses with 20, 30,40 units distance are further away. As the order of the curve increases, this measurement get a better resolution.

Exploring space filling curves.

What is a curve.

A curve is a path inscribed into a dimensional space. We are all familiar with curves in euclidian space such as 2 dimensional space where polynomial functions describe a path through that 2-d space. We see below some well know polynomial functions graphed in a small range.

in these examples, the single x value is mapped onto the output y value for the range[-5,5] of x.

Space filling curve

A space filling curve is a parameterized function which maps a unit line segment to a continuous curve in the unit space (square, cube, ….). For the unit square form we have i in range[0..1] with a function F(i) that provides output x,y with 0<=x,y,<=1.

The Hilbert Curve

The Hilbert curve is a curve which uses recursion in its definition and it is a continuous fractal so as its order increases it gets more dense but maintains the basic properties of the curve and the space that the curve fills.

Hilbert Curve: TimSauder, CC BY-SA 4.0

In the space in which this curve is inscribed (x,y) where 0<=x,y,<=1 we can see that as the order (complexity) of the curve increases, it does a better job of filling in the space. Another way to consider this is that the x and y input for low order are like low resolution numbers along the lines of the resolution of an Analog to Digital (A/D) convertor.

A 1 bit A/D convertor can only indicate the presence or absence of a signal while a 12bit A/D convertor can capture significant signal resolution.

The lowest order Hilbert curve starts with x and y values being either 0 or 1 which gives us the simple “U” pattern which have x,y coordinates of [(x=0,y=1),(0,0),(1,0),(1,1)]

At each point along the curve it also has a value in the range [0,1], so again the simple example, would have the function map

(0,1) -> 0, (0,0)->0.33, (1,0)-> 0.66 and (1,1)->1

As the x,y resolution of the curve increases then its possible for number of curve values to become larger to the point where is we leave the A/D convertor example where x and y become real numbers then the curve values can also be a real number between 0 and 1.

If we say that our input parameters are (x,y) and curve output value is z at point (x,y) then the hilbert curve has an interesting and very attractive property that values of x and y representing points that are close to each other will generate values of z that are also close to each other. This property allows for (in our example) 2 dimensional closeness to be mapped onto numbers that retain the ability to measure closeness but using only 1 dimension.

So far I have been using a 2 dimensional example where the space filling curve traverses a 2 dimensional space, but its also possible to have a 3 dimensional curve that traverse a 3 dimensional space and also in higher dimensions. This effectively provides a method for high dimensional data to be mapped onto a simple dimensional value where some inherent properties of the high dimension data are maintained in the linear representation.

In the video above of a 3 dimension hibert curve, you can see that 3d spacial distance maps onto closeness of the value of the curve.

Once we have a space filling curve with points [0<=x<=1,0<=y<=1] that mops to values h range(0 ->1) then we can use this to have easy calculations for relative closeness which are simple calculations (number subtraction) instead of euclidian distances calculations that involve squaring and square root calculations. This may not seem like a big deal in 2 dimensions but as the number of dimensions grows having an simple calculation to derive closeness will be an interesting capability.

Future plans

In later posts, I want to use this capability to have items that are arranged in a number of different spaces and using hibert spaces to allow for easy inference of closeness within these spaces. One example of this is the fact that ip addresses are not geographically allocated so having a method that allows ip addresses to be attached to positions in a hilbert space will provide an easy scalable method of understanding the geographic proximity of ip addresses – for instance, a computation to decide what ip addresses are close to a city or an address will involve taking the gps coordinates of the city, transforming them in the hilbert number and then a definition of closeness will be a range [h-∂, h+∂] which can be then traversed to find all the ip addressed within that space.

Tesla Solar Roof Install

While out walking this afternoon I came across an ongoing Tesla Solar Roof in the neighborhood. Last week, the old roof had been taken off the house and an under covering which was branded Tesla had been installed so I kinda guessed that the Tesla Solar roofing was the next step.

True to fashion, a large work party, with truck and a sign advertising that it was a Tesla install appeared this AM. This afternoon I checked on their progress and the the panel installs have begun.

The individual tiles seemed to be delivered in palleted boxes and one is show below.

The picture I believe is showing 4 panels with the black triangular piece being the attachment point to the roof. I will have to investigate more it seems like the panels have significant space underneath them and they must be sturdy as the installer seem to be walking on them (with soft shoes) in the above picture.

Once the dust has settled I will speak with the homeowner to get some more information on the size of the install. I did a quick search and entering the Telsa Solar roof site and getting a price/recommendation it the following were some recommendations.

SizeRoof costpower wallsprice after incentives etc
4.1KW$311661 for 7770$36,748, $31166 after incentive
10.2KW$368673 for $17390$45,552, $38867 after incentive
Pricing for two sizing of roof from Tesla site.

All in all, its looks like a pretty expensive setup, where there is marginal price difference between the 4.1KW panel setup and 10.2KW setup. Checking on Amazon, Renogy 30pcs 320W Monocrystalline Solar Panels would set you back $9700 which is close enough to 10KW of panels, so yes there is a lot of wiring, framing, installation and permitting costs but is it $20K? But its Tesla so it must be good for you…… I assume that Tesla will get the unit costs of the solar panels down with mass adoption.

Adding calculated distance to network maps calculated from GPS coordinates.

Continuing the theme (part1, part2) of using GPS data to augment network graph data, this post will consider using the data for inclusion in the network graph but will also introduce taking the network graph and rendering this onto a geographical mapping visualization.

NetworkX graph package

In these discussions I have been using the Python NetworkX package to store a graph structure, nodes in this graph have city names as identifiers and edges created as a tuple or pair of city names.

import networkx as nx
import matplotlib.pyplot as plt
G = nx.Graph()

Cnames = ['Vancouver', 'Portland', 'San Francisco', 'Seattle', 
          'San Antonio', 'Dallas', 'Austin', 'Sevilla', 
          'Belfast', 'Sydney', 'Canberra', 'Tokyo']
G.add_node(name)
G.add_edge(name,'Dublin')

The NetworkX package allows for easy graph creation and manipulation, allows attributes to be added to the graph structure and then the package graph algorithms ( shortest path, cliquing, graph analysis) and also graph traversal, neighbor query and also graph drawing using standard plotting packages such as matplotlib.

plt.figure(figsize=(8, 8))
pos = nx.spring_layout(G)
nx.draw(G,nodelist=nodes,with_labels=True,pos=pos)
_ = nx.draw_networkx_edge_labels(G,pos=pos)

Graph attributes, added and calculated

At this point, we now have a capability to generate a network graph that has named nodes and edges. The edges of the graph have a calculated distance derived from the GPS location of the nodes. An application of interest to me is calculating a idealized latency value for the network, i.e. what is the idealized one-way latency between location x and location y.

One-way latency for model

For the global network model that we have incrementally built, we started with cities as nodes of the network, using the GPS coordinates of these cites we create graph edges between the cities and calculated the great circle distance between these cities.

The next step in this model is to assume that a single fiber optic runs between these cities. This is unrealistic for geographical, technical, electro/optical and political reasons but is adequate for the model that I am building.

Light traveling in a fiber optic travels at a different speed than in free space. In free space light travels at 299792458 m/s (approximated as 3×10^5m/s). In an optical fiber, light travels slower than in free space, approximately 35% slower based on the refractive index of the fiber optic between its core and its cladding.

A fiber optic is composed of two different materials, the optic core and the optic cladding. As light leaves the optic core and enters the optic cladding it is refracted, at a certain angle (called the critical angle) the refraction results in the light being totally reflected back into the optic core. This results as shown in the figure above have the light ray ‘bouncing’ along the fiber optic from one end to another.

In the figure an angle θ is shown as the incident and relected angle, there is also a distance X marked which shows the linear distance along the fiber from one reflection point to another. While the light would traveling horizontally the distance of x when its in free space, because the light is inside the fiber optic it has to travel a longer distance donated as x/sin(θ), so lets assume θ = 45 degrees, sin(45 deg) = 0.707 so the light in the fiber optic has to travel x/0.707 which in this example is 1.41 times x, thus the earlier rule of thumb of light traveling 35% slower in an fiber optic.

So this 35% figure we can say a modeled speed of light is 2.121X10^5 m/s can be used.

nodes = []
for name in Cnames:
    nodes.append(name)
    G.add_node(name)
    G.add_edge(name,'Dublin')
    distance  = great_circle(Cities['Dublin'],
                             Cities[name]).kilometers
    G.edges['Dublin',name]['distance'] = round(distance,1)
    G.edges['Dublin',name]['oneway-latency'] = 
                         distance * 2.121*10^^5

This latency calculation is assuming that the fiber optic is laid in a ‘straight line’ along the great circle between the two locations and does not include any optical regenerators, or intervening optical, or networking equipment.

Calculating the distance between 2 GPS Cordinates – Follow on

In a previous post, I discussed calculating distances based on GPS coordinates, subsequently I came across a good python library that offers this functionality as well as a lot more. I decided that anybody that is interested in the distance calculation via GPS coordinates will probably appreciate everything else that the GeoPy library has to offer.

In addition to the distance calculation, GeoPy offers a client functionality to access several popular geocoding web servers.

From the Readme file of the package’s GitHub site:

geopy makes it easy for Python developers to locate the coordinates of addresses, cities, countries, and landmarks across the globe using third-party geocoders and other data sources.

geopy includes geocoder classes for the OpenStreetMap NominatimGoogle Geocoding API (V3), and many other geocoding services.

Here is my previous example coded with the GeoPY library

from geopy.distance import great_circle
Dublin = (53.35, -6.27)
SanFrancisco=(37.78, -122.42)

great_cicle_dis = great_circle(Dublin, San Francisco).kilometers
geodesic_dis = geodesic(Dublin, San Francisco).kilometers

Cities = { 'Vancouver':(49.25,-123.1),
    'Portland':(45.52,-122.68),
    'San Francisco':(37.78, -122.42),
    'Seattle':(47.62, -122.33),
    'San Antonio':(29.42, -98.5),
    'Dallas':(32.78,-96.8),
    'Austin':(30.25,-97.75),
    'Dublin':(53.35, -6.27),
    'Sevilla':(37.38, -5.98),
    'Belfast':(54.6,-5.93),
    'Sydney':(-33.87, 151.22),
    'Canberra':(-35.3, 149.12),
    'Tokyo':(35.68, 139.7)}

import networkx as nx
import matplotlib.pyplot as plt
G = nx.Graph()
Cnames = ['Vancouver', 'Portland', 'San Francisco', 'Seattle', 
          'San Antonio', 'Dallas', 'Austin', 'Sevilla', 
          'Belfast', 'Sydney', 'Canberra', 'Tokyo']
nodes = []
for name in Cnames:
    nodes.append(name)
    G.add_node(name)
    G.add_edge(name,'Dublin')
    distance = great_circle(Cities['Dublin'],Cities[name]).kilometers
    G.edges['Dublin',name]['distance'] = round(distance,1)
plt.figure(figsize=(8, 8))
pos = nx.spring_layout(G)
nx.draw(G,nodelist=nodes,with_labels=True,pos=pos)
_ = nx.draw_networkx_edge_labels(G,pos=pos)

A complete example is provided in the following GitHub page

The GeoPy library is well worth reviewing as it also have a litany of functions which I did not cover here including:

Geocoding is provided by a number of different services, which are not affiliated with geopy. These services provide APIs, which anyone could implement, and geopy is a library which provides these implementations for many different services in a single package.

Calculating the distance between 2 GPS points

With the proliferation of GPS tagging in data sets, a useful calculation to that allows to derive the distance between two points is possible using the Haversine formula. The law of havershines is a more general formula within spherical trigonometry which relates to sides and angles or spherical triangles. On the surface of a sphere the ‘straight line’ connecting two points is actually the arc of a curve on the sphere surface. This curve arc is a arc along a great circle on the sphere and is mathematically called the spherical distance.

In the haversine calculation, the radius of the earth varies around 6356.752 km to 6378.137 km so choosing a value will result in a error that can not be less than 0.5%.

From wikipedia, the calculation is defined as:

so using this formula, the distance of the great circle arc between two points on earths surface in python is:

from math import asin, cos, radians, sin, sqrt
def arclen(lat_a,lon_a,lat_b,lon_b):
  r = 6370 #note this is for KM distance
  a1,a2, = (radians(lat_a),radians(lon_a)) 
  b1,b2 = (radians(lat_b),radians(lon_b))
  delta_lat = b1-a1
  delta_lon = b2-a2
  n = abs(sin(delta_lat/2) **2 
          + cos(a1)*cos(a2)*(sin(delta_lon/2) **2))
  if n >1:
    n = 1
  return 2*r*asin(sqrt(n))

A Jupiter notebook with an more extensive example is available.

Visual Strategies for Biological Data – A review

I recently came across a compilation of columns that were originally published in Nature Methods. The compilation was subsequently published in 2015 as a document that covers data visualization and presentation of scientific data. It is a brief collection (40 pages) that provides a good set of suggestions when presenting and visualizing complex data sets.

Columns include general topics such as:

  • Color Coding
  • Negative Space
  • Arrows
  • Networks
  • Heat maps

It also covers areas specific to Bioinformatics such as:

  • Representing the genome
  • Representing genomic structure
  • Visualizing biological data

As expected, Information content is high and presentation is professional and all within 40 pages, whats not to like, I learned a few techniques that will help with my data presentation and visualization moving forwards.

Observations on Observability

Introduction

Col John Boyd developed the OODA loop (Observe-Orient-Decide-Act) as a decision making concept for use within military operations. Agility and the ability to rapidly iterate through the loop and through multiple iterations of the loop is consider a winning advantage in conflict. A stereotypical example of the ‘dogfight’ where the fighter pilot that ‘gets inside’ the other’s OODA loop wins the day. The OODA loop is applicable in areas where agile operations lead to better results.

OODA Loop

What does it mean to Observe?

The Observe stage is the first part of an iteration of this loop and within a larger consideration of the OODA loops lets consider what Observe means.

In a non trivial system, it will have internal state (which is not visible outside the system boundary), the system will also have external outputs which are visible outside the system boundary. Observability is a measure of the (invisible internal system state) can be derived from the ability to measure its external outputs.

To Observe is defined as – ‘to regard with attention, especially so as to see or learn something’ which aligns with the concepts of measurement of external outputs(regard with attention) to derive internal state (learn something).

Monitoring versus Observabilty

Monitoring is another term that appears in this space. Monitoring is similar but different to Observability. Both Monitoring and Observability are measuring the external outputs of a system. Examples of system outputs that would be measured in an Enterprise software, SaaS or cloud system include:

  • Events emitted by the system.
  • Log files generated by the system
  • measurements or metrics of the system – response and reaction times, load and resource consumption metrics, activity metrics.

Observability differs from Monitoring in their end goals. Monitoring uses the external measurements or signals of the system to indicate that a problem is or has occurred. Observability on the other hand also using the external measurments to understand the internal state of the system as it exists in this problem state and offer pathways to resolve the problem situation. Observability attempts to identify the root cause of the issue which has caused the system enter this problem state. With the root cause in hand remediation of this problem state can then begin whereas Monitoring will only indicate that the (unseen) problem state exists based on the external measurement artifacts. Monitoring by itself will not offer a pathway to this problem state resolution.

Observability can be considered to be a superset of Monitoring both in terms of the items that it processes and also its objectives. As noted Monitoring attempt to indicate that a problem is or has occurred while Observability goes on to attempt to diagnose and aid in the remediation of this situation.

Metrics are a key source of measurement data in both processes and with metrics there are a number of key metrics that provide a foundational model of the state/operation of the system under consideration, these key metrics are:

  • Traffic (or Activity), this metric can appear as queries per second, requests per second, interactions per second all of which are a measurement of the interactivity of the system with external entities.
  • Latency – this is a measurement of the responsiveness of a service request that the system is responding to. Fundamentally it is the time that the system takes to respond and complete a request. The latency figures for service requests may indicate that the system is degrading (slow grinding to a halt). It is important to note that one must differentiate between the latency of successful and failed requests, a request that is failing may fail immediately ( giving a very low latency number – so its looks like a fast response to the query) but a failed request latency figure may not be a true representation of the operating or service state of the system. The latency measurements of a successfully served request will provide a information regarding the oeprating/service state of the system, particularly is the latency numbers are outside the normal operating range or are increasing over the a period of time.
  • Error — The error rate or types of errors that are measured and provided as metrics both also provide valuable views into the system operation and its interaction with external entities.
  • Resource Utilization — and particularly saturation of resource utilization within a system provides an information view point into the operating state of the system. Having a keen understanding of resource utilization during proper operation of the system as well as the ability to track deviations of the resource utilization provides good data with which to understand the system operation. One point worth noting is that an increase in resource utilization away from the baseline measurement may not indicate a problem state but may just be a true indicator that the system is doing more work or offering more service. If a system was operating at baseline resource utilization where it is serving 100 transactions per some unit of time, a resource utilization increase of 10% may not be an issue particularly if the system is designed with headroom where there is still a significant amount of unused or available resources left. This 10% increase in utilization may be a result in a 10% increase in the number of transactions its processing per unit of time. If there is not increase in Traffic then this increase in resource utilization may be an early indicator in a change in operating state of the system.

Logs and Events provide a temporal record of the operation and processing that is happening in a System. Log and Events may be processed to understand and reconstruct threads of activity as recorded by the events and logs. In addition to logs Tracing capabilities may be available within the system to allow for a abstract tree of operations to be recorded as the system interacts with the external world. Traces may be independently created by various sub-systems within the greater system and through either unique transaction or workflow identifiers that appear in the various trace streams or through correlation of different identifies a transaction level view of activity can be built. These transaction views allow for a clearer understanding of the internal operation and state transitions of the system and thus provide valuable input to analysis capabilities that further the Observability goals.