Bogy

Sunday, March 16, 2008

ESRI shapefile structure

Any ESRI shapefile consists of:

A main file (*.shp)

An index file (*.shxd)

A DataBase table (*.dbf)

The main file is a direct access, variable-record-length file in which each record describes a shape with a list of its verticies.

In the index file, each record contains the offset of the corresponding main file record from the beginning of the main file.

The dBASE table contains feature attributes with one record per feature. The one-to-one relationship between geometry and attributes is based on record number.

** The main file (.shp) contains a fixed-length file header followed by variable-length records. Each variable-length record is made up of a fixed-length record header followed by variable-length record contents.

The header for each record stores the record number and content length for the record.Record headers have a fixed length of 8 bytes.

Shapefile record contents consist of a shape type followed by the geometric data for the shape. The length of the record contents depends on the number of parts and verticies in a shape. For each shape type, we first describe the shape and then its mapping to record contents on disk.
Shapes are from simple "Points" till Multiple Polygons. (Detailed description of record structures and tables can be found in the reference document)

** The index file (.shx) contains a 100-byte header followed by 8-byte, fixed-length records.
The index file header is identical in organization to the main file header.
The I’th record in the index file stores the offset and content length for the I’th record in the main file.

** The dBASE file (.dbf) contains any desired feature attributes or attribute keys to which other tables can be joined. Its format is a standard DBF file used by many table-based applications in Windows™ and DOS.

For more information on the dBASE file format, visit http://www.inprise.com/

Ref: http://svn2.assembla.com/svn/boogy/eBooks/shapefile.pdf (use your SVN password)

Friday, March 7, 2008

The Hows and Whys of Commenting

Comments should not be overly long. Comments should not give details for the sake of details; only when a fact is necessary or interesting should it be brought to the attention of the program's reader. If you were reading someone else's program, you would not want to be forced to pick through paragraphs of text describing the intricacies of a for loop's operation only to realize that you could have discovered the same information simply by having read the for loop.
　
For the sake of future readers, you should generally include some header information at the top of your program. This information may include your name and contact information, the date the code was last modified, the purpose of the program, and if necessary, a brief exposition of the algorithm used or the design decisions made. You may also want to include a list of known bugs, inefficiencies, or suggestions for improvement. It is convenient to demarcate this section of the program file in a large comment block of the form
/********
* *
* *
********/
　
Second, whenever you create a new class or a new function definition, you should add a comment explaining what the class or function does.
-For a class, you should explain the purpose of the class, what publicly accessible functions and variables are available, any limitations of the class, and information that the programmer may need if he or she wanted to inherit from the class.
-When defining a function, you should describe what the function does and whether or not it has side effects such as changing global variables, interacting with the user, or so forth. It is also convenient to describe what sort of arguments the function takes and what, if any, value it returns: e.g., if you have a function findTime(int distance, int speed), you would want to tell the user that the distance is in rods, and the speed is in furlongs per fortnight, and that the function returns the time taken by the travel in epochs.
　
Third, when adding comments directly to your code, be spare. The less you say, the better; if you force yourself to be precise, you will make the comments more helpful, and you will force yourself to synthesize the flow of your program rather than allow yourself to repeat what the code already tells the user. The code should fit seamlessly into the algorithm rather than wrap entirely around it and smother the logic embedded in the C++.
　
Fourth, good commenting can improve your programming. You can use them as an organizational device by including comments prior to filling in code -- for instance, when you have a long block of conditional statements, you may wish to comment what should happen when each conditional is executed before you flesh out the code. In doing so, you save yourself the burden of remembering the details of the entire program, allowing you to concentrate on the implementation of one aspect at a time. Additionally, when you force yourself to comment during the programming process, you cannot get away with writing code that "you hope will work" if you make yourself explain why it works. (Keep in mind that if the code is so complex that you don't know how it works, then you probably should be commenting it for the sake of both yourself and others.)
　
Finally, keep in mind that what seems obvious now may not seem obvious later. While you shouldn't excessively comment, do make sure to comment things that are nonstandard algorithms. You do not need to comment a programming idiom, but you do want to comment an algorithm you designed for the program, no matter how simple it may seem to you. No doubt it will seem foreign three weeks after you write the code, and if you plan (and eve n if you do not plan) to come back to the code, it will be immeasurably helpful.

reference:
http://www.cprogramming.com/tutorial/comments.html

Naming Conventions Guide

General Naming Conventions

There are a lot of common naming conventions for classes, functions and objects. Usually these are broken into several broad categories: c-style naming, camelCase, and CamelCase. C-style naming separates words in a name using underscores: this_is_an_identifer. There are two forms of camelCase: one that begins with a lowercase letter and then capitalizes the first letter of every ensuing word, and one that capitalizes the first letter of every single word.

One popular convention is that leading capital letter CamelCase is used for the names of structs and classes, while normal camelCase is used for the names of functions and variables (although sometimes variables are written in c-style to make the visual separation between functions and variables more clear).

It can be useful to use prefixes for certain types of data to remind you what they are: for instance, if you have a pointer, prefixing it with "p_" tells you that it's a pointer. If you see an assignment between a variable starting with "p_" and one that doesn't begin with "p_", then you immediately know that something fishy is going on. It can also be useful to use a prefix for global or static variables because each of these has a different behavior than a normal local variable. In the case of global variables, it is especially useful to use a prefix in order to prevent naming collisions with local variables (which can lead to confusion).

Length of identifiers

A fundamental element of all naming conventions are the rules related to identifier length (i.e., the finite number of individual characters allowed in an identifier). Some rules dictate a fixed numerical bound, while others specify less precise heuristics or guidelines.
Identifier length rules are routinely contested in practice, and subject to much debate academically.
Some considerations:
-shorter identifiers may be preferred as more expedient, because they are easier to type
-extremely short identifiers (such as 'i' or 'j') are very difficult to uniquely distinguish using automated search and replace tools
-longer identifiers may be preferred because short identifiers cannot encode enough information or appear too cryptic
-longer identifiers may be disfavored because of visual clutter

Hungarian Notation

The original idea behind Hungarian notation, however, was more general and useful: to create more abstract "types" that describe how the variable is used rather than how the variable is represented. This can be useful for keeping pointers and integers from intermixing, but it can also be a powerful technique for helping to separate concepts that are often used together, but that should not be mixed.

Abbrevations

Abbreviations are dangerous--vowels are useful and can speed up code reading. Resorting to abbreviations can be useful when the name itself is extremely long because names that are too long can be as hard to read as names that are too short. When possible, be consistent about using particular abbreviations, and restrict yourself to using only a small number of them. Common abbreviations include "itr" for "iterator" or "ptr" for pointer. Even names like i, j, and k are perfectly fine for loop counter variables (primarily because they are so common). Bad abbreviations include things like cmptRngFrmRng, which at the savings of only a few letters eliminates a great deal of readability. If you don't like typing long names, look into the auto-complete facilities of your text editor. You should rarely need to type out a full identifier. (In fact, you rarely want to do this: typos can be incredibly hard to spot.)

references:

http://www.cprogramming.com/tutorial/style_naming_conventions.html

http://geosoft.no/development/cppstyle.html

http://www.possibility.com/Cpp/CppCodingStandard.html

Tuesday, March 4, 2008

R Tree

R-trees are tree data structures that are similar to B-trees, but are used for spatial access methods i.e., for indexing multi-dimensional information; for example, the (X, Y) coordinates of geographical data. A common real-world usage for an R-tree might be: "Find all museums within 2 miles of my current location".

The data structure splits space with hierarchically nested, and possibly overlapping, minimum bounding rectangles (MBR, otherwise known as bounding boxes, i.e. "rectangle", what the "R" in R-tree stands for).

Each node of an R-tree has a variable number of entries (up to some pre-defined maximum). Each entry within a non-leaf node stores two pieces of data: a way of identifying a child node, and the bounding box of all entries within this child node.

The insertion and deletion algorithms use the bounding boxes from the nodes to ensure that "nearby" elements are placed in the same leaf node (in particular, a new element will go into the leaf node that requires the least enlargement in its bounding box). Each entry within a leaf node stores two pieces of information; a way of identifying the actual data element (which, alternatively, may be placed directly in the node), and the bounding box of the data element.
Similarly, the searching algorithms (for example; intersection, containment, nearest) use the bounding boxes to decide whether or not to search inside a child node. In this way, most of the nodes in the tree are never "touched" during a search. Like B-trees, this makes R-trees suitable for databases, where nodes can be paged to memory when needed.

Different algorithms can be used to "split" nodes when they become too full, resulting in the "Quadratic" and "Linear" R-tree sub-types.

R-trees do not historically guarantee good worst-case performance, but generally performed well with real-world data. However, a new algorithm was published in 2004 that defines the Priority R-Tree, which claims to be as efficient as the currently most efficient methods and is at the same time worst-case optimal.

-------------------------

Source: Wikipedia.

Thursday, February 7, 2008

Efficient map-matching of large GPS data sets

1. Distance between a point and a road segment

Let G(V,E) be the directed graph describing the road network. V is the set of vertexes or nodes and E is the set of edges or links. Let Q_i be the set of points given by the GPS data stream i = 1...T. Each GPS point consists of a pair of coordinates (x_i, y_i) and a time-stamp t_i. To measure the distance between a GPS point Q and an oriented link AB, we use the istance as introduced in [8]. Let Q` be the projection of Q on line AB. The distance is defined as follows:

Where de denotes the Euclidean distance, With this definition, d_Q,AB = d_Q,BA and a point is equally distant from the two opposite directed links of a given road segment.

For this reason, a small shift perpendicular to the link direction is introduced, according to the driving side of the studied area. Therefore, in the computation of the distance, an oriented link AB is replaced by A`B` where A` = A + λ1_┴and B` = B + λ1_┴ and 1_┴ ┴ AB, λ is chosen to reflect the average physical distance between the middle of the road and the middle of the driving lanes (in our application, ┴ = 3m). The computation of the distance between a point and a road segment is illustrated on Figure 1 where Q,Q1 and Q2 are equidistant to AB.

2. Score of a path

Let P {E₁,E₂, . . . ,E_p} denote the path composed of the p subse

quent links E₁,E₂, . . . ,E_p.

The absolute score F of path P in matching the sequence Q_i is defined as:

Where δ_ij = 1 if Q_i was matched by link E_j and δ_ij = 0 otherwise. The problem is to find the path P* that minimizes F. The relative score F_r is defined as F_r = F/T.

F_r will be used as a surrogate for the estimation of the error of the algorithm. Note that F_ris highly dependent on the resolution of the coded network. Each point is matched by only one link of the path, so that:

3. Initialization

The algorithm is initialized by finding the set S of the N nearest links from the first GPS point Q₁. Note that we could use in practice a few more GPS points so that the initialization is more robust and less dependent on potential initial error in the measurement. For each link in the set, we create a one-link path with the initial link.

The path P₁, P₂, . . . , P_N are ranked according to their scores: F₁ ≤ F₂ ≤ . . . ≤ Fn.

Technically, they are stored in a sorted set. Alphabetical rank is used to discriminate paths that are different but that have the same score. Searching for the nearest link can be accelerated using quad-tree storage for the links. Experiments done on a network with 400, 000 links showed performances of about 1 million of points per minute on a PC clocked at 2Ghz.

4. Breaking routes into paths

In most cases, a whole route will not be matched by a single continuous path because of interruptions in the GPS data stream. These interruptions can be due to tunnels, tree canopy, poor signal, etc. Therefore, a route will often be matched by a sequence of paths. We do not address here the problem of connecting these paths together.

The outer loop of the algorithm processes the GPS points sequentially. Assume Q_i is the next data point to be processed. If the GPS device is searching for the signal, the point is discarded. If de (Q_i-1,Q_i) is bigger than a given threshold (e.g. 300 meters) or if t_i-1−t_i is longer than a given period (e.g. 30 seconds), the matching is stopped for the current path and a fresh initialization is started with Q_i.

Obviously, this procedure could be improved to be less sensitive on outliers.

5. Following the network topology

Given a new point Q_i, the following procedure is applied for each of the path P in the set S.

Assume that Q_i is matched by the last link E_p, that is δ_ip = 1.
Update the score of path P using (1).
Insert P in a new sorted set V.
Check if the route has reached the next intersection (i.e. the destination of E_p).
If yes, create the new paths P_K^FS that correspond to the forward star of E_p.
Update the score of paths P_K^FS using (1).
Insert P_K^FS in the new sorted set V.

When the condition is met on step 4, new paths are created depending on the network topology. However, it is not obvious to determine if the vehicle has reached the end of the last link. We use the bird distance covered by the GPS points since the beginning of E_p to determine if the vehicle might have cruised the whole link.

Let Q_p0 the first point matched by E_p and L(E_p) the length of E_p. If the following condition holds,

Then we assume that the vehicle has reached the destination of E_p. A problem arises if a vehicle performs a u-turn in the middle of a long link. The parameter α = 0.5 is introduced for that reason.

Let FS (E_p) be the set of downstream links from E_p (forward star). If the condition(2) holds, a new path P_K^FS is created for each of the link in FS (E_p). These children paths differ from their ancestor since they have one more link: E^k_p+1Є FS(Ep) where E^k_p+1 is the last link of path P_K^FS. The children paths are initialized with the last point so that δ_ip+1 = 1. The path creation mechanism is represented on Figure 2 where a vehicle is driving from O to D. At the intersection A, three children paths are created (P₁, P₂, P₃). Clearly the first points close to A do not allow discriminating which direction has been taken. Indeed, the first point suggests AC was the chosen direction. Obviously, the direction was AD but the network description was not accurate enough to describe the curvature of the road section.

Initially, the score of P₃ is better than that of P₂. However, as more points are added, the score of P₂ will dominate P₃ and P₁ in the end.

6. Maintaining a set of candidate paths

Children paths are inserted into the sorted set V and ranked according to their score. Consequently, V has more elements than S. When Q_i is processed, S is emptied and only the N best elements of V are inserted into S. This avoids that the sorted set S grows indefinitely. In practice, the branching and the path creation can happen for a range of different Q_i as soon as the condition (2) is met. The range depends on α. Therefore, there are generally several instances of paths P₁, P₂, P₃ in the S, depending on the point Q_i when the branching occurred. Eventually, the less efficient paths are discarded. The sorted set S has to be large enough to keep track of unlikely trajectories that might turn to be correct ultimately as illustrated on the previous example. Intuitively, the number of candidate paths to maintain is a function of the resolution of the coded network.

When the last point Q_i is processed, the best path from S is kept as the result of the map-matching process. Similarly, when the route is broken and a new path has to be started; only the best path from S is kept as the best matching part for the previous GPS points. The flow-chart of the whole algorithm is presented on Figure 3. It depends on two parameters: α and the size of the set of candidates N.

References:

Efficient map-matching of large GPS data sets - Tests on a speed monitoring experiment in Zurich, F.Marchal_ J. Hackney K.W. Axhausen†, 26th July 2004.

Links:

http://www.strc.ch/pdf_2005/STRC05_D2_Marchal.pdf

Tuesday, January 29, 2008

Map Matching I

Problem Statment:
In general , Global Positioning System (GPS) has been widely used in Vehicle Navigation Systems. It helps the users to determine the vehicle position or provides users with proper maneuver instruction .Because of some biases, such as SA effect, buildings canyon around cities , thus those affect and block GPS signals .In addition , the road network of the digital Map is not accurate enough. These cause vehicle position not to locate precisely on the road network of the digital map. Therefore a map-matching algorithm is necessary. An optimal algorithm is developed through the analysis of some driving conditions (such as turning. U-turn or drive to parking lots).The algorithm is presented.

How the algorithm works:
Assuming that GPS point is G1 with coordinate (XG1, YG1), from-node of
road is S1 with coordinate (XS1, YS1), to-node of road is E1 with coordinate (EE1, YE1).
The value of K can be determined.

where θ is the angle between vector S1G1 and S1E1

1) if 0<= K <=1 then the prependicular distance between G1 and the road is:

( (XG1-XN)^2+(YG1-YN)^2)^1/2

The (Xn, Yn) are the coordinates of the projected point of the road:
XN=K*(XE1-XS1)+ X S1
YN=K*(YE1-YS1)+ YS1

2) If K <>

3) If K > 1, then the vehicle location is not on the candidate road, and the location is away from the to-note E1

in case like the following:

the candidate road of G1 is either line1 or line2. The possibility of making wrong decision is very high. To solve this problem we consider the vehicle traveling direction. In digital road map database the true azimuth (which is a mathematical concept defined as the angle)of road is recorded, and NMEA data also includes the vehicle direction. Therefore, we can compare these two any try to select the candidate road.

After selection of the candidate road, We assume current vehicle location is in G1, next vehicle location is G2, if the distance between G1, and E1, (end-node of road segment) are close to a pre-set value, the next turning candidate road will be Line1, Line2 and Line3. G1E, is the azimuth of traveling direction from G1, to E1, point, G2E, is the azimuth from E1, to G2, point. These values of azimuth will be used for judging which road is the candidate. If driver has a U-turn, the from-node and to-node of road segment must be switched. Assuming current vehicle location is at G1, point, the location of former epoch is G0, endnode of road segment is E1,

ΔHEADING = G1a - G0a

G1a: azimuth of G1E1,G0a : azimuth of G0E1

If ΔDISTANCE is negative and larger then a tolerant value and
ΔHEADING > Tolerance, we presume, we presume vehicle has a U-Turn.

Vector Maps

Vector Map (VMAP), AKA Vector Smart Map, is a vector-based collection of GIS data covering the earth at various detail levels. Level 0 (low resolution) coverage is global, and is entirely in the public domain. Level 1 (global coverage at medium resolution) is only partly in the public domain.

Description

Coordinate reference system: Geographic coordinates stored in decimal degrees with southern and western hemispheres using a negative latitude and longitude

Thematic data layers

Features and data attributes are tagged utilizing the international Feature and Attribute Coding Catalogue (FACC).

major road networks
railroad networks
hydrologic drainage systems
utility networks (cross-country pipelines and communication lines)
major airports
elevation contours
coastlines
international boundaries
populated places
index of geographical names

Levels of resolution

The vector map product comes in three flavors: low resolution (level 0), medium resolution (level 1) and high resolution (level 2).

Level Zero (VMAP0)

Level 0 provides worldwide coverage of geo-spatial data and is equivalent to a small scale (1:1,000,000). The data are offered either on CD-ROM or as direct download, as they have been moved to the public domain. Data are structured following the Vector Product Format (VPF).

Data sets

The entire coverage has been divided into four data sets:

North America (NOAMER)
Europe and North Asia (EURNASIA)
South America, Africa, and Antarctica (SOAMAFR)
South Asia and Australia (SASAUS).

Level One (VMAP1)

Level 1 data are equivalent to a medium scale resolution.

Horizontal accuracy: 125-500m
Vertical accuracy: 0.5-2m

Data sets

VMAP Level 1 is divided in 234 geographical tiles. Only 57 of them are currently (2006) available for download from NGA. Amongst the available datasets, coverage can be found for parts of:

Costa Rica
Libya
United States
Mexico
Iraq
Russia
Panama
Colombia
Japan

Level Two (VMAP2)

Level 2 data are equivalent to a large scale resolution.

Debate about availability of data

The U.S.A. Freedom of Information Act and the Electronic Freedom of Information Act guarantee access to virtually all GIS data created by the US government. Following the trend of the United States, much of the VMAP data has been offered to the public domain.

Still many countries consider mapping and cartography a state monopoly; for such countries, the VMAP Level1 data are kept out of the public domain. With the privatization of public agencies, some data are sold for a profit.

Efforts are being made by various public groups towards moving all VMAP1 data to the public domain.

Further steps have been taken (by the Free World Maps Foundation among others) to licence the data under the GNU General Public License while remaining copyrighted, as an alternative to the public domain. This is an ongoing debate (as of 2006), of which the outcome is yet to be seen.

Tools to read and convert VMAP data

VPFView (V2.1) - developed by NIMA, is available from NGA or USGS (as part of the NIMAMUSE package); this tool can render simple plots and export GIS data to other GIS file formats
"OGR with OGDI driver": this free software tool can convert VMAP format to standard GIS file formats such as SHAPE, PostGIS etc.

References

Main Reference

http://en.wikipedia.org/wiki/Vector_Map

NGA Home Page

http://www.nga.mil

Good Article about vector product format

http://www.nga.mil/portal/site/nga01/index.jsp?epi-content=GENERIC&itemID=a2986591e1b3af00VgnVCMServer23727a95RCRD&beanID=1629630080&viewID=Article

Vector product format standards

http://www.nga.mil/NGASiteContent/StaticFiles/OCR/vpf_main.pdf