Sunday, March 16, 2008

ESRI shapefile structure

Any ESRI shapefile consists of:

  • A main file (*.shp)

  • An index file (*.shxd)

  • A DataBase table (*.dbf)

The main file is a direct access, variable-record-length file in which each record describes a shape with a list of its verticies.

In the index file, each record contains the offset of the corresponding main file record from the beginning of the main file.

The dBASE table contains feature attributes with one record per feature. The one-to-one relationship between geometry and attributes is based on record number.


** The main file (.shp) contains a fixed-length file header followed by variable-length records. Each variable-length record is made up of a fixed-length record header followed by variable-length record contents.

The header for each record stores the record number and content length for the record.Record headers have a fixed length of 8 bytes.

Shapefile record contents consist of a shape type followed by the geometric data for the shape. The length of the record contents depends on the number of parts and verticies in a shape. For each shape type, we first describe the shape and then its mapping to record contents on disk.
Shapes are from simple "Points" till Multiple Polygons. (Detailed description of record structures and tables can be found in the reference document)



** The index file (.shx) contains a 100-byte header followed by 8-byte, fixed-length records.
The index file header is identical in organization to the main file header.
The I’th record in the index file stores the offset and content length for the I’th record in the main file.


** The dBASE file (.dbf) contains any desired feature attributes or attribute keys to which other tables can be joined. Its format is a standard DBF file used by many table-based applications in Windows™ and DOS.

For more information on the dBASE file format, visit http://www.inprise.com/

Ref: http://svn2.assembla.com/svn/boogy/eBooks/shapefile.pdf (use your SVN password)

Friday, March 7, 2008

The Hows and Whys of Commenting

Comments should not be overly long. Comments should not give details for the sake of details; only when a fact is necessary or interesting should it be brought to the attention of the program's reader. If you were reading someone else's program, you would not want to be forced to pick through paragraphs of text describing the intricacies of a for loop's operation only to realize that you could have discovered the same information simply by having read the for loop.
 
For the sake of future readers, you should generally include some header information at the top of your program. This information may include your name and contact information, the date the code was last modified, the purpose of the program, and if necessary, a brief exposition of the algorithm used or the design decisions made. You may also want to include a list of known bugs, inefficiencies, or suggestions for improvement. It is convenient to demarcate this section of the program file in a large comment block of the form
/********
* *
* *
********/
 
Second, whenever you create a new class or a new function definition, you should add a comment explaining what the class or function does.
-For a class, you should explain the purpose of the class, what publicly accessible functions and variables are available, any limitations of the class, and information that the programmer may need if he or she wanted to inherit from the class.
-When defining a function, you should describe what the function does and whether or not it has side effects such as changing global variables, interacting with the user, or so forth. It is also convenient to describe what sort of arguments the function takes and what, if any, value it returns: e.g., if you have a function findTime(int distance, int speed), you would want to tell the user that the distance is in rods, and the speed is in furlongs per fortnight, and that the function returns the time taken by the travel in epochs.
 
Third, when adding comments directly to your code, be spare. The less you say, the better; if you force yourself to be precise, you will make the comments more helpful, and you will force yourself to synthesize the flow of your program rather than allow yourself to repeat what the code already tells the user. The code should fit seamlessly into the algorithm rather than wrap entirely around it and smother the logic embedded in the C++.
 
Fourth, good commenting can improve your programming. You can use them as an organizational device by including comments prior to filling in code -- for instance, when you have a long block of conditional statements, you may wish to comment what should happen when each conditional is executed before you flesh out the code. In doing so, you save yourself the burden of remembering the details of the entire program, allowing you to concentrate on the implementation of one aspect at a time. Additionally, when you force yourself to comment during the programming process, you cannot get away with writing code that "you hope will work" if you make yourself explain why it works. (Keep in mind that if the code is so complex that you don't know how it works, then you probably should be commenting it for the sake of both yourself and others.)
 
Finally, keep in mind that what seems obvious now may not seem obvious later. While you shouldn't excessively comment, do make sure to comment things that are nonstandard algorithms. You do not need to comment a programming idiom, but you do want to comment an algorithm you designed for the program, no matter how simple it may seem to you. No doubt it will seem foreign three weeks after you write the code, and if you plan (and eve n if you do not plan) to come back to the code, it will be immeasurably helpful.


reference:
http://www.cprogramming.com/tutorial/comments.html

Naming Conventions Guide

General Naming Conventions

There are a lot of common naming conventions for classes, functions and objects. Usually these are broken into several broad categories: c-style naming, camelCase, and CamelCase. C-style naming separates words in a name using underscores: this_is_an_identifer. There are two forms of camelCase: one that begins with a lowercase letter and then capitalizes the first letter of every ensuing word, and one that capitalizes the first letter of every single word.

One popular convention is that leading capital letter CamelCase is used for the names of structs and classes, while normal camelCase is used for the names of functions and variables (although sometimes variables are written in c-style to make the visual separation between functions and variables more clear).

It can be useful to use prefixes for certain types of data to remind you what they are: for instance, if you have a pointer, prefixing it with "p_" tells you that it's a pointer. If you see an assignment between a variable starting with "p_" and one that doesn't begin with "p_", then you immediately know that something fishy is going on. It can also be useful to use a prefix for global or static variables because each of these has a different behavior than a normal local variable. In the case of global variables, it is especially useful to use a prefix in order to prevent naming collisions with local variables (which can lead to confusion).




Length of identifiers

A fundamental element of all naming conventions are the rules related to identifier length (i.e., the finite number of individual characters allowed in an identifier). Some rules dictate a fixed numerical bound, while others specify less precise heuristics or guidelines.
Identifier length rules are routinely contested in practice, and subject to much debate academically.
Some considerations:
-shorter identifiers may be preferred as more expedient, because they are easier to type
-extremely short identifiers (such as 'i' or 'j') are very difficult to uniquely distinguish using automated search and replace tools
-longer identifiers may be preferred because short identifiers cannot encode enough information or appear too cryptic
-longer identifiers may be disfavored because of visual clutter




Hungarian Notation
The original idea behind Hungarian notation, however, was more general and useful: to create more abstract "types" that describe how the variable is used rather than how the variable is represented. This can be useful for keeping pointers and integers from intermixing, but it can also be a powerful technique for helping to separate concepts that are often used together, but that should not be mixed.



Abbrevations
Abbreviations are dangerous--vowels are useful and can speed up code reading. Resorting to abbreviations can be useful when the name itself is extremely long because names that are too long can be as hard to read as names that are too short. When possible, be consistent about using particular abbreviations, and restrict yourself to using only a small number of them. Common abbreviations include "itr" for "iterator" or "ptr" for pointer. Even names like i, j, and k are perfectly fine for loop counter variables (primarily because they are so common). Bad abbreviations include things like cmptRngFrmRng, which at the savings of only a few letters eliminates a great deal of readability. If you don't like typing long names, look into the auto-complete facilities of your text editor. You should rarely need to type out a full identifier. (In fact, you rarely want to do this: typos can be incredibly hard to spot.)

references:

Tuesday, March 4, 2008

R Tree

R Tree

R-trees are tree data structures that are similar to B-trees, but are used for spatial access methods i.e., for indexing multi-dimensional information; for example, the (X, Y) coordinates of geographical data. A common real-world usage for an R-tree might be: "Find all museums within 2 miles of my current location".

The data structure splits space with hierarchically nested, and possibly overlapping, minimum bounding rectangles (MBR, otherwise known as bounding boxes, i.e. "rectangle", what the "R" in R-tree stands for).

Each node of an R-tree has a variable number of entries (up to some pre-defined maximum). Each entry within a non-leaf node stores two pieces of data: a way of identifying a child node, and the bounding box of all entries within this child node.

The insertion and deletion algorithms use the bounding boxes from the nodes to ensure that "nearby" elements are placed in the same leaf node (in particular, a new element will go into the leaf node that requires the least enlargement in its bounding box). Each entry within a leaf node stores two pieces of information; a way of identifying the actual data element (which, alternatively, may be placed directly in the node), and the bounding box of the data element.
Similarly, the searching algorithms (for example; intersection, containment, nearest) use the bounding boxes to decide whether or not to search inside a child node. In this way, most of the nodes in the tree are never "touched" during a search. Like B-trees, this makes R-trees suitable for databases, where nodes can be paged to memory when needed.

Different algorithms can be used to "split" nodes when they become too full, resulting in the "Quadratic" and "Linear" R-tree sub-types.

R-trees do not historically guarantee good worst-case performance, but generally performed well with real-world data. However, a new algorithm was published in 2004 that defines the Priority R-Tree, which claims to be as efficient as the currently most efficient methods and is at the same time worst-case optimal.


-------------------------

Source: Wikipedia.