[aka: A case of investigating the possibilities of premature optimisation in a not-even-started project]

In MySQL days getting an unique key was a matter of creating a field and tag it with the auto_increment feature. The database engine would do the rest and nice sequential unique numbers were added for each record. The predictability of these keys made them less useful in situations where such a key is visible, say for instance in a url.

One easy way to get rid of the sequence is using a UUID instead. When using a Neo4J Graph database nothing is easier than adding a uuid property and set its value. If you need a unique key to the node it is preferred way. You might be tempted by using the id() of a node but don’t do that. There is no guarantee that particular node will keep that number for ever.

One interesting approach of adding UUID’s to Graphs is by extending Neo itself but I’d prefer to do it in the application itself to keep a bit more control as I only want uuids on some nodes not on all of them.

In an existing project which uses Neomodel I typically define it like :

from neomodel import (StructuredNode, StringProperty)
from uuid import uuid4

class ContentItem(StructuredNode):
    uuid = StringProperty(default=uuid4, unique_index=True)
    title = StringProperty(required=True)
    ...

Whenever a node is saved it will automatically call the uuid4 function which return a nice new uuid and uses that as a key. If  there is a key nothing is done. Good stuff. When you add a uuid like that later don’t forget to re-save all your nodes otherwise the uuid field will have a random new value everytime you retrieve the node. That might do funny things to your app 😉

For a new project which is starting I’d like to get rid of the Neomodel dependancy for reasons outlined in this previous post. While pondering about it I realised that during imports I might be needing 30.000 or more uuid’s at a single go. This made me wonder how fast the generating of them actually is.

from uuid import uuid4
for i in range(0,30000):
    u = uuid4()

and run

atom:uuidtest paulj$ time python uid.py

real	0m0.622s
user	0m0.385s
sys	0m0.223s

Not too bad, so how about a million of them?

real	0m17.488s
user	0m11.711s
sys	0m5.749s

It’s getting hard work, so how about 10.000.000?

real	3m9.005s
user	2m5.535s
sys	1m2.976s

The changes of needing these amounts are less than zero but it is Friday and it has been ages since I did some premature optimalisation so just out of interest I decided to check how Go is doing in the uuid domain.

The first hit on “golang uuid” is this Stack Overflow discussion which mentions two packages: github.com/nu7hatch/gouuid and github.com/twinj/uuid

Both are very similar in usage:

package main

import "github.com/nu7hatch/gouuid"

//import "fmt"

func main() {
	for i := 0; i < 10000000; i++ {
		_, _ = uuid.NewV4()

	}
}

and:

package main

import "github.com/twinj/uuid"

//import "fmt"

func main() {

	//u := uuid.NewV4()
	//fmt.Println(u)
	for i := 0; i < 10000000; i++ {
		_ = uuid.NewV4()
                // u:= uuid.NewV4()
                // fmt.Println(u)

	}

}

I commented out the statements I used to check the output, we are not benchmarking stdio. The _ is the funny way in Go to indicate that you actually ignore the return value, it is needed otherwise the program won't run. Both are very similar in performance:

atom:uuidtest paulj$ time go run nu7hatch.go

real	0m12.164s
user	0m2.348s
sys	0m10.512s

and

atom:uuidtest paulj$ time go run twinj.go

real	0m12.136s
user	0m2.281s
sys	0m10.405s

Where Python needs minutes Go goes through it in seconds, and this appears to be only the start.. Sort of surprised there was no "default" package in Go for generating uuid's I googled some more and found a few others.. One of them being code.google.com/p/go-uuid/uuid which is even a lot faster:

package main

import "code.google.com/p/go-uuid/uuid"
//import "fmt"

func main() {
	for i := 0; i < 10000000; i++ {
		//u := uuid.NewUUID()
		//fmt.Println(u)
		_ = uuid.NewUUID()

	}
}

and its results:

atom:uuidtest paulj$ time go run go_uuid.go

real	0m2.248s
user	0m2.200s
sys	0m0.153s

The generated uuid's are less random than the other two packages,

8bb7b767-1101-11e5-aeeb-94de80b5e095
8bb7c4c0-1101-11e5-aeeb-94de80b5e095
8bb7c4eb-1101-11e5-aeeb-94de80b5e095

versus

{413F7C38-BC59-4FED-95BD-65158DAF9FE7}
{0265D466-01FD-4654-A0C4-953CCE4726B5}
{A0B17BE3-7BE3-4077-8677-EAC4D18CA92B}

so it might be good to look at the actual implementation before settling on a specific package. And as mentioned there are many others. Googling on "gaoling uuid" opens a world of different implementations. Obviously others than myself were wondering about the "best one" as early as in 2013.

Perhaps one "best" implementation will end up in the standard library one day. For now doing 30.000 plus in Python seems very ok, and if needed there is an escape..

Remember: there are lies, damned lies and statistics. But benchmarks top all of these 😉
These were done on my personal workstation, a Core i7 @3.7Ghz running Yosemite.