## Monday, 8 July 2013

### What Every Computer Scientist Should Know About Random Numbers

This article outlines a few key things you should know about random numbers if you're a programmer, in the same vein as What Every Computer Scientist Should Know About Floating Point Arithmetic by David Goldberg (a more accessible version of which is available at The Floating-Point Guide) and Joel Spolsky's The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets.

But this will be much shorter than those articles.

## The Basics

### What are random numbers?

A random number sequence is one where the next number in the sequence cannot be predicted, even if you know all of the previous numbers—i.e. one that contains no discernible pattern.

A random number generator is a device or program that produces such a pattern.

### Why do we need random numbers?

They have applications in computer games, simulations and computer security. Simple examples would be picking the next track for a music player to select in shuffle mode or the order of a deck of cards in an online poker game. Additionally, many cryptography and security applications rely on starting with some unknown (and unpredictable) number or sequence of numbers, and their security falls down if they can be predicted.

### How can a computer, which is deterministic, produce random numbers?

Unless you take special measures (see the next section) computers can't produce truly random numbers, so instead they do something pretty close that's usually good enough—they produce pseudo-random numbers, which are sequences of numbers that (i) appear random to humans and (ii) are statistically random. Such sequences generally have a uniform distribution—i.e. all numbers within the range of numbers produced by the random number generator are equally likely to be chosen. Another way of looking at this is that if you generate many pseudo-random numbers, each number within the range will be chosen roughly the same number of times, so no numbers will be preferred or chosen more frequently than any others. Of course there are other distributions you may want (e.g. a normal/Gaussian distribution), but usually you'll have a good reason for this (if you've never heard of normal or Gaussian distributions, they're very unlikely to be what you want).

Pseudo-random number sequences generally start with a seed value from which all the other numbers in the sequence follow using some algorithm. The algorithm itself is deterministic, and this is what makes the numbers it generates pseudo random as opposed to truly random.

### What about if I want truly random numbers?

This is generally done using hardware entropy—the operating system detects things about hardware which cannot be predicted, such as when traffic arrives on your network port, and uses it to generate numbers that are truly random. On most UNIX systems these numbers can be accessed through /dev/random, which simply produces random bytes whenever you read from it. However, this method is a lot slower than producing pseudo-random numbers (just try running hd /dev/random on any UNIX system; wiggle the mouse or pound the keyboard to hurry things up), and doesn't have the advantage of being able to repeat the sequence for debugging/testing.

## Random Number APIs

Most languages' standard libraries come with facilities to generate random numbers. They generally consist of at least two basic things:
1. A way to seed a sequence of pseudo-random numbers, and
2. A way to get the next number in the sequence.
Sometimes it's useful to have the same seed, for example for reproducing problems when debugging. Other times it's necessary to choose a seed from a source such as the current time in order to get a different pseudo-random sequence on each run of a program. Be careful choosing your seed from the time—depending on the resolution of the timer you're using, you could end-up with the same sequence being used on consecutive runs of a program (see an earlier post about this).

## Common Mistakes when Producing Random Numbers

These are the mistakes most beginners make, especially when asking for help on forums.

### Trying to make the numbers “more random”

It is a common mistake for people to assume that “pseudo-random” really means “not-very-random” and that they should try to somehow “improve” this randomness. This is typically done by calling some random number generation function (e.g. the standard C rand() or Java's Math.random()) several times and combining the results, as seen in this post on The Daily WTF, for example.

Producing random numbers with a uniform statistical distribution is very hard—you need some pretty advanced number theory to really know what you're doing and get it right. In many cases, combining several random numbers in this way produces a bias for/against some numbers, which actually makes these functions less random than they started out.

On the other hand if you do think you've produced better random numbers than some other implementation, NIST has some statistical tests you can download to check this.

### Re-seeding the random number generator every time you need a new random number

This is another common mistake—a pseudo-random number sequence should generally be seeded once and only once. You'll only need to re-seed such a sequence if you want to reproduce a previous run of random numbers (e.g. for debugging/testing). Frequently when people attempt to re-seed the generator they make use of the current time, which can lead to the same “random” number being used several times in a row if the timer resolution isn't fine enough and the random number function is called often enough.

## How You Should Produce Random Numbers...

How you should produce random numbers depends mostly on what's at stake—i.e. what the consequences would be if someone were able to predict the next random number in a sequence.

### ...when no personal data is at stake

This will be the case for most random number applications, e.g. for games that need to shuffle cards/simulate dice rolls, etc.

In this case, use the random number generator that comes with your language's standard library, seeded with the current time. If you need the random number to be within a certain range, take the modulus and add the base of the range, e.g. in C:

```/*
* Get a random number in range 100 to 110, inclusive.
* There are 11 numbers to choose from (110 - 100 + 1 = 11).
*/
int x = (rand() % 11) + 100;
```

This will work fine so long as the low-order bits of your implementation are as random as the high-order bits, which they will be in any half-way decent implementation. If you're worried about the quality of your standard library's implementation, check out the GNU Scientific Library's random number facilities.

If your timer resolution is in seconds and you're worried about people getting the same seed by running two instances of your program in quick succession, combine the current time with the process ID. Alternatively on UNIX you can use /dev/urandom to generate either a seed or a stream of pseudo-random numbers.

You may also want to consider (i) printing out the seed you use and (ii) adding a flag to set the seed to an explicit value. This will allow you to reproduce problems that may only occur when a certain sequence of numbers is generated.

### ...when personal data is at stake

This will be the case for cryptographic applications where bank details, passwords and the like are being protected. These normally involve a small quantity of random numbers being used once to generate a key, so performance is rarely an issue.

In this case, using a true random number generator is essential. On most UNIX systems you can use /dev/random for this purpose, though if you need a large quantity of random numbers this may take some time—encourage the user to wiggle the mouse in order to generate more hardware entropy (seriously).

If your platform doesn't have such a facility, random.org can produce truly random bytes over a secure HTTPS connection. And if this isn't an option and a true random number generator isn't available, /dev/urandom is probably the safest fallback.

## Thursday, 4 April 2013

### How to Get the Programming Job You Want

I've been on a job hunt recently, and have just been offered a position that is pretty much the ideal position I was after. In light of this, I thought I'd share a quick piece of advice on how to get the programming job you really want.

While I believe this advice certainly applies to programming, it may apply to other fields as well, especially where there's a low barrier to entry. Also, note that this advice even applies if you've never had a programming job before, and if you have absolutely no formal qualifications in programming (I don't, and I had to start somewhere…).

My advice is a fairly simple, two-stage process:
1. Figure out what programming job you want, and
2. Start doing it.
As I mentioned before, I have no formal programming qualifications—but for entry-level positions most employers are willing to look past this, so long as you can show that you have the ability to actually do the job. And what better way to show you can do the job by doing the job in your spare time? Additionally, since software development is a profession where you're constantly learning, being able to demonstrate that you're an “information sponge” is a big plus. Sure, the guy just out of university knows as much Java as you do, but did he just do it to pass his assignments? If you haven't had assignments to pass, you must have learnt it because (i) you were eager to and (ii) you were able to—you're proactive, not just another JavaSchool drone. This logic applies whether you're new to software development, or just to some specific area of software development.

With easy access to open-source and free-of-charge software and relatively cheap hardware, there's a plethora of ways you can get real-world, genuinely valuable experience in your field that'll give you something to put on your CV* and show off at interview to demonstrate you know what you're talking about. You want to write .NET desktop apps? Go write some—re-implement Task Manager if you want, it doesn't matter if it's been written before. Want to work on operating system kernels? Linux and the various BSD variants will accept contributions from anyone, or you could write your own. Want to write websites? Find an excuse to make one, even if it's just a personal one to show-off your LAMP/AJAX skills. Want to program safety-critical real-time embedded systems? Well, you probably can't do that recreationally—but you can get fairly cheap evaluation boards from a variety of sources, get a book on real-time systems and put the two together.

Whatever you want to do, just go ahead and do it—not only will employers relish in the fact that you've given yourself your own on-the-job training, but you'll find out whether you really like that area.

Of course, there's a flip-side—at the end of it you have to know your stuff. You won't have the presumed knowledge that comes with a formal qualification, so be prepared to demonstrate the limits of your knowledge when you interview. Make sure you study as much as possible in your chosen area—read (good) books, visit online forums to try to answer other people's questions (especially if you have to research the answer) and learn everything you can about your chosen language/technologies (they're the tools of the trade, after all).

* That's a résumé, for you Americans.

## Thursday, 18 October 2012

Here's a quick overview of the calls available to a programmer using Pthreads (both the standard Pthreads API and GNU extensions). It's intended to be as brief as possible while being (i) descriptive and (ii) not just a list of all the available calls. The links are to the relevant man pages on linux.die.net.

Note: The "NP" at the end of certain functions and macros stands for "non-portable". They are GNU extensions, accessed by e.g. #define-ing _GNU_SOURCE.

The pthread_attr_t can have the following operations performed on it:

A thread can get its own ID via pthread_self(). Two thread IDs can be tested to see if they refer to the same thread with pthread_equal() (don't just use ==).

You can specify functions to be executed immediately before and after a call to fork() (in both parent and child) with pthread_atfork(). This is useful in e.g. a threaded library, since the child process a fork() creates only consists of a single thread.

Thread clean-up handlers (analogous to atexit()/on_exit()) can be registered/unregistered with pthread_cleanup_push() and pthread_cleanup_pop(). They will be called if the thread is cancelled or it exits using pthread_exit() (but not if it just returns from the thread routine).

If you're worried about asynchronous cancellation between pushing/popping clean-up handlers (and who isn't!) you can use the GNU-specific pthread_cleanup_push_defer_np() and pthread_cleanup_pop_restore_np() instead, which (i) set the thread cancellation type to deferred (see Thread termination) and (ii) restore the previous thread cancellation type, respectively.

A thread that is joinable can be joined with pthread_join(), which will block until the thread has finished. A thread can be moved to the detached state with pthread_detach() (the reverse cannot be done). Normally, an attempt to join will block until the thread has finished, but this blocking behaviour can be avoided with the GNU-specific pthread_tryjoin_np() and pthread_timedjoin_np().

One primitive way of terminating a thread is to cancel it with pthread_cancel(). Whether the thread is able to be cancelled in this way can be controlled with pthread_setcancelstate(), and whether such a cancellation is deferred or delivered at a cancellation point can be controlled with pthread_setcanceltype(). Within a thread, an "artificial" cancellation point can be generated by calling pthread_testcancel().

### Signal operations

A signal can be sent to a specific thread using pthread_kill() or pthread_sigqueue(), analogous to the normal kill() and sigqueue functions.

If you're using LinuxThreads (you'll probably know if you are; most systems are using the Native POSIX Thread Library now, for which this function is irrelevant) you can use pthread_kill_other_threads_np() to send SIGKILL to all other threads, usually to correct a deficiency whereby other threads are not terminated when calling one of the exec() family of functions.

The signal mask for a thread can be accessed/modified with pthread_sigmask(), which uses a sigset_t that can be manipulated with the usual sigsetops.

### Scheduling and CPU time/affinity

The clock_t of a specific thread (e.g. for a call to clock_gettime()) can be accessed with pthread_getcpuclockid().

A thread's scheduling policy and parameters (i.e. priority) can be accessed/modified using pthread_setschedparam() and pthread_getschedparam(). The priority alone can be set using pthread_setschedprio(). The CPU can be released with pthread_yield(), which is analogous to sched_yield() but for an individual thread.

Note: This is irrelevant on Linux, and is ignored. The concurrency (a hint to the system about how many lower-level entities (e.g. kernel threads) to assign) of a process can be accessed/modified with pthread_getconcurrency() and pthread_setconcurrency().

The CPU affinity of a thread can be accessed/modified after creation with the GNU-specific pthread_getaffinity_np() and pthread_setaffinity_np().

## Thread-specific data and one-time-only operations

Once-only initialization can be achieved using pthread_once(). This requires a pthread_once_t which is initialized by assignment from the PTHREAD_ONCE_INIT macro.

## Locking mechanisms

Pthreads provides three types of locking mechanism - mutexes, read/write locks and spin locks.

### Mutexes

These are the most common types of lock, as well as the most flexible. They are also the only locks that can be used with condition variables.

Mutexes can optionally be initialized with attributes in the form of a pthread_mutexattr_t, initialized and freed using pthread_mutexattr_init() and pthread_mutexattr_destroy(). The pthread_mutexattr_t can have the following operations performed on it:

Read/write locks can provide performance benefits over mutexes since they allow any number of threads to access the protected resource with "read-only" behaviour, while granting exclusive access to threads that want to be able to modify the resource. They have some disadvantages over mutexes though, namely that they can't be configured to avoid priority inversion and they can't be used with condition variables.

Currently, the only attribute a read/write lock can be given is whether it can be used by multiple processes. This setting can be accessed/modified with pthread_rwlockattr_getpshared() and pthread_rwlockattr_setpshared().

### Spin locks

These are the simplest form of locks, and are generally only used when you know that other threads will only be holding the locks for a short amount of time (e.g. not whilst performing blocking I/O).

A spin lock is initialized/freed with pthread_spin_init() andpthread_spin_destroy(). There is no associated attributes structure (but they can be made available to other processes upon initialization).

## Memory barriers

These allow threads to synchronize at a pre-defined point of execution.

A memory barrier can optionally be initialized with attributes in the form of a pthread_barrierattr_t, initialized and freed using pthread_barrierattr_init() and pthread_barrierattr_destroy().

Currently, the only attribute a memory barrier can be given is whether it can be used by multiple processes. This property can be accessed/modified using pthread_barrierattr_getpshared() andpthread_barrierattr_setpshared().

Once created, threads can synchronize at a barrier using pthread_barrier_wait(). There is no such thing as a "trywait" or "timedwait" function.

## Condition variables

Condition variables can be used to signal changes in an application's state to other threads. They must be used with an associated pthread_mutex_t.

A condition variable can optionally be initialized with attributes in the form of a pthread_condattr_t, initialized and freed using pthread_condattr_destroy() and pthread_condattr_init(). You can initialize a statically allocated condition variable using PTHREAD_COND_INITIALIZER.

Whether the condition variable is able to be used by multiple processes can be accessed/modified using pthread_condattr_getpshared() and pthread_condattr_setpshared().

The clock used (identified by a clockid_t) for timed waits on a condition variable can be accessed/modified using pthread_condattr_getclock() and pthread_condattr_setclock() (though it cannot be set to a CPU clock).

## Tuesday, 9 October 2012

### 3 Films 1 Room

Recently I was trying to think of all the films I could that are set (almost) entirely in a single room, and a thought struck me; the good ones are some of the best films I know. I'm not sure why this is. Maybe they're a case of constraints breeding creativity? Maybe I just love dialogue-driven films? Maybe I just forget the bad ones? (Although Exam would be a counter-example to that - I want my money back, and I didn't even pay anything to see it*). Here are my three favourite almost-but-not-quite-cult examples.

There are no spoilers.

### Tape

Set in a motel room, two old high-school friends discuss old times, eventually leading to old accusations resurfacing and some very heated moments in such a small space. Ethan Hawke, Robert Sean Leonard and Uma Thurman all give excellent performances.

### The Man from Earth

Professor John Oldman gets accosted by friends before he can leave his whole life behind, seemingly for no reason and not giving anyone any clues as to where he's headed to next. Pushed for the reasons for such a hasty escape, he confesses to being 14,000 years old, periodically moving when people notice he doesn't age. His academic colleagues try to pick holes in his story, but he makes them doubt themselves with an answer for every question.

This film is famous in cult circles for being filmed on a budget of \$200,000 (not much for a film these days) and being publicised mostly through word-of-mouth on the internet (or should that be "word-of-keys"?) - the producer even commenting on how positive file sharing has been in getting word of the film out.

### Conspiracy

A gripping portrayal of the Wannsee Conference, depicting the German high command discussing the "Final Solution" behind closed doors. I'm not much of a history buff, but this film really did engage me. Although this film isn't strictly all in one room all the action (by which I mean dialogue; it feels like action) and a large part of the film does, and the rest is not far away.

### "Was it a play?"

For some reason, when I mention to someone that a film is set mostly or entirely in a single room, they often ask if it's an incarnation of a play (as if a screenwriter could ever bring such a restriction upon themselves). So far as I know, of the offerings above only Tape started out in life as a play, although The Man from Earth was subsequently made into one.

* It was on TV - lose one hundred million points if you briefly hated me in your head for downloading films. Win one hundred million points back if you still hate me for preempting you; I like that.

## Sunday, 7 October 2012

### The Cumbrian Run, or: How Not to Finish Training for a Distance Run

Well, the Cumbrian Run is over for another year, yet again leaving me with what feels like the legs of an 80-year-old man. The previous two years have been a bit of a let-down, with me not training as much as I could (especially last year) and being disappointed with my result. Last year's was 2:25, the previous was 2:15. So last year, I vowed to train all year and really boost my time. Which I did. Well... ish.

My training started off well, but throughout the year I'd had a couple of breaks of a few weeks or so due to commitments and general laziness. And the last couple of weeks I've been preparing for an exam, as well as just experiencing general apathy for the whole running thing (these things come and go, I guess). This may not be the best time, but priorities are priorities.

Anyway, about a month or so ago I would have estimated I could get a 1:50 time fairly confidently, so that's what I was aiming for and the pace I set off at. You couldn't ask for much better weather - still, sunny and fairly cool. Setting off, the first four miles felt pretty good aside from a pain at the back of my ankle - I don't normally get pain here, so that wasn't a good sign. Then Disaster Strikes! in the form of my watch stopping. It's a Garmin Forerunner 305. I've never had any problems with it in the past, and I'm sure I didn't knock it/into anyone, so this is puzzled me somewhat.

Anyway, at mile four I reset it and got on with things. The next four miles were pretty painful but I managed to keep up the pace, then the next three I dropped it a bit. I was happy with my last mile though - it was faster paced than my average for the race, and the crowd really helped me finish.

In the end, I ended up with a 1:52:13 chip time, which I was fairly pleased with - although I was hoping for under 1:50 going in, I wasn't sure how two weeks of sitting on my arse would affect things (and I still don't - would I have knocked two minutes off my time with extra training? Who knows...) my aim from last year was to get under 2 hours, so I can be happy with that. And I knocked over 32 minutes off my time - if I continue at this rate, I'm on for a pretty good time next year...

## Thursday, 4 October 2012

### Chilli Confusion: Win for Evolution

I love chilli peppers. They go great in various foods as well as just whole in a nice glass of wine. I also think they (as well as tomatoes) have a neat method of seed propagation: they get eaten by animals and let their seeds be "distributed" when their digestive systems are done with them. This apparently works well for them, since they get (i) a nice new home in foreign climes and (ii) a toasty warm bed of fertilizer to start life off with.

However, at same time this has always confused me - while I love them cut up and mixed into my food, if I were your average grazing mammal and decided to try biting into a chunk of chilli for the first time, the intense burning sensation in my mouth would soon put an end to such a behaviour. But if they get propagated in the same way as tomatoes, why would they want to ward off their middle-men?

Obviously, there's a perfectly reasonable explanation for this, but neither my teacher at school nor the then-infantile dial-up internet of the time could provide said explanation. Revisiting the problem recently gave me the answers I craved. The heat in chillies arises from evolutionary forces that depend on two major factors that make the answer kind of obvious when you know them, and are a veritable marvel of nature (at least, I think so...).

First, there's the fact that while we perceive chillies as having a hot, burning sensation, birds don't. In humans and other mammals, the capsaicin in chillies activates the same pain receptors as direct application of heat does*, but birds don't have these and so are basically unaffected by this.

Secondly, the seeds in chillies are vulnerable to mammalian digestive tracts - they are harmed when they pass through them, but not when they pass through avian digestive tracts. Thus, a chilli's heat is how it selects its preferred "thing to be eaten by".

So that's my curiosity satisfied, and since it was such a nice evolutionary arrangement I thought I'd share.

* Though it doesn't feel quite the same as actual heat applied to the inside of your mouth (at least to me) because you don't get other consequences of heat (e.g. tissue cell damage and the resultant release of various substances with further consequences). But it really is the same pain receptors that get stimulated (these ones, if you're so inclined), just in a different way.

## Tuesday, 2 October 2012

### Invisibility in Fiction

I'd like to comment on a subject I feel very strongly about at a quite personal level. I feel this issue is important, despite being overlooked by many serious authors of various backgrounds and heights. Quite frankly, it gets right on my tits.

It is this:
An invisible person would be blind.
This is, of course, ignoring any invisibility obtained through magic or similar, since then an author can also get around this problem with hand-waving and magic. I'm talking more about the sort of thing described in H. G. Wells' The Invisible Man*, or the modern film Hollow Man, where people are described as simply having all their matter turn completely transparent somehow. Fringe cases (such as the girl in The Incredibles) could be argued either way, I guess.

Anyway, as you may well know, the big problem is that in order to see, your retina has to absorb light (hence why it's black). But if you're invisible in the way described (i.e. everything that constitutes you/your body is completely transparent), light would pass through your retina, the photoreceptive cells that turn light into neural signals never would get stimulated, and you'd be as blind a completely blind person.

There are a few ways you could try and get around this. In the The Invisible Man, H. G. Wells attempts to do so by describing the recently-made-invisible protagonist's retinae** as being translucent in chapter 20: an attenuated pigment still remained behind the retina of my eyes, fainter than mist.

Of course, this state of affairs would be no better than being blind. Without an optically correct cornea, lens and associated fluids that make your eyes so juicy and delicious, light would not get focused on your retina - the best you could hope for is a blur. And even if these weren't completely transparent and had their normal optical properties, without an opaque sclera (the white that surrounds the rest of your eye) light would be arriving at your retina from all other directions, meaning the world would be a big white blur. In order to see, you'd need these bare essentials to be unaffected by any invisibility remedy, but that would leave you with two very distinct and conspicuous white spheres bobbing around at everyone else's eye-level (unless you're very short/tall or you want to go around crouching).

As an aside, if you could really do this you may actually see better than you do ordinarily. The photoreceptors in your retina are orientated with their photoreceptive pigment at the back of your eye, meaning the cell bodies as well as various retinal processing cells that aggregate/mediate their responses are all in front of the receptors themselves, in the path of the light. I don't know how much it would actually improve your sight, but it wouldn't be by a vast amount - the best values of visual acuity that have been measured are not much better than the theoretical limit, given the distance between adjacent photoreceptors in the retina.

In conclusion, next time I'm watching a film with an invisible person in it, I'll just pretend they suddenly developed extraordinarily good echolocation.

* An attempt at reconciling this problem is made in the book - see later.
** This is a weird one - saying "retinae" instead of "retinas" sounds weird to me, but writing "retinas" instead of "retinae" looks equally weird. And anyway, both are acceptable and I like writing footnotes.