Show Posts

This section allows you to view all posts made by this member. Note that you can only see posts made in areas you currently have access to.

Topics - harold

Pages: [1] 2
Math and Science / Graph coloring mini challenge
« on: October 05, 2014, 11:56:58 am »
You've probably heard that a map (with some restrictions: no enclaves, no "bridges", countries meeting at a point are not considered adjacent) can be colored using only four colors.

Graph coloring is a bit more general: given a bunch of "nodes" with "edges" between them, color the nodes such that no two adjacent (ie, have an edge between them) nodes have the same color. A trivial way to do that would be giving every node its own color. A bit better would be to start at the first node, give it color 0, then the second node can have color 0 or 1 (depending on whether it's connected to the first) and so on, and there are tricks you can play here with the order in which you visit the nodes (it matters a lot, and in fact there is an order such that you get an optimal coloring this way, there are many interesting things you can do there).

The challenge is usually to find such a coloring with the fewest possible number of colors, but you could also ask: given a certain number of colors you're allowed to use, can you color the graph? (this is an example of an NP-complete problem)

Anyway for this mini challenge, there are two graphs (I could make more, if necessary), a small one with 50 nodes and 350 edges, and a big one with 500 nodes and 62423 edges. The second one is particularly nasty not so much because it has 500 nodes, but mainly because it has about half as many edges as it could maximally have, which is about the worst case (more edges and it becomes so constrained that there are not many solutions to try, less edges and you won't need as many colours so again there are few solution). But whatever, the point is, there's an easy one and a hard one. The east one is still big enough that you can't blindly try everything though, you have to be at least slightly clever.

50 graph: gc50.txt
500 graph: gc500.txt
Format: first line is [number of nodes] [number of edges]
Other lines are the edges, a pair of nodes that has a link between them.
The last line is empty (so that every line you read can end in \r\n to simplify parsing or whatever), and there are exactly as many edges as advertised (so you can read them in a for loop and not care about end of file).

Challenges (easy):
  • Find an optimal coloring for gc50.
  • Find a coloring with at most 80 colors for gc500.
  • Prove that you need at least 11 colors to color gc500.

Challenges (medium):
  • Find a coloring with at most 60 colors for gc500.

Challenges (hard):
  • Find a coloring with at most 52 (protip: that's not optimal) colors for gc500.

Challenges (impossible):
  • Find an optimal coloring for gc500. (I don't even know how many colors it will have, so don't ask)

Challenges (other):
  • Find a non-trivial lower bound on the number of colors necessary for gc500 (deliver proof/explanation of why it's a lower bound).

PM answers to me (don't post colorings please, feel free to post lower bounds), discuss everything (as long as it doesn't violate "don't post colorings"), pretty much anything goes with respect to how you solve this but it will be more interesting for you if you play with the problems yourself instead of just dumping them into open source solvers.

Computer Projects and Ideas / Redstone-oriented Minecraft clone
« on: August 22, 2014, 08:29:54 am »
You may have seen a couple of screenshots of this appear on IRC, at the time of writing it looks like this:

They're pulse circuits, so it looks like a bunch of things are incorrect but that's just because of their delay.

Anyway, the goal as I set it for this project is to be able to build/test/run redstone circuits without various minecraft annoyances, most of which can be fixed with mods I suppose..
Basically it's creative mode on steroids. Nothing that distracts from just building circuits. World-edit-like godmode editing. And I have some plans for putting in things like displaying delays and critical paths and some automated circuit testing stuff (such as cycling through a bunch of different inputs without having to build a circuit to do it). None of that exists just yet, I'm just doing the basics now. Also, the ability to pauze, step, fast-forward redstone simulation.

Some redstone logic already works to some degree, for example:
 - torches, both floor-mounted and wall-mounted. Don't burn out though, and I'm not sure whether to make them burn out. They also react to 1 tick pulses, which is wrong. (that's how the above screenshot has 1 tick pulses)
 - repeaters. You may have hear me complain about them. A normal 1-tick repeater that isn't locked works fine. Other delays may or may not do what they're supposed to do, the screenshot above is testing how they respond to 1 tick pulses and they respond correctly. I'm not sure about other inputs though. Also, locks make no sense and are currently incorrect (correct behaviour requires time travel).
 - redstone dust, I think, works correctly in terms of which blocks it powers, what it looks like (apart from the particle effect), how far signals travel, and signals don't go down transparent diodes.
Some things are not there but will be there soon:
 - comparators
 - editing (it just loads and simulates now)
Some things may or may not ever be included:
 - pistons (I know you like them, but they suck to simulate, and I'm sure you remember how long they caused trouble in MC)
 - shadows / lightlevels would be pretty, but not useful and would slow down rendering, particularly of complex+big maps.

Trickery used
The world can get pretty big (256x256x256 at least, it should work with 1024x1024x512 as well but not tested), which is smaller than "real minecraft", but still brings some challenges.
For example, it's impossible in general to reupload all geometry every frame, or you'd have maybe 1 frame per second. In order to avoid having to reupload too much stuff that wasn't changed, the world is divided in 16x16x16 chunk. Chunks with no changes are left alone. However, naively, if a redstone signal changes, that means a change in a chunk because you'd have to update some texture coordinates and that means reuploading the geometry, this time with different texture coordinates. That would make far too many chunks be reuploaded, so to counter that, I've decoupled powerlevels from geometry, so I can update them separately with only a tiny upload of only 2kb (16x16x16/2 because power levels can be packed in nibbles) versus about 1kb per solid block (minus faces that touch other solid blocks) in the chunk. It's a bad deal only if there are 1 or 2 blocks in a chunk, which is not the common case.
The powerlevel data is also used by "objects" (torches, repeaters, etc) to determine their state, even for the lock-bar of repeaters, so as long as only redstone signals change, no geometry is reuploaded.
To avoid having to deal with "ordering" with transparent parts (torches draw some transparent parts), transparent pixels are discarded. AA tends not to like that, but it works in combination with sample shading.
Possible future trickery: if I can figure out how to do it well, maybe fuse some faces and use texture wrapping. That would significantly reduce the polycount in some cases, but I'd have to careful not to mess anything up. Polycount hasn't really been a problem so far though, a million randomly placed (but not multiple at the same location) solid blocks still render at 30FPS (on a HD6950, +vsync), I doubt any real build would even come close to that.

Tech used/platform
C# and OpenGL 4. The biggest compatibility obstacle is probably the extensions I used and buggy/old GPU drivers.

Math and Science / Representing disjoint sets
« on: April 02, 2014, 06:56:38 am »
None of this is new, but I think it deserves more attention.

Parent pointers. As discussed in wiki:disjoint-set_data_structure. Every element points to its parent, the root identifies the set. Singleton sets are an element that points to itself. This is a relatively well-known data structure because every CS curriculum covers it, usually in the context of amortized analysis.
It supports merging two sets in O(log n) and identifying which set an item belongs to also in O(log n), but remember the amortized analysis that makes this data structure so good.
But here are some things you don't get (without extensions), or not nicely:
  • seeing how big a set is. But you can fix that: take an extra array "size", initialize it with 1's, and on a union, do size[newroot] += size[otherroot]. Getting the size of the set containing x is just size[find(x)].
  • removing an element from a set. Other elements might be pointing to it, and you have to find them and fix them. The pointers go the wrong way, so you have to scan everything to find them.
  • a good way to enumerate some set. Again the pointers go the wrong way. You can scan through all items and see if they belong to the set, and in some ways that is not so bad (in the worst case, the set contains everything, and you'd have to visit every item anyway), but it could be better.

So here's a way that initially looks related, but is actually very different: cycles. It looks the same in that you start out with an array [0, 1, 2, 3 .. ], but the similarities pretty much end there.
A set is represented by a cycle. You follow a cycle by doing x = array[ x]. To merge to sets, take any item x from the first set and any item y from the second set, then swap array[ x] and array[ y].
Here are some diagrams to clarify:
Begin situation. Every element is in it's own cycle/set.

Merge the set {1} with the set {2}.

Merge the set {0} with the set {3}.

Merge the set {0,3} with the set {1,2}, could be done in several ways, shown here by swapping array[ 2] with array[ 3].

It should be obvious that this is reversible, so you can easily "undo" merges if you remember which two items you used in the merge.
You can also remove an item from its cycle, but that requires knowing the element which points to it. When you're iterating over a cycle, you can remember the previous element, so you can unlink elements from their set while enumerating that set.
Unlike with the first data structure, finding out which set an element is in is hard, and there's not even an indisputable representative of a set anyway. You could enumerate a set and take the biggest or smallest item as representative, though. An other trick is to add a bunch of items that are only used to "give names to sets".

But it gets better. What if you combine those two data structures?
Merging becomes O(log n), inherited from the union-find structure. There is now an indisputable representative of a set, namely the root in the union-find structure. And now you can also remove an item from its set in a more reasonable O(|S|) where S is the set containing the item (vs O(n) before), with a  very simple algorithm: if the array containing the cycles is called "cycles" and the item we're removing is "x", iterate over the cycle setting the parent pointers to cycle[ x], and when you reach the item y such that cycle[ y] = x (ie, you've completed the cycle), swap cycle[ x] and cycle[ y] to unlink the item. You can still keep track of the set sizes with no significant overhead.
So you can:
  • Merge the set containing x and the set containing y in O(log n), or O(1) if you already know that x and y are both roots
  • Determine the set containing x in O(log n)
  • Enumerate the set containing x in O(|S|)
  • Remove an item from its set (making it a singleton set) in O(|S|)
  • Get the size of the set containing x in O(log n), or O(1) if you already know that x is a root
Of course the amortized analysis of union/find with path compression still applies, so it's actually better than those O(log n)'s make it look (that already looks pretty good though).
The algorithm for removing an item from a set guarantees that find operations on any of the items in the set that you just removed an item from will run in O(1) the next time, partially off-setting the bad-looking O(|S|) time. Essentially it performs path compression on every item in the set, and in a way that's faster than usual.

As a variant of the cycle structure, you can use two arrays, containing the same cycles but with one going "backwards". Essentially emulating a doubly linked list instead of a singly linked one. Visualize it with double arrows. In this data structure, you can unlink items from their cycle in O(1) (at all times, not just if you remember the previous node), and here's an other fun trick: instead of making the unlinked item into a singleton cycle, you can keep its pointers intact. That way it remembers its original position in the cycle it came from, and you can undo a sequence of unlinkings by relinking them in reverse order. This is a reasonably well-known trick on cyclic doubly linked lists, most famously used in the Dancing Links algorithm.
There's not much point in combining this with the union-find structure, but it works well together with a plain old "array of integers saying which set an item belongs to", that's bad for merging two sets, but it becomes interesting if you're only ever shuffling single items between sets.

Here's an other trick: all the arrays can start out as all zeroes, with only small modifications to the code (and no conceptual changes to the algorithms). Instead of treating items as index of the item they refer to, treat them as an offset. You're indexing into the array, so you always know the current index, the value that you're conceptually storing is just i + array[ i]. The ranks already start at 0, and making the set sizes zero just means off-setting them by 1.

Math and Science / A "new" compression format [subject to changes]
« on: November 09, 2013, 03:17:32 pm »
I was trying to write a Deflate decompressor, and that's perfectly doable, but occasionally annoying. The "new" format I'm suggesting is basically Deflate with some changes. Some to make it less annoying, others because "why not". The goal is mostly "improved decompression speed".

The main structure is the same. Matches in a sliding window, with literals and match-lengths (and End of Block marker) in the same alphabet, and Huffman coding.

But there are changes.
The main change is that the Huffman codes are packed in dwords (specifically to help decoding with a 32bit "window"), starting at the most significant bit (helps with decoding). Those dwords are then saved with little-endian byte-order (to help x86 - ARM can work this more easily than x86 could in the reverse situation). They're aligned to their natural boundary, so there may be a little padding after the header (the point is to read in dwords, aligning them makes that more efficient on some architectures and only costs 3 bytes at worst).
The Huffman codes are, of course, canonical, but a different kind of canonical than in Deflate: long codes must begin with zeroes. IE, using the rule "shorter codes have a numerically (if filled with zeros to the right) higher value than longer codes". This ensures that no more than 9 rightmost bits can ever differ from zero, which keeps your decoding tables small without trickery.
The maximum length for a Huffman code is 31, up from 15. (that usually won't help, I'm throwing that in "because why not")
Furthermore, the match-length symbols start at 256, with EndOfBlock assigned the code 288. This makes it slightly more convenient to index a table directly with the match-symbol.
All 32 match-length symbols are used, following the same pattern of "extra bits" as in Deflate, but extended. The last 4 match-length-symbols have 6 "extra bits".
The distance codes work like in Deflate64 (which is like Deflate, but with codes 30 and 31 being valid and getting 14 "extra bits").

The suggested way to decode this is to take two dwords, shift them left (with the bits of the second dword appearing in the lsb of the first dword) by some amount, count the leading zeroes (you can OR with 1 and use BSR since the maximum length of one symbol is 31 bits anyway), shift right by some amount depending on the number of leading zeroes, add some offset depending on the number of leading zeroes, then index a table with that.
Alternatively, take a (possibly unaligned) qword, rotate it by 32 bits, then shift left (normally), count leading zeroes, etc..

The header is changed to this:
if the first byte is 0, end of stream
otherwise, the first dword (little-endian) has some bitfields:
bit 0(lsb): 1 (constant) (to ensure the first byte is not zero)
bit 1: 0 if the block is compressed, 1 is the block is stored uncompressed
bit 2-31: length of this block
if block is stored: just the bytes, raw.
if block is compressed:
dword (little endian): length of this block when decompressed
uint16 (little endian): is_present, bit[n] (numbered from lsb up) indicates that the range of symbols [n * 16 .. n * 16 + 15] is part of the alphabet (1) or not (0)
uint16 (little endian): is_default, bit[n] (see above) indicates that the range (see above) has its own 16bit mask (0) or is completely present (1)
For every range that needs a mask (ie is present and not default), an uint16 (little endian) that indicates for every symbol whether it's part of the alphabet (1) or not(0)
Two more masks for the match-lengths (these two masks are always present)
Then, some bytes to store the code lengths:
The lowest 5 bits indicate the code length (a length of 0 may be coded - enables you to drop a subrange mask and save a byte sometimes), the upper 3 bits are a repeat-count. If the repeat-count is 0, the repeat count is 8 + next_byte.
[up to 3 bytes padding to align the next part]

None of those changes are really spectacular, but IMO they all contribute a little bit to make the format less annoying.

The part that I'm least sure about is the header. I'm also not sure whether allowing longer match distances is a mistake or not. Actually I'm not sure whether the whole deal for encoding matches is the right way to go or not. I couldn't come up with anything better, but it's a pain in the behind to parse. For most lengths `n`, parsing the match takes so much time that decoding `n` literals would have been faster.

As always, suggestions are welcome (that's why I'm posting this in the first place). Especially if you've written a Deflate decompressor before (otherwise the changes might seem random).

Miscellaneous / MBTI types
« on: October 20, 2013, 06:15:02 am »
Do a test, such as this one or an other one, if you don't know your type. Or guess it. Or look up all the type on wikipedia and see which one you identify with most (probably more accurate, but time consuming).

INTJ here, pretty much the archetypal INTJ.

Math and Science / Portal Physics
« on: October 13, 2013, 06:22:03 pm »
Ok, what do you guys think about this

(spoiler alert: if you try this in the game, the cube refuses to go through at all)

I'm going for B, because:
1) Option A implies the cube keeps whatever momentum the cube had in relation to the "ground". That means that:
1a) if you take a cube and a portal on a fast train and then slowly push the cube through, it comes shooting out the other side at high speed. Ok, then what happens when the cube is halfway through? The back end is moving into the portal at a slower rate than the front end is leaving the exit portal. So the cube is ripped apart.
1b) in the setup in the picture, the front half would not be moving at all, so it wouldn't actually be going through at all. Of course that means you couldn't have stuck in the front half either.
2) the cube has a relative velocity with the entry portal, it must have the same relative velocity with the exit portal
3) the argument that throwing a portal over an object is like throwing a hoop over it isn't really true - a hoop has the "exit" moving at the same speed as the "entry", and indeed the cube "exits" with the same relative velocity as it enters, and that leaves the cube stationary just as it would if the blue portal was pointed upwards and falling at the same speed as the yellow portal is falling (in that case the cube would move through space but gain no momentum).
4) from the perspective of the yellow portal (and choosing an other inertial reference frame like that is OK), the cube is moving and some speed and it will keep that speed at the other side.


Computer Programming / Addition in the bitfield domain
« on: July 16, 2013, 01:54:05 pm »
Ok I lied, there isn't just one bitfield domain, there are two. And I'm interested in both in them.

One them (henceforth BD1), as described here (chapter 2.8) works like this:
An abstract value in BD1 is a pair (z, o) describing the set { x for which (x | z) == -1 and (x | o) == o } (I know the o is easy to confuse with zero, but it's really an o and I didn't choose this notation)
So, in English, z is a bitmask where a zero at position k means that no elements in the set are allowed to have bit k set to zero, a one in z at position k means that elements in the set may have bit k set to zero (but they don't necessarily). o on the other hand is a mask that describes which bits are allowed to be one.
For example,
Code: [Select]
z = 1110
o = 0001
set: { 0001 }

z = 1111
o = 0011
set: { 0000, 0001, 0010, 0011 }

z = 1111
o = 1110
set: even numbers
Some abstract operators are:
Code: [Select]
(z1, o1) | (z2, o2) = (z1 & z2, o1 | o2)
(z1, o1) & (z2, o2) = (z1 | z2, o1 & o2)
(z1, o1) ^ (z2, o2) = ((z1 & z2) | (o1 & o2), (z1 & o2) | (o1 & z2))
And obviously you can build addition out of those, but that's not very efficient. So, question 1, does anyone have a better idea for how to implement addition in BD1? Something nice and elegant like the ones above?

The other obvious implementation of a bitfield domain, BD2, is with a bitmask m that says which bits are known and a value v in which only the bits that are set in m are relevant and the rest are zero for convenience. Unlike BD1, BD2 has no representation for the empty set (BD1 has multiple such representations, namely all those where at some position k, both z and o have a zero).
Anyway, a value in BD2 is thus a tuple (m, v) and as example
Code: [Select]
m = 1111
v = 0101
set: { 0101 }

m = 1001
v = 0000
set: { 0000, 0010, 0100, 0110 }

m = 0001
v = 0000
set: even numbers
Some abstract operations are:
Code: [Select]
(m1, v1) | (m2, v2) = ((m1 & m2) | v1 | v2, v1 | v2)   // both known or one of them is 1
(m1, v1) & (m2, v2) = ((m1 & m2) | (m1 ^ v1) | (m2 ^ v2), v1 & v2) // both known or one of them is 0
(m1, v1) ^ (m2, v2) = (m1 & m2, (v1 ^ v2) & m1 & m2)
Most things get more complicated in this formulation, except XOR.
In BD2, building addition from these primitives is even worse than it was in BD1.
You can rewrite it a little and get this, which is about 14 times faster in practice:
Code: [Select]
uint abm = m1 & b2;
uint knownzero = m1 ^ v1 | m2 ^ v2;
uint knownone = v1 | v2;
uint cm = 1 | ((abm & ~(v1 ^ v2)) << 1);
uint cv = (v1 & v2) << 1;
uint m = 0;

for (int i = 0; i < 32; i++)
    uint e = cm & abm;
    m |= e;
    uint t = (cm & cv & knownone) << 1;
    cm |= ((e | cm & ~cv & knownzero) << 1) | t;
    cv |= t;
v = v1 + v2;
But I suspect that there's some trick I'm missing. Something elegant I'm overlooking. Something like this:
Code: [Select]
(m1, v1) + (m2, v2) = (~(v1 + v2) ^ ((v1 | ~m1) + (v2 | ~m2)), v1 + v2)But that doesn't work. This is basically saying "a bit of the result is known if it is the same in (minimum element in set 1 + minimum element in set 2) and in (maximum element in set 1 + maximum element in set 2)", which is not true in general, because for example (now using the general notation where an * means "unknown") 000* + 000* = 00** whereas the above code would conclude that the least significant bit is known to be zero.
So question 2, is there a better way to implement addition in BD2?

Any useful (or interesting, or better yet: both) ideas are appreciated.

Math and Science / Propagating bounds through OR and AND
« on: June 19, 2013, 06:16:34 pm »
Problem description:
There are two variables, x which is known to be in the range [a, b], and y which is known to be in the range [c, d].
The problem is, compute the tightest range that contains all the possible values that x|y  (that's a bitwise OR) could have.

The solution, if you want it, can be found in Hacker's Delight.

Now, the real reason I posted this is that I discovered a more efficient algorithm than the one in HD (it's actually on the HD website as of recently, in the Contributions from Correspondents section), and I'd like to try to prove its correctness formally (I only have a sort of informal justification for it).

Spoiler contains the algorithm I discovered + justification (no proof).

Spoiler For Spoiler:
Explanation below.
Code: [Select]
unsigned minOR(unsigned a, unsigned b, unsigned c, unsigned d)
   unsigned settablea = (a ^ b) == 0 ? 0 : 0xFFFFFFFF >> nlz(a ^ b);
   unsigned settablec = (c ^ d) == 0 ? 0 : 0xFFFFFFFF >> nlz(c ^ d);
   unsigned candidatebitsa = (~a & c) & settablea;
   unsigned candidatebitsc = (a & ~c) & settablec;
   unsigned candidatebits = candidatebitsa | candidatebitsc;

   unsigned target = candidatebits == 0 ? 0 : 0x80000000 >> nlz(candidatebits);
   // small optimization compared to ~a & target
   unsigned targeta = c & target;
   unsigned targetc = a & target;

   unsigned newa = a & (targeta == 0 ? -1 : -targeta);
   unsigned newc = c & (targetc == 0 ? -1 : -targetc);
   // no need to actually set the target bit, it will be 1 in the other bound by construction
   return newa | newc;

It's based on the algorithm from HD. It uses the same strategy, but takes a shortcut. The way both algorithm work is by increasing a or c in a way that you set a bit that is already set in the other (so setting them does not increase the value of a|c) but in return that lets you reset all the bits that are less significant than that bit, so the value of a|c can do down. Of course you can not increase a beyond b, not c beyond d. It should be possible to prove that it actually results in the lower bound if you pick the highest bit that you can set in that way.
HD's algorithm starts at the top bit and works its way down until it finds the first bit it can set in either a or c.
The algorithm I discovered relies on the fact that a <= b, and therefore the the highest bit k at which a and b differ must be 0 in a and 1 in b. (a | (1 << k)) & -(1 << k)  must then be less than or equal to b, because it has the same prefix and then continues with only zeroes.
So the "using this bit won't make a bigger than b"-test is a simple bitmask. That mask can be ANDed with the other condition (let's name the result of that AND the candidate bits), and then the highest set bit is the highest bit that passes both tests. If you do this for both bounds, then the highest set bit in the union of the two sets of candidate bits is the bit that can be set in some bound (you forgot, at that point, which bound - but that's easy to find out again) such that it minimizes a|c.

So.. any tips on how to prove it formally?

Miscellaneous / Good books, reading lists
« on: May 16, 2013, 12:11:49 pm »
Here's a list of books that are good enough that I remember having read them. I'll probably add more later.

  • The Saga of Recluce
  • Memory, Sorrow, and Thorn
  • The Belgariad and The Malloreon
  • Artemis Fowl
These all happen to be series, so that's actually a LOT of reading material.

Other Fiction
  • The Bourne Identity, The Bourne Supremacy and The Bourne Ultimatum
  • The Dirty Streets of Heaven

  • Hacker's Delight
  • The Art of Computer Programming, particularly volume 4
  • The Rocks Don't Lie
  • Periodic Tales: A Cultural History of the Elements, from Arsenic to Zinc
  • At Home: A Short History of Private Life

Computer Usage and Setup Help / Planned Computer Configuration
« on: April 19, 2013, 12:36:32 pm »
So, I need a new computer. Here's what I have in mind now and why.

- Intel 4770K. Approx €350. I know an Ivy or SB-E would get better perf/price, but I really want AVX2. As an asm optimizer, I will actually use it, too.
- Mobo: dunno, they don't seem to be here yet. Probably a reasonably high end ASUS, but not the Insane Top Model. €200-250ish?
- 16GB DDR3, Corsair Vengeance PC3-15000 CL9 €130 ("but 4770K doesn't support that speed!", I know)
- Corsair H100i €100 (obviously there will be some OCing (CPU ends in a K so yea), which is also why the memory is slightly faster than standard)
- Corsair SP120's (Quiet Edition), because an H100i really needs pressure optimized fans, and the stock fans make a ridiculous amount of noise (hey Corsair, wtf?). €25 for 2.
- Corsair AF120's (Quiet Edition), case fans. €25 for 2. (I actually research and buy fans, I'm crazy like that)
- Corsair C70, one of the not that many cases that will fit an H100i, and pretty nice. €100
- OCZ ZT Series OCZ-ZT750W €100 (fully modular, plenty of power, ships with nice cables)
- Samsung 840 pro 256GB. €200. Neither best perf/price nor best capacity/price, but good balance.
- HD6970, €0 because I already have it. I know it's a noob thing. I'll upgrade it later.

Total: €1280?! Damn.
What it will cost me: €280, because 1k from my grandparents. Jelly?

Computer Projects and Ideas / haroldbot (theorem prover / solver)
« on: April 05, 2013, 12:40:19 pm »
haroldbot (development name: TheoremBot) was an IRC bot (now only a website) that proves theorems, solves equations, and can even calculate. All in the 32-bit bitvector domain (if that means nothing to you, take that to mean "it calculates with unsigned ints, not natural numbers or real numbers or anything like that).

Warning: the interpretation of >> has changed between versions, it is and was an unsigned shift in the current and the first version, but a signed shift in some versions in between.
Warning: the interpretation of >>> has changed between versions, it is now a rotate.
Warning: in version 5d, theorem proving mode has become much more sensitive to divisions by zero, so sensitive that if there's any possibility of division by zero it will freak out. I'll try to do something about that at some point.

How to use it
There are three basic modes.
  • The "solve" mode. Just type in any boolean expression. haroldbot will try to solve it, and give (if applicable) cases of inputs that make it true, false, or result in a divby0. If an expression is always true, haroldbot will search for a proof. Example: "x * 3 == 1" or "x + y == y + x"
  • The calculation/simplification mode. Used for any non-boolean expression. You can make any expression non-boolean by surrounding it with parentheses. For non-constant results, only some inputs result in anything interesting (if the resulting BDD can be matched against a known pattern). Example: "1+2+3+4^15" or "a | a"
note: haroldbot tries to answer you regardless of what else you put in a message as long as it has the string haroldbot: in it. If it doesn't answer, usually that means you asked something it didn't like.
note: sometime it will say something like "time-out while counting solutions". That means the BDD approach didn't work out and it had to switch to the SAT solver - don't worry if you don't know what that means, it's the implication of that that's important: it will not find try to find "nice" results, it will find just any old result. Quantified mode does not fall back to the SAT solver (yet?).

Values in haroldbot have no signedness, instead the operations have signedness. All operators and functions are unsigned by default, append s (operators) or _s (functions) to get the signed version, if there is one.

Expression syntax
Operator precedence and associativity are almost the same as in C, but not quite. Notably the comparisons have a lower precedence than bitwise operations (but still higher than boolean operations). Here is a list of available operators and their gotcha's, in order of precedence (groups of equal precedence not shown)
  • ~ bitwise NOT, aka 1's complement.
  • - unary minus, aka 2's complement.
  • * multiplication. haroldbot doesn't really like multiplication, especially not if neither of the two operands is a constant. Computation tends to be slow (using the SAT solver fallback) or even time out and give no result.
  • /, /u, /s division. The warning about multiplication also applies to division. Note that "0x80000000 / -1" in signed mode doesn't "error", as it usually would.
  • %, %u, %s remainder, aka modulo (in the unsigned case). Signed remainder has the sign of the dividend. The warning about multiplication also applies to remainder.
  • + addition.
  • - subtraction.
  • << left shift. haroldbot doesn't like shifts where the right operand isn't a constant, but you can sometimes get away with it. It's not as bad as multiplication.
  • >>, >>u, >>s signed right shift in signed mode, unsigned right shift in unsigned mode. The warning about shifts also applies to signed right shift.
  • & bitwise AND.
  • ^ bitwise XOR.
  • | bitwise OR.
  • ==, !=, <, <u, <s, >, >u, >s, <=, <=u, <=s, >=, >=u, >=s signed or unsigned comparisons. '=' is not a comparison. Comparisons give a mask of all 1's or all 0's instead of being a boolean true or false.
  • => implies. implemented as (~a | b) == -1.
  • && boolean AND.
  • || boolean OR.
  • ?: actually a bitwise mux, acts like the usual ternary operator if the condition can only be 0 or -1 (for example, if the condition is a boolean expression).
  • let name = expression in expression used to give names to things. Useful to factor out common sub-expressions. For example: "haroldbot: let m = x >> 31 in abs(x) == (x + m) ^ m"
Built-in functions:
  • min(a, b) takes the signed or unsigned minimum of a and b.
  • min_s(a, b) takes the signed minimum of a and b.
  • min_u(a, b) takes the unsigned minimum of a and b.
  • max(a, b) takes the signed or unsigned maximum of a and b.
  • max_s(a, b) takes the signed maximum of a and b.
  • max_u(a, b) takes the unsigned maximum of a and b.
  • popcnt(a) the population count aka hamming weight (number of 1 bits) of a.
  • nlz(a) number of leading 0's in a.
  • ntz(a) number of trailing 0's in a.
  • reverse(a) reverses the bits in a.
  • abs(a) takes the absolute value of a where a is interpreted as signed (otherwise it would do nothing).

How it works
It builds an array of 32 binary decision diagram's of the input (one for each bit), and does some fairly standard operations on them (quantified solve mode uses universal quantification over all the variables that you didn't mention, which is a bit rare). There's no magic. If you want to learn more about BDD's, you can read the draft of Knuth's Chapter 7.1.4.
When that fails (that is, there are more than about a hundred thousand nodes in the BDD), it switches to a SAT solver. It switches when the BDDs get too big to fit in their pre-allocated array, the time that takes depends on the speed of the client.

The proofs it finds (if any), are made using the fairly brute-force approach of taking both "ends" of the proof and applying rewrite rules until they meet in the middle. This process is guided somewhat by selecting candidate expressions to rewrite based on their "weight", the weight is a combination of how many steps away it is from the original expression and, crucially, how complicated the expression is. It's an empirical observation that it's usually not necessary to have expressions in a proof that are a lot more complicated than the most complicated "end" of the proof. Of course this does not always hold, so this optimization makes it slower to find some proofs, while making it faster to find typical proofs. A second important component is two forms of hashing. One form that helps to match expressions that are overall structurally equal (all explored expressions are put into hash sets using this type of hash, to make it fast to detect when the two expanding clouds of expressions meet). That hash the usual "combination of hashes of child nodes and this node". The other hash is one that helps to match rewrite rules to expressions, it considers only a node and its immediate children. Rewrite rules are put in bins according to which hashes they would match onto, when rewriting this is used to only attempt to match a rule pattern to an expression if it has a decent chance of actually matching.

I'm open to requests, but BDDs are no silver bullet and will not solve everything, and the SAT solver fallback tends to be really slow (through no fault of its own, it's just that it automatically only gets to deal with hard problems - everything easy is solved by the BDDs). For example multiplication, division and remainder, are just fundamentally hard to deal with. It's also not really a good framework for simplification, but you can use theorem proving mode to test if your own simplification is correct.

Computer Projects and Ideas / Continued Fraction Math
« on: March 19, 2013, 09:57:52 am »
(Infinite) continued fractions (CFs) are a nice trick to store and manipulate numbers in a computer that would otherwise be hard to handle (such as irrationals and transcendentals). The nice thing about them, is that you can get a correct finite prefix of the sum/product/decimal-expansion/etc of a/two CFs using only finite* prefixes of (both) operand(s). That means that, as long as you ask for only finitely many digits of the result, a prefix of the actual correct result rolls out in finite time.

* but not always. For example, it's impossible to compare two equal infinite CFs - maybe the next term will make them differ, so you have to evaluate infinity terms. This also arises in "implicit comparisons" such as subtracting two equal infinite CFs or squaring a square root of an integer that isn't a perfect square. The CF library I've been working on tries to detect those cases, but it can get confused - it's better to just avoid those cases.

So, how do you do math with them? Didn't we all learn in school that you can't really do math with CFs in a mechanical way (ie you had to reason about it to come up with an answer)? Turns out there is a way, but I guess it's rarely taught. It's a shame, it's not really all that complicated and it's a nice trick to know.

Now, on to the library I've been working on. It's not "ready", and probably full of bugs (if you find one, please let me know), but it's "ready enough" to show some neat things that you might have thought were terribly hard to do, such as evaluating ##\frac{\pi e}{\sqrt{2}}## to 500 decimals. The example program is doing just that, and you can use Mathematica/wolfram-alpha to see it's correct. Just for fun it also tries to merge successive operations, because "one operation" is really of the form ##\frac{axy+bx+cy+d}{exy+fx+gy+h}## where x and y are the inputs, a through h are set to constants that determine the operation, so a lot of stuff can be put together in one operation, such as ##\frac{x}{y} + \frac{x}{2} + 3 = \frac{xy + 2x + 6y}{2y}##, merging 4 operations into 1.

I'm not sure if this has a real use, but maybe some of the people working on custom CASes are interested in the algorithm? The paper I linked to isn't one of the easiest papers to read, but I can explain it if anyone's interested in implementing it themselves. The limitations I mentioned would be kind of bad for a CAS, but you can hack around them by using a "maximum expansion depth". I didn't implement that because I'm still looking for a better solution.

I'll start out with some stuff I actually use and may not be widely known: ("widely known" is stuff such as red-black trees, array-based heaps, circular buffers, hash tables, linked lists, you name it - first year CS material)
  • Binary Decision Diagram solves all sorts of problems that you thought were hard or impossible, such as: boolean function equivalence, boolean satisfiability (and in fact the number of solutions, in time independent on the result) and Linear Boolean Programming. I've used BDDs with great success in many Project Euler problems (usually the "count solutions" part of BDDs) that were designed to require some mathematical insight. Great for "brute force - but smart".
  • Zero Suppressed BDD's - a variation of BDDs especially suited to sparse functions, such as when solving some Exact Cover (or regular Set Cover) problems (not when choosing a set has very non-local results, such as with Sudoku) and you can actually get a count of solutions, pick a uniformly distributed random solution or a solution that optimizes the sum of weights of the chosen sets. Also good for representing all Hamiltonian paths though a graph, which you can use to solve some nontrivial instances or the Traveling Salesperson Problem.
  • Dynamic Programming. Not an algorithm, but a class of algorithms so big that you have already used DP even if you didn't know it. For example, you know how to calculate Fibanacci numbers in linear time? That's DP. You've used Dijkstra's algorithm? DP. Learn about DP, it will come in handy. Maybe this shouldn't be on the list - almost everyone has heard about it.
  • Branch and Bound. The wiki entry is just a stub, so no link. The idea is simple enough, but powerful: you do a Depth First Search, but you stop going down the recursion tree when the solution can not possibly become better by going down this sub-tree than the best solution found so far. Especially good when combined with a fast approximation. This technique can save your ass when it looks like you are going to have to use actual brute force, but when the cost function is monotonic while going down a sub-tree. Often applicable to solving puzzle games. It may seen niche, but I've had to use this in real life software that is actually used by people.
  • Suffix Arrays, takes less space than a suffix tree, and useful for many of the same problems, such as many string problems (such as searching for substrings in sub-linear time). Other significant applications include a linear time Burrows–Wheeler transform (useful for data compression) and linear time Lempel Ziv decomposition (probably useful for LZ77/Deflate).
  • Miller-Rabin primality test for when trial division gets too slow. It will test 64bit numbers for primality in mere miliseconds.
  • Pollard's Rho algorithm for when trial division gets too slow and you actually need a factor, not just a boolean "IsPrime".
Let's get this going, I'm sure we can all learn something new and interesting from this thread if people start adding stuff to it.

Miscellaneous / High level languages are vexing
« on: February 18, 2013, 11:27:44 am »
High level languages make many things harder instead of easier, I'm sure most assembly programmers feel the same way, especially when I show you what I mean.

So you want the full result of a multiplication. That's non-trivial on z80, but on x86 (and many others) you can get this in one or two instructions.
But now you're working in a high level language again and guess what, it won't let you do that. You could perhaps use a type that is twice as wide.. unless it was the widest type already. So you have to split your numbers and multiply the parts separately. To make matters worse, your compiler isn't detecting the idiom, and what should have taken 5 cycles or so now takes more than a dozen. Wow.

Ok but at least that was solvable and the performance may be acceptable. But now you want to do a 64bit modular multiplication. What's the problem, just mul\div, right? (assuming the inputs were already reduced). But of course that relies on 1) a full multiplication and 2) a full division. Not gonna happen. So now what, do a multi-word division? You can.. but it's complex and slow and bug prone, and it's something that's supposed to take 2 instructions.

Non-standard extensions to C/C++ solve this, but what if you're working in C# or Java? There are, afaik, no good solutions. (no, using bigint isn't a good solution, if anything it's worse)

Let's add some more, there's a ton of stuff like this.

Say you want to take the average of two unsigned numbers. Easy, right? Just add them and rotate-right-with-carry by 1.
Oh wait, you don't have a carry, because this is a high level language. Solutions range from "absolutely horrible" to "wrong", except perhaps for "average = low + ((high - low) / 2)" which I suppose is just about acceptable.. but I still have this nagging feeling of "but it should be simpler!". I suppose a smart compiler can detect this idiom and do the right thing anyway, but it's vexing.

Or do a multi-word addition/shift. There are various ways to calculate the carry, but you're still forced to jump through hoops just because there's no "add/shift with carry".

Or, you want to find the index of the lowest/highest set bit. Simple. Even in Java, though I wouldn't trust it to compile to the right thing. But in C#? Nope. You'll have to write something complicated and/or slow to emulate it.

Rotates. Ok there is "(x << n) | (x >> -n)" and the likes of it, and then hope&pray the compiler isn't an idiot (C and C++ compilers generally aren't idiots with this, but once again C# loses). It could have been so simple. Say, >>> and <<< operators. Though of course Java already used >>> because it doesn't have unsigned types (worthy of a whole separate rant, especially the fact that bytes are signed). Do you even need rotates? Yes. You do.

Computer Projects and Ideas / Writing a compiler [tutorial]
« on: February 15, 2013, 07:27:00 pm »
Compiler tutorials often combine a overly-simple source language with a useless target. This one will combine a "sufficiently C-like language" with x64 (AMD64) as a target, so you can actually compile some interesting code with it and actually run it on your computer (no emulator needed).

How this tutorial works: first, general information on how to write a compiler like one of my compilers. Then, more about what I actually did in one of my compilers.

Some knowledge of x64 assembly may come in handy.

The Architecture

To keep things simple, the code-gen will be almost directly after the parser, with no significant optimizations or even type checking. Maybe I'll get into those later.

The picture of the architecture then becomes this:
Code: [Select]
input source code -> [parser] -> AST -> [semantic analysis] -> annotated AST -> [code-gen] -> blocks of code -> [linker] -> .exe fileAST means Abstract Syntax Tree.

The Parser

Compiler courses at universities tend to focus on parsing a lot, because it has a lot of "interesting" theory to talk about. In real life though, you don't build your own parse-tables, you use a parser generator. Parsing is, essentially, a solved problem. I use Coco/R a lot, because I like to work in C# and it's one of the few good parser generators that come with a plug-in for Visual Studio. In any case, it doesn't matter how you decide to parse really, the end result will probably an Abstract Syntax Tree.

Abstract Syntax Trees

The Abstract Syntax Tree you will get out of the parser is tree structure of the input source code. It contains all the relevant information from the source in a format more suitable for a computer.

Semantic Analysis

In serious compilers, you'd do a lot more here; such as type checking and checking for other errors (Definite Assignment, for example). But to get a functioning compiler, all you need is to bind name-uses to their definitions (strictly speaking you don't even need this, but having it allows your language to make sense). This is where things finally get non-trivial enough to warrant an explanation.

Unless you used "The Parser Hack" (mixing the Symbol Table up with the parser), names in the AST, such as variable names and function names, are textual. But in the rest of the compiler, you'll need to know what the actual thing they refer to is, not just the name. That's easy enough to fix, though.

Walk through the AST, and for every "scope", you make a new symbol table and push it on a stack.
For everything that declares anything, put that thing in the top-most symbol table.
For every use of a name, search for that name in the stack of symbol tables. Use the definition closest to the top of the stack.

Practical considerations:
- the stack can be The Stack, by doing it recursively.
- though slightly "dirty", using a writable field in the AST nodes that represent variable uses and function calls to store a reference to the AST node that declared them, works great.
- using a map<string, Node> (or equivalent) for the symbol tables makes sense, and I see no reason not to do it.
- function arguments should have their declaration be active inside the scope of their functions, not in the global scope, unless you really want your language to not make sense.


This is where the trouble starts. Code-gen is pretty hard to get through, and this is, I think, where most compiler projects get stuck. Number 1 reason: trying to generate fast code. Everyone likes fast code, but you'll get into lots of trouble - Register Allocation is just about doable (though nontrivial), but you'd need to match tree-patterns with a huge database of patterns to get some nice code. I like fast code too, though, so this is exactly what I'll be doing in my next compiler, and maybe I'll write a short tutorial for that when I get it working.

But for now, there will be slow code. Stack machine code, in fact. But AMD64 isn't a stack machine! And that's really OK, that just means there would be explicit pushes and pops. But on 64bit Windows, you shouldn't do that. It's not like you can't, but you shouldn't, and I'll get into why when I talk about the Linker. But fortunately, you'll only ever use a statically determinable amount of stack, at most the maximum depth of an expression. At most, but it can be less - using the stack for everything would be so wasteful that it would make me cry, so I keep the top of the stack in RAX. The offset on the stack where a temporary value should go is easily determined by keeping track of what the stack-depth would have been if you had used pushes and pops.

Generating code for a binary (2-operand) expression is easy then: generate the code for the Right Hand Side, save the value in the stack at the right offset, generate code fore the Left Hand Side, then use the appropriate "op RAX, [RSP + offset]" instruction.
That last part should make clear why the Right Hand Side comes first: the memory operand has to be the Left Hand Side (otherwise the result would go back into memory, instead of into RAX).
An optimization is immediately obvious: generate "op RAX, constant" if the Left Hand Side is a constant (of the RHS and the operation is commutative). Feel free do it - I did.

Generating code for a variable may require some trickery, if you have array-types that should "decay" into pointers for example, using a variable of that type should give its address instead of its value.
For your usual local variables, you'd generate a "mov RAX, [RBP + offset]". But what should the offset be? Well, there an extra walk over the AST comes in, one that sums all the sizes of the local variables and remembers what the total size was when a variable was declared, that size will be its offset (you can use smarter layouts there, but this works). In this walk over the AST, you should also keep track of the maximum number of arguments to functions that you call, because you'll need it.

Generating code for a constant is trivial: "mov RAX, constant"

Generating code for assignments. Suddenly the Left Hand Side is not a value, but a location. It's possible to use a special LValue code-gen for that, but I just special-cased everything right there. In the language I compile there isn't much to choose from anyway: local scalar variables (including function arguments), local array dereference or pointer dereference. (no globals, though they wouldn't be hard to add)

Generating code for "return" is interesting. Code should be generated for its argument (if it has one), but then what? How do you return? Not "ret", or you'd skip the epilogue of the function. A jump would need to know the offset to the end.. but that isn't known yet. So a "sort of like half-a-stage" at the very end of the code-gen takes care of that.
Practical: you need to know the label of the end of this function. For this reason and others, you need to know the current function, so keep track of it somehow.

Generating code for "if" and "while". Again labels are needed, and the Linker will deal with them. The new and interesting bit here is conditions. It's possible to use the normal expression code-gen and then do a "test rax, rax", and always use the Z flag. But the code generated that way sucks too much for me, so I wrote a special code-gen for conditions. Tip: only the root node of the AST of the condition is really special, and you can have a rule that it must be a comparison. That means the special condition code-gen only has one method in it.

Generating code for functions. The body is trivial when the have the rest done, it's the prologue and epilogue that are the problem. You should have the "maximum stack usage by expressions" somewhere, and the "space used by locals", and the "maximum number of arguments to functions that you call" (multiply that by 8). Don't forget to align the stack by 16, otherwise Windows functions that you call will likely crash.
Also, to simplify the codegen for using variables and calling functions, the arguments to this function should be saved to their home locations on the stack. Optionally a frame pointer can be used, you don't really need it but I chose to use it because the generated code is a bit easier to understand and debug (bugs in the code-gen will happen, so that's not unimportant).
"return" needed a label to just before the epilogue, so actually put that label there.

Generating code for strings: "lea RAX, [rip+somelabel]". Tell the linker about the string and this usage, though. It has to put it in your data segment and patch up the label reference.

Generating code for imports: you don't. They just have a label. You don't even have to tell the Linker about this, if it is ever called then the call can do that.

Generating code for function calls: mostly trivial, except when calling imports. They have to be called as "call [rip+label]", and you have to tell the Linker about it. Names were bound, so at this point you can distinguish internal calls for calls to imports, and you know the details of the import, such as which dll it was from (because you have its declaration node), so you can do that actual importing in the call node. This means unused imports won't be imported.

You can make up the rest (unary operators, the special handling of division and modulo, for loops, etc), it's just more of the same.

Blocks of Code and the "sort of like half-a-stage"

At this point, you really don't want to be stuck with textual assembly code, you'd only have to parse it again. I just made a giant list of functions like "add_q_r_r(R r, Rrm)" that output the right bytes into a buffer, and then collect a bunch of buffers. Every "block" has a buffer, an optional label, and an optional "target label".

The "half stage" then walks over the blocks, and does a couple of things:
- it tries to fix up jumps. But what if they're not within range? Well if you used the 32bit-offset versions, then you are now in trouble because the offset is just too big to make sense. If you used 8bit-offsets, then you may now have to go back and change the instruction and the length of the block. While resizes keep happening, keep iterating this procedure (a resize may cause a jump target to fall out of range)
- when all the sizes are set in stone, fix up the actual offsets of jumps and calls to internal functions.


The separate compilation model is for 99% a relic of the past. So the "Linker", or what's left of it, lost the task of gathering stuff from objects files. In fact, there are no object files.
What it does, then, is:
- create import and exception directories
- patch up import references (and data references, if you have, say, string literals)
- lay out the program code and data etc in virtual memory
- write it all into a valid .exe file

Generating the import directory is easy when you know the structure of it. The exception directory is trickier though. It has to know, for every function, where it begun and where it ended, and how much stack it allocated, and which (if any) register it used as frame pointer, and what offset it had relative to RSP, and most annoyingly of all, the exact sequence of events in the prologue.
You need a lot of information that only the function that handles the AST node for functions has, so that's where I put it into an object and give it to the Linker. It just puts it in a list until its stage runs.

Patching up import references. To do this, you need the big list of places where imports were called, and the import directory you just made. Simply write the offsets into the code.

Laying stuff out in virtual memory is simple, just put everything in their segments, and align them. By the segment alignment that is, not by the file alignment.

Writing it out into a valid .exe seems hard but isn't really, just study the structure here. In theory there's a huge number of fields that you could fill in, but most of them aren't necessary to make the program run. Remember to align sections that you're writing by the file alignment, though, which is typically lower than the segment alignment (in other words, in virtual space you have more padding than there will be in the file).


I will probably expand this part quite a few times. There's a lot I could write, but most of it seems irrelevant to me right now. Be sure to ask things - I've found that I explain things better when asked about them, compared to when I'm just monologuing.

One thing I did not mention above is that I have types. Kind of. The types are relevant, for example, in arrays and pointer-dereference. Strings pretend to be pointers to char. Pointers use 64bit math.

Also, I have an "op RAX, [RBP + offset]" optimization, and something similar for array accesses, and I reverse commutative ops when it helps, to save quite a lot of code that I'd have to wade through when debugging.

Constant folding and using trivial identities (eg x*0=0). The way I do this, is by giving every expression-node a virtual method Optimize that by default returns "this", but many sub-types override it with a method that return a constant if the result of that node is (locally - ie it does not track whether a variable is a known constant) known to be a constant.

Pages: [1] 2