Suppose I create a table with a billion rows, on a cluster with N nodes. Then I want to increase performance, so I add a new node to the cluster. Obviously the data is still stored on the first N nodes, and not on the new one. Is there a way of redistributing the data (online) to take advantage of the new node?
I realise the answer might depend on the configuration of the table. If there are schemas that fit this notion well, and schemas that don't, I'd be interested to know about that too.
(This will be running on CDH5, if that makes a difference.)