We are building an Enterprise Datawarehouse on Phoenix(HBase)
Please refer the diagram attached.
The EDW supports an unified architecture that serves both Streaming and batch use cases.
I am recommending a staging area that is source compliant (i.e. that mimics source structure)
In the EDW path - data is always loaded into staging and then gets moved to EDW.
Folks are not liking the idea due to an additional hop. They are saying the hop is unnecessary and will cause latency issues.
I am saying latency can be handled in two ways:
1. The caching layer will take care
2. If designed properly, Latency is a function of hardware
What are your thoughts?
One other question - is Kafka required at all???
It is introduced in the architecture for replay messages in case kinesis connectivity issues. So that we can replay messages.
Is there a better way to do it?
help as always is appreciated.