I am currently consulting at a client with the following requirements.
They want to make available detailed data usage CDRs for customers to verify their data usage against the websites that they visited. In short this can be seen as an itemised bill for data usage. The data is currently not loaded into a RDBMS due to the volumes of data involved. The proposed solution is to load the data into HBase, running on a HDP cluster, and make it available for querying by the subscribers. It is critical to ensure low latency read access to the subscriber data, which possibly will be exposed to 25 million subscribers. We will be running a scaled down version first for a proof of concept with the intention of it becoming an operational data store. Once the solution is functioning properly for the data usage CDRs other CDR types will be added, as such we need to build a cost effective, scalable solution .
I am thinking of using Apache Phoenix for the following reasons:
1. 1. Current data loading into RDBMS is file based (CSV) via a staging server using the RDBMS file load drivers
2. 2. Use Apache Phoenix bin/psql.py script to mimic above process to load to HBase
3. 3. Expected data volume : 60 000 files per
1 –to 10 MB per file
500 million records per day
500 GB total volume per day
4. 4. Use Apache Phoenix client for low latency data retrieval
Is Apache Phoenix a suitable candidate for this specific use case?