IJRCS – Volume 3 Issue 2 Paper 1

POST STRATIFICATION SAMPLING AND HORVITZ THOMPSON ESTIMATOR FOR RANGE AGGREGATE QUERIES IN BIG DATA ENVIRONMENTS

Author’s Name : S Barkath Nisha | R Latha Priyadharshini

Volume 03 Issue 02  Year 2016  ISSN No:  2349-3828  Page no: 1-5

12

Abstract:

Big Data is a collection of large datasets and handling of data is challenging in this environment. Fast Range Aggregate Queries (FastRAQ) approach is used to process the range aggregate queries that consist of aggregate function on all tuples within the query ranges. The query result can be generated from the range cardinality query algorithm. The weight of the sample estimate is calculated using the Post Stratification sampling method and to estimate the total and mean of a super population in a stratified sample, Horvitz Thompson estimator is used. The time complexity is reduced by using the sampling methods.

Keywords:

Balanced partition; Big Data; FastRAQ; Hadoop; Horvitz Thompson; MapReduce; Multidimensional Histogram; Post Stratification; Range Aggregate Query

References:

  1. Bilal K., Manzano M., Khan S., Calle E., Li K. and Zomaya A. (2013), ‘On the characterization of the structural robustness of data center networks’, IEEE Transactions on Cloud Computing, volume 1, no. 1, pp. 64–77.
  2. Chaudhuri S., Das G. and Srivastava U. (2004), ‘Effective use of block-level sampling in statistics estimation’, in Proceedings of the ACM SIGMOD International Conference on Management of Data, Paris, pp. 287–298.
  3. Choi H. and Varian H. (2012), ‘Predicting the present with Google trends’, Economics Record, volume 88, no. s1, pp. 2–9.
  4. Cohen E., Cormode G. and Duffield N. (2011), ‘Structure-aware sampling: Flexible and accurate summarization’, Proceedings on Very Large Data Bases Endowment, volume 4, no. 11, pp. 819–830.
  5. Condie T., Conway N., Alvaro P., Hellerstein J. M., Gerth J., Talbot J., Elmeleegy K. and Sears R. (2010), ‘Online aggregation and continuous query support in MapReduce’, in Proceedings of the ACM SIGMOD International Conference on Management of Data, Vienna, pp. 1115–1118.
  6. De Capitani di Vimercati S., Foresti S., Jajodia S., Paraboschi S. and Samarati P. (2013), ‘Integrity for join queries in the cloud’, IEEE Transactions on Cloud Computing, volume 1, no. 2, pp. 187–200.
  7. Flajolet P., Fusy E., Gandouet O. and Meunier F. (2008), ‘Hyperloglog: The analysis of a near-optimal cardinality estimation algorithm’, in Proceedings of International Conference on Analysis of Algorithms, Germany, pp. 127–146.
  8. Haas P. J. and Hellerstein J. M. (1999), ‘Ripple joins for online aggregation’, in ACM SIGMOD Record, volume 28, no. 2, pp. 287–298.
  9. Hellerstein J. M., Haas P. J. and Wang H. J. (1997), ‘Online aggregation’, ACM SIGMOD Record, volume 26, no. 2, pp. 171–182.
  10. Haas P. J. and Konig C. (2004), ‘A bi-level bernoulli scheme for database sampling’, in Proceedings of the ACM SIGMOD, International Conference on Management of Data ACM, China, pp. 275–286.
  11. Heule S., Nunkesser M. and Hall A. (2013), ‘Hyperloglog in practice: algorithmic engineering of a state of the art cardinality estimation algorithm’, in Proceedings of the International Conference Extending Database Technology, New York, pp. 683–692.
  12. Ho C. T., Agrawal R., Megiddo N. and Srikant R. (1997), ‘Range queries in OLAP data cubes’, ACM SIGMOD Record, volume 26, no. 2, pp. 73–88.
  13. Liang W., Wang H. and Orlowska M. (2000), ‘Range queries in dynamic OLAP data cubes’, Data Knowledge and Engineering, volume 34, no. 1, pp. 21–38.
  14. Malensek M. and Pallickara S. (2013), ‘Polygon-based query evaluation over geospatial data using distributed hash tables’, in Proceedings of the IEEE/ACM 6th International Conference on Utility Cloud Computing, New York, pp. 219–226.
  15. Mishne G., Dalton J., Li Z., Sharma A. and Lin J. (2013), ‘Fast data in the era of big data: Twitter’s real-time related query suggestion architecture,” in Proceedings of the International Conference Management of Data ACM SIGMOD, New York, pp. 1147–1158.