In the present time the amount of data that needs to be processed grows on a daily basis. For this reason, many of the state-of-the-art applications need to be designed in such a way that allows them to process large volumes of data efficiently. However, it is common that during the development phase there is not enough real-world data available for the developers to benchmark and test their applications. On the other hand, in this case even a synthetic data can be sufficient if provided in large amounts. For this reason, the demand for an efficient generator of large datasets that also reflect real data to some extent has been growing recently.

BDgen is a solution for such a problem. It was developped as a software project at Faculty of Mathematics and Physics of Charles University in Prague. The tool is implemented as a general framework which is highly extensible with new plugins that might be developed by a third party. The whole system is divided into scalable backend designed to generate Big Data on clusters with MPI framework and frontend for user friendly definition of input data for backend. We implemented generators of two commonly used formats - JSON and CSV. Our generator also contains plugin for generating data based on a regular grammar.

See the repository   Download Bin and Doc

Team Members

Supervisor of the project: Doc. RNDr. Irena Holubová, Ph.D.

  • Bc. Jan Škvařil
  • Bc. Tomáš Faltín
  • Bc. Michal Hanzeli
  • Bc. Vojtěch Šípek
  • Bc. Dušan Variš

Contact Us