You Don’t Always Need Spark.

Balakrishnan Sathiyakugan
4 min readJul 5, 2022

Apache Spark is a general-purpose & lightning-fast cluster computing system. It provides a high-level API. For example, Java, Scala, Python, and R. Apache Spark is a tool for Running Spark Applications. Spark is 100 times faster than Bigdata Hadoop and ten times faster than accessing data from disk. But it is not always the case you need to use Spark to get things done.

Spark is suggested for big data sets that cannot fit on one computer. But you don’t need Spark if you are working on smaller data sets. In the cases of data sets that can fit on your local computer, there are many other options out there you can use to manipulate data, such as:

  • AWK — a command-line tool for working with text files
  • R — a programming language and software environment for statistical computing
  • Python PyData Stack, which includes pandas, Matplotlib, NumPy, and scikit-learn, among other libraries

Sometimes, you can still use pandas on a single, local machine even if your data set is only a little bit larger than memory. Pandas can read data in chunks. Depending on your use case, you can filter the data and write the relevant parts to disk.

You can leverage SQL to extract, filter, and aggregate the data if the data is already stored in a relational database such as MySQL or Postgres. If you want to leverage pandas and SQL simultaneously, you can use libraries such as SQLAlchemy, which provides an abstraction layer to manipulate SQL tables with generative Python expressions.

The most commonly used Python Machine Learning library is scikit-learn. It has a wide range of algorithms for classification, regression, and clustering, as well as utilities for preprocessing data, fine-tuning model parameters, and testing their results. However, if you want to use more complex algorithms — like deep learning — you’ll need to look further. TensorFlow and PyTorch are currently popular packages.

Spark Use Cases and Resources

Spark provides a faster and more general data processing platform. Spark lets you run programs up to 100x faster in memory or 10x faster on disk than Hadoop. Last year, Spark took over Hadoop by completing the 100 TB Daytona GraySort contest 3x quicker on one-tenth the number of machines, and it also became the fastest open-source engine for sorting a petabyte.

Here are a few resources about different Spark use cases:

Spark’s Limitations

Spark has some limitations.

Spark Streaming’s latency is at least 500 milliseconds since it operates on micro-batches of records instead of processing one record at a time. Native streaming tools such as Storm, Apex, or Flink can push down this latency value and might be more suitable for low-latency applications. Flink and Apex can also be used for batch computation, so if you’re already using them for stream processing, there’s no need to add Spark to your stack of technologies.

Another limitation of Spark is its selection of machine learning algorithms. Currently, Spark only supports algorithms that scale linearly with the input data size. Deep learning is generally unavailable, though many projects integrate Spark with Tensorflow and other deep learning tools.

Beyond Spark for Storing and Processing Big Data

Keep in mind that Spark is not a data storage system, and several tools besides Spark can be used to process and analyze large datasets.

Sometimes it makes sense to use the power and simplicity of SQL on big data. A new class of databases, NoSQL, has been developed for these cases.

For example, you might hear about newer database storage systems like HBase or Cassandra. There are also distributed SQL engines like Impala and Presto. Many of these technologies use query syntax that you are likely already familiar with based on your experiences with Python and SQL.

Deciding Between Pandas and Spark

Let’s see a few advantages of using PySpark over Pandas
When we use a vast amount of datasets, pandas can be slow to operate, but Spark has an inbuilt API to handle data, making it faster than pandas.
Below are a few considerations when choosing PySpark over Pandas

  • If your data is huge and grows significantly over the years and you want to improve your processing time.
  • If you want fault-tolerant.
  • ANSI SQL compatibility.
  • Language to choose (Spark supports Python, Scala, Java & R)
  • When you want Machine-learning capability.
  • Would like to read Parquet, Avro, Hive, Casandra, Snowflake e.t.c
    If you wanted to stream the data and process it in real-time.

Pandas and Spark aren’t mutually exclusive, though. There’s a framework called Apache arrow that helps pandas leverage Spark’s distributed computing since some would argue it’s easier to work with the pandas API than Spark’s data frames.

What if your company generates data exponentially?

Your company might initially generate a small amount of data, but it may sometimes be expected to grow exponentially with the customers. In that case, planning and building pipelines are better than crashing the whole reporting echo system. It is up to the business to decide whether to develop that infrastructure at the start or not.

Conclusion

Apache Spark can process huge amounts of data in a very efficient manner with high throughput. It can solve problems related to batch processing, near real-time processing, can be used to apply lambda architecture, and can be used for Structured Streaming. Also, it can solve many complex data and predictive analytics problems with the help of the MLlib component, which comes out of the box. Apache Spark has been significantly impacting the whole data engineering and data science gamut at scale. But you don’t need Spark if you are working on smaller data sets. But there are some edge cases where you might want to use Spark. I hope you liked this post!

--

--