Spark PySpark Advanced Python Devops Software Engineering Course By Joshi
Multiple & Logistic Regression in Spark 101
Spark Mlib code
From the code “ from pyspark.ml import regression”
Imported from Pyspark library and not a local code.
Implementing this on your own for practice, created notebook on community databrics:
https://towardsdatascience.com/building-a-linear-regression-with-pyspark-and-mllib-d065c3ba246a?gi=66d5401a1bb7
Regression Spark parameters
Spark Logistic Regression
https://spark.apache.org/docs/1.4.1/api/java/org/apache/spark/ml/regression/LinearRegression.html
Quant Methods in Regression
Maximum likelihood
Log likelihood
Regression Code
https://dziganto.github.io/classes/data%20science/linear%20regression/machine%20learning/object-oriented%20programming/python/Understanding-Object-Oriented-Programming-Through-Machine-Learning/
Git commands 101 Fork, clone, merge branches
Git command primer to merge two branches
Converting Python into Spark code
Running Scipy would be bad idea as it would slow down things as python doesn’t work on JVM. It is suggested to us Scalanlp-Breeze.
https://stackoverflow.com/questions/40370759/using-python-scipy-from-spark-scala
Is Python code faster or should we have small UDF in Scala?
Not an easy and straight forward answer. But Python doesn’t run on JVM so we have to use it as API.
Garbage Collection and Java Serialization decides the speed.
https://changhsinlee.com/pyspark-udf/
Spark Optimization:
https://blog.cloudera.com/blog/2015/04/how-to-translate-from-mapreduce-to-apache-spark-part-2/
https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/best_practices/prefer_reducebykey_over_groupbykey.html
https://stackoverflow.com/questions/32435263/dataframe-join-optimization-broadcast-hash-join/39404486
Partitioning in Spark
https://www.dezyre.com/article/how-data-partitioning-in-spark-helps-achieve-more-parallelism/297
Executioner and Driver - how many gb of memory
How to make partitions in Spark
Make sure the number of partition is at least the number of executors.
https://github.com/vaquarkhan/vaquarkhan/wiki/How-to-calculate-node-and-executors-memory-in-Apache-Spark
https://umbertogriffo.gitbooks.io/apache-spark-best-practices-and-tuning/content/sparksqlshufflepartitions_draft.html
Spark Partitioning data & speed
How to optimally partition data using spark.
PySpark and its Transformation
https://blog.usejournal.com/tutorial-on-pyspark-transformations-and-mlib-7ed289a9e843?gi=192332db80e
MLlib DataFrame-based API
http://spark.apache.org/docs/latest/ml-guide.html
RDD vs DF
https://www.adsquare.com/comparing-performance-of-spark-dataframes-api-to-spark-rdd/
Submitting a job to Pyspark on terminal
# Run a Python application on a Spark standalone cluster
./bin/spark-submit \
--master spark://207.184.161.138:7077 \
examples/src/main/python/pi.py \
1000
https://spark.apache.org/docs/latest/submitting-applications.html
driver-memory 5G
--conf spark.driver.maxResultSize
--conf spark.shuffle.service.enabled
--conf spark.dynamicAllocation.enabled
--conf spark.ui.enabled
--conf spark.speculation
--conf spark.port.maxRetries
--queue root.mde.ste_queue.ste_queue3
--conf spark.kryoserializer.buffer.max
--conf spark.executor.pyspark.memory
--conf spark.executor.memoryOverhead
Submit a job to spark cluster
https://stackoverflow.com/questions/49081211/submit-python-script-into-spark-cluster
Compilation using Maven
Classes & Functions in Python 103
Meta Classes
A metaclass is the class of a class. A class defines how an instance of the class (i.e. an object) behaves while a metaclass defines how a class behaves. A class is an instance of a metaclass.
Skeleton of a class.
https://stackoverflow.com/questions/100003/what-are-metaclasses-in-python
https://realpython.com/python-metaclasses/
Understanding closures
https://www.programiz.com/python-programming/closure
Super
super(GeneralizedLinearRegression, self)
When do we use Super?
Staticmethods
@staticmethod
With staticmethods, neither self (the object instance) nor cls (the class) is implicitly passed as the first argument. They behave like plain functions except that you can call them from an instance or the class.
https://stackoverflow.com/questions/136097/what-is-the-difference-between-staticmethod-and-classmethod
Double Underscore
How to create private classes in Python
Single Underscore
Names, in a class, with a leading underscore are simply to indicate to other programmers that the attribute or method is intended to be private. However, nothing special is done with the name itself.
To quote PEP-8:
_single_leading_underscore: weak "internal use" indicator. E.g. from M import * does not import objects whose name starts with an underscore.
A single leading underscore isn't exactly just a convention: if you use from foobar import *, and module foobar does not define an __all__ list, the names imported from the module do not include those with a leading underscore. Let's say it's mostly a convention, since this case is a pretty obscure corner;-
__foo__: this is just a convention, a way for the Python system to use names that won't conflict with user names.
_foo: this is just a convention, a way for the programmer to indicate that the variable is private (whatever that means in Python).
__foo: this has real meaning: the interpreter replaces this name with _classname__foo as a way to ensure that the name will not overlap with a similar name in another class.
No other form of underscores have meaning in the Python world.
There's no difference between class, variable, global, etc in these conventions.
https://stackoverflow.com/questions/1301346/what-is-the-meaning-of-a-single-and-a-double-underscore-before-an-object-name
Higher order functions & Decorators
Used @total_ordering which is from functools and is a part of higher order functions.
@property
def getConfidenceInterval(self):
Functions being passed and behavior changed.
Decorators with symbol @ are used for decorator to pass in the function in another function and now change its behavior.
https://stackoverflow.com/questions/681953/how-to-decorate-a-class
https://docs.python.org/2/library/functools.html
Unit Testing & Integration Testing in Python 101
Two folders exist.
Python Unit Testing
UNIT TESTING is a level of software testing where individual units/ components of a software are tested. The purpose is to validate that each unit of the software performs as designed. A unit is the smallest testable part of any software. It usually has one or a few inputs and usually a single output.
There are certification exams about automated testing etc.
https://stackoverflow.com/questions/15351546/how-to-import-a-class-from-unittest-in-python
Magic Methods
https://blog.rmotr.com/python-magic-methods-and-getattr-75cf896b3f88?gi=76e08922d80d
Jira / OpenProject (Agile Development) 101
Reporting Agile Project Development & Sprint
It allows you to see all the open questions to know about the meetings and what people are doing
Principles of Agile Development and what are sprints
Jenkins 101
SDLC and Jenkins
Why is DevOPS becoming important?
https://www.tutorialspoint.com/jenkins/jenkins_overview.htm
https://www.dezyre.com/article/how-data-partitioning-in-spark-helps-achieve-more-parallelism/297
Questions
How does the libraries talk to each other?
SAS running in on VM 101 course?
Python API for Spark or Python runs on the container?
How would we create our code to run on cluster?
set PYTHONPATH=%PYTHONPATH%;
C:\Users\xxx\Downloads\python-3.6.0-embed-amd64\
Linux Commands
Source copy
scp -r username@ip/project/ C:/Users/xxx/Documents