While performing some github repository cleaning in May 2020, I found a collection of blogs from my first blog site which I didn't stick with. This article is a reflection on taking a big data with apache spark course.
10 minutes ago I opened a bottle of wine to celebrate finishing the edX course "BerkeleyX: CS100.1x Introduction to Big Data with Apache Spark".
I wanted to write up my reflections on the course; why I took it, and what I learnt from it.
On the path to improving my data analysis abilities, I undertook the task of learning Python. While my abilities have improved a great deal in the past few months, I was hesitant in enrolling in a course where an 'intermediate' knowledge was required. I would say the python requirements were not too onerous. They were just challenging enough.
``` python
myValuesRDD = sc.parallelize(listOfMyNewValues)
# Merge RDDs. Can also use join in some instances.
allRDD = existingRDD.union(myValuesRDD)
# Change the way tuples are displayed in the data
mappedRDD = allRDD.map(lambda x: (x,1))
>[(oi,1),(tudo,1),(sim,1),(oi,1),(oi,1),(tudo,1).....]
# The above have effectively created a key/value partnership. This is great for 'reduceByKey'.
reducedRDD = mappedRDD.reduceByKey(lambda a,b: a+b)
>[(oi,3),(tudo,2),(sim,1).....]
# Actions will cause calculations to happen
reducedRDD.count()
>3....
reducedRDD.collect()
>[(oi,3),(tudo,2),(sim,1).....]
reducedRDD.take(2)
>[(oi,3),(tudo,2)]
# Sometimes you have to sort in strange ways due to the key/value relationship. While there may be better ways (and better examples), using two map commands can do the job.
reducedRDD.map(lambda x: (x[1],x[0]).sortByKey(True).map(lambda x: (x[1],x[0]).collect()
>[(sim,1),(tudo,2),(oi,3).....]
# filters are also widely used
.filter(lambda x: True if x[1]>2 else False).collect()
>[(oi,3).....]
```
I finished the course with an A and what I feel that translates to is exposure and basic understanding of the concepts. To really become expert in this area, I would need to work with distributed data sets professionally alongside other experts. The introduction to regex and groupByKey will be invaluable and something I will build on in the next year as I get more exposure to SQL and NLP.
Top 10 movies I would apparently watch!
Paper on collaborative filtering using regression based approach. http://www.dabi.temple.edu/~zoran/papers/vucetic_kais05.pdf
So for now - vagrant halt
Product analytics metrics and ab tests
An overview of my market research experience
Wow auctions development status
Programming and analytics in games
An overview of machine learning concepts
Natural language processing review 2018
Marketing segmentation approaches