One of the most powerful data sources are the social networks and inside this group, Twitter. Therefore I decided to build a system (as proof of concept) in order to analyze tweets related to BigData.
Firstable I decided that the app would be deployed in openshift. Openshift is a PaaS product from Red Hat and I like due to its flexibility. Python was the chosen language, Django the chosen framework and MongoDB the chosen data base.
If you want to interact with Twitter you will need to register your app in order to get the Twitter's credentials for your new app. So, I visited the Twitter Developers web and I registered my app, talkingbigdata:
The most important part are:
- API key
- API secret
- Access token
- Access token secret
Once that we have registered the app and we have obtained the credentials, let's code.
The first part is connecting to the Twitter API Stream and for that purpose I utilized tweepy.
Tweepy brings a very useful class, StreamListener(), which manages the Twitter's Stream. So I created CustomStreamListener class which inherits from StreamListener and it contains the common things and I also created MongoDBStreamListener class which inherits from CustomStreamListener and it will perform the specific things for MongoDB data base. In this way I have the structure in order to add other data bases or even other things.
When a tweet arrives, MongoDBStreamListener extracts the information from the tweet (it comes in JSON format) and puts it inside a Tweet object:
This strategy helps us in order to add more studies in the future.
Processor of Tweets:
Model (for MongoDB):
After having all the structure we have to connect to Twitter:
Now, we only need to query the data and show it:
Openshift is a very good option when you want to try a PaaS for free. The only problem is that openshift stops your server if you don't receive a request for a long time, but I need to remember that my account is free.
Python was a very good option because it is easy and it has a lot of libraries which help you a lot, for example tweepy.
MongoDB was another good option because is so simple to use when your data does not have relations. Besides, the Twitter's data come in json that is the format that MongoDB uses for its Documents.
Django is a very powerful framework but it was not key for this project. Moreover, Django-ORM does not have support for MongoDB and I missed a lot of things that Django gives you for free; hence, I had to work MongoEngine which is a very good library and it is like working with Django-ORM or SQLAlchemy. Therefore, after all, I would have used Flask for this project.
This project is in github.