Python n’ stuff

Making a simple link shortener with AWS and MySQL

2019-02-06T00:00:00+00:00

Making a simple link shortener with AWS and MySQL

Link shorteners are handy and pretty simple to implement. There are some free services that you can use for this like bitly and google, but in some cases it might be preferable to have the link shortener be under your own domain. Or maybe you want to implement some additional analytics or features that the other services don’t have. Let’s get started!

Usage

The goal of this project is to create a service that turns any link into a short link under you domain. For example:

POST api.jasonrhaas.com

{
  "url":  "reddit.com"
}

The response should be:

{
  "tiny_url": "http://jasonrhaas.com/6vyv6"
}

When you go to that URL it should point you to reddit.com.

Architecture

When building something, I’m a big believer in using the right tool for the job. In other words, I’m not going to use a batteries included web framework like Django when all I need is a simple micro service. The tools I’m using for this job are:

Flask
MySQL
Zappa (AWS Lamda + API Gateway)

Zappa is a nice tool that makes it easy to deploy an event driven API. I find the API Gateway user interface a bit cumbersome so its nice to have a framework that allows me to do (almost) everything from my code editor and command line.

Implementation

To build this, I used a simple Flask application containing a single Url SQLALchemy model and basically two functions, a get_tiny_url function and a get_long_url function.

Model

The model looks like this:

If you aren’t familiar with model classes, I recommend checking out the Flask SQLAlchemy Quickstart for a crash course. If you want to go a bit deeper into what database model classes can provide, check out the Django Tutorial on models. The SQLAlchemy documentation is comprehensive, but its very dense and technical and in my opinion is not a good introduction to models.

In the Url model above, here are what the different tables are:

id — just a database id. In some cases (like for Django), this line isn’t even necessary and is provided automatically.
hash — This is the what the long url gets turned into after it is shortened.
long — This is the original url.
hits — A simple numeric field that keeps track of how many times the link as been accessed.

We could expand this to provide even more information and analytics, like user IP address, referring URL, timestamp of when it was accessed, etc. But for this case I just needed something simple. The good part is this is easy to expand upon later.

Get Tiny URL

The function to create the short link I decided to call get_tiny_url. The function looks like this:

To explain this, I’m going to pick out a few lines.

Line 3

First we check to make sure that it is a POST request and a valid JSON object. If it’s not, we simply redirect to the base url. This is kind of a “catch all” approach and works fine for our use case, but it could be improved to have more specific error catching and an appropriate error message for the user.

Line 9

It’s never* a good idea to store potentially sensitive information in a database in clear text. As secure as your system is, there is always a chance that it may be hacked and the data may be stolen. As a good rule of thumb, passwords should always be hashed, for example.

In this case, I don’t have passwords, I’m just using the hash to identify a unique URL. The area of password hashing is a complex and convoluted one. During my research of implementing a JWT API Gateway Authorization solution, I found that the Python Blake2 library appears to be the new industry standard that is considered good enough to hash passwords. In Python 3.6, this was added to the Python standard library.

When this line gets run, you end up with a hash like 53761004cf82ca63a62c430e8a409a6703d63f45. This hash is deterministic, but its “one-way” meaning that it’s (almost) impossible derive the long_url from this hash code on its own.

Line 12

This line is to try to find the url by the hash. This hash is unique per url, so there should never be any duplicates in the database. If the url does not exist, it will add it to the database.

Line 20

Finally, we make use of the short_url Python library. This library uses a bit-shuffling approach to deterministically generate URLs from a number. In essence, this number corresponds to the database id. For our use case, the number will point to an id in the database, which contains the long_url.

Get Long URL

On to the reverse function, get_long_url which looks up the original URL given the short link.

Picking out a few lines of interest:

Line 5

This line takes the /tiny_url part of the link and translate it to the url_id which matches up with the database id.

Line 13

This a simple counter keeping track of how many times the line gets access. This information is then updated in the database.

Conclusion

As you can see, coding up a link shortener is pretty straightforward! To test this locally, all you need is a local MySQL database and Python3. In the next blog post, I’ll talk in depth about how to set up your local environment and also do automatically deployments using Continuous Integration and AWS.

Technology - The Final Frontier

2017-11-17T00:00:00+00:00

Photo by SpaceX on Unsplash

Technology (and especially software) is constantly changing. Just because one database, framework or programming language is popular now doesn’t mean it will be in 5 years, or even 6 months. A career in technology is a career is constant learning.

So how do you keep up with the latest? Here are a few pointers from my own experience:

Follow Hacker News (news.ycombinator.com)

I’m not saying you need to be on top of this constantly. But this is the pulse of the tech/startup world. Many experienced developers, tech CEOs, and entrepreuers post and comment on here.

It’s important to see what stories are getting lots of comments, and get a feel for the general setiment of the community. Also — new frameworks or technologies are often posted here and talked about.

Follow your niche field

For me this is Python. I’m also interested in Entreprenuership. So I subscribe to these two newsleters and at least browse through some of the content every week:

www.pythonweekly.com
www.foundersweekly.com

I know what you’re thinking: Email is the worst. And you would be correct in thinking that. These newsletters are one of the far and few email subscriptions that are actually meaningful and helpful. I recommend you find one similar in your niche field.

Go to (some) conferences

Technology conferences are a dime a dozen these days. What technology are in you into? Kubernetes perhaps? Mesos? Or maybe just all things Big Data? Or maybe you are really into Blockchain. Regardless of what it is, there seems to be a conference for everything.

Honestly I usually don’t learn a heck of a lot from the conferences, but they are a good way to see what other people are thinking and where the industry is going. It’s more of a “meet & greet” and making sure I’m ahead of the curve rather than anything else.

Unfortunately, a lot of of them are charging in the realm of multiple hundreds of dollars to attend. In my opinion, most of them aren’t worth paying money for (with the exception of PyCon, of course).

However, some conferences such as DeveloperWeek offer free tickets if you are a developer. The idea is that the companies and recruiters pay the big fees for the chance at recruiting talent (you!).** Look for these “developer deals”, and take advantage of them.**

Also — if you work for a company that is hip enough to pay for your nerd conference, definitely take advantage of that as well.

Speak at Tech Meetups

This is a good one. It will a give you a chance to give back to the community and practice public speaking. Which — if you plan on running your own business someday or even just being in a leadership position — is key.

Some Meetups, like the Austin Python Meetup have the regular talk, and then “Lightning Talks” afterwards. A Lightning Talk just it’s typically 5–10 minutes in length. Lightning Talks are a great way to get comfortable speaking in front of the crowd.

Use Open Source, and Contribute to it

Many people and organizations are relying on Open Source today, yet less than 1% of users actually contribute back to Open Source.

I totally made that quote up, but its probably pretty close. I know — contributing to Flask can be intimidating. The regular contributors to popular Open Source projects know their stuff. In fact — if companies see you are a contributer to an Open Source Project, they may let you skip the demeaning “Live Code Interview” process altogether.

You don’t need to start with a big project, start small and work from there. Also many projects have tags that specifically call out good tasks for people new to the project — start with those.

Don’t fall for all the new shiny things

This world moves pretty fast. If you don’t stop and look around every once in a while, you might miss it — Ferris Bueller

Ok so there is always some new shiny thing that promises to be THE BEST DATABASE EVER. You should definitely play around with all the new fancy things, but be wary about using them in Production. Learn about it and figure out what it can do that your existing stack or programming language can’t.

You know what powers most of the forward thinking technology companies today? Unix. Know when Unix was invented? The 70’s. Same goes with MySQL, Regular Old Bash Scripts, and even (gasp) C++. Yea, C++ might not be “cool”, but there is a reason why it still plays a big role in many of the tried and true tools that are taken for granted today.

4 Keys to Fostering a Successful (Remote) Work Culture

2017-07-18T00:00:00+00:00

Teams working remotely is more common then ever. The combination of flexibility, high speed Internet, and the right tools make working remotely a viable option for may teams. It has given rise to a whole generation of people that not only can work remotely, they expect it.

What do I know about working remotely? I’ve worked successfully in a remote capacity for almost 3 years. In that time, I traveled around the world with a program called Remote Year while working remotely. I am passionate about fostering a culture that allows remote teams not only work together, but thrive.

Here are the 4 traits that are absolutely essential to fostering a healthy (remote) work culture.

1. Open communication

By far the most important part of a successful remote work culture is open communication. What do I mean by this? Think about how a traditional office operates. There are conversations that happen in the hallway, at the water cooler, in meetings, in the cafeteria. Sometimes the conversation is casual but other times its important work stuff that should be shared with your co-workers.

I can’t tell you how many times in the past I’ve made some big technical decisions by bumping into a senior engineer in the hallway and asking his advice. Sometimes bouncing ideas off other people helps you solve a problem.

In fact, this is the #1 reason companies like to co-locate and put everyone in a single space. This is good for fostering ideas and innovation, but very bad for solving hard problems that require Deep Work.

Making important decisions without review among your peers is often a bad idea. Ideally the whole core engineering staff would also be involved at some level to validate ideas. In a traditional office, this usually means meetings. However, it’s pretty well understood that meetings are a colossal waste of time and resources.

The solution in a remote work environment is open communication using asynchronous tools. For software development, this often includes tools such as:

Slack (asynchronous chat)
Github (asynchronous software review)
JIRA (asynchronous planning)
Google Hangouts (synchronous meetings)

You can swap any of these out for your tool set of choice. By far the most import part of this whole section is this:

Using public forums and for most conversations

I’ll stress this again, have conversations in a public place! Avoid direct messages like the plague. DM’s should not be used unless absolutely necessary, and the information needs to be private with that individual.

When a new employee comes on board, it can be tempting to use DM’s because he or she might be intimidated by asking “stupid questions” in a public forum. But, it needs to be stressed to them that having discussions in public is essential.

There are several advantages of using public channels vs. private channels or direct messages:

Avoiding redundancy

There is nothing more annoying or inefficient than repeating the same question or relaying the exact information to 3, 4, 5, or 50 people over and over again. Why not just say it once in a public channel?

Open accountability

Is the boss wondering what you are working on or why a project is taking so long? Instead of harassing you on a regular interval, all they have to do is catch up on the public channels. At that point they can see what is going and if they want to step in to help.

I once worked with a senior engineer that refused to share any of his knowledge with anyone. Something broken? He’ll fix it for you, but won’t tell you how he did. This is a horrible engineering culture, and its actually really bad for the company. What if he leaves the company? It puts the company is a bad position.

People like to learn new things. Found out a better way to deploy your code? Share it with the team.

2. Accountability

Something that holds bigger or more traditional companies back from allowing their employees to work part or full time remote is accountability. The belief goes that if employees are being “watched” in the office they will slack off.

The truth is, if you have hired the right people, the opposite will be true. In fact, there are a lot of studies that show that people are actually more productive when they are free from the distractions of the office.

So how do you stay accountable as an employee? Here are some ideas to build an accountable culture.

Daily standups. A 15 minute synchronous meeting serves as face time for the team to talk about what they are working on today and if there are any blockers. You can also have a #daily-standup channel to provide more detail on Slack.
Daily announcements. This channel is to let people know when you aren’t working. Have to run some errands or want to do some laundry? No problem, let the team know in this channel, and when you are back working let them know again. No need to bother people, but if they want to know where you are, it should be in this channel.
Task tracking. I won’t get into the whole Agile methodology here, but I want to emphasize that your team should break tasks out into small, manageable chunks. This should improve velocity and allow for quick wins. Also, surveys show that engineers are happier when they are deploying code often. As a general rule of thumb, try to make tasks that take no longer than 3 days to complete.
Schedule push. If something is going to take longer than expected, report this early on. Make it known in a public slack channel so there are no surprises and get someone to help.

3. Batching Tasks

Schedule meetings in chunks to allow for proper Deep Work time. Ever wonder why Hackathons exist? Its because of the simple fact that people are more productive with uninterrupted time to focus on their work. .

There is a great article I read I while back that talks about the two types of people in a company: the managers and the makers. The Managers have their whole day booked up, meeting after meeting, jumping from one thing to the next. Their work often requires scheduling, high level oversight, meeting with clients or other managers. If the work they are doing doesn’t require Deep Work, this is a fine thing.

The Makers on the other hand require time for Deep Work to truly be productive. Jumping from one task to the next is counter productive as an engineer. This is part of the reason why many software developers report doing their best work late at night. The Managers should keep this in mind and try to batch weekly meetings all on one day, and keep other meetings during the week to a specific chunk of time, such as 9 - 12am, and leave the afternoon for engineering work.

In my last job, I often railed back at the constant stream of meetings. I like to use the analogy of comparing software development to driving a manual transmission car.

My best work is done when I’m in 6th gear. But to get to 6th gear I have to go through all of the gears: 1st, 2nd, 3rd, 4th, 5th, and then 6th. When I’m in 6th gear I am “in the zone” or “in flow”. If I get interrupted while I’m in any of the gears and have to switch tasks, I have to restart in 1st gear and work my way up again, similar to having to stop a car at a stop sign.

4. Innovative and Open Culture

The best ideas can come from anywhere. They can come from the CEO, or from the Jr. Engineer that was just hired. Different people in the company have different perspectives on how things work, and may have an idea to improve or help grow the company. It’s important to encourage an environment of innovation, and one that allows everyone to put their ideas in a public space, without fear. Even if the idea will never be implemented, its important to hear ideas from all sides.

Some of my best work has come from this kind of innovative culture. Most companies have some sort of road map that is broken up into tasks, with some schedule on how to get there. The thing is, once engineering starts gaining a deeper understanding of how to solve the problem, there may be alternative approaches that drastically improve the product. This applies particularly in the technology industry which is changing constantly.

I remember when I first discovered Kibana for Elastic Search. At the time the company had a lot of data that was sitting around in raw files, S3, or a SQL database. While doing some data analysis for the CEO, I stumbled upon the out-of-the-box visualization capabilities of Kibana. I read up on it, started indexing the data to Elastic Search, and created some basic visualizations with Kibana.

When I first told my boss about this, he was frustrated that I wasn’t working on my assigned tasks and that I was “wasting” my time playing with this new technology. However, as soon as I demonstrated this to my boss and relevant stakeholders they were blown away with the capabilities. This side project alone opened up an entire new capability to the company and our customers, and helped to propel the company’s technology in to the future.

Call to Action

If you work in a remote culture, or even an office culture, pick one of these traits and see how your company culture stacks up. If the company is lacking in one of these areas, lead by example. Start sharing things in public, share your ideas, and encourage others to do the same. Another idea is to get the buy in of the leadership team, point them to this article and convince them to give it a try for a quarter.

Adding a simple API to your Postgres database

2017-07-17T00:00:00+00:00

When designing systems or platforms, it is very common to use a relational database such as MySQL or Postgres as a backend data storage. In order to access this data from a remote endpoint, it’s very handy to have an API that can serve out proper JSON data.

In this post I’m going to discuss one way to approach this problem. I’m a huge fan of simple, elegant approaches, and I think this fits the bill nicely.

I will be using some code that I wrote for the CodeForDC housing insights project.

The state of things

I volunteer some of my time to the Housing-Insights project to help out with the backend and API design and implementation. The current backend design at a high level consists of:

Download open data
Parse and clean data
Add tables to Postgres database
Access data via custom Flask API endpoints

Flask is a great Python framework for making dead simple APIs. It is my go to if I need a lightweight application to serve up some data. The syntax is as simple as this:

from flask import Flask
app = Flask(__name__)

@app.route("/")
def hello():
    return "Hello World!"

That is all the code you need to get a simple endpoint up and running. If you want to create some API endpoints on your database, a simple approach is to use the Postgres psycopg2 module for Python and run SQL queries as needed, then return the results.

And indeed, this works out pretty well. To get raw data from tables, you can do something like this:

@application.route('/api/raw/<table>', methods=['GET'])
@cross_origin()
def list_all(table):
    """ Generate endpoint to list all data in the tables. """

    application.logger.debug('Table selected: {}'.format(table))
    if table not in tables:
        application.logger.error('Error:  Table does not exist.')
        abort(404)

    conn = engine.connect()
    q = 'SELECT row_to_json({}) from {} limit 1000;'.format(table, table)
    proxy = conn.execute(q)
    results = [x[0] for x in proxy.fetchmany(1000)] # Only fetching 1000 for now, need to implement scrolling
    conn.close()

    return jsonify(items=results)

Using the simple SQL statement

'SELECT row_to_json({}) from {} limit 1000;'.format(table, table)`

it will return 1000 rows from whatever table you select in the table variable. This value comes from the Flask route '/api/raw/<table>', methods=['GET'].

But, as Raymond Hettinger likes to say…

there must be a better way

And there is.

Flask Restless

Flask-Restless is a plugin for Flask that takes advantage of SQL Alchemy’s Object Relational Mappers to generate quick and easy endpoints. If you have defined your database schema using SQLA (recommended), there is quite a bit of functionality out of the box. Some examples:

Endpoints can be generated from any model
Auto pagination
JSON based search
Pre-processors and post-processors

So thats a great way to start off accessing the database, and the pagination feature makes sure you don’t end up pulling too much data at once.

But, here is the problem: the database schema was not defined in SQLA. Sigh. But, there is a solution to that as well.

SQL Alchemy Automap

SQLA includes a feature called automap that is able to “reflect” information about your database tables and automatically generate the models. Using this approach, you can now take advantage of the features that SQLA and Flask Restless have to offer.

The code is pretty simple:

application = Flask(__name__)
application.config['SQLALCHEMY_DATABASE_URI'] = connect_str
application.config['SQLALCHEMY_TRACK_MODIFICATIONS'] = False

db = SQLAlchemy(application)
Base = automap_base()

metadata = MetaData(bind=db)

Base.prepare(db.engine, reflect=True)

db.session.commit()

BuildingPermits = Base.classes.building_permits
Census = Base.classes.census
CensusMarginOfError = Base.classes.census_margin_of_error
Crime = Base.classes.crime
DcTax = Base.classes.dc_tax
Project = Base.classes.project
ReacScore = Base.classes.reac_score
RealProperty = Base.classes.real_property
Subsidy = Base.classes.subsidy
Topa = Base.classes.topa
WmataDist = Base.classes.wmata_dist
WmataInfo = Base.classes.wmata_info

models = [BuildingPermits, Census, CensusMarginOfError, Crime, DcTax, Project, ReacScore,
          RealProperty, Subsidy, Topa, WmataDist, WmataInfo
          ]

db.init_app(application)

manager = APIManager(application, flask_sqlalchemy_db=db)

for model in models:
    # https://github.com/jfinkels/flask-restless/pull/436
    model.__tablename__ = model.__table__.name
    manager.create_api(model, methods=['GET'])


@application.route('/')
def hello():
    return("The Housing Insights API Rules!")


if __name__ == '__main__':
    application.run(host='0.0.0.0', port=5000)

That chunk of code above is all the code you need to reflect your current tables into the SQLA model and serve up the API using Flask Restless. Pretty badass if you ask me. I’ll walk through it a little bit to describe what is going on.

db = SQLAlchemy(application)

is Flask Restless wrapper around the basic SQLA engine creator. You pass it your Flask application object.

Base.prepare(db.engine, reflect=True)

here is where you tell SQLA automap which database to use, and that you want it to reflect your current database tables.

BuildingPermits = Base.classes.building_permits

here is where you pull the auto generated model out of Base.classes and assign it a model name.

manager = APIManager(application, flask_sqlalchemy_db=db)

create a APIManager Flask Restless API.

for model in models:
    # https://github.com/jfinkels/flask-restless/pull/436
    model.__tablename__ = model.__table__.name
    manager.create_api(model, methods=['GET'])

For this chunk of code, we are creating a basic GET endpoint for all of the models defined above. The line above it model.__tablename__ is a workaround for an issue that will be fixed in version 1.0 of the code.

Conclusion

If you need a quick and dirty API on top of your SQL database, look no further than Flask and Flask Restless. It’s a great way to get started. However, you still may have to define some custom endpoints, since Flask Restless doesn’t do everything. However using the ORM approach to create your endpoints probably leads to more maintainable and more clearly written code.

My question to the devs out there is this: How does this approach compare to the Django ORM? I know Django has a lot of capability right out of the box, and has its own built in ORM, and Rest API plugin. I’m curious to compare it to the Flask/SQLA approach in terms of ease of use, flexibility, and overall capability.

Automate all the things

2016-01-24T00:00:00+00:00

Building a computing infrastructure for your applications and big data stack is time consuming. Not only is it time consuming, but it’s very hard to plan for. Your needs today will likely not be your needs a year from now. This is especially the case if you are a growing technology company staying on the edge of the latest developments in the big data world. We all try to plan and think ahead for future needs, but this is often less than perfect.

In the past, system administrators and engineers typically built up their servers using a combination of techniques. Quite often this would involve customizing a particular server or image and then “cloning” it over to other servers. But this only works if the software on each needs to be the same. So inevitably there ends up being some kind of custom bash script or post install script to customize the build on a server by server basis. I’ve seen some pretty fancy bash and perl scripts used, and while very powerful they become a nightmare to maintain.

Ansible

Server provisioning software attempts to solve the code maintainability problem by introducing a framework and standards to manage your infrastructure. Some popular frameworks include Chef, Puppet, and Ansible. They all are a great way to manage your infrastructure, but Ansible stands out because it is agent-less and only requires ssh to provision your server. Also – it is written in Python, which I also like due to Python’s readability and hackability.

Dynamic inventories and group_vars

Ansible uses an inventory file to figure out where your servers are and what they are called. It also has server “groups”, so you can logically group your servers together. For a big data stack this might be zookeeper-nodes, kafka-nodes, spark-worker-nodes, etc. These groups are very powerful because they allow you to scale up or down your infrastructure simply by editing the inventory file. Want to add more resources to your Spark cluster? Just add it to the inventory and re-run the Ansible playbook.

In an ansible playbook, the spark-worker-nodes group can be accessed by using the {{ groups['spark-worker-nodes'] }}. You can also access individual elements of the list by adding an index, like {{ groups['spark-worker-nodes'][0] }}.

Roles

Ansible roles are standalone tasks that are meant to be performed for a single piece of infrastructure. Role names typically match up to server groups, but they don’t have to. All of these roles should be able to run independently. This concept is very powerful because you now can design your Ansible playbooks to accommodate almost any number of servers and configurations. A good practice to follow is to have the following folders under each role:

defaults
handlers
meta
tasks
templates
vars

If you don’t have a need for one of these folders, you don’t necessarily need to create it (git won’t even track it if there aren’t any files in it). Underneath each folder you should have a main.yml file. Why call it that? Because ansible looks for it automatically. You don’t have to put all your code in main.yml. If you wish to break it up into logical parts (such as Debian and RedHat plays), you can include: them inside your main.yml file.

Defaults

Defaults are the default variable settings for a specific role. These settings have the lowest priority of all variables. They are, well, defaults, and can be overwritten by re-defining the variable literally any other place in the Ansible code, and also on the command line using the --extra-vars flag. The most common way to override these defaults is with group_vars, which I will discuss later on. An example settings that might be in a defaults/main.yml file:

zookeeper_version: 3.4.6
zookeeper_client_port: 2181
zookeeper_install_dir: /opt/zookeeper
zookeeper_base_dir: "{{ zookeeper_install_dir }}/default"
zookeeper_conf_dir: "{{ zookeeper_base_dir }}/conf"
zookeeper_data_dir: "{{ zookeeper_base_dir }}/data"
zookeeper_log_dir: "{{ zookeeper_base_dir }}/logs"

Things like version numbers, port numbers, install directories are nice to put in the defaults section.

Handlers

Handlers are handy for doing things like restarting a process when a file changes. Just define your handlers in the main.yml and then use them in your main playbook under the tasks folder. For example, here is a simple handler to restart zookeeper (running under supervisord):

- name: restart zookeeper
  supervisorctl:
    name=zookeeper
    state=restarted

In your tasks playbooks, this handler can be used by adding a notify: restart zookeeper in one of the plays. For example,

- name: setup zoo.cfg
  template: 
    dest={{ zookeeper_conf_dir }}/zoo.cfg
    src=zoo.cfg.j2
  notify:
    - restart zookeeper
  tags: zookeeper

Tasks

The tasks folder is where the actual procedure to install your software lives. Your tasks/main.yml file is where you can utilize any of the Ansible modules and take advantage of all your variables, whether those are defined in defaults, group_vars, the inventory, or the command line. Here is a partial snippet of a zookeeper tasks/main.yml file:

- name: create zookeeper install directory
  file:
    path={{ item }}
    state=directory
    mode=0744
  with_items:
    - "{{ zookeeper_install_dir }}"
  tags: zookeeper

- name: check for existing install
  stat: path={{ zookeeper_install_dir }}/zookeeper-{{ zookeeper_version }}
  register: zookeeper
  tags: zookeeper

- name: download zookeeper
  get_url:
    url="{{ repository_infrastructure }}/zookeeper-{{ zookeeper_version }}.tar.gz"
    dest=/tmp/zookeeper-{{ zookeeper_version }}.tgz
    mode=0644
    validate_certs=no
  when: zookeeper.stat.isdir is not defined
  tags: zookeeper

- name: extract zookeeper
  unarchive:
    src=/tmp/zookeeper-{{ zookeeper_version }}.tgz
    dest={{ zookeeper_install_dir }}
    copy=no
  when: zookeeper.stat.isdir is not defined
  tags: zookeeper

Anything surrounded by {{ }} is an Ansible variable. That variable can be defined a number of ways. The first place it’s seen is the {{ item }} variable. This is an Ansible special(?) variable that is used for doing “for” loops. In the case of the “create zookeeper install directory” above, it really not neccessary since there is only one folder created. However, if I wanted to add more I could just tack on more items in the with_items yaml list like this:

with_items:
  - "{{ zookeeper_install_dir }}"
  - "{{ some_other_dir }}"

The other Ansible trick that is used in the tasks above is the when: conditional. In Ansible you can run plays only when the when: conditional meets some criteria. In the case above, the “download zookeeper” task is only run when: zookeeper.stat.isdir is not defined. The zookeeper variable is defined in the previous task and checks whether a directory already exists. Some other common ways to use the when: clause are:

Running on different OS’s (Debian vs. Redhat)
Only run when a variable is true
Only run when a variable is defined

Example of running specific Debian or Redhat plays:

- include: setup-RedHat.yml
  when: ansible_os_family == 'RedHat'

- include: setup-Debian.yml
  when: ansible_os_family == 'Debian'

In this case, there are separate playbooks for Debian and Redhat, and each one is only run on the appropriate OS. The same thing can be used for OS specific variables:

- name: Include OS-specific variables.
  include_vars: "{{ ansible_os_family }}.yml"

Templates

Templates are files that typically end in a .j2 extension and are used when you have a file that may need to change based on some variables you have defined in the Ansible code base. Templates are very handy to manage configuration settings for Linux software since almost all tools that run on Linux have some sort of configuration text file that can be customized. Here is a snippet from a Kafka server.properties.j2 template file:

{% for host in kafka_host_list %}
{%- if host == inventory_hostname -%}broker.id={{ loop.index }}{%- endif -%}
{% endfor %}

message.max.bytes={{ kafka_message_max }}
replica.fetch.max.bytes={{ kafka_replica_fetch_max_bytes }}
port={{ kafka_port }}
host.name={{ inventory_hostname }}
advertised.host.name={{ inventory_hostname }}
advertised.port={{ kafka_port }}

Notice the {{ }} variables that are used in the template file. There are a few “special” variables in here that deserve special attention. inventory_hostname is a reserved Ansible variable that maps to the hostname defined in the Ansible inventory file. It will match whatever host Ansible is currently running on.

The first chunk of code above is a fancy for loop that iterates through all elements of the kafka_host_list variable and sets the broker_id Kafka setting equal to the index of the host. Also note that the kafka_host_list variable has to be defined somewhere. In the case of this code it is defined at the playbook like and is:

kafka_host_list: "{{ groups['kafka-nodes'] }}"

The groups['kafka-nodes'] list is another special Ansible variable that is used to grab all of the hosts in the kafka-nodes group inside the inventory file. So your inventory for Kafka might look like this:

[kafka-nodes]
prod-as-01
prod-as-02
prod-as-03

In this case groups['kafka-nodes'] would contain all of those hostnames. You can access each one individually by using an index number, like this: groups['kafka-nodes'][0].

Back to the for loop above, that code would set the prod-as-01 host to broker.id=1, prod-as-02 host to broker.id=2, and prod-as-03 to broker.id=3.

The rest of the Kafka template code above is simply using Ansible variables defined mostly in the defaults/main.yml file populate the fields.

Vars

Variables are used everywhere in Ansible. For me, it’s actually the most confusing part about using Ansible at first. Here are (most of) the places variables can be set:

role defaults
role vars
playbook role vars
inventory vars
host_vars
group_vars
command line vars

The Ansible documentation has some good examples of how, when, and where to use variables, but I still think it is a bit confusing for someone new to Ansible.

General guidelines for using variables:

Role defaults are lowest precedence
Role defaults are “meant” to be overridden
group_vars for site specific variables, API keys, accounts
host_vars for host specific variables
--extra-vars for command line one-off playbook runs

Creating your site playbook

I prefer keep my Ansible code simple and manage as few .yml files as possible. To do this, I like to have all of my roles and plays in one or maybe two top level playbooks. Just pick and choose which roles you want and put it all in a site-infrastructure.yml, being sure to tag every play appropriately. Note that as of this writing, Ansible 2.0 reads tags dynamically, so if you want to use tags to control how plays get run (I highly recommend this), you need to put them at your top level playbook otherwise Ansible will iterate through every single play in your code looking for your --tag that you wanted to run.

Using –tags and –limit

When you want to run your top level playbook, you can choose to run everything like this, ansible-playbook -i production site-inventory.yml or limit which plays get run by using the --tags or --limit flags on the command line. For example, ansible-playbook -i production site-inventory.yml --limit aws or ansible-playbook -i production site-inventory.yml --tags site-kafka.

Remember that each play can have multiple tags. This allows you to pair things logically together. You might want to always run the zookeeper role when you run kafka, since kafka relies on zookeeper. In that case you might have:

- name: Run zookeeper role
  hosts: zookeeper-nodes
  vars:
   - zookeeper_host_list: "{{ groups['zookeeper-nodes'] }}"
  roles: [ zookeeper ]
  tags:
    - site-zookeeper
    - deps-kafka

- name: Run kafka role
  hosts: kafka-nodes
  vars:
   - kafka_host_list: "{{ groups['kafka-nodes'] }}"
   - zookeeper_host_list: "{{ groups['zookeeper-nodes'] }}"
  roles: [ kafka ]
  tags:
    - site-kafka
    - deps-kafka

This way if you run ansible-playbook -i production site-infrastructure.yml --tags deps-kafka it will run both zookeeper and kafka.

Don’t break the build

Since Ansible is a provisioning tool, you need an operating system to test on. Inevitably with Ansible you end up with a lot of little bugs to sort all while testing your code. This tends to happen a lot when using Ansible. So – how to sort through all those bugs? Well you don’t want to be changing your local machine or any production machines without feeling the wrath of your local sysadmin. Vagrant to the rescue!

Vagrant

Vagrant is a VM scripting tool that allows you to manage different configurations for as many virtual machines as you need. It supports Virtualbox and VMWare out of the box. I personally use Vagrant + Virtualbox because its free and works really well. As of Vagrant 1.8+, they now support VM snapshots, which is very nice for testing different setups and environments. I’ll walk through a simple Vagrant setup with two independent VMs, although this scales to create any number of VM’s that you wish.

Vagrantfile

The script file that tells Vagrant which VM’s to setup and how to provision them is the Vagrantfile. The file is written in Ruby so it is programmable and is there to do your bidding. For scalable VM testing, I chose to have the Vagrantfile actually read from a vagrant_hosts and parse it to figure out the VM name, IP, and type. For example, the vagrant_host file may look like:

0.0.1        localhost
168.33.101   vagrant-as-01  vas01  ubuntu
168.33.102   vagrant-as-02  vas02  ubuntu

Another thing I do in my Vagrantfile is overwrite the /etc/hosts file with my vagrant_hosts file so that the VM’s know how to talk to each other on the network. Lastly, I copy over my ssh public key so that I can ssh into the VM’s using ssh vagrant@vagrant-as-01. Normally if you are just testing VM’s without provisioning with Ansible you could use the vagrant ssh command which uses a built in private key that comes with Vagrant. However, to use Ansible via your local console to provision Vagrant, you need to be able to ssh in, ideally without a password. These actions are accomplished by doing:

# Configuration applying to all VMs
config.vm.provision :shell, inline: "cat /vagrant/vagrant_hosts > /etc/hosts"
config.vm.provision :shell, inline: "cat /vagrant/id_rsa.pub >> /home/vagrant/.ssh/authorized_keys"

Note the comment that says these actions will be applied to all VM’s. If you want to do something to an individual VM, you have to break it out in another ruby loop:

# Set up IP addresses and hostnames from 'hosts' file
# It assumes 'localhost' is on the first line
hosts = File.readlines('vagrant_hosts')
hosts[1..-1].each do |h|
  unless /(#|^\s*$)/.match(h)   # ignore commented out hosts and blank lines
    config.vm.define h.split(%r{\s+})[1] do |node|
      if h.split(%r{\s+})[-1] == 'centos'
        node.vm.box = CENTOS_BOX
      elsif h.split(%r{\s+})[-1] == 'ubuntu'
        node.vm.box = UBUNTU_BOX
        node.ssh.shell = "bash -c 'BASH_ENV=/etc/profile exec bash'"
      end
      node.vm.hostname = h.split(%r{\s+})[1]
      node.vm.network "private_network", ip: h.split(%r{\s+})[0]
      node.vm.provision "shell", inline: "service supervisord restart || true", run: "always"
    end
  end
end

To clarify a few spots in the code snippet above, hosts[1..-1].each do |h| sets up .each loop that iterates from index 1 (not 0 since that is localhost) to the end of the file. To find out what type of VM it is, it parses the line looking for “centos” or “ubuntu”. The line node.ssh.shell = "bash -c 'BASH_ENV=/etc/profile exec bash'" is a special trick I discovered to resolve the infamous stdin is not a tty Vagrant bug when provisioning Ubuntu VMs.

This line node.vm.provision "shell", inline: "service supervisord restart || true", run: "always" is a workaround that I’m doing to restart the supervisord process upon VM booting. I like to use supervisord to manage all my running applications since its a nice central place to check status on all the custom software or applications I’ve installed.

Testing Ansible with Vagrant

After you run vagrant up your VM’s should be pretty much good to go. You may want to also add the IP addresses and hostnames in your vagrant_hosts file so you can access them via hostname rather than IP address. Make sure you can ssh into the machines as vagrant user and you are ready to start provisioning with Ansible!

When you run your Ansible code, be sure to run it like ansible-playbook -i inventory site-infrastructure.yml -u vagrant since by default Ansible will try to connect using your current username which does not exist on the Vagrant VM.

Automated Builds with Travis CI

Making a change to code and manually running tests gets old really fast. Not to mention it’s subject to human error. Automating the test process not only speeds up development in the long run, but will catch errors quickly and reliably (assuming your tests are good). This practice of “continuous integration” or “continuous delivery” can also be applied to the “infrastructure as code” approach.

The first thing to do is to run a --syntax-check on your code. This catches any trivial errors and will cause your build to fail very fast so you can fix the bug quickly. Next, you can actually provision the VM that Travis gives you to test with. For Ansible, I recommend breaking this up into different pieces using the --tags option so that you can take advantage of concurrency if your CI software supports it. Lastly, you can run some high level tests on your tools to make sure they are actually working as they should.

For Elastic Search, index a small document
For Kafka, write something to a topic
For Hadoop, make a file or run a map reduce job
For Hbase, write something to the database

You get the idea.

Here is an example .travis.yml file that I’ve used to test Ansible code:

sudo: required
dist: trusty
addons:
  hosts:
    - travis-trusty
language: python
python: '2.7'
before_install:
  - sudo apt-get update -qq
  - sudo apt-get install -qq python-apt
install:
  - pip install ansible
env:
  - TAGS='site-common'
  - TAGS='site-zookeeper'
  - TAGS='deps-kafka'
  - TAGS='ELK'
  - TAGS='scrapy-services'
  - TAGS='scrapy-cluster'
  - TAGS='deps-storm'
  - TAGS='deps-hadoop'
  - TAGS='site-docker-engine'

matrix:
  fast_finish: true
script:
  - ansible-playbook -i testing site-infrastructure.yml --tags $TAGS --syntax-check
  - ansible-playbook -i testing site-infrastructure.yml --tags $TAGS --connection=local --become

Note that this requires creating a special testing inventory file that uses travis-trusty as the hostname for everything. Also - by taking advantage of Travis TAGS and ansible --tags, I can effectively run multiple Ansible builds concurrently which should speed up the overall build status.

Conclusion

If you have to manage more that one server, you should probably be using some sort of provisioning framework. Ansible is certainly a good choice, and is becoming increasingly popular relative to other tools such as Chef or Puppet. In fact, judging at least by the number of attention on Github, Ansible is blowing away the competition.

Using a combination of Vagrant VM’s and CI tools like Travis are essential to making sure you don’t break the build. Vagrant is great for development and Travis is great for those one line changes that “shouldn’t” break the build but should get tested anyway.

Inspiration for most of the examples and code snippets was taken from the ansible-symphony repository and most of the development for this code was done for IST Research. If you enjoyed this post and want to work with the latest in IT and big data technology, python, or Java, shoot me email or get in touch with me on LinkedIn or Twitter!

Using pandoc

2015-12-12T00:00:00+00:00

The résumé is outdated. Why are people still passing around MS Word documents? There are a few problems with this:

Email attachments just suck to begin with.
As soon as a résumé is sent, it is out of date.
You have no control of what happens to the document once you send it.
There are likely multiple different versions of your résumé floating around the web in various states of correctness since they are out of date.

This just leaves everyone confused as to what is the latest version of your résumé, and then the inevitable, “Can you send me an updated copy of your résumé by tonight?” question comes out, and you’re left scrambling to update it.

Keep your information on LinkedIn updated. LinkedIn can handle most of your resume needs, but it still only allows for basic text entry (why no markdown at least?), so people feel the obligation to stick with MS Word. It has an “export to PDF” feature, but it leaves much to be desired.

I wish that LinkedIn would up their game and allow for more formatting and flexibility, but until then most people will be looking for another solution. For many tech companies and startups, a LinkedIn profile is sufficient, but for the old guard a physical résumé document is still the gold standard.

Markdown and Pandoc to the rescue!

For those that want a physical résumé separate from LinkedIn, there is a solution to your MS Word woes, and its name is pandoc. Pandoc is a free document converter that supports all kinds of formats. On a Mac, you can install it with brew install pandoc.

Here’s the best part, you can maintain your résumé in Markdown and then have pandoc automatically generate the other formats for you! I converted my résumé to markdown, and made a little shell script that will generate all the formats that I need. Here is the make.sh script which I run whenever my markdown resume is updated. It generates plain text, docx, and html files.

#!/bin/bash
if [ $# -eq 1 ]; then
    name=${1//\.md/}
    pandoc $1 -t plain -o $name.txt
    pandoc $1 -t docx -o $name.docx
    pandoc $1 -t html5 -o $name.html
else
    echo "Usage : $0 YourResume.md"
    exit 1
fi

As an example of what the outputs look like, here is my résumé in the original markdown format and the generated .txt, .docx, and .html. I am controlling the source markdown file in github and then generating the other files with the make.sh script.

Why I love open source

2015-11-30T00:00:00+00:00

There is a thriving open source community out there, just waiting to be tapped into. Have an idea to make an application better? The developers would love to hear it. Have time to code it yourself? Submit a pull request on GitHub.

The community is a welcoming one. Before I found open source, I was a systems engineer working on military systems. You’d be hard pressed to find a community of people that are willing to talk about systems engineering for military systems outside of the DC area. Even then, most people cannot share work they’ve done, or code they’re written. It’s usually restricted in some capacity or can only be used and sold by The Company.

Part of the appeal to Open Source is that anything you write is free for you and anyone else to use, and can potentially help many other people solve problems in lots of different areas. It’s not just The Company that benefits, its everyone.

Why has Open Source become so popular?

Open source has been around for a long time, and UNIX has been around since the 1960’s. However – the big change that has happened in the last 5 years or so is that enterprises and businesses are switching to open source. At my last job we made systems that relied on commercial off the shelf (COTS) hardware and wrote custom Java code to control everything. Before I left, we started to use some hardware that actually had open source libraries to control the hardware. We took that open source software and started to add our own customizations. Parts of what we were doing could be pushed back to the open source repository but a lot of it was closed source. That was my first interaction with Git and Github.

So – Github. Since Github launched in 2008, it has transformed the Open Source universe. Today, the first step to developing a new software product is to check Github and see if someone has already written it for you. And here’s the best part - most of the time the developers are more than willing to help you with the software. The first time I hopped on IRC to ask a question about a Python package, I was amazed at how helpful the developers were.

How can it be free?

The dynamics of the open source software community is strange and unique. People are anxious to give away their software, but then how do they make money? The simple answer is services. The business of selling software licenses is a dying one, and many large technology companies have been shifting to a service based business model. One of the earliest companies to adopt this approach is Red Hat. Red Hat was founded in 1993 on the back of a new Linux distribution that they created called Red Hat Linux. It’s based on the Unix architecture which is already open source, so the OS itself is also open. They make money by providing services and support to mostly enterprise customers. Enterprises want to take advantage of the flexibility and stability that Linux and Open Source provide, but usually want some kind of security blanket knowing that they can get support if needed.

The other advantage of open sourcing your software is that you get community involvement. Just by putting it out there and making it free to use, there will be people finding bugs for you, submitting feature requests, and even improving your code or adding new features. It’s an unspoken agreement among developers that if you have benefited from another developer’s open source work and improved upon it, you should contribute back to it.

Open sourcing software can also build credibility within the developer community and get people to start using your software. It may even be a good way for companies to recruit new talent. Do you like working with our Deep Learning Code? Come work on it at Google and we’ll pay you to work on it.

Getting involved

I’ve read that only 1% of the population that uses Open Source software actually contributes to it. I have no idea if that’s true, but imagine if that number was 2%, 5%, or even 10%? I imagine we would see an even greater amount of companies open sourcing their software to tap into the community. And more community involvement means greater diversity of ideas, which could ultimately lead to better software in the long run.

My advice for anyone looking to get involved is to start small. If you have a software package on Github you like to use, look at helping to improve the documentation. Docs are one of those things that are so important in getting new people to adopt the software, but developers to neglect it because they are focused on writing code. After adding or fixing documentation, look at existing issues on Github and see if you can tackle any of them. Fixing existing issues is always appreciated and is sure to bolster your Open Source karma in the community.

My approach to design

2015-11-29T00:00:00+00:00

If you are a python programmer, or doing any technical design for that matter, I highly recommend checking out The Zen of Python. If you are on OSX or Linux, open up a terminal and type python -c 'import this'. You should see this:

The Zen of Python, by Tim Peters

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!

If written well, Python almost reads like plain english, and it is clear to another programmer what is going on with the code fairly quickly. If I have to spend more than 10 minutes trying to figure out how a class method or function is working, it’s probably poorly written. My take on some of these points:

Beautiful is better than ugly

I spend extra time making sure my code looks good. Its not just about ascetics, its about maintainability. One day, myself or someone else is going to have to modify this code, and you don’t want them to have to waste time figuring out what the code is doing. It’s kind of like cleaning up your apartment because guests come over to visit. Some examples:

Bad

d = {}
d['some'] = 1
d['thing'] = 2

Better

d = dict(some=1, thing=2)

Best

d = {
    'some': 1,
    'thing': 2
}

One could argue that approach #2 is the simplest to type and takes the least amount of space in your text editor. Although convenient, I think that the #3 looks better, and is easier to see how the dictionary would look after converting to a JSON string.

Explicit is better than implicit

This is one of the foundations of Python – if you are going to do something, it should be spelled out in the code. There shouldn’t be any magic going on that can be hard to track down if there are bugs. Python enforces being explicit in most cases, but design decisions that the programmer makes can influence how explicit the code really is. Some examples:

Inheritance vs. Composition

Certainly there is a case for both Inheritance and Composition in Object Oriented Programming and Python in general. However, in terms of being Explicit, composition wins. Inheritance is very convenient – you inherit a Parent class and then all of a sudden you have some new magic methods to use! This is clear in the Python interpreter by doing running the dir(a) command on an instance of the clild class. But – to figure this out in your text editor you need to most likely hunt around in different places trying to find out where the inherited methods are coming from. This is annoying and not that Explicit. With composition, you are forced to be explicit. You likely will have to import specific classes using from some_module import AwesomeClass. At that point anytime something in the AwesomeClass namespace is used, it will be clear in the code where it is being used like AwesomeClass.more_awesome().

Using *args and **kwargs

This is another one that definitely has its uses, but I prefer to stay away from it unless absolutely necessary (decorator functions, inheritance) due to its ambiguity. *args and **kwargs allows the user of a function to pass an arbitrary number of arguments into your function. Since the function does not enforce any arguments, it needs to handle all the cases where random arguments could be passed in. This could require a bunch of code that could get messy and may be hard to maintain. If many arguments need to be passed in, better to use a list or a dict and explicitly define that in the function doc string.

The Art of Unix Programming

The Art of Unix Programming is another great resource for providing some guidelines on good UNIX and programming practices. These guidelines can benefit any programmer or hacker. Especially if someone is coming from a Windows or strictly Java background, this could be particularly useful. Some of my favorite paradigms are:

This is the Unix philosophy: Write programs that do one thing and do it well. Write programs to work together. Write programs to handle text streams, because that is a universal interface.
Design and build software, even operating systems, to be tried early, ideally within weeks. Don’t hesitate to throw away the clumsy parts and rebuild them.
Programmer time is expensive; conserve it in preference to machine time.
Avoid hand-hacking; write programs to write programs when you can.

Bear in mind that a lot of this is coming out of a circa 1978 time period. These ideas are especially relevant today – I guess there’s a reason UNIX has been around so long.

Closing thoughts

The next time someone comes to you for a idea or has a technical solution, consider these thoughts from the Zen of Python.

If the implementation is hard to explain, it’s a bad idea.
If the implementation is easy to explain, it may be a good idea.

Resources

Kafka

2015-11-27T00:00:00+00:00

There are a few decent resources out there for learning Kafka, but really it comes down to the Apache Documentation and Michael Knoll’s publications. While these are both excellent, I still think there could better information out there to help developers get started. Hopefully this post can help.

Why use Apache Kafka?

There are many use cases, and some of those are discussed in Kafka’s documentation. The benefits of Kafka are many: scalability, speed, durability. That’s all great, but here’s my biggest reason for using it: it serves as a central data bus for all streaming data. This is especially important when you may not know in advance who will be producing data, and who will be consuming that data.

Kafka Basics

Kafka is nothing more than a streaming log system. Think of it as tail -f in UNIX speak. In Linux, if there is a process that is producing some log output, it is very common to run tail -f <filename> on the file to track the log file updates as they happen. A Kafka topic is exactly that, its just a log file that lives in the Kafka broker ecosystem. The big difference is that instead of tailing a single file on a single server, you can consume from a topic from anywhere that has access to Kafka. That topic could also have multiple producers writing to them from many different places. For example:

Distributed Application

You have a distributed application that lives across more than one server. This application has some output. Where should that output be written to? The usual choices are to stdout, a flat file, or a database. But what if you don’t have a database set up or don’t know which one to use at first? What if you need to do some additional processing on this distributed data before sending to a database? You could write all the data out to flat files, do some processing on it, and then ingest the data into the database. But then you have to worry about managing all the data between the servers.

Enter Kafka. What if of each server node writing out the data to a different place, they all wrote their data to a common Kafka topic? This is the power of Kafka. In Kafka speak, when data is written to a topic it is a producer. Now – if someone wants to read that data stream, they have one single place to go. The client reading the data is called a consumer.

Inside the Kafka Topic

Two other built in features of Kafka are parallelism and redundancy. Kafka handles this by giving each topic a certain number of partitions and replicas.

Partitions: A single piece of a Kafka topic. The number of partitions is configurable on a per topic basis. More partitions allow for great parallelism when reading from the topics. The number of partitions determines how many consumers you have in a consumer-group. For example, if a topic has 3 partitions, you can have 3 consumers in a consumer-group balancing consuming between the partitions. In this way you have a parallelism of 3. This partition number is somewhat hard to determine until you know how fast you are producing data vs. how fast you are consuming the data. If you have a topic that you know will be high volume, I would ere on the side of more partitions. This also allows room for growth. Aim for between 10 - 50 partitions to start.

Replicas: These are copies of the partitions. They are never written to or read from. Their only purpose is for data redundancy. If your topic has n replicas, n-1 brokers can fail before there is any data loss. Additionally, you cannot have a topic a replication factor greater than the number of brokers that you have. For example, you have 5 Kafka brokers, you could have a topic with a maximum replication factor of 5, and 5-1=4 brokers could go down before there is any data loss.

Offsets: An “offset” is just a pointer to a location in the logfile or “topic”. Each client or “consumer” has their own “consumer-group” that is used to track the offset where they are in the topic. The actual offset values are stored in a special Kafka topic called “_consumer_offsets”. Why is it called a “consumer-group” and not just a “consumer”? This is because Kafka supports balanced consuming – meaning that you can have more than one consumer reading from a topic in a round-robin fashion to increase parallelism.

Leaders and In Sync Replicas (ISRs): Once your topic has been created, you can use Kafka’s built in tool ./kafka-topics.sh --describe -z <zookeeper-node>:2181 to run to describe the topics on your Kafka cluster. You might see something like this:

Topic: test.cleaned_firehose PartitionCount:3    ReplicationFactor:3 Configs:
Topic: test.cleaned_firehose    Partition: 0    Leader: 4   Replicas: 4,5,1 Isr: 1,4,5
Topic: test.cleaned_firehose    Partition: 1    Leader: 5   Replicas: 5,1,2 Isr: 1,2,5
Topic: test.cleaned_firehose    Partition: 2    Leader: 1   Replicas: 1,2,3 Isr: 1,2,3

Each partition has a broker leader, and the replicas simply “follow” the leader and duplicate the data. If a broker that is a leader does down, Kafka will automatically elect a new broker leader by default. Note that if you have consumers consuming on a topic that temporarily loses their leader, they may need to be re-connect to fetch the new meta data from the cluster.

Common problems

The biggest problem I’ve encountered is with brokers randomly going down and then becoming unavailable for leader election. I haven’t gotten to the bottom of this issue but I’m hopeful that some of this stuff has been fixed in 0.9. Rebooting the Kafka broker fixes this problem most of the time.

Another common problem I enconter when using kafka is that a broker goes down, Kafka elects a new leader, and the consumer does’t get the message that there is a new leader in town. This results in the dreaded NotLeaderForPartition errors. This can be solved by updating the metadata for the Kafka consumer. In the case of a python client, it appears that neither kafka-python nor pykafka can handle this situation. Therefore, the error needs to be caught, and the consumer needs to be re-created.

Tips and Tricks

Check out kafkacat on github for a nice CLI non-JVM based tool for checking Kafka topics or consuming/producing topics.

Closing thoughts

Kafka is a great tool – but it is still in development, API’s are in flux, and new features are still being added. As of this writing Kafka 0.9.0 has just come out, which introduces a new consumer API (although the old one is still supported) and a security protocol. Before 0.9.0 you can had control access to Kafka via a whitelist or some other VPN firewall.

One of the most lacking areas of Kafka is any kind of built in monitoring or “health status” support. When things go wrong, its very hard to figure out the root cause, and Kafka will often still being “running” but you’ll see ERROR messages spewing out of the logs. Some kind of built in status check API would be very useful for monitoring the tool and figuring out what’s going on. There are some OK open source solutions out there for monitoring consumer lag, offsets, and broker status, but they aren’t sufficent to solve this problem.

Mac dev tips

2015-09-06T00:00:00+00:00

I’ve been doing software development on a Macbook Pro for a little while now, and I gotta say there are a TON of great free packages and tools that make development that much more enjoyable. I’m not going to get into a Windows/Mac/Linux debate here, lets just say Mac OSX wins, with Linux a close second. All of the production code that I run runs in Linux, and most of all runs natively on my Mac as well. That with the combination of all the other nice feature of the Mac make it unmatched for software development.

OK so I want to talk about some tools and nifty tricks that I use on a fairly regular basis.

Sublime Text 3

If you do most of your development on a Mac already, you probably know about Sublime Text. It’s a lightweight and fast editor with a ton of free plugins for just about everything you can imagine. I do most of my work in python and use git for version control, so this list will be a little skewed towards those technologies.

Color Sublime

Colorsublime has about a billion different built in color themes, and you can actually preview them right in Sublime before even installing them by using the Sublime Text 3 command pallet.

Git Gutter

Git Gutter is another fantastic plugin compares your working copy of a file to the version in the git index.

This is similar to doing git diff on the command line. By default it compares against HEAD but this can be changed to compare against specific branches, tags, or commits.

Sublime Linter

Sublime Linter is a framework for using code linters in Sublime Text. Any linters you wish to use need to be installed separately. For python, there are many linters but I recommend using pyflakes at a minimum. It’s also a good idea to use the pep8 linter to make sure you are following the PEP8 Python standards.

ProTip: I recommend changing the settings to for Sublime Linter to manual mode. By default it lints every file you have open in real time, which I’ve found can cause Sublime to hiccup and lag – very annoying. To change the settings –

Open up the command palette, and select SublimeLinter: Choose Lint Mode –> Manual

ReStructuredText Improved

ReStructuredText Improved is a nice plugin that does syntax highlight of your ReStructuredText. It integrates very nicely into Sublime and is very unobtrusive, unlike some of the Markdown plugins for Sublime that I have seen.

Honorable mentions

Some other plugins I use…

Bracket Highlighter
Sidebar (sidebar enhancements)
PyDOC (links to python documentation by right clicking on code)

FlyCut

FlyCut is a great little piece of software that keeps a copy-paste buffer within easy reach. By default, just hit shift+command+v to pull up the dialog. This is such a simple thing but it saves an immense amount of time. You can download it on the Mac App Store.

Caffeine

If you ever get annoying at your computer dimming its screen and going to sleep when you want the screen to stay on, this little app is for you. Again – a really simple piece of software that really improves Mac usage. Basically you just click the coffee button when you want your Mac to stay awake.

Spectacle

Another simple, extremely useful peice of software is Spectacle. This application lets you easily place and re-size your windows with a bunch of keyboard shortcuts. Actually – this is such a good idea that Apple has decided to incorporate something very similar into their #El-Capitan OSX release coming later in the Fall.

f.lux

If you work late like I do, that blueish screen can be pretty harsh on the eyes. Check out f.lux – it gradually makes the screen go redder as the night wears on. It is easier on the eyes and also helps you get to sleep faster after a long night of coding.

SourceTree

This is a GUI interface for git, made by Atlassian. Now – I know what you’re thinking – “it’s not command line! CLI is way more powerful!” Yes – that is true, and I use the command line for git most of the time. However, if you have a bunch of changes and really want to do diffs on what changed, and potentially break the chunks into smaller commits, SourceTree beats the CLI. I know this can be done via git add -p, but the GUI interface is just better for this.

Python n’ stuff

Making a simple link shortener with AWS and MySQL

Making a simple link shortener with AWS and MySQL

Usage

Architecture

Implementation

Model

Get Tiny URL

Line 3

Line 9

Line 12

Line 20

Get Long URL

Line 5

Line 13

Conclusion

Technology - The Final Frontier

Follow Hacker News (news.ycombinator.com)

Follow your niche field

Go to (some) conferences

Speak at Tech Meetups

Use Open Source, and Contribute to it

Don’t fall for all the new shiny things

4 Keys to Fostering a Successful (Remote) Work Culture

1. Open communication

Using public forums and for most conversations

Avoiding redundancy

Open accountability

Knowledge sharing

2. Accountability

3. Batching Tasks

4. Innovative and Open Culture

Call to Action

Adding a simple API to your Postgres database

The state of things

Flask Restless

SQL Alchemy Automap

Conclusion

Automate all the things

Ansible

Dynamic inventories and group_vars

Roles

Defaults

Handlers

Meta

Tasks

Templates

Vars

Creating your site playbook

Using –tags and –limit

Don’t break the build

Vagrant

Vagrantfile

Testing Ansible with Vagrant

Automated Builds with Travis CI

Conclusion

Using pandoc

LinkedIn

Markdown and Pandoc to the rescue!

Why I love open source

Why has Open Source become so popular?

How can it be free?

Getting involved

My approach to design

Beautiful is better than ugly

Explicit is better than implicit

Inheritance vs. Composition

Using *args and **kwargs

The Art of Unix Programming

Closing thoughts

Resources

Kafka

Why use Apache Kafka?

Kafka Basics

Distributed Application

Inside the Kafka Topic

Common problems

Tips and Tricks

Closing thoughts

Mac dev tips