Jekyll2019-02-06T17:33:51+00:00https://jasonrhaas.github.io/feed.xmlPython n’ stuffPython, and other related technical stuff. Also may include posts about things I like such as brewing beer and riding motorcycles.
Making a simple link shortener with AWS and MySQL2019-02-06T00:00:00+00:002019-02-06T00:00:00+00:00https://jasonrhaas.github.io/2019/02/06/making-a-simple-link-shortener-with-aws-andmysql<p><img src="https://cdn-images-1.medium.com/max/1600/1*NaL3SxOBXbrwXrAciJyM3Q.jpeg" alt="" /></p>
<h3 id="making-a-simple-link-shortener-with-aws-and-mysql">Making a simple link shortener with AWS and MySQL</h3>
<p>Link shorteners are handy and pretty simple to implement. There are some free
services that you can use for this like bitly and google, but in some cases it
might be preferable to have the link shortener be under your own domain. Or
maybe you want to implement some additional analytics or features that the other
services don’t have. Let’s get started!</p>
<h3 id="usage">Usage</h3>
<p>The goal of this project is to create a service that turns any link into a short
link under you domain. For example:</p>
<p>POST <code class="highlighter-rouge">api.jasonrhaas.com</code></p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>{
"url": "reddit.com"
}
</code></pre></div></div>
<p>The response should be:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>{
"tiny_url": "http://jasonrhaas.com/6vyv6"
}
</code></pre></div></div>
<p>When you go to that URL it should point you to reddit.com.</p>
<h3 id="architecture">Architecture</h3>
<p>When building something, I’m a big believer in using the right tool for the job.
In other words, I’m not going to use a batteries included web framework like
Django when all I need is a simple micro service. The tools I’m using for this
job are:</p>
<ul>
<li>Flask</li>
<li>MySQL</li>
<li>Zappa (AWS Lamda + API Gateway)</li>
</ul>
<p>Zappa is a nice tool that makes it easy to deploy an event driven API. I find
the API Gateway user interface a bit cumbersome so its nice to have a framework
that allows me to do (almost) everything from my code editor and command line.</p>
<h3 id="implementation">Implementation</h3>
<p>To build this, I used a simple Flask application containing a single <code class="highlighter-rouge">Url</code>
SQLALchemy model and basically two functions, a <code class="highlighter-rouge">get_tiny_url</code> function and a
<code class="highlighter-rouge">get_long_url</code> function.</p>
<h4 id="model">Model</h4>
<p>The model looks like this:</p>
<script src="https://gist.github.com/jasonrhaas/320d9a4bd7a8233f3d1600f9e9c77920.js"> </script>
<p>If you aren’t familiar with model classes, I recommend checking out the <a href="http://flask-sqlalchemy.pocoo.org/2.3/quickstart/">Flask
SQLAlchemy Quickstart</a> for a
crash course. If you want to go a bit deeper into what database model classes
can provide, check out the <a href="https://docs.djangoproject.com/en/2.1/topics/db/models/">Django
Tutorial</a> on models.
The <a href="https://docs.sqlalchemy.org/en/latest/">SQLAlchemy documentation</a> is
comprehensive, but its very dense and technical and in my opinion is not a good
introduction to models.</p>
<p>In the <code class="highlighter-rouge">Url</code> model above, here are what the different tables are:</p>
<ul>
<li><code class="highlighter-rouge">id</code> — just a database id. In some cases (like for Django), this line isn’t even
necessary and is provided automatically.</li>
<li><code class="highlighter-rouge">hash</code> — This is the what the long url gets turned into after it is shortened.</li>
<li><code class="highlighter-rouge">long</code> — This is the original url.</li>
<li><code class="highlighter-rouge">hits</code> — A simple numeric field that keeps track of how many times the link as
been accessed.</li>
</ul>
<p>We could expand this to provide even more information and analytics, like user
IP address, referring URL, timestamp of when it was accessed, etc. But for this
case I just needed something simple. The good part is this is easy to expand
upon later.</p>
<h4 id="get-tiny-url">Get Tiny URL</h4>
<p>The function to create the short link I decided to call <code class="highlighter-rouge">get_tiny_url</code>. The
function looks like this:</p>
<script src="https://gist.github.com/jasonrhaas/be559a46d24db938758dfa687945a09e.js"> </script>
<p>To explain this, I’m going to pick out a few lines.</p>
<h4 id="line-3">Line 3</h4>
<p>First we check to make sure that it is a POST request and a valid JSON object.
If it’s not, we simply redirect to the base url. This is kind of a “catch all”
approach and works fine for our use case, but it could be improved to have more
specific error catching and an appropriate error message for the user.</p>
<h4 id="line-9">Line 9</h4>
<p>It’s never* a good idea to store potentially sensitive information in a database
in clear text. As secure as your system is, there is always a chance that it may
be hacked and the data may be stolen. As a good rule of thumb, passwords should
<em>always</em> be hashed, for example.</p>
<p>In this case, I don’t have passwords, I’m just using the hash to identify a
unique URL. The area of password hashing is a complex and convoluted one. During
my research of implementing a JWT API Gateway Authorization solution, I found
that the Python <a href="https://docs.python.org/3/library/hashlib.html#blake2">Blake2</a>
library appears to be the new industry standard that is considered good enough
to hash passwords. In Python 3.6, this was added to the Python standard library.</p>
<p>When this line gets run, you end up with a hash like
<strong>53761004cf82ca63a62c430e8a409a6703d63f45</strong>. This hash is deterministic, but
its “one-way” meaning that it’s (almost) impossible derive the <code class="highlighter-rouge">long_url</code> from
this hash code on its own.</p>
<h4 id="line-12">Line 12</h4>
<p>This line is to try to find the url by the hash. This hash is unique per url, so
there should never be any duplicates in the database. If the <code class="highlighter-rouge">url</code> does not
exist, it will add it to the database.</p>
<h4 id="line-20">Line 20</h4>
<p>Finally, we make use of the
<a href="https://github.com/Alir3z4/python-short_url">short_url</a> Python library. This
library uses a bit-shuffling approach to deterministically generate URLs from a
number. In essence, this number corresponds to the database <code class="highlighter-rouge">id</code>. For our use
case, the number will point to an <code class="highlighter-rouge">id</code> in the database, which contains the
<code class="highlighter-rouge">long_url</code>.</p>
<h4 id="get-long-url">Get Long URL</h4>
<p>On to the reverse function, <code class="highlighter-rouge">get_long_url</code> which looks up the original URL given
the short link.</p>
<script src="https://gist.github.com/jasonrhaas/41ac789259c251133b98dab16558543a.js"> </script>
<p>Picking out a few lines of interest:</p>
<h4 id="line-5">Line 5</h4>
<p>This line takes the <code class="highlighter-rouge">/tiny_url</code> part of the link and translate it to the
<code class="highlighter-rouge">url_id</code> which matches up with the database <code class="highlighter-rouge">id</code>.</p>
<h4 id="line-13">Line 13</h4>
<p>This a simple counter keeping track of how many times the line gets access. This
information is then updated in the database.</p>
<h3 id="conclusion">Conclusion</h3>
<p>As you can see, coding up a link shortener is pretty straightforward! To test
this locally, all you need is a local MySQL database and Python3. In the next
blog post, I’ll talk in depth about how to set up your local environment and
also do automatically deployments using Continuous Integration and AWS.</p>Link shorteners are handy and pretty simple to implement. There are some free services that you can use for this like bitly and google, but in some cases it might be preferable to have the link shortener be under your own domain.Technology - The Final Frontier2017-11-17T00:00:00+00:002017-11-17T00:00:00+00:00https://jasonrhaas.github.io/2017/11/17/technology-the-final-frontier<p><img src="https://cdn-images-1.medium.com/max/2000/1*OpY-PPIj2tPZ6Bc1DMiiIA.jpeg" alt="" />
<span class="figcaption_hack">Photo by SpaceX on Unsplash</span></p>
<p>Technology (and especially software) is constantly changing. Just because one
database, framework or programming language is popular now doesn’t mean it will
be in 5 years, or even 6 months. <strong>A career in technology is a career is
constant learning.</strong></p>
<p>So how do you keep up with the latest? Here are a few pointers from my own
experience:</p>
<h3 id="follow-hacker-news-newsycombinatorcom">Follow Hacker News (news.ycombinator.com)</h3>
<p>I’m not saying you need to be on top of this constantly. But this is the pulse
of the tech/startup world. Many experienced developers, tech CEOs, and
entrepreuers post and comment on here.</p>
<p>It’s important to see what stories are getting lots of comments, and get a feel
for the general setiment of the community. Also — new frameworks or technologies
are often posted here and talked about.</p>
<h3 id="follow-your-niche-field">Follow your niche field</h3>
<p>For me this is Python. I’m also interested in Entreprenuership. So I subscribe
to these two newsleters and at least browse through some of the content every
week:</p>
<p><a href="http://www.pythonweekly.com">www.pythonweekly.com</a><br />
<a href="http://www.foundersweekly.com">www.foundersweekly.com</a></p>
<p><strong>I know what you’re thinking: Email is the worst.</strong> And you would be correct in
thinking that. These newsletters are one of the far and few email subscriptions
that are actually meaningful and helpful. I recommend you find one similar in
your niche field.</p>
<h3 id="go-to-some-conferences">Go to (some) conferences</h3>
<p>Technology conferences are a dime a dozen these days. What technology are in you
into? Kubernetes perhaps? Mesos? Or maybe just all things Big Data? Or maybe you
are really into Blockchain. Regardless of what it is, there seems to be a
conference for everything.</p>
<p>Honestly I usually don’t learn a heck of a lot from the conferences, but they
are a good way to see what other people are thinking and where the industry is
going. <strong>It’s more of a “meet & greet” and making sure I’m ahead of the curve
rather than anything else.</strong></p>
<p>Unfortunately, a lot of of them are charging in the realm of multiple hundreds
of dollars to attend. In my opinion, most of them aren’t worth paying money for
(with the exception of PyCon, of course).</p>
<p>However, some conferences such as DeveloperWeek offer free tickets if you are a
developer. The idea is that the companies and recruiters pay the big fees for
the chance at recruiting talent (you!).** Look for these “developer deals”, and
take advantage of them.**</p>
<p>Also — if you work for a company that is hip enough to pay for your nerd
conference, definitely take advantage of that as well.</p>
<h3 id="speak-at-tech-meetups">Speak at Tech Meetups</h3>
<p>This is a good one. It will a give you a chance to give back to the community
and practice public speaking. Which — if you plan on running your own business
someday or even just being in a leadership position — is key.</p>
<p>Some Meetups, like the <em>Austin Python Meetup</em> have the regular talk, and then
“Lightning Talks” afterwards. A Lightning Talk just it’s typically 5–10 minutes
in length. <strong>Lightning Talks are a great way to get comfortable speaking in
front of the crowd.</strong></p>
<h3 id="use-open-source-and-contribute-to-it">Use Open Source, and Contribute to it</h3>
<blockquote>
<p>Many people and organizations are relying on Open Source today, yet less than 1%
of users actually contribute back to Open Source.</p>
</blockquote>
<p>I totally made that quote up, but its probably pretty close. I know —
contributing to <em>Flask</em> can be intimidating. The regular contributors to popular
Open Source projects know their stuff. <strong>In fact — if companies see you are a
contributer to an Open Source Project, they may let you skip the demeaning “Live
Code Interview” process altogether.</strong></p>
<p>You don’t need to start with a big project, start small and work from there.
Also many projects have tags that specifically call out good tasks for people
new to the project — start with those.</p>
<h3 id="dont-fall-for-all-the-new-shiny-things">Don’t fall for all the new shiny things</h3>
<blockquote>
<p>This world moves pretty fast. If you don’t stop and look around every once in a
while, you might miss it — Ferris Bueller</p>
</blockquote>
<p>Ok so there is always some new shiny thing that promises to be THE BEST DATABASE
EVER. You should definitely play around with all the new fancy things, but be
wary about using them in Production. Learn about it and figure out what it can
do that your existing stack or programming language can’t.</p>
<p><strong>You know what powers most of the forward thinking technology companies today?
Unix.</strong> Know when Unix was invented? The 70’s. Same goes with MySQL, Regular Old
Bash Scripts, and even (gasp) C++. Yea, C++ might not be “cool”, but there is a
reason why it still plays a big role in many of the tried and true tools that
are taken for granted today.</p>Technology (and especially software) is constantly changing. Just because one database, framework or programming language is popular now doesn’t mean it will be in 5 years, or even 6 months. A career in technology is a career is constant learning.4 Keys to Fostering a Successful (Remote) Work Culture2017-07-18T00:00:00+00:002017-07-18T00:00:00+00:00https://jasonrhaas.github.io/2017/07/18/fostering-a-successful-remote-work-culture<p>Teams working remotely is more common then ever. The combination of flexibility, high speed Internet, and the right tools make working remotely a viable option for may teams. It has given rise to a whole generation of people that not only <em>can</em> work remotely, they <a href="http://www.alliedtelecom.net/why-embracing-remote-work-tech-helps-attract-retain-millennial-employees/"><em>expect</em> it</a>.</p>
<p>What do I know about working remotely? I’ve worked successfully in a remote capacity for almost 3 years. In that time, I traveled around the world with a program called Remote Year while working remotely. <strong>I am passionate about fostering a culture that allows remote teams not only work together, but thrive.</strong></p>
<p>Here are the 4 traits that are absolutely essential to fostering a healthy (remote) work culture.</p>
<h2 id="1-open-communication">1. Open communication</h2>
<p><strong>By far the most important part of a successful remote work culture is open communication</strong>. What do I mean by this? Think about how a traditional office operates. There are conversations that happen in the hallway, at the water cooler, in meetings, in the cafeteria. Sometimes the conversation is casual but other times its important work stuff that should be shared with your co-workers.</p>
<p>I can’t tell you how many times in the past I’ve made some big technical decisions by bumping into a senior engineer in the hallway and asking his advice. Sometimes bouncing ideas off other people helps you solve a problem.</p>
<p>In fact, this is the #1 reason companies like to co-locate and put everyone in a single space. This is good for fostering ideas and innovation, but very bad for solving hard problems that require <a href="http://calnewport.com/books/deep-work/">Deep Work</a>.</p>
<p>Making important decisions without review among your peers is often a bad idea. Ideally the whole core engineering staff would also be involved at some level to validate ideas. In a traditional office, this usually means meetings. However, it’s pretty well understood that meetings are a <a href="https://www.themuse.com/advice/how-much-time-do-we-spend-in-meetings-hint-its-scary">colossal waste of time and resources</a>.</p>
<p><strong>The solution in a remote work environment is open communication using asynchronous tools</strong>. For software development, this often includes tools such as:</p>
<ul>
<li>Slack (asynchronous chat)</li>
<li>Github (asynchronous software review)</li>
<li>JIRA (asynchronous planning)</li>
<li>Google Hangouts (synchronous meetings)</li>
</ul>
<p>You can swap any of these out for your tool set of choice. By far the most import part of this whole section is this:</p>
<h3 id="using-public-forums-and-for-most-conversations">Using public forums and for most conversations</h3>
<p>I’ll stress this again, have conversations in a public place! Avoid direct messages like the plague. DM’s should not be used unless absolutely necessary, and the information needs to be private with that individual.</p>
<p>When a new employee comes on board, it can be tempting to use DM’s because he or she might be intimidated by asking “stupid questions” in a public forum. But, it needs to be stressed to them that having discussions in public is essential.</p>
<p>There are several advantages of using public channels vs. private channels or direct messages:</p>
<h3 id="avoiding-redundancy">Avoiding redundancy</h3>
<p>There is nothing more annoying or inefficient than repeating the same question or relaying the exact information to 3, 4, 5, or 50 people over and over again. Why not just say it once in a public channel?</p>
<h3 id="open-accountability">Open accountability</h3>
<p>Is the boss wondering what you are working on or why a project is taking so long? Instead of harassing you on a regular interval, all they have to do is catch up on the public channels. At that point they can see what is going and if they want to step in to help.</p>
<h3 id="knowledge-sharing">Knowledge sharing</h3>
<p>I once worked with a senior engineer that refused to share any of his knowledge with anyone. Something broken? He’ll fix it for you, but won’t tell you how he did. This is a horrible engineering culture, and its actually really bad for the company. What if he leaves the company? It puts the company is a bad position.</p>
<p><strong>People like to learn new things. Found out a better way to deploy your code? Share it with the team.</strong></p>
<h2 id="2--accountability">2. Accountability</h2>
<p>Something that holds bigger or more traditional companies back from allowing their employees to work part or full time remote is accountability. The belief goes that if employees are being “watched” in the office they will slack off.</p>
<p>The truth is, if you have hired the right people, the opposite will be true. <strong>In fact, there are a lot of studies that show that people are actually more productive when they are free from the distractions of the office.</strong></p>
<p>So how do you stay accountable as an employee? Here are some ideas to build an accountable culture.</p>
<ul>
<li><strong>Daily standups</strong>. A 15 minute synchronous meeting serves as face time for the team to talk about what they are working on today and if there are any blockers. You can also have a #daily-standup channel to provide more detail on Slack.</li>
<li><strong>Daily announcements</strong>. This channel is to let people know when you aren’t working. Have to run some errands or want to do some laundry? No problem, let the team know in this channel, and when you are back working let them know again. No need to bother people, but if they want to know where you are, it should be in this channel.</li>
<li><strong>Task tracking</strong>. I won’t get into the whole Agile methodology here, but I want to emphasize that your team should break tasks out into small, manageable chunks. This should improve velocity and allow for quick wins. Also, surveys show that engineers are happier when they are <a href="https://insights.stackoverflow.com/survey/2017#work-how-are-job-satisfaction-and-committing-code-related">deploying code often</a>. <strong>As a general rule of thumb, try to make tasks that take no longer than 3 days to complete.</strong></li>
<li><strong>Schedule push</strong>. If something is going to take longer than expected, report this early on. Make it known in a public slack channel so there are no surprises and get someone to help.</li>
</ul>
<h2 id="3--batching-tasks">3. Batching Tasks</h2>
<p>Schedule meetings in chunks to allow for proper Deep Work time. Ever wonder why Hackathons exist? <strong>Its because of the simple fact that people are more productive with uninterrupted time to focus on <a href="http://calnewport.com/books/deep-work/">their work</a>.</strong> .</p>
<p>There is a great article I read I while back that talks about the two types of people in a company: <a href="http://www.paulgraham.com/makersschedule.html">the managers and the makers</a>. The Managers have their whole day booked up, meeting after meeting, jumping from one thing to the next. Their work often requires scheduling, high level oversight, meeting with clients or other managers. If the work they are doing doesn’t require Deep Work, this is a fine thing.</p>
<p>The Makers on the other hand require time for Deep Work to truly be productive. <strong>Jumping from one task to the next is counter productive as an engineer.</strong> This is part of the reason why many software developers report doing their best work late at night. The Managers should keep this in mind and try to batch weekly meetings all on one day, and keep other meetings during the week to a specific chunk of time, such as 9 - 12am, and leave the afternoon for engineering work.</p>
<p>In my last job, I often railed back at the constant stream of meetings. I like to use the analogy of comparing software development to driving a manual transmission car.</p>
<p><strong>My best work is done when I’m in 6th gear</strong>. But to get to 6th gear I have to go through all of the gears: 1st, 2nd, 3rd, 4th, 5th, and then 6th. When I’m in 6th gear I am “in the zone” or “in flow”. If I get interrupted while I’m in any of the gears and have to switch tasks, I have to restart in 1st gear and work my way up again, similar to having to stop a car at a stop sign.</p>
<h2 id="4--innovative-and-open-culture">4. Innovative and Open Culture</h2>
<p>The best ideas can come from anywhere. They can come from the CEO, or from the Jr. Engineer that was just hired. Different people in the company have different perspectives on how things work, and may have an idea to improve or help grow the company. It’s important to encourage an environment of innovation, and one that allows everyone to put their ideas in a public space, without fear. Even if the idea will never be implemented, its important to hear ideas from all sides.</p>
<p>Some of my best work has come from this kind of innovative culture. Most companies have some sort of road map that is broken up into tasks, with some schedule on how to get there. <strong>The thing is, once engineering starts gaining a deeper understanding of how to solve the problem, there may be alternative approaches that drastically improve the product.</strong> This applies particularly in the technology industry which is changing constantly.</p>
<p>I remember when I first discovered Kibana for Elastic Search. At the time the company had a lot of data that was sitting around in raw files, S3, or a SQL database. While doing some data analysis for the CEO, I stumbled upon the out-of-the-box visualization capabilities of <a href="https://www.elastic.co/products/kibana">Kibana</a>. I read up on it, started indexing the data to Elastic Search, and created some basic visualizations with Kibana.</p>
<p>When I first told my boss about this, he was frustrated that I wasn’t working on my assigned tasks and that I was “wasting” my time playing with this new technology. However, as soon as I demonstrated this to my boss and relevant stakeholders they were blown away with the capabilities. This side project alone opened up an entire new capability to the company and our customers, and helped to propel the company’s technology in to the future.</p>
<h2 id="call-to-action">Call to Action</h2>
<p>If you work in a remote culture, or even an office culture, pick one of these traits and see how your company culture stacks up. If the company is lacking in one of these areas, lead by example. Start sharing things in public, share your ideas, and encourage others to do the same. Another idea is to get the buy in of the leadership team, point them to this article and convince them to give it a try for a quarter.</p>Teams working remotely is more common then ever. The combination of flexibility, high speed Internet, and the right tools make working remotely a viable option for may teams. It has given rise to a whole generation of people that not only can work remotely, they expect it.Adding a simple API to your Postgres database2017-07-17T00:00:00+00:002017-07-17T00:00:00+00:00https://jasonrhaas.github.io/2017/07/17/adding-a-simple-api-to-your-postgres-database<p>When designing systems or platforms, it is very common to use a relational database such as MySQL or Postgres as a backend data storage. In order to access this data from a remote endpoint, it’s very handy to have an API that can serve out proper JSON data.</p>
<p>In this post I’m going to discuss one way to approach this problem. I’m a huge fan of simple, elegant approaches, and I think this fits the bill nicely.</p>
<p>I will be using some code that I wrote for the <a href="https://codefordc.org">CodeForDC</a> housing insights project.</p>
<h2 id="the-state-of-things">The state of things</h2>
<p>I volunteer some of my time to the <a href="https://github.com/codefordc/housing-insights">Housing-Insights</a> project to help out with the backend and API design and implementation. The current backend design at a high level consists of:</p>
<ul>
<li>Download open data</li>
<li>Parse and clean data</li>
<li>Add tables to Postgres database</li>
<li>Access data via custom Flask API endpoints</li>
</ul>
<p>Flask is a great Python framework for making dead simple APIs. It is my go to if I need a lightweight application to serve up some data. The syntax is as simple as this:</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="kn">from</span> <span class="nn">flask</span> <span class="kn">import</span> <span class="n">Flask</span>
<span class="n">app</span> <span class="o">=</span> <span class="n">Flask</span><span class="p">(</span><span class="n">__name__</span><span class="p">)</span>
<span class="nd">@app.route</span><span class="p">(</span><span class="s">"/"</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">hello</span><span class="p">():</span>
<span class="k">return</span> <span class="s">"Hello World!"</span></code></pre></figure>
<p>That is all the code you need to get a simple endpoint up and running. If you want to create some API endpoints on your database, a simple approach is to use the Postgres <code class="highlighter-rouge">psycopg2</code> module for Python and run SQL queries as needed, then return the results.</p>
<p>And indeed, this works out pretty well. To get raw data from tables, you can do something like this:</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="nd">@application.route</span><span class="p">(</span><span class="s">'/api/raw/<table>'</span><span class="p">,</span> <span class="n">methods</span><span class="o">=</span><span class="p">[</span><span class="s">'GET'</span><span class="p">])</span>
<span class="nd">@cross_origin</span><span class="p">()</span>
<span class="k">def</span> <span class="nf">list_all</span><span class="p">(</span><span class="n">table</span><span class="p">):</span>
<span class="s">""" Generate endpoint to list all data in the tables. """</span>
<span class="n">application</span><span class="o">.</span><span class="n">logger</span><span class="o">.</span><span class="n">debug</span><span class="p">(</span><span class="s">'Table selected: {}'</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">table</span><span class="p">))</span>
<span class="k">if</span> <span class="n">table</span> <span class="ow">not</span> <span class="ow">in</span> <span class="n">tables</span><span class="p">:</span>
<span class="n">application</span><span class="o">.</span><span class="n">logger</span><span class="o">.</span><span class="n">error</span><span class="p">(</span><span class="s">'Error: Table does not exist.'</span><span class="p">)</span>
<span class="n">abort</span><span class="p">(</span><span class="mi">404</span><span class="p">)</span>
<span class="n">conn</span> <span class="o">=</span> <span class="n">engine</span><span class="o">.</span><span class="n">connect</span><span class="p">()</span>
<span class="n">q</span> <span class="o">=</span> <span class="s">'SELECT row_to_json({}) from {} limit 1000;'</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">table</span><span class="p">,</span> <span class="n">table</span><span class="p">)</span>
<span class="n">proxy</span> <span class="o">=</span> <span class="n">conn</span><span class="o">.</span><span class="n">execute</span><span class="p">(</span><span class="n">q</span><span class="p">)</span>
<span class="n">results</span> <span class="o">=</span> <span class="p">[</span><span class="n">x</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="k">for</span> <span class="n">x</span> <span class="ow">in</span> <span class="n">proxy</span><span class="o">.</span><span class="n">fetchmany</span><span class="p">(</span><span class="mi">1000</span><span class="p">)]</span> <span class="c"># Only fetching 1000 for now, need to implement scrolling</span>
<span class="n">conn</span><span class="o">.</span><span class="n">close</span><span class="p">()</span>
<span class="k">return</span> <span class="n">jsonify</span><span class="p">(</span><span class="n">items</span><span class="o">=</span><span class="n">results</span><span class="p">)</span></code></pre></figure>
<p>Using the simple SQL statement</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>'SELECT row_to_json({}) from {} limit 1000;'.format(table, table)`
</code></pre></div></div>
<p>it will return 1000 rows from whatever table you select in the <code class="highlighter-rouge">table</code> variable. This value comes from the Flask route <code class="highlighter-rouge">'/api/raw/<table>', methods=['GET']</code>.</p>
<p>But, as Raymond Hettinger likes to say…</p>
<blockquote>
<p>there must be a better way</p>
</blockquote>
<p>And there is.</p>
<h2 id="flask-restless"><a href="https://flask-restless.readthedocs.io">Flask Restless</a></h2>
<p>Flask-Restless is a plugin for Flask that takes advantage of SQL Alchemy’s Object Relational Mappers to generate quick and easy endpoints. If you have defined your database schema using SQLA (recommended), there is quite a bit of functionality out of the box. Some examples:</p>
<ul>
<li>Endpoints can be generated from any model</li>
<li>Auto pagination</li>
<li>JSON based search</li>
<li>Pre-processors and post-processors</li>
</ul>
<p>So thats a great way to start off accessing the database, and the pagination feature makes sure you don’t end up pulling too much data at once.</p>
<p>But, here is the problem: the database schema was not defined in SQLA. Sigh. But, there is a solution to that as well.</p>
<h2 id="sql-alchemy-automap"><a href="http://docs.sqlalchemy.org/en/latest/orm/extensions/automap.html">SQL Alchemy Automap</a></h2>
<p>SQLA includes a feature called <code class="highlighter-rouge">automap</code> that is able to “reflect” information about your database tables and automatically generate the models. Using this approach, you can now take advantage of the features that SQLA and Flask Restless have to offer.</p>
<p>The code is pretty simple:</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">application</span> <span class="o">=</span> <span class="n">Flask</span><span class="p">(</span><span class="n">__name__</span><span class="p">)</span>
<span class="n">application</span><span class="o">.</span><span class="n">config</span><span class="p">[</span><span class="s">'SQLALCHEMY_DATABASE_URI'</span><span class="p">]</span> <span class="o">=</span> <span class="n">connect_str</span>
<span class="n">application</span><span class="o">.</span><span class="n">config</span><span class="p">[</span><span class="s">'SQLALCHEMY_TRACK_MODIFICATIONS'</span><span class="p">]</span> <span class="o">=</span> <span class="bp">False</span>
<span class="n">db</span> <span class="o">=</span> <span class="n">SQLAlchemy</span><span class="p">(</span><span class="n">application</span><span class="p">)</span>
<span class="n">Base</span> <span class="o">=</span> <span class="n">automap_base</span><span class="p">()</span>
<span class="n">metadata</span> <span class="o">=</span> <span class="n">MetaData</span><span class="p">(</span><span class="n">bind</span><span class="o">=</span><span class="n">db</span><span class="p">)</span>
<span class="n">Base</span><span class="o">.</span><span class="n">prepare</span><span class="p">(</span><span class="n">db</span><span class="o">.</span><span class="n">engine</span><span class="p">,</span> <span class="n">reflect</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="n">db</span><span class="o">.</span><span class="n">session</span><span class="o">.</span><span class="n">commit</span><span class="p">()</span>
<span class="n">BuildingPermits</span> <span class="o">=</span> <span class="n">Base</span><span class="o">.</span><span class="n">classes</span><span class="o">.</span><span class="n">building_permits</span>
<span class="n">Census</span> <span class="o">=</span> <span class="n">Base</span><span class="o">.</span><span class="n">classes</span><span class="o">.</span><span class="n">census</span>
<span class="n">CensusMarginOfError</span> <span class="o">=</span> <span class="n">Base</span><span class="o">.</span><span class="n">classes</span><span class="o">.</span><span class="n">census_margin_of_error</span>
<span class="n">Crime</span> <span class="o">=</span> <span class="n">Base</span><span class="o">.</span><span class="n">classes</span><span class="o">.</span><span class="n">crime</span>
<span class="n">DcTax</span> <span class="o">=</span> <span class="n">Base</span><span class="o">.</span><span class="n">classes</span><span class="o">.</span><span class="n">dc_tax</span>
<span class="n">Project</span> <span class="o">=</span> <span class="n">Base</span><span class="o">.</span><span class="n">classes</span><span class="o">.</span><span class="n">project</span>
<span class="n">ReacScore</span> <span class="o">=</span> <span class="n">Base</span><span class="o">.</span><span class="n">classes</span><span class="o">.</span><span class="n">reac_score</span>
<span class="n">RealProperty</span> <span class="o">=</span> <span class="n">Base</span><span class="o">.</span><span class="n">classes</span><span class="o">.</span><span class="n">real_property</span>
<span class="n">Subsidy</span> <span class="o">=</span> <span class="n">Base</span><span class="o">.</span><span class="n">classes</span><span class="o">.</span><span class="n">subsidy</span>
<span class="n">Topa</span> <span class="o">=</span> <span class="n">Base</span><span class="o">.</span><span class="n">classes</span><span class="o">.</span><span class="n">topa</span>
<span class="n">WmataDist</span> <span class="o">=</span> <span class="n">Base</span><span class="o">.</span><span class="n">classes</span><span class="o">.</span><span class="n">wmata_dist</span>
<span class="n">WmataInfo</span> <span class="o">=</span> <span class="n">Base</span><span class="o">.</span><span class="n">classes</span><span class="o">.</span><span class="n">wmata_info</span>
<span class="n">models</span> <span class="o">=</span> <span class="p">[</span><span class="n">BuildingPermits</span><span class="p">,</span> <span class="n">Census</span><span class="p">,</span> <span class="n">CensusMarginOfError</span><span class="p">,</span> <span class="n">Crime</span><span class="p">,</span> <span class="n">DcTax</span><span class="p">,</span> <span class="n">Project</span><span class="p">,</span> <span class="n">ReacScore</span><span class="p">,</span>
<span class="n">RealProperty</span><span class="p">,</span> <span class="n">Subsidy</span><span class="p">,</span> <span class="n">Topa</span><span class="p">,</span> <span class="n">WmataDist</span><span class="p">,</span> <span class="n">WmataInfo</span>
<span class="p">]</span>
<span class="n">db</span><span class="o">.</span><span class="n">init_app</span><span class="p">(</span><span class="n">application</span><span class="p">)</span>
<span class="n">manager</span> <span class="o">=</span> <span class="n">APIManager</span><span class="p">(</span><span class="n">application</span><span class="p">,</span> <span class="n">flask_sqlalchemy_db</span><span class="o">=</span><span class="n">db</span><span class="p">)</span>
<span class="k">for</span> <span class="n">model</span> <span class="ow">in</span> <span class="n">models</span><span class="p">:</span>
<span class="c"># https://github.com/jfinkels/flask-restless/pull/436</span>
<span class="n">model</span><span class="o">.</span><span class="n">__tablename__</span> <span class="o">=</span> <span class="n">model</span><span class="o">.</span><span class="n">__table__</span><span class="o">.</span><span class="n">name</span>
<span class="n">manager</span><span class="o">.</span><span class="n">create_api</span><span class="p">(</span><span class="n">model</span><span class="p">,</span> <span class="n">methods</span><span class="o">=</span><span class="p">[</span><span class="s">'GET'</span><span class="p">])</span>
<span class="nd">@application.route</span><span class="p">(</span><span class="s">'/'</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">hello</span><span class="p">():</span>
<span class="k">return</span><span class="p">(</span><span class="s">"The Housing Insights API Rules!"</span><span class="p">)</span>
<span class="k">if</span> <span class="n">__name__</span> <span class="o">==</span> <span class="s">'__main__'</span><span class="p">:</span>
<span class="n">application</span><span class="o">.</span><span class="n">run</span><span class="p">(</span><span class="n">host</span><span class="o">=</span><span class="s">'0.0.0.0'</span><span class="p">,</span> <span class="n">port</span><span class="o">=</span><span class="mi">5000</span><span class="p">)</span></code></pre></figure>
<p>That chunk of code above is all the code you need to reflect your current tables into the SQLA model and serve up the API using Flask Restless. Pretty badass if you ask me. I’ll walk through it a little bit to describe what is going on.</p>
<p><code class="highlighter-rouge">db = SQLAlchemy(application)</code></p>
<p>is Flask Restless wrapper around the basic SQLA engine creator. You pass it your Flask application object.</p>
<p><code class="highlighter-rouge">Base.prepare(db.engine, reflect=True)</code></p>
<p>here is where you tell SQLA automap which database to use, and that you want it to reflect your current database tables.</p>
<p><code class="highlighter-rouge">BuildingPermits = Base.classes.building_permits</code></p>
<p>here is where you pull the auto generated model out of <code class="highlighter-rouge">Base.classes</code> and assign it a model name.</p>
<p><code class="highlighter-rouge">manager = APIManager(application, flask_sqlalchemy_db=db)</code></p>
<p>create a APIManager Flask Restless API.</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>for model in models:
# https://github.com/jfinkels/flask-restless/pull/436
model.__tablename__ = model.__table__.name
manager.create_api(model, methods=['GET'])
</code></pre></div></div>
<p>For this chunk of code, we are creating a basic GET endpoint for all of the models defined above. The line above it <code class="highlighter-rouge">model.__tablename__</code> is a workaround for an issue that will be fixed in version 1.0 of the code.</p>
<h2 id="conclusion">Conclusion</h2>
<p>If you need a quick and dirty API on top of your SQL database, look no further than Flask and Flask Restless. It’s a great way to get started. However, you still may have to define some custom endpoints, since Flask Restless doesn’t do everything. However using the ORM approach to create your endpoints probably leads to more maintainable and more clearly written code.</p>
<p><strong>My question to the devs out there is this</strong>: How does this approach compare to the Django ORM? I know Django has a lot of capability right out of the box, and has its own built in ORM, and Rest API plugin. I’m curious to compare it to the Flask/SQLA approach in terms of ease of use, flexibility, and overall capability.</p>When designing systems or platforms, it is very common to use a relational database such as MySQL or Postgres as a backend data storage. In order to access this data from a remote endpoint, it’s very handy to have an API that can serve out proper JSON data.Automate all the things2016-01-24T00:00:00+00:002016-01-24T00:00:00+00:00https://jasonrhaas.github.io/2016/01/24/automate-all-the-things<p>Building a computing infrastructure for your applications and big data stack is time consuming. Not only is it time consuming, but it’s very hard to plan for. Your needs today will likely not be your needs a year from now. This is especially the case if you are a growing technology company staying on the edge of the latest developments in the big data world. We all try to plan and think ahead for future needs, but this is often less than perfect.</p>
<p>In the past, system administrators and engineers typically built up their servers using a combination of techniques. Quite often this would involve customizing a particular server or image and then “cloning” it over to other servers. But this only works if the software on each needs to be the same. So inevitably there ends up being some kind of custom bash script or post install script to customize the build on a server by server basis. I’ve seen some pretty fancy bash and perl scripts used, and while very powerful they become a nightmare to maintain.</p>
<h1 id="ansible">Ansible</h1>
<p>Server provisioning software attempts to solve the code maintainability problem by introducing a framework and standards to manage your infrastructure. Some popular frameworks include Chef, Puppet, and Ansible. They all are a great way to manage your infrastructure, but Ansible stands out because it is agent-less and only requires <code class="highlighter-rouge">ssh</code> to provision your server. Also – it is written in Python, which I also like due to Python’s readability and hackability.</p>
<h2 id="dynamic-inventories-and-group_vars">Dynamic inventories and group_vars</h2>
<p>Ansible uses an <strong>inventory</strong> file to figure out where your servers are and what they are called. It also has server “groups”, so you can logically group your servers together. For a big data stack this might be <code class="highlighter-rouge">zookeeper-nodes</code>, <code class="highlighter-rouge">kafka-nodes</code>, <code class="highlighter-rouge">spark-worker-nodes</code>, etc. These groups are very powerful because they allow you to scale up or down your infrastructure simply by editing the inventory file. Want to add more resources to your Spark cluster? Just add it to the inventory and re-run the Ansible playbook.</p>
<p>In an ansible playbook, the spark-worker-nodes group can be accessed by using the <code class="highlighter-rouge">{{ groups['spark-worker-nodes'] }}</code>. You can also access individual elements of the list by adding an index, like <code class="highlighter-rouge">{{ groups['spark-worker-nodes'][0] }}</code>.</p>
<h2 id="roles">Roles</h2>
<p>Ansible roles are standalone tasks that are meant to be performed for a single piece of infrastructure. Role names typically match up to server groups, but they don’t have to. All of these roles should be able to run <em>independently</em>. This concept is very powerful because you now can design your Ansible playbooks to accommodate almost any number of servers and configurations. A good practice to follow is to have the following folders under each role:</p>
<ul>
<li>defaults</li>
<li>handlers</li>
<li>meta</li>
<li>tasks</li>
<li>templates</li>
<li>vars</li>
</ul>
<p>If you don’t have a need for one of these folders, you don’t necessarily need to create it (git won’t even track it if there aren’t any files in it). Underneath each folder you should have a <strong>main.yml</strong> file. Why call it that? Because ansible looks for it automatically. You don’t have to put all your code in <strong>main.yml</strong>. If you wish to break it up into logical parts (such as Debian and RedHat plays), you can <code class="highlighter-rouge">include:</code> them inside your <strong>main.yml</strong> file.</p>
<h3 id="defaults">Defaults</h3>
<p>Defaults are the default variable settings for a specific role. These settings have the <em>lowest</em> priority of all variables. They are, well, defaults, and can be overwritten by re-defining the variable literally any other place in the Ansible code, and also on the command line using the <code class="highlighter-rouge">--extra-vars</code> flag. The most common way to override these defaults is with <strong>group_vars</strong>, which I will discuss later on. An example settings that might be in a <strong>defaults/main.yml</strong> file:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>zookeeper_version: 3.4.6
zookeeper_client_port: 2181
zookeeper_install_dir: /opt/zookeeper
zookeeper_base_dir: "{{ zookeeper_install_dir }}/default"
zookeeper_conf_dir: "{{ zookeeper_base_dir }}/conf"
zookeeper_data_dir: "{{ zookeeper_base_dir }}/data"
zookeeper_log_dir: "{{ zookeeper_base_dir }}/logs"
</code></pre></div></div>
<p>Things like version numbers, port numbers, install directories are nice to put in the defaults section.</p>
<h3 id="handlers">Handlers</h3>
<p>Handlers are handy for doing things like restarting a process when a file changes. Just define your handlers in the <strong>main.yml</strong> and then use them in your main playbook under the <strong>tasks</strong> folder. For example, here is a simple handler to restart zookeeper (running under supervisord):</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>- name: restart zookeeper
supervisorctl:
name=zookeeper
state=restarted
</code></pre></div></div>
<p>In your tasks playbooks, this handler can be used by adding a <code class="highlighter-rouge">notify: restart zookeeper</code> in one of the plays. For example,</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>- name: setup zoo.cfg
template:
dest={{ zookeeper_conf_dir }}/zoo.cfg
src=zoo.cfg.j2
notify:
- restart zookeeper
tags: zookeeper
</code></pre></div></div>
<h3 id="meta">Meta</h3>
<p>The meta folder is meant to handle any dependencies that your role has. For example, zookeeper requires Java and in our installation and supervisord for managing the process. So the meta file looks like this:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>---
dependencies:
- { role: supervisord, when: "supervisord_has_run is not defined" }
- { role: java, when: "java_has_run is not defined" }
</code></pre></div></div>
<p>Note that the <code class="highlighter-rouge">when: "java_has_run is not defined"</code> part is a sneaky trick that I’m using so that Ansible does not keep re-running the same role on a specific server if it as already been run. At the end of the Java role, I create an “Ansible fact” called <code class="highlighter-rouge">java_has_run</code> and set it to <code class="highlighter-rouge">true</code>. If that fact exists on the specific server, Java will not be run again on that machine. At the end of the Java role, I have this play:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>- name: Set fact java_has_run
set_fact:
java_has_run: true
</code></pre></div></div>
<h3 id="tasks">Tasks</h3>
<p>The tasks folder is where the actual procedure to install your software lives. Your <strong>tasks/main.yml</strong> file is where you can utilize any of the Ansible <a href="http://docs.ansible.com/ansible/modules_by_category.html">modules</a> and take advantage of all your variables, whether those are defined in <strong>defaults</strong>, <strong>group_vars</strong>, the <strong>inventory</strong>, or the command line. Here is a partial snippet of a zookeeper <strong>tasks/main.yml</strong> file:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>- name: create zookeeper install directory
file:
path={{ item }}
state=directory
mode=0744
with_items:
- "{{ zookeeper_install_dir }}"
tags: zookeeper
- name: check for existing install
stat: path={{ zookeeper_install_dir }}/zookeeper-{{ zookeeper_version }}
register: zookeeper
tags: zookeeper
- name: download zookeeper
get_url:
url="{{ repository_infrastructure }}/zookeeper-{{ zookeeper_version }}.tar.gz"
dest=/tmp/zookeeper-{{ zookeeper_version }}.tgz
mode=0644
validate_certs=no
when: zookeeper.stat.isdir is not defined
tags: zookeeper
- name: extract zookeeper
unarchive:
src=/tmp/zookeeper-{{ zookeeper_version }}.tgz
dest={{ zookeeper_install_dir }}
copy=no
when: zookeeper.stat.isdir is not defined
tags: zookeeper
</code></pre></div></div>
<p>Anything surrounded by <code class="highlighter-rouge">{{ }}</code> is an Ansible variable. That variable can be defined a number of ways. The first place it’s seen is the <code class="highlighter-rouge">{{ item }}</code> variable. This is an Ansible special(?) variable that is used for doing “for” loops. In the case of the “create zookeeper install directory” above, it really not neccessary since there is only one folder created. However, if I wanted to add more I could just tack on more items in the <code class="highlighter-rouge">with_items</code> yaml list like this:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>with_items:
- "{{ zookeeper_install_dir }}"
- "{{ some_other_dir }}"
</code></pre></div></div>
<p>The other Ansible trick that is used in the tasks above is the <code class="highlighter-rouge">when:</code> conditional. In Ansible you can run plays only when the <code class="highlighter-rouge">when:</code> conditional meets some criteria. In the case above, the “download zookeeper” task is only run <code class="highlighter-rouge">when: zookeeper.stat.isdir is not defined</code>. The <code class="highlighter-rouge">zookeeper</code> variable is defined in the previous task and checks whether a directory already exists. Some other common ways to use the <code class="highlighter-rouge">when:</code> clause are:</p>
<ul>
<li>Running on different OS’s (Debian vs. Redhat)</li>
<li>Only run when a variable is <code class="highlighter-rouge">true</code></li>
<li>Only run when a variable is <code class="highlighter-rouge">defined</code></li>
</ul>
<p>Example of running specific Debian or Redhat plays:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>- include: setup-RedHat.yml
when: ansible_os_family == 'RedHat'
- include: setup-Debian.yml
when: ansible_os_family == 'Debian'
</code></pre></div></div>
<p>In this case, there are separate playbooks for Debian and Redhat, and each one is only run on the appropriate OS. The same thing can be used for OS specific variables:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>- name: Include OS-specific variables.
include_vars: "{{ ansible_os_family }}.yml"
</code></pre></div></div>
<h3 id="templates">Templates</h3>
<p>Templates are files that typically end in a <strong>.j2</strong> extension and are used when you have a file that may need to change based on some variables you have defined in the Ansible code base. Templates are very handy to manage configuration settings for Linux software since almost all tools that run on Linux have some sort of configuration text file that can be customized. Here is a snippet from a Kafka <strong>server.properties.j2</strong> template file:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>{% for host in kafka_host_list %}
{%- if host == inventory_hostname -%}broker.id={{ loop.index }}{%- endif -%}
{% endfor %}
</code></pre></div></div>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>
message.max.bytes={{ kafka_message_max }}
replica.fetch.max.bytes={{ kafka_replica_fetch_max_bytes }}
port={{ kafka_port }}
host.name={{ inventory_hostname }}
advertised.host.name={{ inventory_hostname }}
advertised.port={{ kafka_port }}
</code></pre></div></div>
<p>Notice the <code class="highlighter-rouge">{{ }}</code> variables that are used in the template file. There are a few “special” variables in here that deserve special attention. <code class="highlighter-rouge">inventory_hostname</code> is a reserved Ansible variable that maps to the hostname defined in the Ansible inventory file. It will match whatever host Ansible is currently running on.</p>
<p>The first chunk of code above is a fancy for loop that iterates through all elements of the <code class="highlighter-rouge">kafka_host_list</code> variable and sets the <code class="highlighter-rouge">broker_id</code> Kafka setting equal to the index of the host. Also note that the <code class="highlighter-rouge">kafka_host_list</code> variable has to be defined somewhere. In the case of this code it is defined at the playbook like and is:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>kafka_host_list: "{{ groups['kafka-nodes'] }}"
</code></pre></div></div>
<p>The <code class="highlighter-rouge">groups['kafka-nodes']</code> list is another special Ansible variable that is used to grab all of the hosts in the <strong>kafka-nodes</strong> group inside the inventory file. So your inventory for Kafka might look like this:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>[kafka-nodes]
prod-as-01
prod-as-02
prod-as-03
</code></pre></div></div>
<p>In this case <code class="highlighter-rouge">groups['kafka-nodes']</code> would contain all of those hostnames. You can access each one individually by using an index number, like this: <code class="highlighter-rouge">groups['kafka-nodes'][0]</code>.</p>
<p>Back to the for loop above, that code would set the <strong>prod-as-01</strong> host to <code class="highlighter-rouge">broker.id=1</code>, <strong>prod-as-02</strong> host to <code class="highlighter-rouge">broker.id=2</code>, and <strong>prod-as-03</strong> to <code class="highlighter-rouge">broker.id=3</code>.</p>
<p>The rest of the Kafka template code above is simply using Ansible variables defined mostly in the <strong>defaults/main.yml</strong> file populate the fields.</p>
<h3 id="vars">Vars</h3>
<p>Variables are used everywhere in Ansible. For me, it’s actually the most confusing part about using Ansible at first. Here are (most of) the places variables can be set:</p>
<ul>
<li>role defaults</li>
<li>role vars</li>
<li>playbook role vars</li>
<li>inventory vars</li>
<li>host_vars</li>
<li>group_vars</li>
<li>command line vars</li>
</ul>
<p>The Ansible documentation has some good examples of how, when, and where to use variables, but I still think it is a bit confusing for someone new to Ansible.</p>
<p>General guidelines for using variables:</p>
<ul>
<li>Role defaults are lowest precedence</li>
<li>Role defaults are “meant” to be overridden</li>
<li>group_vars for site specific variables, API keys, accounts</li>
<li>host_vars for host specific variables</li>
<li><code class="highlighter-rouge">--extra-vars</code> for command line one-off playbook runs</li>
</ul>
<h2 id="creating-your-site-playbook">Creating your site playbook</h2>
<p>I prefer keep my Ansible code simple and manage as few .yml files as possible. To do this, I like to have all of my roles and plays in one or maybe two top level playbooks. Just pick and choose which roles you want and put it all in a <strong>site-infrastructure.yml</strong>, being sure to <code class="highlighter-rouge">tag</code> every play appropriately. Note that as of this writing, Ansible 2.0 reads tags dynamically, so if you want to use tags to control how plays get run (I highly recommend this), you need to put them at your top level playbook otherwise Ansible will iterate through every single play in your code looking for your <code class="highlighter-rouge">--tag</code> that you wanted to run.</p>
<h2 id="using-tags-and-limit">Using –tags and –limit</h2>
<p>When you want to run your top level playbook, you can choose to run everything like this, <code class="highlighter-rouge">ansible-playbook -i production site-inventory.yml</code> or limit which plays get run by using the <code class="highlighter-rouge">--tags</code> or <code class="highlighter-rouge">--limit</code> flags on the command line. For example, <code class="highlighter-rouge">ansible-playbook -i production site-inventory.yml --limit aws</code> or <code class="highlighter-rouge">ansible-playbook -i production site-inventory.yml --tags site-kafka</code>.</p>
<p>Remember that each play can have multiple tags. This allows you to pair things logically together. You might want to always run the zookeeper role when you run kafka, since kafka relies on zookeeper. In that case you might have:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>- name: Run zookeeper role
hosts: zookeeper-nodes
vars:
- zookeeper_host_list: "{{ groups['zookeeper-nodes'] }}"
roles: [ zookeeper ]
tags:
- site-zookeeper
- deps-kafka
- name: Run kafka role
hosts: kafka-nodes
vars:
- kafka_host_list: "{{ groups['kafka-nodes'] }}"
- zookeeper_host_list: "{{ groups['zookeeper-nodes'] }}"
roles: [ kafka ]
tags:
- site-kafka
- deps-kafka
</code></pre></div></div>
<p>This way if you run <code class="highlighter-rouge">ansible-playbook -i production site-infrastructure.yml --tags deps-kafka</code> it will run both zookeeper and kafka.</p>
<h1 id="dont-break-the-build">Don’t break the build</h1>
<p><img src="https://jasonrhaas.github.io/assets/brokebuild.jpg" alt="Alt Text" /></p>
<p>Since Ansible is a provisioning tool, you need an operating system to test on. Inevitably with Ansible you end up with a lot of little bugs to sort all while testing your code. <a href="http://devopsreactions.tumblr.com/post/135373866575/bug-fixed-should-be-ok-now-no-wait">This</a> tends to happen a lot when using Ansible. So – how to sort through all those bugs? Well you don’t want to be changing your local machine or any production machines without feeling the wrath of your local sysadmin. Vagrant to the rescue!</p>
<h1 id="vagrant">Vagrant</h1>
<p>Vagrant is a VM scripting tool that allows you to manage different configurations for as many virtual machines as you need. It supports Virtualbox and VMWare out of the box. I personally use Vagrant + Virtualbox because its free and works really well. As of Vagrant 1.8+, they now support VM snapshots, which is very nice for testing different setups and environments. I’ll walk through a simple Vagrant setup with two independent VMs, although this scales to create any number of VM’s that you wish.</p>
<h2 id="vagrantfile">Vagrantfile</h2>
<p>The script file that tells Vagrant which VM’s to setup and how to provision them is the <strong>Vagrantfile</strong>. The file is written in Ruby so it is programmable and is there to do your bidding. For scalable VM testing, I chose to have the Vagrantfile actually read from a <strong>vagrant_hosts</strong> and parse it to figure out the VM name, IP, and type. For example, the <strong>vagrant_host</strong> file may look like:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>127.0.0.1 localhost
192.168.33.101 vagrant-as-01 vas01 ubuntu
192.168.33.102 vagrant-as-02 vas02 ubuntu
</code></pre></div></div>
<p>Another thing I do in my Vagrantfile is overwrite the <strong>/etc/hosts</strong> file with my <strong>vagrant_hosts</strong> file so that the VM’s know how to talk to each other on the network. Lastly, I copy over my ssh public key so that I can ssh into the VM’s using <code class="highlighter-rouge">ssh vagrant@vagrant-as-01</code>. Normally if you are just testing VM’s without provisioning with Ansible you could use the <code class="highlighter-rouge">vagrant ssh</code> command which uses a built in private key that comes with Vagrant. However, to use Ansible via your local console to provision Vagrant, you need to be able to <code class="highlighter-rouge">ssh</code> in, ideally without a password. These actions are accomplished by doing:</p>
<figure class="highlight"><pre><code class="language-ruby" data-lang="ruby"><span class="c1"># Configuration applying to all VMs</span>
<span class="n">config</span><span class="p">.</span><span class="nf">vm</span><span class="p">.</span><span class="nf">provision</span> <span class="ss">:shell</span><span class="p">,</span> <span class="ss">inline: </span><span class="s2">"cat /vagrant/vagrant_hosts > /etc/hosts"</span>
<span class="n">config</span><span class="p">.</span><span class="nf">vm</span><span class="p">.</span><span class="nf">provision</span> <span class="ss">:shell</span><span class="p">,</span> <span class="ss">inline: </span><span class="s2">"cat /vagrant/id_rsa.pub >> /home/vagrant/.ssh/authorized_keys"</span></code></pre></figure>
<p>Note the comment that says these actions will be applied to all VM’s. If you want to do something to an individual VM, you have to break it out in another ruby loop:</p>
<figure class="highlight"><pre><code class="language-ruby" data-lang="ruby"><span class="c1"># Set up IP addresses and hostnames from 'hosts' file</span>
<span class="c1"># It assumes 'localhost' is on the first line</span>
<span class="n">hosts</span> <span class="o">=</span> <span class="no">File</span><span class="p">.</span><span class="nf">readlines</span><span class="p">(</span><span class="s1">'vagrant_hosts'</span><span class="p">)</span>
<span class="n">hosts</span><span class="p">[</span><span class="mi">1</span><span class="o">..-</span><span class="mi">1</span><span class="p">].</span><span class="nf">each</span> <span class="k">do</span> <span class="o">|</span><span class="n">h</span><span class="o">|</span>
<span class="k">unless</span> <span class="sr">/(#|^\s*$)/</span><span class="p">.</span><span class="nf">match</span><span class="p">(</span><span class="n">h</span><span class="p">)</span> <span class="c1"># ignore commented out hosts and blank lines</span>
<span class="n">config</span><span class="p">.</span><span class="nf">vm</span><span class="p">.</span><span class="nf">define</span> <span class="n">h</span><span class="p">.</span><span class="nf">split</span><span class="p">(</span><span class="sr">%r{</span><span class="se">\s</span><span class="sr">+}</span><span class="p">)[</span><span class="mi">1</span><span class="p">]</span> <span class="k">do</span> <span class="o">|</span><span class="n">node</span><span class="o">|</span>
<span class="k">if</span> <span class="n">h</span><span class="p">.</span><span class="nf">split</span><span class="p">(</span><span class="sr">%r{</span><span class="se">\s</span><span class="sr">+}</span><span class="p">)[</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span> <span class="o">==</span> <span class="s1">'centos'</span>
<span class="n">node</span><span class="p">.</span><span class="nf">vm</span><span class="p">.</span><span class="nf">box</span> <span class="o">=</span> <span class="no">CENTOS_BOX</span>
<span class="k">elsif</span> <span class="n">h</span><span class="p">.</span><span class="nf">split</span><span class="p">(</span><span class="sr">%r{</span><span class="se">\s</span><span class="sr">+}</span><span class="p">)[</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span> <span class="o">==</span> <span class="s1">'ubuntu'</span>
<span class="n">node</span><span class="p">.</span><span class="nf">vm</span><span class="p">.</span><span class="nf">box</span> <span class="o">=</span> <span class="no">UBUNTU_BOX</span>
<span class="n">node</span><span class="p">.</span><span class="nf">ssh</span><span class="p">.</span><span class="nf">shell</span> <span class="o">=</span> <span class="s2">"bash -c 'BASH_ENV=/etc/profile exec bash'"</span>
<span class="k">end</span>
<span class="n">node</span><span class="p">.</span><span class="nf">vm</span><span class="p">.</span><span class="nf">hostname</span> <span class="o">=</span> <span class="n">h</span><span class="p">.</span><span class="nf">split</span><span class="p">(</span><span class="sr">%r{</span><span class="se">\s</span><span class="sr">+}</span><span class="p">)[</span><span class="mi">1</span><span class="p">]</span>
<span class="n">node</span><span class="p">.</span><span class="nf">vm</span><span class="p">.</span><span class="nf">network</span> <span class="s2">"private_network"</span><span class="p">,</span> <span class="ss">ip: </span><span class="n">h</span><span class="p">.</span><span class="nf">split</span><span class="p">(</span><span class="sr">%r{</span><span class="se">\s</span><span class="sr">+}</span><span class="p">)[</span><span class="mi">0</span><span class="p">]</span>
<span class="n">node</span><span class="p">.</span><span class="nf">vm</span><span class="p">.</span><span class="nf">provision</span> <span class="s2">"shell"</span><span class="p">,</span> <span class="ss">inline: </span><span class="s2">"service supervisord restart || true"</span><span class="p">,</span> <span class="ss">run: </span><span class="s2">"always"</span>
<span class="k">end</span>
<span class="k">end</span>
<span class="k">end</span></code></pre></figure>
<p>To clarify a few spots in the code snippet above, <code class="highlighter-rouge">hosts[1..-1].each do |h|</code> sets up <code class="highlighter-rouge">.each</code> loop that iterates from index 1 (not 0 since that is localhost) to the end of the file. To find out what type of VM it is, it parses the line looking for “centos” or “ubuntu”. The line <code class="highlighter-rouge">node.ssh.shell = "bash -c 'BASH_ENV=/etc/profile exec bash'"</code> is a special trick I discovered to resolve the infamous <a href="https://github.com/mitchellh/vagrant/issues/1673">stdin is not a tty</a> Vagrant bug when provisioning Ubuntu VMs.</p>
<p>This line <code class="highlighter-rouge">node.vm.provision "shell", inline: "service supervisord restart || true", run: "always"</code> is a workaround that I’m doing to restart the <strong>supervisord</strong> process upon VM booting. I like to use supervisord to manage all my running applications since its a nice central place to check status on all the custom software or applications I’ve installed.</p>
<h2 id="testing-ansible-with-vagrant">Testing Ansible with Vagrant</h2>
<p>After you run <code class="highlighter-rouge">vagrant up</code> your VM’s should be pretty much good to go. You may want to also add the IP addresses and hostnames in your <strong>vagrant_hosts</strong> file so you can access them via hostname rather than IP address. Make sure you can ssh into the machines as <strong>vagrant</strong> user and you are ready to start provisioning with Ansible!</p>
<p>When you run your Ansible code, be sure to run it like <code class="highlighter-rouge">ansible-playbook -i inventory site-infrastructure.yml -u vagrant</code> since by default Ansible will try to connect using your current username which does not exist on the Vagrant VM.</p>
<h2 id="automated-builds-with-travis-ci">Automated Builds with Travis CI</h2>
<p><img src="https://jasonrhaas.github.io/assets/travis.png" alt="Alt Text" /></p>
<p>Making a change to code and manually running tests gets old really fast. Not to mention it’s subject to human error. Automating the test process not only speeds up development in the long run, but will catch errors quickly and reliably (assuming your tests are good). This practice of “continuous integration” or “continuous delivery” can also be applied to the “infrastructure as code” approach.</p>
<p>The first thing to do is to run a <code class="highlighter-rouge">--syntax-check</code> on your code. This catches any trivial errors and will cause your build to fail very fast so you can fix the bug quickly. Next, you can actually provision the VM that Travis gives you to test with. For Ansible, I recommend breaking this up into different pieces using the <code class="highlighter-rouge">--tags</code> option so that you can take advantage of concurrency if your CI software supports it. Lastly, you can run some high level tests on your tools to make sure they are actually working as they should.</p>
<ul>
<li>For Elastic Search, index a small document</li>
<li>For Kafka, write something to a topic</li>
<li>For Hadoop, make a file or run a map reduce job</li>
<li>For Hbase, write something to the database</li>
</ul>
<p>You get the idea.</p>
<p>Here is an example <strong>.travis.yml</strong> file that I’ve used to test Ansible code:</p>
<figure class="highlight"><pre><code class="language-yaml" data-lang="yaml"><span class="na">sudo</span><span class="pi">:</span> <span class="s">required</span>
<span class="na">dist</span><span class="pi">:</span> <span class="s">trusty</span>
<span class="na">addons</span><span class="pi">:</span>
<span class="na">hosts</span><span class="pi">:</span>
<span class="pi">-</span> <span class="s">travis-trusty</span>
<span class="na">language</span><span class="pi">:</span> <span class="s">python</span>
<span class="na">python</span><span class="pi">:</span> <span class="s1">'</span><span class="s">2.7'</span>
<span class="na">before_install</span><span class="pi">:</span>
<span class="pi">-</span> <span class="s">sudo apt-get update -qq</span>
<span class="pi">-</span> <span class="s">sudo apt-get install -qq python-apt</span>
<span class="na">install</span><span class="pi">:</span>
<span class="pi">-</span> <span class="s">pip install ansible</span>
<span class="na">env</span><span class="pi">:</span>
<span class="pi">-</span> <span class="s">TAGS='site-common'</span>
<span class="pi">-</span> <span class="s">TAGS='site-zookeeper'</span>
<span class="pi">-</span> <span class="s">TAGS='deps-kafka'</span>
<span class="pi">-</span> <span class="s">TAGS='ELK'</span>
<span class="pi">-</span> <span class="s">TAGS='scrapy-services'</span>
<span class="pi">-</span> <span class="s">TAGS='scrapy-cluster'</span>
<span class="pi">-</span> <span class="s">TAGS='deps-storm'</span>
<span class="pi">-</span> <span class="s">TAGS='deps-hadoop'</span>
<span class="pi">-</span> <span class="s">TAGS='site-docker-engine'</span>
<span class="na">matrix</span><span class="pi">:</span>
<span class="na">fast_finish</span><span class="pi">:</span> <span class="no">true</span>
<span class="na">script</span><span class="pi">:</span>
<span class="pi">-</span> <span class="s">ansible-playbook -i testing site-infrastructure.yml --tags $TAGS --syntax-check</span>
<span class="pi">-</span> <span class="s">ansible-playbook -i testing site-infrastructure.yml --tags $TAGS --connection=local --become</span></code></pre></figure>
<p>Note that this requires creating a special <strong>testing</strong> inventory file that uses <strong>travis-trusty</strong> as the hostname for everything. Also - by taking advantage of Travis TAGS and ansible <code class="highlighter-rouge">--tags</code>, I can effectively run multiple Ansible builds concurrently which should speed up the overall build status.</p>
<h1 id="conclusion">Conclusion</h1>
<p>If you have to manage more that one server, you should probably be using some sort of provisioning framework. Ansible is certainly a good choice, and is becoming increasingly popular relative to other tools such as Chef or Puppet. In fact, judging at least by the number of attention on Github, Ansible is <a href="http://bit.ly/1QiTuLU">blowing away the competition</a>.</p>
<p>Using a combination of Vagrant VM’s and CI tools like Travis are essential to making sure you don’t break the build. Vagrant is great for development and Travis is great for those one line changes that “shouldn’t” break the build but should get tested anyway.</p>
<p>Inspiration for most of the examples and code snippets was taken from the <a href="https://github.com/istresearch/ansible-symphony">ansible-symphony</a> repository and most of the development for this code was done for <a href="http://istresearch.com">IST Research</a>. If you enjoyed this post and want to work with the latest in IT and big data technology, <code class="highlighter-rouge">python</code>, or Java, shoot me email or get in touch with me on LinkedIn or Twitter!</p>Building a computing infrastructure for your applications and big data stack is time consuming. Not only is it time consuming, but it’s very hard to plan for. Your needs today will likely not be your needs a year from now. This is especially the case if you are a growing technology company staying on the edge of the latest developments in the big data world. We all try to plan and think ahead for future needs, but this is often less than perfect.Using pandoc2015-12-12T00:00:00+00:002015-12-12T00:00:00+00:00https://jasonrhaas.github.io/2015/12/12/pandoc<p>The résumé is outdated. Why are people still passing around MS Word documents? There are a few problems with this:</p>
<ul>
<li>Email attachments just suck to begin with.</li>
<li>As soon as a résumé is sent, it is out of date.</li>
<li>You have no control of what happens to the document once you send it.</li>
<li>There are likely multiple different versions of your résumé floating around the web in various states of correctness since they are out of date.</li>
</ul>
<p>This just leaves everyone confused as to what is the latest version of your résumé, and then the inevitable, “Can you send me an updated copy of your résumé by tonight?” question comes out, and you’re left scrambling to update it.</p>
<h2 id="linkedin">LinkedIn</h2>
<p>Keep your information on LinkedIn updated. LinkedIn can handle <em>most</em> of your resume needs, but it still only allows for basic text entry (why no markdown at least?), so people feel the obligation to stick with MS Word. It has an “export to PDF” feature, but it leaves much to be desired.</p>
<p>I wish that LinkedIn would up their game and allow for more formatting and flexibility, but until then most people will be looking for another solution. For many tech companies and startups, a LinkedIn profile is sufficient, but for the old guard a physical résumé document is still the gold standard.</p>
<h2 id="markdown-and-pandoc-to-the-rescue">Markdown and Pandoc to the rescue!</h2>
<p>For those that want a physical résumé separate from LinkedIn, there is a solution to your MS Word woes, and its name is <code class="highlighter-rouge">pandoc</code>. Pandoc is a free document converter that supports all kinds of formats. On a Mac, you can install it with <code class="highlighter-rouge">brew install pandoc</code>.</p>
<p>Here’s the best part, you can maintain your résumé in Markdown and then have pandoc automatically generate the other formats for you! I converted my résumé to markdown, and made a little shell script that will generate all the formats that I need. Here is the <code class="highlighter-rouge">make.sh</code> script which I run whenever my markdown resume is updated. It generates plain text, docx, and html files.</p>
<figure class="highlight"><pre><code class="language-bash" data-lang="bash"><span class="c">#!/bin/bash</span>
<span class="k">if</span> <span class="o">[</span> <span class="nv">$# </span><span class="nt">-eq</span> 1 <span class="o">]</span><span class="p">;</span> <span class="k">then
</span><span class="nv">name</span><span class="o">=</span><span class="k">${</span><span class="nv">1</span><span class="p">//\.md/</span><span class="k">}</span>
pandoc <span class="nv">$1</span> <span class="nt">-t</span> plain <span class="nt">-o</span> <span class="nv">$name</span>.txt
pandoc <span class="nv">$1</span> <span class="nt">-t</span> docx <span class="nt">-o</span> <span class="nv">$name</span>.docx
pandoc <span class="nv">$1</span> <span class="nt">-t</span> html5 <span class="nt">-o</span> <span class="nv">$name</span>.html
<span class="k">else
</span><span class="nb">echo</span> <span class="s2">"Usage : </span><span class="nv">$0</span><span class="s2"> YourResume.md"</span>
<span class="nb">exit </span>1
<span class="k">fi</span></code></pre></figure>
<p>As an example of what the outputs look like, here is my résumé in the original markdown format and the generated .txt, .docx, and .html. I am controlling the source markdown file in github and then generating the other files with the <code class="highlighter-rouge">make.sh</code> script.</p>
<ul>
<li><a href="{filename}/files/JasonHaas_Resume.md">Resume Markdown</a></li>
<li><a href="{filename}/files/JasonHaas_Resume.txt">Resume Text</a></li>
<li><a href="{filename}/files/JasonHaas_Resume.html">Resume HTML</a></li>
<li><a href="{filename}/files/JasonHaas_Resume.docx">Resume Docx</a></li>
</ul>The résumé is outdated. Why are people still passing around MS Word documents? There are a few problems with this:Why I love open source2015-11-30T00:00:00+00:002015-11-30T00:00:00+00:00https://jasonrhaas.github.io/2015/11/30/why-i-love-open-source<p>There is a thriving open source community out there, just waiting to be tapped into. Have an idea to make an application better? The developers would love to hear it. Have time to code it yourself? Submit a pull request on GitHub.</p>
<p>The community is a welcoming one. Before I found open source, I was a systems engineer working on military systems. You’d be hard pressed to find a community of people that are willing to talk about systems engineering for military systems outside of the DC area. Even then, most people cannot share work they’ve done, or code they’re written. It’s usually restricted in some capacity or can only be used and sold by The Company.</p>
<p>Part of the appeal to Open Source is that anything you write is free for you and anyone else to use, and can potentially help many other people solve problems in lots of different areas. It’s not just The Company that benefits, its <strong>everyone</strong>.</p>
<h2 id="why-has-open-source-become-so-popular">Why has Open Source become so popular?</h2>
<p>Open source has been around for a long time, and UNIX has been around since the 1960’s. However – the big change that has happened in the last 5 years or so is that enterprises and businesses are switching to open source. At my last job we made systems that relied on commercial off the shelf (COTS) hardware and wrote custom Java code to control everything. Before I left, we started to use some hardware that actually had open source libraries to control the hardware. We took that open source software and started to add our own customizations. Parts of what we were doing could be pushed back to the open source repository but a lot of it was closed source. That was my first interaction with Git and Github.</p>
<p>So – <strong>Github</strong>. Since Github launched in 2008, it has transformed the Open Source universe. Today, the first step to developing a new software product is to check Github and see if someone has already written it for you. And here’s the best part - most of the time the developers are more than willing to help you with the software. The first time I hopped on IRC to ask a question about a Python package, I was amazed at how helpful the developers were.</p>
<h2 id="how-can-it-be-free">How can it be free?</h2>
<p>The dynamics of the open source software community is strange and unique. People are anxious to give away their software, but then how do they make money? The simple answer is <em>services</em>. The business of selling software licenses is a dying one, and many large technology companies have been shifting to a service based business model. One of the earliest companies to adopt this approach is Red Hat. Red Hat was founded in 1993 on the back of a new Linux distribution that they created called Red Hat Linux. It’s based on the Unix architecture which is already open source, so the OS itself is also open. They make money by providing services and support to mostly enterprise customers. Enterprises want to take advantage of the flexibility and stability that Linux and Open Source provide, but usually want some kind of security blanket knowing that they can get support if needed.</p>
<p>The other advantage of open sourcing your software is that you get <strong>community involvement</strong>. Just by putting it out there and making it free to use, there will be people finding bugs for you, submitting feature requests, and even improving your code or adding new features. It’s an unspoken agreement among developers that if you have benefited from another developer’s open source work and improved upon it, you should contribute back to it.</p>
<p>Open sourcing software can also build credibility within the developer community and get people to start using your software. It may even be a good way for companies to recruit new talent. Do you like working with our Deep Learning Code? Come work on it at Google and we’ll pay you to work on it.</p>
<h2 id="getting-involved">Getting involved</h2>
<p>I’ve read that only 1% of the population that uses Open Source software actually contributes to it. I have no idea if that’s true, but imagine if that number was 2%, 5%, or even 10%? I imagine we would see an even greater amount of companies open sourcing their software to tap into the community. And more community involvement means greater diversity of ideas, which could ultimately lead to better software in the long run.</p>
<p>My advice for anyone looking to get involved is to start small. If you have a software package on Github you like to use, look at helping to improve the documentation. Docs are one of those things that are so important in getting new people to adopt the software, but developers to neglect it because they are focused on writing code. After adding or fixing documentation, look at existing issues on Github and see if you can tackle any of them. Fixing existing issues is always appreciated and is sure to bolster your Open Source karma in the community.</p>There is a thriving open source community out there, just waiting to be tapped into. Have an idea to make an application better? The developers would love to hear it. Have time to code it yourself? Submit a pull request on GitHub.My approach to design2015-11-29T00:00:00+00:002015-11-29T00:00:00+00:00https://jasonrhaas.github.io/2015/11/29/my-approach-to-design<p>If you are a python programmer, or doing any technical design for that matter, I highly recommend checking out <em>The Zen of Python</em>. If you are on OSX or Linux, open up a terminal and type <code class="highlighter-rouge">python -c 'import this'</code>. You should see this:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>The Zen of Python, by Tim Peters
Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!
</code></pre></div></div>
<p>If written well, Python almost reads like plain english, and it is clear to another programmer what is going on with the code fairly quickly. If I have to spend more than 10 minutes trying to figure out how a class method or function is working, it’s probably poorly written. My take on some of these points:</p>
<h2 id="beautiful-is-better-than-ugly">Beautiful is better than ugly</h2>
<p>I spend extra time making sure my code <em>looks</em> good. Its not just about ascetics, its about maintainability. One day, myself or someone else is going to have to modify this code, and you don’t want them to have to waste time figuring out what the code is doing. It’s kind of like cleaning up your apartment because guests come over to visit. Some examples:</p>
<p>Bad</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">d</span> <span class="o">=</span> <span class="p">{}</span>
<span class="n">d</span><span class="p">[</span><span class="s">'some'</span><span class="p">]</span> <span class="o">=</span> <span class="mi">1</span>
<span class="n">d</span><span class="p">[</span><span class="s">'thing'</span><span class="p">]</span> <span class="o">=</span> <span class="mi">2</span></code></pre></figure>
<p>Better</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">d</span> <span class="o">=</span> <span class="nb">dict</span><span class="p">(</span><span class="n">some</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span> <span class="n">thing</span><span class="o">=</span><span class="mi">2</span><span class="p">)</span></code></pre></figure>
<p>Best</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">d</span> <span class="o">=</span> <span class="p">{</span>
<span class="s">'some'</span><span class="p">:</span> <span class="mi">1</span><span class="p">,</span>
<span class="s">'thing'</span><span class="p">:</span> <span class="mi">2</span>
<span class="p">}</span></code></pre></figure>
<p>One could argue that approach #2 is the simplest to type and takes the least amount of space in your text editor. Although convenient, I think that the #3 looks better, and is easier to see how the dictionary would look after converting to a JSON string.</p>
<h2 id="explicit-is-better-than-implicit">Explicit is better than implicit</h2>
<p>This is one of the foundations of Python – if you are going to do something, it should be spelled out in the code. There shouldn’t be any magic going on that can be hard to track down if there are bugs. Python enforces being explicit in most cases, but design decisions that the programmer makes can influence how explicit the code really is. Some examples:</p>
<h3 id="inheritance-vs-composition">Inheritance vs. Composition</h3>
<p>Certainly there is a case for both Inheritance and Composition in Object Oriented Programming and Python in general. However, in terms of being <em>Explicit</em>, composition wins. Inheritance is very convenient – you inherit a Parent class and then all of a sudden you have some new magic methods to use! This is clear in the Python interpreter by doing running the <code class="highlighter-rouge">dir(a)</code> command on an instance of the clild class. But – to figure this out in your text editor you need to most likely hunt around in different places trying to find out where the inherited methods are coming from. This is annoying and not that <em>Explicit</em>. With composition, you are forced to be explicit. You likely will have to import specific classes using <code class="highlighter-rouge">from some_module import AwesomeClass</code>. At that point anytime something in the <code class="highlighter-rouge">AwesomeClass</code> namespace is used, it will be clear in the code where it is being used like <code class="highlighter-rouge">AwesomeClass.more_awesome()</code>.</p>
<h3 id="using-args-and-kwargs">Using *args and **kwargs</h3>
<p>This is another one that definitely has its uses, but I prefer to stay away from it unless absolutely necessary (decorator functions, inheritance) due to its ambiguity. <code class="highlighter-rouge">*args</code> and <code class="highlighter-rouge">**kwargs</code> allows the user of a function to pass an arbitrary number of arguments into your function. Since the function does not enforce any arguments, it needs to handle all the cases where random arguments could be passed in. This could require a bunch of code that could get messy and may be hard to maintain. If many arguments need to be passed in, better to use a <code class="highlighter-rouge">list</code> or a <code class="highlighter-rouge">dict</code> and explicitly define that in the function doc string.</p>
<h2 id="the-art-of-unix-programming">The Art of Unix Programming</h2>
<p><em>The Art of Unix Programming</em> is another great resource for providing some guidelines on good UNIX and programming practices. These guidelines can benefit any programmer or hacker. Especially if someone is coming from a Windows or strictly Java background, this could be particularly useful. Some of my favorite paradigms are:</p>
<ul>
<li>This is the Unix philosophy: Write programs that do one thing and do it well. Write programs to work together. Write programs to handle text streams, because that is a universal interface.</li>
<li>Design and build software, even operating systems, to be tried early, ideally within weeks. Don’t hesitate to throw away the clumsy parts and rebuild them.</li>
<li>Programmer time is expensive; conserve it in preference to machine time.</li>
<li>Avoid hand-hacking; write programs to write programs when you can.</li>
</ul>
<p>Bear in mind that a lot of this is coming out of a circa 1978 time period. These ideas are especially relevant today – I guess there’s a reason UNIX has been around so long.</p>
<h1 id="closing-thoughts">Closing thoughts</h1>
<p>The next time someone comes to you for a idea or has a technical solution, consider these thoughts from the <em>Zen of Python</em>.</p>
<ul>
<li>If the implementation is hard to explain, it’s a bad idea.</li>
<li>If the implementation is easy to explain, it may be a good idea.</li>
</ul>
<h1 id="resources">Resources</h1>
<ul>
<li><a href="http://docs.python-guide.org/en/latest/">Hitchhikers Guide to Python</a></li>
<li><a href="http://www.catb.org/esr/writings/taoup/html/">The Art of Unix Programming</a></li>
</ul>If you are a python programmer, or doing any technical design for that matter, I highly recommend checking out The Zen of Python. If you are on OSX or Linux, open up a terminal and type python -c 'import this'. You should see this:Kafka2015-11-27T00:00:00+00:002015-11-27T00:00:00+00:00https://jasonrhaas.github.io/2015/11/27/kafka<p>There are a few decent resources out there for learning Kafka, but really it comes down to the <a href="http://kafka.apache.org/documentation.html">Apache Documentation</a> and Michael Knoll’s <a href="http://www.michael-noll.com/blog/2014/08/18/apache-kafka-training-deck-and-tutorial/">publications</a>. While these are both excellent, I still think there could better information out there to help developers get started. Hopefully this post can help.</p>
<h3 id="why-use-apache-kafka">Why use Apache Kafka?</h3>
<p>There are many use cases, and some of those are discussed in Kafka’s documentation. The benefits of Kafka are many: scalability, speed, durability. That’s all great, but here’s my biggest reason for using it: <em>it serves as a central data bus for all streaming data</em>. This is especially important when you may not know in advance who will be <strong>producing</strong> data, and who will be <strong>consuming</strong> that data.</p>
<h3 id="kafka-basics">Kafka Basics</h3>
<p>Kafka is nothing more than a streaming log system. Think of it as <code class="highlighter-rouge">tail -f</code> in UNIX speak. In Linux, if there is a process that is producing some log output, it is very common to run <code class="highlighter-rouge">tail -f <filename></code> on the file to track the log file updates as they happen. A Kafka <strong>topic</strong> is exactly that, its just a log file that lives in the Kafka <strong>broker</strong> ecosystem. The big difference is that instead of tailing a single file on a single server, you can <strong>consume</strong> from a <strong>topic</strong> from anywhere that has access to Kafka. That <strong>topic</strong> could also have multiple <strong>producers</strong> writing to them from many different places. For example:</p>
<h4 id="distributed-application">Distributed Application</h4>
<p>You have a distributed application that lives across more than one server. This application has some output. Where should that output be written to? The usual choices are to stdout, a flat file, or a database. But what if you don’t have a database set up or don’t know which one to use at first? What if you need to do some additional processing on this distributed data before sending to a database? You could write all the data out to flat files, do some processing on it, and then ingest the data into the database. But then you have to worry about managing all the data between the servers.</p>
<p>Enter Kafka. What if of each server node writing out the data to a different place, they all wrote their data to a common Kafka <strong>topic</strong>? This is the power of Kafka. In Kafka speak, when data is written to a topic it is a <strong>producer</strong>. Now – if someone wants to read that data stream, they have one single place to go. The client reading the data is called a <strong>consumer</strong>.</p>
<h3 id="inside-the-kafka-topic">Inside the Kafka Topic</h3>
<p>Two other built in features of Kafka are <em>parallelism</em> and <em>redundancy</em>. Kafka handles this by giving each topic a certain number of <strong>partitions</strong> and <strong>replicas</strong>.</p>
<p><strong>Partitions</strong>: A single piece of a Kafka topic. The number of partitions is configurable on a per topic basis. More partitions allow for great parallelism when reading from the topics. The number of partitions determines how many consumers you have in a <strong>consumer-group</strong>. For example, if a topic has 3 partitions, you can have 3 consumers in a <strong>consumer-group</strong> balancing consuming between the partitions. In this way you have a parallelism of 3. This partition number is somewhat hard to determine until you know how fast you are producing data vs. how fast you are consuming the data. If you have a topic that you know will be high volume, I would ere on the side of more partitions. This also allows room for growth. Aim for between 10 - 50 partitions to start.</p>
<p><strong>Replicas</strong>: These are copies of the partitions. They are never written to or read from. Their only purpose is for data redundancy. If your topic has <code class="highlighter-rouge">n</code> replicas, <code class="highlighter-rouge">n-1</code> brokers can fail before there is any data loss. Additionally, you cannot have a topic a replication factor greater than the number of brokers that you have. For example, you have 5 Kafka brokers, you could have a topic with a maximum replication factor of 5, and 5-1=<strong>4</strong> brokers could go down before there is any data loss.</p>
<p><strong>Offsets</strong>: An “offset” is just a pointer to a location in the logfile or “topic”. Each client or “consumer” has their own “consumer-group” that is used to track the offset where they are in the topic. The actual offset values are stored in a special Kafka topic called “_consumer_offsets”. Why is it called a “consumer-group” and not just a “consumer”? This is because Kafka supports balanced consuming – meaning that you can have more than one consumer reading from a topic in a round-robin fashion to increase parallelism.</p>
<p><strong>Leaders and In Sync Replicas (ISRs)</strong>: Once your topic has been created, you can use Kafka’s built in tool <code class="highlighter-rouge">./kafka-topics.sh --describe -z <zookeeper-node>:2181</code> to run to describe the topics on your Kafka cluster. You might see something like this:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Topic: test.cleaned_firehose PartitionCount:3 ReplicationFactor:3 Configs:
Topic: test.cleaned_firehose Partition: 0 Leader: 4 Replicas: 4,5,1 Isr: 1,4,5
Topic: test.cleaned_firehose Partition: 1 Leader: 5 Replicas: 5,1,2 Isr: 1,2,5
Topic: test.cleaned_firehose Partition: 2 Leader: 1 Replicas: 1,2,3 Isr: 1,2,3
</code></pre></div></div>
<p>Each partition has a broker leader, and the replicas simply “follow” the leader and duplicate the data. If a broker that is a leader does down, Kafka will automatically elect a new broker leader by default. Note that if you have consumers consuming on a topic that temporarily loses their leader, they may need to be re-connect to fetch the new meta data from the cluster.</p>
<h3 id="common-problems">Common problems</h3>
<p>The biggest problem I’ve encountered is with brokers randomly going down and then becoming unavailable for leader election. I haven’t gotten to the bottom of this issue but I’m hopeful that some of this stuff has been fixed in 0.9. Rebooting the Kafka broker fixes this problem most of the time.</p>
<p>Another common problem I enconter when using kafka is that a broker goes down, Kafka elects a new leader, and the consumer does’t get the message that there is a new leader in town. This results in the dreaded <code class="highlighter-rouge">NotLeaderForPartition</code> errors. This can be solved by updating the metadata for the Kafka consumer. In the case of a python client, it appears that neither <code class="highlighter-rouge">kafka-python</code> nor <code class="highlighter-rouge">pykafka</code> can handle this situation. Therefore, the error needs to be caught, and the consumer needs to be re-created.</p>
<h3 id="tips-and-tricks">Tips and Tricks</h3>
<p>Check out <code class="highlighter-rouge">kafkacat</code> on github for a nice CLI non-JVM based tool for checking Kafka topics or consuming/producing topics.</p>
<h3 id="closing-thoughts">Closing thoughts</h3>
<p>Kafka is a great tool – but it is still in development, API’s are in flux, and new features are still being added. As of this writing Kafka 0.9.0 has just come out, which introduces a new consumer API (although the old one is still supported) and a security protocol. Before 0.9.0 you can had control access to Kafka via a whitelist or some other VPN firewall.</p>
<p>One of the <em>most</em> lacking areas of Kafka is any kind of built in monitoring or “health status” support. When things go wrong, its very hard to figure out the root cause, and Kafka will often still being “running” but you’ll see ERROR messages spewing out of the logs. Some kind of built in status check API would be <em>very</em> useful for monitoring the tool and figuring out what’s going on. There are some OK open source solutions out there for monitoring consumer lag, offsets, and broker status, but they aren’t sufficent to solve this problem.</p>There are a few decent resources out there for learning Kafka, but really it comes down to the Apache Documentation and Michael Knoll’s publications. While these are both excellent, I still think there could better information out there to help developers get started. Hopefully this post can help.Mac dev tips2015-09-06T00:00:00+00:002015-09-06T00:00:00+00:00https://jasonrhaas.github.io/2015/09/06/mac-dev-tips<p>I’ve been doing software development on a Macbook Pro for a little while now, and I gotta say there are a TON of great free packages and tools that make development that much more enjoyable. I’m not going to get into a Windows/Mac/Linux debate here, lets just say Mac OSX wins, with Linux a close second. All of the production code that I run runs in Linux, and most of all runs natively on my Mac as well. That with the combination of all the other nice feature of the Mac make it unmatched for software development.</p>
<p>OK so I want to talk about some tools and nifty tricks that I use on a fairly regular basis.</p>
<h2 id="sublime-text-3">Sublime Text 3</h2>
<p>If you do most of your development on a Mac already, you probably know about Sublime Text. It’s a lightweight and <em>fast</em> editor with a <em>ton</em> of free plugins for just about everything you can imagine. I do most of my work in <code class="highlighter-rouge">python</code> and use <code class="highlighter-rouge">git</code> for version control, so this list will be a little skewed towards those technologies.</p>
<h3 id="color-sublime">Color Sublime</h3>
<p><a href="http://colorsublime.com">Colorsublime</a> has about a billion different built in color themes, and you can actually preview them right in Sublime before even installing them by using the Sublime Text 3 command pallet.</p>
<h3 id="git-gutter">Git Gutter</h3>
<p><a href="https://github.com/jisaacks/GitGutter">Git Gutter</a> is another fantastic plugin compares your working copy of a file to the version in the git index.</p>
<p>This is similar to doing <code class="highlighter-rouge">git diff</code> on the command line. By default it compares against <strong>HEAD</strong> but this can be changed to compare against specific branches, tags, or commits.</p>
<h3 id="sublime-linter">Sublime Linter</h3>
<p><a href="https://github.com/SublimeLinter/SublimeLinter3">Sublime Linter</a> is a <em>framework</em> for using code linters in <strong>Sublime Text</strong>. Any linters you wish to use need to be installed separately. For <code class="highlighter-rouge">python</code>, there are many linters but I recommend using <strong>pyflakes</strong> at a minimum. It’s also a good idea to use the <strong>pep8</strong> linter to make sure you are following the PEP8 Python standards.</p>
<p>ProTip: I recommend changing the settings to for <strong>Sublime Linter</strong> to manual mode. By default it lints every file you have open <em>in real time</em>, which I’ve found can cause Sublime to hiccup and lag – very annoying. To change the settings –</p>
<p>Open up the command palette, and select <strong>SublimeLinter: Choose Lint Mode –> Manual</strong></p>
<h3 id="restructuredtext-improved">ReStructuredText Improved</h3>
<p><a href="https://packagecontrol.io/packages/RestructuredText%20Improved">ReStructuredText Improved</a> is a nice plugin that does syntax highlight of your ReStructuredText. It integrates very nicely into Sublime and is very unobtrusive, unlike some of the <strong>Markdown</strong> plugins for Sublime that I have seen.</p>
<h3 id="honorable-mentions">Honorable mentions</h3>
<p>Some other plugins I use…</p>
<ul>
<li>Bracket Highlighter</li>
<li>Sidebar (sidebar enhancements)</li>
<li>PyDOC (links to python documentation by right clicking on code)</li>
</ul>
<h2 id="flycut">FlyCut</h2>
<p>FlyCut is a great little piece of software that keeps a copy-paste buffer within easy reach. By default, just hit <code class="highlighter-rouge">shift+command+v</code> to pull up the dialog. This is such a simple thing but it saves an <em>immense</em> amount of time. You can download it on the Mac App Store.</p>
<h2 id="caffeine">Caffeine</h2>
<p>If you ever get annoying at your computer dimming its screen and going to sleep when you want the screen to stay on, this little app is for you. Again – a really simple piece of software that really improves Mac usage. Basically you just click the <strong>coffee</strong> button when you want your Mac to stay awake.</p>
<h2 id="spectacle">Spectacle</h2>
<p>Another simple, extremely useful peice of software is <strong>Spectacle</strong>. This application lets you easily place and re-size your windows with a bunch of keyboard shortcuts. Actually – this is such a good idea that Apple has decided to incorporate something very similar into their #El-Capitan OSX release coming later in the Fall.</p>
<h2 id="flux">f.lux</h2>
<p>If you work late like I do, that blueish screen can be pretty harsh on the eyes. Check out f.lux – it gradually makes the screen go redder as the night wears on. It is easier on the eyes and also helps you get to sleep faster after a long night of coding.</p>
<h2 id="sourcetree">SourceTree</h2>
<p>This is a GUI interface for <code class="highlighter-rouge">git</code>, made by Atlassian. Now – I know what you’re thinking – “it’s not command line! CLI is way more powerful!” Yes – that is true, and I use the command line for <code class="highlighter-rouge">git</code> most of the time. However, if you have a bunch of changes and really want to do diffs on what changed, and potentially break the chunks into smaller commits, SourceTree beats the CLI. I know this can be done via <code class="highlighter-rouge">git add -p</code>, but the GUI interface is just better for this.</p>I’ve been doing software development on a Macbook Pro for a little while now, and I gotta say there are a TON of great free packages and tools that make development that much more enjoyable. I’m not going to get into a Windows/Mac/Linux debate here, lets just say Mac OSX wins, with Linux a close second. All of the production code that I run runs in Linux, and most of all runs natively on my Mac as well. That with the combination of all the other nice feature of the Mac make it unmatched for software development.