Sunday, December 21, 2014

PyTutorial: Working on Larger Software Projects

Working on "larger" software projects entails problems not normally encountered in smaller ones.
But what do I mean by "larger", and what are these problems, and how are they typically overcome?

Larger projects can involve:

  • More than one developer (perhaps dozens?)
  • Lots of complicated code (think hundreds if not thousands of modules.)
  • Long development period (think years)
  • Actual customers (think bugs, priorities, schedules, etc.)
Well, good luck! :)

Thankfully there are a few tools, approaches, and strategies that help deal with these problems...

Integrated Development Environment (IDE)

Dealing with lots of code is hard.  "What was that function called?"  "What were it's parameters?" blah blah blah.  Just moving around the code can be a pain, and a cute little editor like IDLE or Notepad just doesn't cut it.

Here come the IDEs to the rescue!  IDEs do many things to many people, but their main, most critical functionality is helping you navigate tons of code.  Basically, instead of opening a file, they open a project, and let you do fun things such as: opening files within the project, searching files, looking up function / class definitions, renaming functions (refactoring), debugging (pausing a program while its running and snooping around), etc. etc. etc.  As you can imagine, there are LOTS of IDEs out there, some of them free, others not so free.

By far, the most popular and powerful free IDE is Eclipse.  Although initially designed for Java, it can handle other languages, including Python, via extensions.  Python's extension is PyDev.

All IDEs have a learning curve, and make some things easier than others.  I, personally, fell for PyCharm.  It's free for work on open-source projects, and they have various discounts as well.  I don't mind coughing up a little cash for a wonderful tool.

But everyone has an opinion about their favorite / most hated IDEs.  I really don't care, they're all good, as far as I'm concerned (at least much better than no-IDE.)  So, look around, and fine one that you like.

Source Control

A.k.a. "Version Control", "Source Control Management" (SCM), etc. etc. etc.

Large software development projects are iterative.  Add a little feature here, fix a little bug there.  They are also accomplished by quite a few people, each adding some features, and fixing some bugs.  But there needs to be a way to coordinate all of this code-writing.  It's also useful to be able to see the history of some code.  Who changed what, when, and why.  As you have probably guessed, this is where source control comes to the rescue.

The most popular SCM is Git.  It's open-source, and does some incredibly hard things very well.  Another popular free SCM is SVN.  It may not be able to do everything that Git does, but is a lot simpler to work with, and does a great job for "smaller" projects (i.e. projects that don't have multiple development centers around the world.)  I actually really like it.

Which ever one you want to work with, setting it up can be a pain.  Thankfully, there are a quite a few online services that set them up for you, provide backup services, etc.  Some of the popular ones are GitHubBitBucket, and CloudForge.  There are hundreds (if not thousands) other similar services out there.  Find one that you can trust, and go with it!  If you can't trust anyone, then you're going to have to set it up yourself.  That's not impossible, but it takes time to do right.

Both Git & SVN enable you to:

Check-In some code.  That is, upload your changes into the code-server.  You can usually add a comment about the purpose of this code-change.
Check-Out / update your code.  This gets the latest code from the code server.  All the code that other people checked-in will now be on your computer too!  Woohoo!
View a history log:  see which files have changed, when, by whom, and why.
Merge multiple code changes / handle conflicts:  sometimes, more then one person may work on the same file.  In many cases, the SCM will be be able to automagically figure out how all these changes fit together.  But sometimes, this becomes too hard for the SCM, and manual intervention is required.  This usually occurs when you try to check in some code that someone changed before you.  Imagine file A.py.  Both you and "Mary" have checked-out the same copy of it.  Great!  Now both you and Mary changed the same line of code.  Perhaps you changed it, while Mary deleted it?  Mary checked-in her changes first.  So far, so good.  But now, when you come to check-in your changes, the SCM throws a complaint:  how can it change a line of code that was just deleted??  You need to get in there and fix this contradiction before you can check-in the code.  Basically, you have to decide between taking your changes (line change), or her changes (line delete), or something else entirely.  After that's done, all is well, and you can check it in and keep working.   An important lesson to be learned for this whole thing is to check-out / update your code regularly, as you want it to incorporate as many of other people's changes as possible, and thus reduce the changes for a manual-merge being required.

Branches:  a branch is basically a copy of the code that is set aside for some particular purpose.  For example, if a feature is big and complicated, work on it may be done on a separate branch.  This way, the "main" branch won't be in a semi-broken state for the foreseeable future.  The price associated with a branch is the later "re-integration" work, which takes all the changes made to that branch, and merges them back into the "main" branch.  This can be error-prone and difficult, especially if the code has changed a lot in these two branches.  Fortunately, any good SCM will help automate this process, and make it as painless as possible.  Most organizations have their own branching policies.

Tag / Snapshot: a tag is basically a snapshot of the code at any given moment in time.  This is helpful for future reference.  Tags don't actually change the code, they just label it. For instance, a tag could be "First release", or "Release 2.0", etc.  One can then compare the code in different tags, or inspect the code of a tag to see where a problem may have first come up.

BTW, most IDEs have built-in integration with most SCMs.  This way, you can update and check-in your code directly from your IDE!  If you don't like that, most SCMs have nice user-interfaces that can be used.  For example, Tortoise SVN integrates SVN commands right into your file "explorer".  Letting you perform SCM tasks as easily as right-clicking on a file or folder.  And of course, there is always the command-line interface, which usually gives you access to the full range of commands, if you need to run something peculiar. 

Python Specifics

On working on larger Python projects, there are a couple of tools / libraries you probably want to be familiar. 

The first is virtualenvwrapper.  It's a simple tool designed to let you create multiple python environments for you to play with.  Each environment can have different packages installed, and thus your different projects won't conflict.  It's handy if you want to play around with some packages without "messing up" your main work environment.  The docs are pretty good, but basically, to get started you need to:
  • Download / install.  I think this only works on Linux & Windows systems.  But I'm not too sure.
  • Set up your system to automatically us this package.  On linux this means adding to your .bashrc file the following environment variables: WORKON_HOME, VIRTUALENVWRAPPER_PYTHON, and VIRTUALENVWRAPPER_VIRTUALENV.  Then, also add "source /usr/local/bin/virtualenvwrapper.sh".  QED.
  • To create a virtual environment, run: mkvirtualenv <env_name>
  • To use a virtual environment, run: workon <env_name>
  • To remove a virtual environment, run: rmvirtualenv <env_name>
  • See all the commands.
  • After running workon, you can add & remove python packages normally using pip.  It will only install them for that particular virtual env.
The other is nose.  This is a very common unit-testing tool.  I'll probably blather more about it in my section on unit-testing.  At the very least you need to be aware of it.

Gimme more tools!

IDEs and SCMs are by far the two most important tools for larger software projects.  But lots of other helper tools exist, and their usage depends largely on tastes and needs...

Bug-Tracking: if you got lots of bugs coming in, and you want to figure out who works on what, when is each fixed, etc., then you're going to want some sort of bug-tracking tool.  Some popular ones are BugzillaTrac, and Mantis.   A nice feature of Trac is that it has an integrated Wiki (and was written in Python).  They are all good.  Some prefer writing bugs on a white-board, or on little cards. :)

Wiki / Documentation: I'm not a HUGE fan of horrible, outdated documentation.  But having a place for people to share ideas is nice.  I would keep it small and simple.

Ok, that's enough! Just get to work already!! :)

But I want more!  Gimme more!!!

Argf! Ok.  Well, larger projects benefit from a sane approach to development. There is not one "right" answer, but very few people seem to realize this. :) 

I think the most important aspect of large project development is unit testing.  It really helps keep things sane.  

But there are also philosophies! Today's fashion include Agile.  From Agile, I mostly like Pair Programming.

Another important aspect is taking naps.  Working too hard reduces productivity.  It also reduces the amount of fun you have in your life.  And that would be a shame now, wouldn't it?

Oh yea, there's also that great classic, The Mythical Man-Month, in which the concept that just adding more people on a project will actually make that project complete faster is thoroughly destroyed. 

Making big software thingis is hard work.  It takes time.  And you can't plan for everything.  Have fun and good luck!  :)



No comments:

Post a Comment