Friday, July 23, 2010

GetDeb: Archive traffic distribution

GetDeb user base increased exponentially since it was started on 2006. The migration to a proper APT repository while providing important benefits is also a big technical challenge. The GetDeb and PlayDeb repositories are presently configured by more than 30k users, providing tens of Gigabytes from our mirror pool during traffic peaks.
On 2007 we had a single server and the traffic was unaffordable, we have gathered some mirrors and we have developed a php script which was responsible for validating files availability on the candidate mirrors and then redirect the users to them (using http redirect). This script was poorly developed but sufficient for a long time.
Before moving to APT the file requests were human originated from web clicks, now this scrip is massively used by the automatic system upgrades, it's original faults have now a much serious impact. It needs to be replaced.

I have checked existing solutions for mirror distribution:

APT mirror: method - APT supports a specific mirror: method which dynamically obtains a mirror from an URL, however it's transaction based, the same archive will be used for all requests after an initial retrieval. This means that on the beginning of transaction it should get the url of a mirror which provides all the files required by the subsequent operations. For GetDeb this is a major limitation, since we have very frequent updates (somtimes hourly) most of the mirrors would be unavailable for mirror selection because they would be out of synch, even if they do have the packages for that specific transaction there is no way to know in advance. This issue is not present with http redirects, we always return the packages index from the master server, files will be obtained from individual mirrors as long they match the master server version, regardless of the overall mirror status.

Mirrorbrain - Mirrorbain is used by mainstream solutiosn like OpenSuse's build service and OpenOffice so it was a strong candidate. After some research I have found that it detects file availability by using a database which must be kept current using a mirror scan tool which does a full mirror scan (file info: size, last modified). While this maybe great for most scenarios I don't think it is as efficient as doing on demand mirror check, our slowest mirror took >10m for a full scan, we would need large intervals increasing the risk of redirection to a failed mirror.

mirror-selector - Because I have a strong believe on the technical merit of the on demand scan I have decided to implement a mirror selection system from scratch using Python.
The utility/project name is "mirror-selector" it runs as a standalone HTTP Server whose only purpose is to handle static file GET requests, check the availability from a local directory (it must be run on a local mirror) and then redirect to an available mirror after checking that an exact copy of the file is available remotely.

The http server uses a fixed size thread pool, each web client request is handled on it's own http server thread. When mirror-selector starts a thread is started for each mirror, each mirror thread provides an input queue which maybe used by any http server thread. With this architecture all requests related to a unique mirror are handled on a single thread, this allows to easily reuse the same TCP connection by using HTTP 1.1 Keep-Alive for multiple requests. The caching facility is also simpler to implement because it works on a per thread basis.

The code is available at launchpad: bzr branch lp:mirror-selector (check the README to test it), it should be considered as alpha.

GetDeb's/PlayDeb's main archive pool was already switched to mirror-selector, we may intermittently swap to the legacy selector as serious problems maybe still be found.

To check if it's available and some stats:

Saturday, July 10, 2010

Something is wrong with the software publishing plan for Ubuntu

Lately I have been watching a couple of threads related to Ubuntu packaging and publishing whichs makes me feel something is wrong.

With all respect, I believe Matt was probably dreaming when he came up with the "We’ve packaged all of the free software" title. There are plenty of applications not properly archive maintained or even packaged due to the lack of human resources.

Then Jorge on this article narrowed the issue to the the centralized versus distributed model. It's a bit concerning the argument which involves "iPhone’s app store", will Ubuntu app store follow iPhone's restrictive strategy because *some* people like it ?

Opinions apart, there will be a new process for applications inclusion which is great but which I am afraid will be commercially oriented. If Ubuntu does not have the resources to handle the current community oriented package inclusion process for development releases, REVU I really don't see how it will be able to handle post release inclusions unless there is revenue involved.

Tuesday, July 6, 2010

Python and OpenOffice - Hello World

Recently I had the need to automate data gathering into an openoffice calc heet, after some frustration playing with the OO macro language (OOBasic) I did some research on how to interact with OO from Python. There is some documentation spread over wikis and forums but not that much so I am going to share my learning here.
This is a simple hello world script which will fill the "A1" Cell from the current active sheet with "Hello World" in bold.
Please note that I will be using regular python scripts launched from the terminal and which will interact with OO using a UNO socket bridge.
First you will need to start oocalc in listenting mode (to accept UNO connections), from the terminal:
oocalc "-accept=socket,host=localhost,port=8100;urp;StarOffice.ServiceManager" -norestore -nofirststartwizard -nologo
Now just execute python and try the following code:
from uno import getComponentContext
from import BOLD

# Connect to soffice using UNO
localContext = getComponentContext()
resolver = localContext.ServiceManager.createInstanceWithContext("", localContext)
context = resolver.resolve("uno:socket,host=localhost,port=%d;urp;StarOffice.ComponentContext" % 8100)

# Get desktop service
desktop = context.ServiceManager.createInstanceWithContext("", context)

# Get current document
document = desktop.getCurrentComponent()

# Get the active sheet
sheet = desktop.getCurrentComponent().getCurrentController().getActiveSheet()

# Get the "A1" cell reference
cell = sheet.getCellRangeByName("A1")

# Set a string
cell.setString('Hello World')

# Change to bold
cell.CharWeight = BOLD

That is all for the first lesson :)