Friday, March 29, 2013

Scraping Craigslist for mein Deutsches Auto

Recently, for 30+ years of good service, I was awarded the gift of a son, born 2 weeks early and bringing with him a set of lungs only a new father could love. As part of the GREAT PREPARATION for the boy, I was to source a laundry list of items, the most important being a car.


TWO CARS BETTER THAN ONE

The problem with buying a car is that I already had one. But, despite it having 2 very solid doors and a rag top, it did not transport infants according to rule one of THE WIFE's guidelines on acceptable child bearing machines. To replace it with a 4 door motor would require a vehicle that both satisfied my rule "never buy anything I don't want" and also the wife's "never buy a Mustang again" edict.

After two months of researching, I finally selected a steed I would be happy with.


BUYERS AND SUPPLIERS

It turned out incredible hard to find an example of THE CHOSEN CAR that I wanted however, a specific revision of an used Audi A4 Wagon with glass roof and lots of buttons. So difficult was this search that I found myself spending over an hour a day searching a bunch of websites for this specific model. On 2 occasions when I actually found the ideal car in an evenings search, following up in both occasions found that it had already been sold!

To add to the woes, my criteria for "what car" quickly grew as BABY DAY approached into accepting almost any car that wasn't "that damn Mustang" and now instead of spending an hour a day looking for one model, I was spending three times the effort speed reading around 100 new hits, researching the specs of these cars and checking if the pricing was good or not.


THE GREAT AUTOMATION

So, automating the search was the next step.Craigslist was where I bought the 'Stang from and where the other 2 hits came from for the new motor, so I started here.

There are many methods of monitoring Craigslist  - RSS feeds being the officially supported method, but direct scraping the HTML feed also works. There are also third party apps that can do this on your behalf including browser plugins and mashup web services dedicated to searching. One particular app I had a lot of hope on was If This Then That.



Without reviewing each and every service in detail here including ITTT, about half of them worked well enough, yet all were either laggy or were not specific enough to save me significant time or give me confidence that it would be reliable enough to bag the motor.


THE WHOLE HOG

So I ended up putting in place a hosted Linux server running scripts that scraped things like Craigslist every 5 mins and emailed / texted me as soon as a model turned up that was in my search criteria. It was, as expected, a lot of fun.

You can find the code here if you find yourself with the same issue!

https://github.com/mrlambchop/clcarhunt

I have got a completely redone version that supports scraping of all sorts of content for personal pleasure which was much more interesting technically to put in place - will save this for another post.

And yes, I still have the Mustang. The boy is 4 months now ;0)

Wednesday, March 20, 2013

Everybody needs a cloud for a pillow

Home networks with NAS units are great for local storage of files, but I have always wary of exposing these servers directly to the great unwashed of the WWW as, quite frankly people, you sometimes get up to mischief.

YOU CAN SECURE THESE YOU KNOW

True. And I used to run a NAS ( +NETGEAR ReadyNAS ) from my home router, exposed to the internet, protected by SSH keys and obscure port mappings (YAY). Having not built the NAS linux kernel and user space from scratch however (or fully characterized all the processes / apps running on it), there was always a niggling doubt that it was simply obscure, not secure. Also, it consumed much of my time to maintain and one day, the device killed one of the RAID disks – the writing was on the wall for this little box that could, but did not. So I reinstalled the original firmware and for the sake of my kids baby photos, turning it back into an basic NAS box.

WHAT DO YOU WANT FROM THIS CLOUD ANYWAY? YOU BRITS ALWAYS COMPLAIN ABOUT THE RAIN

Well, I had a bunch of things I was messing around with, all of which needed something to talk to that was “always on”. A box in the cloud was perfect for this, so I started looking around to see what was available. Turns out there are a couple of ways to ways to get one of these there cloud boxes that fitted my "pay nothing to anyone" budget.
  • A dedicated PC or MAC, kept in a data center of your choosing
    • You can supply the HW and pay the hosting centre fees or rent one
    • The hosting service provides power, a network connection and a fixed IP address
    • Runs your OS of choice! AmigaOS please.
    • Physical access is generally required if you provided the box yourself, or some funky BIOS SW is available on pro servers to enable remote administration – some even support KVM and the ability to supply an ISO image for the CD or USB drives over the net.
  • A Virtual Private Server (VPS) 
    • You pay a hosting service for an timeshare on a super fast server
      • Share CPU, memory, storage and network with multiple users
    • The hosting service can run a large selection of virtual machines / OS's for you to use, but the list is limited to what sells in volume (no AmigaOS…)

The latter sounded good, but VPS seems to come in many shapes and sizes! The characteristics I was looking for were limited to:
  • "Hack it if you want, its quick to reinstall and there is nothing secret on there anyway” administration panel access
  • Low(ish) processing and local storage
  • Good enough bandwidth limits (no plans to host torrents on it)
  • Super cheap!

THE LOW END BOX

The site LowEndBox was a great read on the different solutions available in the VPS Linux box hosting arena. Much discussion basically boils down to the following however:
  • Price
  • Reliability (up time)
  • Where is the server located
  • Are there any deals?
  • Does the owner post on the forums and does he reply to crazies posting about his company?
The site quickly led me to +URPad DC who had an offer for $12 for a year deal for an Ubuntu 12.04 installation. THIS IS CHEAPER THAN (some) BEER!

Any caveats? Well, the VPS is listed as "unmanaged", meaning that outside of the initial install, its all down to you to configure. 

Update: Further investigations shows lowendstock as also being a great resource for budget VPS solutions. At the time of writing "FOUR DOLLAR VPS".

URPAD

12 dollar dollar bill yo’s later, and we’re in. What I liked:
  • Super quick to setup - payment went through and I received my login almost immediately
  • Easy to access hosting controls to wipe / provision the server
  • Small but good selection of Ubuntu packages pre-installed and APT running quickly to install any missing items
  • Reliable hosting (never found it broken or down so far)
The only negative would be that at one time, I found the VPS going super slow for a few minutes when I was simply at the shell – nothing else running. It wasn't anything critical, but it led to a wander into the tech behind their virtual server stack.

OpenVZ vs VMware

URPad runs OpenVZ. According to its wiki, its containerization of an OS instead of a entire virtual machine emulator (such as VMWare or VirtualBox). The interwebs do a better job of “what” here, so I knocked up a quick table of some of the differences in relation to the “why did my server go slow!” witch hunt:

Virtualization Containerization
Single Kernel? No Yes
Full System Isolation Yes No (common kernel)
Performance OK Better
Scheduler Contention Yes Yes
Resource control Per VM Per user within the VM
Isolation Complete Partial

How could this explain the slow down? Whilst a virtualized OS is still at the whim of the host’s scheduler (memory allocation, disk access etc…), within the virtualized OS, everything runs 'evenly'(ish) within that virtual guest. In theory, you can lock down the scheduling time of the virtual machines to a fairly granular level. With the containerized VM, as the kernel is not emulated, its possible to call various syscalls, sufficiently enough to load up the entire system and monopolize the hosts CPU time.

The pricing of VPS solutions based on containerization is fairly obvious – cheaper maintenance / resource requirements per virtual host and I am happy with the tradeoffs. Would I want a production server running on this infrastructure however? Probably not. In fact, I’d be straight over to Amazon .

The art of file storage

James, over at Programming in the 21st century has recently written about how the desktop is an acceptable place for the storage of documents and files. And I couldn't agree more, assuming we are not talking about code.

THE BIG BUCKET

Whilst my current (office) desktop is only partially covered by icons, its only because every time it gets full, I file the current contents into a sub folder and start again. This rarely happens however as I never minimize all windows or reboot enough to notice the full desktop ;0)

Icons blurred to protect the guilty!

The fact is that I store all documents I'm working on straight to the desktop because the file save / open dialog is slow slow to do anything but click on the Desktop icon - by the time I've got to saving a document, I'm already onto the next thing and dealing with a sluggish file browser is enough of a delay that I'm reaching for the browser whilst it churns along. Moving to Windows 7 + an SSD did make a lot of difference, but nothing close to my tolerance still.

FILE RETRIEVAL

Whilst the desktop is a good for saving, in retrieving the files Servant Salamander is my go to choice for a file manager (in the Norton Command style).

Things I use this for all the time:

  • Fast directory switching on my local disk and file opening / closing / deleting / moving etc...
  • Navigating ANY network drive - for some reason, its x10 faster than Windows explorer and makes the 
  • [S] FTP / SCP copying to / from Linux boxes
  • Creating a customized list of files from a directory
  • Directory size calculation
  • Viewing various types of compressed file (zip, tar, gz etc...)
Typical use - local disk left, remote disk right (this one via SSH)

And yes, this version is unregistered still! After the last laptop upgrade, the license key bought in another country 5 years ago went missing. Not only does the unlicensed version crash repeatedly, but I am to cheap to simply buy it again and would rather spend endless hours scouring old emails for the key. I'll break soon and re-buy any day now...


SAME AND SIMPLE

I never used the Function key shortcuts however in these file managers, or added any customizations to it however - like my love affair with Nano, I get things done so much faster learning a tool that is 90% of my requirements over tweaking something for months then having to sync all the systems I use together with the same settings. Muscle memory is unforgiving when the latest tweak to you .rc file hasn't been replicated to that one box - its like a mental stubbing of the toe.

Tuesday, March 19, 2013

Android Package Manager

BT is broken on my Galaxy Nexus (JB 4.2.1) when connecting to my Audi - around 40% of the time it refuses to pair and the phone needs a reboot to establish a connection  Sadly, I am almost always trying to call someone when it fails and so I never had time to whip out a USB cable and grab the log. Turning to the Android Play store, I downloaded a bunch of apps that promised logcat extraction with a single click.

But the apps didn't work.


READ_LOGS

We turn to stack-overflow (really need to buy shares in this company - its second on my search radar to Google these days) and find this and this explaining the background to what is going on - basically, permissions to read the logs by an application have been revoked (and are now only accessible to a system application). Aha - now I recall hearing about this.

Trying to hack around the removal of android.permission.READ_LOGS from Jellybean as follows...

adb shell pm grant <pkg> android.permission.READ_LOGS

...but an for some reason the console package manager / pm application in the Nexus device does not match the one built from src in the AOSP Android source code. Specifically, the "grant" function is missing from the help (adb shell pm).
.
adb shell pm grant
Error: unknown command 'grant'

Checking the code for the pm console command, it seemed unusual to have a such a core function have a significantly different command set in a Google production phone compared to the AOSP src - lets dig into what is going on here.


ANDROID JAVA CMDS

A bunch of commands that run from the shell in Android are implemented in Java. These 'console' apps can call directly into the core Android framework with out jumping through hoops, allowing them to talk to the framework APIs. An incomplete list of the apps on JellyBean 4.2 that run in this way is below:

am, backup, bmgr, bootanimation, bu, bugreport, content, ime, input, installd, ip-up-vpn, pm, rawbu, requestsync, screencap, screenshot, service, servicemanager, settings, svc, system_server

When invoked from the console, these console apps are actually a shell script that invokes the "app_process" command, pointing at the Java package to execute. In this package, a familiar "main()" function is used as an entry point in the Java application who then parses the command line and does some work.

Shell script example that invokes the pm application is below (copied into /system/bin/pm):

# Script to start "pm" on the device, which has a very rudimentary
# shell.
#
base=/system
export CLASSPATH=$base/framework/pm.jar
exec app_process $base/bin com.android.commands.pm.Pm "$@"

Example of main from cmds/pm/src/com/android/commands/pm/Pm.java - simply parsing the command line arguments and calling functions in the package manager (or user manager) via binder (IPC to another process).

public static void main(String[] args) {
    new Pm().run(args);
public void run(String[] args) {
    boolean validCommand = false;
    if (args.length < 1) {
        showUsage();
        return;
    }
    mUm = IUserManager.Stub.asInterface(ServiceManager.getService("user"));
    mPm = IPackageManager.Stub.asInterface(ServiceManager.getService("package"));
<snip>
    mArgs = args;
    String op = args[0];
    mNextArg = 1;
<snip>
   if ("grant".equals(op)) {
        runGrantRevokePermission(true);
        return;
    }

ANDROID PACKAGE MANAGER ANOMALY

Because the pm application is just a Java app,, stored on the Android file system, we can pull this file off and take a look at what differences it has compared to the AOSP version. I used a Galaxy Nexus here with firmware 4.2.1:
adb pull /system/framework/pm.jar
jar -xf pm.jar
ls -al 
gave a single directory, META-INF containing the Java manifest.

Note: as .JAR files are simply ZIP files (with a few caveats), you can rename it to .zip if your a windows user and open it right up, or use the 'jar' application if the have the JDK installed from either Windows or Linux. Triple checking using Windows showed that it did indeed contain just an empty manifest.

What is missing here is the src code (aka the classes.dex) file that implements the pm.jar. AHA. Of course, the Galaxy Nexus is using a "user" image for production and so it contains a pre-generated odex file (Optimized DEX) and leaves an empty .jar file in place (for a reason I never fully followed - I suspect some check in Dalvik needs to see its there).

For completeness / cross check this, grabbing a copy of /system/framework/pm.jar from a debug arm-v7-neon "eng" config, built from the AOSP, then running the same command on it gives me the expected dex file (dalvik bytecode or "binary java").

META-INF
classes.dex

Using the dex2jar tool (http://code.google.com/p/dex2jar/) to convert to a standard Sun style jar file, then unpacking the classes.dex file:

./dex2jar-0.0.9.13/dex2jar.sh classes.dex
jar -xf classes_dex2jar.jar
cd com/android/commands/pm/
ls -al
 unsurprisingly lists the class files that make up this app, as we'd expect:
Pm.class
OK - so we'll grab the /system/framework/pm.odex file from the Galaxy Nexus. To deodex an ODEX file, we use baksmali as such:

./baksmali -a 17 -d yakju-jop40d/sys/framework -x pm.odex

Note that '17' is the API level used in JB 4.2, the framework dir is from /system/framework on the device (adb pull /system/framework) and of course, pm.odex is taken from the device.

The output is put into: out/com/android/commands/pm/

Looking at the Pm.smali file and searching for "grant", we find this in the showUsage function (that gets invoked whenever the command syntax is wrong):
.line 1471
sget-object v0, Ljava/lang/System;->err:Ljava/io/PrintStream;
const-string v1, "       pm grant PACKAGE PERMISSION"
invoke-virtual {v0, v1}, Ljava/io/PrintStream;->println(Ljava/lang/String;)V

CURIOUSER AND CURIOUSER CRIED ALICE

So the question remains - how is the "grant" help code missing from the console app on the Galaxy Nexus?

To Be Continued...