sqlplus, edit stored procedures

Previously, we discussed viewing stored procedures in sqlplus.

It came to my attention in an Ask Tom article of a very handy way to edit a package in sqlplus.

Set your $SQLPATH environment variable to wherever you wish to store useful scripts, including a login.sql -- make sure to set an editor in login.sql, e.g.,

define _editor=vim

In the Ask Tom example, simply create a getcode.sql and save somewhere in your $SQLPATH -- for reference,

set feedback off
set heading off
set termout off
set linesize 1000
set trimspool on
set verify off
spool &1..sql
prompt set define off
select decode( type||'-'||to_char(line,'fm99999'),
               'PACKAGE BODY-1', '/'||chr(10),
                null) ||
       decode(line,1,'create or replace ', '' ) ||
       text text
  from user_source
 where name = upper('&&1')
 order by type, line;
prompt /
prompt set define on
spool off
set feedback on
set heading on
set termout on
set linesize 100

Then, in sqlplus, you can edit an existing procedure/package/function and update it in the database, e.g.,

foo_owner@FOO> @getcode PROCS_DATE
foo_owner@FOO> edit PROCS_DATE

foo_owner@FOO> @PROCS_DATE

Package created.

Package body created.

foo_owner@FOO> 
Posted in oracle

sqlplus, view stored procedures

I would like to view stored procedures and packages in sqlplus.

All stored procedures, functions, and packages can be found in the user_source, dba_source, or all_source (depending on your access). Let's examine user_source.

foo_owner@FOO> desc user_source
 Name					   Null?    Type
 ----------------------------------------- -------- ----------------------------
 NAME						    VARCHAR2(30)
 TYPE						    VARCHAR2(12)
 LINE						    NUMBER
 TEXT						    VARCHAR2(4000)

foo_owner@FOO> 

Let's examine user_source.type to see what kind of objects are available to view,

foo_owner@FOO> SELECT DISTINCT type FROM user_source;

TYPE
------------
PROCEDURE
PACKAGE
PACKAGE BODY
FUNCTION
TYPE

5 rows selected.

foo_owner@FOO>

The user_source table (like dba_source and all_source) contains line-numbered text for functions, procedures, and packages.

If you want a list a packages, you could simply run the following:

foo_owner@FOO> SELECT DISTINCT name FROM user_source WHERE type = 'PACKAGE';

NAME
------------------------------
PROCS_GIFT_SUBS
PROCS_TAX_EXCEPTION_REPORT_V1
PROCS_GIFT_SUBS_TAX
PROCS_FLASH_REPORT
PROCS_TAX_V1
PROCS_FISCAL
PROCS_EXCEPTION_REPORT
PROCS_TAX
...

If you'd like to view the source for a package (the package and package body), you can query the user_source.text as follows,

foo_owner@FOO> SELECT text FROM user_source WHERE name = 'PROCS_DATE' ORDER BY type,line;

TEXT
----------------------------------------------------------------------------------------------------
PACKAGE 	    "PROCS_DATE" AS
  FUNCTION DATE_ID(IN_DATE DATE) RETURN NUMBER RESULT_CACHE;
  FUNCTION DATE_FROM_ID(IN_ID NUMBER) RETURN DATE RESULT_CACHE;
END;

PACKAGE BODY		 "PROCS_DATE" AS

  FUNCTION DATE_ID(IN_DATE DATE) RETURN NUMBER RESULT_CACHE
  IS
    ID NUMBER;
  BEGIN
    SELECT D.ID INTO ID FROM DATES_DIM D WHERE SQL_DATE = TRUNC(IN_DATE);
    RETURN ID;
  END;

  FUNCTION DATE_FROM_ID(IN_ID NUMBER) RETURN DATE RESULT_CACHE
  IS
    the_date DATE;
  BEGIN
    SELECT sql_date INTO the_date FROM dates_dim WHERE id = in_id;
    return the_date;
  END;

END;

foo_owner@FOO> 
Posted in oracle

sqlplus explain plan

I would like to view explain plan output in sqlplus.

One easy approach is to set autotrace. You can set autotrace on to show query statistics as well as an exection plan. You can also view only the execution plan, for example,

foo_owner@FOO> set autotrace on explain
foo_owner@FOO> SELECT COUNT(*) FROM tax_facts WHERE create_date <= TO_DATE('10-JAN-12', 'DD-MON-YY');

  COUNT(*)
----------
   3768447


Execution Plan
----------------------------------------------------------
Plan hash value: 3136400752

--------------------------------------------------------------------------------------
| Id  | Operation	      | Name	     | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT      | 	     |	   1 |	   8 |	1988   (5)| 00:00:36 |
|   1 |  SORT AGGREGATE       | 	     |	   1 |	   8 |		  |	     |
|*  2 |   INDEX FAST FULL SCAN| TAXFCT_CRTDT |	3774K|	  28M|	1988   (5)| 00:00:36 |
--------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter("CREATE_DATE"<=TO_DATE('10-JAN-12','DD-MON-YY'))

foo_owner@FOO>

Alternatively, you can use EXPLAIN PLAN FOR in front of your sql-statement, but viewing the plan output is less-intuitive, e.g.,

foo_owner@FOO> EXPLAIN PLAN FOR  SELECT COUNT(*) FROM tax_facts WHERE create_date <= TO_DATE('10-JAN-12', 'DD-MON-YY');

Explained.

foo_owner@FOO> SELECT PLAN_TABLE_OUTPUT FROM TABLE(DBMS_XPLAN.DISPLAY());

PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------------------------------------
Plan hash value: 3136400752

--------------------------------------------------------------------------------------
| Id  | Operation	      | Name	     | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT      | 	     |	   1 |	   8 |	1988   (5)| 00:00:36 |
|   1 |  SORT AGGREGATE       | 	     |	   1 |	   8 |		  |	     |
|*  2 |   INDEX FAST FULL SCAN| TAXFCT_CRTDT |	3774K|	  28M|	1988   (5)| 00:00:36 |
--------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter("CREATE_DATE"<=TO_DATE('10-JAN-12','DD-MON-YY'))

14 rows selected.

foo_owner@FOO> 
Posted in oracle

sqlplus, exploring schemas and data

I would like to get information about different schemas and tables in sqlplus. I would like this to be as easy as mysql, it's less intuitive but almost as easy.

To get a list of schemas (similar to mysql "show databases"), you can run the following,

foo_owner@FOO> SELECT username FROM all_users ORDER BY username;

USERNAME
------------------------------
BOB_APP
BOB
FOO_OWNER
SYS
SYSTEM

5 rows selected.

foo_owner@FOO> 

To change to a different schema (similar to mysql "use database"), you can run the following,

foo_owner@FOO> ALTER SESSION SET CURRENT_SCHEMA = foo_owner;

Session altered.

foo_owner@FOO> 

To view the tables (similar to mysql "show tables"), you can run the following,

foo_owner@FOO> SELECT table_name FROM tabs;

TABLE_NAME
------------------------------
FOO_USERS
FOO_GROUPS
FOO_USER_GROUP
...

To view the columns of the table, it's the same as mysql, i.e.,

rpt_owner@FOO> DESC foo_user_group

 Name								   Null?    Type
 ----------------------------------------------------------------- -------- --------------------------------------------
 USER_ID							   NOT NULL NUMBER(38)
 GROUP_ID							   NOT NULL NUMBER(38)
 DATE_ID							   NOT NULL NUMBER(38)

And all the normal SQL you feel like running, e.g.,

foo_owner@FOO> SELECT COUNT(*) FROM foo_users;

  COUNT(*)
----------
    746202

foo_owner@FOO> 
Posted in oracle

sqlplus pagesize and linesize

I would like sqlplus output to be more readable.

Fortunately, you can adjust the pagesize and linesize variables on the fly, e.g.,

foo_owner@FOO> set pagesize 50000
foo_owner@FOO> set linesize 120

You can set pagesize to 0, which is very useful for scripted output, as it will not print any column headers and only print the results. However, if you want to see column-headers but not repeat-them every nth line, set pagesize to something reasonably high.

The linesize defaults to 80, which is fine on a 1980's dumb-terminal, but I prefer this set to a more reasonable width to fit my terminal screen.

You can add both of these lines to your login.sql file (sqlplus will look for login.sql in the current directory or in $SQLPATH environment variable)

-- format output
set pagesize 50000
set linesize 120
Posted in oracle, shell tips

sqlplus command prompt (sqlprompt)

I would like to change the sqlplus command-prompt to something more useful than

SQL>

You can modify the sqlprompt variable as follows,

SQL>
SQL> set sqlprompt "_USER'@'_CONNECT_IDENTIFIER> "
FOO_OWNER@FOO> 

You can also use variables such as _PRIVILEGE and _DATE, although all I really want is the user and schema. Unfortunately, with the above technique, if you ever connect to a database using a full connection string (rather than a tnsnames.ora entry), this prompt will be a bit unwieldy. e.g.,

$ sqlplus foo_owner/foo_pass@'(DESCRIPTION=(ADDRESS_LIST=(load_balance=on)(failover=on)(ADDRESS=(PROTOCOL=TCP)(HOST=10.1.1.11)(PORT=1528)))(CONNECT_DATA=(SERVER=DEDICATED)(service_name=FOO)(FAILOVER_MODE=(TYPE=SELECT)(METHOD=BASIC)(RETRIES=20)(DELAY=15))))'

SQL*Plus: Release 10.2.0.4.0 - Production on Wed Jan 11 12:47:19 2012

Copyright (c) 1982, 2007, Oracle.  All Rights Reserved.


Connected to:
Oracle Database 11g Enterprise Edition Release 11.2.0.1.0 - 64bit Production
With the Partitioning option

SQL> 
SQL> set sqlprompt "_USER'@'_CONNECT_IDENTIFIER> "        
FOO_OWNER@(DESCRIPTION=(ADDRESS_LIST=(load_balance=on)(failover=on)(ADDRESS=(PROTOCOL=TCP)(HOST=10.1.1.11)(PORT=1528)))(CONNECT_DATA=(SERVER=DEDICATED)(service_name=FOO)(FAILOVER_MODE=(TYPE=SELECT)(METHOD=BASIC)(RETRIES=20)(DELAY=15))))> 

I love that I can connect without a tnsnames.ora entry, but I would like a simplistic prompt and I want it loaded every time I start sqlplus.

Fortunately, sqlplus will look in the current directory and in the $SQLPATH environment variable for a login.sql, and execute it automatically. For example, if you include the following in your login.sql

SET TERMOUT OFF 
DEFINE sqlprompt=none
COLUMN sqlprompt NEW_VALUE sqlprompt
SELECT LOWER(SYS_CONTEXT('USERENV','CURRENT_USER')) || '@' || SYS_CONTEXT('USERENV','DB_NAME') as sqlprompt FROM DUAL;
SET SQLPROMPT '&sqlprompt> '
UNDEFINE sqlprompt
SET TERMOUT ON

Then whenever you run sqlplus the prompt will be set accordingly, e.g.,

$ sqlplus foo_owner/foo_pass@FOO

SQL*Plus: Release 10.2.0.4.0 - Production on Wed Jan 11 12:47:19 2012

Copyright (c) 1982, 2007, Oracle.  All Rights Reserved.


Connected to:
Oracle Database 11g Enterprise Edition Release 11.2.0.1.0 - 64bit Production
With the Partitioning option

foo_owner@FOO> show user
USER is "FOO_OWNER"
foo_owner@FOO> 
Posted in oracle, shell tips

reverse ssh tunnel

I would like ssh access to a protected host that is not directly accessible on the Internet but does have outbound access. This is a common scenario in corporate networks that often require a vpn for remote access; but in situations where vpn access is not available (e.g., I forgot my keyfob, or I don't want to install vpn software) a reverse ssh tunnel to a trusted host can be used instead.

Basically, a reverse ssh tunnel is setup by the private host (i.e., any host that's not directly accessible from the Internet). The private host connects to a trusted host (that is accessible on the Internet).

A reverse ssh tunnel is usually started with something like,

$ ssh -R 12354:localhost:22 user@example.com

This will create a tunnel from example.com to the private host. In other words, you would simply login to example.com and issue the following command,

$ ssh user@localhost -p 12345

And you would now be logged into the private host.

The only problem is that you must establish the reverse tunnel from the private host. In order to keep this tunnel alive (without you physically being able to login to the private host), you can use a periodic cron, e.g.,

#!/bin/bash
#
# e.g., crontab every 5 minutes
# */5 * * * * ~/.ssh/reverse_tunnel >/dev/null 2>&1
#
# -OR-
# */5 * * * * ~/.ssh/reverse_tunnel hostname port >/dev/null 2>&1

## set defaults
REMOTE_HOST=user@hostname
REMOTE_PORT=12345

SSH_KEY=/path/to/your/private.key

if [ $# == 2 ]; then
  REMOTE_HOST=$1
  REMOTE_PORT=$2
fi

## reverse tunnel command
REVERSE_TUNNEL="ssh -i $SSH_KEY -q -N -R $REMOTE_PORT:localhost:22 $REMOTE_HOST"

## start tunnel if not already running
pgrep -f -x "$REVERSE_TUNNEL" > /dev/null 2>&1 || $REVERSE_TUNNEL

## test tunnel by executing a command on the remote host
ssh -i $SSH_KEY $REMOTE_HOST netstat -an | egrep "tcp.*:$REMOTE_PORT.*LISTEN"  > /dev/null 2>&1
if [ $? -ne 0 ] ; then
   pkill -f -x "$REVERSE_TUNNEL"
   $REVERSE_TUNNEL
fi

This will maintain a tunnel on the public host to the private host. Now you can easily login to the private host anywhere that has access to the public host.

Posted in shell tips, ssh

ssh agent across multiple hosts

I would like secure single-sign-in across multiple hosts. An easy way to do this is with ssh-agent, however, ssh-agent is a bit limited.

For example, the normal use of ssh-agent looks like this,

$ ssh-agent
SSH_AUTH_SOCK=/tmp/ssh-stSwW11394/agent.11394; export SSH_AUTH_SOCK;
SSH_AGENT_PID=11395; export SSH_AGENT_PID;
echo Agent pid 11395;

You would need to set those environment variables to use the newly created agent, and then use ssh-add to add your credentials, e.g.,

$ eval `ssh-agent`
Agent pid 11464
$ ssh-add
Enter passphrase for /home/foobar/.ssh/id_dsa: 
Identity added: /home/twarnock/.ssh/id_dsa (/home/foobar/.ssh/id_dsa)
$ ssh-add -l
1024 84:e3:23:7b:f4:22:a2:da:53:fb:04:19:67:78:2b:3d /home/foobar/.ssh/id_dsa (DSA)
$

From here you can access any hosts that has your public key in ~/.ssh/authorized_keys

However, any new shell instances on this same host would have to go through this process. Ideally, I want to login ONCE and ONLY ONCE while working across multiple hosts.

The following code snippet can be added to your .bashrc which maintains a single agent across multiple shells per host, and also respects ssh auth forwarding. This code will also work in situations where your home directory is mounted across multiple hosts (maintaining an ssh-agent per host as needed, depending on where you login first).

# ssh agent -- for shared home directory across hosts
SSH_ENV=$HOME/.ssh/.environment.`hostname`
function start_agent {
  echo "Starting a new ssh-agent on this host"
  ssh-agent | sed 's/^echo/#echo/' > ${SSH_ENV}
  chmod 600 ${SSH_ENV}
  . ${SSH_ENV} > /dev/null
  ssh-add;
  echo succeeded
}

## ssh-agent
if [ -e "$SSH_AUTH_SOCK" ]; then
  echo "Using ${SSH_AUTH_SOCK}"
elif [ -f "${SSH_ENV}" ]; then
  echo "Using ${SSH_ENV}"
  . ${SSH_ENV} > /dev/null
  ps -ef | grep ${SSH_AGENT_PID} | grep ssh-agent$ > /dev/null || {
    echo "${SSH_ENV} agent is no longer running"
    start_agent;
  }
else
  start_agent;
fi

With this scenario you typically need to only login once per work session (regardless of which host you first login to). And with appropriate use of ssh auth forwarding, e.g., ssh -A user@host, you can jump around from host-to-host without constantly typing in the same password.

Posted in shell tips, ssh

python, finding recurring pairs of data

I would like to find all pairs of data that appear together at least 10 times.

For example, given a large input file of keywords:

>>> foo, bar, spam, eggs, durian, stinky tofu, ...
>>> fruit, meat, vinegar, sphere, foo, ...
>>> merlot, red, hearty, spam, oranges, durian, ...
>>> ...

"durian" and "spam" appear together twice and all other combinations appear together only once.

Rather than a brute-force approach counting through all possible keyword pairs, we can use python set operations to quickly determine the number of times two keywords appeared together.

#!/usr/bin/env python
'''
Processes the given input and returns all pairs that appear together in 
at least 10 different lines.

Algorithm Explaination:

Hash each value such that each value maps a 'set' of row indexes 
(i.e., each line in the file)

Next, rather than compare all possible combinations of values, create a 
shorter list of candidate values who have 10 or more row indexes.

The set-intersection will contain all rows that BOTH values were listed in, 
if the size of the set-intersection is greater than 10, output this pair
'''

import codecs, csv, argparse, sys

def setIntersection(infile):
    datafile = codecs.open(infile, 'r', 'utf-8')
    csv_data = csv.reader(datafile)
    data_sets = {}
    i = 0
    for row in csv_data:
        i = i+1
        for a in row:
            if a in data_sets:
                data_sets[a].add(i)
            else:
                data_sets[a] = set([i])
    candidates = {a:s for a,s in data_sets.items() if len(s) > 10}
    pairc = {a for a in candidates.keys()}
    for a,s in candidates.items():
        pairc.remove(a)
        for c in pairc:
            if len(s & candidates[c]) > 10:
                print('%s, %s, %s' % (a,c,len(s & candidates[c])))

def main():
    p = argparse.ArgumentParser()
    p.add_argument('--output-file', '-o', default='')
    p.add_argument('--bitmap', '-b', action="store_true")
    p.add_argument('files', nargs='+')
    args = p.parse_args()

    if (args.output_file != ''):
        sout = sys.stdout
        fh = open(options.output_file, 'w')
        sys.stdout = fh

    for infile in args.files:
        setIntersection(infile)

    if (args.output_file != ''):
        sys.stdout = sout
        fh.close()

if __name__ == '__main__':
    main()

Alternatively, we could use a bitmap comparison and compute the Hamming Weight, e.g.,

def bitmapIntersection(infile):
    '''
    The bitmap example is more general purpose and easily ported to C.  In fact, 
    the bitCount function is a python version of computing Hamming Weight from K&R.

    A simple example illustrates the bitmap,

      keyword1 : [0 0 0 0 0 0 1 0 1 0 ... ]
      keyword2 : [0 1 1 0 0 0 1 1 1 0 ... ]
             & : [0 0 0 0 0 0 1 0 1 0 ... ] <- bitwise-AND
    '''
    datafile = codecs.open(infile, 'r', 'utf-8')
    csv_data = csv.reader(datafile)
    data_maps = {}
    i = 0
    for row in csv_data:
        for a in row:
            if a in data_maps:
                data_maps[a] += (2 ** i)
            else:
                data_maps[a] = (2 ** i)
        i = i+1
    candidates = {a:b for a,b in data_maps.items() if bitCount(b) > 10}
    pairc = {a for a in candidates.keys()}
    for a,b in candidates.items():
        pairc.remove(a)
        for c in pairc:
            bits = bitCount(b & candidates[c])
            if bits > 10:
                print('%s, %s, %s' % (a,c,bits))

def bitCount(int_type):
    ''' ported from Kernighan & Ritchie's "The C Programming Language"
    '''
    count = 0
    while(int_type):
        int_type &= int_type - 1
        count += 1
    return(count)

A bitmap comparison is functionally identical to the set intersection but provides potential optimization enhancements such as using encoded vectors like the BitVector module.

Posted in python

python, analyzing csv files, part 2

Previously, we discussed analyzing CSV files, parsing the csv into a native python object that supports iteration while providing easy access to the data (such as a sum by column header).

For very large files this can be cumbersome, especially where more advanced analytics are desired.

I would like to keep the same simple interface but use an in-memory database connection, thus transforming the CSV files into database tables for deeper analysis.

For example, I would like to do the following (leveraging the builtin sqlite3 module):

>>> 
>>> reports = Reports('/home/user/reports-1109')
>>> reports.billing_detail.sum('Tax Amount', {'Fiscal Period':'2011-09'})
Decimal('123321.1')
>>> 
>>> reports.tax_summary.sum('Amount', {'Fiscal Period':'2011-09'})
Decimal('123321.1')
>>> 
>>> len(reports.billing_detail)
719153
>>> 
>>> curs = reports.conn.cursor()
>>> curs.execute('SELECT name FROM sqlite_master WHERE type="table"').fetchall()
[(u'billing_detail',), (u'billing_summary',)]
>>> 

This approach can be orders of magnitude faster for even the most basic analysis. Furthermore, this allows OLAP cube analysis of the data from the CSV files, e.g.,

>>> 
>>> curs.execute('CREATE TABLE t_fact(id TEXT UNIQUE, b INT, t INT, r INT)').fetchall()
[]
>>> curs.execute('CREATE INDEX IF NOT EXISTS idxt ON t_fact(id)').fetchall()
[]
>>> 
>>> ## load some data into the fact table
>>> curs.execute('''INSERT OR REPLACE INTO t_fact(id,b,t,r)
                SELECT bd.%(id)s as id, bd.ROWID as b, ts.ROWID as t, rf.ROWID as r
                FROM billing_detail bd
                LEFT OUTER JOIN tax_summary ts ON bd.%(id)s = ts.%(tax_id)s
                LEFT OUTER JOIN refunds r ON bd.%(id)s = rf.%(ref_id)s
                ''' % query_dict).fetchall()
[]
>>> 
>>> ## e.g., find billing records without tax summaries
>>> billings_without_tax = curs.execute('SELECT id FROM t_fact WHERE t IS NULL').fetchall()
>>> 

Using the same Report and Reports objects discussed previously, the code can be modified to leverage a database connection to support this type of analytics:

class Report(collections.Mapping):
    def __init__(self, filehint, table = None, conn = sqlite3.connect(':memory:')):
        self.filename = Reports.find_report(filehint)
        self.info = []
        self.headers = []
        self.table = table
        self.conn = conn
        self.indexes = []
        self._load()

    def _load(self):
        logging.debug('loading %s' %(self.filename))
        curs = self.conn.cursor()
        fh = open(self.filename)
        reader = csv.reader(fh)
        self.info = reader.next()
        self.headers = reader.next()
        columns = ', '.join(['c'+str(x) for x in range(len(self.headers))])
        columnTypes = ' TEXT, '.join(['c'+str(x) for x in range(len(self.headers))]) + ' TEXT'
        try:
            curs.execute('CREATE TABLE %s(%s)' %(self.table, columnTypes))
        except sqlite3.OperationalError as e:
            logging.debug('%s -- using existing table' %(e))
        else:
            curs.executemany('INSERT INTO %s (%s) VALUES(%s)' %(
                self.table, columns,
                '?, ' * (len(self.headers) -1) + '?'
            ), reader)
            self.conn.commit()
        curs.close()

    def _column(self, key):
        if key.lower() not in [x.lower() for x in self.headers]:
            raise IndexError('%s not in %s'%(key, self.table))
        return 'c' + str([x.lower() for x in self.headers].index(key.lower()))

    def create_index(self, col):
        col = self._column(col)
        icol = 'i' + col
        if icol not in self.indexes:
            logging.debug('adding index %s to %s(%s)' %(icol, self.table, col))
            curs = self.conn.cursor()
            curs.execute('CREATE INDEX IF NOT EXISTS %s ON %s(%s)' %(icol, self.table, col))
            curs.close()
            self.indexes.append(icol)

    def __getitem__(self, key):
        curs = self.conn.cursor()
        res = list(curs.execute('SELECT * FROM %s WHERE ROWID = %s' %(self.table, key+1)).fetchall()[0])
        curs.close()
        return res

    def __iter__(self):
        curs = self.conn.cursor()
        self.__iter = curs.execute('SELECT * FROM %s' %(self.table))
        curs.close()
        return self

    def next(self):
        return self.__iter.next()

    def __len__(self):
        curs = self.conn.cursor()
        ret = curs.execute('SELECT COUNT(*) FROM %s' %(self.table)).fetchall()[0][0]
        curs.close()
        return ret

    def get(self, column, value):
        '''get rows where column matches value'''
        curs = self.conn.cursor()
        column = self._column(column)
        ret = curs.execute('SELECT * FROM %s WHERE %s = "%s"' %(self.table, column, value)).fetchall()
        curs.close()
        return ret

    def sum(self, col, filter = {}):
        curs = self.conn.cursor()
        _where = []
        for k,v in filter.iteritems():
            _where.append(' %s = "%s" ' %(self._column(k),v) )
        ret = Decimal(str(curs.execute('SELECT SUM(%s) FROM %s %s' %(
            self._column(col),
            self.table,
            ' WHERE ' + ' AND '.join(_where) if _where else ''
            )).fetchall()[0][0]))
        curs.close()
        return ret
Posted in data arch., python, software arch.

python, analyzing csv files, part 1

I would like to analyze a collection of CSV (comma-separated-values) files in python. Ideally, I would like to treat the csv data as a native python object.

For example,

>>> financial_detail = Report('financial-detail.csv')
>>> transactions = {}
>>> for row in financial_detail:
...   transactions.append(row['Transaction'])
...
>>> financial_detail.sum('Tax Amount')
Decimal('123456.10')
>>> 

Additionally, I would like to easily retrieve row data by column name, e.g.,

>>> financial_detail = Report('financial-detail.csv')
>>> 
>>> financial_detail[0]['Transaction']
'6324565'
>>> 
>>> ## the above should be equivalent to
>>> financial_detail.headers.index('Transaction')
7
>>> financial_detail[0][7]
'6324565'
>>> 

Given a directory of CSV files, I would like to retrieve these files (magically) by name, e.g.,

>>> reports = Reports('/directory/of/reports/')
>>> len(reports.sales_tax)
4884
>>> 
>>> len(reports.finance_detail)
512916
>>> reports.sales_tax.sum('Amount') - reports.financial_detail.sum('Tax Amount')
Decimal('0.0')
>>> 

This can be accomplished by leveraging the python builtin csv module, the collections.Mapping abstract base class, as well as the python magic __getattr__ methods, i.e.,

!/usr/bin/env python
# vim: set tabstop=4 shiftwidth=4 autoindent smartindent:
import csv, glob, os, sys 
import collections
from decimal import Decimal

## set the logging level
import logging
logging.basicConfig(
  level=logging.DEBUG,
  format='%(levelname)s:%(filename)s:%(lineno)d -- %(message)s'
)



class Report(collections.Mapping):
    ''' 
    Interface into a CSV Report

    Parses the csv into an iterable collection and provides memory of pivot sums
    '''

    def __init__(self, *filehints):
        self.filename = Reports.find_report(*filehints)
        self.info = []
        self.headers = []
        self.data = []
        self.pivot_sums = {}
        self.load()

    def load(self):
        logging.debug('loading %s' %(self.filename))
        fh = open(self.filename)
        reader = csv.reader(fh)
        self.info = reader.next()
        self.headers = reader.next()
        for row in reader:
            self.data.append(row)
        self.curval = 0 

    def __getitem__(self, key):
        return VirtualRecord(self.data[key], self)

    def __iter__(self):
        self.curval = 0 
        return self

    def next(self):
        if self.curval < 0 or self.curval >= (len(self.data) - 1): 
            raise StopIteration
        else:
            self.curval += 1
            return VirtualRecord(self.data[self.curval], self)

    def __len__(self):
        return len(self.data)

    def sum(self, col):
        if col not in self.pivot_sums:
            logging.debug('computing sum for %s in %s' %(col, ', '.join(self.info)))
            sum = Decimal(0)
            index = col
            if isinstance(index, str):
                index = [x.lower() for x in self.headers].index(index.lower())
            for i in self.data:
                sum += Decimal(i[index]) if i[index] != '' else 0
            self.pivot_sums[col] = sum
        return self.pivot_sums[col]


class VirtualRecord():
    '''
    A virtual record within a Report object, only one instance exists at a time (not-thread safe) 
    to avoid the memory crunch from large report files
    '''
    _instance = None
    def __new__(self, *args, **kwargs):
        if not self._instance:
            self._instance = super(VirtualRecord, self).__new__(self)
        return self._instance

    def __init__(self, row, parent):
        self.row = row
        self.parent = parent

    def __getitem__(self, key):
        index = key
        if isinstance(key, str):
            index = [x.lower() for x in self.parent.headers].index(key.lower())
        return self.row[index]

    def __len__(self):
        return len(self.row)


class Reports:
    '''
    Magic collection of billing reports
    '''

    reports = {}

    def __init__(self, reportdir = '.'):
        self.reportdir = reportdir

    def __getattr__(self, report):
        '''
        magically find reports within self.reportdir

        For example, if there is a report 'sales-tax.2011-09.20111018.csv'
        You can access this with an attribute 'sales_tax', e.g.,
        >>> 
        >>> reports = Reports()
        >>> len(reports.sales_tax)
        4884
        >>> 
        '''
        filename = Reports.find_report(self.reportdir, report.replace('_', '?') + '.*.csv')
        if filename is None or not os.stat(filename):
            return None
        if report not in self.reports:
            self.reports[report] = Report(filename, table=report)
        return self.reports[report]

    @classmethod
    def find_report(self, *hints):
        files = glob.glob( os.path.join(*hints) )
        if len(files) == 1:
            return files[0]
        elif len(files) == 0:
            logging.error('No reports found: find_report(%s)' %(os.path.join(*hints)))
        else:
            logging.error('Found too many matching reports: find_report(%s) => %s' %(os.path.join(*hints), ', '.join(files) ))

Posted in data arch., python, software arch.

python, unique files by content

I would like to retrieve a list of unique files by content rather than by filename.

That is, if spam.txt and eggs.txt both contained the same contents I want only one of them to return. A very simple approach is to compute a SHA-1 checksum on each file, and build a dictionary with the checksum as the unique key.

#!/usr/bin/env python
# vim: set tabstop=4 shiftwidth=4 autoindent smartindent:
import hashlib, sys
import logging

def _dupdoc(filelist):
	'''
	returns a list of unique files (by content rather than filename)
	that is, if spam.txt and eggs.txt both contained the same contents, 
	only one filename will be returned
	'''
	shasums = {}
	for file in filelist:
		try:
			fh = open(file, 'rb')
			sha1 = hashlib.sha1(fh.read()).hexdigest()
			if sha1 not in shasums:
				shasums[sha1] = file
				logging.debug('%s %s' %(file, sha1))
		except IOError as e:
			logging.warning('could not open %s' %(file))
	uniquelist = [file for file in shasums.values()]
	return uniquelist


if __name__ == "__main__":
	'''
	command-line, accept either a list of files in STDIN
	or a single filename argument that contains a list of files
	'''

	filelist = []
	if len(sys.argv) > 1:
		fh = open(sys.argv[1], 'r')
		filelist = fh.readlines()
		fh.close()
	else:
		filelist = sys.stdin.readlines()
	filelist = [file.strip() for file in filelist]
	uniques = _dupdoc(filelist)
	for file in uniques:
		print file

The commandline __main__ portion of the program expects an optional command line argument, or if no argument is specified than a filelist will be read on STDIN, e.g.,

#  find test -type f | dupdoc
test/spam1.txt
test/spam9.txt
# 
Posted in python, shell tips

python daemon

I would like to create a python daemon, completely detaching from the terminal or parent process, and yet retaining any log handlers through the python logging module. There is a wonderful example at cookbook-278731 of a well-behaved daemon, and see PEP 3143.

Borrowing from these examples, here is a simple daemonize function that will retain thread-safe logging handlers that were setup through the logging module. This is also available via github.

# vim: set tabstop=4 shiftwidth=4 autoindent smartindent:
'''
Daemon (python daemonize function)

Detach process through double-fork (to avoid zombie process), close all
file descriptors, and set working directory to root (to avoid umount problems)
'''

import os, resource
import logging

# target environment
UMASK = 0 
WORKINGDIR = '/' 
MAXFD = 1024
if (hasattr(os, "devnull")):
    REDIRECT_TO = os.devnull
else:
    REDIRECT_TO = "/dev/null"

def daemonize():
    '''Detach this process and run it as a daemon'''

    try:
        pid = os.fork() #first fork
    except OSError, e:
        raise Exception, "%s [%d]" % (e.strerror, e.errno)

    if (pid == 0): #first child
        os.setsid()
        try:
            pid = os.fork() #second fork
        except OSError, e:
            raise Exception, "%s [%d]" % (e.strerror, e.errno)

        if (pid == 0): #second child
            os.chdir(WORKINGDIR)
            os.umask(UMASK)
        else:
            os._exit(0)
    else:
        os._exit(0)

    #close all file descriptors except from non-console logging handlers
    maxfd = resource.getrlimit(resource.RLIMIT_NOFILE)[1]
    if (maxfd == resource.RLIM_INFINITY):
        maxfd = MAXFD
    filenos = []
    for handler in logging.root.handlers:
        if hasattr(handler, 'stream') and hasattr(handler.stream, 'fileno') and handler.stream.fileno() > 2:
            filenos.append( handler.stream.fileno() )
    for fd in range(0, maxfd):
        try:
            if fd not in filenos:
                os.close(fd)
        except OSError:
            pass

    #redirect stdin, stdout, stderr to null
    os.open(REDIRECT_TO, os.O_RDWR)
    os.dup2(0, 1)
    os.dup2(0, 2)

    return(0)

Simply call the daemonize() function within your application, e.g.,

import logging
import Daemon

logging.info('daemonize?')
Daemon.daemonize():
logging.info('daemonized! we are now safely detached')

From a shell, console logging will only appear before daemonize() was called, e.g.,

# python test/daemon.py 
INFO:daemon.py:4 -- daemonize?
#

And the logfile output:

# cat test.log
2011-09-27 13:26:02,712 INFO:daemon.py:4 -- daemonize?
2011-09-27 13:26:02,717 INFO:daemon.py:6 -- daemonized! we are now safely detached
#
Posted in python, software arch.

python logging

I would like customizable logging in python applications, and I would like to easily send log messages to multiple handlers without any modification of the application code. The built-in logging module provides a very robust and easy-to-use logging capability.

In it's simplest form, log messages will be sent to the console with minimal formatting, e.g.,

>>> import logging
>>> 
>>> logging.warning('this is a warning')
WARNING:root:this is a warning

In fact, your application and module code can simply use the built-in logging methods which can inherit logging handlers set by the main application code.

Below is an example of a logging configuration that creates three logging handlers:

import logging

# Log to file
logging.basicConfig(
    filename='test.log',
    level=logging.INFO,
    format='%(asctime)-15s %(levelname)s:%(filename)s:%(lineno)d -- %(message)s'
)   

# Log to console
console = logging.StreamHandler()
console.setLevel(logging.DEBUG)
console.setFormatter(logging.Formatter('%(levelname)s:%(filename)s:%(lineno)d -- %(message)s'))
logging.getLogger().addHandler(console)

# Log to syslog
from logging.handlers import SysLogHandler
syslog = SysLogHandler(address='/dev/log')
syslog.setFormatter(logging.Formatter('%(asctime)-15s %(levelname)s:%(filename)s:%(lineno)d -- %(message)s'))
logging.getLogger().addHandler(syslog)

Once this module is imported, any logging calls to the root logger [e.g., logging.warning(), logging,critical(), etc] will be processed by all three handlers. You can add as many log handlers as needed while your module code maintains a very simple logging interface, e.g.,

import logging

logging.debug('this is a debug msg')
logging.info('this is an info msg')
logging.warning('this is a warning msg')
logging.error('this is an error msg')
logging.critical('this is a critical error msg')

The console output will look like:

INFO:test_log.py:4 -- this is an info msg
WARNING:test_log.py:5 -- this is a warning msg
ERROR:test_log.py:6 -- this is an error msg
CRITICAL:test_log.py:7 -- this is a critical error msg

The filelog (and syslog) will look like:

 
2011-09-26 12:30:52,521 INFO:test_log.py:4 -- this is an info msg
2011-09-26 12:30:52,522 WARNING:test_log.py:5 -- this is a warning msg
2011-09-26 12:30:52,522 ERROR:test_log.py:6 -- this is an error msg
2011-09-26 12:30:52,522 CRITICAL:test_log.py:7 -- this is a critical error msg
Posted in python, software arch.

HTML5 canvas Mandelbrot

I would like to create an animated Mandelbrot visualization using JavaScript on an HTML5 <canvas> element. The Mandelbrot set, popularized by BenoƮt Mandelbrot, is the set of complex numbers that remain bounded under the function zn+1 = zn2 + c. This is known as an escape function; that is, regardless of the size of n, zn never "escapes". Computing the Mandelbrot set can be as computationally complex as desired for a given visualization.

In JavaScript, the escape function can be written as follows:

Mandelbrot.prototype.escapeFunc = function(x, y) {
  r = 0.0; i = 0.0; m = 0.0;
  j = 0;
  while ((j < this.max) && (m < 4.0)) {
    j++;
    m = r * r - i * i;
    i = 2.0 * r * i + y;
    r = m + x;
  }
  return j;
}

For a given HTML5 canvas element, such as

<canvas id="mandelbrot" width="512" height="512">

A Manelbrot set over the complex plane can be represented with the follow object

function Mandelbrot(m) {
  this.m = m;
  this.c = m.getContext("2d");
  this.width = m.width;
  this.height = m.height;
  this.SX = -1.75; // start value real
  this.EX = 0.6;    // end value real
  this.SY = -1.125; // start value imaginary
  this.EY = 1.125;  // end value imaginary
  this.xzoom = (this.EX - this.SX) / (this.width*1.0);
  this.yzoom = (this.EY - this.SY) / (this.height*1.0);
}

Given these functions, rendering a Mandelbrot set on an HTML5 canvas element is as simple as looping through each of the pixels of the canvas, calculating the escape value, and drawing the pixel. Here is a simple render function:

Mandelbrot.prototype.render = function() {
  var prev_h = 0;
  for (var x = 0; x < this.width; x=x+1) {
    for (var y = 0; y < this.height; y=y+1) {
      esc = this.escapeFunc(this.SX + x*this.xzoom, this.SY + y*this.yzoom);
      h = 360 * (esc/this.max)
      if (h != prev_h) {
         perc = Math.floor(100*(h/360))
         this.c.fillStyle='hsla('+ h + ','+ (perc+100) +'%,'+ (60-perc) +'%,'+ this.opacity  +')';
         prev_h = h;
      }
      this.c.fillRect(x,y,1,1);
    }
  }
}

If you have an HTML5 compatible browser you should see an animated example below:

your browser does not support the HTML5 canvas element


Posted in html, javascript