Generate Junk Files

The other day I was testing benchmarks for a delete script. I needed to create files with various file sizes. More specific, 1,000,000 files with 5K per file. A while ago I found this great snippet on StackOverflow to generate a junk random string:

junk =  (("%%0%dX" % (junk_len * 2)) % random.getrandbits(junk_len * 8)).decode("hex")

I’ve wrapped that around to make a utility function and snippet:

import os, random, sys
# This tool takes 3 parameters
#   testing <directory to put files in> <how many files> <size of each file in bytes>
# Example:
#   testing dan 100 500
def createLocalDirectory( directoryName ):
  if not os.path.exists( directoryName ):
    os.makedirs( directoryName )
folderName      = sys.argv[1]
how_many_files  = int(sys.argv[2])
junk_len        = int(sys.argv[3])
createLocalDirectory( folderName )
for i in range( 0, how_many_files ):  
  junk =  (("%%0%dX" % (junk_len * 2)) % random.getrandbits(junk_len * 8)).decode("hex")
  path = folderName + "/" + str(i) + ".txt"
  f = open( path, 'w' )
  f.write( junk )
  print f

Search In One File from Keywords in Another File

I needed to see if this list of email addresses were found in a logs file. So I had one file with a list of email addresses. Another file is a list emails sent. I needed to make sure that the emails were sent. Here’s a quick Python script I put together that does this:

import re
import sys
def searchInLogFile( FILE, query ): 0, 0 )  
  for line in LogFile:
    logLine = line.replace("\n","").replace("\r","").rstrip().lstrip()
    if query, logLine, re.IGNORECASE | re.MULTILINE ):
      return True
# This file has a list (\r\n delimited) email addresses.
EmailListFile = open( "email-list-internal.txt", "r")
# This is the log file which we'll use to see if email addresses are in here.
LogFile = open( "POST20100201.log", "r" )
EmailFound = []
EmailNotFound = []
breakTime = 0
# 0 = does the whole list
EmailsToSearchFor = 0
for emailLine in EmailListFile:
  email = emailLine.replace("\n","").replace("\r","").rstrip().lstrip()
  if ( searchInLogFile( LogFile, email ) ):
    print email, "was found"
    EmailFound.append( email )
    print email, "not found"
    EmailNotFound.append( email )  
  if ( EmailsToSearchFor != 0 ):
    breakTime += 1
    if ( breakTime == EmailsToSearchFor ):
# Log results to a file.
OutputFile = open( "output.log", "w" )
divider = "\n\n======== Found ========================================"
print divider
OutputFile.write( divider )
for i in EmailFound:
  print i
  OutputFile.write( "\n" + i )
divider = "\n\n======== Not Found ===================================="  
print divider
OutputFile.write( divider )
for i in EmailNotFound:
  print i
  OutputFile.write( "\n" + i )

Pretty straightforward. The script also writes a file called “output.log” which has a list of emails that were found (marked under “found”) and not found (marked under “not found”).

Get Latest File

In my last post, I made a quick script that checks for the date. It was very limiting, since it used the dir command. This one uses several date/time Python modules and is more capable.

import os, os.path, stat, time
from datetime import date, timedelta, datetime
# Reference
def getFileDate( filenamePath ):    
  used = os.stat( filenamePath ).st_mtime      
  year, day, month, hour, minute, second = time.localtime(used)[:6]
  objDateTime = datetime(year, day, month, hour, minute, second)
  return objDateTime
  # Ways to reference this DateTime Object
  # objDateTime.strftime("%Y-%m-%d %I:%M %p")
  # objDateTime.year
  # objDateTime.month
def isDaysOldFromNow( filenamepath, days ):
  # Checks how old a file is. Is it older than "days" [variable] days?
  inTimeRange = False  
  timeDeltaDiff = ( filenamepath ) ).days
  # Check if the file's date is days old or less:
  if ( timeDeltaDiff >= days ):
    inTimeRange = True  
  return inTimeRange
fname = "C:/temp/decision2.pdf"  
# Set this variable to check if the file is this days old
howOld = 3
if ( isDaysOldFromNow( fname, howOld ) ):
  print fname, "is more than", howOld, "days old"
  print fname, "is NOT more than", howOld, "days old"


Compress and Move Log Files

Sometimes log files bog a system down. For one of our servers, I made this little Python script that compresses (via WinRAR) the log files in a directory, and then moves them to a backup location. The only little catch is that I wanted to leave the latest log files for in that directory. Log files are created daily, so the the latest log files have a datestamp of today. Here’s how I did it.

First Create the Python Script:

import os
import datetime
dateStamp  ="%Y-%m-%d") 
imsLogPath = 'd:\\LogFiles\\'                     
# Don't use a mapped drive but use UNC for network drives. Task Schedule seems to choke when it calls Python.
newRARPath = '"\\\\\\Root\\backups\\' + dateStamp + '.rar"'
rarPath    = '"C:\\Program Files\\WinRAR\\rar.exe" a -m5 ' + newRARPath 
# Get Latest Files
smtpLatest   = os.popen(r"dir /od /a-d /b " + imsLogPath + "SMTP*.log").read().splitlines()[-1]
postLatest   = os.popen(r"dir /od /a-d /b " + imsLogPath + "POST*.log").read().splitlines()[-1]
ischedLatest = os.popen(r"dir /od /a-d /b " + imsLogPath + "iSched*.log").read().splitlines()[-1]
relayLatest  = os.popen(r"dir /od /a-d /b " + imsLogPath + "Relay*.log").read().splitlines()[-1]
qengLatest   = os.popen(r"dir /od /a-d /b " + imsLogPath + "Qeng*.log").read().splitlines()[-1]
# Get List of All Files
allFiles     = os.popen(r"dir /od /a-d /b " + imsLogPath + "*.log").read().splitlines()
# Remove Latest Files from All Files List
allFiles.remove( smtpLatest )
allFiles.remove( postLatest )
allFiles.remove( ischedLatest )
allFiles.remove( relayLatest )
allFiles.remove( qengLatest )
# allFiles Array Has the list of files
# Flatten Array allFiles to be used as a parameter in system command
flatLogPathList = ""
for filenameWithPath in allFiles:
  flatLogPathList = flatLogPathList + imsLogPath + filenameWithPath + " "
# Execute WinRar
path = rarPath + " " + flatLogPathList.rstrip()
os.system( '"' + path + '"' )
# Delete all log files
os.system( '"del ' + flatLogPathList.rstrip() + '"' )

Then I set up the Scheduled Task:

With these Settings:

Recursion vs For-Loop

So I’m currently in process of reading the infamous “Code Complete” by Steve McConnell. So far it’s been an amazing book and I definitely guarantee it to any programmer out there. I’ve just read the section on recursion and it mentioned how doing recursion for a factorial (or fibonacci) function is not as efficient as a for-loop iteration. I guess I never thought about it, since in computer science I was always shoved recursion down my throat when doing factorials. I agree with him that computer science professors are eager to apply the idea of recursion on factorials, but I’ve never remembered a professor mention that it’s not the most efficient way. McConnell states in the book that doing recursion in factorials:

  1. Is not as fast as a for-loop.
  2. Not as clear to read as a for-loop.
  3. Use of run-time memory is unpredictable.

Just for fun, I wanted to test his point on speed. This is a Python script that tests the average speed of a factorial using a for-loop or recursion. I noticed that for numbers less than 3000! the time it took for both functions were exactly the same. It was only when I bumped it up to 5000!, which is a huge number (16,327 digits). Luckily Python lets you work with very large numbers easily. Just had to increase the number of recursion calls in Python from the default 1000.

import win32api
import sys
def factorial_forloop( n ):
  count = 1
  for i in range( n, 0, -1 ):
    count = count * i  
  return count
def factorial_recursion(n):
  if n == 0:
     return 1
     return n * factorial_recursion(n-1)
total_time_recursion = 0
total_time_forloop   = 0
number_of_tries      = 500
for i in range( 1, number_of_tries ):
  start = win32api.GetTickCount()
  factorial_recursion( 5000 )
  end = win32api.GetTickCount()
  total = end - start  
  total_time_recursion += total
  start = win32api.GetTickCount()
  factorial_forloop( 5000 )
  end = win32api.GetTickCount()
  total = end - start
  total_time_forloop += total  
print "\n"  
print "Average time for recursion: ", ( total_time_recursion / 10 ) * .001
print "Average time for for-loop: ", ( total_time_forloop / 10 ) * .001

So in 500 tries, the results were as follows:

Average time for recursion:  1.284 seconds
Average time for for-loop:   1.083 seconds

It doesn’t seem by much but the results are interesting. But then again, a factorial is a very simple algorithm. In future posts I’ll try to test more complicated algorithms and see how they battle out. Also, this is Python. The results for C, C++, or Java may differ.

yUML and ColdFusion

I just tried to write a quick script in Python that scans CFCs and generates a yUML URL to diagram. I pointed my script to my root CFC path and I got a 13K strlen URL. I pasted it in the address bar to see what happened and I got the following:

Request-URI Too Large
The requested URL's length exceeds the capacity limit for this server.
Apache/2.2.3 (Debian) Phusion_Passenger/2.0.2 Server at Port 80

I wonder what the limitation is. I suppose I’ll have to do a CFC per diagram and then bind them together somehow. I’m choosing Python so this script can be part of my build script.

Here’s the code so far, which of course, could be optimized:

import re
import os
# UML Syntax
# [
#   User
#   |
#     Property1;
#     Property2
#   |
#     Method1();
#     Method2()
#  ]
# Master Path
ROOT_PATH = 'C:\\temp\\cf-yuml'
def SearchForFile( rootpath, searchfor, includepath = 0 ):
  # Search for a file recursively from a root directory.
  #  rootpath  = root directory to start searching from.
  #  searchfor = regexp to search for, e.g.:
  #                 search for *.jpg : \.exe$                     
  #  includepath = appends the full path to the file
  #                this attribute is optional
  # Returns a list of filenames that can be used to loop
  # through.
  # TODO: Use the glob module instead. Could be faster.  
  names = []
  append = ""
  for root, dirs, files in os.walk( rootpath ): 
    for name in files:
      if searchfor, name ):
        if includepath == 0:
          root = ""          
          append = "\\"
        names.append( root + append + name )        
  return names  
def getCFCInfo ( FILE, path ): 0, 0 )  
  CFCLines = FILE.readlines()
  CFCFunctions  = []
  CFCProperties = []
  CFC           = {}
  for i in CFCLines:
    # Get names of methods  
    if "^<cffunction", i , re.IGNORECASE | re.MULTILINE ):    
      CFCFunctions.append( r'name\s*=\s*"([\w$-]+)"', i, re.DOTALL | re.IGNORECASE).group(1) )
  # Get names of properties
    if "^<cfproperty", i , re.IGNORECASE | re.MULTILINE ):    
      CFCProperties.append( r'name\s*=\s*"([\w$-]+)"', i, re.DOTALL | re.IGNORECASE).group(1) )     
  CFC = { "properties":CFCProperties, "methods":CFCFunctions }  
  # Generate URL
  strFunctions  = ""
  strProperties = ""
  for i in CFCFunctions:
    strFunctions  += i + "();"
  for i in CFCProperties:
    strProperties += i + ";"  
  CFCFileName ="\\([\w-]+)\.cfc$", path, re.DOTALL | re.IGNORECASE).group(1)  
  return "[" + CFCFileName + "|" + ( strProperties.strip()[:-1] + "|" if strProperties.strip()[:-1] else "" ) + strFunctions.strip()[:-1] + "]"  
URL = ""
for i in SearchForFile( ROOT_PATH, "\.cfc$", 1 ):
  CFCFile = open( i, "r" )
  URL += getCFCInfo( CFCFile, i ) + ","
URL = URL[:-1]
print "" + URL

I’ll keep working on this as time goes on. So far it just goes through all the CFC’s from the path you point to. It will crawl through all sub directories. There’s no relationship between classes, however. Not yet at least.

Python and SQL Server

Setting up Python to connect to SQL Server was relatively easy. First, you select a DB API driver. I chose pyodbc because I saw a Python article on Simple-Talk. There are two simple steps:

  1. Install Pywin32. Get the latest. It’s a dependency for pyodbc.
  2. Install pyodbc. Get it for the version of Python you’re using.

Once you’ve done this, you can query your SQL Server db as so:

import pyodbc
connection = pyodbc.connect('DRIVER={SQL Server};SERVER=;DATABASE=MyAwesomeDB;UID=sa;PWD=password')
cursor = connection.cursor()
cursor.execute("select * from states")
for row in cursor:
  print row.StateID, row.Abbreviation, row.Name

For more snippets and a tutorial, check out the documentation.

Now let’s try something more interesting. Let’s try doing some inserts and see how long it takes.

import win32api
import uuid
import pyodbc 
connection = pyodbc.connect('DRIVER={SQL Server};SERVER=;DATABASE=MrSkittles;UID=sa;PWD=password')
cursor = connection.cursor()
_start = win32api.GetTickCount()
for i in range( 0, 10000 ):  
  # Let's insert two pieces of data, both random UUIDs. 
  sql = "INSERT INTO Manager VALUES( '" + str( uuid.uuid4() ) + "', '" + str( uuid.uuid4() ) + "' )"  
  cursor.execute( sql )
_end = win32api.GetTickCount()
_total = _end - _start
print "\n\nProcess took", _total * .001, "seconds"

After some tests, 10,000 records took roughly 20-30 seconds. 1,000,000 records took 30 to 40 minutes. A bit slow, but it’s not a server machine. My machine is a Core Duo, 1.8Ghz x 2, at ~4GB with PAE on WindowsXP, but I ran this on a VMware VM with 1GB and SQL Server 2005 w/Windows Server 2003. The table was a two column table both varchar(50). On a server machine, it should be a helluva lot faster.

IIS Logs Scripts

While working with some IIS logs, I decided to start practicing my Python. I put together some handy Python functions to work with IIS Log files. These will come in handy. On a 3GB, 2.5GHz, running WinXP machine, these functions take about 3 seconds to process a 180MB Text file. Python code could be optimized to be faster if you’re dealing with larger sized files.

#!/usr/bin/env python
# An IIS log file can have various log properties. Everytime you add new columns to log for
# in IIS, it creates a new row full of columns.
import re
import os
MainLogDelimiter = "#Software: Microsoft Internet Information Services 6.0"
TestFile         = "C:\\Dan\\IIS-Log-Import\\Logs\\not-the-same.txt"
BigTestFile      = "C:\\Dan\\IIS-Log-Import\\Logs\\ex090914\\ex090914.log"
LogsDir          = "C:\\Dan\\IIS-Log-Import\\Logs"
def SearchForFile( rootpath, searchfor, includepath = 0 ):
  # Search for a file recursively from a root directory.
  #  rootpath  = root directory to start searching from.
  #  searchfor = regexp to search for, e.g.:
  #                 search for *.jpg : \.exe$                     
  #  includepath = appends the full path to the file
  #                this attribute is optional
  # Returns a list of filenames that can be used to loop
  # through.
  # TODO: Use the glob module instead. Could be faster.  
  names = []
  append = ""
  for root, dirs, files in os.walk( rootpath ): 
    for name in files:
      if searchfor, name ):
        if includepath == 0:
          root = ""          
          append = "\\"
        names.append( root + append + name )        
  return names  
def isSameLogProperties( FILE ):
  # Tests to see if a log file has the same number of columns throughout
  # This is in case new column properties were added/subtracted in the course
  # of the log file. 0, 0 )
  SubLogs = MainLogDelimiter )
  # SubLogs[0] Stores the number of different log variations in the log file  
  SubLogs[0] = len( SubLogs ) - 1    
  # Grab the column names from the log file, separated by space
  columns = "^#Fields:\s([\w\-()\s]+)$", SubLogs[1], re.IGNORECASE | re.MULTILINE ).group(1)   
  LogSameProperties = True
  for i in range( 2, SubLogs[0] + 1 ):
    # If there are columns
    if ( len( columns ) > 0 ):    
      if ( columns != "^#Fields:\s([\w\-()\s]+)$", SubLogs[i], re.IGNORECASE | re.MULTILINE ).group(1) ):        
        LogSameProperties = False
  return LogSameProperties
def getFirstColumn( FILE ):
  # This gets the columns from a log file. It returns only the first columns, and ignores another column
  # row that may exist in case new columns were added/subtracted in IIS. 
  # input: FILE
  # output: 1 single element List 0, 0 )
  names = []
  # Grab the column names from the log file, separated by space
  names.append( "^#Fields:\s([\w\-()\s]+)$", MainLogDelimiter )[1], re.IGNORECASE | re.MULTILINE ).group(1).strip() )
  return names
def getAllColumns( FILE ):
  # This gets all the columns from a log file. 
  # input: FILE
  # output: List 0, 0 )  
  names = []
  SubLogs = MainLogDelimiter )    
  # SubLogs[0] Stores the number of different log variations in the log file  
  SubLogs[0] = len( SubLogs ) - 1        
  for i in range( 1, SubLogs[0] + 1 ):        
    names.append( "^#Fields:\s([\w\-()\s]+)$", SubLogs[i], re.IGNORECASE | re.MULTILINE ).group(1).strip() )  
  return names  
# Loop through all the IIS log files in the directory
# for file in SearchForFile( LogsDir, "\.txt$", 1 ):  
LogFile = open( file, "r" )
if ( isSameLogProperties( LogFile ) ):
  print file, "the same"
  print file, "not the same"

I’ve Switched to Python from Perl

So I’ve finally dumped Perl for my systems scripts. Partly was for maintainability. Overall, when doing some benchmarks myself, it seems that Perl beats Python in simple text parsing and file manipulation, which is most of the time is what I use it for. Ugh. I do find it though, that in most teams, Perl can be cryptic and unnecessarily harder for one to jump into. Python solves this. I think Python (after playing around with it for about a week) is a much more elegant language. Python will be a great addition to my toolkit for system automation. Much easier to apply OOP principles and write readable code. It’s a pleasure to write in this language and I look forward to learning more about it.

Also, while searching for performance tests on which language was “faster,” I ran across this site: The Great Win32 Computer Language Shootout . Of course, not to be used as a definitive guide, it does serve as a baseline, I think, for very simplistic tasks in a language.

On a related note, here’s a great video I saw on “Python in the Enterprise – How to Get Permission”:

If you start your own company or run your own project you can usually choose the programming language, but if you work for a large company there are probably architects and others who keep a tight rein on approved technology. How do you steer a big ship towards dynamic programming languages, and how fast can it turn? Come hear the story of one software developer employee who in 20 months facilitated the adoption of Python as the standard scripting language for an enterprise with 25,000 employees. Leave with ideas for advancing dynamic programming languages in your workplace, and with hope that change is possible.

I looked into Ruby, and found various similarites. Python sold me due to its larger community and greater applications in the wild. I took a look at PHP for system scripting and it wasn’t fast enough for parsing large files. Lastly, I thought about JavaScript on the console via JSDB but then realized its breadth of native library functions wasn’t as wide as that of Python. I really love that Python is getting a lot of momentum from Google and Microsoft is doing more to support the IronPython (Python on .NET) platform.