2009年5月4日星期一

A Lambda Calculus Interpreter in Haskell

I will present a Lambda Calculus interpreter in this post. It is an (almost) exact translation from the ML code in chapter 9 of ML for the Working Programmer by Lawrence C. Paulson. This program incorporates a parser to convert string to an internal represenation of lambda expression, a pretty printer to reconvert the internal represenation to a formatted string in a human-readable form, and two evaluation functions (lazy and strict respectively) to reduce a given lambda expression to its normal form.

Enjoy.

Code below:
------------

module Lci (
            Lambda
           ,reader           -- String -> Lambda
           ,pr               -- Lambda -> String
           ,byValue          -- Lambda -> Lambda
           ,byName           -- Lambda -> Lambda
           ,evalV            -- String -> Lambda
           ,evalN            -- String -> Lambda
           ,tryV             -- String -> IO ()
           ,tryN             -- String -> IO ()
           ) where

-- Parser
import Text.ParserCombinators.Parsec
import qualified Text.ParserCombinators.Parsec.Token as T
import Text.ParserCombinators.Parsec.Language (emptyDef)

-- Pretty Printing
import qualified Text.PrettyPrint as PP
import Text.PrettyPrint (Doc, (<>), (<+>))

-- Environment
import qualified Data.Map as Map
import Data.Maybe


-------------------
-- Data
-------------------

data Lambda = Free String
            | Bound Int
            | Abs String Lambda
            | App Lambda Lambda
              deriving (Eq, Show, Read)

type Env = Map.Map String Lambda
 

-------------------
-- Parser
-------------------

abstract :: Int -> String -> Lambda -> Lambda
abstract i s (Free s') | s == s'   = Bound i
                       | otherwise = Free s'
abstract _ _ (Bound i) = Bound i
abstract i s (Abs s' t) = Abs s' (abstract (i+1) s t)
abstract i s (App t1 t2) = App (abstract i s t1) (abstract i s t2)

-- Abstraction over several free variables
absList :: [String] -> Lambda -> Lambda
absList xs t = foldr (\x acc -> Abs x (abstract 0 x acc)) t xs

-- Application of t to several terms               
appList :: Lambda -> [Lambda] -> Lambda
appList = foldl (\acc x -> App acc x)

-- Shift a term's non-local indices by i
shift :: Int -> Int -> Lambda -> Lambda
shift 0 _ u = u
shift i d u@(Free s) = u
shift i d u@(Bound j) = if j >= d then Bound (j+i) else Bound j
shift i d (Abs s t) = Abs s (shift i (d+1) t)
shift i d (App t1 t2) = App (shift i d t1) (shift i d t2)

-- Substitution for bound variable

subst :: Int -> Lambda -> Lambda -> Lambda
subst _ _ t@(Free _) = t
subst i u t@(Bound i') | i' <>
                       | i' == i   = shift i 0 u
                       | otherwise = Bound (i' - 1) -- non-local to t
subst i u (Abs s t') = Abs s (subst (i+1) u t')
subst i u (App t1 t2) = App (subst i u t1) (subst i u t2)

-- Lexer & Parser
lexer :: T.TokenParser ()
lexer = T.makeTokenParser emptyDef

whiteSpace= T.whiteSpace lexer
lexeme    = T.lexeme lexer
symbol    = T.symbol lexer
parens    = T.parens lexer
identifier= T.identifier lexer

p_term :: Parser Lambda
p_term = do{ symbol "^"
           ; ids <- many1 identifier
           ; symbol "."
           ; t <- p_term
           ; return $ absList ids t
           }
     <|> do{ t <- p_atom
           ; ts <- many p_atom
           ; return $ appList t ts
           }

p_atom :: Parser Lambda
p_atom = do{ s <- identifier
           ; return $ Free s
           }
     <|> parens p_term

p_top :: Parser Lambda
p_top = do{ whiteSpace
          ; t <- p_term
          ; eof
          ; return t
          }

reader :: String -> Lambda
reader input = case (parse p_top "" input) of
                  Left err -> error $ show err
                  Right x  -> x


-------------------
-- Pretty Printing
-------------------

-- Free variables in a term
vars :: Lambda -> [String]
vars (Free s) = [s]
vars (Bound _) = []
vars (Abs s t) = vars t
vars (App t u) = vars t ++ vars u

-- Rename variable "a" to avoid clashes
rename bs a = if a `elem` bs then rename bs (a ++ "'") else a

-- Remove leading lambdas, return bound variable names
stripAbs t = strip [] t
    where strip bs (Abs s t) =
              let b = rename (vars t) s
              in strip (b:bs) (subst 0 (Free b) t)
          strip bs u = (reverse bs, u)

pr_term :: Lambda -> Doc
pr_term (Free s) = PP.text s
pr_term (Bound i) = PP.text "??UNMATCHED INDEX??"
pr_term t@(Abs _ _) = 
    let (b:bs, u)     = stripAbs t
        spaceJoin b z = " " ++ b ++ z
        binder        = "^" ++ b ++ (foldr spaceJoin ". " bs)
    in PP.sep [PP.text binder, pr_term u]
pr_term t = PP.sep (pr_app t)
            
pr_app (App t u) = pr_app t ++ [PP.nest 1 (pr_atom u)]
pr_app t         = [pr_atom t]
pr_atom (Free s) = PP.text s
pr_atom t        = PP.nest 1 $ PP.parens $ pr_term t

pr :: Lambda -> String
pr = PP.render . pr_term


-------------------
-- Reduction
-------------------

eval (App t1 t2) =
    case eval t1 of
      (Abs a u) -> eval (subst 0 (eval t2) u)
      t         -> App t (eval t2)
eval t = t

byValue :: Lambda -> Lambda
byValue t = bodies (eval t)
    where bodies (Abs a t) = Abs a (byValue t)
          bodies (App t u) = App (bodies t) (bodies u)
          bodies t         = t

headNF (Abs a t) = Abs a (headNF t)
headNF (App t u) = 
    case headNF t of
      (Abs a t') -> headNF (subst 0 u t')
      u'         -> App u' u
headNF t = t

byName :: Lambda -> Lambda
byName = args . headNF
    where args (Abs a t) = Abs a (args t)
          args (App t u) = App (args t) (byName u)
          args t         = t


-----------------------------------------
-- Standard Environment (Lambda Prelude)
-----------------------------------------

-- Substitution for free variables from environment
inst :: Env -> Lambda -> Lambda
inst env t@(Free s)  = fromMaybe t (Map.lookup s env)
inst env t@(Bound i) = t
inst env (Abs s t')  = Abs s (inst env t')
inst env (App t1 t2) = App (inst env t1) (inst env t2)

insertEnv :: Env -> (String, String) -> Env
insertEnv env (a, s) = Map.insert a (reader s) env

stdEnv = foldl insertEnv Map.empty [
          -- booleans
          ("true",  "^x y. x"), ("false", "^x y. y")
         ,("if",    "^p x y. p x y")
          -- pairs
         ,("pair",  "^x y f. f x y")
         ,("fst",   "^p. p true"), ("snd", "^p. p false")
         ]


-------------------
-- Evaluators
-------------------

evalWith :: (Lambda -> Lambda) -> Env -> String -> Lambda
evalWith fn env = fn . inst env . reader
              
evalV = evalWith byValue stdEnv
evalN = evalWith byName  stdEnv

tryV = putStrLn . pr . evalV
tryN = putStrLn . pr . evalN

2009年2月6日星期五

Delayed-Branch-Slot Optimization and the Result

Background

 An Instruction Set Simulator (ISS) is a program that takes stream of target instructions and simulates the effects of execution of the code stream on a host machine. Usually the target and the host are of different architectures, but nothing prevents you from running a, say, x86 simulator on a PC. Base on this definition, an ISS is nothing but an ordinary interpreter, and from the fact that it interprets machine code instead of high level languages, hence freeing from lexical analysis, parsing, byte-code compilation, garbage collection…, etc., all sorts of complications one normally finds in modern high level language interpreters, it seems that an ISS is a trivial (often labor-intensive though) construct to make.

 Well, not quite so…

 The tricky thing is that you want your simulator runs fast, really fast. A naive implementation with a central “read-evaluate-loop” structure could hardly exceed the speed of interpreting over 10 million target instructions per second (MIPS). While 10 MIPS may be okay for running standalone, user-mode applications, it is hardly enough for running an OS, which easily takes billions of instructions just to boot up.

 Threaded-code is an elegant control structure that strike at a sweet point of balance between implementation complexity and run-time efficiency. A threaded-code ISS with is capable of running system level code at about 100 MIPS with full accuracy at the instruction level.

 This is the technique we actually adopted in developing an ISS for a RISC processor. While happy with the end performance, we want to see if we can squeeze more runtime efficiency out of it. One thing that made me feel less satisfactory in our implementation was the way we coped with branch delay slot.

 Conceptually, processor executes instructions sequentially, one following another, until it reaches a branch or jump instruction, where it branches to an effective address other than the one following it. However, in some RISC machines, to better use of the pipeline structure, branching is delayed after the execution of the instruction immediately following it. The position following the branch instruction is called the branch delay slot. This is one of the few occasions where hardware natures becoming visible to programmers.

 One way to simulate this behavior -- perhaps the most obvious one -- is to keep a Boolean valued flag, whose value is set by every branch/jump instruction (if the branch/jump is taken), and cleared otherwise. This flag is checked at the end of the service routine of every normal instruction. If it is set, we jump to a pre-calculated target address; if it is not, we simple increase the instruction pointer and move on to the next instruction.

 A slightly better solution that can do away with the conditional statement associated with the Boolean flag is to maintain two instruction pointers, ip and next_ip at the same time, so that we always jump to the address pointed to by next_ip after the execution of every instruction.

 Both of these two solutions use an auxiliary variable to simulate the correct execution flow of the processor in the presence of branch delay slot. In either case, a runtime overhead is added for every normal instruction, even though most instructions do not locate in the delay slot.

 Can we do better?

Idea for Optimization

 It did not take long before I realized that whether an instruction locates in a delayed slot is a matter that can be resolved in the decoding phase, e.g., it is independent of the state of the processor. Hence for each instruction that may possibly reside in the delayed slot, we create two specialized intermediate form instructions. The instruction service routines of the two specialized forms have most of their interpretative code in common but have different dispatching semantics. Instructions that do not locate in delayed slot will simply increase the instruction pointer and jumping to the instruction service routine of the next instruction. Instructions that locates in the delayed slot in general must check if the branch was taken (for conditional branch/jump instructions) and if so, set the instruction pointer to the target address calculated by the previous branch/jump instruction and transfer control to the service routine of the instruction at that address. If the branch is not taken, the behavior is the same as those that do not locates in the delayed slot.

 In essence, this is another application of the instruction specialization optimization technique typically found in instruction set simulators where an intermediate form of decoded instructions are utilized to avoid decoding the same stream of instructions over and over again. The difference is, in this case, instead of specializing on the bit pattern of an individual instruction group, we specialize over the spatial position of individual instructions relative to special (branch/jump) instructions.

Implementation

 Once the idea is clear, the implementation is quite straight forward.

 We translate raw instruction stream into the decoded form on a page-by-page basis. A page is the smallest unit (normally 4KB, but may vary in modern architectures) that the Memory Management Unit (MMU) of the processor breaks up the memory address space for critical OS-related operations such as address translation, protection, etc., as a whole. Doing so has the extra benefit that in-page branching, which is predominately the usual case against off-page branching, does not need to go through the MMU – an expensive operation that could degrade the performance of the simulation significantly.

 While decoding the raw instructions of a certain page, we maintain a local flag that is set when the instruction under decoding is a branch/jump instruction and is cleared otherwise. This flag is checked for every instruction: if it is set, we choose the delayed-slot version of the interpretive code of that instruction; if it is clear, we use the normal version of the interpretive code of that instruction. Note that this is very much the same as we did in the pre-optimization implementation, only this time it is done in the decoding phase instead of run time.

 The only problem of that scheme is what to do with the first instruction on a page. Where does the initial value of that flag come from? While cross-page branch delay slot is uncommon, the architecture does not exclude this from happening, so we must handle this in our interpreter.

 The answer is that we maintain a similar, but non-local flag in each of the decoded page, whose value is determined by the last instruction on that page. The local flag then gets its initial value from that state flag of its previous page. If its previous page does not exist when this page is decoded, then the initial value is by default set to false.

 The correctness of this algorithm is based on the following reasoning: if the first instruction on a page is in delayed slot, then we must get here from executing instructions of its immediate previous page, which must have its decoded page in memory with the in-slot flag being correctly set.

 Result

 Happy with this discovery (I am not aware of anyone coming up with a similar idea to the best of my knowledge. This is a very specific domain of application anyway.), I re-implemented the original threaded-code instruction set simulator with the new algorithm and expect a somewhere close to 20% performance improvement.

 How did I come up with this 20% number? Simple. The intrinsic figure of merit of the performance of an instruction set simulator is the average number of host instructions executed to interpret one target instruction. This number is independent of the clock frequency of the host processor. While it certainly depends on the instruction set architecture of the host processor, we have reason to believe that it does not vary significantly from architectures to architectures. I estimate that on average our original simulator executes 20 host instructions to interrupt one target instruction, and by specializing non-branching instructions with regard on their relative position in the code stream, I could at least reduce 3 host instructions on average. One does not need to be particularly good at math to compute the final result then. It is a very rough, but nevertheless valid estimation.

 To my surprise, the benchmark shows that the new simulator executes at 105 MIPS (Millions Instructions per Second) against 115 MIPS of the original one. The delayed-slot “optimization” I have labored with actually brings down the performance by almost ten percent!

 How do I explain the seemingly contradictory result?

 It could be that the time the simulator spends on decoding is not negligible comparing with the time it spends on executing. In this case, the overhead incurred in the new simulator in the decoding phase could offset what it gains in the execution phase. But a more likely reason that I suspect is that the interpretive code has been restructured inadvertently in the new simulator for a less-optimal memory layout for utilizing the cache system of the host machine. This is just a guess. I don’t have a definite answer to this puzzle yet.

 Conclusion

 So what is the lesson of this story?

 While not directly relevant, and by no means did I regret for this small adventure to optimize an already fast simulator, I could not help but thinking of the famous remark D. Knuth made many years ago: “premature optimization is the root of all evil”.

 Optimization is hard, harder than most of us would expect. A simple solution is often the best solution. When optimization is necessary, we should try to optimize “in the large”. For example, by replacing the simple, classic read-evaluate-loop control structure with direct-threaded code, we get an order of magnitude performance boost. Surely it is worth the effort for a performance critical application. But don’t waste your time on tiny piece optimizations that you thought would work. Even when it works, it may take its penalty on something else, say, code clarity, which is more important in an engineering sense than a, say, 5% increase of runtime efficiency.

 Let me end up here by quoting from another great mind: “keep things simple, but no simpler.”


2009年1月4日星期日

Useful Shell Shortcuts

Here are some small, simple, but surprisingly useful shell commands that I use everyday. Hopefully, you may find it useful too.

Change Directory and List Content

alias cd..='cd ..' alias cd...='cd ../..'  cl() {     cd $1 && ls } 

The functions of these commands are self-explanatory and may seem trivial on the first sight. But since ls and cd are the most common shell commands we use everyday, these shortcuts will save you a tremendously amount of key stokes in the long run.

List Sub-directories Only

alias lsd=alias lsd='ls -F | grep / | sed -e '\''s/\///g'\'' | column' 

Say that you are in a large code base (for instance, the Linux kernel) and you want to see if a particular sub-directory exists but do not want to clutter the output from the ls command with all the ordinary files, then this shortcut comes in handy.

Trash instead of Delete

alias trash="mv -t ~/.Trash --backup=t" 

This command moves files into a specific location instead of delete them permanently. It is the shell version trash can utility usually found in a GUI environment. Note that in Ubuntu, you might want to replace ~/.Trash with~/.local/share/Trash/files, the standard location for trashed objects.