Sunday, March 6, 2011

Information Hiding, Abstraction, and Encapsulation

...another take on an eternal topic. Why do I risk to publish my own interpretations of these terms -- even though there are already so many, and so many discussions about them? Simple answer: Because I have not found a text that satisfied me.
Before I give my own interpretation, please take a look at the following links:
  • http://www.itmweb.com/essay550.htm:  Edward V. Berard collected a host of definitions from various books and articles. In my reading, many of the definitions equate information hiding and encapsulation: "Data hiding is sometimes called encapsulation", "Encapsulation (also information hiding) ...", "[E]ncapsulation -- also known as information hiding ...", although Berard tries to convince us that there are significant differences. Still, I am confused.
  • http://electrotek.wordpress.com/2009/04/29/encapsulation-and-information-hiding/ by Viktoras Agejevas has other citations; and ends with "These quotes clearly show that encapsulation and information hiding are almost synonymous." However, this is only a short discussion of the topic.
  • http://nat.truemesh.com/archives/000498.html by Nat Pryce argues to distinguish encapsulation and information hiding because "I find it much easier to make good decisions when I am clear about when I am doing encapsulation and when I am doing information hiding." However, to me the text sounds just like an argument to prefer the term "information hiding."
  • http://discuss.joelonsoftware.com/default.asp?design.4.145438.37 by Dave Jarvis also argues that the two are orthogonal. However, in the many comments, the concepts get muddled up more and more, also in Dave Jarvis's own comments -- although I must say he tries eagerly to find out a good usage of the terms.
I will give my definitions in a moment. However, I am of the opinion that the quarrel about these terms is not that important. If someone -- like Nat Pryce, Dave Jarvis, or I -- want to use these terms with a very specific meaning in mind, this is ok. But in a conversation or discussion, we should accept that other people attach other -- maybe precise, maybe unclear -- meanings and usages to them; and therefore we should focus on the original goal of the discussion or conversation.

Information hiding was introduced in a paper by Dave Parnas in 1972 -- long before object orientation came along. AFAIK, he used it to introduce a new way of modularizing an algorithm: The predominant method of the day was functional decomposition: An algorithm is taken apart into sub-algorithms.
Parnas argued that a "risk-driven approach" was better: Identify parts of the algorithm that rely on the same design decision, and package each such part into a separate "module." This module now need not expose the consequences of that decision, because all those consequences are now internal to the module. In other words, the module hides all information resulting from that decision. The huge advantage is that changes to that decision will not be seen by other parts of the system -- thus avoiding any ripple effect when the decision changes.

Thus, information hiding is a process concept that includes reasoning about project conditions and decisions. There is, in general, no "right" or "wrong" information hiding. Inherently, information hiding requires judgment about decision risks: If a decision will most certainly not change, there's no need to hide that information. Examples could be the selection of your operating or RDBM system; or "invariant" features that are very unlikely or even impossible to change, e.g. the list of human sexes or natural laws (... I fear that we agree that even such "invariants" may change in certain scenarios ...).

If information is hidden, it needs to be hidden somewhere: i.e., behind some sort of "walls." We could imagine a "software landscape" where there are only sometimes walls, but none at other places (imagine flying over English meadows -- sometimes with walls or hedges between them, sometimes not). However, it is easier to view the landscape as a set of "modules," with walls around each module. Now, it is important to realize that there are many different sorts of modules -- even in our current "OO age:"
  • functions
  • In OO languages:
  • - classes (and their variants, like "structs")
  • - class hierarchies
  • - groups of nested classes
  • threads and tasks
  • thread groups / task groups
  • co-routines in languages with co-routine support
  • packages/namespace/"modules" (e.g. in MODULA)
  • assemblies (in Windows and .Net)
  • configuration files and file sets
  • DSLs and generators
  • aspects (in aspect-oriented programming languages)
  • schemas (in relational databases)
  • arbitrarily defined groups of classes, methods, threads, ...: E.g. all whose names match a certain pattern
For each decision whose consequences should be hidden, one must decide which sort of "module" is used to hide it.

In a nutshell: Information Hiding is a process encompassing ...
  1. ... deciding on things that might change / that are risky / that are not under your control: Changes to these should be hidden from the rest of the design of your system.
  2. ... situation-specific judgment about the volatility of decisions -- therefore, there is no "universally right or wrong" information to hide.
  3. ... identifying the "walls" towards other parts of the design behind whom information can be hidden -- in practice, the "modules" inside which certain pieces of information are to be hidden.
Abstraction, as a process, is -- for me -- the complement of information hiding, in the following sense:
  • Information hiding focuses on which information is hidden (away) so that changes to decisions about that information do not influence the rest of the system.
  • Abstraction, on the other side, focuses on the information that is not hidden = that is exposed so that the rest of the system can rely on it.
Also, "abstraction" is the result of an abstraction process.
From the standpoint of information hiding, in an ideal world, all information is hidden. Obviously, this completely prevents building the required system. Therefore, we must also focus on a useful design of the exposed (non-hidden) parts of each module. A good abstraction requires all those -ibilites, e.g.
  • usability
  • testability
  • "understandability"
  • completeness with respect to proofs or arguments about the users of the abstraction
Finally, Encapsulation is in my dictionary the name of the techniques used to put information hiding and abstractions into practice. Some of these techniques are embedded in programming languages, others must be designed explicitly. Here are a few important guidelines of encapsulation techniques that pre-date object orientation:
  • separate accessors from (implementation) data;
  • expose only read-only data;
  • expose only copies of internal data;
  • separate APIs ("interfaces") of algorithms and algorithm groups from implementation details;
  • use thread-local/thread-static variables to hide information inside a thread;
In the '80s, a group of encapsulation techniques came into fashion that is now commonly known under the term "object orientation" (an older related concept was ADT or "abstract data type") which provides, among others,
  • Combining algorithms and data that are significantly coupled in "cohesive classes"
  • Combining such elements in "inheritance hierarchies"
  • For data exposed from some module, only expose restricted knowledge -- mostly only interfaces (a concept that emerged around the same time)
  • Use private, public, internal, package-private to define the hiding boundaries.
  • Law of Demeter
To sum up:
  • Information Hiding and Abstraction are processes that decide about design based on "risk" information; and the results of these processes.
  • Encapsulation are concrete techniques to establish information hiding and abstraction.
That's it.

Saturday, March 5, 2011

Parallel Bug #2

This time, the bug is wrong handling of database connections in an id generator (class IdGen). The intention of the IdGen class is that a new id is created in a separate transaction. Because of this, parallel requests can get new ids without (much) waiting for each other even if the request transactions themselves are long. However, due to a tiny error we made, the behavior was not as intended. Yet, the buggy code was not detected for almost 6 months - it seems that in real life, there were not too many parallel requests; and/or they were short enough; and/or users accepted waiting times.

In the following code, the threads t1 and t2 represent two requests, e.g. in a web server. The SQL Server WAITFOR statements represent SQL statements that take a few seconds. This allows us to reproduce the bug deterministally.

Here is the code:

    1 using System;
    2 using System.Data;
    3 using System.Data.SqlClient;
    4 using System.Threading;
    5 
    6 #region An SQL Connection And Id Generation Framework
    7 
    8 public static class Logger {
    9     private static readonly DateTime _st = DateTime.Now;
   10     public static void Log(string msg) {
   11         Console.WriteLine((Thread.CurrentThread.Name ?? "m ") +
   12             " @ " + (DateTime.Now - _st).TotalSeconds + ": " + msg);
   13     }
   14 }
   15 
   16 public class DBContext : IDisposable {
   17     public string ConnectionString { get; private set; }
   18     private readonly SqlConnection _conn;
   19     private readonly SqlTransaction _tx;
   20 
   21     public DBContext(string connectionString) {
   22         ConnectionString = connectionString;
   23         _conn = new SqlConnection(connectionString);
   24         _conn.Open();
   25         _tx = _conn.BeginTransaction();
   26         Logger.Log("Enter Transaction");
   27     }
   28 
   29     public int ExecuteIntQuery(string sql) {
   30         using (IDbCommand cmd = _conn.CreateCommand()) {
   31             try {
   32                 Logger.Log("Starting " + sql);
   33                 cmd.CommandText = sql;
   34                 cmd.CommandTimeout = 5;
   35                 cmd.Transaction = _tx;
   36                 return (int)(cmd.ExecuteScalar() ?? 0);
   37             } finally {
   38                 Logger.Log("Done     " + sql);
   39             }
   40         }
   41     }
   42 
   43     public void Dispose() {
   44         Logger.Log("Commit Transaction");
   45         _tx.Commit();
   46         _conn.Close();
   47     }
   48 }
   49 
   50 public static class IdGen {
   51     public static int GetNewIdInSeparateTransaction(DBContext cxt) {
   52         using (new DBContext(cxt.ConnectionString)) {
   53             return cxt.ExecuteIntQuery(
   54                 "UPDATE fwk_id SET ct = ct + 1; SELECT ct FROM fwk_id"
   55             );
   56         }
   57     }
   58 }
   59 
   60 #endregion An SQL Connection And Id Generation Framework
   61 
   62 #region An Application
   63 
   64 class Program {
   65     static void Main() {
   66         const string connString = @"server=.\SQLEXPRESS;Initial Catalog=MyToyDatabase;Integrated Security=SSPI";
   67         using (var setup = new DBContext(connString)) {
   68             setup.ExecuteIntQuery(@"
   69                     BEGIN TRY DROP TABLE fwk_id END TRY
   70                     BEGIN CATCH END CATCH;
   71                     CREATE TABLE fwk_id (ct INTEGER);
   72                     INSERT INTO fwk_id VALUES(1000)");
   73         }
   74 
   75         var t1 = new Thread(() => {
   76             int newId;
   77             using (var cxt = new DBContext(connString)) {
   78                 cxt.ExecuteIntQuery("WAITFOR DELAY '00:00:02'");
   79                 newId = IdGen.GetNewIdInSeparateTransaction(cxt);
   80                 cxt.ExecuteIntQuery("WAITFOR DELAY '00:00:03'");
   81             }
   82             Logger.Log("Result = " + newId);
   83         }) { Name = "t1" };
   84 
   85         var t2 = new Thread(() => {
   86             int newId;
   87             using (var cxt = new DBContext(connString)) {
   88                 cxt.ExecuteIntQuery("WAITFOR DELAY '00:00:01'");
   89                 newId = IdGen.GetNewIdInSeparateTransaction(cxt);
   90             }
   91             Logger.Log("Result = " + newId);
   92         }) { Name = "t2" };
   93         t1.Start(); t2.Start();
   94         t1.Join(); t2.Join();
   95     }
   96 }
   97 
   98 #endregion An Application
   99 

If you run this code, everything works fine. However, if you increase the time in line 80 from 3 seconds to 5, you get a timeout exception.