Saturday, July 31, 2010

Updating a Lucene Index – The “Green” Version

There are plenty of examples available on the internet that are good introductions into the basics of a Lucene.NET index. They explain how to create an index and then how to use it for a search.

At some point you’ll find yourself in the situation that you want to update the index. Furthermore you want to update certain elements only.

One option is to throw away the entire index and then recreate it from the sources. For some scenarios this might be the best choices. For example you may have a lot of changes in your data and a high latency for updating the index is acceptable. In that case it might be the cheapest to do a full re-index each time. The trade-off is at different points, e.g. when less than 10% have changed updating can be more time efficient. In some cases you probably want to experiment with this a little.

If you go for recreating the entire index then you probably want to build the new index first (in a different directory if file based) and to replace the index in use only once the new index is complete.

Another option is to update in the index only the documents that have changed (The “green” option as we are re-using the index). This of course would require you to be able to identify the documents that need to be updated. Depending on your application and your design this might be relatively easy to achieve.

If you opt for updating in the index just the documents that have changed then some blogs are suggesting to remove the existing version of the document first and then insert/add the new version of the document. For example the code from the discussion on the question “How to Update a Lucene.NET Index” at Stackoverflow:

int patientID = 12;
IndexReader indexReader = IndexReader.Open( indexDirectory );
indexReader.DeleteDocuments( new Term( "patient_id", patientID ) );

There is, however, another option. Lucene.NET (I’m using version 2.9.2) can update an existing document. Here is the code:

readonly Lucene.Net.Util.Version LuceneVersion = Lucene.Net.Util.Version.LUCENE_29;
var IndexLocationPath = "..." // Set to your location
var directoryInfo = new DirectoryInfo(IndexLocationPath);
var directory = FSDirectory.Open(directoryInfo);
var writer = new IndexWriter(directory, 
            new StandardAnalyzer(LuceneVersion),
            false, // Don't create index
            IndexWriter.MaxFieldLength.LIMITED);
writer.UpdateDocument(new Term("patient_id", document.Get("patient_id")), document);
writer.Optimize(); // Should be done with low load only ...
writer.Close();

Be aware that the field you are using for identifying the document needs to be unique. Also when you add the document, the field has to be added as follows:

doc.Add(new Field("patient_id", id.ToString(), 
                  Field.Store.YES, 
                  Field.Index.NOT_ANALYZED));

The good thing about this option is that you don’t have to find or remove the old version. IndexWriter.UpdateDocument() takes care of that.

Happy coding!

Friday, July 30, 2010

Configuring log4net for ASP.NET

Yes, there are already a few posts out there, and yet I think there is value in providing just a recipe to make it work in your ASP.NET project without too many further details. So here you go (in C# where code is used):

Step 1: Download log4net, version 1.2.10 or later, and unzip the archive

Step 2: In your project add a reference to the assembly log4net.dll.

image

Step 3: Create a file log4net.config at the root of your project (same folder as the root web.config). The following content will log everything to the trace window, e.g. “Output” in Visual Studio:

<configuration>
   <configSections>
      <section name="log4net" type="log4net.Config.Log4NetConfigurationSectionHandler, log4net" />
   </configSections>

   <log4net>
      <appender name="TraceAppender" type="log4net.Appender.TraceAppender" >
         <layout type="log4net.Layout.PatternLayout">
            <param name="ConversionPattern" value="%d %-5p- %m%n" />
         </layout>
      </appender>
      <root>
         <level value="ALL" />
         <appender-ref ref="TraceAppender" />
      </root>
   </log4net>
</configuration>

Step 4: In AssemblyInfo.cs add the following to make the resulting assembly aware of the configuration file:

// Tell log4net to watch the following file for modifications:
[assembly: log4net.Config.XmlConfigurator(ConfigFile = "log4net.config", Watch = true)]

Step 5: In all files in which you want to log add the following as a private member variable:

private static readonly log4net.ILog Log =
   log4net.LogManager.GetLogger(
      MethodBase.GetCurrentMethod().DeclaringType);

Step 6: Log as needed. For example for testing that steps 1 to 5 were successful, add the following in file Global.asax.cs:

public class Global : HttpApplication {
   protected void Application_Start(object sender, EventArgs e) {
      Log.Info("Application Server starting...");
   }
}

For more information about configuring log4net, e.g. logging to files, see log4net’s web site.

As always, if you find a problem with this recipe, please let me know. Happy coding!

Wednesday, July 28, 2010

Wildcard Searches in Lucene.NET

Yes, you can do wild card searches with Lucene.Net. For example you can search for the term “Mc*” in a database with names it will then return names such as “McNamara” or “McLoud”. When you read more details about the query parser syntax (version 3.0.2) you will notice that the wildcard characters * (any number of characters) and ? (one character) are only allowed in the middle or at the end of the search term but not at the beginning.

But how about using wildcards at the beginning? Well you can but you should be aware of the consequences. You have to explicitly switch this on in your code as it comes with an additional performance hit with large indexes. So be careful and see whether the resulting performance is acceptable for your users.

And here is the code (C# in this case):

var index = FSDirectory.Open(new DirectoryInfo(IndexLocationPath));
var searcher = new IndexSearcher(index, true);
var queryParser = new QueryParser(LuceneVersion, "content", new StandardAnalyzer(LuceneVersion));
queryParser.SetAllowLeadingWildcard(true);
var query = queryParser.Parse("*" + searchterm + "*"); // Using wildcard at the beginning
Happy coding!

Monday, July 26, 2010

Lucene Index Toolbox

After you have succesfully created your first index with Lucene.Net you might wonder whether the index was actually created as you wanted. Well, such a tool exists, thanks to the binary compatibility between the Java version and the .NET version of Lucene.

The tool is called Lucene Index Toolbox. It is a Java based tool that allows inspecting file base indexes. To use it:

  1. Download the Lucene Index Toolbox. (This download version 1.0.1, please check for newer versions)
  2. Make sure you have a recent Java runtime installed.
  3. Open a command line for the directory containing the downloaded jar file
  4. Use “java -jar lukeall-x.y.z.jar” to start the tool. Replace x.y.z with the version you downloaded. I used 1.0.1 so the command line for me is: “java -jar lukeall-1.0.1.jar”

Once started you can try out queries against your Lucene index. Or you can have a look at the files of your index and their meaning. Here is an example:

image

A very useful tool in particular for the beginner. Happy coding!

Sunday, July 25, 2010

Visual Studio 2010 and WCF: Hard-to-read Error Message

Just ran into the following error/failure when updating a service reference to a WCF based service:

image

The challenge I had was that I couldn’t see the remainder of the message. Furthermore nothing was selectable in this message box. Ideally the control used for displaying the message should allow for selecting the text and also allow for a scrollbar. I suspect this is the default error message box of the OS. If that is case I think it could be solved by either the Visual Studio or the Windows team.

In my case I launched the ASP.NET application hosting the WCF service and typed in the URL in a browser. That way I got access to the same but now complete error information. “The request failed with the error message:” and “The type ‘xyz’, provided as the Service attribute value in the ServiceHost directive, or provided in the configuration element system.serviceModel/serviceHostingEnvironment/serviceActivations could not be found.” now made sense.

And here is the root cause: Since there was an increasing number of services in the ASP.NET app I decided to create a folder in that project and move the UserManagementService into that folder. With some refactoring I also updated the namespaces and it happily compiled. I even remembered to update the entries in the web.config file. What I did overlook was the markup in the .svc files. So here is a simple example:

imageNote the highlighted part: Initially when I created the service it was sitting in the root and the name of the implementation including the namespace was “Server.UserManagementService”. When moved it into a folder name UserManagement, I forgot to update this markup to “Server.UserManagement.UserManagementService”.

So keep in mind the following when you rename or move a service implementation:

  1. Rename the service
  2. Update the web.config file (this is also mentioned in the comments generated when you use the wizard to add the service)
  3. Update the markup in the associated svc-file.
  4. Update/configure the service references in all service clients.

The last one can be done in two ways: First you remove the service reference and then re-add it. Alternatively you can choose to reconfigure the reference:

image Next, update the address to the service:

image Happy coding!

Wednesday, July 21, 2010

SVN Location of Lucene.NET

I’m probably the last one to notice… And if not, here is the subversion (SVN) repository location of Lucene.NET after it has come out of Apache Software Foundation’s incubator and became a part of Lucene:

https://svn.apache.org/repos/asf/lucene/lucene.net/

In case you want to download the source code, I’m sure you are aware that you want to append either ‘trunk’ or a tag to this URL. Don’t bother looking into branches. As of writing there were none. The latest tag as of writing was version Lucene.Net_2_9_2 (URL in the SVN repository) although the Java version is already at 3.0.

By the way: They also offer binary releases, but the most recent I could find was March 11, 2007. So I guess this means: DYI. Fortunately, that turned out to be straight forward when using Visual Studio 2005 or later (I used VS 2010). Just get the code of tag Lucene.Net_2_9_2 and compile the solution src\Lucene.Net\Lucene.Net.sln. The output is in Bin\Debug or Bin\Release and consists of a single assembly Lucene.Net.dll, which you need to reference in your project.

Sunday, July 18, 2010

Selenium RC and ASP.NET MVC 2: Controller Invoked Twice

Admittedly MVC (as of writing I use ASP.NET MVC 2) has been designed from the ground up for automated testability (tutorial video about adding unit testing to an MVC application). For example you can test a controller without even launching the ASP.NET development web server. After all a controller is just another class in a .NET assembly.

However, at some point you may want to ensure that all the bits and pieces work together to provide the planned user experience. That is where acceptance tests enter the stage. I use Selenium for this, and a few days ago I hit an issue that turned out to be caused by Selenium server version 1.0.3. Here are the details.

The symptom that I observed was that a controller action was hit twice for a single Selenium.Open(…) command. First I thought that my test was wrong, so I stepped through it line by line. But no, there was only one open command for the URL in question. Next I checked my implementation, whether maybe accidentally I had created some code that implicitly would call or redirect to the same action. Again, this wasn’t the case as each time when I hit the break point on the action controller there was nothing in the call stack.

Then I used Fiddler (a web debugging proxy) for a while and yes, there were indeed a HEAD request and a GET request triggered by the Selenium.Open(…) command. And even more interesting, when I ran my complete test suite I found several cases where the GET request was preceded by a HEAD request for the same URL.

The concerning bit, however, was that I couldn’t find a way how to reproduce this with a browser that I operated manually. Only the automated acceptance tests through Selenium RC created this behavior.

For a moment I considered trying to use caching on the server side to avoid executing the action more than once. But then I decided to drill down to get more details. In global.asax.cs I added the following code (Of course you can use loggers other than log4net):

protected void Application_BeginRequest() {
   Log.InfoFormat("Request type is {0}.", Request.RequestType);
   Log.InfoFormat("Request URL is {0}.", Request.Url);
}

private static readonly log4net.ILog Log =
   log4net.LogManager.GetLogger(MethodBase.GetCurrentMethod().DeclaringType);

As a result I was able to track all requests. Of course you wouldn’t want to do this for production purposes. In this case I just wanted to have more information about what was going on. It turned out that Fiddler was right as I found several HEAD requests followed by a GET request.

After some research I came across a discussion about Selenium RC head requests and it turned out that this was a known issue in Selenium server version 1.0.3. As of writing this was fixed in trunk and I thought for a moment about building from trunk but then decided on a different path. And that solution worked as well: Instead of using version 1.0.3 I am now using Selenium Server version 2.0a5 plus the .NET driver from the 1.0.3 package.

So here is what you need to download:

  1. Selenium Remote Control 1.0.3 which includes the .NET driver. Don’t use the server from this download.
  2. Selenium Server Standalone 2.0a5. Use this jar file as your server. The command line at the Windows command prompt changes from “java –jar selenium-server.jar” to “java -jar selenium-server-standalone-2.0a5.jar”.

Then start the 2.0a5 server and run your tests. The HEAD/GET issue should be gone. In my case it was and I’m now back to extending my test suite finally making progress again.

My configuration: Visual Studio 2010, .NET 4.0, ASP.NET MVC 2, Vista Ultimate 32, various browsers (IE, Firefox, Chrome, Opera, Safari). The issue I describe here may be different than the one you observe. Consequentially it is possible that this suggested solution doesn’t work for you.