Thursday, March 30, 2006

Parsing HTML.

In my recent project, I had a requirement to parse HTML documents. I was debating between using the web browser control which gives the HTML as a DOM and HttpWebRequest where I had to do all the parsing. A quick search on google took me to Html Agility Pack.

This is a HTML parser that allows you to parse "out of the web" HTML files and creates a DOM. This supports plain XPATH or XSLT.

Difference between Directory.GetFiles and windows explorer sorting numeric files.

The .NET Sytem.IO.Directory.GetFiles class returns the files sorted in alphabetic order whereas Windows Explorer sorts them in numeric order. For e.g








In Directory.GetFiles In Explorer
04.html 1.html
040.html 3.html
05.html 04.html
1.html 05.html
10.html 10.html
3.html 040.html

If you have access to the source then you can pass the string array returned from Directory.GetFiles to Array's Sort(Array, IComparer) method to sort it in numeric order.

If you do not have access to the source then you can modify the registry as below for explorer to sort using the alphabetic order. Restart explorer for the changes to take effect.

For current user:[HKEY_CURRENT_USER\Software\Microsoft\Windows\CurrentVersion\Policies\Explorer]

For the System:
[HKEY_LOCAL_MACHINE\Software\Microsoft\Windows\CurrentVersion\Policies\Explorer]

Value Name: NoStrCmpLogical
Type: REG_DWORD
Value:1

Disclaimer: Modifying the registry can cause serious problems that may require you to reinstall your operating system. Use the information provided at your own risk.