Sample for content extraction test (2)

About this file

INA Lintaro 2008-10-10T1120+0900

This file is for a simple test that the multiple contents of the page (like blog pages) can appropriately be extracted.

tarao 2008-10-10T1124+0900

Comments on the article should not be regarded as a content.

tarao 2008-10-10T1126+0900

Or, should be?

The second entry

INA Lintaro 2008-10-10T1127+0900

The second entry in the blog like page should not be regarded as the main content of the page. Or, if the entires seem to be continuous, the scoring heuristics may regard them as a single content.

tarao 2008-10-10T1137+0900

You can adjust parameters of the scoreing heuristics.