HaT 0.3 (c)2004,2005 Stepan Roh
-------------------------------

This archive contains experimental tool for adding diacritic marks to the
text. The error rate if test database is used is around 5%.

Changes from version 0.2
------------------------

- extended test database

Changes from version 0.1
------------------------

- fixed bug where context frequency was ignored
- extended test database

Contents
--------

hat-0.3.tar.gz

  contains documentation and the tool itself

hat-0.3-db.tar.gz

  contains example of database for use by the tool

Running
-------

Requirements:

  Perl 5.x or higher (tested with v5.8.2)
  Cz::Cstocs (tested with version 3.4)

Generation (training) of database:

  ./hat.pl -b hat.db il2 < train.txt

  - creates database hat.db from training data train.txt, which are in
    encoding iso-8859-2 (encoding names are according to Cz::Cstocs)

Adding diacritic marks:

  ./hat.pl -h hat.db il2 < ascii.txt > czech.txt

  - using database hat.db adds diacritic marks to ascii.txt and saves it as
    czech.txt in encoding iso-8859-2

Test database
-------------

Test database was generated from these sources:

CZLUG's statutes (http://www.linux.cz/czlug/stanovy.html)
GNU LGPL (CZ) (http://www.gnu.cz/article.php?id_art=34)
Linux Documentation Project (CZ, 2nd ed.) (http://www.cpress.cz/knihy/ldp2/)
Selected laws of Czech Republic (http://portal.gov.cz)
Texts from various Czech periodicals and newspapers
Few Czech and translated to Czech books

Exact form of used texts can not be reconstructed from test database (it
does not contain all the information from original source) so I consider
this to be fair use.

						Stepan Roh <src@post.cz>
