Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
184 commits
Select commit Hold shift + click to select a range
2ea900b
action: generating ParlaMint-[IL] sample files with #906
matyaskopp Jun 3, 2025
1fc7c03
action: generating ParlaMint roots for Sample folder #906
matyaskopp Jun 3, 2025
a561d33
Merge branch 'main' of github.com:clarin-eric/ParlaMint into devel
matyaskopp Jun 3, 2025
d9d3744
Deploy GitHub pages
matyaskopp Jun 3, 2025
e4c6b1b
Use 'find' for wildcard operations (#905).
TomazErjavec Jun 3, 2025
483e7c4
add taxonomies polishing in initialization
matyaskopp Jun 3, 2025
0f0f1c0
add new taxonomies for translation (sentiment, topic) #753
matyaskopp Jun 3, 2025
99dbc3b
reinsert taxonomies with missing translations #753
matyaskopp Jun 3, 2025
11d4515
Restrict "find -delete" to files only (#905).
TomazErjavec Jun 4, 2025
2b34b29
Add "Parent ID" column to .ana metadata TSV (#897).
TomazErjavec Jun 4, 2025
c207c2f
Add forgotten "-type f" to find commands (#905).
TomazErjavec Jun 4, 2025
79a03bd
Add producing .ana metadata files (#903).
TomazErjavec Jun 4, 2025
f87ebba
setup Canary Islands Parliament (ParlaMint-ES-CN)
matyaskopp Jun 5, 2025
1cd0c6d
add validation of corpus-specific taxonomies - checking IDs and schem…
matyaskopp Jun 5, 2025
cb33d57
Merge branch 'devel' into data
matyaskopp Jun 5, 2025
115a7f7
Update ParlaMint-taxonomy-NER.ana.xml
RubenvanHeusden Jun 5, 2025
2797774
Update ParlaMint-taxonomy-parla.legislature.xml
RubenvanHeusden Jun 5, 2025
0c54796
validate taxonomies even if the translation is not required
matyaskopp Jun 5, 2025
b76aafa
Merge branch 'devel' into data
matyaskopp Jun 5, 2025
3498148
Update ParlaMint-taxonomy-politicalOrientation.xml
RubenvanHeusden Jun 5, 2025
e88c48d
Update ParlaMint-taxonomy-subcorpus.xml
RubenvanHeusden Jun 5, 2025
1252bf0
Update ParlaMint-taxonomy-topic.xml
RubenvanHeusden Jun 5, 2025
fae7231
Update ParlaMint-taxonomy-sentiment.ana.xml
RubenvanHeusden Jun 5, 2025
5fb8c15
Update ParlaMint-taxonomy-topic.xml
starkadur Jun 5, 2025
03b5a4a
Update ParlaMint-taxonomy-sentiment.ana.xml
starkadur Jun 5, 2025
48bdb86
Merge pull request #914 from starkadur/data
matyaskopp Jun 5, 2025
e038499
Merge branch 'data' into data
RubenvanHeusden Jun 5, 2025
9dbd8e0
ES-CT: add partially translated taxonomy
matyaskopp Jun 5, 2025
c83b897
IS: add translation to shared taxonomies
matyaskopp Jun 5, 2025
ea3f1b1
Merge branch 'data' into data
matyaskopp Jun 5, 2025
ed8ecf4
ES-CT: taxonomy translation
matyaskopp Jun 6, 2025
d6282f3
ES-CT: add translation to shared taxonomies
matyaskopp Jun 6, 2025
d56a56a
add script for patching ids in corpus-specific taxonomies #902
matyaskopp Jun 6, 2025
ee8cc59
Add Turkish translations for topic taxonomy #753.
Jun 6, 2025
8eac47b
Add Turkish translations for sentiment taxonomy #753.
Jun 6, 2025
036ba00
Add script to check coalition/opposition date clashes.
TomazErjavec Jun 7, 2025
bbdb76e
Add test for new script validate-parlamint-relation.xsl.
TomazErjavec Jun 8, 2025
29704fd
Deal with special chars in vertical notes.
TomazErjavec Jun 8, 2025
813be0b
Fix relation validation.
TomazErjavec Jun 9, 2025
7252196
One more fix for relation validation.
TomazErjavec Jun 9, 2025
23db2f6
Merge pull request #916 from coltekin/data
matyaskopp Jun 9, 2025
fdad43f
TR: add fi translation into common taxonomies #753
matyaskopp Jun 9, 2025
d5de1d7
Merge branch 'data' into data
matyaskopp Jun 9, 2025
7ae0344
Merge pull request #915 from RubenvanHeusden/data
matyaskopp Jun 9, 2025
392c21f
NL: add nl translation into common taxonomies #753
matyaskopp Jun 9, 2025
3128721
PL: add pl translation into common taxonomies #753
matyaskopp Jun 10, 2025
973aa5c
Fix some spacing issues.
TomazErjavec Jun 10, 2025
c1a4143
Merge branch 'data' into devel.
TomazErjavec Jun 10, 2025
876aa19
Add templates for new sentiment and topic taxonomies.
TomazErjavec Jun 10, 2025
a3ad358
Merge branch 'data' into devel, again.
TomazErjavec Jun 10, 2025
800eaf4
Remove obsolete log.
TomazErjavec Jun 10, 2025
85e3fd9
taxonomies translations
DimitrisGk-iel Jun 10, 2025
71d66aa
Merge branch 'clarin-eric:data' into data
DimitrisGk-iel Jun 10, 2025
a4f2ddc
remove space
DimitrisGk-iel Jun 10, 2025
c8381f1
Remove redundant "Mixed negative" in catDesc.
TomazErjavec Jun 11, 2025
7ba63c5
Merge pull request #917 from DimitrisGk-iel/data
matyaskopp Jun 11, 2025
9c6fb4b
GR: add el translation into common taxonomies #753
matyaskopp Jun 11, 2025
251c6ce
fix mixed negative #918
matyaskopp Jun 11, 2025
a7d317a
Remove extent/measure if not for speeches or words.
TomazErjavec Jun 11, 2025
6ca6020
Update ParlaMint-taxonomy-topic.xml
AnnaParla Jun 11, 2025
f7b0b23
Update ParlaMint-taxonomy-sentiment.ana.xml
AnnaParla Jun 11, 2025
2b785f6
Merge branch 'data' into devel.
TomazErjavec Jun 11, 2025
f1b33b5
Merge pull request #919 from AnnaParla/patch-4
matyaskopp Jun 11, 2025
d946f20
Merge pull request #920 from AnnaParla/patch-5
matyaskopp Jun 11, 2025
4b8044b
UA: add uk translation into common taxonomies #753
matyaskopp Jun 11, 2025
fecaff7
Merge branch 'data' into devel
TomazErjavec Jun 12, 2025
dddc850
ES-GA: add gl translation into common taxonomies #753
matyaskopp Jun 12, 2025
d0b5b98
Fix number format bugs.
TomazErjavec Jun 12, 2025
349d1f5
Fix URL typo.
TomazErjavec Jun 12, 2025
1359144
Merge branch 'data' into devel
TomazErjavec Jun 13, 2025
e026fce
fix number format for Czech
matyaskopp Jun 13, 2025
b67fef7
update taxonomies responsibility, add initTaxonomies4release target #753
matyaskopp Jun 13, 2025
491d314
ES-GA: add spanish taxonomies translation #753
matyaskopp Jun 13, 2025
9d114ad
Update with changes for run.
TomazErjavec Jun 13, 2025
bd18c4e
Discuss local taxonomy IDs and topic and sentiment taxonomies (#902).
TomazErjavec Jun 13, 2025
a9af670
add target for polishing xml, use XMLFILE varieble
matyaskopp Jun 16, 2025
926b3a5
polish common taxonomies #753
matyaskopp Jun 16, 2025
5cc62a1
fix formating and polishing taxonomies #753
matyaskopp Jun 16, 2025
77d6169
format common taxonomies #753
matyaskopp Jun 16, 2025
fcad86e
ES-GA: add es translation #753
matyaskopp Jun 16, 2025
8a464bd
Merge branch 'data' into devel
TomazErjavec Jun 16, 2025
2b26d83
Add ParlaMint-IL registry file.
TomazErjavec Jun 16, 2025
66834c2
improve space normalization in taxonomies #753
matyaskopp Jun 16, 2025
49563a2
Revert "improve space normalization in taxonomies #753"
matyaskopp Jun 16, 2025
7c07d7e
fix taxonomies formatting #753
matyaskopp Jun 16, 2025
d4c2619
Remove spurious backslash.
TomazErjavec Jun 16, 2025
e0a7710
CZ ES ES-CT ES-GA GB IS NL SI TR UA: reinsert release ready taxonomie…
matyaskopp Jun 16, 2025
efa5f46
GitHub Actions: install missing xmllint dependency
matyaskopp Jun 16, 2025
e323b45
GR: reinsert release ready taxonomies with proper formating to sample…
matyaskopp Jun 16, 2025
3293c3a
Merge branch 'devel' into data
matyaskopp Jun 16, 2025
baadd7f
fix typo #753
matyaskopp Jun 16, 2025
895a07d
Merge branch 'data' of github.com:clarin-eric/ParlaMint into data
matyaskopp Jun 16, 2025
f41c31b
Merge pull request #922 from clarin-eric/data
matyaskopp Jun 16, 2025
73d68f7
BG: insert bg translation #753
matyaskopp Jun 17, 2025
141f885
BG: add bg translation into common taxonomies #753
matyaskopp Jun 17, 2025
5648ded
BG: reinsert release ready taxonomies with proper formating to sample…
matyaskopp Jun 17, 2025
50c5b97
BG: typo #753
matyaskopp Jun 17, 2025
75764a8
Merge branch 'data' into devel
TomazErjavec Jun 17, 2025
64e846f
Restore backslash.
TomazErjavec Jun 17, 2025
cc3daaf
Completely remove spurious backslash.
TomazErjavec Jun 17, 2025
cede58b
Add Samples README.
TomazErjavec Jun 17, 2025
e251ca1
Expand on Samples/.
TomazErjavec Jun 17, 2025
23793de
italian translation
Ittig Jun 19, 2025
be0fb9c
italian translation
Ittig Jun 19, 2025
6a4da28
italian translation
Ittig Jun 19, 2025
962177c
FI: add translation for sentiment taxonomy #753
yoge1 Jun 19, 2025
9e01807
FI: add translation for topic taxonomy #753
yoge1 Jun 19, 2025
28fe580
FI: add translation for sentiment taxonomy description #753
yoge1 Jun 19, 2025
e7450f5
FI: use values in sentiment taxonomy category descriptions as ParlaSe…
yoge1 Jun 19, 2025
1406909
FI: add translation for topics taxonomy description #753
yoge1 Jun 19, 2025
f59f8ec
Merge pull request #924 from atomm/data
matyaskopp Jun 19, 2025
543ad21
Merge pull request #925 from SemanticComputing/data
matyaskopp Jun 19, 2025
a515587
IT: add it translation into common taxonomies #753
matyaskopp Jun 19, 2025
75a7a40
FI: add fi translation into common taxonomies #753
matyaskopp Jun 19, 2025
f14ab2b
action: generating ParlaMint-[FI] sample files with #925
matyaskopp Jun 19, 2025
6849ccf
action: generating ParlaMint roots for Sample folder #925
matyaskopp Jun 19, 2025
7560058
FI IT: reinsert release ready taxonomies with proper formating to sam…
matyaskopp Jun 19, 2025
be880a8
Merge branch 'data' of github.com:clarin-eric/ParlaMint into data
matyaskopp Jun 19, 2025
bfb76ff
Merge branch 'devel' into data
matyaskopp Jun 20, 2025
9d00e03
Add -en corpora to README.
TomazErjavec Jun 21, 2025
8606c08
Update for v 5.0.
TomazErjavec Jun 21, 2025
c462099
Minor edits.
TomazErjavec Jun 21, 2025
e40cadc
Make corpus roots.
TomazErjavec Jun 21, 2025
a11a012
Add samples for 5.0.
TomazErjavec Jun 22, 2025
1a756d6
Add metadata TSV for 5.0.
TomazErjavec Jun 22, 2025
3bfda3f
Remove old registry files.
TomazErjavec Jun 22, 2025
140307d
Allow meta-file generation with .ana (for -ana tsvs).
TomazErjavec Jun 22, 2025
73bd3b2
Re-do samples
TomazErjavec Jun 23, 2025
6f77ac6
Make -ana meta file also in local language.
TomazErjavec Jun 23, 2025
e8cbd43
AT: add de translation into common taxonomies #753
matyaskopp Jun 23, 2025
dcbd54a
Revert "AT: add de translation into common taxonomies #753"
matyaskopp Jun 23, 2025
1ab9a5e
AT add partial de translation #753
matyaskopp Jun 23, 2025
839309e
Revert "AT add partial de translation #753"
matyaskopp Jun 23, 2025
d6f16d1
AT: insert buggy de translation #753
matyaskopp Jun 23, 2025
aca217a
AT: fix de partial translations #753
matyaskopp Jun 23, 2025
74fecc8
AT: add de partial-translation into common taxonomies #753
matyaskopp Jun 23, 2025
b65d917
fix initialization taxonomies for release, when translation is missin…
matyaskopp Jun 23, 2025
b388d2e
AT: reinsert release ready taxonomies with proper formating to sample…
matyaskopp Jun 23, 2025
e457860
New samples (with -ana-meta.tsv).
TomazErjavec Jun 23, 2025
37d4c6a
Merge branch 'data' into devel.
TomazErjavec Jun 23, 2025
0df4a29
Bosnian translation added
nljubesi Jun 23, 2025
4670bdb
Croatian translation added
nljubesi Jun 23, 2025
b4af276
Serbian translation added
nljubesi Jun 23, 2025
744e2b7
Update ParlaMint-taxonomy-sentiment.ana.xml
nljubesi Jun 23, 2025
363ba18
Update ParlaMint-taxonomy-sentiment.ana.xml
nljubesi Jun 23, 2025
6143073
Update ParlaMint-taxonomy-sentiment.ana.xml
nljubesi Jun 23, 2025
28f634d
Update ParlaMint-taxonomy-sentiment.ana.xml
nljubesi Jun 23, 2025
30f44ec
Update ParlaMint-taxonomy-sentiment.ana.xml
nljubesi Jun 23, 2025
5a3dd9c
AT: add de taxonomies translation #753
matyaskopp Jun 23, 2025
89e6346
Merge branch 'data' of github.com:clarin-eric/ParlaMint into data
matyaskopp Jun 23, 2025
74bbf31
Merge branch 'data' into nljubesi-translation-RS
matyaskopp Jun 24, 2025
149edf5
Merge pull request #931 from clarin-eric/nljubesi-translation-RS
matyaskopp Jun 24, 2025
6f1cfc3
Merge branch 'data' into nljubesi-translation-HR
matyaskopp Jun 24, 2025
3d6f7f1
Merge pull request #930 from clarin-eric/nljubesi-translation-HR
matyaskopp Jun 24, 2025
e8358a7
Merge branch 'data' into nljubesi-translation-BA
matyaskopp Jun 24, 2025
d9c1e58
Merge pull request #929 from clarin-eric/nljubesi-translation-BA
matyaskopp Jun 24, 2025
4a77cfe
BA: fix lang attribute value en->bs #753
matyaskopp Jun 24, 2025
09b405f
[data 4a77cfe7] BA: fix lang attribute value en->bs #753
matyaskopp Jun 24, 2025
e3e5d9b
HR: fix lang attribute value en->hr #753
matyaskopp Jun 24, 2025
dfe6052
RS: fix lang attribute value en->sr #753
matyaskopp Jun 24, 2025
5a7c173
BA: fix lang attribute value en->bs #753
matyaskopp Jun 24, 2025
659ed0e
BA: add bs to common taxonomies and reinitialize taxonomies #753
matyaskopp Jun 24, 2025
70ecf14
HR: add hr to common taxonomies and reinitialize taxonomies #753
matyaskopp Jun 24, 2025
417ce2f
RS: add sr to common taxonomies and reinitialize taxonomies #753
matyaskopp Jun 24, 2025
d3e32c2
Update with German transation of taxonomy description.
TomazErjavec Jun 24, 2025
1943419
Merge remote-tracking branch 'refs/remotes/origin/data' into data
TomazErjavec Jun 24, 2025
a042595
skip failing when merging large amount of data to main
matyaskopp Jun 26, 2025
39d2733
Make new samples for 5.0
TomazErjavec Jun 27, 2025
d552a0f
Make new ParlaMint roots.
TomazErjavec Jun 27, 2025
b081303
fix corpus-specific taxonomies validation #933
matyaskopp Jun 27, 2025
78081df
fix specific-taxonomies patching (removefd problem with unicode chara…
matyaskopp Jun 27, 2025
9042bf2
Make new root files.
TomazErjavec Jun 28, 2025
68f6c59
Update README.md (#935)
GiliGoldin Jun 30, 2025
de64331
Update roots.
TomazErjavec Jun 30, 2025
88a03df
Update IS, TR samples.
TomazErjavec Jun 30, 2025
463c531
Remove bad IS sample files.
TomazErjavec Jul 1, 2025
c70435f
Fix IL readme (tag quotes, LREV authors).
TomazErjavec Jul 1, 2025
5b896e2
FR: add fr taxonomies translations #753
matyaskopp Jul 1, 2025
8f84230
FR: reinitialize taxonomies #753
matyaskopp Jul 1, 2025
e830727
Add u level sentiment to subcorpattrs for SI 5.0.
TomazErjavec Jul 1, 2025
73b1442
Slight change in visualisation.
TomazErjavec Jul 1, 2025
bd095d3
New roots and meta files.
TomazErjavec Jul 3, 2025
daa97ea
Housekeeping.
TomazErjavec Jul 4, 2025
281edd6
Add log file packing.
TomazErjavec Jul 4, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
1 change: 1 addition & 0 deletions .github/actions/ParlaMintEnvSetup/action.yml
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@ runs:
- name: Install deps
run: |
sudo apt-get install rename
sudo apt-get install -y libxml2-utils
pip3 install --user regex
shell: bash
- name: Setup Java JDK
Expand Down
1 change: 1 addition & 0 deletions .github/actions/ParlaMintValidate/validate.sh
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,7 @@ for parla in $(jq -r '.[]' <<< $1 ); do
else
echo "::warning:: INFO initialize taxonomies with no translations - check if correct(known) ids has been used"
make initTaxonomies-$parla
make validateTaxonomies-$parla | sed "s/^\(.*\)\(\berror\b\)/::error:: \1\2/i" | tee $DIR/taxonomies.log
fi

if [ -f "${DATADIR}/ParlaMint-$parla/ParlaMint-$parla.xml" ] ; then
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/validate.yml
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ jobs:
id: detect-changes
uses: ./ParlaMint/.github/actions/ParlaMintStatus
- name: Test total TEI file size limit
if: ${{ steps.detect-changes.outputs.max_parla_changed_size > 100 }}
if: ${{ steps.detect-changes.outputs.max_parla_changed_size > 100 && github.event.pull_request.base.ref != 'main' }}
run: |
echo "::error::100MB file limit has been exceed my one parliament"
exit 1
Expand Down
2 changes: 0 additions & 2 deletions Build/Distro/Makefile

This file was deleted.

1,072 changes: 818 additions & 254 deletions Build/Distro/ParlaMint-en.ana.xml

Large diffs are not rendered by default.

1,117 changes: 827 additions & 290 deletions Build/Distro/ParlaMint.ana.xml

Large diffs are not rendered by default.

402 changes: 263 additions & 139 deletions Build/Distro/ParlaMint.xml

Large diffs are not rendered by default.

166 changes: 82 additions & 84 deletions Build/Makefile
Original file line number Diff line number Diff line change
@@ -1,38 +1,40 @@
########### Makefile for making a distributable version of the ParlaMint TEI, TEI.ana, -en.TEI.ana corpora and metadata overviews
#### Variables give the corpora, version, handle, paths and scripts to use
#### make nohup1 starts make all and saves the log in Logs/
########### Makefile for making a distributable version of the ParlaMint TEI, TEI.ana, and -en.TEI.ana corpora and metadata overviews
#### Variables give the included countries, version, handle, paths and scripts to use
#### make nohup1 starts make all and saves the log in Logs/
#### make mt-nohup1 starts make mt-all and saves the log in Logs/
#### make all builds the plain text and linguistically annotated corpora
#### make mt-all builds the machine translated and linguistically annotated corpora
#### and there are a lot of test- targets to test various parts of the build.

### VARIABLES

### COMPLETE SET OF CORPORA
#CORPORA=AT BA BE BG CZ DK EE ES ES-CT ES-GA ES-PV FI FR GB GR HR HU IS IT LV NL NO PL PT RS SE SI TR UA
CORPORA=AT
CORPORA=AT BA BE BG CZ DK EE ES ES-CT ES-GA ES-PV FI FR GB GR HR HU IS IT LV NL NO PL PT RS SE SI TR UA

# Used in targets that run only for one corpus
CORPUS=

# Version number and PID of next ParlaMint release
VERSION = 5.0
HANDLE-TEI = http://hdl.handle.net/11356/2004
HANDLE-ANA = http://hdl.handle.net/11356/2005
HANDLE-MT = http://hdl.handle.net/11356/2006

# For IL only:
# VERSION = 1.0
# HANDLE-TEI = http://hdl.handle.net/11356/2032
# HANDLE-ANA = http://hdl.handle.net/11356/2032

#Absolute paths are needed otherwise problems with XSLT
PARLAMINT := $(shell realpath .. | tr -d '\n')# get real absolute path to ParlaMint directory
HERE = ${PARLAMINT}/Build
TEMP = ${HERE}/Temp
SCH = ${PARLAMINT}/Schema

# Where the submitted corpora are found:
# ParlaMint-XX.TEI/ and ParlaMint-XX.TEI.ana
# Where the submitted corpora are found: ParlaMint-XX.TEI/ and ParlaMint-XX.TEI.ana
SOURCES = ${HERE}/Sources-TEI
# ParlaMint-XX-en.TEI.ana, MTed + semantically tagged:
SOURCES-MT = ${HERE}/Sources-CoNLLU

# Version number and PID of next ParlaMint release
VERSION = 5.0
HANDLE-TEI = http://hdl.handle.net/11356/2004
HANDLE-ANA = http://hdl.handle.net/11356/2005
HANDLE-MT = http://hdl.handle.net/11356/2006

# For IL only:
#VERSION = 1.0
#HANDLE-TEI = http://hdl.handle.net/11356/2032
#HANDLE-ANA = http://hdl.handle.net/11356/2032

#Where the produced corpora are put for inspection
WEB = tomaz@nl.ijs.si:/home/tomaz/www/tmp/ParlaMint

Expand All @@ -41,9 +43,9 @@ WEB = tomaz@nl.ijs.si:/home/tomaz/www/tmp/ParlaMint
###### Targets

### Overviews to be put in Metadata/

### This should be done once all the corpora have been built
metadata: metadata-persons metadata-orgs metadata-quant-tsv metadata-quant-tex
#Make overview LaTeX tables (for LREV paper)
#Make overview LaTeX tables (e.g. for LREV paper)
metadata-quant-tex:
$s mode=tex -xsl:Scripts/parlamint2cnt-overview.xsl Distro/ParlaMint.xml > Metadata/ParlaMint-overview-stats.tex
$s mode=tex -xsl:Scripts/parlamint2cnt-particDesc.xsl Distro/ParlaMint.xml > Metadata/ParlaMint-participDesc-stats.tex
Expand All @@ -68,9 +70,23 @@ source-metadata:
$s out-lang=xx -xsl:Scripts/listOrg-tei2tsv.xsl Sources-TEI/ParlaMint.xml > Metadata/ParlaMint-listOrg.tsv
$s out-lang=en -xsl:Scripts/listOrg-tei2tsv.xsl Sources-TEI/ParlaMint.xml > Metadata/ParlaMint-listOrg-en.tsv

### Make overall root(.ana) for ParlaMint for Sources-TEI/ and Distro/,
### This should be done once all the corpora have been built
all-roots: source-roots master-roots
source-roots:
$s base=${HERE}/Sources-TEI type=TEI -xsl:../Scripts/parlamint2root.xsl \
../Scripts/ParlaMint-rootTemplate.xml > ${HERE}/Sources-TEI/ParlaMint.xml
$s base=${HERE}/Sources-TEI type=TEI.ana -xsl:../Scripts/parlamint2root.xsl \
../Scripts/ParlaMint-rootTemplate.xml > ${HERE}/Sources-TEI/ParlaMint.ana.xml
master-roots:
$s base=${HERE}/Distro type=TEI -xsl:../Scripts/parlamint2root.xsl \
../Scripts/ParlaMint-rootTemplate.xml > ${HERE}/Distro/ParlaMint.xml
$s base=${HERE}/Distro type=TEI.ana -xsl:../Scripts/parlamint2root.xsl \
../Scripts/ParlaMint-rootTemplate.xml > ${HERE}/Distro/ParlaMint.ana.xml
$s base=${HERE}/Distro type=en.TEI.ana -xsl:../Scripts/parlamint2root.xsl \
../Scripts/ParlaMint-rootTemplate.xml > ${HERE}/Distro/ParlaMint-en.ana.xml

###### Various tests
test:
date
test-tei2:
${FINALIZE} -valid -codes SI -in ${HERE}/Distro -out ${HERE}/Distro
test-tei1:
Expand Down Expand Up @@ -136,22 +152,24 @@ test-fix2:
### Fixes
# Merge per-language translated CoNLL-Us (BE, ES-CT, ES-PV, UA) to joint CoNLL-U (with # lang info on newpar)
# It is more useful to have them merged than separate
mrg-conll-nohup:
fix-conll-nohup:
nohup time make mrg-conll > Logs/ParlaMint_Merge_CoNLL-U.log &
mrg-conll:
fix-conll:
Scripts/merge-conllu.pl Distro/ParlaMint-BE.conllu ${SOURCES-MT}/ParlaMint-BE-en.conllu
Scripts/merge-conllu.pl Distro/ParlaMint-ES-CT.conllu ${SOURCES-MT}/ParlaMint-ES-CT-en.conllu
Scripts/merge-conllu.pl Distro/ParlaMint-ES-PV.conllu ${SOURCES-MT}/ParlaMint-ES-PV-en.conllu
Scripts/merge-conllu.pl Distro/ParlaMint-UA.conllu ${SOURCES-MT}/ParlaMint-UA-en.conllu

# Fix a mistake with handle in corpora
# In-place fix mistake with handle in corpora
OLD = http://hdl.handle.net/11356/1810
NEW = http://hdl.handle.net/11356/1488
fix-handle:
for CORPUS in ${CORPORA}; do \
Scripts/fix-handle.pl "Distro/ParlaMint-$${CORPUS}.TEI.ana/ParlaMint-$${CORPUS}.ana.xml"; \
Scripts/fix-handle.pl "Distro/ParlaMint-$${CORPUS}.TEI.ana/*/*.ana.xml"; \
Scripts/fix-handle.pl ${OLD} ${NEW} "Distro/ParlaMint-$${CORPUS}.TEI.ana/ParlaMint-$${CORPUS}.ana.xml"; \
Scripts/fix-handle.pl ${OLD} ${NEW} "Distro/ParlaMint-$${CORPUS}.TEI.ana/*/*.ana.xml"; \
done;

# Copy READMEs to master
# Post-hoc copy READMEs to master, in case they need to be changed after the corpora have been built
cp-readmes:
Scripts/cp-readmes.pl -codes "${CORPORA}" -version ${VERSION} -teihandle ${HANDLE-TEI} -anahandle ${HANDLE-ANA} \
-docs Sources-Distro -out ${HERE}/Distro
Expand All @@ -170,7 +188,7 @@ mt-samples:
#Merge original and MTed samples into official Samples directory
cp-samples:
Scripts/cp-samples.pl 'Distro/ParlaMint-*' ../Samples
#cp Logs/ParlaMint-$${CORPUS}-samples.log ../Samples/ParlaMint-$${CORPUS}; \


# Make vertical file with en metadata, a hack:
XX-CORPORA = AT-xx BA-xx BE-xx BG-xx CZ-xx DK-xx EE-xx ES-xx ES-CT-xx ES-GA-xx ES-PV-xx FI-xx FR-xx GB-xx GR-xx HR-xx HU-xx IS-xx IT-xx LV-xx NL-xx NO-xx PL-xx PT-xx RS-xx SE-xx SI-xx TR-xx UA-xx
Expand All @@ -179,7 +197,8 @@ make-verts-xx-nohup:
nohup time make make-verts-xx > Logs/ParlaMint-Verts-xx.log &
make-verts-xx:
for CORPUS in ${CORPORA}; do \
../Scripts/parlamintp-tei2vert-xx.pl ${HERE}/Distro/ParlaMint-$${CORPUS}.TEI.ana Temp/ParlaMint-$${CORPUS}-xx.vert; \
../Scripts/parlamintp-tei2vert-xx.pl -jobs ${THREADS} \
-in ${HERE}/Distro/ParlaMint-$${CORPUS}.TEI.ana -out Temp/ParlaMint-$${CORPUS}-xx.vert; \
done;
perl ../Scripts/join-all-verts.pl -codes '${XX-CORPORA}' -in 'Temp' -out Verts/ParlaMint-XX.${VERSION}.vert

Expand All @@ -190,40 +209,20 @@ make-verts:
done;
make verts

# Don't make TEI but only text, vert and conllu files only
make-conll-vert-txt:
# Don't make TEI but only text, vert and conllu files
make-txt-vert-conll:
for CORPUS in ${CORPORA}; do \
${FINALIZE} -txt -vert -conll -codes $${CORPUS} -in ${SOURCES} -out ${HERE}/Distro; \
done;

# Make overall root(.ana) for ParlaMint for Sources-TEI/ and Distro/,
all-roots: source-roots master-roots
source-roots:
$s base=${HERE}/Sources-TEI type=TEI -xsl:../Scripts/parlamint2root.xsl \
../Scripts/ParlaMint-rootTemplate.xml > ${HERE}/Sources-TEI/ParlaMint.xml
$s base=${HERE}/Sources-TEI type=TEI.ana -xsl:../Scripts/parlamint2root.xsl \
../Scripts/ParlaMint-rootTemplate.xml > ${HERE}/Sources-TEI/ParlaMint.ana.xml
master-roots:
$s base=${HERE}/Distro type=TEI -xsl:../Scripts/parlamint2root.xsl \
../Scripts/ParlaMint-rootTemplate.xml > ${HERE}/Distro/ParlaMint.xml
$s base=${HERE}/Distro type=TEI.ana -xsl:../Scripts/parlamint2root.xsl \
../Scripts/ParlaMint-rootTemplate.xml > ${HERE}/Distro/ParlaMint.ana.xml
$s base=${HERE}/Distro type=en.TEI.ana -xsl:../Scripts/parlamint2root.xsl \
../Scripts/ParlaMint-rootTemplate.xml > ${HERE}/Distro/ParlaMint-en.ana.xml

mt-logs:
for CORPUS in ${CORPORA}; do \
grep -a -i 'fatal' Logs/ParlaMint-$${CORPUS}-en.log > Logs/ParlaMint-$${CORPUS}-en.error.log; \
grep -a -i 'error' Logs/ParlaMint-$${CORPUS}-en.log >> Logs/ParlaMint-$${CORPUS}-en.error.log; \
grep -a -i 'warn' Logs/ParlaMint-$${CORPUS}-en.log > Logs/ParlaMint-$${CORPUS}-en.warn.log; \
done;

# Put logs and packed build to web for inspection by corpus compilers
web-nohup:
nice nohup time make web > ParlaMint-Web.log &
web:
rsync -av Logs/*.log ${WEB}/Logs
rsync -av Packed/*.tgz ${WEB}/Repo


###### Targets for producing releasable version of ParlaMint corpora
FINALIZE = perl ../Scripts/parlamint2distro.pl -version ${VERSION} -teihandle ${HANDLE-TEI} -anahandle ${HANDLE-ANA} -schema ../Schema -docs Sources-Distro -procMemGB ${JAVA-MEMORY} -procChunkSize ${CHUNK-SIZE} -procThreads ${THREADS}

Expand All @@ -237,12 +236,23 @@ nohup2:
nohup3:
nice nohup time make all > Logs/ParlaMint.3.log &

all: final verts
xall: final verts pack
all: final join-verts pack
xall: final join-verts pack

pack-logs:
mkdir -p Packed/ParlaMint-logs
rm -f Packed/ParlaMint-logs/*
for CORPUS in ${CORPORA}; do \
cp Logs/ParlaMint-$${CORPUS}.*log Packed/ParlaMint-logs; \
done
cd Packed; tar -czf ParlaMint-logs.tgz ParlaMint-logs
rm -fr Packed/ParlaMint-logs
mkdir -p Packed/ParlaMint-en-logs
rm -f Packed/ParlaMint-en-logs/*

pack:
perl Scripts/pack-parlamint.pl -codes '${CORPORA}' -in Distro -out Packed
verts:
join-verts:
perl Scripts/join-verts.pl -version ${VERSION} -codes '${CORPORA}' -in Distro -out Verts
final:
for CORPUS in ${CORPORA}; do \
Expand All @@ -268,17 +278,11 @@ FINALIZE-MT=perl ../Scripts/parlamint2distro.pl -version ${VERSION} -anahandle $

# Targets
mt-nohup1:
nice nohup time make mt-all > Logs/ParlaMint-en.log &
nice nohup time make mt-all > Logs/ParlaMint-en.1.log &
mt-nohup2:
nice nohup time make mt-all > Logs/ParlaMint-en.2.log &
mt-nohup3:
nice nohup time make mt-all > Logs/ParlaMint-en.3.log &
mt-nohup4:
nice nohup time make mt-all > Logs/ParlaMint-en.4.log &
mt-nohup5:
nice nohup time make mt-all > Logs/ParlaMint-en.5.log &
mt-nohup6:
nice nohup time make mt-all > Logs/ParlaMint-en.6.log &

mt-all: mt-final
mt-xall-final: mt-convert mt-final mt-verts mt-pack mt-web
Expand All @@ -291,6 +295,12 @@ mt-convert-txt:
mt-web:
rsync -av Logs/*-en*.log ${WEB}/Logs
rsync -av Packed/*-en*.tgz ${WEB}/Repo
mt-pack-logs:
for CORPUS in ${CORPORA}; do \
cp Logs/ParlaMint-$${CORPUS}-en.*log Packed/ParlaMint-en-logs; \
done
cd Packed; tar -czf ParlaMint-en-logs.tgz ParlaMint-en-logs
rm -fr Packed/ParlaMint-en-logs
nohup-mt-pack:
nohup time make mt-pack > mt-pack.log &
mt-pack:
Expand Down Expand Up @@ -321,7 +331,7 @@ mt-make-verts:

# Join verts only
mt-verts:
#perl ../Scripts/join-all-verts.pl -codes '${CORPORA}' -in 'Distro' -out Verts/ParlaMint-XX.${VERSION}.vert
perl ../Scripts/join-all-verts.pl -codes '${CORPORA}' -in 'Distro' -out Verts/ParlaMint-XX.${VERSION}.vert
perl ../Scripts/join-all-verts.pl -en -codes '${CORPORA}' -in 'Distro' -out Verts/ParlaMint-XX-en.${VERSION}.vert

# Sanity check for alignment
Expand Down Expand Up @@ -382,9 +392,9 @@ mt-test9:
${HERE}/Distro/ParlaMint-ES-CT-en.TEI.ana/2015/ParlaMint-ES-CT-en_2015-10-26-0101.ana.xml > test.vert
mt-test8:
$s -xsl:../Scripts/validate-parlamint.xsl \
${HERE}/Distro/ParlaMint-UA.TEI.ana/ParlaMint-UA.ana.xml
$s meta=${HERE}/Distro/ParlaMint-UA.TEI.ana/ParlaMint-UA.ana.xml -xsl:../Scripts/validate-parlamint.xsl \
${HERE}/Distro/ParlaMint-UA.TEI.ana/2022/ParlaMint-UA_2022-01-25-m0.ana.xml
${HERE}/Distro/ParlaMint-AT-en.TEI.ana/ParlaMint-AT-en.ana.xml
$s meta=${HERE}/Distro/ParlaMint-AT-en.TEI.ana/ParlaMint-AT-en.ana.xml -xsl:../Scripts/validate-parlamint.xsl \
${HERE}/Distro/ParlaMint-AT-en.TEI.ana/2005/ParlaMint-AT-en_2005-04-27-022-XXII-NRSITZ-00108.ana.xml
mt-test7:
$s meta=${HERE}/Distro/ParlaMint-AT-en.TEI.ana/ParlaMint-AT-en.ana.xml -xsl:../Scripts/check-links.xsl \
${HERE}/Distro/ParlaMint-AT-en.TEI.ana/2022/ParlaMint-AT-en_2022-01-20-027-XXVII-NRSITZ-00139.ana.xml
Expand Down Expand Up @@ -432,25 +442,13 @@ merge-taxos:
done;
${vta} Taxonomies/ParlaMint-taxonomy-*.xml

### Some ideas, need to think about them...

#REGIS=at ba be bg cz dk es_ct fr gb gr hr hu is it lv nl no pl pt rs se si tr ua
REGIS=ua
QUERY=https://dev:alfabetagama@www.clarin.si/noske-beta/parlamint.cgi/wordlist?
TAIL=wlmaxitems=1000;wlattr=speech.body;wlminfreq=1;include_nonwords=1;wlsort=f;wlnums=docf;format=xml
body:
rm -f body.xml
for REGI in ${REGIS} ; do \
curl "${QUERY}corpname=parlamint30_$${REGI};${TAIL}" | grep -v xml >> body.xml ; \
done

###################### SCRIPT VARIABLES
##$JAVA-MEMORY## Set a java memory maxsize in GB
JAVA-MEMORY=240
JAVA-MEMORY=480
JM := $(shell test -n "$(JAVA-MEMORY)" && echo -n "-Xmx$(JAVA-MEMORY)g")

CHUNK-SIZE=100
THREADS=7
CHUNK-SIZE=500
THREADS=10

P = parallel --citation --gnu --halt 2
#Run java with a large heap, as a complete corpus needs to be read in
Expand Down
Loading
Loading