LIS 654 lecture representation of text and copyright

  • Slides: 90
Download presentation
LIS 654 lecture representation of text and copyright 1 Thomas Krichel 2011 -11 -15

LIS 654 lecture representation of text and copyright 1 Thomas Krichel 2011 -11 -15

what’s up doc? • We have see the representation of images. • Today we

what’s up doc? • We have see the representation of images. • Today we look more at the representation of character data. • This is more difficult than the representation of images because it involves a more sophisticated representation of human culture. • After that we start with copyright.

introduction • From Omeka, we have seen that databases store records. • Records contain

introduction • From Omeka, we have seen that databases store records. • Records contain fields, fields have values. • Here we talk about fundamentally, how do we compose those values. – Numerical values are easy – String values are harder

literature • The library textbooks are hopelessly short and confused about this topic. •

literature • The library textbooks are hopelessly short and confused about this topic. • I have most of what I have here from my own experience. • I recommend Wikipedia, it has fascinating articles about these topics.

all gone to a number • In all modern information system, information is stored

all gone to a number • In all modern information system, information is stored to be treated on a computer. • A computer can only deal with numbers. • As a consequence all information has to be converted into a number. • It's a huge job. • Let’s look at the ground, numbers.

a bit • A bit is the elementary unit of information. • It takes

a bit • A bit is the elementary unit of information. • It takes a binary value. We can label it true/false, black/white, +/-, etc. • Every piece of information in all modern information storage systems has to be reduced to a sequence of bits. • We will denote them 0/1 here.

byte • A byte is a sequence of 8 bits. '0000' to '1111'. There

byte • A byte is a sequence of 8 bits. '0000' to '1111'. There are 2 to the power 8, meaning 256 possibilities to write a byte. • If the byte is required to start with 0, then we can only write '0000000' to '01111111'. This leaves us with 2 to the power 7, meaning 128 possibilities.

hex numbers • Hex numbers contain the usual digits 0 to 9, as well

hex numbers • Hex numbers contain the usual digits 0 to 9, as well as A to F. A means 10, B means 11, etc F means 15. • One hex number can represent 2 to the power 4, meaning 16 possibilities (0 to 15). • Two hex numbers can represent 2 to the power 8 possibilities.

bytes and hex numbers • Since two hex numbers convene the same number of

bytes and hex numbers • Since two hex numbers convene the same number of possibilities as a byte is often represented as two hex numbers. • Thus, for example • '0000' in binary is 00 in hex, • '1111' in binary is 'FF' in hex, • '01111111' in binary is ‘ 7 F‘ in hex

converting information to numbers • • • A lot of problem in converting information

converting information to numbers • • • A lot of problem in converting information comes from some part of the information encode in some form and some other part in some other from. Example: “ 15 Julliet 1923” vs “July 17, 1923” Often such inconsistencies require manual reformatting, which is very expensive.

numerical information • • Some information can be converted to a number using a

numerical information • • Some information can be converted to a number using a simple conversion. Examples: – – A recent point in time is often converted into a number by taking the number of seconds since the first of January 1970. A date is often written as an ISO date in the form yyyymmdd. yyyy in the year, mm is the month and dd the day with leading 0 s.

numerizing • • In the design of every information system, it is a good

numerizing • • In the design of every information system, it is a good idea to convert information into something that is directly a number. There are examples where it is possible directly use a number, such as – – – colours times and dates locations.

another hex number example • • Colors on the world wide web follow the

another hex number example • • Colors on the world wide web follow the red/green/blue color model. Each color is given as a number #rrggbb, where rr is the amount of red gg is the amount of green and bb in the amount of blue. All these numbers are hex numbers. Example – – #FFFFFF white #00 FFFF aqua

non-numerical information • A lot of information is not numerical by its nature. For

non-numerical information • A lot of information is not numerical by its nature. For example – – the name of a person the title of an expression of a work • The information is of a character string nature. • To store character strings in an information system, each character has to be converted to a number.

character • • A character is an indivsible unit of textual information. Textual information

character • • A character is an indivsible unit of textual information. Textual information is composed of characters, and nothing else.

characters and computer • Computers can not deal with characters directly. They can only

characters and computer • Computers can not deal with characters directly. They can only deal with numbers. • There we need to associate a number with every character that we want to use in an information encoding system. • A character set combines characters with number.

ASCII • ASCII is an old character set developed in the United States. It

ASCII • ASCII is an old character set developed in the United States. It is a seven bit character set. • In hex notation, it goes from '00' to '7 F' • Because Anglo-Saxon cultural imperialism, the first 128 characters in Unicode are the same as in ASCII.

notable characters in ASCII decimal • 8 • 9 • 10 • 13 •

notable characters in ASCII decimal • 8 • 9 • 10 • 13 • 32 • 127 hex 8 9 A D 20 7 F byte 08 09 0 A 0 D 20 7 F U+0008 U+0009 U+000 A U+000 D U+0020 U+007 F backspace horizontal tab line feed carriage return space delete

wikipedia notation • • Wikipedia denotes every character in the BMP as U+hhhh where

wikipedia notation • • Wikipedia denotes every character in the BMP as U+hhhh where h is a hex digit 0 -F. We will follow this notation here.

UCS / Unicode • UCS is a universal character set. • It is maintained

UCS / Unicode • UCS is a universal character set. • It is maintained by the International Standards Organization. • Unicode is an industry standard for characters. It is better documented than UCS. • For what we discuss here, UCS and Unicode are the same.

Basic multilingual plane • This is a name for the first 65536 characters in

Basic multilingual plane • This is a name for the first 65536 characters in Unicode. • Each of these characters fits into two bytes and is conveniently represented by four hex numbers. • Even for these characters, there are numerous complications associated with them.

dashes • figure dash ‒ U+2012 to link numbers without a range • en

dashes • figure dash ‒ U+2012 to link numbers without a range • en dash – U+2013 to link numbers with a range • em dash — U+2014 for interjections in a sentence • minus sign − U+2212 for mathematics

“smart” quotes U+201 c “ is the opening double quote U+201 d ” is

“smart” quotes U+201 c “ is the opening double quote U+201 d ” is the closing U+2019 ’ is the apostrophe The single quote of the ASCII character set is considered to be of mixed usage, it should be avoided when a specific use can be done. • Similarly, the double quote of the ASCII character set is imprecise. • •

spaces • non-breaking space, U+00 A 0 is used when you want to avoid

spaces • non-breaking space, U+00 A 0 is used when you want to avoid a line break between the two spaced items. For example in hyperlink text, it is good practice to replace spaces with nonbreaking spaces as to avoid there appearing to be two links. • In whitespace collapsing contents, it can also be use to add extra space.

beyond ascii, foreign languages • • Everything becomes difficult. As an example consider the

beyond ascii, foreign languages • • Everything becomes difficult. As an example consider the characters – – – • o ő ö The latter two can be considered o with diarcitics or as separate characters.

most problematic: encoding • • One issue is how to map characters to numbers.

most problematic: encoding • • One issue is how to map characters to numbers. This is complicated for languages other than English. But assume UCS/Unicode has solved this. But this is not the main problem that we have when working

encoding • • • The encoding determines how the numbers of each character should

encoding • • • The encoding determines how the numbers of each character should be put into bytes. If you have a character set that is has one byte for each character, you have no encoding issue. But then you are limited to 256 characters in your character set.

fixed-length encoding • • • If you have a fixed length encoding, all characters

fixed-length encoding • • • If you have a fixed length encoding, all characters take the same number of bytes. Say for the basic-multilingual plane of unicode, you need two bytes for each character, and then you are limited to that. If you are writing only ASCII, it appears a waste.

variable length encoding • • • The most widely used scheme to encode Unicode

variable length encoding • • • The most widely used scheme to encode Unicode is a variable length scheme, called UTF-8. I will leave out the technical details on how this is. But it is important to understand that the encoding needs to known and correct.

ASCII vs UTF-8 • The ASCII representation of characters in a byte has the

ASCII vs UTF-8 • The ASCII representation of characters in a byte has the first bit set to zero. • This is the same is in UTF-8. • Any other character occupies at least two bytes in UTF-8. • This is in contrast to character sets such as ISO -Latin-1 that occupies more characters in the second half of the byte. • This is THE major problem practical work!

ligature • In fine traditional typography, certain characters appear to be linked to each

ligature • In fine traditional typography, certain characters appear to be linked to each other. • The most command examples in English usage are fi, ff, fl, ffi, ffl.

ligatures growing up • In certain cases, ligatures have become so common that they

ligatures growing up • In certain cases, ligatures have become so common that they have become characters of their own. • A prominent example is the German sz ligature the esszet. It looks a bit like a beta because it is derived from the fraktur font of the characters. • Another example, apparently, is &.

collations • Collations are topic that is related to characters. • A collation is

collations • Collations are topic that is related to characters. • A collation is a sorting order of character strings. • You may think this is trivial, just follow the alphabetic order. • But in many languages, diacritics come to complicate matters.

example German Here are the extra letter of German: Ä/ä, Ö/ö, Ü/ü, ß •

example German Here are the extra letter of German: Ä/ä, Ö/ö, Ü/ü, ß • In German, there are two collations. – DIN 5007 -1 “dictionary collation” treats umlauted characters as if they did not have them, and ß as s. • – DIN 5007 -2 “phonebook collation” treats umlauted as letter and e (ex. ä --> ae), and ß as ss

transliterations • When non-English characters are supposed to be entered in a system used

transliterations • When non-English characters are supposed to be entered in a system used by English speaking people, a transliteration might be used. • This can also be the case if the original script may not be commonly understood. An example are Japanese road sign. • Wikipedia lists 20 different ways to do that for Russian, say. Library of Congress scheme is apparently the most widely used.

outlook • This mainly follows the book by Hirtle et al. • Here I

outlook • This mainly follows the book by Hirtle et al. • Here I am working on chapters 1 to 3 of this book. • I will be covering selected content of the other chapters next week. – implications of copyright (what are the rights of the copyright holder) – exemptions to these implications • in general • to library and archives in particular

structure • • basics of copyright (as relevant here) copyright history what can be

structure • • basics of copyright (as relevant here) copyright history what can be copyrighted how long does it last (complicated)

basis • US Constitution, Article I, Section 8 authorizes Congress to enact laws “To

basis • US Constitution, Article I, Section 8 authorizes Congress to enact laws “To promote the Progress of Science and useful Arts, by securing for limited Times to Authors and Inventors the exclusive Right to their respective Writings and Discoveries. ” • The current use of copyright laws is a travesty of these objectives.

basic conditions • The work or subject matter must fall within a category of

basic conditions • The work or subject matter must fall within a category of material protected by copyright. • Copyright must subsist in that particular work or subject matter, having regard to its originality, authorship, and fixity. • Copyright must not have expired.

copyright vs. possession • Purchasing a copy of a copyrighted work does not create

copyright vs. possession • Purchasing a copy of a copyrighted work does not create a transfer of copyright. • This even holds when the object is unique, like say a painting. Copyright transfer would have to be negotiated explicitly in writing at the transfer of ownership.

public domain (henceforth: PD) • Works that are not affected by copyright are said

public domain (henceforth: PD) • Works that are not affected by copyright are said to be in the public domain. • Works automatically enter the public domain after a certain time. • But copyright holders appear to be constantly able to extend the terms.

copyright is for expressions • Ideas and facts can not be copyrighted. • Expressions

copyright is for expressions • Ideas and facts can not be copyrighted. • Expressions of ideas and facts can be copyrighted. • If an expression is simple, copying the expression is not likely to be in violation of any copyright owner’s rights.

copyright governance • US copyright is governed by the Copyright Act of 1976 as

copyright governance • US copyright is governed by the Copyright Act of 1976 as amended and incorporated in the United States Code as Title 17. • We use USC 17 in the following to refer to this. • There are unfortunately other sources of copyright governance – Common law – International treatise

common law • Works that are not protected by federal copyright laws may still

common law • Works that are not protected by federal copyright laws may still be protected by what is often called “common law copyright. ” • Common law copyright contains – state-based law – extracts from judicial decisions. • It can vary from state to state. • Implications of common law are minor.

international treaties • The most important here is the Berne convention. • The US

international treaties • The most important here is the Berne convention. • The US were a late signatory, because traditionally copied things made abroad.

start of copyright • 1709: The first copyright act, the “Statute of Anne, ”

start of copyright • 1709: The first copyright act, the “Statute of Anne, ” passes in England. It grants copyright protection to the authors of books. • 1787: U. S. Constitution in Article 1, Section 8, authorizes Congress to pass copyright and patent legislation.

first steps in specific legislation • 1790: First federal copyright statute passes. Protection is

first steps in specific legislation • 1790: First federal copyright statute passes. Protection is limited to maps, charts, and books. Duration is for 14 years, with the possibility of a 14 -year renewal term if the author is still living. • 1831: Term extends to 28 years with the possibility of a 14 -year extension. Protection extends to published music, which is protected against reproduction, but not performance.

continuous extensions • 1856: Copyright protection for dramatic public performances is added. • 1865:

continuous extensions • 1856: Copyright protection for dramatic public performances is added. • 1865: Photographs and negatives become eligible for copyright protection. • 1870: Copyright protection for dramatic works, pantomimes, paintings, drawings, and sculpture is added.

international dimension starts • 1886: Formulation of the Berne Convention for the Protection of

international dimension starts • 1886: Formulation of the Berne Convention for the Protection of Literary and Artistic Works. • 1891: First U. S. copyright protection foreign works. Protection for performed music is added.

1909 copyright act • 1909: Copyright act makes major changes. It broadens the definition

1909 copyright act • 1909: Copyright act makes major changes. It broadens the definition of works of authorship and extends terms to 28 years with the possibility of a 28 -year renewal. • 1912: Movies are afforded copyright protection.

after the war • 1955: United States becomes a signatory to the Universal Copyright

after the war • 1955: United States becomes a signatory to the Universal Copyright Convention (UCC), affording U. S. authors expanded protection abroad. • 1972: Sound recordings receive federal copyright protection.

main copyright act • 1976: Copyright Act of 1976, (went into effect in 1978).

main copyright act • 1976: Copyright Act of 1976, (went into effect in 1978). It makes a number of major revisions to U. S. copyright, including: – granting federal protection to unpublished items – calculating copyright duration based on life of the author +50 years – codifying the judicial doctrine of fair use – and adding specific exemptions for libraries and archives in Section 108.

Berne convention & renewals • 1988: The United States joins the Berne Convention. This

Berne convention & renewals • 1988: The United States joins the Berne Convention. This leads to the eventual dismantling of all formal requirements (notice, registration, renewal) for copyright. • 1990: Works of architecture receive federal copyright protection. • 1992: Copyright renewal is made automatic. All works published from 1964 to 1978 are given an automatic 75 -year term.

extension and digital adoption • 1998 Sonny Bono Copyright Extension Act extends almost all

extension and digital adoption • 1998 Sonny Bono Copyright Extension Act extends almost all copyrights by another 20 years. • 1998 Digital Millennium Copyright Act gives online service providers some important safe harbors from copyright-infringement suits.

“stuff” that can be copyrighted • The US copyright act sets out what can

“stuff” that can be copyrighted • The US copyright act sets out what can be copyrighted. • But it does not furnish an exhaustive list. • It gives examples in 17 USC § 102. • | we go through some of these|

literary works • This covers non-dramatic textual works – web sites – emails –

literary works • This covers non-dramatic textual works – web sites – emails – toilet wall engravings – speeches – advertisements

musical works • The copyright act gives no definition. • It covers new compositions

musical works • The copyright act gives no definition. • It covers new compositions and arrangements of older ones. • The performance generate a separate copyright. • The owner of a copyrighted music work has an additional privilege to make or authorize the first recording.

dramatic works, including accompanying music • These are literary works intended to be performed.

dramatic works, including accompanying music • These are literary works intended to be performed. • As with musical works, the performance may have a separate copyright attached to it.

pantomimes and choreographic works • Simple dance steps can not be copyrighted. • The

pantomimes and choreographic works • Simple dance steps can not be copyrighted. • The works will have to be written out in some notation in order to achieve tangible form.

pictorial, graphic, sculptural work • § 101 of US Copyright Act lists those as

pictorial, graphic, sculptural work • § 101 of US Copyright Act lists those as – works of fine graphic and applied art – photographs – print and art reproductions – maps, charts globes and other cartographic works – diagrams models and technical drawings – architectural works

no copyright in useful articles • Such things don’t usually get copyright protection. •

no copyright in useful articles • Such things don’t usually get copyright protection. • Copyright can protect artistic aspect of useful articles of these aspect can be separately identified. • This is an important grey area.

audiovisual works • “consist of a series of images which are intrinsically related intended

audiovisual works • “consist of a series of images which are intrinsically related intended to be shown by the use of machines …” • Motion picture a specific type.

sound recordings • These enjoy federal copyright protection since 1972. • They are “works

sound recordings • These enjoy federal copyright protection since 1972. • They are “works that result from the fixation of a series of musical, spoken or other sounds …” • This excludes the sound of an audiovisual work. • Making soundalike recording is not an infringement of sound recording copyright.

architectural works • These were granted protection by Congress in 1990. • It only

architectural works • These were granted protection by Congress in 1990. • It only applies to buildings built after 1990 or build before 2002 using plans that were unpublished before 1990. • Copyright holder can not prevent photgraphs of the building.

government works • Works “prepared by an officer or employee of the United States

government works • Works “prepared by an officer or employee of the United States Government as part of that person’s official duties” are generally excluded from copyright. • This means the Federal government. • Edicts of all level of government are not protected. • All publications are still protected abroad.

prerequisites for protection • To be protected works need: – exit in tangible form

prerequisites for protection • To be protected works need: – exit in tangible form – be works of authorship – be original – meet requirements regarding the nationality of the author • | we address these in turn |

tangible form • Copyright arises only of the work in fixed in a tangible

tangible form • Copyright arises only of the work in fixed in a tangible medium of expression. • It is not necessarily that can be humanly perceptible, merely that it can be perceived, reproduced, or otherwise communicated. • Improvised music, dance or speech is not protected by federal law but may be by state common law.

be a work of authorship • The author of the work has to be

be a work of authorship • The author of the work has to be human. • This excludes works by – nature – computer programs – supernatural beings • Compilations of works supposed to be authored by supernatural beings can be copyrighted.

originality • 17 U. S. C. § 102(a) “original works of authorship”. • Traditionally

originality • 17 U. S. C. § 102(a) “original works of authorship”. • Traditionally the threshold is verrry low. • But pure “sweat of the brow” is not eligible, since Feist vs Rural Telephone. • In other countries the threshold tends to be lower.

databases • Databases can be protected as compilations “formed by the collection and assembling

databases • Databases can be protected as compilations “formed by the collection and assembling of pre-existing materials or of data that are selected, coordinated or arranged in such a way that the resulting work as a whole constitutes an original work of authorship. ” [17 U. S. C. § 101]

authorship • For unpublished works, they are covered regardless of nationality of author. •

authorship • For unpublished works, they are covered regardless of nationality of author. • Published |+| works will be given protection of any of the following holds true – author is a citizen or resident of the USA – work is first published in the USA or a country the USA has an agreement (e. g. Berne convention) with – the author is a citizen of a treaty country.

publication status • A work is published when the copyright owner authorized the distribution

publication status • A work is published when the copyright owner authorized the distribution of copies through sale, rental, lease or lending. • The offer must be made to the general public. • Public performance is not publication.

duration of copyright • It first depends on the type of work – Unpublished

duration of copyright • It first depends on the type of work – Unpublished works – Works first published in the United States – Works first published abroad – Sound recordings – Architectural works • The “normal term” is live of the author + 70 years or 95 years since publication if author is corporate.

unpublished works • Here we deal with works that were unpublished and not registered

unpublished works • Here we deal with works that were unpublished and not registered with the copyright office before 1978. • These are the rules are for unpublished works used in the USA, regardless of nationality of author.

terms for unpublished works • Unpublished works: Life of the author +70 years •

terms for unpublished works • Unpublished works: Life of the author +70 years • Exception: Unpublished works created before 1978 that were published after 1977 but before 2003: Life of the author +70 years or 31 December 2047, whichever is greater. • Unpublished anonymous and pseudonymous works, works made for hire (corporate authorship), or unknown death date of author : 120 years from date of creation.

public domain published works • published before 1923: PD • published 1923 through 1977

public domain published works • published before 1923: PD • published 1923 through 1977 without a copyright notice: PD • published 1978 to 1 March 1989 without notice, and without subsequent registration: PD • published 1923 through 1963 with notice but copyright was not renewed: PD

works published in the US, 2 • published 1923 through 1963 with notice and

works published in the US, 2 • published 1923 through 1963 with notice and the copyright was renewed: 95 years after publication date • published 1964 through 1977 with notice: 95 years after publication date

normal term works in the US • published 1978 to 1 March 1989 without

normal term works in the US • published 1978 to 1 March 1989 without notice, but with subsequent registration: “normal term” • published 1978 to 1 March 1989 with notice: “normal term” • After 1 March 1989: “normal term”

background to this • Before 1976 the copyright period was fixed. • The 1909

background to this • Before 1976 the copyright period was fixed. • The 1909 act set this to 28 years, renewable another 28 years by notice. • The Sony Bono of 1998 copyright extension act extended it to 75 years, but left the works published before 1923 intact.

copyright notice • Between 1923 and 1998 the law required the work had to

copyright notice • Between 1923 and 1998 the law required the work had to carry a copyright notice. • If the notice is not there the work entered public domain.

registration • Registration of copyright was mandatory until the Copyright Renewal Act in 1992

registration • Registration of copyright was mandatory until the Copyright Renewal Act in 1992 made it optional. • This act did not touch works that were already in the second period, i. e. that had been published before 1964.

the 1978 shed • A published with notice and renewed. The duration of the

the 1978 shed • A published with notice and renewed. The duration of the copyright will depend on the date of publication: – Prior to 1978: 95 years from publication – Since 1978: normal term

published works • For published works, the normal term is 70 years after death

published works • For published works, the normal term is 70 years after death of author, in the case of corporate authorship, 95 years after publication. • This the normal term.

published foreign works • This clearly includes the following – Works by non-U. S.

published foreign works • This clearly includes the following – Works by non-U. S. citizens published only outside the United States – Works by U. S. citizens living outside the United States, published only outside the United States

applicability of US rules • A work by a non-US author published both inside

applicability of US rules • A work by a non-US author published both inside and outside the US will be applied US rules if either of the following two apply: – work was published with a delay of less than 30 days – the copyright was registered in the US • A work by a US author published in the USA or abroad will be considered according to USA rules.

works first published abroad in PD • published before 1923: PD • published 1923

works first published abroad in PD • published before 1923: PD • published 1923 -- 1977 without US formalities and in the PD in its home as of January 1996: PD

95 years terms • published abroad, no US formalities and not in the PD

95 years terms • published abroad, no US formalities and not in the PD abroad in 1996: 95 years • published 1923 -- 1977 with notice, & renewal : 95 years • published 1923 -- 1977 abroad only, without compliance with U. S. formalities or US republication , and not in the public domain in its home country as of 1 January 1996: 95 years after date of publication

foreign works after 1978 • If published without copyright notice, and in the public

foreign works after 1978 • If published without copyright notice, and in the public domain in its home country as of 1 January 1996: PD • If published either with or without copyright notice, and not in the public domain in its home country as of 1 January 1996: normal term.

sound recording copyright • Recording made before 1972 are only protected by common law

sound recording copyright • Recording made before 1972 are only protected by common law copyright, usually antipiracy and anti-bootlegging legislation. • Recording made 1972 to 1998 without notice, are PD. • With notice 95 years before 1978, normal term after 1978. • Foreign recording get 95 years if not in PD abroad in 1996.

http: //openlib. org/home/krichel Please shutdown the computers when you are done. Thank you for

http: //openlib. org/home/krichel Please shutdown the computers when you are done. Thank you for your attention!