Hacker News new | past | comments | ask | show | jobs | submit login
Corporations Are Crappy Archivists (pcjs.org)
184 points by ingve 3 months ago | hide | past | favorite | 67 comments

Not just archiving of published documentation - they’re poor at archiving internally too.

If you’ve ever been to some old heritage industrial site like a textile mill or other Victorian-era factory, you’ll notice them displaying old records like lists of customers, the employee payroll, notes between managers discussing events, and other items of historical interest.

None of that will exist to future historians of our current industries. Privacy regulations, data lifecycle policies, absence of respect for corporate memory, and fear of legal discovery - these are all going to leave a black hole.

The only things that will get preserved are the “important” things like board papers and financial statements. But those are not the things that give insight into corporate life.

This is survivorship bias. Plenty of old Victorian-era businesses vanished without a trace, and today you’re seeing just the small fraction of things that survived, not the entire records of the business.

Meanwhile, in current businesses some employees keep diaries, some records get out through litigation, others are filed with governments, etc.

It's hard to find the balance between collaborative internal improvement, regulatory documentation requirements, and corporate naval gazing.

Sounds like we need "just one more rule" to enforce record-keeping.

Except that it runs counter to the rules that enforce record deletion.

Right now the path companies need to tread on is already razor thin: Companies are already feeling the heat from both sides.

Customers already sue because right after account deletion their personal data still exists in backups and database tombstones — while on the other hand, others sue for not being able to resurrect accounts with years of data after their accounts got hacked.

There are no winners here. Long term archival as a lesser concern is one of the major losers in this confict.

That feels like a false equivalence, this is about preserving knowledge, not preserving user data.

That means someone needs to go through everything and decide what is sensitive and not sensitive, both from a business perspective and with respect to privacy laws.

I recently went through a bunch of older board meeting minutes and related materials from a non-profit because we were considering donating it to an archive. I scanned a few things but basically decided there was too much potentially sensitive stuff. Yeah, it was old but just too much dirty laundry that someone could take out of context that it wasn't worth the scrubbing.

What about that guy who always prints out every email?

It’s specifically corporate policy in the auto industry — records older than x years that don’t directly relate to warranty, service or ongoing litigation are to be destroyed with annual reviews to ensure this. It means that if you sue long enough after the fact discovery won’t yield anything because it’s policy to not have anything responsive to negligence in design. This has been official policy since at least the 1980s, but probably longer than that. I think this sort of policy should be discouraged, but I understand the chain of unintended consequences in jurisprudence and regulation that created the incentives to result in this policy.

not just in the auto industry. In many companies max retention of most documents is 5 years. emails less than that by default. Chats less than 30 days is also in place in some companies. CYA tactics everywhere.

It's a pretty stupid tactic, because judges can give instructions to juries to presume adverse interference—assume that the emails existed at one time, but are not available in the present precisely because of these "CYA" policies—as happened in Apple v. Samsung.

There's an offshoot of the Gell–Mann amnesia effect in play here. When programmers encounter Web sites that enact ridiculous requirements about passwords, they complain of bozos being in charge of technical decisionmaking, and then making decisions that either don't make sense or are downright bad based on nothing other than cargo cult advice. There are plenty of cargo cultists in management and legal departments, too, but they're often given the benefit of the doubt. Obviously they know what they're doing and have good reasons for everything, right? You know, just like the reason that a site requires you to have digits and punctuation in your password, but forbids you to have a 6-word passphrase because it's too long and contains spaces.

But judges generally will not give instructions to assume adverse circumstance if the company is following a pre-defined document retention policy. That's the point of the policy. It's effective CYA.

In the case I cited, Samsung was doing exactly that, and the judge did give those instructions.

The pro-CYA argument also fails to account for aggregate cost of employees running into roadblocks because the crucial piece of information they need to complete their task was sent in an email two weeks and one day ago, so they no longer have a copy.

> More than 100 boxes of the two men’s writings, correspondence, speeches and other items were contained in one of two modular buildings that burned to the ground at the Fountaingrove headquarters of Keysight Technologies. Keysight, the world’s largest electronics measurement company, traces its roots to HP and acquired the archives in 2014 when its business was split from Agilent Technologies - itself an HP spinoff.



I would recommend folks upload material like this that they want archived to both archive.org and bitsavers.org

While a corp might have copyright, they don't have the right to have these cultural artifacts lost to the sands of time.

That's been a perennial topic on HN. My suggest as to the expense of archiving it was that a student could be hired at minimum wage to photograph them, page by page, with a phone camera.

This suggestion was usually followed by a deluge of angry responses that archiving should be done properly by a trained archivist. Of course, that's expensive, and now the only thing archived is ashes.

I've personally archived a lot of family letters using this technique, and it's fast (several pages a minute) and plenty accurate. Just do it on the kitchen table in the sunlight.

>My suggest as to the expense of archiving it was that a student could be hired at minimum wage to photograph them, page by page, with a phone camera.

I'm not sure you'd need a trained archivist, but my 15 year-old Brother multifunction laser allows you to scan around 20 pages in a minute through an automatic document feeder at 600dpi.

From a quick look it seems like modern, dedicated scanners in the sub-US$500 range (brand new) have even larger feeders and faster scan rates - being able to chew through a 50 page feeder in around a minute.

If you have such a device at hands, go ahead and use it.

I understand the suggestion to go "minimum wage + phone camera" as "do whatever is prudent so stuff is archived _now_ in whatever quality, it doesn't even cost the world."

I agree with you in principle - my comment was just that for anything except the smallest jobs, it would probably be more economical to get some dedicated hardware to assist the worker.

Yes, but starting from there will generate resistance because of the initial expense and lack of percieved value.

If you can get the role of archivist created then later on there can be an argument to increase the budget without so much risk of the project being cancelled altogether.

My sheet feeding scanner always jams when it has anything other than perfect copier paper in a stack. For folded, odd sizes, fragile, dirty, stapled, glued, torn, other paper, etc., it just doesn't work. Ya also have to constantly clean the platen with alcohol, or your scans get vertical lines on them.

(It's the same problem with a sheet feeding printer. If the printer paper isn't perfect, it jams.)

I agree that a dual side sheet feeder is awesome when it works.

Yeah, I have a ton of stuff I'd like to scan and have scanned some of it. But very little of it is in the category of just load in the hopper and you're done. And once you're looking at scanning hundreds, if not thousands, of pages of stuff and attaching metadata to it, it's pretty much a matter of laying down until the desire to digitize it goes away.

As a side note, I've been involved in a project to digitize the back issues of a student newspaper I was involved with. It's all a combination of very labor intensive and expensive (given the issues that they only have bound copies of are a real pain to scan even with an expensive large format flatbed scanner).

The Fujitsu ScanSnap sheet fed scanner we have is pretty forgiving with what will successfully run through it. It was about $550 with a full version of acrobat.

I see the commercial version of these at drug stores and hospitals, so I'm guessing that the robust flexibility for scanning isn't unique to my experience.

I don't know. Our (I'm sure multi-K$) Xerox copier/scanners in the office are pretty temperamental if you don't give them fairly clean stacks of paper to scan--especially double-sided.

I'm using the Fujitsu ScanSnap sheet fed scanner :-)

> I'm not sure you'd need a trained archivist, but my 15 year-old Brother multifunction laser allows you to scan around 20 pages in a minute through an automatic document feeder at 600dpi.

Document feeders can and do eat documents.

> That's been a perennial topic on HN. My suggest as to the expense of archiving it was that a student could be hired at minimum wage to photograph them, page by page, with a phone camera.

> This suggestion was usually followed by a deluge of angry responses that archiving should be done properly by a trained archivist. Of course, that's expensive, and now the only thing archived is ashes.

Isn't likely that a student willing to work for minimum wage may not care enough to not treat the material carelessly and cause quite a bit of destruction or damaging disorganization?

You're also conflating archiving with digitization, when they're distinct activities.

IMHO, in most cases, it's also more likely that the original paper documents will survive in readable form than a mass of mediocre quality scans.

> a mass of mediocre quality scans

I challenge you to pick a random piece of paper, put it on the kitchen table, take a shot with your phone, and look at the result. Tell me it's not easily readable.

I also have an app on my phone that will OCR the jpg.

> I challenge you to pick a random piece of paper, put it on the kitchen table, take a shot with your phone, and look at the result. Tell me it's not easily readable.

That's a straw man. I never claimed that you can't take a readable photograph of a document with a phone.

The gist of my point is your idea seems to be mainly about insurance against catastrophic destruction (which happens, but infrequently). You're very focused on basic usability of the output of your proposed project, but you don't address 1) if your digitization will survive long enough to do its job as insurance, 2) damage that your project could cause (e.g. poorly paid workers being careless and damaging or disorganizing things irreparably).

I've read a very little bit about archival science, but one of the basic things they emphasize is preserving the original organization, because important information can be encoded in it. That could get easily get lost by minimum wage workers spilling documents on the floor, or rearranging things to make their job easier (e.g. when I used to scan receipts, I'd order them by width and rough length, because that would cause the fewest issues with the document feeder). Then you have issues with old fragile documents, accidentally tearing things out of binders, etc.)

At least tech companies aren't malicious about it. The big movie production companies lose original copies of older films pretty often and deliberately make watching their older content difficult and/or expensive.

If you're patient, lots of it eventually comes around on TCM. Unfortunately, the Comcast dvr doesn't let you just type in the names of movies to auto record, like Tivo could. It has to be on the current schedule, which only goes out 2 weeks.

It’s a shame TiVo never saw wider adoption. My dad first bought one in the late 90s and the usability of their devices has always been top-notch.

I've been working my way through the old Robert Mitchum movies on TCM. I had no idea they were that good.

I don't think it's because they're deliberately crappy archivists.

The fact is, when you have a company and you have to meet payroll, service customers and make a profit, archiving content that you're not making money from, is sort of a secondary thing.

A lot of companies don't archive anything not even the stuff they make a profit on. they just wing it and hope that when they need it that someone close to retirement will have a copy in the bottom drawer of their desk.

Most of it doesn't exist as paper any longer. I rarely print things out and save even less. I used to have big file cabinets at work. Now I maybe have a few file folders with archived work-related materials.

The rest is either on a local hard disk or a cloud drive.

I am continually angered at Microsoft's decisions to remove the older content from its site --- especially content that has been linked from elsewhere and would otherwise be invaluable in solving a problem or providing historical context.

It doesn't cost much at all to host this information, especially for a company of MS' size; I bet their Windows Updates servers consume far more bandwidth and storage than all of the KB articles they have ever published.

I have a file "knowledgebase16.7z", which appears to still be available online, that contains around 220K of the KB articles starting from Q10000, including Q12230 and Q46369 that this article's author tried to find. It is 1.1GB uncompressed and 134MB compressed.

One could argue that keeping such out of date information online and readily accessible could cause harm (like a now deprecated, but top-voted, stackoverflow answer) but it's not something a big flashy red banner at the top of the page couldn't solve.

They do it sometimes, like on this one there's a big purple banner warning that it's no longer supported: https://docs.microsoft.com/en-us/previous-versions/windows/i...

I don't know what the qualification was for stuff making it from TechNet though, because lots of it is missing.

I recall MDN using this for Firefox OS docs when I was searching for information on the ill fated project.

You may hate me for this but:

They simply don't care.

History and other such non monetizable things are only preserved by what Taleb calls "soul in the game" people. That is people that accept (even if slightly) negative real world payoffs for what's the right thing to do.

Have to draw the line somewhere. You can spend your life living, or in a sisyphean quest to archive everything.

Come on. The two are not mutually exclusive, and you're making the archivist up to be some obsessed madman.

In part at least, blame Google and Bing. It's not the storage space or bandwidth. It's that you're expending a lot of time and money--sometimes literally in the case of paid search--to surface mostly the most current and relevant information. Personally, when I search for the fix to some problem I'm having, I rarely want a 2010 article about my problem and usually get mildly annoyed when I get one.

And technical content is just part of it. Companies have current things they're selling, marketing, and messaging. They don't want that all mixed up with whatever they were doing five years ago.

Occasionally if you try the internet archive you can get the files.

My friend is an archivist/curator for Luis Vuitton, and my understanding is they do a pretty decent job. I assume this is because they actually derive some brand value from their own historicity.

Dior too has extensive archives


Remember when companies tried maintaining their own libraries?

Mid 1990s, during the infatuation with "learning organizations", we really struggled with onboarding and knowledge sharing. Surely we can do better, right?

So I got my archivist buddy hired. Extract domain knowledge from teams and individuals. Collect, aggregate, curate, and then reshare. Maintain our "library". Populate it with all the manuals, installation disks, training materials, textbooks, etc.

We'll never reinvent the wheel again. Woot!

Flew like a lead zeppelin.


Older me understands:

1) Forgetting is crucial to learning, moving forward, adapting.

2) Often times starting over is cheaper than finding prior answers.

I often wonder about my prior enthusiasm for Remember All The Things. Probably some mix of technophilia and existential dread (fear of being forgotten).

Old me rejects Chesterton's fence. https://en.wikipedia.org/wiki/G._K._Chesterton#Chesterton's_...

Any decision or rule without an attached name (advocate) is fair game for culling. If it was truly important, someone would care. Opposing change on principle is just being reactionary. Which isn't very helpful right now.

Yay for people who do work to remember. I'm in awe of modern historians like Jill Lepore. She's like a hacker or a genius, in that I can't even imagine how she comes up with her original content.

I'm not post-modern. We can learn plenty from the past. Alas, most first-person story tellers are unreliable narrators. And will probably record and archive all the wrong stuff.

Writing this out... I guess that's the difference between archivists and historians. There's no way for archivists to know what details may be important later.

Corporations are often also bad at archiving information that is of direct benefit to themselves. One instance that I am personally familiar with is keeping all the designs of past products in the live design system and relying on the backup system to ensure that they don't get lost instead of deliberately copying them to a curated archive.

Of course this didn't work. I remember getting a call from a designer asking me to restore a file. I asked when he had last seen it so that I could go and pick the right tapes from the fire safe. But he then said that he wasn't sure exactly when but that it was at least two years earlier. We reused our tapes annually so he was completely out of luck and had to reverse engineer the design in question. Turning what should have been half an afternoon's job into a week of work.

That sounds like a pretty reasonable tradeoff unless this is a regular occurrence.

The problem comes when you change to a new CAD system and don't bother to convert all of the existing designs. Eventually there is no one left in the company who remembers where the old stuff is, assuming it still exists, and you have no way of reading it anyway. Not a big problem if you are just making throwaway toys but a serious problem if you are expected to maintain machines over a fifty year lifetime. In the past a big engineering company would have a drawing office archive and even have a librarian. Now disk space is so cheap we can store everything but we can't make use of it because file formats become obsolete in a way that drawings on paper never do.

Retaining information in an organised way that scales is a huge unsolved problem.

A former girlfriend of mine used to work in the archival department of the BBC, which covered content and legally sensitive communications. They had big archives and their system worked; their biggest problem was educating new employees throughout the organisation, a vast bureaucracy, to use the system: that's a compliance issue, not a technical issue.

You're missing "at a price that most companies are willing to pay" in there.

I saw a presentation on The New York Times' photo "morgue" a while back. They've been digitizing it inclusing putting lots of effort into metadata because it's pretty much useless otherwise. https://www.nytimes.com/2018/11/10/reader-center/past-tense-...

There are also a lot of issues with making it generally accessible because the NYT doesn't own the rights to simply opening up the photos to the world.

> Retaining information in an organised way that scales is a huge unsolved problem.

How so? Libraries and archives have been around for a long time, and seem to work pretty well.

Libaries have screwed this up as well numerous times. less so for proper archives. However lets take a wide example micro-film. Many libraries started storing periodicals on microfilm. Thing is for a lot of older local news papers and such thats now all that exists. (no known original left) It also takes money to put those micro-film scans on something that is not obsolete. You still run into libraries keeping a microfilm reader on hand since they don't have the funds to convert everything.

While microfilm was seen to have advantages durning its peak as its become more obsolete it become obvious it has draw backs compared to keeping original documents. At least its relatively simple in practice just needing about 100x microscope to read.

The problem with digital information is worse especially you if throw in proprietary disk formats and proprietary file formats in the mix. Let alone keeping hardware operating that can use older interfaces for the storage medium, the alternative is keep copying the data to newer media and ensure that no errors happened during that process

I guess my point is that keeping a archived/preserved piece of data accessible always has an upkeep cost.

Even books and other physical things have an upkeep cost mainly keeping a safe environment to store said items to keep them safe things like uv that may fade items or humidity that make them mold ect... Let alone protecting items from cataclysmic events fire ect.. Any break down along this line for a library or archive means information may be lost or no longer accessible.

The problem is the same for companies except now they have to weigh spending money on this for documentation and such for products they no longer sell and likely do not even support any more if old enough.

Seems like libraries should band together and create archiving standards, and just build on top of a common archiving infrastructure?

They have [1]. However, libraries are incredibly decentralized, can have massive collections of "things" (not simply books) measured in the millions, and possess the kind of institutional inertia that comes with operating for literal centuries on a relatively shoestring budget. It's somewhat difficult getting everyone around the world to agree on single system that can potentially catalogue every human creation in existence, let alone to implement it as common infrastructure.

[1] https://www.loc.gov/librarians/standards

I believe you, but what about snapshots? I'm thinking something like the internet archive. Documentation websites can go about their business however they like, but there's always a "show me (most of) MSDN as of October 2016" option. Internal bureaucracy would have to be in place to make sure that the feature doesn't get completely broken on a whim by a disinterested branch of the company, but I think it's feasible.

It's justifying this expense that would be difficult.

> But, I figured that’s OK, because I’ve got enough old MSDN Library CDs to create a tower that would rival any Jenga structure, and surely these articles would be on one of those.

Seems like someone with more of an archival mindset should rip those and create a database of all the Knowledgbase articles (including any variations).

At least with these MS produced a physical artifact that someone do such a project with.



Making everyone completely responsible for what a company does will simply further concentrate power in the hands of the biggest existing corporations. Limited liability companies make it possible to launch a new company without the existential risk of losing everything that you own at the same time.

Removing limited liability across the board would eliminate that benefit for the smaller companies rather than the larger ones that could afford to tie up everyone in litigation.

Of all the comments, I think this has the most truth in it. Corporations are crappy at everything else other than making money. They are a way for us people to organize in one certain way, so that we can express desires and energies in a way that's compatible with civilized society.

The problem comes when we expect anything else of corporations. Like, to be good archivists. Or good anything else. To be moral, responsible, ethical, caring. If these things are not directly tied to profit, and they are mostly not, then it's only secondary to profit, which means that it will suffer. Worst case it's just an externality to a company, or even specifically built upon it.

Now I'm not sold on any solution on what to do with this. But trusting corporations to not being anything else than for-profit is naive, and will end in hurt, given enough time.

I do agree with you, the solution I offer is far from ideal. It was only meant to be provocative and make people think about alternatives, as you did.

More broadly, the way I see the situation, is that corporations are really a way for a group of people to get together and in-corporate the group, therefore create an entity acting as a separate person.

This creates an imbalance: a corporation can be created, like any natural organism is born, but never share social responsibilities for its impacts on society, and may never die either, as long as money gets funnelled into it. This is unnatural, and makes the corporation totally adversary to the needs of the societies it grows in.

More than the mere search of profit, the problem may be the pursuit or a never-ending existence, often at any cost. At any human cost for sure. Which is insane and destructive. It's the very definition of the golem.

> Google mission statement is to “organize the world's information and make it universally accessible and useful.” Its vision statement is to “provide an important service to the world-instantly delivering relevant information on virtually any topic.”

Failed ...

Failed?? What a low quality response. How was Google supposed to archive material that isn’t online? Magic???

And yet they used to. Ever heard of Google Books?

From the looks of it this data was off the web before Google even existed.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact