Storage technology changes rapidly and planning a storage purchase can be overwhelming. But it is important to avoid becoming locked into an obsolete storage strategy, especially since data storage systems can be one of the single greatest expenses for many PACS. The authors demystify storage terminology and offer suggestions for a flexible, robust, and cost-effective storage strategy.
Dr. Nagy
is the Director of the Radiology Informatics Laboratory, Medical
College of Wisconsin, Milwaukee, WI.
Mr. Farmer
is the Chief Technology Officer of Cambridge Computer, Waltham,
MA.
There are a number of options when it comes to choosing the
right storage technology for your medical images. The first
challenge is that the storage industry is evolving rapidly, and you
don't want to be locked into an obsolete storage strategy. Hard
drive capacity has increased by a factor of 17.6 million since its
invention in 1952.
1
There has been a sustained average annual growth in capacity of
33%. Tomorrow, there will be higher capacity drives at lower prices
and you don't want to be stuck with older, slower, and more
expensive equipment.
The second challenge you will have to face is that storage can
take a big bite out of your budget if you don't fully understand
what you are paying for. Data storage systems can be one of the
single greatest expenses for many picture archiving systems (PACS),
and their cost and complexity is often the barrier to entry to
adopting PACS. It's easy to overbuy, and there is no shortage of
enthusiastic storage salespeople out there to take your money! In
this article, we will demystify some of the terminology thrown
about in the storage arena and will lay the foundation for helping
you create a flexible, robust, and cost-effective storage
strategy.
Disk devices versus jukeboxes: Online, nearline, and
offline storage
Storage is usually classified functionally as either online,
nearline, or offline. Online storage refers to data that is stored
on magnetic hard drives with access times in milliseconds and
transfer rates in the range of 10s to 100s of megabytes
(MB)/second. Online storage is immediately available to your PACS
application.
Nearline storage typically refers to a tape or optical jukebox
in which robotic arms can retrieve the tapes automatically and
insert them into a drive to read or write data. Generally, a
nearline system can access data within 60 seconds and is able to
transfer data at a few MB/sec. Offline storage is removable tape or
optical media that is stored on a shelf in a catalog and is
retrieved manually. Today, there is little use for offline storage
as it is very slow in retrieving data and can cause data loss if
media is mislabeled, misplaced, or mishandled.
The relation between online (hard drives) and nearline (tape and
optical media) is historically considered to be a direct trade-off
between cost and performance. A type of software application
described as hierarchical storage management (HSM) would manage a
relatively small portion of online storage and a larger amount of
nearline storage together as one storage pool. The HSM would try to
predict which studies would be requested and keep them on the
online portion of the system. As the online portion filled up,
older studies would be retired to the nearline storage system. In
PACS, the rule of thumb is that roughly 80% of immediate relevant
prior studies are from within the last 6 months, and 90% are from
within 1 year (Figure 1). A good rule of thumb is to buy at least 1
year's worth of online storage. This ensures that you won't
experience delays in retrieving prior studies for 80% to 90% of
your cases.
At the Medical College of Wisconsin, we generate roughly 10
terabytes (TB) of data annually for the 225,000 radiological
procedures performed. This ratio would be higher for cancer centers
that do a higher percentage of computed tomography (CT) procedures.
Relative to other healthcare applications, PACS requires a
disproportionate amount of data storage, 100 to 1000 times as much,
and must be able to scale in capacity indefinitely.
Moving from jukeboxes to disk arrays
Early adopters of PACS were forced to rely disproportionately on
nearline storage versus online storage. Disk storage was too
expensive to accommodate a sufficient cache of prior studies, and
studies had to be pulled from the nearline jukebox regularly.
Jukeboxes have a limited number of drives, not to mention all kinds
of slow mechanical processes. As such, requests would queue up,
which caused delays, and failures of the robotic systems were
extremely inconvenient and costly.
Another problem with jukeboxes is that they lock you in to
today's cost of storage. At any given time, removable media is
cheaper per MB than hard drives, but jukebox systems require you to
buy most of your storage technology up front, in anticipation of
your long-term needs. By the time you grow into your anticipated
needs, the cost of hard disks could have dropped down way below the
original cost of the jukebox. Once you factor in the cost of
maintaining the jukebox and the software and expertise to manage
it, the disk approach is cheaper.
Using disks for nearline storage
The historical cost difference between hard drives and removable
media, such as tape and optical, no longer exists. On the contrary,
hard drives are not only much faster, but they are also cheaper for
data storage compared with tape or optical media. The role of
nearline storage must change to one of disaster recovery and
obsolescence protection.
Today, the preferable solution is to use disk technology, rather
than a jukebox, for a nearline system. You still have nearline
storage, but you are storing a second copy to a disk device, rather
than to a jukebox. The software that manages the disk-based
nearline archive might be the very same HSM software that manages a
jukebox.
If you are just starting out with a PACS and your budget is
constrained, you could build your system entirely with online
storage and later add nearline storage as the need becomes
apparent. This helps with obsolescence protection, as it makes data
migration much easier when the time comes to move data to newer
hardware platforms or between systems from different vendors.
Dissecting online storage (disk systems)
PACS do not require "enterprise" storage solutions
In the past, the only place to buy high-capacity, scalable disk
arrays was in the enterprise computing marketplace, and, as such,
there is a common misconception that storage systems designed for
corporate data centers are required for PACS. Corporate data
centers often have hundreds of servers, each running different
applications and operating systems. Each server in an enterprise
needs only a few gigabytes (GB) of storage for their textual
information. Managing storage in such diverse environments is a
nightmare, and a very expensive nightmare at that. It makes a lot
of sense to manage all those servers from a central storage area
network to reduce complexity.
Enterprise-class storage systems can be cost-justified for
complex data centers, but they are overkill for most PACS. The best
bet is to buy dedicated storage systems for your PACS archive
(Table 1).
DAS (Direct attached storage)
There are three ways your PACS can be using online storage.
Direct attached storage (DAS) is to have the hard drives directly
on the server running the PACS application. This is the simplest
model and the one, historically, that all PACS vendors started
with. Unfortunately, the DAS model has scalability limitations in a
PACS environment in which you need lots of drives. Scalability
means that next year when you need to buy more TB of storage, you
will be limited by the number of drives you can fit into the
server. You can purchase an external small computer system
interface (SCSI) drive system to extend a DAS, but this will buy
you only 1 to 2 years before it reaches capacity. The DAS model is
also not very fault tolerant. If that server goes down, you will
lose access to the data on that server.
Storage area network
One of the most popular solutions for the corporate data center
is the storage area network (SAN), in which the storage is
independent of the servers. A SAN is a dedicated network for
connecting storage devices to computers. This means you can add
storage each year without having to take the servers down. The
storage can be accessed from more than 1 server, so your system can
suffer the loss of a server without losing access to the
storage.
Network attached storage
Another popular solution for enterprise data centers is network
attached storage (NAS). In SAN, the storage is accessed on a
separate dedicated network controlled by the servers. In contrast,
NAS is freestanding storage sitting on the network. NAS is not
directly attached to the servers and the storage is accessed using
network standard protocols. An analogy of NAS would be attaching
your printer to a network as opposed to attaching directly to your
computer. When you have only one computer, it is simpler to attach
it directly, but when you want many computers to access it, you are
better off attaching it to the network directly.
Not being tightly coupled to the PACS vendor and its software
can be a real advantage for NAS. It does not need the same level of
validation with the PACS application for every upgrade. This gives
the customer more freedom in choosing from various NAS vendors than
being locked into the storage vendor the PACS vendor prefers.
SCSI versus SATA
There are four different types of hard drives on the market:
SCSI, Fibre Channel, advanced technology attachment interface
(ATA), and serial ATA (SATA). SCSI and Fibre Channel drives are
typically used in enterprise storage arrays and servers. ATA and
SATA are typically used in personal computers and storage arrays
with less demanding performance requirements. Whether you use SAN,
NAS, or DAS, you should consider using storage systems based on ATA
and SATA drives.
SCSI hard drives--
These are the common hard drives used for enterprise storage. They
typically run at high rotational velocities of 10,000 to 15,000
revolutions per minute (rpm). They are more expensive than
commercial ATA drives, primarily because ATA drives consist of up
to 8 times the volume of drives in the SCSI market.
ATA hard drives
--
ATA drives were originally designed and marketed to the personal
computer marketplace. The individual drive performance for ATA and
SATA is slower than Fibre Channel and SCSI drives due to rotation
speeds at 7200 rpm versus 15,000 rpm for SCSI. It is interesting
that the overall system performance differences are not likely to
be noticeable on a PACS. In fact, depending on the controller and
number of drives used in your storage systems, ATA and SATA disk
systems could perform with equal speed or faster than some Fibre
Channel and SCSI systems. Meanwhile, ATA and SATA drives are less
expensive and are available in higher capacities. This translates
to a significantly lower cost per TB without any trade-offs from a
PACS perspective. PACS is all about moving big files around a
network with a high throughput. The SCSI drives are better for
transactional storage, such as email servers and databases, which
need access to small files many thousand times per second.
The peak loading on a PACS storage system for a large hospital
with multiple simultaneous requests is approximately 30 to 50
MB/sec. The limitations to performance are mostly at the
application level on how fast they can receive the data on the
workstation.
Disaster recovery and fault tolerance
Online storage systems span data across multiple hard drives in
a redundant array of inexpensive disks (RAID). There are several
different techniques for implementing RAID. The RAID 5 is the
configuration that is most suitable for PACS.
RAID technology compensates for the failure of an individual
hard drive. In RAID 5, the disk controllers write data across
multiple drives in such a way that if one drive fails, your data is
still intact. If two drives fail, however, your data is lost, so it
is common to keep at least one online spare drive in the cabinet.
In the event of a drive failure, the online spare is automatically
substituted and the disk array gradually returns to a
fault-tolerant state.
Make sure you configure the system so that you will be alerted
to a failure. In the past, when PACS was situated in the
department, a blinking red light on a failed drive might have been
enough to alert an attentive administrator. Today, the PACS is
buried in the back of the data center, where visual error lights
might not be observed. Ensure that your storage system utilizes an
alerting mechanism for any failure that will require human
intervention. This will ensure that you get an email or page when a
drive dies on your server. Simple network management protocol
(SNMP) tools are available that can trap all the errors from your
servers and storage devices so you can see what is going on from
one location.
Due to their mechanical operation, hard drives are prone to
failure. The more drives you have, the higher the probability that
you will have a failure. There are other components that can
malfunction as well. Be sure to ask your vendor to identify other
single points of failure. Power supplies are another common point
of failure. Be sure that you have redundant power supplies and be
sure to plug them into an uninterruptible power supply (UPS)
system, which is a device that uses batteries to back up the
electrical power in case of a power outage.
Mirroring refers to a technique for writing your data to two
locations at the same time. If one storage location failed, the
PACS can access the other copy relatively easily. Replication
refers to a type of mirroring in which data is written to one place
and then copied to another place. Depending on the type of network
connecting the two places, the second copy could be a bit out of
sync with the original.
Backup systems for PACS might use similar hardware and software
as enterprise data center backup systems. The biggest difference is
that your data is largely cumulative. That is, you are adding data
rather than updating previous data, and you almost never delete
data. Enterprise data centers typically use a backup strategy that
involves making full backups once a week and backing up only the
changes during the week. This approach could be costly and
unnecessary for a PACS. A PACS could be backed up in full only
once, with new files added to the backup system incrementally.
Optical versus tape
If you plan to have offsite storage and want to use a removable
media format, you should look at tape storage rather than optical.
There are two reasons for this. The first is that the cost density
is currently superior with an industry standard of 500 GB per tape
of Super Ad-vanced Intelligent Tape (SAIT). The optical industry
standard is still really in the range of 5 to 30 GB with the latest
optical media being 30 GB Ultra Dense Optical (UDO). The second
difference is performance. If you need to retrieve data from
removable media in the case of disaster, tape has a much faster
sequential through-put rate. The fastest read time from an optical
drive is 2 to 3 MB/sec, whereas a tape can retrieve at 20 to 30
MB/sec. In the event of a disaster, retrieving 1 TB of data
(approximately 25,000 studies) from optical drive would take 4
days, as opposed to 9 hours with tape.
Conclusion
You should strongly consider putting all of your PACS storage
online; it is not only economical but is also a good protection
from obsolescence. Also, consider using SATA hard drives, which
offer a level of reliability once available only for enterprise
storage at near desktop prices. The cost of storage will continue
to decrease every year while simultaneously increasing in capacity
as the computer industry continues to innovate. The best way to
take advantage of this is to purchase only the storage you need for
the upcoming year and encapsulate your storage from the PACS server
by employing a network attached storage technique. With careful
planning, you should be able to stay ahead of your storage
requirements without having the archive consume a significant
portion of your hard-earned PACS budget.