Life with a large cloud: Lessons learned

By Bob Violino (ComputerWorld)

Many companies are just getting started with cloud computing, but others are already well entrenched. Indeed, some organizations in the latter camp have built massive private clouds that support significant portions of their operations.

Those giant megaclouds create unique management challenges for CIOs and business leaders. Among other things, administrators need to maintain service levels, ensure that systems hosted in the cloud are secure, and position the new offerings in a way that makes internal customers want to use them. And because so much about cloud computing is relatively new, organizations must learn to deal with hurdles like those while they're in the process of deploying these multi-petabyte IT architectures.

"To me, the biggest challenge in implementing private clouds is the massive culture and operating model shift from a 'do it for the users' model to a user self-service model," says Frank Gens, an analyst at research firm IDC in Framingham, Mass. "The entire IT service delivery model -- from design through deployment and operation, and on through support -- needs to be overhauled."

Here's a look at some of the hurdles associated with big clouds that organizations are dealing with as they implement and use service-based computing infrastructures.

Integration with legacy systems

Enterprises aren't moving to all-cloud environments overnight, so the integration of private clouds and existing IT systems is a key issue.

BAE Systems, an Arlington, Va.-based defense and security company, operates a multi-tenant private cloud on behalf of its government and military customers. The system encompasses multiple petabytes of storage -- though BAE declined to specify exactly how large its cloud is.

BAE also uses a smaller-scale private cloud internally to develop and test systems that it's building for customers, says Jordan Becker, a BAE vice president.

The cloud infrastructure that BAE is building will gradually replace data centers that the company and its customers currently operate. Therefore, integration and migration between the older and newer computing environments is an issue BAE has had to address.

"The private cloud deployment is relatively new and has not yet displaced the existing data centers," Becker says. However, the private cloud has helped "slow the growth of the legacy data centers," he explains. "As the legacy data center infrastructure approaches its natural capital refresh cycle, the infrastructure will be displaced incrementally with new cloud infrastructure. This process will take several years."

During this transition, "we need to elastically extend that legacy data center to enable the applications already running to scale across the cloud transparently," Becker says. "It should look to the user as though it's one virtual infrastructure" from both an applications and management standpoint, he explains.

To achieve that integration, BAE Systems has created a common global namespace -- a heterogeneous, enterprisewide abstraction of all file information -- for all of its image data. The data includes two-dimensional images, files with stereo sound and files that include full-motion video. There is also metadata that goes along with these images, Becker says.

"The global common namespace is unique to each particular customer group," he says. "One such customer group that shares a common namespace [is made up of] users of geospatial information across several defense and intelligence agencies."

Now that BAE employs this common namespace for the private cloud that reaches across legacy data centers and file archives originally developed for stand-alone applications, customers can access and federate information with peers they want to collaborate with in a seamless manner, Becker says.

Security and service continuity

The University of Southern California operates a 4-petabyte private cloud that supports the USC Digital Repository. The IT professionals responsible for it have found that cloud security is one of their biggest concerns.

The Digital Repository provides clients with digital archives of content such as high-definition videos and high-resolution photos. Services include converting physical or electronic collections to standard digital formats for preservation and online access. The Repository also features high-bandwidth file management capabilities for accessing, managing and manipulating the large digital collections.

In November 2011, USC contracted with Nirvanix, a San Diego-based provider of cloud storage services, to deploy more than 8PB of unstructured data on a Nirvanix private cloud that the vendor is managing as a service from within USC's central data center and at its own facilities. This includes 4PB in the USC data center and 4PB at an out-of-state location to mirror the data.

"Nirvanix gives us a full managed cloud in both places, so I don't have to have staff familiar with their architecture or systems," says Sam Gustman, executive director of the Digital Repository and CTO of USC's Shoah Foundation Institute. "They are responsible for all upgrades and maintenance. Even though I don't have to operate the storage, I get the benefit of having the storage on our local network for access."

The cloud, which has the capacity to grow to 40PB of storage, handles digital content from multiple USC entities.

USC is also leveraging the cloud for its own internal data storage needs, and it's making it available to internal clients, Gustman says. He says the university opted for a cloud approach because it provides a geographically diverse and cost-effective way to store, preserve and distribute content on a global scale.

Protecting data from breaches was one of the major factors USC considered when it selected a storage vendor, Gustman says, noting that the university made sure that Nirvanix had security technology and policies that met the institution's standards.

Nirvanix has a set of policies it uses for data in the cloud, which includes encrypting data in transit and at rest, Gustman says. "It only decrypts when leaving the cloud," he says. "They also let our security team manage the keys to the data. Basically, every one of our 800,000 video files has a computer-generated password that we get to manage."

Matching security policies

"The hardest part of this [cloud endeavor] is security policy: making sure that the service company matches our own security policies," Gustman adds.

In addition to strong security, USC wanted to ensure it had an effective business continuity and disaster recovery strategy in place.

That "geo diversity" -- having data stored in multiple locations -- ensures that the university can continue to provide services from the cloud even if one site experiences downtime, Gustman says.

USC keeps two copies of the data stored in its onsite database, one on a Nirvanix disk that's managed at USC and another located in a different state on the Nirvanix cloud. "Not only can we track that the bits are staying exactly as they should through the multiple copies, but we can ensure that we will have copies under almost any circumstance," Gustman says.

Data security is also a major issue for BAE Systems, especially considering that many of its clients are involved in military or intelligence-gathering operations. "You have systems that source data at multiple levels of security classification that have to be certified and accredited by each agency," Becker says. "Data assets must be replicated and transferred with the proper classification levels."

In addition, BAE's government customers must comply with a number of federal regulations that govern how certain types of information should be protected, who should have access to data and so forth.

Security policy is established by the government customers and then enforced by systems, such as firewalls and XML gateways, that interpret and implement those policies based on a set of complex rules, Becker says. There are secure firewalls and gateways that manage multiple levels of security on BAE's managed systems and on the secure government networks that these systems interconnect over, Becker explains.

Managing scalability and speed

Web pioneer Yahoo has built a huge private cloud that encompasses many thousands of servers around the world, more than 200PB of data and some 11 billion Web pages. That cloud also supports much of the Sunnyvale, Calif.-based company's operations, including its online search and news services.

"I would venture to say we have one of the largest private clouds in the world," says Elissa Murphy, vice president of product development at Yahoo. The company defines the cloud as a series of shared services that nearly every one of its properties uses. For instance, "most of the data you see on a page is requested and pulled from our private cloud," Murphy says.

What drove Yahoo to operate much of its business in a cloud environment was the need for extreme agility and speed. "Any company that's running at the scale that Yahoo runs at, supporting over 700 million users, needs to build new applications and serve up pages as quickly as possible to ensure the best user experience," Murphy says. She says the company's private cloud delivers data at speeds not rivaled by public clouds -- at least not yet.

One of the biggest challenges of managing the cloud is being able to quickly scale systems up and down. "For example, when a breaking news story occurs, we can quickly shunt workloads, moving lower priority workloads -- batch processing, for example -- off the servers and dedicate them to the news spike," Murphy says.

She says Yahoo can quickly scale using technology developed internally, and she adds that, in a large cloud environment, that capability is critical to carefully managing that process.

For organizations that strive for scalability, one of the big challenges is keeping data consistent across multiple regions to ensure a consistent user experience. "You have to essentially ensure each copy of the data around the world is consistent and ensure that a user's privacy is maintained," Murphy says. "That introduces a large number of issues when you have products that span the world." One thing that Yahoo must do as it copies data to different regions is ensure that it remains in compliance with the privacy regulations in place in each locale.

From a technology perspective, Yahoo has built some of the first and largest NoSQL data stores in the world. The company uses many of them to keep data consistent across regions.

To address scalability as well as system and application reliability, Yahoo makes extensive use of Hadoop. The Apache Hadoop project develops open-source software to manage so-called big data, and Murphy says Yahoo has contributed more than 70% of the code to Hadoop. The Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using a simple programming model. The framework is designed to be able to scale up from a single server to thousands of servers.

Speed is another critical issue for a company like Yahoo that lives on the Internet. Murphy says the company's service-level agreements actually include provisions that guarantee that speed at which Yahoo will serve up data to its customers. "We are in the very low milliseconds consistently," she says.

To address the need for speed, Yahoo uses some of the most advanced hardware available, including solid-state drives, which feature much faster read/write times than hard drives. "We also employ the use of caching throughout the infrastructure to ensure that data is always served as quickly as possible," Murphy says.

Training and other personnel issues

Much about cloud computing is new to people inside and outside IT organizations, and there's a learning curve associated with moving to private cloud environments.

As a result, effective training is one of the challenges enterprises need to address when operating massive clouds for themselves and their customers.

"A lot of [technology] training and systems administration training issues go along with managing this type of solution," says BAE's Becker. "People have to be well versed in knowing how virtualization systems work, how to configure systems for failover, how to do backups and provide business continuity."

BAE Systems has training programs that prepare its government agency customers to work with the cloud infrastructure; the programs also cover application support issues. Training in cloud-based offerings such as software as a service (SaaS) is particularly important because the concept is still new to a lot of people, Becker says.

The company trains its application developers to support what end users might be looking for in SaaS applications that they will access via the cloud. Topics include self-service concepts, data access and security.

"The end users are often soldiers or analysts -- not IT people -- who have to be able to use these systems," Becker says. "We have a whole training practice around how to build applications so that they're intuitive for these users." In cases where BAE provides the applications that end users operate, the company provides training on the applications and the end-to-end system, Becker says.

Adjusting to the big cloud environment

The cloud requires different ways of thinking about computing and managing resources.

In 2009, NASA began development of a platform-as-a-service (PaaS) capability that developers could use to operate a variety of Web applications. A separate project called Nebula evolved from that effort. Nebula is a private infrastructure-as-a-service (IaaS) NASA development project designed to provide the scalable infrastructure needed to support PaaS-based services.

It quickly became clear that the IaaS model could provide an attractive alternative to physical infrastructure for NASA projects that typically required their own servers.

"It can take months for NASA projects to procure new server resources, and even longer to install and configure them once they are received," says Raymond O'Brien, CTO for IT at NASA's Ames Research Center. "Further, deploying physical servers sometimes requires specialized facilities and comes with the ongoing burden of operating, maintaining and replenishing physical assets."

The IaaS approach provides an attractive alternative because it makes server resources available instantly, and it doesn't involve the overhead of housing and supporting physical equipment, O'Brien says.

Nebula has progressed through alpha and beta stages and most recently underwent a five-month evaluation by NASA's Science Mission Directorate. During the alpha and beta stages, more than 250 NASA employees and contractors had Nebula accounts.

According to the NASA website, the Nebula team is working on the development and implementation of the Nebula Object Store, a Web services-based approach for storing and retrieving data created by applications that use object-oriented programming techniques.

Object Store will allow total storage capacity to be extended into the hundreds of petabytes if required, "something that is not out of the realm of future possibilities, given NASA's rate of data and information generation," the agency says.

NASA is now reviewing its private cloud project to determine its future path, according to O'Brien.

"One of the biggest challenges for Nebula has been making the transition from a small development project focused on innovation to an agency service," he says. "There is a big difference between developing cloud software and operating a cloud."

Moreover, the cloud model is still new "and there is no textbook on how to best position an on-demand private cloud service to gain broad adoption internally," O'Brien says.

And further adding to the challenge is the fact that the cloud model might not be completely understood by internal users, so the space agency must formulate a new business model for private cloud services and integrate the use of cloud services with existing policies and practices.

The bottom line, says O'Brien, is this: "Launching a new cloud service can be a lot of work."

Violino is a freelance writer in Massapequa Park, N.Y. You can reach him at