In 2007, a team of Google engineers needed more accurate time for servers. Time is especially useful for synchronising data, and especially transactional data. Technologies like Cassandra depend on accurate time between database servers to be able to reconstruct the order-of-events on a database. The end goal is to be sure about the “State-of-Data”.
NTP (network time protocol) is what unix servers and internet machines use to synchronise time. Due to network delays or processing delays, a computer’s time can easily get out of sync with its peers. The margin of error has been minor, and the demand for high accuracy has been crucial. But for a large computing company like Google, keeping thousands of system accurate was important for them to create “Spanner“.
Spanner is the new time keeping platform that Google has constructed using GPS and an atomic clock. Fortunately the distances we have to cover is at most the span of the earth. I am sure the folks in NASA and other space faring agencies will have to consider time differences spanning larger quantums of space.
This is all in the effort to maintain the “State-of-Data”.
Informatica (INFA) issues warning on weakness in Q3 results, reporting guidance of USD189+M, a drop from USD200+M (5% drop). The stock market has further punished the company with stock prices dropping from USD45 to USD 30 (33% drop) over the last 6 months.
Informatica blames the weakness on Europe, but could it be that the value of its core business of “data transfer” is being eroded by open source equivalents like Talend. This is definitely a lower cost alternative for open source oriented European companies.
Here are some updates on VMware. Also sometimes called “The Cloud”, it’s the current fad in IT infrastructure.
1) Virtualisation is going to happen whether we like it or not
- This was driven initially by under-utilized servers, but ease of management and configuration has taken over as the leading reasons for virtualisation.
- Currently only 30% of organisational server infrastructure is in the virtualised environment. If an organisation doesn’t reach 80% virtualised, it doesn’t gets the efficiency benefits of virtual infrastructure, but ends up with large overheads managing both virtual infrastructure and traditional infrastructure.
- VMware hopes to push this to 50% in 2011-12
- The issues with adoption are confidence levels in application-infrastructure interoperability, and security.
- VMware has notoriously low security, and is itself a gateway to accessing the entire virtualised infrastructure. (Search Google for “vmware hack”)
- Virtualisation comes with overheads.
- If installing Vista, or Windows 7 was not enough, virtualisation can help by adding 10-20% overhead to CPU usage.
- VMs also generate alot more network traffic.
- VM configuration is going to be crucial as “the server” as it is spread over a VM, SAN storage, and network “bus”, and actual physical locations. So when we have slow VMs, it could be the result of alot of different factors now.
- A clone of VM for failover/failback scenarios can also generate alot of network traffic. So virtualisation increases network overheads.
4) The Virtual Desktop
- VMWare hopes to bring back thin-client computing with virtualised downloadable profiles from VM infrastructure.
- Personally, I think this is a shot in the dark, as the PC-era is gone, and computing is already transitioning to the fragmented plethora of thin-clients (eg mobile devices, ipads, netbooks) with profiles stored in SaaS applications.
- The benefits of centralized profiles is supposedly in data security, however, with SaaS, fragmenting application, platforms, I doubt the virtual desktop will make it to the enterprise before iDesktops.
- VMWare ESX 5 now supports upto 32 cores, and upto 1TB RAM per VM. These are called the “Big VMs” (or “Monster VMs” if you were a VMware sales person) that VMware has now released.
- This may support the more computationally intensive applications, but only if the virtual infrastructure has been upgraded.
- From an application development point-of-view, understanding the performance and capability of an application in the virtual infrastructure is less transparent as performance issues are less transparent. (eg. is a network, or disk bottleneck? or over-utilisation of the CPU?)
- Processor CPU utility within a Windows/Unix VM is not an accurate reflection of the actual processing capability available to your application.
- So VM infrastructure performance statistics needs to be actively shared (in real-time) with application teams.
- Using SPEC CPU benchmarking tools is another way to measure application-infrastructure performance.
- However let’s hope for an open environment with open information sharing.
7) Super-Computing / Grid Computing
- Although there has not been any noted implementation of supercomputing in VM infrastructure, there are no reasons why this is not possible.
- Grid Computing, and maybe some aspects of super-computing is probably possible on VM infrastructure with the appropriate HPC software in place.
8) The Carbon Footprint
- The Carbon Footprint is now the new driver for VM infrastructure.
- Running un-optimised / under-utilitzed servers kills the environment.
- If electricity prices go up by 30% in the next 2-5 years, what will organisations have to do to mitigate that?
The Agile Director <a href=”http://theagiledirector.com/content/4-things-twitter-can-give-business-intelligence” target=”_blank”>recently commented</a> on using Social Media feeds as a form of data to give organisations insight through Business Intelligence initiatives formed on social media. This is very true. If companies realise that their businesses are built on their customers, all their internal systems should align accordingly. This is applicable to retail, property, media, communications, telcos, etc.., and the end-results are forward thinking, pro-active, customer-centric organisations.
The Data Chasm represents the gap between those who realise this paradigm. It’s as fundamental as the <a href=”http://www.catb.org/~esr/writings/homesteading/” target=”_blank”>manifesto </a>of “<a href=”http://en.wikipedia.org/wiki/The_Cathedral_and_the_Bazaar” target=”_blank”>The Cathedral and the Bazaar</a>”.
Data – A large portion of the corporate future will be driven by those who have it, and those who don’t. Then its driven by those who know what to do with it, and those who don’t.
The gap between the haves and have nots is growing, where even governments, and corporations fall under the have nots.
Open data is the way forward to close the chasm. Supplying data alone is only the first step. As in economics, banking, media, supply chain, logistics, there are eco-systems of data analysts that churn out information. But yes, the common denominator across all these diverse industries is digital media. That is the key to bridge the data chasm.
Data Warehousing (DW) is a common term used business intelligence (BI) projects and systems. The data warehouse has traditionally been the overhead, a large storeroom which aggregated and staged data from multiple sources into at single point. Analytics could then be conducted on this, and provide valuable insights for management.
Now, the problem with the data warehouse is that its huge, and expensive. The processes to populate the data warehouse consume large computing resources, and the outcomes after a lengthy project might be inaccurate or off-focus.
Within modern applications, and data analytics, we should consider analytics as part of an application’s design, performing smaller analytics projects on smaller datasets before engaging in larger ones. We should also consider incremental processing of data by actively managing data state in a similar way in which we manage application states.
This fits well with the Agile methodology.
So just like abandoned warehouse along the rivers and docks of modern cities, data warehouses will be abandoned with JIT Analytics, Agile BI, and better application designs.